KB-30E5

Fix R2-1 — P1 Partial-Failure Self-Cleaning

4 min read Revision 1
c1stagingcodex-r2-fixp1partial-cleanup2026-06-23

03 — FIX R2-1: P1 PARTIAL-FAILURE SELF-CLEANING

File: bin/dot-staging-sandbox-create (sha256 3694a0b6d35cc761637826537bfb04375b12a2db4b98b13954beeec90e33d23e).

Requirement

Track the candidate sandbox before any create; if P1 fails after creating the DB but before emitting SANDBOX_JSON, attempt cleanup; if cleanup fails, exit nonzero and report it; never emit created=true unless all postconditions pass.

Code branch — candidate derived + validated BEFORE any DB op

TS="$(date -u +%Y%m%d_%H%M)"; SBX="${SBX:-c1_staging_${TS}}"
OPERATOR="$(whoami)@$(hostname)"
stg_assert_sandbox_name "$SBX"   # R2-1: candidate id derived + validated BEFORE any DB op

Code branch — EXIT trap armed before create; compensating drop on partial failure

P1_CREATED_DB=0; P1_DONE=0
p1_on_exit(){
  local rc=$?
  if [ "$rc" -ne 0 ] && [ "$P1_CREATED_DB" -eq 1 ] && [ "$P1_DONE" -ne 1 ]; then
    echo "P1_PARTIAL_FAILURE: rc=$rc after DB create, before completion -> compensating drop of ${SBX}" >&2
    if stg_drop_db "$SBX"; then
      echo "P1_PARTIAL_CLEANUP_OK: candidate ${SBX} dropped; no sandbox leaked" >&2
    else
      echo "P1_PARTIAL_CLEANUP_FAILED: candidate ${SBX} may still exist -- run 'dot-staging-sandbox-drop --sandbox-id ${SBX}'" >&2
      stg_cleanup_remote_tmps || true
      exit 70                       # cleanup failure forces nonzero
    fi
  fi
  stg_cleanup_remote_tmps || true
  exit "$rc"
}
trap p1_on_exit EXIT

Code branch — CREATED_DB armed the instant the DB exists

stg_run_sql_file postgres "$SQLDIR/p1a-create-db.sql" -v sbx="$SBX"
P1_CREATED_DB=1   # R2-1: DB now exists -> EXIT trap will compensating-drop on any later failure

Code branch — created=true emitted ONLY after all postconditions, then disarm

... orphan check / ledger / RETIRE hint / SANDBOX_READY ...
printf 'SANDBOX_JSON: {"sandbox_id":"%s","sandbox_db":"%s","created":true}\n' "$SBX" "$SBX"
P1_DONE=1   # full success -> disarm compensating drop (created=true now authoritative)

Why this is complete

  • The trap is installed before the reuse-first read and before phase-a create, so any nonzero exit after that point fires it.
  • P1_CREATED_DB=1 runs immediately after a successful CREATE DATABASE (p1a is a single CREATE DATABASE under ON_ERROR_STOP=1; if it fails no DB exists and the flag stays 0 → no spurious drop).
  • Every post-create step (phase-b meta, orphan check, ledger, echoes) runs under set -e; a failure exits the script and the trap drops the candidate.
  • SANDBOX_JSON/created=true is the last emitted artifact; P1_DONE=1 immediately follows, so the trap never drops on full success and never lets a partial run advertise success.
  • Defense in depth: the phase-b failure branch also does an explicit stg_drop_db "$SBX" || true (idempotent DROP DATABASE IF EXISTS); the trap re-verifies.

Plan-side complement (report 05 option b)

The dry-run plan preselects + validates the sandbox id and passes it via --sandbox-id, setting SBX BEFORE invoking P1, so the plan's own EXIT-trap cleanup target is known even if P1 fully succeeds but the JSON parse fails (see file 04).

Validation

bash -n OK; shellcheck warning-clean; guard self-tests confirm the pre-create guards still fail closed (exit 3/4) without ever reaching CREATE DATABASE. No live execution of the create path (no staging DB created).

Back to Knowledge Hub knowledge/dev/laws-new/reports/c1-staging-codex-r2-fixes-ready-for-r3/03-fix-p1-partial-failure-cleanup.md