T2 RP — OOM Safety Pack
03 — OOM Safety Pack (production-critical)
Classification: VERIFIED / OOM_SAFE. All five required runbook rules are present in RUNBOOK_v2 §0 and §5, and live state confirms OOM safety (T2 read-only check, 2026-06-05).
Live verification (T2, read-only)
v_rp_guard_safety_status=OOM_SAFE__NO_LIVE_SMOKE_COMBO_LANDMINE(0 live crash landmines, 6 function-backed guards, 10 slow-bounded views, deploy_guard_fn present, acceptance_fn present).v_rp_oom_landmine_detectorCRASH_LANDMINE rows = 0.- Postgres container log: last signal-9 / "all server processes terminated" was 06:04:02 UTC. Nothing after it but healthy checkpoints, benign statement-timeouts, and column-name errors.
The incident (why this matters)
This VPS Postgres crashes (signal 9 / OOM — even under EXPLAIN, so the cost is planner-side,
not execution-side) when ONE SQL statement combines the smoke probe
v_rp_ui_current_smoke_probe (which expands the deep contract stack ~15×) with ANY other deep
RP stack. Five such crashes occurred 2026-06-05 05:28–06:04 UTC (05:28:27, 05:54:43, 05:58:47,
05:59:34, 06:04:02). A prior checkpoint WRONGLY claimed the landmine was already neutralized;
live logs disproved it. Each crash auto-recovered via WAL replay with NO data loss but dropped
directus/nuxt connections for seconds.
The known landmine views (do NOT query / EXPLAIN)
v_rp_ui_current_production_acceptance_dashboard— the deep-composite landmine (combined smoke + guard + anti-false-green + autoscale/validation in ONE statement). Use the functionfn_rp_ui_current_production_acceptance()instead.- The pre-fix single-statement deploy-guard view (superseded by
fn_rp_ui_deploy_final_readiness_guard()). - Root amplifier:
v_rp_ui_current_smoke_probe(expands deep stack ~15×).
The FIVE required rules (all present in the runbook — confirmed)
- Do not use deep composite views in a single statement. Never reference the smoke probe together with any other deep RP stack in ONE statement.
- Do not EXPLAIN smoke-combo views. EXPLAIN alone OOMs (planner-side cost).
- Use function-backed guards only.
fn_rp_ui_deploy_final_readiness_guard()andfn_rp_ui_current_production_acceptance()isolate each gate as a separate bounded statement; a function-scan in FROM is opaque, so the caller never builds one giant plan. - Stop if signal-9 appears after 06:04 UTC. Before trusting any "crash-safe" claim, grep
the postgres container logs for
signal 9: Killed/ "all server processes terminated". LIVE EVIDENCE WINS — a green view is not proof; the log is. - Run the detector before AND after any deploy/repoint.
v_rp_oom_landmine_detector(must be 0 CRASH_LANDMINE) andv_rp_guard_safety_status(must read OOM_SAFE…). The detector uses the dependency graph (pg_depend / TRUE refs), not text-matching, so it cannot be fooled by string coincidences.
Operational nuance discovered by T2 (add to runbook)
- The function-backed guard
fn_rp_ui_deploy_final_readiness_guard()itself can hit a statement-timeout (not a crash) on its smoke-probe gate when called through the MCPquery_pg5s wrapper (observed 06:23:13 UTC). To get the verdict, call it via ssh psql withSET statement_timeout=0;. This is a wrapper limitation, not an OOM event — the gate is bounded, just slow. - Heavy SLOW_BOUNDED views (validation_summary_v2, autoscale_v2, smoke probe, parity regression)
are safe ONLY standalone with
statement_timeout=0— never join two, never add the smoke probe. (The 06:08 cancel ofv_rp_generator_parity_regression_v2under the 5s wrapper is this same bounded-but-slow behavior, not a crash.)
No-go (OOM)
- Any CRASH_LANDMINE present in
v_rp_oom_landmine_detector. - Any new signal-9 in postgres logs after 06:04:02 UTC → STOP, do not proceed with deploy/repoint.
- Restoring
99_rollback.sqlre-introduces the landmine acceptance dashboard — do not query it post-rollback.