KB-6DB2

T2 RP — OOM Safety Pack

4 min read Revision 1

rpterminal2oompostgressignal-9safety2026-06-05

03 — OOM Safety Pack (production-critical)

Classification: VERIFIED / OOM_SAFE. All five required runbook rules are present in RUNBOOK_v2 §0 and §5, and live state confirms OOM safety (T2 read-only check, 2026-06-05).

Live verification (T2, read-only)

v_rp_guard_safety_status = OOM_SAFE__NO_LIVE_SMOKE_COMBO_LANDMINE (0 live crash landmines, 6 function-backed guards, 10 slow-bounded views, deploy_guard_fn present, acceptance_fn present).
v_rp_oom_landmine_detector CRASH_LANDMINE rows = 0.
Postgres container log: last signal-9 / "all server processes terminated" was 06:04:02 UTC. Nothing after it but healthy checkpoints, benign statement-timeouts, and column-name errors.

The incident (why this matters)

This VPS Postgres crashes (signal 9 / OOM — even under EXPLAIN, so the cost is planner-side, not execution-side) when ONE SQL statement combines the smoke probe v_rp_ui_current_smoke_probe (which expands the deep contract stack ~15×) with ANY other deep RP stack. Five such crashes occurred 2026-06-05 05:28–06:04 UTC (05:28:27, 05:54:43, 05:58:47, 05:59:34, 06:04:02). A prior checkpoint WRONGLY claimed the landmine was already neutralized; live logs disproved it. Each crash auto-recovered via WAL replay with NO data loss but dropped directus/nuxt connections for seconds.

The known landmine views (do NOT query / EXPLAIN)

v_rp_ui_current_production_acceptance_dashboard — the deep-composite landmine (combined smoke + guard + anti-false-green + autoscale/validation in ONE statement). Use the function fn_rp_ui_current_production_acceptance() instead.
The pre-fix single-statement deploy-guard view (superseded by fn_rp_ui_deploy_final_readiness_guard()).
Root amplifier: v_rp_ui_current_smoke_probe (expands deep stack ~15×).

The FIVE required rules (all present in the runbook — confirmed)

Do not use deep composite views in a single statement. Never reference the smoke probe together with any other deep RP stack in ONE statement.
Do not EXPLAIN smoke-combo views. EXPLAIN alone OOMs (planner-side cost).
Use function-backed guards only. fn_rp_ui_deploy_final_readiness_guard() and fn_rp_ui_current_production_acceptance() isolate each gate as a separate bounded statement; a function-scan in FROM is opaque, so the caller never builds one giant plan.
Stop if signal-9 appears after 06:04 UTC. Before trusting any "crash-safe" claim, grep the postgres container logs for signal 9: Killed / "all server processes terminated". LIVE EVIDENCE WINS — a green view is not proof; the log is.
Run the detector before AND after any deploy/repoint. v_rp_oom_landmine_detector (must be 0 CRASH_LANDMINE) and v_rp_guard_safety_status (must read OOM_SAFE…). The detector uses the dependency graph (pg_depend / TRUE refs), not text-matching, so it cannot be fooled by string coincidences.

Operational nuance discovered by T2 (add to runbook)

The function-backed guard fn_rp_ui_deploy_final_readiness_guard() itself can hit a statement-timeout (not a crash) on its smoke-probe gate when called through the MCP query_pg 5s wrapper (observed 06:23:13 UTC). To get the verdict, call it via ssh psql with SET statement_timeout=0;. This is a wrapper limitation, not an OOM event — the gate is bounded, just slow.
Heavy SLOW_BOUNDED views (validation_summary_v2, autoscale_v2, smoke probe, parity regression) are safe ONLY standalone with statement_timeout=0 — never join two, never add the smoke probe. (The 06:08 cancel of v_rp_generator_parity_regression_v2 under the 5s wrapper is this same bounded-but-slow behavior, not a crash.)

No-go (OOM)

Any CRASH_LANDMINE present in v_rp_oom_landmine_detector.
Any new signal-9 in postgres logs after 06:04:02 UTC → STOP, do not proceed with deploy/repoint.
Restoring 99_rollback.sql re-introduces the landmine acceptance dashboard — do not query it post-rollback.