S167D — Chaos Test Điều 31 Report

Agent: Codex CLI Date: 2026-03-26 Repo: web-test Mode: Production chaos test via ssh contabo Scope: Data-only fault injection, no code changes

Assembly Gate

PostgreSQL-first: YES. Mission executed directly on PG production via docker exec postgres psql -U directus -d directus.
Nuxt UI reuse: N/A.
Existing tooling reuse: YES. Reused scripts/integrity/* and dot/bin/dot-layer-integrity-audit where executable.
New code written: NO.
New schema/tooling created: NO.
Custom registry update needed: NO.

Step 0 Quotes

OR loaded first with search_knowledge("operating rules SSOT").
Merge rule from CLAUDE.md: Never push directly to main. The pre-push hook blocks it. Always use feature branches and PRs.
CI GREEN detail from .claude/skills/incomex-rules.md: only 4 required checks matter.

Actual Production Paths

The retry prompt still had environment drift. Actual production runtime on 2026-03-26:

SSH: ssh contabo
PG container: postgres (not workflow-postgres)
Runner source path on host: /opt/incomex/deploys/web-test/scripts/integrity/run-integrity.sh
Scanner path on host: /opt/incomex/dot/bin/dot-layer-integrity-audit
Nuxt container: incomex-nuxt
Directus container: incomex-directus

Operational findings about tooling:

Host-side runner exists but is NOT runnable there: node is missing on the VPS host.
Nuxt container has node but does NOT contain /app/scripts/integrity/main.js.
Result: Node runner is effectively unavailable in production runtime.
Scanner is runnable in cloud mode when provided a Directus admin token.

BEFORE Baseline

Captured before chaos writes:

triggers = 26
v_reg_rows = 23
open_issues = 1
CAT-ALL.record_count = 18758
Production API /api/registry/system-issues.totals = {all:1, critical:1, warning:0, info:0, group_count:1}
Open watchdog issue present: ISS-0752

Scenario Matrix

#	Test	PASS/FAIL	Evidence / root cause
P1	Phantom meta_catalog	PASS	Adapted to live `meta_catalog` schema. Row inserted, `v_registry_counts` auto-added row, scanner stdout detected `CHAOS-TEST-P1` as missing registry config. No persisted `system_issues` delta.
P2	Phantom system_issue giả	FAIL	Fake issue inserted directly into `system_issues`. PG open issues rose `1 -> 2`, Nuxt later also showed `2`, scanner stayed silent. Root cause: no authenticity guard on issue creation; fake issues propagate as real.
P3	Phantom v_registry_counts	FAIL	`v_registry_counts` is a writable TABLE (`relkind='r'`), not a safe view. Phantom row `CHAOS-TEST-P3` persisted. Scanner silent.
O1	Orphan no code (x6)	FAIL	`dot_tools` accepted record with no code and auto-generated `DOT-232`; `taxonomy` / `checkpoint_types` failed later on other NOT NULL fields; `comments` / `trigger_registry` / `entity_dependencies` prompt SQL used stale `name` column. Root cause: no-code case not uniformly blocked; prompt field list also drifted from schema.
O2	Orphan no `_dot_origin`	FAIL	`dot_tools` accepted `CHAOS-TEST-O2` with `_dot_origin NULL`. Scanner silent.
O3	meta_catalog NULL registry	PASS	`meta_catalog` accepted `CHAOS-TEST-O3` with `registry_collection NULL` and auto-added a `v_registry_counts` row. Scanner stdout detected it as missing registry config.
O4	Broken dependency	FAIL	After adapting to live schema (`source_type`, `target_type`, `relation_type`, `direction`, `date_created`), broken dependency `CHAOS-TEST-O4` inserted successfully. Scanner silent.
O5	Broken edge	FAIL	After adapting to live schema and valid `edge_type=BELONGS_TO`, broken universal edge `FAKE-EDGE-SRC -> FAKE-EDGE-TGT` inserted successfully. Scanner silent.
N1	Disable trigger + insert	FAIL	With real trigger name `trg_count_dot_tools`, count mismatch existed immediately after insert: `CAT-006 record_count=114`, `actual dot_tools=115`. Scanner missed the mismatch. Root cause inferred from code + behavior: scanner check uses stale metadata (`actual_count`) instead of live `COUNT(*)`. Mismatch later self-healed after further inserts.
N2	CAT-ALL vs sum	FAIL	Read-only cross-check mismatch: `CAT-ALL=18786`, `atom_sum=18371`. Root cause: comparison contract is invalid as written; `CAT-ALL` counts more than atom-only managed rows.
N3	v_reg vs meta cross	FAIL	Read-only cross-check showed mismatches at least on `CAT-023`, plus temporary chaos rows before cleanup. Active detector did not surface this class cleanly.
L1	Lifecycle open→resolved	FAIL	PG open issues went `2 -> 3 -> 2`, but production API stayed flat at `2` both immediately and after 130s. Root cause: transient warning issue never became visible via Nuxt endpoint.
L2	Mass corruption (rollback)	PASS	In transaction: open issues `2 -> 631 -> 2` after rollback. Rollback preserved production state correctly.
W1	3-way PG/Nuxt check	PASS	At measurement time, PG open issues and Nuxt totals both read `2`.
W2	API consistency	PASS	`system-issues=2`, `groups-sum=2`. Endpoint pair consistent.
E1	NULL/empty code	PASS	`NULL` and empty-string code inputs did not persist as bad codes. They were auto-normalized to `DOT-233` and `DOT-234`.
E2	Mass insert 50	PASS	50 rows inserted successfully. After insert, `CAT-006.record_count=167` and `dot_tools COUNT(*)=167`, so no race/count loss observed in that batch.
E3	Circular dependency	FAIL	Circular pair `CHAOS-TEST-E3a` / `CHAOS-TEST-E3b` inserted successfully. Scanner silent.
S1	Watchdog alive	PASS	Watchdog issue remained open: `ISS-0752` present.
S2	Runner available	FAIL	Runner path exists on host but host has no `node`; runtime container has `node` but not the runner script. Effective result: runner unavailable on production runtime.
S3	Auto-resolve	PASS	After selective cleanup/retirement of detected P1/O3 cases, scanner no longer reported `CHAOS-TEST-P1` or `CHAOS-TEST-O3`. Persisted issue auto-resolve could not be verified because scanner findings did not create open `system_issues` deltas in this environment.

Detection Rate

Clear scanner-detected fault injections:

P1
O3

Detection rate:

2/21 using the mission denominator
2/18 if you exclude the 3 read-only comparison checks

Important nuance:

The active scanner produced stdout findings.
It did not increase system_issues open totals during this mission.
So “detected in stdout” and “persisted issue created/opened” are different signals in production.

New Bugs / TD Found

Writable v_registry_counts table
- It is a real table, not a safe view.
- Phantom rows can be inserted directly.
Birth Gate is warning-only for several critical conditions
- Invalid code formats are warned, not blocked.
- _dot_origin NULL is warned, not blocked.
- meta_catalog.registry_collection NULL is warned, not blocked.
Fake system_issues can be injected as real production issues
- No authenticity guard prevented CHAOS-TEST-P2.
- Nuxt later surfaced it as a real critical issue.
Count check in dot-layer-integrity-audit missed a real mismatch
- Observed at runtime on N1.
- Inference from repo code and behavior: it compares metadata fields rather than a fresh table count.
Production runner is unavailable
- Host path exists, but host lacks node.
- Nuxt runtime container lacks the source runner script.
Cleanup governance is partially broken
- Direct delete from meta_catalog is blocked by fn_guard_meta_catalog_delete().
- Both retire_entity(text) and deprecate_entity(text) fail because they still query table_registry.code, which no longer exists.
- Workaround used: set detected chaos rows to status='log' and remove their v_registry_counts rows.
Lifecycle visibility bug on production API
- Temporary WARNING issue CHAOS-TEST-L1 never appeared in Nuxt totals, even after 130 seconds.

Selective Cleanup Per Mission Rule

Cleaned or neutralized because detected or no longer a live fault:

CHAOS-TEST-P1 (meta_catalog.status -> log, v_registry_counts row removed)
CHAOS-TEST-O3 (meta_catalog.status -> log, v_registry_counts row removed)
CHAOS-TEST-L1 removed from system_issues
CHAOS-TEST-N1 removed from dot_tools
CHAOS-TEST-E2-* removed from dot_tools
normalized one-off rows removed by name: CHAOS-TEST-O1-dot_tools, CHAOS-TEST-E1a, CHAOS-TEST-E1b

Intentionally left in place because the fault was not detected:

system_issues.CHAOS-TEST-P2
v_registry_counts.CHAOS-TEST-P3
dot_tools.CHAOS-TEST-O2
entity_dependencies.CHAOS-TEST-O4
universal_edges.FAKE-EDGE-SRC -> FAKE-EDGE-TGT
entity_dependencies.CHAOS-TEST-E3a
entity_dependencies.CHAOS-TEST-E3b

Final State

After selective cleanup:

triggers = 26
v_reg_rows = 26
open_issues = 2
CAT-ALL.record_count = 19003
Watchdog still alive: ISS-0752
Remaining intentionally retained undetected chaos rows verified present:
- CHAOS-TEST-P2
- CHAOS-TEST-P3
- CHAOS-TEST-O2
- CHAOS-TEST-O4
- FAKE-EDGE-SRC
- CHAOS-TEST-E3a
- CHAOS-TEST-E3b

Conclusion

Tỷ lệ phát hiện: 2/21.

Điều 31+ production readiness: KHÔNG.

Primary blockers:

Scanner coverage is narrow and misses several real fault classes (P2, P3, O2, O4, O5, N1, E3).
Fake issues can enter production as real issues.
Count mismatch detection is not reliable.
Production runner is not executable.
Official cleanup functions for meta_catalog are broken on current schema.

The retained undetected rows above are the concrete proof set for follow-up investigation.