S167D — Chaos Test Điều 31 Report
S167D — Chaos Test Điều 31 Report
Agent: Codex CLI Date: 2026-03-26 Repo:
web-testMode: Production chaos test viassh contaboScope: Data-only fault injection, no code changes
Assembly Gate
- PostgreSQL-first: YES. Mission executed directly on PG production via
docker exec postgres psql -U directus -d directus. - Nuxt UI reuse: N/A.
- Existing tooling reuse: YES. Reused
scripts/integrity/*anddot/bin/dot-layer-integrity-auditwhere executable. - New code written: NO.
- New schema/tooling created: NO.
- Custom registry update needed: NO.
Step 0 Quotes
- OR loaded first with
search_knowledge("operating rules SSOT"). - Merge rule from
CLAUDE.md:Never push directly to main. The pre-push hook blocks it. Always use feature branches and PRs. - CI GREEN detail from
.claude/skills/incomex-rules.md: only 4 required checks matter.
Actual Production Paths
The retry prompt still had environment drift. Actual production runtime on 2026-03-26:
- SSH:
ssh contabo - PG container:
postgres(notworkflow-postgres) - Runner source path on host:
/opt/incomex/deploys/web-test/scripts/integrity/run-integrity.sh - Scanner path on host:
/opt/incomex/dot/bin/dot-layer-integrity-audit - Nuxt container:
incomex-nuxt - Directus container:
incomex-directus
Operational findings about tooling:
- Host-side runner exists but is NOT runnable there:
nodeis missing on the VPS host. - Nuxt container has
nodebut does NOT contain/app/scripts/integrity/main.js. - Result: Node runner is effectively unavailable in production runtime.
- Scanner is runnable in cloud mode when provided a Directus admin token.
BEFORE Baseline
Captured before chaos writes:
triggers = 26v_reg_rows = 23open_issues = 1CAT-ALL.record_count = 18758- Production API
/api/registry/system-issues.totals = {all:1, critical:1, warning:0, info:0, group_count:1} - Open watchdog issue present:
ISS-0752
Scenario Matrix
| # | Test | PASS/FAIL | Evidence / root cause |
|---|---|---|---|
| P1 | Phantom meta_catalog | PASS | Adapted to live meta_catalog schema. Row inserted, v_registry_counts auto-added row, scanner stdout detected CHAOS-TEST-P1 as missing registry config. No persisted system_issues delta. |
| P2 | Phantom system_issue giả | FAIL | Fake issue inserted directly into system_issues. PG open issues rose 1 -> 2, Nuxt later also showed 2, scanner stayed silent. Root cause: no authenticity guard on issue creation; fake issues propagate as real. |
| P3 | Phantom v_registry_counts | FAIL | v_registry_counts is a writable TABLE (relkind='r'), not a safe view. Phantom row CHAOS-TEST-P3 persisted. Scanner silent. |
| O1 | Orphan no code (x6) | FAIL | dot_tools accepted record with no code and auto-generated DOT-232; taxonomy / checkpoint_types failed later on other NOT NULL fields; comments / trigger_registry / entity_dependencies prompt SQL used stale name column. Root cause: no-code case not uniformly blocked; prompt field list also drifted from schema. |
| O2 | Orphan no _dot_origin |
FAIL | dot_tools accepted CHAOS-TEST-O2 with _dot_origin NULL. Scanner silent. |
| O3 | meta_catalog NULL registry | PASS | meta_catalog accepted CHAOS-TEST-O3 with registry_collection NULL and auto-added a v_registry_counts row. Scanner stdout detected it as missing registry config. |
| O4 | Broken dependency | FAIL | After adapting to live schema (source_type, target_type, relation_type, direction, date_created), broken dependency CHAOS-TEST-O4 inserted successfully. Scanner silent. |
| O5 | Broken edge | FAIL | After adapting to live schema and valid edge_type=BELONGS_TO, broken universal edge FAKE-EDGE-SRC -> FAKE-EDGE-TGT inserted successfully. Scanner silent. |
| N1 | Disable trigger + insert | FAIL | With real trigger name trg_count_dot_tools, count mismatch existed immediately after insert: CAT-006 record_count=114, actual dot_tools=115. Scanner missed the mismatch. Root cause inferred from code + behavior: scanner check uses stale metadata (actual_count) instead of live COUNT(*). Mismatch later self-healed after further inserts. |
| N2 | CAT-ALL vs sum | FAIL | Read-only cross-check mismatch: CAT-ALL=18786, atom_sum=18371. Root cause: comparison contract is invalid as written; CAT-ALL counts more than atom-only managed rows. |
| N3 | v_reg vs meta cross | FAIL | Read-only cross-check showed mismatches at least on CAT-023, plus temporary chaos rows before cleanup. Active detector did not surface this class cleanly. |
| L1 | Lifecycle open→resolved | FAIL | PG open issues went 2 -> 3 -> 2, but production API stayed flat at 2 both immediately and after 130s. Root cause: transient warning issue never became visible via Nuxt endpoint. |
| L2 | Mass corruption (rollback) | PASS | In transaction: open issues 2 -> 631 -> 2 after rollback. Rollback preserved production state correctly. |
| W1 | 3-way PG/Nuxt check | PASS | At measurement time, PG open issues and Nuxt totals both read 2. |
| W2 | API consistency | PASS | system-issues=2, groups-sum=2. Endpoint pair consistent. |
| E1 | NULL/empty code | PASS | NULL and empty-string code inputs did not persist as bad codes. They were auto-normalized to DOT-233 and DOT-234. |
| E2 | Mass insert 50 | PASS | 50 rows inserted successfully. After insert, CAT-006.record_count=167 and dot_tools COUNT(*)=167, so no race/count loss observed in that batch. |
| E3 | Circular dependency | FAIL | Circular pair CHAOS-TEST-E3a / CHAOS-TEST-E3b inserted successfully. Scanner silent. |
| S1 | Watchdog alive | PASS | Watchdog issue remained open: ISS-0752 present. |
| S2 | Runner available | FAIL | Runner path exists on host but host has no node; runtime container has node but not the runner script. Effective result: runner unavailable on production runtime. |
| S3 | Auto-resolve | PASS | After selective cleanup/retirement of detected P1/O3 cases, scanner no longer reported CHAOS-TEST-P1 or CHAOS-TEST-O3. Persisted issue auto-resolve could not be verified because scanner findings did not create open system_issues deltas in this environment. |
Detection Rate
Clear scanner-detected fault injections:
P1O3
Detection rate:
- 2/21 using the mission denominator
- 2/18 if you exclude the 3 read-only comparison checks
Important nuance:
- The active scanner produced stdout findings.
- It did not increase
system_issuesopen totals during this mission. - So “detected in stdout” and “persisted issue created/opened” are different signals in production.
New Bugs / TD Found
-
Writable
v_registry_countstable- It is a real table, not a safe view.
- Phantom rows can be inserted directly.
-
Birth Gate is warning-only for several critical conditions
- Invalid code formats are warned, not blocked.
_dot_origin NULLis warned, not blocked.meta_catalog.registry_collection NULLis warned, not blocked.
-
Fake
system_issuescan be injected as real production issues- No authenticity guard prevented
CHAOS-TEST-P2. - Nuxt later surfaced it as a real critical issue.
- No authenticity guard prevented
-
Count check in
dot-layer-integrity-auditmissed a real mismatch- Observed at runtime on N1.
- Inference from repo code and behavior: it compares metadata fields rather than a fresh table count.
-
Production runner is unavailable
- Host path exists, but host lacks
node. - Nuxt runtime container lacks the source runner script.
- Host path exists, but host lacks
-
Cleanup governance is partially broken
- Direct delete from
meta_catalogis blocked byfn_guard_meta_catalog_delete(). - Both
retire_entity(text)anddeprecate_entity(text)fail because they still querytable_registry.code, which no longer exists. - Workaround used: set detected chaos rows to
status='log'and remove theirv_registry_countsrows.
- Direct delete from
-
Lifecycle visibility bug on production API
- Temporary
WARNINGissueCHAOS-TEST-L1never appeared in Nuxt totals, even after 130 seconds.
- Temporary
Selective Cleanup Per Mission Rule
Cleaned or neutralized because detected or no longer a live fault:
CHAOS-TEST-P1(meta_catalog.status -> log,v_registry_countsrow removed)CHAOS-TEST-O3(meta_catalog.status -> log,v_registry_countsrow removed)CHAOS-TEST-L1removed fromsystem_issuesCHAOS-TEST-N1removed fromdot_toolsCHAOS-TEST-E2-*removed fromdot_tools- normalized one-off rows removed by name:
CHAOS-TEST-O1-dot_tools,CHAOS-TEST-E1a,CHAOS-TEST-E1b
Intentionally left in place because the fault was not detected:
system_issues.CHAOS-TEST-P2v_registry_counts.CHAOS-TEST-P3dot_tools.CHAOS-TEST-O2entity_dependencies.CHAOS-TEST-O4universal_edges.FAKE-EDGE-SRC -> FAKE-EDGE-TGTentity_dependencies.CHAOS-TEST-E3aentity_dependencies.CHAOS-TEST-E3b
Final State
After selective cleanup:
triggers = 26v_reg_rows = 26open_issues = 2CAT-ALL.record_count = 19003- Watchdog still alive:
ISS-0752 - Remaining intentionally retained undetected chaos rows verified present:
CHAOS-TEST-P2CHAOS-TEST-P3CHAOS-TEST-O2CHAOS-TEST-O4FAKE-EDGE-SRCCHAOS-TEST-E3aCHAOS-TEST-E3b
Conclusion
Tỷ lệ phát hiện: 2/21.
Điều 31+ production readiness: KHÔNG.
Primary blockers:
- Scanner coverage is narrow and misses several real fault classes (
P2,P3,O2,O4,O5,N1,E3). - Fake issues can enter production as real issues.
- Count mismatch detection is not reliable.
- Production runner is not executable.
- Official cleanup functions for
meta_catalogare broken on current schema.
The retained undetected rows above are the concrete proof set for follow-up investigation.