KB-1443 rev 3

S167D — Chaos Test Điều 31 Report

10 min read Revision 3
reports167ddieu31chaos-testproductionvps2026-03-26

S167D — Chaos Test Điều 31 Report

Agent: Codex CLI Date: 2026-03-26 Repo: web-test Mode: Production chaos test via ssh contabo Scope: Data-only fault injection, no code changes

Assembly Gate

  1. PostgreSQL-first: YES. Mission executed directly on PG production via docker exec postgres psql -U directus -d directus.
  2. Nuxt UI reuse: N/A.
  3. Existing tooling reuse: YES. Reused scripts/integrity/* and dot/bin/dot-layer-integrity-audit where executable.
  4. New code written: NO.
  5. New schema/tooling created: NO.
  6. Custom registry update needed: NO.

Step 0 Quotes

  • OR loaded first with search_knowledge("operating rules SSOT").
  • Merge rule from CLAUDE.md: Never push directly to main. The pre-push hook blocks it. Always use feature branches and PRs.
  • CI GREEN detail from .claude/skills/incomex-rules.md: only 4 required checks matter.

Actual Production Paths

The retry prompt still had environment drift. Actual production runtime on 2026-03-26:

  • SSH: ssh contabo
  • PG container: postgres (not workflow-postgres)
  • Runner source path on host: /opt/incomex/deploys/web-test/scripts/integrity/run-integrity.sh
  • Scanner path on host: /opt/incomex/dot/bin/dot-layer-integrity-audit
  • Nuxt container: incomex-nuxt
  • Directus container: incomex-directus

Operational findings about tooling:

  • Host-side runner exists but is NOT runnable there: node is missing on the VPS host.
  • Nuxt container has node but does NOT contain /app/scripts/integrity/main.js.
  • Result: Node runner is effectively unavailable in production runtime.
  • Scanner is runnable in cloud mode when provided a Directus admin token.

BEFORE Baseline

Captured before chaos writes:

  • triggers = 26
  • v_reg_rows = 23
  • open_issues = 1
  • CAT-ALL.record_count = 18758
  • Production API /api/registry/system-issues.totals = {all:1, critical:1, warning:0, info:0, group_count:1}
  • Open watchdog issue present: ISS-0752

Scenario Matrix

# Test PASS/FAIL Evidence / root cause
P1 Phantom meta_catalog PASS Adapted to live meta_catalog schema. Row inserted, v_registry_counts auto-added row, scanner stdout detected CHAOS-TEST-P1 as missing registry config. No persisted system_issues delta.
P2 Phantom system_issue giả FAIL Fake issue inserted directly into system_issues. PG open issues rose 1 -> 2, Nuxt later also showed 2, scanner stayed silent. Root cause: no authenticity guard on issue creation; fake issues propagate as real.
P3 Phantom v_registry_counts FAIL v_registry_counts is a writable TABLE (relkind='r'), not a safe view. Phantom row CHAOS-TEST-P3 persisted. Scanner silent.
O1 Orphan no code (x6) FAIL dot_tools accepted record with no code and auto-generated DOT-232; taxonomy / checkpoint_types failed later on other NOT NULL fields; comments / trigger_registry / entity_dependencies prompt SQL used stale name column. Root cause: no-code case not uniformly blocked; prompt field list also drifted from schema.
O2 Orphan no _dot_origin FAIL dot_tools accepted CHAOS-TEST-O2 with _dot_origin NULL. Scanner silent.
O3 meta_catalog NULL registry PASS meta_catalog accepted CHAOS-TEST-O3 with registry_collection NULL and auto-added a v_registry_counts row. Scanner stdout detected it as missing registry config.
O4 Broken dependency FAIL After adapting to live schema (source_type, target_type, relation_type, direction, date_created), broken dependency CHAOS-TEST-O4 inserted successfully. Scanner silent.
O5 Broken edge FAIL After adapting to live schema and valid edge_type=BELONGS_TO, broken universal edge FAKE-EDGE-SRC -> FAKE-EDGE-TGT inserted successfully. Scanner silent.
N1 Disable trigger + insert FAIL With real trigger name trg_count_dot_tools, count mismatch existed immediately after insert: CAT-006 record_count=114, actual dot_tools=115. Scanner missed the mismatch. Root cause inferred from code + behavior: scanner check uses stale metadata (actual_count) instead of live COUNT(*). Mismatch later self-healed after further inserts.
N2 CAT-ALL vs sum FAIL Read-only cross-check mismatch: CAT-ALL=18786, atom_sum=18371. Root cause: comparison contract is invalid as written; CAT-ALL counts more than atom-only managed rows.
N3 v_reg vs meta cross FAIL Read-only cross-check showed mismatches at least on CAT-023, plus temporary chaos rows before cleanup. Active detector did not surface this class cleanly.
L1 Lifecycle open→resolved FAIL PG open issues went 2 -> 3 -> 2, but production API stayed flat at 2 both immediately and after 130s. Root cause: transient warning issue never became visible via Nuxt endpoint.
L2 Mass corruption (rollback) PASS In transaction: open issues 2 -> 631 -> 2 after rollback. Rollback preserved production state correctly.
W1 3-way PG/Nuxt check PASS At measurement time, PG open issues and Nuxt totals both read 2.
W2 API consistency PASS system-issues=2, groups-sum=2. Endpoint pair consistent.
E1 NULL/empty code PASS NULL and empty-string code inputs did not persist as bad codes. They were auto-normalized to DOT-233 and DOT-234.
E2 Mass insert 50 PASS 50 rows inserted successfully. After insert, CAT-006.record_count=167 and dot_tools COUNT(*)=167, so no race/count loss observed in that batch.
E3 Circular dependency FAIL Circular pair CHAOS-TEST-E3a / CHAOS-TEST-E3b inserted successfully. Scanner silent.
S1 Watchdog alive PASS Watchdog issue remained open: ISS-0752 present.
S2 Runner available FAIL Runner path exists on host but host has no node; runtime container has node but not the runner script. Effective result: runner unavailable on production runtime.
S3 Auto-resolve PASS After selective cleanup/retirement of detected P1/O3 cases, scanner no longer reported CHAOS-TEST-P1 or CHAOS-TEST-O3. Persisted issue auto-resolve could not be verified because scanner findings did not create open system_issues deltas in this environment.

Detection Rate

Clear scanner-detected fault injections:

  • P1
  • O3

Detection rate:

  • 2/21 using the mission denominator
  • 2/18 if you exclude the 3 read-only comparison checks

Important nuance:

  • The active scanner produced stdout findings.
  • It did not increase system_issues open totals during this mission.
  • So “detected in stdout” and “persisted issue created/opened” are different signals in production.

New Bugs / TD Found

  1. Writable v_registry_counts table

    • It is a real table, not a safe view.
    • Phantom rows can be inserted directly.
  2. Birth Gate is warning-only for several critical conditions

    • Invalid code formats are warned, not blocked.
    • _dot_origin NULL is warned, not blocked.
    • meta_catalog.registry_collection NULL is warned, not blocked.
  3. Fake system_issues can be injected as real production issues

    • No authenticity guard prevented CHAOS-TEST-P2.
    • Nuxt later surfaced it as a real critical issue.
  4. Count check in dot-layer-integrity-audit missed a real mismatch

    • Observed at runtime on N1.
    • Inference from repo code and behavior: it compares metadata fields rather than a fresh table count.
  5. Production runner is unavailable

    • Host path exists, but host lacks node.
    • Nuxt runtime container lacks the source runner script.
  6. Cleanup governance is partially broken

    • Direct delete from meta_catalog is blocked by fn_guard_meta_catalog_delete().
    • Both retire_entity(text) and deprecate_entity(text) fail because they still query table_registry.code, which no longer exists.
    • Workaround used: set detected chaos rows to status='log' and remove their v_registry_counts rows.
  7. Lifecycle visibility bug on production API

    • Temporary WARNING issue CHAOS-TEST-L1 never appeared in Nuxt totals, even after 130 seconds.

Selective Cleanup Per Mission Rule

Cleaned or neutralized because detected or no longer a live fault:

  • CHAOS-TEST-P1 (meta_catalog.status -> log, v_registry_counts row removed)
  • CHAOS-TEST-O3 (meta_catalog.status -> log, v_registry_counts row removed)
  • CHAOS-TEST-L1 removed from system_issues
  • CHAOS-TEST-N1 removed from dot_tools
  • CHAOS-TEST-E2-* removed from dot_tools
  • normalized one-off rows removed by name: CHAOS-TEST-O1-dot_tools, CHAOS-TEST-E1a, CHAOS-TEST-E1b

Intentionally left in place because the fault was not detected:

  • system_issues.CHAOS-TEST-P2
  • v_registry_counts.CHAOS-TEST-P3
  • dot_tools.CHAOS-TEST-O2
  • entity_dependencies.CHAOS-TEST-O4
  • universal_edges.FAKE-EDGE-SRC -> FAKE-EDGE-TGT
  • entity_dependencies.CHAOS-TEST-E3a
  • entity_dependencies.CHAOS-TEST-E3b

Final State

After selective cleanup:

  • triggers = 26
  • v_reg_rows = 26
  • open_issues = 2
  • CAT-ALL.record_count = 19003
  • Watchdog still alive: ISS-0752
  • Remaining intentionally retained undetected chaos rows verified present:
    • CHAOS-TEST-P2
    • CHAOS-TEST-P3
    • CHAOS-TEST-O2
    • CHAOS-TEST-O4
    • FAKE-EDGE-SRC
    • CHAOS-TEST-E3a
    • CHAOS-TEST-E3b

Conclusion

Tỷ lệ phát hiện: 2/21.

Điều 31+ production readiness: KHÔNG.

Primary blockers:

  • Scanner coverage is narrow and misses several real fault classes (P2, P3, O2, O4, O5, N1, E3).
  • Fake issues can enter production as real issues.
  • Count mismatch detection is not reliable.
  • Production runner is not executable.
  • Official cleanup functions for meta_catalog are broken on current schema.

The retained undetected rows above are the concrete proof set for follow-up investigation.