KB-391C

S167F Chaos Re-Test Report

13 min read Revision 1
reports167fchaos-testdieu31production

S167F — Chaos Re-Test Round 2 (Expanded)

Date: 2026-03-26 Agent: Codex CLI (codex-webtest) Repo: web-test Mode: Production chaos test, no code changes

Mandatory Rule Check

  • search_knowledge("operating rules SSOT") completed.
  • .claude/skills/incomex-rules.md read.
  • S167D chaos test report read.
  • S167E chaos hardening report read.
  • §0-AK quote used for scoring: TD-379 Birth Gate: QUYẾT ĐỊNH = KHÔNG BLOCK. Giữ WARN. Điều 31 scanner PHẢI detect _dot_origin NULL và báo.

BEFORE Baseline

PG

Metric BEFORE
trigger_count (trg_% + fn_guard_%) 141
v_registry_counts rows 23
open system_issues 1
CAT-ALL.record_count 18962
total dot_tools 112
total entity_dependencies 141
total universal_edges 2040

Nuxt / Agent Data

  • system-issues.totals: {"all":1,"critical":1,"warning":0,"info":0,"group_count":1}
  • /api/health.data_integrity: document_count=582, vector_point_count=851, ratio=1.46, sync_status=ok
  • /api/health.event_system: enabled=true, listeners=1, events_logged=1330

Pre-clean for stale chaos

S167F found 3 leftover entity_dependencies rows from S167D even though S167E reported clean:

  • CHAOS-TEST-O4
  • CHAOS-TEST-E3a
  • CHAOS-TEST-E3b

They were deleted before new injections, as required by mission rule #8.

Phase 1 — 21 Original Scenarios

# Test S167D S167F Delta Evidence Root cause if FAIL
P1 Phantom meta_catalog PASS PASS same Inserted CHAOS-TEST-P1; scanner matched CHAOS-TEST-P1 Phantom Test — Lớp 2 thiếu cấu hình bảng registry
P2 Phantom system_issue giả FAIL PASS fixed PG error: system_issues INSERT requires source_system or source field. Anonymous inserts blocked.
P3 Phantom v_registry_counts FAIL PASS fixed PG error: Direct modification of v_registry_counts is blocked.
O1 Orphan no code (x6) FAIL FAIL same dot_tools→DOT-235, taxonomy→LBL-508, checkpoint_types→CP-033, trigger_registry→TRG-087, entity_dependencies→DEP-0336; comments failed NOT NULL; scanner grep on all O1 markers returned no match 5/6 tables auto-generated code, comments still hard-blocked by NOT NULL, so true orphan-no-code cases never reached scanner
O2 Orphan no _dot_origin FAIL FAIL same Inserted CHAOS-TEST-O2 with _dot_origin=NULL; scanner grep for `CHAOS-TEST-O2 DOT origin
O3 meta_catalog NULL registry PASS PASS same Inserted CHAOS-TEST-O3; scanner matched Missing Registry — Lớp 2 thiếu cấu hình bảng registry
O4 Broken dependency FAIL FAIL same Inserted CHAOS-TEST-O4 -> FAKE-SRC-999 -> FAKE-TGT-999; scanner grep returned no match No broken-reference detection for entity_dependencies
O5 Broken edge FAIL FAIL same Inserted FAKE-EDGE-SRC -> FAKE-EDGE-TGT; scanner grep returned no match No broken-reference detection for universal_edges
N1 Disable trigger + insert FAIL PASS* improved but unstable After hidden insert: scanner emitted GEM-CHAOS-P1 ... meta_catalog nói 116 nhưng thực tế 117 Detection happened, but mismatch was attributed to a duplicate meta_catalog row (GEM-CHAOS-P1) instead of CAT-006
N2 CAT-ALL vs sum FAIL FAIL same CAT-ALL=19127, atom_sum=18709 during Phase 1 Query invariant is not stable: CAT-ALL covers all managed rows, not only atom rows, and live production writes changed totals during mission
N3 v_reg vs meta cross FAIL FAIL same Mismatches observed: CAT-006 113 vs 118, CAT-019 108 vs 107, CAT-023 17941 vs 17635, plus chaos rows while injected v_registry_counts and meta_catalog.record_count drift independently; duplicate registry_collection mappings amplify misattribution
L1 Lifecycle open→resolved FAIL PASS fixed PG 1→2→1; API totals updated after ~30s in both directions (T30=2, RT30=1)
L2 Mass corruption rollback PASS PASS same 500 issue inserts inside transaction gave after_insert_open=501; ROLLBACK restored to 1, chaos_left=0
W1 3-way PG/Nuxt check PASS FAIL (transient) regressed At one check: PG open=2, API totals.all=1 Eventual-consistency/race during concurrent issue lifecycle made PG and Nuxt diverge momentarily
W2 API consistency PASS FAIL (transient) regressed Same moment: totals.all=1, groups-sum=2, PG=2 system-issues totals lagged system-issues-groups and PG truth
E1 NULL / empty code PASS PASS same NULL normalized to DOT-238; empty string normalized to DOT-239; blank-code count remained 0
E2 Mass insert 50 PASS FAIL regressed dot_tools actual 121→171, but CAT-006.record_count stayed 113; scanner only changed description percentage and later misattributed count mismatch via GEM-CHAOS-P1 Count update path for dot_tools is broken/stale; duplicate meta_catalog for dot_tools distorts detection
E3 Circular dependency FAIL FAIL same Inserted CHAOS-TEST-E3a/E3b (DOT-001↔DOT-002); scanner grep returned no match No cycle detection on entity_dependencies
S1 Watchdog alive PASS PASS same ISS-0752 open at start
S2 Runner available FAIL PASS fixed enough Bare shell: node: command not found; with source ~/.nvm/nvm.sh, node /opt/incomex/deploys/web-test/scripts/integrity/main.js --dry-run ran and produced stdout (`PASS: 37 FAIL: 0
S3 Auto-resolve after cleanup PASS PASS same After cleanup, scanner grep for `CHAOS-TEST GEM-CHAOS

Phase 2 — Auto-System Liveness (10 scenarios)

# Test Result Evidence Implication
V1 Vector/Document count parity FAIL /api/health: document_count=582, vector_point_count=851, ratio=1.46 Exposed parity metric is far outside 5% threshold
V2 Vector sync sau CRUD PASS Uploaded test doc with unique token; search_knowledge found it after 30s; after delete + 30s, search no longer found it Vector create/delete propagation works via Agent Data API path
V3 Orphan vector detection FAIL /api/openapi.json exposed no live orphan-vector endpoint; parity gap remained 582 vs 851 orphan = 0 cannot be proven from live telemetry
A1 Event system alive PASS /api/health.event_system: enabled=true, listeners=1 Event loop is alive
A2 Directus sync active FAIL Agent Data list_documents(path="knowledge/") -> count=361; Directus knowledge_documents -> 370 Sync drift of 9 docs between Directus and Agent Data
A3 Event system kill test (read-only) PASS docker inspect incomex-agent-data -> StartedAt=2026-03-23T14:53:04Z, restart_policy=unless-stopped Container will auto-restart; liveness not single-shot fragile
C1 Integrity runner cron active PASS Crontab lines exist: cron-integrity.sh daily, watchdog-monitor.sh hourly Automation is scheduled
C2 Runner last execution FAIL Only full run artifact was cron-20260323-200011.log; current cron.log ended with DATABASE_URL: unbound variable and DIRECTUS_TOKEN ... PERMISSION Cron exists, but successful runner execution is stale and current schedule path is broken
C3 Scanner last execution FAIL No dedicated scanner cron entry and no recent scanner log artifact found Scanner automation is not independently scheduled/observable
C4 Watchdog heartbeat freshness PASS Initial watchdog last_seen_at=2026-03-25 15:35:25.611+00; later a new watchdog row ISS-1647 was opened at 2026-03-26 05:50:12Z Heartbeat stayed <24h throughout mission

Phase 3 — Multi-Round Consistency

Round Scenarios re-tested All PASS? Anomalies
1 Full Phase 1 + Phase 2 NO Major misses remained on _dot_origin, broken deps/edges, circular deps, API consistency, vector parity, automation freshness
2 P2, P3, N1, V2, W1, W2 NO P2/P3 blocked again; V2 passed again; W1/W2 stabilized (PG=2, API totals=2, groups-sum=2); N1 did not stably detect after cleanup and only changed description counts
3 After 5-minute wait: V1, W1, W2, A1, C4, baseline NO V1 still failed (582/851, ratio 1.46); W1/W2 stayed stable (PG open=2, API all=2, groups-sum=2); A1 stayed alive; C4 stayed fresh; CAT-ALL continued increasing due live production traffic

Detection Rates

  • Phase 1: 11/21
  • Phase 2: 5/10
  • Total: 16/31

Comparison:

  • S167D: 2/21
  • S167F: 11/21 on Phase 1, but still not production-ready

New / Confirmed Bugs

CRITICAL

  • Scanner still misses _dot_origin NULL on entity tables (O2) despite §0-AK requiring detection.
  • Broken entity_dependencies and universal_edges remain invisible to Điều 31 (O4, O5).
  • Circular dependencies remain invisible (E3).
  • dot_tools count integrity is unreliable; mass inserts can leave meta_catalog.record_count stale (E2).
  • Duplicate meta_catalog.registry_collection='dot_tools' rows (CAT-006, GEM-CHAOS-P1) plus refresh_registry_count() using LIMIT 1 cause misattributed or missed count alerts.

HIGH

  • API totals can disagree with both PG and groups endpoint (W1, W2).
  • Runner cron exists but successful execution is stale; current schedule path is breaking on env/token setup (C2).
  • No separate scanner automation evidence (C3).
  • Agent Data vs Directus knowledge-doc sync drift remains (A2).
  • Vector parity metric is far outside threshold and orphan=0 cannot be proven (V1, V3).

MEDIUM

  • O1 cannot currently exercise true orphan-no-code paths on most tested tables because auto-code or hard constraints intercept the write before scanner can prove coverage.
  • N1 improved from S167D, but detection is not stable across rounds.

LOW

  • S2 runner is usable only after sourcing NVM; bare non-login shell still says node: command not found.

Cleanup

Chaos cleanup result

All chaos rows were removed, including concurrent GEM-CHAOS-* data found on production during the mission.

Final zero-check:

  • meta_catalog=0
  • v_registry_counts=0
  • dot_tools=0
  • taxonomy=0
  • checkpoint_types=0
  • trigger_registry=0
  • entity_dependencies=0
  • universal_edges=0
  • system_issues=0 for all %CHAOS% / %FAKE% markers.

AFTER baseline

Metric BEFORE AFTER Note
trigger_count 141 141 restored
v_registry_counts rows 23 23 restored
open system_issues 1 2 new watchdog issue ISS-1647 opened at 2026-03-26 05:50:12Z during mission
CAT-ALL.record_count 18962 19047 live production traffic increased total during mission
total dot_tools 112 112 restored
total entity_dependencies 141 141 restored
total universal_edges 2040 2040 restored

Important evidence that AFTER drift is external, not cleanup residue:

  • Final chaos-row audit was 0 across every tested table.
  • birth_registry ended at 17990, and CAT-023.record_count also ended at 17990.
  • The increase in CAT-ALL from 18962 -> 19047 is exactly +85, matching live birth_registry growth during the mission window.

Final Conclusion

  1. Detection rate: 16/31 overall, 11/21 on the original S167D scenarios.
  2. Production readiness for Điều 31+: KHÔNG.
  3. Primary blockers:
  • Scanner coverage gaps on _dot_origin, broken deps/edges, and circular dependencies.
  • Count integrity for dot_tools is still unreliable and misattributed under duplicate meta_catalog mappings.
  • Runner/scanner automation is not healthy enough to trust unattended operation.
  • API consistency and knowledge/vector parity remain unstable or unprovable.