S167F — Chaos Re-Test Round 2 (Expanded)

Date: 2026-03-26 Agent: Codex CLI (codex-webtest) Repo: web-test Mode: Production chaos test, no code changes

Mandatory Rule Check

search_knowledge("operating rules SSOT") completed.
.claude/skills/incomex-rules.md read.
S167D chaos test report read.
S167E chaos hardening report read.
§0-AK quote used for scoring: TD-379 Birth Gate: QUYẾT ĐỊNH = KHÔNG BLOCK. Giữ WARN. Điều 31 scanner PHẢI detect _dot_origin NULL và báo.

BEFORE Baseline

PG

Metric	BEFORE
trigger_count (`trg_%` + `fn_guard_%`)	141
v_registry_counts rows	23
open system_issues	1
`CAT-ALL.record_count`	18962
total `dot_tools`	112
total `entity_dependencies`	141
total `universal_edges`	2040

Nuxt / Agent Data

system-issues.totals: {"all":1,"critical":1,"warning":0,"info":0,"group_count":1}
/api/health.data_integrity: document_count=582, vector_point_count=851, ratio=1.46, sync_status=ok
/api/health.event_system: enabled=true, listeners=1, events_logged=1330

Pre-clean for stale chaos

S167F found 3 leftover entity_dependencies rows from S167D even though S167E reported clean:

CHAOS-TEST-O4
CHAOS-TEST-E3a
CHAOS-TEST-E3b

They were deleted before new injections, as required by mission rule #8.

Phase 1 — 21 Original Scenarios

#	Test	S167D	S167F	Delta	Evidence	Root cause if FAIL
P1	Phantom meta_catalog	PASS	PASS	same	Inserted `CHAOS-TEST-P1`; scanner matched `CHAOS-TEST-P1 Phantom Test — Lớp 2 thiếu cấu hình bảng registry`
P2	Phantom system_issue giả	FAIL	PASS	fixed	PG error: `system_issues INSERT requires source_system or source field. Anonymous inserts blocked.`
P3	Phantom v_registry_counts	FAIL	PASS	fixed	PG error: `Direct modification of v_registry_counts is blocked.`
O1	Orphan no code (x6)	FAIL	FAIL	same	`dot_tools→DOT-235`, `taxonomy→LBL-508`, `checkpoint_types→CP-033`, `trigger_registry→TRG-087`, `entity_dependencies→DEP-0336`; `comments` failed NOT NULL; scanner grep on all O1 markers returned no match	5/6 tables auto-generated code, `comments` still hard-blocked by NOT NULL, so true orphan-no-code cases never reached scanner
O2	Orphan no `_dot_origin`	FAIL	FAIL	same	Inserted `CHAOS-TEST-O2` with `_dot_origin=NULL`; scanner grep for `CHAOS-TEST-O2	DOT origin
O3	meta_catalog NULL registry	PASS	PASS	same	Inserted `CHAOS-TEST-O3`; scanner matched `Missing Registry — Lớp 2 thiếu cấu hình bảng registry`
O4	Broken dependency	FAIL	FAIL	same	Inserted `CHAOS-TEST-O4` -> `FAKE-SRC-999 -> FAKE-TGT-999`; scanner grep returned no match	No broken-reference detection for `entity_dependencies`
O5	Broken edge	FAIL	FAIL	same	Inserted `FAKE-EDGE-SRC -> FAKE-EDGE-TGT`; scanner grep returned no match	No broken-reference detection for `universal_edges`
N1	Disable trigger + insert	FAIL	PASS*	improved but unstable	After hidden insert: scanner emitted `GEM-CHAOS-P1 ... meta_catalog nói 116 nhưng thực tế 117`	Detection happened, but mismatch was attributed to a duplicate `meta_catalog` row (`GEM-CHAOS-P1`) instead of `CAT-006`
N2	CAT-ALL vs sum	FAIL	FAIL	same	`CAT-ALL=19127`, `atom_sum=18709` during Phase 1	Query invariant is not stable: `CAT-ALL` covers all managed rows, not only atom rows, and live production writes changed totals during mission
N3	v_reg vs meta cross	FAIL	FAIL	same	Mismatches observed: `CAT-006 113 vs 118`, `CAT-019 108 vs 107`, `CAT-023 17941 vs 17635`, plus chaos rows while injected	`v_registry_counts` and `meta_catalog.record_count` drift independently; duplicate `registry_collection` mappings amplify misattribution
L1	Lifecycle open→resolved	FAIL	PASS	fixed	PG `1→2→1`; API totals updated after ~30s in both directions (`T30=2`, `RT30=1`)
L2	Mass corruption rollback	PASS	PASS	same	500 issue inserts inside transaction gave `after_insert_open=501`; `ROLLBACK` restored to `1`, `chaos_left=0`
W1	3-way PG/Nuxt check	PASS	FAIL (transient)	regressed	At one check: PG open=`2`, API totals.all=`1`	Eventual-consistency/race during concurrent issue lifecycle made PG and Nuxt diverge momentarily
W2	API consistency	PASS	FAIL (transient)	regressed	Same moment: totals.all=`1`, groups-sum=`2`, PG=`2`	`system-issues` totals lagged `system-issues-groups` and PG truth
E1	NULL / empty code	PASS	PASS	same	`NULL` normalized to `DOT-238`; empty string normalized to `DOT-239`; blank-code count remained `0`
E2	Mass insert 50	PASS	FAIL	regressed	`dot_tools actual 121→171`, but `CAT-006.record_count` stayed `113`; scanner only changed description percentage and later misattributed count mismatch via `GEM-CHAOS-P1`	Count update path for `dot_tools` is broken/stale; duplicate `meta_catalog` for `dot_tools` distorts detection
E3	Circular dependency	FAIL	FAIL	same	Inserted `CHAOS-TEST-E3a/E3b` (`DOT-001↔DOT-002`); scanner grep returned no match	No cycle detection on `entity_dependencies`
S1	Watchdog alive	PASS	PASS	same	`ISS-0752` open at start
S2	Runner available	FAIL	PASS	fixed enough	Bare shell: `node: command not found`; with `source ~/.nvm/nvm.sh`, `node /opt/incomex/deploys/web-test/scripts/integrity/main.js --dry-run` ran and produced stdout (`PASS: 37	FAIL: 0
S3	Auto-resolve after cleanup	PASS	PASS	same	After cleanup, scanner grep for `CHAOS-TEST	GEM-CHAOS

Phase 2 — Auto-System Liveness (10 scenarios)

#	Test	Result	Evidence	Implication
V1	Vector/Document count parity	FAIL	`/api/health`: `document_count=582`, `vector_point_count=851`, `ratio=1.46`	Exposed parity metric is far outside 5% threshold
V2	Vector sync sau CRUD	PASS	Uploaded test doc with unique token; `search_knowledge` found it after 30s; after delete + 30s, search no longer found it	Vector create/delete propagation works via Agent Data API path
V3	Orphan vector detection	FAIL	`/api/openapi.json` exposed no live orphan-vector endpoint; parity gap remained `582 vs 851`	`orphan = 0` cannot be proven from live telemetry
A1	Event system alive	PASS	`/api/health.event_system`: `enabled=true`, `listeners=1`	Event loop is alive
A2	Directus sync active	FAIL	Agent Data `list_documents(path="knowledge/") -> count=361`; Directus `knowledge_documents -> 370`	Sync drift of 9 docs between Directus and Agent Data
A3	Event system kill test (read-only)	PASS	`docker inspect incomex-agent-data -> StartedAt=2026-03-23T14:53:04Z`, `restart_policy=unless-stopped`	Container will auto-restart; liveness not single-shot fragile
C1	Integrity runner cron active	PASS	Crontab lines exist: `cron-integrity.sh` daily, `watchdog-monitor.sh` hourly	Automation is scheduled
C2	Runner last execution	FAIL	Only full run artifact was `cron-20260323-200011.log`; current `cron.log` ended with `DATABASE_URL: unbound variable` and `DIRECTUS_TOKEN ... PERMISSION`	Cron exists, but successful runner execution is stale and current schedule path is broken
C3	Scanner last execution	FAIL	No dedicated scanner cron entry and no recent scanner log artifact found	Scanner automation is not independently scheduled/observable
C4	Watchdog heartbeat freshness	PASS	Initial watchdog `last_seen_at=2026-03-25 15:35:25.611+00`; later a new watchdog row `ISS-1647` was opened at `2026-03-26 05:50:12Z`	Heartbeat stayed <24h throughout mission

Phase 3 — Multi-Round Consistency

Round	Scenarios re-tested	All PASS?	Anomalies
1	Full Phase 1 + Phase 2	NO	Major misses remained on `_dot_origin`, broken deps/edges, circular deps, API consistency, vector parity, automation freshness
2	P2, P3, N1, V2, W1, W2	NO	`P2/P3` blocked again; `V2` passed again; `W1/W2` stabilized (`PG=2`, `API totals=2`, groups-sum=`2`); `N1` did not stably detect after cleanup and only changed description counts
3	After 5-minute wait: V1, W1, W2, A1, C4, baseline	NO	`V1` still failed (`582/851, ratio 1.46`); `W1/W2` stayed stable (`PG open=2`, `API all=2`, groups-sum=2); `A1` stayed alive; `C4` stayed fresh; `CAT-ALL` continued increasing due live production traffic

Detection Rates

Phase 1: 11/21
Phase 2: 5/10
Total: 16/31

Comparison:

S167D: 2/21
S167F: 11/21 on Phase 1, but still not production-ready

New / Confirmed Bugs

CRITICAL

Scanner still misses _dot_origin NULL on entity tables (O2) despite §0-AK requiring detection.
Broken entity_dependencies and universal_edges remain invisible to Điều 31 (O4, O5).
Circular dependencies remain invisible (E3).
dot_tools count integrity is unreliable; mass inserts can leave meta_catalog.record_count stale (E2).
Duplicate meta_catalog.registry_collection='dot_tools' rows (CAT-006, GEM-CHAOS-P1) plus refresh_registry_count() using LIMIT 1 cause misattributed or missed count alerts.

HIGH

API totals can disagree with both PG and groups endpoint (W1, W2).
Runner cron exists but successful execution is stale; current schedule path is breaking on env/token setup (C2).
No separate scanner automation evidence (C3).
Agent Data vs Directus knowledge-doc sync drift remains (A2).
Vector parity metric is far outside threshold and orphan=0 cannot be proven (V1, V3).

MEDIUM

O1 cannot currently exercise true orphan-no-code paths on most tested tables because auto-code or hard constraints intercept the write before scanner can prove coverage.
N1 improved from S167D, but detection is not stable across rounds.

LOW

S2 runner is usable only after sourcing NVM; bare non-login shell still says node: command not found.

Cleanup

Chaos cleanup result

All chaos rows were removed, including concurrent GEM-CHAOS-* data found on production during the mission.

Final zero-check:

meta_catalog=0
v_registry_counts=0
dot_tools=0
taxonomy=0
checkpoint_types=0
trigger_registry=0
entity_dependencies=0
universal_edges=0
system_issues=0 for all %CHAOS% / %FAKE% markers.

AFTER baseline

Metric	BEFORE	AFTER	Note
trigger_count	141	141	restored
v_registry_counts rows	23	23	restored
open system_issues	1	2	new watchdog issue `ISS-1647` opened at `2026-03-26 05:50:12Z` during mission
`CAT-ALL.record_count`	18962	19047	live production traffic increased total during mission
total `dot_tools`	112	112	restored
total `entity_dependencies`	141	141	restored
total `universal_edges`	2040	2040	restored

Important evidence that AFTER drift is external, not cleanup residue:

Final chaos-row audit was 0 across every tested table.
birth_registry ended at 17990, and CAT-023.record_count also ended at 17990.
The increase in CAT-ALL from 18962 -> 19047 is exactly +85, matching live birth_registry growth during the mission window.

Final Conclusion

Detection rate: 16/31 overall, 11/21 on the original S167D scenarios.
Production readiness for Điều 31+: KHÔNG.
Primary blockers:

Scanner coverage gaps on _dot_origin, broken deps/edges, and circular dependencies.
Count integrity for dot_tools is still unreliable and misattributed under duplicate meta_catalog mappings.
Runner/scanner automation is not healthy enough to trust unattended operation.
API consistency and knowledge/vector parity remain unstable or unprovable.