KB-3CB9

S167H Codex Chaos + Automation Audit Report

16 min read Revision 1
reports167hcodexchaos-testautomation-auditdieu31production2026-03-26

S167H — Codex Chaos Test + Automation Gap Audit

Date: 2026-03-26 Agent: Codex CLI (codex-webtest) Repo: web-test Mode: Production chaos test + automation audit, no code changes Codex prefix: CHAOS-R3-CDX-

Rule Check

  • search_knowledge("operating rules SSOT") completed.
  • .claude/skills/incomex-rules.md read.
  • handoff S140, S167F chaos retest report, S167H-FIX data quality report, S167G scanner hardening report, automation status read.
  • Operational rule used for scoring:
    • §0-AM: entity-table bad records may enter; scanner must detect.
    • §0-AN: infrastructure-table corruption must be guard-blocked.
    • §0-AP: fix root = investigate → guard → fix → verify.
    • §0-AQ: dual-agent cleanup must delete only own prefix.

BEFORE Baseline

PG

Metric BEFORE
trigger_count 141
v_registry_counts rows 23
open system_issues 1
CAT-ALL.record_count 19129
total dot_tools 112
total entity_dependencies 141
total universal_edges 2039

Production URLs

  • /api/registry/system-issues: {"all":1,"critical":1,"warning":0,"info":0,"group_count":1}
  • /api/health.data_integrity: document_count=591, vector_point_count=871, ratio=1.47, sync_status=ok
  • /api/health.event_system: enabled=true, listeners=1, events_logged=1432

Chaos Score Card

# Test S167F S167H Delta Evidence
P1 Phantom meta_catalog PASS PASS same Inserted CHAOS-R3-CDX-P1; DOT audit grep matched CHAOS-R3-CDX-P1 ... Lớp 2 thiếu cấu hình bảng registry.
P2 Phantom system_issue PASS PASS same Direct insert blocked twice; round-2 retest error: system_issues INSERT requires source_system or source field.
P3 Phantom v_registry_counts PASS PASS same Direct insert blocked twice; round-2 retest error: Direct modification of v_registry_counts is blocked.
O1 Orphan no code (x6) FAIL FAIL same taxonomy -> LBL-509, trigger_registry -> TRG-088; dot_tools, checkpoint_types, entity_dependencies, task_comments rejected because default _dot_origin='DOT:UNKNOWN' fails validator, not because of code policy.
O2 Orphan no _dot_origin FAIL PASS improved Inserted CHAOS-R3-CDX-O2; measurement_log run s167h-cdx-o2 shows `MSR-D31-A1
O3 meta_catalog NULL registry PASS FAIL regressed Inserted CHAOS-R3-CDX-O3; scanner grep for CHAOS-R3-CDX-O3 returned no match; row still auto-added into v_registry_counts.
O4 Broken dependency FAIL PASS improved Inserted CHAOS-R3-CDX-O4; measurement_log run s167h-cdx-o4 shows `MSR-D31-A2
O5 Broken edge FAIL PASS improved Inserted valid universal_edges row with fake codes; measurement_log run s167h-cdx-o5c shows `MSR-D31-A3
N1 Disable trigger + hidden insert PASS* FAIL regressed CAT-006.record_count 165, live COUNT(dot_tools)=166; runner run s167h-cdx-n1 did not detect a new count fault; D26 checks are still method=1.
N2 CAT-ALL vs sum FAIL FAIL same After cleanup: CAT-ALL=20640, atom_sum=20226.
N3 v_reg vs meta cross FAIL FAIL same After cleanup, mismatch still exists: `CAT-023
L1 Lifecycle open→resolved PASS PASS same Inserted CHAOS-R3-CDX-L1 with source_system; PG open 4→5→4; API stayed 4, then 5 at 09:05:57Z, then back to 4 at 09:08:11Z.
L2 Mass corruption rollback PASS PASS same Transaction evidence: BEGIN, open 4, after insert 504, ROLLBACK, restored 4, residue 0.
W1 3-way PG/Nuxt check FAIL PASS improved Stable round-3 snapshot: PG open 4, API totals 4, groups total 4.
W2 API consistency FAIL PASS improved Round-3 /system-issues and /system-issues-groups both report all=4; groups sum=4.
E1 NULL / empty code PASS FAIL regressed INSERT dot_tools(code=NULL) and code='' both failed before normalization with DOT origin rejected ... got: DOT:UNKNOWN.
E2 Mass insert 50 FAIL PASS improved Inserted 50 dot_tools; CAT-006 165→215, live count 165→215, inserted row count 50; cleanup restored.
E3 Circular dependency FAIL PASS improved Inserted CHAOS-R3-CDX-E3A/B; measurement_log run s167h-cdx-e3 shows `MSR-D31-A4
S1 Watchdog alive PASS PASS same Open watchdog issue exists: ISS-1647; round-3 freshness 5.1 minutes stale.
S2 Runner available PASS FAIL regressed Raw mission command runs Node but falls back to legacy dry-run: Token NOT SET, DB NOT SET, PG connection failed, falling back to legacy.
S3 Auto-resolve after cleanup PASS PASS same Post-cleanup runner s167h-cdx-postcleanup returned A1=0; only Gemini baseline anomalies remained (A2=3, A3=1, A4=2); Codex prefix residue verified 0 across all tables.
V1 Vector/Document parity FAIL PASS improved /api/health.data_integrity.ratio=1.47; MSR-D31-A6 passes with threshold <=2.0.
V2 Vector sync CRUD PASS PASS same upload_document rev 1; search_knowledge found CHAOS-R3-CDX-V2-UNIQUE-SYNC-TOKEN; delete_document rev 2; follow-up search no longer returned the deleted document in context.
V3 Orphan vector detection FAIL PASS improved Production /opt/incomex/dot/bin/dot-vector-audit --cloud reported Status: needs_cleanup and Ghost documents (25).
A1 Event system alive PASS PASS same /api/health.event_system: enabled=true, listeners=1.
A2 Directus sync active FAIL PASS improved* Post-cleanup runner MSR-D31-A5 passed: Directus published 377, Agent Data document_count=591. Note: this is one-way logic only.
A3 Container restart policy PASS PASS same docker inspect incomex-agent-data -> unless-stopped; started 2026-03-23T14:53:04Z.
C1 Runner cron active PASS PASS same Crontab contains 0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh.
C2 Runner last execution FAIL PASS improved Latest runner artifacts: cron-20260326-064834.log and cron-20260326-065007.log, both exit 0; latest file timestamp same day.
C3 Scanner cron independent FAIL FAIL same Crontab has one integrity cron plus watchdog only; no separate scanner/vector audit cron entry.
C4 Watchdog heartbeat freshness PASS PASS same Round-3 query: `ISS-1647

Detection Rates

  • Phase 1: 14/21
  • Phase 2: 9/10
  • Total: 23/31

Comparison:

  • S167F: 16/31
  • S167H Codex: 23/31
  • Net change: +7

Phase 3 — Multi-Round Consistency

Round Scenarios re-tested All PASS? Anomalies
1 Full Codex run across 31 scenarios NO Fails remained in O1, O3, N1, N2, N3, E1, S2, C3.
2 Post-cleanup runner + guard retest NO P2/P3 still block correctly; s167h-cdx-postcleanup returned A1=0 but A2=3, A3=1, A4=2 because Gemini chaos remained live.
3 After >5 min: V1, W1, W2, A1, C4, AFTER baseline NO Stability held: ratio 1.47, API totals 4, groups 4, PG open 4, watchdog fresh. AFTER baseline still drifted from BEFORE due concurrent Gemini writes and live production traffic.

Automation Gap Registry

# Area Problem Severity Current Need
1 Điều 31 counting MSR-D26-* are enabled but still method=1, so PG runner ignores count integrity. CRITICAL MSR-D26-001/002/004 -> method=1, runner only loads method=2. Move D26 checks into method-2 or add a dedicated PG runner path for method-1.
2 Điều 31 counting verify_counts() is broken on production. CRITICAL SELECT COUNT(*) FROM verify_counts() errors on species_collection_map WHERE code ... column "code" does not exist. Repair verify_counts() or remove it from active counting doctrine.
3 Runner automation Raw mission command is not production-ready without injected env. HIGH scripts/integrity/pg-client.js requires DATABASE_URL; raw run falls back to legacy dry-run. Provide stable env export on VPS or containerized entrypoint.
4 Cron env wiring cron-integrity.sh reads PG_USER/PG_PASSWORD/PG_DATABASE, but production postgres container exposes POSTGRES_*. HIGH Script lines 30-35 expect PG_*; docker inspect postgres shows POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB. Align cron env lookup with live container env keys.
5 Cron docs drift cron-integrity.sh header says daily 20:00 UTC, but live crontab is 0 */6 * * *. MEDIUM Source comments and production schedule disagree. Update script header/comments to match live schedule.
6 Scanner independence No separate scanner/vector-audit cron exists. HIGH Crontab shows one integrity cron and one watchdog cron only. Add an independent scanner/vector audit schedule with separate logs.
7 Watchdog auth path watchdog-monitor.sh exits 0 on missing token, so auth failure becomes silent non-alerting skip. HIGH Log contains repeated WATCHDOG: No token — skipping; script lines 19-21 return success. Make missing token a hard failure and page it.
8 Alerting dedupe.js only creates/updates system_issues; there is no external notification path for CRITICAL findings. HIGH Code writes Directus rows only. Add Slack/email/webhook alert fan-out for CRITICAL and runner failure.
9 Sync drift logic A5 is one-way only: Agent Data >= Directus published docs always passes. HIGH compareSyncDrift() returns pass when ad >= dc; duplicates/oversync are invisible. Add upper-bound / sample-content parity, not only lower-bound deficit detection.
10 Auto-expansion A1/A2/A3 are hardcoded table lists, not meta-driven expansion. MEDIUM sql/s167g_scanner_hardening.sql enumerates specific collections for _dot_origin, deps, edges. Generate checks from registry metadata so new collections are covered automatically.
11 CI regression guard Critical-file guard protects a fixed file list only. MEDIUM .github/workflows/guard_critical_files.yml hardcodes paths and patterns. Derive coverage from contracts/routes/registries rather than fixed filenames.
12 Smoke coverage scripts/smoke-test.sh checks a fixed set of pages/APIs only. MEDIUM Static endpoint list; new pages or APIs are invisible until manually added. Expand smoke generation from routing/contracts.
13 Threshold inconsistency sync-check.yml treats vector ratio 2.5-4.5 as healthy while A6 passes <=2.0. MEDIUM GitHub Action and production scanner disagree on what healthy means. Unify vector parity thresholds across CI and production.
14 Flow liveness No sync_heartbeats or flow_heartbeats table exists to prove Directus flow/service freshness. HIGH to_regclass('sync_heartbeats') -> NULL, to_regclass('flow_heartbeats') -> NULL. Add heartbeat tables and scanner checks for stale flows/services.
15 Vector audit automation Vector orphan visibility exists, but scheduling is local/macOS-centric, not VPS-native. HIGH dot-vector-audit-schedule creates a LaunchAgent; VPS crontab has no vector-audit entry. Add VPS cron/systemd timer for dot-vector-audit --cloud or equivalent API call.
16 Log rotation Integrity logs are not included in /etc/logrotate.d/incomex. MEDIUM Logrotate covers /var/log/mcp-health.log, reconcile logs, backup logs, but not /opt/incomex/logs/integrity/*.log. Add integrity log paths to logrotate.
17 Cleanup operability Official meta_catalog cleanup path is broken on live schema. HIGH Direct delete is guard-blocked; deprecate_entity()/retire_entity() are broken, so cleanup required temporary guard-trigger disable. Repair governed lifecycle/delete functions so test cleanup does not need trigger bypass.
18 Watchdog design Watchdog freshness is inferred from system_issues.last_seen_at, not an independent heartbeat channel. MEDIUM Same row acts as alert and heartbeat store. Split liveness heartbeat from issue state to avoid circular dependence.

Production Evidence Behind Key Gaps

  • scripts/integrity/main.js:43-91 loads only method=2 measurements.
  • scripts/integrity/pg-client.js:15-18 hard-fails without DATABASE_URL.
  • scripts/integrity/runners/pg-vs-nuxt-check.js:163-174 makes A5 one-way (ad >= dc => pass).
  • scripts/integrity/cron-integrity.sh:18-41 contains token/DB env assumptions that do not match current production.
  • scripts/integrity/watchdog-monitor.sh:19-21 silently skips when token is absent.
  • sql/s133_measurement_framework.sql:316-330 still defines D26 checks as method=1.
  • sql/s167g_scanner_hardening.sql:13-149 hardcodes A1/A2/A3/A4 scope instead of registry-driven scope.
  • .github/workflows/guard_critical_files.yml:16-30 and :53-62 are fixed-list guards.
  • scripts/smoke-test.sh:87-119 is fixed-endpoint smoke coverage.
  • .github/workflows/sync-check.yml:67-70 uses a vector-ratio health range inconsistent with production A6.

Cleanup

Codex-prefix zero residue

Verified all CHAOS-R3-CDX-* residue is gone:

  • meta_catalog=0
  • v_registry_counts=0
  • system_issues=0
  • dot_tools=0
  • taxonomy=0
  • checkpoint_types=0
  • trigger_registry=0
  • entity_dependencies=0
  • universal_edges=0

Cleanup note

meta_catalog direct delete is guard-blocked on production, and official lifecycle cleanup functions are broken on the live schema. To remove only CHAOS-R3-CDX-P1 and CHAOS-R3-CDX-O3, I disabled only trg_guard_meta_catalog_delete and trg_guard_v_registry_counts inside one transaction, deleted the two Codex rows, and immediately re-enabled both guards.

AFTER Baseline

Metric BEFORE AFTER Note
trigger_count 141 141 restored
v_registry_counts rows 23 24 drift from concurrent Gemini data
open system_issues 1 4 concurrent Gemini + reopened sync faults + watchdog
CAT-ALL.record_count 19129 20640 live production traffic during mission
total dot_tools 112 165 external writes during mission
total entity_dependencies 141 144 exactly matches 3 live Gemini rows
total universal_edges 2039 2040 external live delta

Concurrent Gemini evidence at finish:

  • meta_catalog|1
  • v_registry_counts|1
  • system_issues|1
  • dot_tools|1
  • entity_dependencies|3
  • universal_edges|0 for CHAOS-R3-GEM-* / GEM-* markers.

Final Conclusion

  1. Detection rate: 23/31.
  2. Compared to S167F: improved from 16/31 to 23/31.
  3. Primary blockers still open: O1, O3, N1, N2, N3, E1, S2, C3.
  4. Readiness: KHÔNG. Điều 31 is materially stronger than S167F, but counting integrity, raw runner operability, auto-expansion, and automation liveness still leave blind spots that can fail silently.