KB-7F69
S168 Fix 3 Root Causes + Automation Gaps Report
4 min read Revision 1
reports168chaoscountingrunnerdieu26dieu312026-03-26
S168 — Fix 3 Root Causes + Automation Gaps Report
Date: 2026-03-26 | Agent: Claude CLI (claude-go) PR: #638 (MERGED) | Post-deploy verified: 2026-03-26T11:06 UTC
Phase A: Counting Integrity
A1: verify_counts() crash fix
- Before: ERROR: column "code" does not exist (species_collection_map)
- Root cause: meta_catalog.code_column = 'code' but table uses 'species_code'
- Fix: Updated code_column + hardened function to check column existence
- After: Returns 23 rows, no crash. 1 MISMATCH (CAT-023 birth_registry live drift).
A2: D26 measurements invisible to runner
- Before: MSR-D26-001/002/004 method=1, runner only loads method=2
- Fix: Updated to method=2, target_type='pg_query'
- After: Runner loads 13 measurements (was 10). D26 checks execute.
A3: CAT-ALL drift investigation
- CAT-ALL = 19603, atom_sum = 19504, delta = 99
- Root cause: birth_registry (CAT-023) count drifts due to live traffic
- verify_counts() shows: stored=17635, live=18570 (ongoing inserts)
- Not a bug - live system behavior. Count refresh would need to run more often.
Phase B: Runner/Cron Operability
B1: Cron env wiring
- Before: Reads PG_USER/PG_DATABASE (don't exist in .env), falls back to defaults
- Fix: Hardcode directus user/db, read only PG_PASSWORD from .env
- After: DATABASE_URL correctly constructed
B2: Watchdog hard fail
- Before: Missing token -> exit 0 (silent skip)
- Fix: Missing token -> exit 1 (hard failure with alert message)
- After: Verified: empty token = exit 1
B3: Independent scanner cron
- Before: No separate scanner cron
- Fix: Added scanner-counts.sh (verify_counts via docker exec, every 3h)
- After: crontab shows 3 entries: integrity/6h, watchdog/1h, scanner/3h
B4: meta_catalog cleanup path
- Not attempted - deprecate_entity()/retire_entity() investigation deferred to future session.
Phase C: Validation Order
C1: DOT:UNKNOWN DEFAULT fix
- Before: DEFAULT 'DOT:UNKNOWN' rejected by fn_validate_dot_origin (not pipe-separated)
- Fix: Changed DEFAULT to 'DIRECTUS' (whitelisted prefix, no pipe check)
- After: New records get valid _dot_origin automatically
Post-Deploy Production Evidence
Health
{"status":"healthy","data_integrity":{"document_count":593,"ratio":1.48,"sync_status":"ok"}}
System Issues
{"totals":{"all":3,"critical":2,"warning":1}}
- #2157: Watchdog beacon (by design)
- #2763: 1 Gemini chaos residue CHAOS-R3-GEM-SRC (per mission: do not delete)
- #3338: verify_counts 1 mismatch (birth_registry live drift)
Crontab
0 */6 * * * .../cron-integrity.sh (Dieu 31 runner)
0 * * * * .../watchdog-monitor.sh (watchdog)
0 */3 * * * .../scanner-counts.sh (D26 scanner)
Scanner Run (s168-final)
Loaded 13 measurements (was 10)
PASS: 10 | FAIL: 2 | ERROR: 0
Pass Rate: 83.3% (10/12)
Gemini Residue Status
1 record: entity_dependencies id=353 (CHAOS-R3-GEM-SRC -> CHAOS-R3-GEM-TGT). NOT deleted per mission constraint.
Self Check
| # | Item | Status |
|---|---|---|
| 1 | verify_counts() runs on production? | DAT (23 rows) |
| 2 | D26 checks in runner log? | DAT (MSR-D26-001/002/004) |
| 3 | Cron env correct, no fallback? | DAT |
| 4 | Watchdog hard fail on missing token? | DAT |
| 5 | Scanner cron entry exists? | DAT (0 */3 * * *) |
| 6 | Smoke test? | N/A (script not on VPS) |
| 7 | Production URLs verified? | DAT |
| 8 | PR merged (4 GREEN, no --admin)? | DAT (#638) |
| 9 | Report uploaded? | DAT |
| 10 | Gemini residue documented? | DAT |
S168 DONE. 3 root causes fixed. Runner 10→13 measurements. Cron 2→3 entries.