KB-7F69

S168 Fix 3 Root Causes + Automation Gaps Report

4 min read Revision 1
reports168chaoscountingrunnerdieu26dieu312026-03-26

S168 — Fix 3 Root Causes + Automation Gaps Report

Date: 2026-03-26 | Agent: Claude CLI (claude-go) PR: #638 (MERGED) | Post-deploy verified: 2026-03-26T11:06 UTC

Phase A: Counting Integrity

A1: verify_counts() crash fix

  • Before: ERROR: column "code" does not exist (species_collection_map)
  • Root cause: meta_catalog.code_column = 'code' but table uses 'species_code'
  • Fix: Updated code_column + hardened function to check column existence
  • After: Returns 23 rows, no crash. 1 MISMATCH (CAT-023 birth_registry live drift).

A2: D26 measurements invisible to runner

  • Before: MSR-D26-001/002/004 method=1, runner only loads method=2
  • Fix: Updated to method=2, target_type='pg_query'
  • After: Runner loads 13 measurements (was 10). D26 checks execute.

A3: CAT-ALL drift investigation

  • CAT-ALL = 19603, atom_sum = 19504, delta = 99
  • Root cause: birth_registry (CAT-023) count drifts due to live traffic
  • verify_counts() shows: stored=17635, live=18570 (ongoing inserts)
  • Not a bug - live system behavior. Count refresh would need to run more often.

Phase B: Runner/Cron Operability

B1: Cron env wiring

  • Before: Reads PG_USER/PG_DATABASE (don't exist in .env), falls back to defaults
  • Fix: Hardcode directus user/db, read only PG_PASSWORD from .env
  • After: DATABASE_URL correctly constructed

B2: Watchdog hard fail

  • Before: Missing token -> exit 0 (silent skip)
  • Fix: Missing token -> exit 1 (hard failure with alert message)
  • After: Verified: empty token = exit 1

B3: Independent scanner cron

  • Before: No separate scanner cron
  • Fix: Added scanner-counts.sh (verify_counts via docker exec, every 3h)
  • After: crontab shows 3 entries: integrity/6h, watchdog/1h, scanner/3h

B4: meta_catalog cleanup path

  • Not attempted - deprecate_entity()/retire_entity() investigation deferred to future session.

Phase C: Validation Order

C1: DOT:UNKNOWN DEFAULT fix

  • Before: DEFAULT 'DOT:UNKNOWN' rejected by fn_validate_dot_origin (not pipe-separated)
  • Fix: Changed DEFAULT to 'DIRECTUS' (whitelisted prefix, no pipe check)
  • After: New records get valid _dot_origin automatically

Post-Deploy Production Evidence

Health

{"status":"healthy","data_integrity":{"document_count":593,"ratio":1.48,"sync_status":"ok"}}

System Issues

{"totals":{"all":3,"critical":2,"warning":1}}
  • #2157: Watchdog beacon (by design)
  • #2763: 1 Gemini chaos residue CHAOS-R3-GEM-SRC (per mission: do not delete)
  • #3338: verify_counts 1 mismatch (birth_registry live drift)

Crontab

0 */6 * * * .../cron-integrity.sh    (Dieu 31 runner)
0 * * * *   .../watchdog-monitor.sh  (watchdog)
0 */3 * * * .../scanner-counts.sh    (D26 scanner)

Scanner Run (s168-final)

Loaded 13 measurements (was 10)
PASS: 10 | FAIL: 2 | ERROR: 0
Pass Rate: 83.3% (10/12)

Gemini Residue Status

1 record: entity_dependencies id=353 (CHAOS-R3-GEM-SRC -> CHAOS-R3-GEM-TGT). NOT deleted per mission constraint.

Self Check

# Item Status
1 verify_counts() runs on production? DAT (23 rows)
2 D26 checks in runner log? DAT (MSR-D26-001/002/004)
3 Cron env correct, no fallback? DAT
4 Watchdog hard fail on missing token? DAT
5 Scanner cron entry exists? DAT (0 */3 * * *)
6 Smoke test? N/A (script not on VPS)
7 Production URLs verified? DAT
8 PR merged (4 GREEN, no --admin)? DAT (#638)
9 Report uploaded? DAT
10 Gemini residue documented? DAT

S168 DONE. 3 root causes fixed. Runner 10→13 measurements. Cron 2→3 entries.