5000x · one-command 7-surface healthcheck
04 — One-command 7-surface healthcheck
1. From SQL fragments to a monitoring-ready artifact
4000x extended dot_iu_external_healthcheck to 4 surfaces. But the
output was tabular SQL — not directly pipeable to a dashboard. 5000x
turns it into a real one-command healthcheck:
$ scripts/iu_core_healthcheck.sh
{"overall_ok":true,"surfaces":[…7 surface records…]}
$ echo $?
0 # 0 = all green / 2 = at least one fail / 3 = bootstrap error
2. Surfaces (7, in fixed order)
| # | surface | green criteria |
|---|---|---|
| 1 | three_axis_cache |
cache_healthy=t AND in_sync=t |
| 2 | directus_collection |
table_rows > 0 AND permission_rows >= 1 |
| 3 | qdrant_collection |
active collection registered AND sync_points_indexed > 0 |
| 4 | auto_refresh_trigger |
trigger_errors_24h = 0 |
| 5 | vector_boundary |
unique_units = pts or off by ≤ 1 (the legitimate KT-B 2-chunk IU) |
| 6 | write_gates |
6 IU Core write gates all false |
| 7 | operator_runtime |
failed_24h = 0 |
Live output captured at end of 5000x:
three_axis_cache: in_sync
directus_collection: 163 rows / 1 read-permission
qdrant_collection: iu_core_iu_chunks (61 indexed)
auto_refresh_trigger: gate=false fires_24h=3
vector_boundary: 61 pts / 60 unique
write_gates: all 6 inert
operator_runtime: open_runs=0 failed_24h=0 active_leases=0
OVERALL_OK=True
3. Module shape
cutter_agent/iu_core/healthcheck.py exposes:
run_healthcheck(executor: SqlExecutor) -> HealthcheckReport— pure Python, takes an injected SQL executor (a callable returning a list of dicts). Lets tests cover every verdict rule without touching the DB.make_ssh_executor(ssh_host, pg_container, pg_user, pg_db)— the default executor that wrapsssh <host> docker exec -i <container> psql -U <user> -d <db> -tAc <sql>and parses the JSON-aggregated output.main(argv)— CLI entrypoint; prints the report's JSON and exits 0/2/3.
The companion shell wrapper scripts/iu_core_healthcheck.sh exists
only to set the working directory and call the Python entrypoint; the
script never logs a secret value.
4. Tests
tests/test_iu_core_5000x_healthcheck.py covers:
- the 7 expected surfaces are queried (no over-coverage, no missing);
- every surface has a verdict rule (
_VERDICTSmatches_SURFACE_SQL); - the surface SQL strings carry no forbidden literal (
Bearer/api-key/QDRANT__SERVICE__API_KEY/NUXT_DIRECTUS_SERVICE_TOKEN/password=); - happy-path inputs return ok across all surfaces with the JSON output being sorted-key, single-line;
- canonical fault inputs (cache unhealthy / trigger errors / write gate open / boundary breach / failed runs) flip the verdict correctly.
Result: 10 / 10 PASS.
5. Cron / uptime-kuma / Grafana wiring (NOT enabled in this macro)
Per 5000x rule "no broad service restart; no persistent cron unless gated", the macro stops at "artifact authored". A separate ops macro would add one of:
- a host-level cron entry that runs
scripts/iu_core_healthcheck.shevery N minutes and POSTs the JSON to uptime-kuma's status endpoint; - a Grafana exec probe that runs the same script and exposes the seven ok-booleans as gauge metrics;
- a Slack notifier that fires only on
overall_ok=falsetransitions.
All three paths are additive and can be enabled / disabled independently; the healthcheck itself remains the SSOT.
6. Five-layer impact
| layer | impact |
|---|---|
| PG | none — pure SELECTs, no DDL |
| Directus | none |
| Nuxt | none |
| AgentData | +1 KB report (this doc) |
| Qdrant | none |