KB-7A28

5000x · one-command 7-surface healthcheck

4 min read Revision 1

iu-core5000xhealthcheckmonitoring7-surfacescli

04 — One-command 7-surface healthcheck

1. From SQL fragments to a monitoring-ready artifact

4000x extended dot_iu_external_healthcheck to 4 surfaces. But the output was tabular SQL — not directly pipeable to a dashboard. 5000x turns it into a real one-command healthcheck:

$ scripts/iu_core_healthcheck.sh
{"overall_ok":true,"surfaces":[…7 surface records…]}
$ echo $?
0          # 0 = all green / 2 = at least one fail / 3 = bootstrap error

2. Surfaces (7, in fixed order)

#	surface	green criteria
1	`three_axis_cache`	`cache_healthy=t` AND `in_sync=t`
2	`directus_collection`	`table_rows > 0` AND `permission_rows >= 1`
3	`qdrant_collection`	`active` collection registered AND `sync_points_indexed > 0`
4	`auto_refresh_trigger`	`trigger_errors_24h = 0`
5	`vector_boundary`	`unique_units = pts` or off by ≤ 1 (the legitimate KT-B 2-chunk IU)
6	`write_gates`	6 IU Core write gates all `false`
7	`operator_runtime`	`failed_24h = 0`

Live output captured at end of 5000x:

three_axis_cache:     in_sync
directus_collection:  163 rows / 1 read-permission
qdrant_collection:    iu_core_iu_chunks (61 indexed)
auto_refresh_trigger: gate=false fires_24h=3
vector_boundary:      61 pts / 60 unique
write_gates:          all 6 inert
operator_runtime:     open_runs=0 failed_24h=0 active_leases=0
OVERALL_OK=True

3. Module shape

cutter_agent/iu_core/healthcheck.py exposes:

run_healthcheck(executor: SqlExecutor) -> HealthcheckReport — pure Python, takes an injected SQL executor (a callable returning a list of dicts). Lets tests cover every verdict rule without touching the DB.
make_ssh_executor(ssh_host, pg_container, pg_user, pg_db) — the default executor that wraps ssh <host> docker exec -i <container> psql -U <user> -d <db> -tAc <sql> and parses the JSON-aggregated output.
main(argv) — CLI entrypoint; prints the report's JSON and exits 0/2/3.

The companion shell wrapper scripts/iu_core_healthcheck.sh exists only to set the working directory and call the Python entrypoint; the script never logs a secret value.

4. Tests

tests/test_iu_core_5000x_healthcheck.py covers:

the 7 expected surfaces are queried (no over-coverage, no missing);
every surface has a verdict rule (_VERDICTS matches _SURFACE_SQL);
the surface SQL strings carry no forbidden literal (Bearer / api-key / QDRANT__SERVICE__API_KEY / NUXT_DIRECTUS_SERVICE_TOKEN / password=);
happy-path inputs return ok across all surfaces with the JSON output being sorted-key, single-line;
canonical fault inputs (cache unhealthy / trigger errors / write gate open / boundary breach / failed runs) flip the verdict correctly.

Result: 10 / 10 PASS.

5. Cron / uptime-kuma / Grafana wiring (NOT enabled in this macro)

Per 5000x rule "no broad service restart; no persistent cron unless gated", the macro stops at "artifact authored". A separate ops macro would add one of:

a host-level cron entry that runs scripts/iu_core_healthcheck.sh every N minutes and POSTs the JSON to uptime-kuma's status endpoint;
a Grafana exec probe that runs the same script and exposes the seven ok-booleans as gauge metrics;
a Slack notifier that fires only on overall_ok=false transitions.

All three paths are additive and can be enabled / disabled independently; the healthcheck itself remains the SSOT.

6. Five-layer impact

layer	impact
PG	none — pure SELECTs, no DDL
Directus	none
Nuxt	none
AgentData	+1 KB report (this doc)
Qdrant	none