KB-7A28

5000x · one-command 7-surface healthcheck

4 min read Revision 1
iu-core5000xhealthcheckmonitoring7-surfacescli

04 — One-command 7-surface healthcheck

1. From SQL fragments to a monitoring-ready artifact

4000x extended dot_iu_external_healthcheck to 4 surfaces. But the output was tabular SQL — not directly pipeable to a dashboard. 5000x turns it into a real one-command healthcheck:

$ scripts/iu_core_healthcheck.sh
{"overall_ok":true,"surfaces":[…7 surface records…]}
$ echo $?
0          # 0 = all green / 2 = at least one fail / 3 = bootstrap error

2. Surfaces (7, in fixed order)

# surface green criteria
1 three_axis_cache cache_healthy=t AND in_sync=t
2 directus_collection table_rows > 0 AND permission_rows >= 1
3 qdrant_collection active collection registered AND sync_points_indexed > 0
4 auto_refresh_trigger trigger_errors_24h = 0
5 vector_boundary unique_units = pts or off by ≤ 1 (the legitimate KT-B 2-chunk IU)
6 write_gates 6 IU Core write gates all false
7 operator_runtime failed_24h = 0

Live output captured at end of 5000x:

three_axis_cache:     in_sync
directus_collection:  163 rows / 1 read-permission
qdrant_collection:    iu_core_iu_chunks (61 indexed)
auto_refresh_trigger: gate=false fires_24h=3
vector_boundary:      61 pts / 60 unique
write_gates:          all 6 inert
operator_runtime:     open_runs=0 failed_24h=0 active_leases=0
OVERALL_OK=True

3. Module shape

cutter_agent/iu_core/healthcheck.py exposes:

  • run_healthcheck(executor: SqlExecutor) -> HealthcheckReport — pure Python, takes an injected SQL executor (a callable returning a list of dicts). Lets tests cover every verdict rule without touching the DB.
  • make_ssh_executor(ssh_host, pg_container, pg_user, pg_db) — the default executor that wraps ssh <host> docker exec -i <container> psql -U <user> -d <db> -tAc <sql> and parses the JSON-aggregated output.
  • main(argv) — CLI entrypoint; prints the report's JSON and exits 0/2/3.

The companion shell wrapper scripts/iu_core_healthcheck.sh exists only to set the working directory and call the Python entrypoint; the script never logs a secret value.

4. Tests

tests/test_iu_core_5000x_healthcheck.py covers:

  • the 7 expected surfaces are queried (no over-coverage, no missing);
  • every surface has a verdict rule (_VERDICTS matches _SURFACE_SQL);
  • the surface SQL strings carry no forbidden literal (Bearer / api-key / QDRANT__SERVICE__API_KEY / NUXT_DIRECTUS_SERVICE_TOKEN / password=);
  • happy-path inputs return ok across all surfaces with the JSON output being sorted-key, single-line;
  • canonical fault inputs (cache unhealthy / trigger errors / write gate open / boundary breach / failed runs) flip the verdict correctly.

Result: 10 / 10 PASS.

5. Cron / uptime-kuma / Grafana wiring (NOT enabled in this macro)

Per 5000x rule "no broad service restart; no persistent cron unless gated", the macro stops at "artifact authored". A separate ops macro would add one of:

  • a host-level cron entry that runs scripts/iu_core_healthcheck.sh every N minutes and POSTs the JSON to uptime-kuma's status endpoint;
  • a Grafana exec probe that runs the same script and exposes the seven ok-booleans as gauge metrics;
  • a Slack notifier that fires only on overall_ok=false transitions.

All three paths are additive and can be enabled / disabled independently; the healthcheck itself remains the SSOT.

6. Five-layer impact

layer impact
PG none — pure SELECTs, no DDL
Directus none
Nuxt none
AgentData +1 KB report (this doc)
Qdrant none
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-iu-core-5000x-nuxt-pilot-monitoring-rollout-open-goal/04-healthcheck-7-surfaces.md