KB-18AD

IU Core 4000x — 04 External healthcheck + monitoring extension

4 min read Revision 1
iu-core4000xexternal-healthcheckmonitoringdot-commandstrigger-status

04 — External healthcheck + monitoring (4000x extension)

1. What 3000x left

3000x added 7 external operator commands including dot_iu_external_healthcheck (one-shot aggregate over three external surfaces: three-axis cache / Directus collection / Qdrant collection). The healthcheck did not surface the auto-refresh trigger health — fine for 3000x because no trigger existed.

2. 4000x extensions

2.1 dot_iu_external_healthcheck now reports 4 surfaces

The healthcheck SQL now emits 4 jsonb rows: three_axis_cache (drift in_sync + view/table count), directus_collection (permission rows + table rows), qdrant_collection (active collection + sync points indexed), and auto_refresh_trigger (gate value + trigger_fires_24h + trigger_errors_24h).

Live output captured right after runtime/340 (4000x):

three_axis_cache      | {"in_sync": true, "view_count": 163, "table_count": 163}
directus_collection   | {"table_rows": 163, "permission_rows": 1}
qdrant_collection     | {"active": "iu_core_iu_chunks", "sync_points_indexed": 61}
auto_refresh_trigger  | {"gate": "false", "trigger_fires_24h": 1, "trigger_errors_24h": 0}

The trigger_fires_24h: 1 row is the runtime/340 durable smoke; the trigger_errors_24h: 0 says no exception path has fired in the last 24 hours.

2.2 dot_iu_three_axis_envelope_trigger_status — new dedicated command

Read-only command focused on the migration 024 surface only:

SELECT Purpose
gate Current iu_core.three_axis_auto_refresh_enabled value + updated_at.
recent_lifecycle_fires Count / outcome breakdown (skipped_in_sync, refreshed) + last started_at, over the trailing 24 h, scoped to actor='iu_lifecycle_trigger'.
recent_trigger_errors Count of rows in iu_three_axis_envelope_trigger_error_log over the trailing 24 h + last captured_at.

Total registry: 17 governed + 8 external (3000x = 7 + 4000x = 1).

3. Monitoring-ready output

Both commands emit pure tabular / JSONB output suitable for piping to a monitoring system (e.g. uptime-kuma / Grafana / Slack notifier):

  • dot_iu_external_healthcheck returns 4 rows × (surface, detail jsonb) — the operator's existing cron can post the JSON straight to a dashboard.
  • dot_iu_three_axis_envelope_trigger_status returns 3 rows (gate, fires, errors) suitable for an alert rule like: trigger_errors_24h > 0 ⇒ page.

The 4000x macro does NOT install any cron / scheduler — it preserves the "no broad service restart" rule and leaves the cadence choice to ops.

4. Rollback / disable

  • Closing the gate (UPDATE dot_config SET value='false' WHERE key='iu_core.three_axis_auto_refresh_enabled') silences the trigger surface in real time.
  • git revert of the dot_commands.py hunk removes the new dot_iu_three_axis_envelope_trigger_status registry entry and the auto_refresh_trigger surface from the external healthcheck.
  • The migration 024 rollback file drops the trigger function + the error log table.

5. Five-layer impact

layer impact
PG No new DDL beyond migration 024; the two extended commands are pure SELECT projections of existing tables / functions.
AgentData +1 report (this doc).
Directus / Nuxt / Qdrant unchanged.
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-iu-core-4000x-ui-runtime-acceptance-monitoring-rollout-open-goal/04-external-healthcheck-and-monitoring.md