IU Core 4000x — 04 External healthcheck + monitoring extension
04 — External healthcheck + monitoring (4000x extension)
1. What 3000x left
3000x added 7 external operator commands including dot_iu_external_healthcheck (one-shot aggregate over three external surfaces: three-axis cache / Directus collection / Qdrant collection). The healthcheck did not surface the auto-refresh trigger health — fine for 3000x because no trigger existed.
2. 4000x extensions
2.1 dot_iu_external_healthcheck now reports 4 surfaces
The healthcheck SQL now emits 4 jsonb rows: three_axis_cache (drift in_sync + view/table count), directus_collection (permission rows + table rows), qdrant_collection (active collection + sync points indexed), and auto_refresh_trigger (gate value + trigger_fires_24h + trigger_errors_24h).
Live output captured right after runtime/340 (4000x):
three_axis_cache | {"in_sync": true, "view_count": 163, "table_count": 163}
directus_collection | {"table_rows": 163, "permission_rows": 1}
qdrant_collection | {"active": "iu_core_iu_chunks", "sync_points_indexed": 61}
auto_refresh_trigger | {"gate": "false", "trigger_fires_24h": 1, "trigger_errors_24h": 0}
The trigger_fires_24h: 1 row is the runtime/340 durable smoke; the trigger_errors_24h: 0 says no exception path has fired in the last 24 hours.
2.2 dot_iu_three_axis_envelope_trigger_status — new dedicated command
Read-only command focused on the migration 024 surface only:
| SELECT | Purpose |
|---|---|
gate |
Current iu_core.three_axis_auto_refresh_enabled value + updated_at. |
recent_lifecycle_fires |
Count / outcome breakdown (skipped_in_sync, refreshed) + last started_at, over the trailing 24 h, scoped to actor='iu_lifecycle_trigger'. |
recent_trigger_errors |
Count of rows in iu_three_axis_envelope_trigger_error_log over the trailing 24 h + last captured_at. |
Total registry: 17 governed + 8 external (3000x = 7 + 4000x = 1).
3. Monitoring-ready output
Both commands emit pure tabular / JSONB output suitable for piping to a monitoring system (e.g. uptime-kuma / Grafana / Slack notifier):
dot_iu_external_healthcheckreturns 4 rows ×(surface, detail jsonb)— the operator's existing cron can post the JSON straight to a dashboard.dot_iu_three_axis_envelope_trigger_statusreturns 3 rows (gate, fires, errors) suitable for an alert rule like: trigger_errors_24h > 0 ⇒ page.
The 4000x macro does NOT install any cron / scheduler — it preserves the "no broad service restart" rule and leaves the cadence choice to ops.
4. Rollback / disable
- Closing the gate (
UPDATE dot_config SET value='false' WHERE key='iu_core.three_axis_auto_refresh_enabled') silences the trigger surface in real time. git revertof the dot_commands.py hunk removes the newdot_iu_three_axis_envelope_trigger_statusregistry entry and theauto_refresh_triggersurface from the external healthcheck.- The migration 024 rollback file drops the trigger function + the error log table.
5. Five-layer impact
| layer | impact |
|---|---|
| PG | No new DDL beyond migration 024; the two extended commands are pure SELECT projections of existing tables / functions. |
| AgentData | +1 report (this doc). |
| Directus / Nuxt / Qdrant | unchanged. |