06 — DP4 NOTIFY / Worker / Heartbeat (silent-gap closure)
06 — DP4 — NOTIFY / Worker / Heartbeat
DESIGN-ONLY. Cites Điều 45 §5.1, §5.3, §5.4, §15.3, §15.4, §15.5. Addresses the live 4-day worker silence.
§1. Goal
Close the §15.5 silent-gap class of failure with a substrate that:
- Provides a low-latency wake-up (NOTIFY) without violating §5.1 (PG is SoT) or §5.3 (NOTIFY not durable).
- Records an executor heartbeat so the difference between "alive with nothing to do" and "silently dead" is observable.
- Emits
system/queue_worker_silentevents when staleness exceeds threshold (§15.4). - Surfaces a single
v_queue_healthview for D31/D43 watchdogs. - Works whether the scheduler is external poll (Phase 1), NOTIFY-bridged hybrid, or eventually pg_cron.
§2. Current state
iu_route_worker_cursor.last_run_at= 2026-05-22 11:31 — 4-day silent gap (95 hours).hc_executor_last_run(dot_config) is being updated each tick — this is the working pattern DP4 generalises.fn_iu_route_worker_health()exists but its output is not aggregated.fn_kb_notify_vector_syncpublishespg_notify('kb_vector_sync', …)consumed bypg_vector_listener.py. This is the only PG-NOTIFY producer in the system.- No trigger emits NOTIFY for
event_outboxinserts.
§3. Proposed design
§3.1 Heartbeat substrate
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
table public.queue_heartbeat (
executor_id text PRIMARY KEY, -- e.g. 'iu_outbound_default', 'job_worker_a', 'hermes_batch_1'
executor_kind text NOT NULL
CHECK (executor_kind IN ('DOT','Agent','Hermes','Codex','PG_worker','external_worker','future_Kestra_adapter')),
domain_scope text[] NOT NULL, -- which event_domain values this executor covers
last_beat_at timestamptz NOT NULL,
last_beat_payload jsonb NOT NULL DEFAULT '{}'::jsonb
CHECK (NOT last_beat_payload ?| ARRAY['body','content','raw','vector','embedding','secret','token','password','ssn','personal_data']),
expected_cadence_seconds integer NOT NULL CHECK (expected_cadence_seconds > 0),
stale_threshold_seconds integer NOT NULL CHECK (stale_threshold_seconds > 0),
registered_at timestamptz NOT NULL DEFAULT now(),
registered_by text NOT NULL
);
function fn_queue_heartbeat_write(p_executor_id, p_payload, p_actor) RETURNS jsonb
-- UPSERT (executor_id) — sets last_beat_at = now(), validates row existence.
function fn_queue_stale_check() RETURNS jsonb
-- For each executor where now() - last_beat_at > stale_threshold_seconds AND
-- no system/queue_worker_silent event in the last (stale_threshold_seconds × 2) seconds:
-- emit one system/queue_worker_silent event (severity = warning OR critical per multiplier)
§3.2 NOTIFY wake-up bridge (opt-in)
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
trigger trg_event_outbox_wake_notify AFTER INSERT ON event_outbox
FOR EACH ROW WHEN ( fn_queue_notify_enabled() ) -- guard reads dot_config
EXECUTE FUNCTION fn_queue_wake_notify_event_outbox();
trigger trg_job_queue_wake_notify AFTER INSERT ON job_queue
FOR EACH ROW WHEN ( fn_queue_notify_enabled() )
EXECUTE FUNCTION fn_queue_wake_notify_job_queue();
function fn_queue_wake_notify_event_outbox() RETURNS trigger
-- PERFORM pg_notify('queue_wake_'||NEW.event_domain,
-- jsonb_build_object('layer','event','event_id',NEW.id,'event_domain',NEW.event_domain,
-- 'event_type',NEW.event_type,'lane',NEW.delivery_lane,
-- 'occurred_at',NEW.occurred_at)::text);
function fn_queue_wake_notify_job_queue() RETURNS trigger
-- PERFORM pg_notify('queue_wake_'||NEW.event_domain,
-- jsonb_build_object('layer','job','job_id',NEW.job_id,'job_kind',NEW.job_kind,
-- 'process_after',NEW.process_after)::text);
Properties:
- Payload is signal-only (refs + small ints), never the actual event/job body.
- Triggers are gated by
queue.notify.bridge_enabled; defaultfalseat Phase 1. - Listeners (external workers) subscribe via
LISTEN queue_wake_iu;etc. - A missed NOTIFY does NOT lose work — the external poll cadence (DP1 Layer 1) and heartbeat (this DP) catch it.
§3.3 v_queue_health — unified surface
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
view v_queue_health AS
SELECT 'cursor'::text AS source,
c.worker_name AS subject,
c.event_domain AS domain,
c.last_run_at AS last_seen_at,
EXTRACT(epoch FROM (now() - c.last_run_at))::bigint AS age_seconds,
c.events_seen, c.attempts_written, c.dead_lettered,
NULL::text AS status_hint
FROM iu_route_worker_cursor c
UNION ALL
SELECT 'heartbeat'::text, h.executor_id, NULL,
h.last_beat_at,
EXTRACT(epoch FROM (now() - h.last_beat_at))::bigint,
NULL, NULL, NULL,
CASE WHEN now() - h.last_beat_at > make_interval(secs => h.stale_threshold_seconds)
THEN 'stale' ELSE 'fresh' END
FROM queue_heartbeat h
UNION ALL
SELECT 'dead_letter_event'::text, ir.event_domain, ir.event_domain,
ir.last_failed_at, NULL, NULL, NULL, NULL,
'open'
FROM iu_route_dead_letter ir WHERE ir.resolved_at IS NULL
UNION ALL
SELECT 'dead_letter_job'::text, jd.job_kind, jd.event_domain,
jd.last_failed_at, NULL, NULL, NULL, NULL,
'open'
FROM job_dead_letter jd WHERE jd.resolved_at IS NULL
UNION ALL
SELECT 'backlog_job_queue'::text, j.job_kind, j.event_domain,
MAX(j.enqueued_at) FILTER (WHERE j.status='queued'),
NULL,
COUNT(*) FILTER (WHERE j.status='queued'),
NULL, NULL, NULL
FROM job_queue j GROUP BY j.job_kind, j.event_domain;
§3.4 Heartbeat write pattern (each executor)
Every executor MUST call fn_queue_heartbeat_write either:
- On every tick (PG worker, external poll), regardless of whether work was found.
- Or via a "still alive" job queued at cadence (Agent/Hermes/Codex sessions).
The hc executor pattern (writing hc_executor_last_run to dot_config) is migrated to a real queue_heartbeat row by Phase 1.
§3.5 Silent-gap event emission
fn_queue_stale_check():
-- For each stale executor:
-- gap_ratio = age_seconds / expected_cadence_seconds
-- severity = 'warning' if 3 ≤ gap_ratio < 10 else 'critical'
-- IF no system/queue_worker_silent event in last (stale_threshold × 2) seconds:
-- emit via fn_iu_emit_event with:
-- event_domain='system'
-- event_type='queue_worker_silent' -- requires registry entry
-- event_stream='alert'
-- delivery_lane='immediate'
-- event_severity = computed
-- event_subject_table='queue_heartbeat'
-- event_subject_ref=executor_id
-- safe_payload = {executor_id, age_seconds, expected_cadence_seconds, gap_ratio}
New event_type system/queue_worker_silent must be added to event_type_registry — that ratification is itself a Council step (vocab change), not in scope of this design pack.
§4. Lifecycle / status
Heartbeat rows do not have a lifecycle in the §6.7 sense; they are a continuously-updated mirror of executor liveness. They are additive observability surface, not transactional state.
§5. Indexes / performance
queue_heartbeatis a small table (1 row per executor, expected count < 100).- PK on
executor_id; an additional partial index on(last_beat_at)for fast stale scan is unnecessary at expected scale. - NOTIFY cost at 100 inserts/hour is negligible.
§6. Security / governance
- Only executors register themselves with a one-time
fn_queue_heartbeat_register(p_executor_id, p_kind, p_domain_scope, p_expected_cadence_seconds, p_stale_threshold_seconds, p_actor)(separate from the_writefn). - Registration requires
workflow_adminrole; write requiresdot_executor. - NOTIFY payloads are signal-only — CHECK on
last_beat_payloadmirrorsevent_outbox.safe_payload.
§7. Rollback / disable
queue.notify.bridge_enabled = false→ trigger guard returns; no NOTIFY emitted.queue.heartbeat.stale_threshold_seconds = huge_number→ stale-check becomes inert.- Drop
queue_heartbeattable only after all executors deregistered (separate deprecation pack).
§8. Healthcheck / observability
v_queue_healthis the single pane (§3.3).- D31 watchdog and D43 red_zones consume
v_queue_healthrows wherestatus_hint='stale'ordead_letterrows are open. - Operator playbook: see doc 14 phase 3 prerequisites.
§9. Compatibility with Điều 45 v1.0
| Clause | Compliance |
|---|---|
| §5.1 PG SoT | ✅ heartbeat row in PG; NOTIFY is wake-up only |
| §5.3 NOTIFY not SoT | ✅ payload is signal; listeners pull from PG |
| §5.4 no pg_cron Phase 1 | ✅ stale-check runs as a worker job (DP3 lease_reaper-style cadence) |
| §15.3 healthcheck contract | ✅ fn_<domain>_route_worker_health pattern preserved; aggregated into v_queue_health |
| §15.4 cadence rule + emit | ✅ implemented via fn_queue_stale_check |
| §15.5 silent_gap_is_a_health_violation | ✅ this is the enforcement substrate |
§10. Implementation prerequisites
- Register
system/queue_worker_silentevent_type in registry (Council vocab gate; §6.4 inclusion rubric: "Health/integrity alert severity ≥ warning" applies). - DP3 lease_reaper-style cadence reaches
fn_queue_stale_checkat < stale_threshold / 2. - D31 watchdog / D43 red_zones adapter must learn the
v_queue_healthview (separate D31/D43 design pack).
§11. Open questions
| # | Question | Routed to |
|---|---|---|
| DP4-Q1 | Approve heartbeat write on every tick (even no-work) vs only when work seen? | Council |
| DP4-Q2 | Approve executor_id text PK (no FK) vs FK to a new executor_registry table? |
Council |
| DP4-Q3 | NOTIFY channel naming: queue_wake_<domain> vs queue_wake_<event_type>? |
Council |
| DP4-Q4 | Should queue_worker_silent event be once-per-stale-period or repeating? |
Council |
| DP4-Q5 | Stale_threshold default per executor_kind (e.g. external_worker tolerates more)? | Council |
| DP4-Q6 | Should hc_executor_last_run dot_config key be migrated to queue_heartbeat row, or coexist? |
Council |
§12. Self-test
self_test:
cites_dieu45_section: §5.1,§5.3,§5.4,§15.3,§15.4,§15.5
defines_status_lifecycle_compatible_with_§6_7: n/a (observability surface)
defines_idempotency_key: heartbeat upsert is idempotent per executor_id
defines_retry_dlq: defers to DP3 substrate
defines_lease: n/a
defines_observability_view: v_queue_health
defines_dot_config_disable_flag: queue.notify.bridge_enabled, queue.heartbeat.stale_threshold_seconds
defines_executor_set_compatible_with_§11_5: yes (CHECK on executor_kind verbatim)
no_vector_in_transient: yes (CHECK on last_beat_payload)
signal_not_data: yes (NOTIFY payload signal; heartbeat payload signal)
pg_sot: yes
rollback_concept: yes
no_pg_cron_dependency_phase_1: yes
no_pg_18_dependency: yes
no_mutation_authored: yes
DP4 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).