KB-CD8C

06 — DP4 NOTIFY / Worker / Heartbeat (silent-gap closure)

12 min read Revision 1
design-packdieu-45dp4notifyheartbeatsilent-gapqueue-healthdesign-only

06 — DP4 — NOTIFY / Worker / Heartbeat

DESIGN-ONLY. Cites Điều 45 §5.1, §5.3, §5.4, §15.3, §15.4, §15.5. Addresses the live 4-day worker silence.


§1. Goal

Close the §15.5 silent-gap class of failure with a substrate that:

  1. Provides a low-latency wake-up (NOTIFY) without violating §5.1 (PG is SoT) or §5.3 (NOTIFY not durable).
  2. Records an executor heartbeat so the difference between "alive with nothing to do" and "silently dead" is observable.
  3. Emits system/queue_worker_silent events when staleness exceeds threshold (§15.4).
  4. Surfaces a single v_queue_health view for D31/D43 watchdogs.
  5. Works whether the scheduler is external poll (Phase 1), NOTIFY-bridged hybrid, or eventually pg_cron.

§2. Current state

  • iu_route_worker_cursor.last_run_at = 2026-05-22 11:31 — 4-day silent gap (95 hours).
  • hc_executor_last_run (dot_config) is being updated each tick — this is the working pattern DP4 generalises.
  • fn_iu_route_worker_health() exists but its output is not aggregated.
  • fn_kb_notify_vector_sync publishes pg_notify('kb_vector_sync', …) consumed by pg_vector_listener.py. This is the only PG-NOTIFY producer in the system.
  • No trigger emits NOTIFY for event_outbox inserts.

§3. Proposed design

§3.1 Heartbeat substrate

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

table public.queue_heartbeat (
  executor_id          text PRIMARY KEY,            -- e.g. 'iu_outbound_default', 'job_worker_a', 'hermes_batch_1'
  executor_kind        text NOT NULL
       CHECK (executor_kind IN ('DOT','Agent','Hermes','Codex','PG_worker','external_worker','future_Kestra_adapter')),
  domain_scope         text[] NOT NULL,             -- which event_domain values this executor covers
  last_beat_at         timestamptz NOT NULL,
  last_beat_payload    jsonb NOT NULL DEFAULT '{}'::jsonb
       CHECK (NOT last_beat_payload ?| ARRAY['body','content','raw','vector','embedding','secret','token','password','ssn','personal_data']),
  expected_cadence_seconds  integer NOT NULL CHECK (expected_cadence_seconds > 0),
  stale_threshold_seconds   integer NOT NULL CHECK (stale_threshold_seconds > 0),
  registered_at        timestamptz NOT NULL DEFAULT now(),
  registered_by        text NOT NULL
);

function fn_queue_heartbeat_write(p_executor_id, p_payload, p_actor) RETURNS jsonb
  -- UPSERT (executor_id) — sets last_beat_at = now(), validates row existence.

function fn_queue_stale_check() RETURNS jsonb
  -- For each executor where now() - last_beat_at > stale_threshold_seconds AND
  --   no system/queue_worker_silent event in the last (stale_threshold_seconds × 2) seconds:
  --     emit one system/queue_worker_silent event (severity = warning OR critical per multiplier)

§3.2 NOTIFY wake-up bridge (opt-in)

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

trigger trg_event_outbox_wake_notify  AFTER INSERT ON event_outbox
  FOR EACH ROW WHEN ( fn_queue_notify_enabled() )                  -- guard reads dot_config
  EXECUTE FUNCTION fn_queue_wake_notify_event_outbox();

trigger trg_job_queue_wake_notify     AFTER INSERT ON job_queue
  FOR EACH ROW WHEN ( fn_queue_notify_enabled() )
  EXECUTE FUNCTION fn_queue_wake_notify_job_queue();

function fn_queue_wake_notify_event_outbox() RETURNS trigger
  -- PERFORM pg_notify('queue_wake_'||NEW.event_domain,
  --                   jsonb_build_object('layer','event','event_id',NEW.id,'event_domain',NEW.event_domain,
  --                                      'event_type',NEW.event_type,'lane',NEW.delivery_lane,
  --                                      'occurred_at',NEW.occurred_at)::text);

function fn_queue_wake_notify_job_queue() RETURNS trigger
  -- PERFORM pg_notify('queue_wake_'||NEW.event_domain,
  --                   jsonb_build_object('layer','job','job_id',NEW.job_id,'job_kind',NEW.job_kind,
  --                                      'process_after',NEW.process_after)::text);

Properties:

  • Payload is signal-only (refs + small ints), never the actual event/job body.
  • Triggers are gated by queue.notify.bridge_enabled; default false at Phase 1.
  • Listeners (external workers) subscribe via LISTEN queue_wake_iu; etc.
  • A missed NOTIFY does NOT lose work — the external poll cadence (DP1 Layer 1) and heartbeat (this DP) catch it.

§3.3 v_queue_health — unified surface

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

view v_queue_health AS
SELECT 'cursor'::text                    AS source,
       c.worker_name                     AS subject,
       c.event_domain                    AS domain,
       c.last_run_at                     AS last_seen_at,
       EXTRACT(epoch FROM (now() - c.last_run_at))::bigint AS age_seconds,
       c.events_seen, c.attempts_written, c.dead_lettered,
       NULL::text                        AS status_hint
  FROM iu_route_worker_cursor c
UNION ALL
SELECT 'heartbeat'::text, h.executor_id, NULL,
       h.last_beat_at,
       EXTRACT(epoch FROM (now() - h.last_beat_at))::bigint,
       NULL, NULL, NULL,
       CASE WHEN now() - h.last_beat_at > make_interval(secs => h.stale_threshold_seconds)
            THEN 'stale' ELSE 'fresh' END
  FROM queue_heartbeat h
UNION ALL
SELECT 'dead_letter_event'::text, ir.event_domain, ir.event_domain,
       ir.last_failed_at, NULL, NULL, NULL, NULL,
       'open'
  FROM iu_route_dead_letter ir WHERE ir.resolved_at IS NULL
UNION ALL
SELECT 'dead_letter_job'::text, jd.job_kind, jd.event_domain,
       jd.last_failed_at, NULL, NULL, NULL, NULL,
       'open'
  FROM job_dead_letter jd WHERE jd.resolved_at IS NULL
UNION ALL
SELECT 'backlog_job_queue'::text, j.job_kind, j.event_domain,
       MAX(j.enqueued_at) FILTER (WHERE j.status='queued'),
       NULL,
       COUNT(*) FILTER (WHERE j.status='queued'),
       NULL, NULL, NULL
  FROM job_queue j GROUP BY j.job_kind, j.event_domain;

§3.4 Heartbeat write pattern (each executor)

Every executor MUST call fn_queue_heartbeat_write either:

  • On every tick (PG worker, external poll), regardless of whether work was found.
  • Or via a "still alive" job queued at cadence (Agent/Hermes/Codex sessions).

The hc executor pattern (writing hc_executor_last_run to dot_config) is migrated to a real queue_heartbeat row by Phase 1.

§3.5 Silent-gap event emission

fn_queue_stale_check():
  -- For each stale executor:
  --   gap_ratio = age_seconds / expected_cadence_seconds
  --   severity = 'warning' if 3 ≤ gap_ratio < 10 else 'critical'
  --   IF no system/queue_worker_silent event in last (stale_threshold × 2) seconds:
  --     emit via fn_iu_emit_event with:
  --       event_domain='system'
  --       event_type='queue_worker_silent'       -- requires registry entry
  --       event_stream='alert'
  --       delivery_lane='immediate'
  --       event_severity = computed
  --       event_subject_table='queue_heartbeat'
  --       event_subject_ref=executor_id
  --       safe_payload = {executor_id, age_seconds, expected_cadence_seconds, gap_ratio}

New event_type system/queue_worker_silent must be added to event_type_registry — that ratification is itself a Council step (vocab change), not in scope of this design pack.


§4. Lifecycle / status

Heartbeat rows do not have a lifecycle in the §6.7 sense; they are a continuously-updated mirror of executor liveness. They are additive observability surface, not transactional state.


§5. Indexes / performance

  • queue_heartbeat is a small table (1 row per executor, expected count < 100).
  • PK on executor_id; an additional partial index on (last_beat_at) for fast stale scan is unnecessary at expected scale.
  • NOTIFY cost at 100 inserts/hour is negligible.

§6. Security / governance

  • Only executors register themselves with a one-time fn_queue_heartbeat_register(p_executor_id, p_kind, p_domain_scope, p_expected_cadence_seconds, p_stale_threshold_seconds, p_actor) (separate from the _write fn).
  • Registration requires workflow_admin role; write requires dot_executor.
  • NOTIFY payloads are signal-only — CHECK on last_beat_payload mirrors event_outbox.safe_payload.

§7. Rollback / disable

  • queue.notify.bridge_enabled = false → trigger guard returns; no NOTIFY emitted.
  • queue.heartbeat.stale_threshold_seconds = huge_number → stale-check becomes inert.
  • Drop queue_heartbeat table only after all executors deregistered (separate deprecation pack).

§8. Healthcheck / observability

  • v_queue_health is the single pane (§3.3).
  • D31 watchdog and D43 red_zones consume v_queue_health rows where status_hint='stale' or dead_letter rows are open.
  • Operator playbook: see doc 14 phase 3 prerequisites.

§9. Compatibility with Điều 45 v1.0

Clause Compliance
§5.1 PG SoT ✅ heartbeat row in PG; NOTIFY is wake-up only
§5.3 NOTIFY not SoT ✅ payload is signal; listeners pull from PG
§5.4 no pg_cron Phase 1 ✅ stale-check runs as a worker job (DP3 lease_reaper-style cadence)
§15.3 healthcheck contract fn_<domain>_route_worker_health pattern preserved; aggregated into v_queue_health
§15.4 cadence rule + emit ✅ implemented via fn_queue_stale_check
§15.5 silent_gap_is_a_health_violation ✅ this is the enforcement substrate

§10. Implementation prerequisites

  • Register system/queue_worker_silent event_type in registry (Council vocab gate; §6.4 inclusion rubric: "Health/integrity alert severity ≥ warning" applies).
  • DP3 lease_reaper-style cadence reaches fn_queue_stale_check at < stale_threshold / 2.
  • D31 watchdog / D43 red_zones adapter must learn the v_queue_health view (separate D31/D43 design pack).

§11. Open questions

# Question Routed to
DP4-Q1 Approve heartbeat write on every tick (even no-work) vs only when work seen? Council
DP4-Q2 Approve executor_id text PK (no FK) vs FK to a new executor_registry table? Council
DP4-Q3 NOTIFY channel naming: queue_wake_<domain> vs queue_wake_<event_type>? Council
DP4-Q4 Should queue_worker_silent event be once-per-stale-period or repeating? Council
DP4-Q5 Stale_threshold default per executor_kind (e.g. external_worker tolerates more)? Council
DP4-Q6 Should hc_executor_last_run dot_config key be migrated to queue_heartbeat row, or coexist? Council

§12. Self-test

self_test:
  cites_dieu45_section: §5.1,§5.3,§5.4,§15.3,§15.4,§15.5
  defines_status_lifecycle_compatible_with_§6_7: n/a (observability surface)
  defines_idempotency_key: heartbeat upsert is idempotent per executor_id
  defines_retry_dlq: defers to DP3 substrate
  defines_lease: n/a
  defines_observability_view: v_queue_health
  defines_dot_config_disable_flag: queue.notify.bridge_enabled, queue.heartbeat.stale_threshold_seconds
  defines_executor_set_compatible_with_§11_5: yes (CHECK on executor_kind verbatim)
  no_vector_in_transient: yes (CHECK on last_beat_payload)
  signal_not_data: yes (NOTIFY payload signal; heartbeat payload signal)
  pg_sot: yes
  rollback_concept: yes
  no_pg_cron_dependency_phase_1: yes
  no_pg_18_dependency: yes
  no_mutation_authored: yes

DP4 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-full-queue-orchestration-design-pack/06-DP4-notify-worker-heartbeat.md