KB-2C5F

03 — DP1 Scheduler Decision (pg_cron vs external poll vs LISTEN/NOTIFY vs hybrid)

11 min read Revision 1
design-packdieu-45dp1schedulerpg-cronlisten-notifyexternal-pollhybriddesign-only

03 — DP1 — Scheduler Decision

DESIGN-ONLY. Cites Điều 45 §5.4, §9, §15.4, §15.5. No extension install. No worker started.


§1. Goal

Decide how the queue substrate gets ticked: who calls fn_iu_route_worker_run (and future per-domain workers / cleanup sweeps / lease reapers / heartbeat checks) — and at what cadence — without violating §15.5 silent-gap invariant.

Implicit sub-goals: (a) make the cadence visible (auditable, configurable, not lore); (b) preserve §5.4's explicit "no pg_cron at Phase 1" stance; (c) leave a documented path to add pg_cron later through its own sub-amendment.


§2. Current state

Fact Value
pg_cron installed? No
Worker function fn_iu_route_worker_run (live)
Worker last_run_at 2026-05-22 11:31:41 (95h ago at this re-survey)
Worker enabled flag iu_core.route_worker_enabled = true (live, default-open)
HC executor cadence unknown frequency but hc_executor_last_run is being updated each tick
External invokers known Hermes batch, Codex sessions, Directus flows, host cron (per §5.4) — none of which has a published cadence contract
Result 4-day silent gap; per-§15.5 this becomes a violation going forward

§3. Options considered

Option What it means Pros Cons Verdict
A. pg_cron only Install pg_cron 1.6+; cron rows drive ticks In-DB; cadence is SoT in PG; survives external orchestrator outages §5.4 forbids; requires extension install + amendment; ties scheduler upgrades to PG upgrades; pg_cron on PG18 unconfirmed ❌ Out of scope at Phase 1
B. External poll only Hermes/Codex/host cron call worker fn at documented cadence; no extension Honours §5.4; minimal change; no new DB surface Cadence lives outside PG; if external system stops, queue stalls silently (today's exact failure) ⚠️ Necessary but insufficient
C. LISTEN/NOTIFY only Trigger emits NOTIFY on event_outbox INSERT; external listener calls worker fn Low latency; wakes only when work arrives NOTIFY can be missed (process restart, reconnect gap); §5.1/§5.3 forbid as SoT ❌ Insufficient on its own
D. Hybrid B + C + heartbeat External poll at coarse cadence (e.g. 30–60s) as floor; NOTIFY adds low-latency wake; heartbeat row makes silent gaps observable Honours all law clauses; closes silent gap; low cost; pg_cron-ready later More moving parts to document RECOMMENDED

§4. Proposed design — Option D Hybrid

§4.1 Three layers

┌─────────────────────────────────────────────────────────────────┐
│ Layer 1 — External coarse cadence (mandatory floor)              │
│   - One named external orchestrator (Hermes, host cron container) │
│   - Calls SELECT fn_<domain>_route_worker_run() every N seconds   │
│   - N = queue.tick.<worker_name>.interval_seconds dot_config      │
│   - Default suggestion: 60s (Council to ratify in DP3)            │
│   - Writes its own heartbeat row (DP4)                            │
└─────────────────────────────────────────────────────────────────┘
                       ↑ poll
┌─────────────────────────────────────────────────────────────────┐
│ Layer 2 — LISTEN/NOTIFY low-latency wake (opt-in, signal-only)   │
│   - Optional trigger AFTER INSERT on event_outbox / job_queue     │
│     emits pg_notify('queue_wake_<event_domain>', json_signal)     │
│   - External worker LISTENs to channel; receiving NOTIFY triggers │
│     an immediate worker tick instead of waiting for the next poll │
│   - Worker MUST tolerate missed NOTIFY (Layer 1 catches it)       │
│   - Enabled via queue.notify.bridge_enabled dot_config             │
└─────────────────────────────────────────────────────────────────┘
                       ↑ NOTIFY
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3 — Heartbeat + silent-gap detector (mandatory)            │
│   - Each executor writes queue_heartbeat row each tick (DP4)      │
│   - Stale-detection function runs as a worker job                 │
│     (job_kind='stale_heartbeat_check') and emits                  │
│     system/queue_worker_silent event when threshold exceeded      │
└─────────────────────────────────────────────────────────────────┘

§4.2 Cadence catalogue (proposed defaults — Council ratifies)

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

queue.tick.iu_outbound_default.interval_seconds:      60
queue.tick.staging_outbound.interval_seconds:         60      # future per-domain worker
queue.tick.job_queue_default.interval_seconds:        30
queue.tick.stale_heartbeat_check.interval_seconds:    120
queue.tick.staging_cleanup_sweep.interval_seconds:    3600    # hourly
queue.tick.lease_reaper.interval_seconds:             60
queue.heartbeat.stale_threshold_seconds:              300     # 5× the smallest tick
queue.notify.bridge_enabled:                          false   # Phase 1 default OFF

§4.3 pg_cron path (deferred, design seam only)

If/when pg_cron is installed (separate DP1-extension pack, separate amendment of §5.4 + §9.4):

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

cron jobs (proposed naming):
  queue_iu_outbound_default      * * * * *
  queue_job_queue_default        * * * * *
  queue_stale_heartbeat_check    */2 * * * *
  queue_staging_cleanup_sweep    0 * * * *
  queue_lease_reaper             * * * * *

config gates:
  queue.scheduler.mode = 'external' | 'pg_cron' | 'hybrid'

In pg_cron mode, Layer 1 cadence is taken over by pg_cron. Layer 2 (NOTIFY) and Layer 3 (heartbeat) remain unchanged. External orchestrators can be retired or kept as redundant ticks (no double-fire because each worker function uses lease + SKIP LOCKED).


§5. Tables / views / functions (sketch)

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

view  v_queue_scheduler_status   -- aggregates cadence config + last_run_at per worker

function fn_queue_tick_request(p_worker_name text) returns jsonb
  -- thin guard that checks dot_config.<worker_name>.enabled, acquires lease, calls underlying worker fn

trigger trg_queue_wake_event_outbox  AFTER INSERT ON event_outbox
  FOR EACH ROW EXECUTE FUNCTION fn_queue_wake_notify();
trigger trg_queue_wake_job_queue     AFTER INSERT ON job_queue
  FOR EACH ROW EXECUTE FUNCTION fn_queue_wake_notify();
function fn_queue_wake_notify() returns trigger
  -- emits pg_notify('queue_wake_'||NEW.event_domain, signal_payload)
  -- enabled only when queue.notify.bridge_enabled = 'true'

§6. Lifecycle / status

DP1 itself has no row-level lifecycle (it's a scheduling policy). It enforces these existing lifecycles:

  • iu_route_worker_cursor.last_run_at written each tick.
  • queue_heartbeat.last_beat_at written each tick (DP4).
  • job_queue.status advanced per §6.7 work state machine (DP2).

§7. Indexes / performance notes

  • The NOTIFY bridge sends one signal per INSERT. At 131k events/18 days ≈ 100 NOTIFYs/hour, well under the kernel pg_notify buffer.
  • External poll at 60s on a queue with no work executes one SELECT … LIMIT 0 style cheap call — negligible cost; lease acquire is sub-ms.
  • pg_cron later: cadence rows are O(few); cost is the underlying worker function cost (unchanged).

§8. Security / governance

  • The fn_queue_tick_request guard is the only public surface; underlying worker fns are reachable only via the guard or directly by privileged roles (workflow_admin, dot_executor).
  • NOTIFY payload is signal-only: {worker_hint, event_domain, occurred_at} — no body, no payload_ref. Listeners pull from PG.
  • Cadence config keys land in dot_config under the queue.tick.* namespace; mutation requires DOT (per Điều 35) with workflow_admin role.

§9. Rollback / disable

  • Layer 1: set iu_core.route_worker_enabled='false' (existing flag) or set queue.tick.<worker>.interval_seconds=0 → tick guard returns immediately.
  • Layer 2: set queue.notify.bridge_enabled='false' → trigger emits nothing.
  • Layer 3: set queue.heartbeat.stale_threshold_seconds to a very large number → silent-gap detection becomes inert (not recommended, but reversible).

§10. Healthcheck / observability

v_queue_scheduler_status (proposed) exposes:

worker_name | configured_interval_seconds | last_run_at | age_seconds | stale (boolean) | mode (external|pg_cron|hybrid)

A row with stale=true is the §15.5 signal. The stale-check job emits system/queue_worker_silent events to drive D43 red_zones.


§11. Compatibility with Điều 45 v1.0

Clause Compliance
§5.1 PG SoT ✅ Layer 1 + 3 anchor in PG; NOTIFY is wake-up only
§5.3 NOTIFY-not-SoT ✅ Bridge is opt-in, signal-only
§5.4 no pg_cron at Phase 1 ✅ Layer 1 = external; pg_cron only via separate amendment
§9 scheduling boundary ✅ Cadence + deadline modelled explicitly
§15.4 cadence rule ✅ Heartbeat + stale-check + event
§15.5 silent_gap_is_a_health_violation ✅ Stale-check is the enforcement surface

§12. Implementation prerequisites

  • DP4 (heartbeat) MUST land before or with DP1 — otherwise Layer 3 is unenforced and §15.5 stays open.
  • DP2 (job_queue) is helpful for the stale-check itself to be a job kind, but not strictly required (a worker fn can stand alone).
  • One named external orchestrator must be identified and documented (Hermes batch, host cron container, dedicated tick container). This is a runbook decision, not a schema decision.

§13. Open questions

# Question Routed to
DP1-Q1 Approve hybrid Option D over pure external-poll? Council
DP1-Q2 Approve 60s default tick + 300s stale threshold? Council
DP1-Q3 Which external orchestrator owns the canonical cadence? Hermes, host cron, dedicated container, Directus flow? Council / Ops
DP1-Q4 Approve queue.notify.bridge_enabled=false at Phase 1 (NOTIFY adoption later)? Council
DP1-Q5 When does pg_cron sub-amendment open? After M jobs/sec? After PG18 ready? Council

§14. Self-test

self_test:
  cites_dieu45_section: §5.4,§9,§15.4,§15.5
  defines_status_lifecycle_compatible_with_§6_7: n/a (policy doc)
  defines_idempotency_key: n/a (tick is idempotent by design)
  defines_retry_dlq: defers to DP3
  defines_lease: uses dot_iu_runtime_lease in fn_queue_tick_request
  defines_observability_view: v_queue_scheduler_status
  defines_dot_config_disable_flag: queue.tick.<w>.interval_seconds=0
  defines_executor_set_compatible_with_§11_5: yes (Layer 1 orchestrator is "external_worker")
  no_vector_in_transient: yes
  signal_not_data: yes (NOTIFY payload signal-only)
  pg_sot: yes
  rollback_concept: yes
  no_pg_cron_dependency_phase_1: yes
  no_pg_18_dependency: yes
  no_mutation_authored: yes

DP1 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-full-queue-orchestration-design-pack/03-DP1-scheduler-decision.md