03 — DP1 Scheduler Decision (pg_cron vs external poll vs LISTEN/NOTIFY vs hybrid)
03 — DP1 — Scheduler Decision
DESIGN-ONLY. Cites Điều 45 §5.4, §9, §15.4, §15.5. No extension install. No worker started.
§1. Goal
Decide how the queue substrate gets ticked: who calls fn_iu_route_worker_run (and future per-domain workers / cleanup sweeps / lease reapers / heartbeat checks) — and at what cadence — without violating §15.5 silent-gap invariant.
Implicit sub-goals: (a) make the cadence visible (auditable, configurable, not lore); (b) preserve §5.4's explicit "no pg_cron at Phase 1" stance; (c) leave a documented path to add pg_cron later through its own sub-amendment.
§2. Current state
| Fact | Value |
|---|---|
| pg_cron installed? | No |
| Worker function | fn_iu_route_worker_run (live) |
| Worker last_run_at | 2026-05-22 11:31:41 (95h ago at this re-survey) |
| Worker enabled flag | iu_core.route_worker_enabled = true (live, default-open) |
| HC executor cadence | unknown frequency but hc_executor_last_run is being updated each tick |
| External invokers known | Hermes batch, Codex sessions, Directus flows, host cron (per §5.4) — none of which has a published cadence contract |
| Result | 4-day silent gap; per-§15.5 this becomes a violation going forward |
§3. Options considered
| Option | What it means | Pros | Cons | Verdict |
|---|---|---|---|---|
| A. pg_cron only | Install pg_cron 1.6+; cron rows drive ticks | In-DB; cadence is SoT in PG; survives external orchestrator outages | §5.4 forbids; requires extension install + amendment; ties scheduler upgrades to PG upgrades; pg_cron on PG18 unconfirmed | ❌ Out of scope at Phase 1 |
| B. External poll only | Hermes/Codex/host cron call worker fn at documented cadence; no extension | Honours §5.4; minimal change; no new DB surface | Cadence lives outside PG; if external system stops, queue stalls silently (today's exact failure) | ⚠️ Necessary but insufficient |
| C. LISTEN/NOTIFY only | Trigger emits NOTIFY on event_outbox INSERT; external listener calls worker fn | Low latency; wakes only when work arrives | NOTIFY can be missed (process restart, reconnect gap); §5.1/§5.3 forbid as SoT | ❌ Insufficient on its own |
| D. Hybrid B + C + heartbeat | External poll at coarse cadence (e.g. 30–60s) as floor; NOTIFY adds low-latency wake; heartbeat row makes silent gaps observable | Honours all law clauses; closes silent gap; low cost; pg_cron-ready later | More moving parts to document | ✅ RECOMMENDED |
§4. Proposed design — Option D Hybrid
§4.1 Three layers
┌─────────────────────────────────────────────────────────────────┐
│ Layer 1 — External coarse cadence (mandatory floor) │
│ - One named external orchestrator (Hermes, host cron container) │
│ - Calls SELECT fn_<domain>_route_worker_run() every N seconds │
│ - N = queue.tick.<worker_name>.interval_seconds dot_config │
│ - Default suggestion: 60s (Council to ratify in DP3) │
│ - Writes its own heartbeat row (DP4) │
└─────────────────────────────────────────────────────────────────┘
↑ poll
┌─────────────────────────────────────────────────────────────────┐
│ Layer 2 — LISTEN/NOTIFY low-latency wake (opt-in, signal-only) │
│ - Optional trigger AFTER INSERT on event_outbox / job_queue │
│ emits pg_notify('queue_wake_<event_domain>', json_signal) │
│ - External worker LISTENs to channel; receiving NOTIFY triggers │
│ an immediate worker tick instead of waiting for the next poll │
│ - Worker MUST tolerate missed NOTIFY (Layer 1 catches it) │
│ - Enabled via queue.notify.bridge_enabled dot_config │
└─────────────────────────────────────────────────────────────────┘
↑ NOTIFY
┌─────────────────────────────────────────────────────────────────┐
│ Layer 3 — Heartbeat + silent-gap detector (mandatory) │
│ - Each executor writes queue_heartbeat row each tick (DP4) │
│ - Stale-detection function runs as a worker job │
│ (job_kind='stale_heartbeat_check') and emits │
│ system/queue_worker_silent event when threshold exceeded │
└─────────────────────────────────────────────────────────────────┘
§4.2 Cadence catalogue (proposed defaults — Council ratifies)
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
queue.tick.iu_outbound_default.interval_seconds: 60
queue.tick.staging_outbound.interval_seconds: 60 # future per-domain worker
queue.tick.job_queue_default.interval_seconds: 30
queue.tick.stale_heartbeat_check.interval_seconds: 120
queue.tick.staging_cleanup_sweep.interval_seconds: 3600 # hourly
queue.tick.lease_reaper.interval_seconds: 60
queue.heartbeat.stale_threshold_seconds: 300 # 5× the smallest tick
queue.notify.bridge_enabled: false # Phase 1 default OFF
§4.3 pg_cron path (deferred, design seam only)
If/when pg_cron is installed (separate DP1-extension pack, separate amendment of §5.4 + §9.4):
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
cron jobs (proposed naming):
queue_iu_outbound_default * * * * *
queue_job_queue_default * * * * *
queue_stale_heartbeat_check */2 * * * *
queue_staging_cleanup_sweep 0 * * * *
queue_lease_reaper * * * * *
config gates:
queue.scheduler.mode = 'external' | 'pg_cron' | 'hybrid'
In pg_cron mode, Layer 1 cadence is taken over by pg_cron. Layer 2 (NOTIFY) and Layer 3 (heartbeat) remain unchanged. External orchestrators can be retired or kept as redundant ticks (no double-fire because each worker function uses lease + SKIP LOCKED).
§5. Tables / views / functions (sketch)
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
view v_queue_scheduler_status -- aggregates cadence config + last_run_at per worker
function fn_queue_tick_request(p_worker_name text) returns jsonb
-- thin guard that checks dot_config.<worker_name>.enabled, acquires lease, calls underlying worker fn
trigger trg_queue_wake_event_outbox AFTER INSERT ON event_outbox
FOR EACH ROW EXECUTE FUNCTION fn_queue_wake_notify();
trigger trg_queue_wake_job_queue AFTER INSERT ON job_queue
FOR EACH ROW EXECUTE FUNCTION fn_queue_wake_notify();
function fn_queue_wake_notify() returns trigger
-- emits pg_notify('queue_wake_'||NEW.event_domain, signal_payload)
-- enabled only when queue.notify.bridge_enabled = 'true'
§6. Lifecycle / status
DP1 itself has no row-level lifecycle (it's a scheduling policy). It enforces these existing lifecycles:
iu_route_worker_cursor.last_run_atwritten each tick.queue_heartbeat.last_beat_atwritten each tick (DP4).job_queue.statusadvanced per §6.7 work state machine (DP2).
§7. Indexes / performance notes
- The NOTIFY bridge sends one signal per INSERT. At 131k events/18 days ≈ 100 NOTIFYs/hour, well under the kernel pg_notify buffer.
- External poll at 60s on a queue with no work executes one
SELECT … LIMIT 0style cheap call — negligible cost; lease acquire is sub-ms. - pg_cron later: cadence rows are O(few); cost is the underlying worker function cost (unchanged).
§8. Security / governance
- The
fn_queue_tick_requestguard is the only public surface; underlying worker fns are reachable only via the guard or directly by privileged roles (workflow_admin, dot_executor). - NOTIFY payload is signal-only:
{worker_hint, event_domain, occurred_at}— no body, nopayload_ref. Listeners pull from PG. - Cadence config keys land in
dot_configunder thequeue.tick.*namespace; mutation requires DOT (per Điều 35) with workflow_admin role.
§9. Rollback / disable
- Layer 1: set
iu_core.route_worker_enabled='false'(existing flag) or setqueue.tick.<worker>.interval_seconds=0→ tick guard returns immediately. - Layer 2: set
queue.notify.bridge_enabled='false'→ trigger emits nothing. - Layer 3: set
queue.heartbeat.stale_threshold_secondsto a very large number → silent-gap detection becomes inert (not recommended, but reversible).
§10. Healthcheck / observability
v_queue_scheduler_status (proposed) exposes:
worker_name | configured_interval_seconds | last_run_at | age_seconds | stale (boolean) | mode (external|pg_cron|hybrid)
A row with stale=true is the §15.5 signal. The stale-check job emits system/queue_worker_silent events to drive D43 red_zones.
§11. Compatibility with Điều 45 v1.0
| Clause | Compliance |
|---|---|
| §5.1 PG SoT | ✅ Layer 1 + 3 anchor in PG; NOTIFY is wake-up only |
| §5.3 NOTIFY-not-SoT | ✅ Bridge is opt-in, signal-only |
| §5.4 no pg_cron at Phase 1 | ✅ Layer 1 = external; pg_cron only via separate amendment |
| §9 scheduling boundary | ✅ Cadence + deadline modelled explicitly |
| §15.4 cadence rule | ✅ Heartbeat + stale-check + event |
| §15.5 silent_gap_is_a_health_violation | ✅ Stale-check is the enforcement surface |
§12. Implementation prerequisites
- DP4 (heartbeat) MUST land before or with DP1 — otherwise Layer 3 is unenforced and §15.5 stays open.
- DP2 (job_queue) is helpful for the stale-check itself to be a job kind, but not strictly required (a worker fn can stand alone).
- One named external orchestrator must be identified and documented (Hermes batch, host cron container, dedicated tick container). This is a runbook decision, not a schema decision.
§13. Open questions
| # | Question | Routed to |
|---|---|---|
| DP1-Q1 | Approve hybrid Option D over pure external-poll? | Council |
| DP1-Q2 | Approve 60s default tick + 300s stale threshold? | Council |
| DP1-Q3 | Which external orchestrator owns the canonical cadence? Hermes, host cron, dedicated container, Directus flow? | Council / Ops |
| DP1-Q4 | Approve queue.notify.bridge_enabled=false at Phase 1 (NOTIFY adoption later)? |
Council |
| DP1-Q5 | When does pg_cron sub-amendment open? After M jobs/sec? After PG18 ready? | Council |
§14. Self-test
self_test:
cites_dieu45_section: §5.4,§9,§15.4,§15.5
defines_status_lifecycle_compatible_with_§6_7: n/a (policy doc)
defines_idempotency_key: n/a (tick is idempotent by design)
defines_retry_dlq: defers to DP3
defines_lease: uses dot_iu_runtime_lease in fn_queue_tick_request
defines_observability_view: v_queue_scheduler_status
defines_dot_config_disable_flag: queue.tick.<w>.interval_seconds=0
defines_executor_set_compatible_with_§11_5: yes (Layer 1 orchestrator is "external_worker")
no_vector_in_transient: yes
signal_not_data: yes (NOTIFY payload signal-only)
pg_sot: yes
rollback_concept: yes
no_pg_cron_dependency_phase_1: yes
no_pg_18_dependency: yes
no_mutation_authored: yes
DP1 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).