KB-137B

08 — Open Questions + Carry-Forward to Phase 2

6 min read Revision 1
dieu-45phase-1open-questionscarry-forwardphase-2council2026-05-26

08 — Open Questions + Carry-Forward

Decisions deferred from Phase 1 (require Council ratification before Phase 2)

DC-1. Executor whitelist CHECK on job_queue.executor?

Phase 1 stores executor identity only in lease_owner (free text, no CHECK). job_queue itself has no executor column — that intentionally avoids tying every row to an executor at enqueue time.

queue_heartbeat.executor_kind already enforces the §11.5 7-name set: DOT, Agent, Hermes, Codex, PG_worker, external_worker, future_Kestra_adapter.

Question: should we add a target_executor_kind text NULL CHECK (… §11.5 set …) column to job_queue in Phase 2, or keep enqueue executor-agnostic and let consumer_registry (Phase 4) bind job_kind → executor?

Default recommendation: keep agnostic; let DP6 consumer_registry route. Adding the column now would force a schema-level coupling that DP6 explicitly aims to invert.

DC-2. NOTIFY channel naming

queue.notify.enabled=false ships. When flipping it on, what channel name? DP4-Q3 proposes queue_wake_<domain>. No code yet emits, so we have zero technical debt either way.

DC-3. Lease reaper as DOT job vs. as fn

Phase 1 ships neither fn_job_lease_reaper nor a DOT entry for it. Design says it can be a self-registered job_kind. Question: Phase 2 path — fn that we invoke from external scheduler vs. job_kind that the queue runs itself?

Default recommendation: ship fn_job_lease_reaper first (idempotent, callable from cron OR from the queue worker itself); register as job_kind only when consumer_registry lands.

DC-4. DOT catalog entries for queue ops

Deferred (see 03-migration-phase-1-substrate.md §DOT catalog). Phase 4 (consumer_registry) is the natural home. Phase 2 may add a dot_queue_healthcheck entry as the only safe early DOT entry — it's read-only.

DC-5. Activating heartbeat for iu_outbound_default

See 06-heartbeat-and-silent-gap-status.md. Recommended in Phase 2 to close the live §15.5 violation. Two equivalent code paths (in-fn vs. external orchestrator). Council picks.

DC-6. Retention sweep (DP7)

job_queue retention thresholds (30d hot, 90d archive, 365d DLQ) not enforced in Phase 1. Phase 7 task.

DC-7. event_outboxjob_queue bridge

Phase 1 explicitly forbids automatic routing. Phase 6+ task; need consumer_registry first.

DC-8. Idempotency-key shape

Phase 1 uses text (free-form, callers pass UUID strings or business keys). DP2 design suggested uuid. Question: standardise on UUID-only in Phase 2 via CHECK (idempotency_key ~ '^[0-9a-f-]{36}$') or keep flexible?

Default recommendation: keep flexible — many likely callers (MARK/CUT, email_send, vector_sync_drain) already have natural business keys (file paths, message IDs) that are not UUIDs.

DC-9. executor_kind for queue_heartbeat — should MOT ever appear?

§11.5 explicitly lists MOT as is_executor=false. The CHECK on queue_heartbeat.executor_kind does NOT permit MOT, matching the law. Confirmed correct.

DC-10. Backoff strategy

Phase 1 hardcodes exponential * 2^(attempts-1) capped at 2^10. Question: should this become a dot_config.queue.retry.backoff_strategy ∈ {exponential, linear, fixed} selector?

Default recommendation: defer until a second strategy is actually wanted. YAGNI.

  1. HB-A — Wire fn_queue_heartbeat_tick('iu_outbound_default','PG_worker',…) into the existing iu_outbound_default cadence (via Hermes or in-fn). Flip queue.heartbeat.enabled=true. Verify v_queue_health.executors_fresh=1.
  2. HB-B — Add fn_queue_stale_check event emission (system/queue_worker_silent) on fresh→stale transition. Requires CHECK widening of event_type_registry if system/queue_worker_silent is new — design Phase 3 explicitly per the roadmap.
  3. LR-A — Author fn_job_lease_reaper — moves stale lease_until < now() rows back to queued, increments attempts, applies backoff. Plus integration test (BEGIN/ROLLBACK).
  4. DLR-A — Author fn_job_dead_letter_replay(dead_letter_id, authorization_source) — gated by triage_status='manual_replay'. Plus operator runbook.
  5. WK-A — Phase 2 minimal worker (Python or PL/pgSQL?). Most likely Python under Hermes; iterates fn_job_claim → run → fn_job_ack/fn_job_fail_or_retry. Council should pick host process now.

New lessons / memory writes (proposed)

  • Confirm with user: write feedback_dieu45_phase1_substrate_live_pattern documenting "for any future PG-native queue substrate work: ship gates default OFF, prove via BEGIN/ROLLBACK first, post-rollback diff to confirm inertness, separate hot/DLQ tables, denylist mirrors event_outbox safe_payload, partial-unique idempotency, SKIP LOCKED claim — all proven 2026-05-26 in 050 phase 1."
  • Refresh [[feedback-dieu45-silent-gap-violation-post-enactment]] noting substrate is now in place; activation is the Phase 2 task.

What is now safe to begin in parallel (no Phase 2 blockers)

  • Operator runbook: when to flip each queue.*.enabled gate, what to monitor in v_queue_health afterwards.
  • Architecture index update (HK1 from §22 post-enactment housekeeping): add the new queue_substrate_phase1 surface.
  • DP1-DP7 design pack annotation: tag DP2, DP3, DP4 docs with "implemented in 050" marker.
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-phase-1-minimal-job-substrate-live-apply/08-open-questions-and-carry-forward.md