KB-29D8

06 — Existing vs Needed — Gap Matrix

8 min read Revision 1
surveygap-matrixqueue-readinessadditivemissingpg-native

06 — Existing vs Needed — Gap Matrix

Date: 2026-05-26 | Scope: Map each "system-wide PG-native queue principle" to live substrate; flag gaps.


§1. Principles vs reality

Principles from the mission:

queue_principles:
  - system_wide_not_cut_only
  - PG_first
  - PG_native
  - PG_driven
  - trigger_in_and_trigger_out_capable
  - supports_AGENT_DOT_worker_Hermes
  - supports_retry_dead_letter
  - supports_status_visibility
  - supports_idempotency
  - supports_scheduling
  - supports_batch
  - supports_cleanup
  - no_vector_for_transient_payload
  - compatible_with_governance/birth/DOT

Matrix:

Principle Live evidence Gap
system_wide_not_cut_only 9-domain × 7-stream × 2-lane vocab covers IU, piece, staging, system, birth_registry, governance, tac, kg, dot, health Only 4 domains have active event types (iu, piece, staging, system); domains 5–8 are vocab-only, no producers wired yet
PG_first All substrate is PG tables + plpgsql + triggers
PG_native No external queue (Redis/RabbitMQ/Kafka) involved
PG_driven Worker fn lives in PG, called externally; no in-DB scheduler ⚠️ no pg_cron → calls come from outside
trigger_in_and_trigger_out_capable iu_sql_event_route (trigger-IN bridge, dry-run) + fn_iu_auto_instantiate_from_event (trigger-OUT consumer) ⚠️ both exist in prototype form; not generalized
supports_AGENT_DOT_worker_Hermes DOT has dot_iu_command_run + dot_iu_command_catalog; Agent/Hermes use external orchestration ❌ no unified job substrate for Agent/Hermes
supports_retry_dead_letter iu_route_dead_letter (full schema), event_pending.error_count + last_error, iu_route_worker_cursor.dead_lettered ⚠️ no max_attempts policy in dot_config; no retry backoff schedule
supports_status_visibility iu_route_worker_cursor, v_iu_event_backlog, v_iu_composer_event_backlog, v_iu_route_dead_letter_open, v_iu_auto_instantiate_event_summary, v_dot_iu_command_run_health ✅ good (could be unified into v_queue_health)
supports_idempotency event_outbox unique constraints; staging idempotency_key; iu_auto_instantiate_event_log idempotency_key; dot_iu_command_run.params_digest
supports_scheduling None (no pg_cron, no schedule table) ❌ scheduling boundary is "whoever calls fn_iu_route_worker_run"
supports_batch Worker drains in single call (fn_iu_route_worker_run); event_pending designed for debounce ⚠️ no formal batch policy; debounce config absent
supports_cleanup fn_iu_staging_cleanup (3-pass), fn_iu_core_retention_cleanup, retention_enabled flag ⚠️ retention gated off; cleanup is operator-invoked
no_vector_for_transient_payload CHECK on event_outbox forbids vector, embedding; staging.vector_excluded=true; iu_vector_sync_point has no staging path ✅ 4-layer guarantee live
compatible_with_governance/birth/DOT event types for staging., piece., iu.structure_*; birth_registry auto-birth proven; DOT runs ledgered

§2. Substrate × use-case matrix

Columns are the 11 use cases from the mission. Rows are the substrate tables. ✅ = directly used today. ⚠ = capable but unwired. ❌ = no fit.

Substrate ↓ / Use case → tasks messages trigger-IN trigger-OUT worker jobs Agent/Hermes DOT jobs review/approval vector sync staging cleanup MOT workflows IU two-way
event_outbox ⚠ (stream=task) ⚠ (stream=comment) ⚠ via SQL route ⚠ (stream=review) ❌ (separate channel)
event_pending
event_read n/a n/a n/a n/a n/a n/a n/a n/a n/a
event_subscription n/a n/a
event_type_registry ✅ (vocab) n/a
iu_route_worker_cursor n/a
iu_route_dead_letter n/a
iu_sql_event_route n/a n/a ✅ (dry-run) n/a n/a n/a n/a n/a n/a n/a n/a
dot_iu_runtime_lease n/a
dot_iu_command_run n/a
dot_iu_command_catalog n/a n/a n/a
iu_staging_record/payload n/a n/a n/a n/a n/a n/a
iu_vector_sync_point + kb_vector_sync NOTIFY n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
iu_auto_instantiate_event_log n/a n/a n/a n/a n/a n/a n/a n/a n/a
iu_lifecycle_log n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

Reading: rows of ✅ + ⚠ mean "substrate fits"; rows of mostly mean "substrate gap".

The big gap row is worker jobs / Agent / Hermes / MOT workflows — there is no PG-native substrate for arbitrary long-running jobs today.


§3. Minimum additive deltas to close the gaps (NOT a design — just a list)

These would each need to be designed properly in the next pack:

Δ What Why
Δ1 event_type_registry rows for non-iu domains (governance.*, tac.*, kg.*, dot.*, health.*, birth_registry.*) Activate dormant vocab
Δ2 Additional worker cursors (iu_route_worker_cursor rows) for domains beyond iu Drain staging., system., dot.* events
Δ3 A worker invocation contract — either pg_cron rows OR a documented external-caller cadence (Hermes / Codex / Directus flow) Currently implicit
Δ4 Generic SQL-event routes in iu_sql_event_route (today only 1 row, dry-run; widen target_event_domain CHECK) Generalize trigger-IN
Δ5 Job substrate decision: extend event_outbox (new domain job) OR new job_outbox table Cover Agent/Hermes/long-running
Δ6 Retry policy table or dot_config keys (event.retry.max_attempts, event.retry.backoff_seconds) Codify retry semantics
Δ7 Scheduler table or dot_config keys (schedule.<job_name>.cron) Even without pg_cron, formalize "what should run when"
Δ8 Outbox-to-NOTIFY bridge (optional) — emit pg_notify('event_universal', json) on event_outbox INSERT, consumed by external worker Bring event substrate up to vector-sync parity
Δ9 A v_queue_health aggregated view across cursors, dead-letter, retention candidates One-pane observability
Δ10 Ratify inclusion criteria (P3D4C0X §M.3) as enforced — registry check + producer rubric Prevent activity-log scope creep

None of Δ1–Δ10 require PG 18.


§4. Risk register on the substrate as-is

Risk Severity Mitigation deferred to design
Worker runs once, never re-tickled (no scheduler) Medium Δ3
event_pending could accumulate silently if worker breaks Medium Δ9 alerts + dead-letter
Subscription resolution view assumes default-broadcast fallback (per design doc); not verified live Low Re-verify live behavior
iu_sql_event_route CHECK forces dry_run when enabled=false — must be explicitly relaxed to enable Low (good safety) None
event_outbox has no partitioning; will reach >1M rows within 8 days at current system/issue_opened rate Medium Partition-by-month after >5M rows
iu_notification_event (legacy) is empty — compat layer status unclear Low Confirm fully deprecated then drop
Operator-driven cutting flow may bottleneck on human attention Low Optional puller in design

§5. Summary

The substrate is mostly there for events and mostly missing for jobs. Closing the system-wide-queue ambition is additive for events (Δ1–Δ4, Δ6–Δ10) and a design choice for jobs (Δ5).

No demolition required. No PG upgrade required. No new extension required (unless pg_cron is in scope — separate decision).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-system-wide-pg-native-queue-law-readiness-survey/06-existing-vs-needed-gap-matrix.md