KB-1DC7

15 — Risk Register and Open Questions

10 min read Revision 1
design-packdieu-45risk-registeropen-questionscouncil-reviewdesign-only

15 — Risk Register and Open Questions

DESIGN-ONLY. Consolidates every open question raised in docs 03–14 and adds cross-cutting risk.


§1. Risk register

§1.1 High severity

# Risk Surface Severity Mitigation
R-H1 The 4-day silent gap on iu_outbound_default (last_run_at=2026-05-22) continues post-enactment, violating §15.5 DP4 / Phase 3 High (already violated) Phase 3 prioritised; runbook for external cadence; heartbeat + stale-check + event
R-H2 New job_queue enqueue path goes live without consumer_registry (Phase 4), making jobs unreachable DP2 dependency chain High queue.job_substrate.enabled=false until Phase 4 ratified; explicit prerequisite chain in doc 14
R-H3 event_outbox partitioning is left until after 5M rows — at current rate ~324 days, but bursty domains could overrun DP7 High (long-tail) Add alert at 3M and 4M row thresholds before partitioning is required
R-H4 MOT pack is opened before §13.4 enforcement is fully ratified, allowing a path where MOT claims a job doc 11 / Phase 6 High CHECK in 4 places ratified at Phase 3 (DP6 ships); MOT pack must depend on it
R-H5 Customer body leaks into safe_payload because new external adapter mis-implements the seam doc 12 / Phase 7 High CHECK enforces; require adapter integration test that attempts a body-in-payload INSERT and asserts refusal

§1.2 Medium severity

# Risk Surface Severity Mitigation
R-M1 consumer_registry race: same event matched by multiple consumers; double-enqueue DP5 Medium UNIQUE (job_kind, idempotency_key) on job_queue rejects duplicate; consumers must pick distinct idempotency_key templates
R-M2 Lease TTL too short causes premature reaping of legitimate long-running jobs DP3 Medium Per-job-kind TTL override; renew API; documented as runbook
R-M3 NOTIFY storm if every event INSERT pg_notifies (currently 16k/day) DP4 Medium Bridge default false; if enabled, payload signal-only; downstream LISTEN must dedupe
R-M4 broadcast_fallback masks subscription drift, causing privacy concerns DP6 Medium v_subscription_broadcast_used warns; Council watches the percentage
R-M5 iu_sql_event_route widened CHECK enables routes whose payload_contract is sloppy → restricted classification leaks DP5 Medium payload_contract jsonb has forbid_keys; row-level review before enabled=true+dry_run=false
R-M6 job_workflow_refresh derivation drifts from reality during cadence gaps doc 11 Medium refresh-on-write side effect inside fn_job_succeed/_fail_* (eventually consistent within ms)
R-M7 Phase 5 puller-enabled collection chosen poorly — pilot picks high-value collection and breaks it Phase 5 Medium pilot collection selection requires explicit Council + operator approval; rollback = puller flag off
R-M8 job_dead_letter and iu_route_dead_letter divergence over time (different resolution semantics) DP3 Medium v_dead_letter_all UNION view + identical CHECK on resolution vocab
R-M9 Composer flag (iu_core.composer_enabled) interplay with new job-driven cut path doc 10 Medium Composer remains the structural gate; puller-enabled MARK/CUT inherits the existing G1–G7 guards via fn_iu_op_*

§1.3 Low severity

# Risk Surface Severity Mitigation
R-L1 New view set adds maintenance burden All DPs Low Naming convention v_queue_* / v_job_* documented
R-L2 dot_config namespace bloat All DPs Low All new keys under queue.*; one-time inventory in doc 16
R-L3 Future Kestra adapter never lands; vocab carries dead label DP6 Low Vocab is forward-compat; harmless to keep
R-L4 iu_notification_event legacy not deprecated leaves stale schema doc 13 Low Separate cleanup pack (post Council Q7)
R-L5 Heartbeat write cost grows with executor count DP4 Low Upsert is O(1); table small

§2. Consolidated open questions (by source DP/doc)

§2.1 DP1 (scheduler)

DP1-Q1 — Approve hybrid Option D over pure external-poll?
DP1-Q2 — Approve 60s default tick + 300s stale threshold?
DP1-Q3 — Which external orchestrator owns the canonical cadence?
DP1-Q4 — Approve queue.notify.bridge_enabled=false at Phase 1?
DP1-Q5 — When does pg_cron sub-amendment open?

§2.2 DP2 (job substrate)

DP2-Q1 — Single job_queue vs one table per kind?
DP2-Q2 — executor CHECK list verbatim from §11.5?
DP2-Q3 — priority smallint default 0 — approved?
DP2-Q4 — payload_ref text vs jsonb multi-typed?
DP2-Q5 — Add result_ref column?
DP2-Q6 — Emit dot.job.<lifecycle_event> events + register in event_type_registry?
DP2-Q7 — job_workflow at Phase 1 or Phase 6?

§2.3 DP3 (retry/lease/DLQ)

DP3-Q1 — max_attempts default per domain (3, 5, 10)?
DP3-Q2 — Backoff strategy default (exponential / linear)?
DP3-Q3 — Lease TTL default (600s / 300s / per-kind override)?
DP3-Q4 — job_dead_letter separate vs widen iu_route_dead_letter?
DP3-Q5 — Replay authority (workflow_admin only / per-domain agency)?
DP3-Q6 — fn_job_fail_permanent require acknowledged_by?
DP3-Q7 — retry_waiting → queued: filter at claim vs explicit promoter?

§2.4 DP4 (NOTIFY/heartbeat)

DP4-Q1 — Heartbeat on every tick vs only when work seen?
DP4-Q2 — executor_id text PK vs FK to executor_registry?
DP4-Q3 — NOTIFY channel naming queue_wake_<domain> vs <event_type>?
DP4-Q4 — queue_worker_silent once-per-stale-period or repeating?
DP4-Q5 — Stale_threshold per executor_kind?
DP4-Q6 — Migrate hc_executor_last_run to queue_heartbeat row?

§2.5 DP5 (trigger-in/out)

DP5-Q1 — Widen iu_sql_event_route.target_event_domain CHECK to 10 values?
DP5-Q2 — consumer_registry as single dispatch surface?
DP5-Q3 — Idempotency template DSL: string interpolation vs jsonb_path?
DP5-Q4 — Cycle detection depth-limit (default depth)?
DP5-Q5 — Migrate fn_iu_auto_instantiate_from_event into consumer at Phase 1?
DP5-Q6 — Per-event_type rate limit — add now or defer?

§2.6 DP6 (subscription/routing/executor)

DP6-Q1 — Broadcast-fallback warn-only (not deny-by-default)?
DP6-Q2 — job_subscription at Phase 1 or Phase 3?
DP6-Q3 — MOT enforcement via CHECK in 4 places (vs single registry table)?
DP6-Q4 — Per-executor max_parallel default (1, 4, 16)?
DP6-Q5 — recipient_ref FK target table?
DP6-Q6 — Merge consumer_registry + event_subscription?

§2.7 DP7 (partition/retention)

DP7-Q1 — Partition threshold at 5M rows?
DP7-Q2 — job_queue cleaned after 30d, hard-delete after 90d?
DP7-Q3 — DLQ 365d retention (vs 180d / never)?
DP7-Q4 — archive schema name?
DP7-Q5 — dot_iu_command_run 90d retention reconfirm in Điều 35?
DP7-Q6 — Hard-delete emit *_hard_deleted event — vocab gate?

§2.8 Cross-cutting docs

MC-Q1 — Approve dual reading (operator + puller) co-existence in cutting?
MC-Q2 — Which collections get puller enabled at Phase 5 pilot?
MC-Q3 — approve as real job_kind vs human action outside queue?
MC-Q4 — copy_to_staging as separate job (currently inlined)?
MC-Q5 — verify_cut advances any status or remains verdict-only?

MOT-Q1 — job_workflow at Phase 1 vs MOT pack?
MOT-Q2 — fn_mot_graph_emit as only MOT entry-point?
MOT-Q3 — DAG semantics: any-parent vs all-parents at Phase 6?
MOT-Q4 — Nested workflows support?
MOT-Q5 — MOT template registry shape?
MOT-Q6 — partially_failed freezes workflow or allows retry?

CM-Q1 — 7 seam-job_kinds vocab reservation at Phase 1?
CM-Q2 — approve-required default + dispensation flag?
CM-Q3 — customer_message_inbound creates event before job?
CM-Q4 — SLA modelling: deadline_at vs customer_sla table?
CM-Q5 — Body encryption mechanism?
CM-Q6 — Multi-channel routing approach?

TX-Q1 — When drop iu_notification_event/read? Survey Q7.
TX-Q2 — Register dot.job.* event types at Phase 1 or only when emitted?
TX-Q3 — consumer_registry multi-match: all enqueue or compete?
TX-Q4 — Cycle detection: refuse globally or depth N?

RM-Q1 — Phase order: Phase 3 (heartbeat) before Phase 1 (job_queue)?
RM-Q2 — Combine Phase 1+2 into one migration?
RM-Q3 — Phase 5 pilot collection?
RM-Q4 — Phase 6 MOT template — cut pipeline or new workflow?
RM-Q5 — Approve 7-phase cadence?

§3. Top 10 questions blocking Council ratification

Of the 50+ questions above, these block ratification of the DP1–DP7 designs:

  1. DP2-Q2 — verbatim §11.5 executor CHECK (must align with law).
  2. DP1-Q1 — hybrid scheduler model (else DP4 stale-check has no caller).
  3. DP3-Q4 — separate job_dead_letter vs widen existing (determines table count).
  4. DP4-Q3 — NOTIFY channel naming (determines listener spec).
  5. DP5-Q1 — widen iu_sql_event_route.target_event_domain CHECK (vocab gate).
  6. DP6-Q1 — broadcast-fallback warn-only vs deny (changes behaviour live).
  7. DP7-Q1 — partition threshold 5M (sets capacity planning).
  8. MOT-Q1job_workflow at Phase 1 vs Phase 6 (changes Phase 1 footprint).
  9. TX-Q1iu_notification_event legacy fate (changes table-count baseline).
  10. RM-Q1 — Phase 3 (heartbeat) before Phase 1 (job_queue)?

§4. Risk register summary

risk_summary:
  high: 5
  medium: 9
  low: 5
  total: 19

mitigations_documented_for_all_high: true
new_substrate_disabled_by_default: true
no_phase_blocks_rollback: true
existing_law_unchanged: true

Risk register + open questions. No mutation. Authored 2026-05-26.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-full-queue-orchestration-design-pack/15-risk-register-and-open-questions.md