KB-1DC7
15 — Risk Register and Open Questions
10 min read Revision 1
design-packdieu-45risk-registeropen-questionscouncil-reviewdesign-only
15 — Risk Register and Open Questions
DESIGN-ONLY. Consolidates every open question raised in docs 03–14 and adds cross-cutting risk.
§1. Risk register
§1.1 High severity
| # | Risk | Surface | Severity | Mitigation |
|---|---|---|---|---|
| R-H1 | The 4-day silent gap on iu_outbound_default (last_run_at=2026-05-22) continues post-enactment, violating §15.5 |
DP4 / Phase 3 | High (already violated) | Phase 3 prioritised; runbook for external cadence; heartbeat + stale-check + event |
| R-H2 | New job_queue enqueue path goes live without consumer_registry (Phase 4), making jobs unreachable |
DP2 dependency chain | High | queue.job_substrate.enabled=false until Phase 4 ratified; explicit prerequisite chain in doc 14 |
| R-H3 | event_outbox partitioning is left until after 5M rows — at current rate ~324 days, but bursty domains could overrun |
DP7 | High (long-tail) | Add alert at 3M and 4M row thresholds before partitioning is required |
| R-H4 | MOT pack is opened before §13.4 enforcement is fully ratified, allowing a path where MOT claims a job | doc 11 / Phase 6 | High | CHECK in 4 places ratified at Phase 3 (DP6 ships); MOT pack must depend on it |
| R-H5 | Customer body leaks into safe_payload because new external adapter mis-implements the seam |
doc 12 / Phase 7 | High | CHECK enforces; require adapter integration test that attempts a body-in-payload INSERT and asserts refusal |
§1.2 Medium severity
| # | Risk | Surface | Severity | Mitigation |
|---|---|---|---|---|
| R-M1 | consumer_registry race: same event matched by multiple consumers; double-enqueue |
DP5 | Medium | UNIQUE (job_kind, idempotency_key) on job_queue rejects duplicate; consumers must pick distinct idempotency_key templates |
| R-M2 | Lease TTL too short causes premature reaping of legitimate long-running jobs | DP3 | Medium | Per-job-kind TTL override; renew API; documented as runbook |
| R-M3 | NOTIFY storm if every event INSERT pg_notifies (currently 16k/day) | DP4 | Medium | Bridge default false; if enabled, payload signal-only; downstream LISTEN must dedupe |
| R-M4 | broadcast_fallback masks subscription drift, causing privacy concerns | DP6 | Medium | v_subscription_broadcast_used warns; Council watches the percentage |
| R-M5 | iu_sql_event_route widened CHECK enables routes whose payload_contract is sloppy → restricted classification leaks |
DP5 | Medium | payload_contract jsonb has forbid_keys; row-level review before enabled=true+dry_run=false |
| R-M6 | job_workflow_refresh derivation drifts from reality during cadence gaps |
doc 11 | Medium | refresh-on-write side effect inside fn_job_succeed/_fail_* (eventually consistent within ms) |
| R-M7 | Phase 5 puller-enabled collection chosen poorly — pilot picks high-value collection and breaks it | Phase 5 | Medium | pilot collection selection requires explicit Council + operator approval; rollback = puller flag off |
| R-M8 | job_dead_letter and iu_route_dead_letter divergence over time (different resolution semantics) |
DP3 | Medium | v_dead_letter_all UNION view + identical CHECK on resolution vocab |
| R-M9 | Composer flag (iu_core.composer_enabled) interplay with new job-driven cut path |
doc 10 | Medium | Composer remains the structural gate; puller-enabled MARK/CUT inherits the existing G1–G7 guards via fn_iu_op_* |
§1.3 Low severity
| # | Risk | Surface | Severity | Mitigation |
|---|---|---|---|---|
| R-L1 | New view set adds maintenance burden | All DPs | Low | Naming convention v_queue_* / v_job_* documented |
| R-L2 | dot_config namespace bloat | All DPs | Low | All new keys under queue.*; one-time inventory in doc 16 |
| R-L3 | Future Kestra adapter never lands; vocab carries dead label | DP6 | Low | Vocab is forward-compat; harmless to keep |
| R-L4 | iu_notification_event legacy not deprecated leaves stale schema |
doc 13 | Low | Separate cleanup pack (post Council Q7) |
| R-L5 | Heartbeat write cost grows with executor count | DP4 | Low | Upsert is O(1); table small |
§2. Consolidated open questions (by source DP/doc)
§2.1 DP1 (scheduler)
DP1-Q1 — Approve hybrid Option D over pure external-poll?
DP1-Q2 — Approve 60s default tick + 300s stale threshold?
DP1-Q3 — Which external orchestrator owns the canonical cadence?
DP1-Q4 — Approve queue.notify.bridge_enabled=false at Phase 1?
DP1-Q5 — When does pg_cron sub-amendment open?
§2.2 DP2 (job substrate)
DP2-Q1 — Single job_queue vs one table per kind?
DP2-Q2 — executor CHECK list verbatim from §11.5?
DP2-Q3 — priority smallint default 0 — approved?
DP2-Q4 — payload_ref text vs jsonb multi-typed?
DP2-Q5 — Add result_ref column?
DP2-Q6 — Emit dot.job.<lifecycle_event> events + register in event_type_registry?
DP2-Q7 — job_workflow at Phase 1 or Phase 6?
§2.3 DP3 (retry/lease/DLQ)
DP3-Q1 — max_attempts default per domain (3, 5, 10)?
DP3-Q2 — Backoff strategy default (exponential / linear)?
DP3-Q3 — Lease TTL default (600s / 300s / per-kind override)?
DP3-Q4 — job_dead_letter separate vs widen iu_route_dead_letter?
DP3-Q5 — Replay authority (workflow_admin only / per-domain agency)?
DP3-Q6 — fn_job_fail_permanent require acknowledged_by?
DP3-Q7 — retry_waiting → queued: filter at claim vs explicit promoter?
§2.4 DP4 (NOTIFY/heartbeat)
DP4-Q1 — Heartbeat on every tick vs only when work seen?
DP4-Q2 — executor_id text PK vs FK to executor_registry?
DP4-Q3 — NOTIFY channel naming queue_wake_<domain> vs <event_type>?
DP4-Q4 — queue_worker_silent once-per-stale-period or repeating?
DP4-Q5 — Stale_threshold per executor_kind?
DP4-Q6 — Migrate hc_executor_last_run to queue_heartbeat row?
§2.5 DP5 (trigger-in/out)
DP5-Q1 — Widen iu_sql_event_route.target_event_domain CHECK to 10 values?
DP5-Q2 — consumer_registry as single dispatch surface?
DP5-Q3 — Idempotency template DSL: string interpolation vs jsonb_path?
DP5-Q4 — Cycle detection depth-limit (default depth)?
DP5-Q5 — Migrate fn_iu_auto_instantiate_from_event into consumer at Phase 1?
DP5-Q6 — Per-event_type rate limit — add now or defer?
§2.6 DP6 (subscription/routing/executor)
DP6-Q1 — Broadcast-fallback warn-only (not deny-by-default)?
DP6-Q2 — job_subscription at Phase 1 or Phase 3?
DP6-Q3 — MOT enforcement via CHECK in 4 places (vs single registry table)?
DP6-Q4 — Per-executor max_parallel default (1, 4, 16)?
DP6-Q5 — recipient_ref FK target table?
DP6-Q6 — Merge consumer_registry + event_subscription?
§2.7 DP7 (partition/retention)
DP7-Q1 — Partition threshold at 5M rows?
DP7-Q2 — job_queue cleaned after 30d, hard-delete after 90d?
DP7-Q3 — DLQ 365d retention (vs 180d / never)?
DP7-Q4 — archive schema name?
DP7-Q5 — dot_iu_command_run 90d retention reconfirm in Điều 35?
DP7-Q6 — Hard-delete emit *_hard_deleted event — vocab gate?
§2.8 Cross-cutting docs
MC-Q1 — Approve dual reading (operator + puller) co-existence in cutting?
MC-Q2 — Which collections get puller enabled at Phase 5 pilot?
MC-Q3 — approve as real job_kind vs human action outside queue?
MC-Q4 — copy_to_staging as separate job (currently inlined)?
MC-Q5 — verify_cut advances any status or remains verdict-only?
MOT-Q1 — job_workflow at Phase 1 vs MOT pack?
MOT-Q2 — fn_mot_graph_emit as only MOT entry-point?
MOT-Q3 — DAG semantics: any-parent vs all-parents at Phase 6?
MOT-Q4 — Nested workflows support?
MOT-Q5 — MOT template registry shape?
MOT-Q6 — partially_failed freezes workflow or allows retry?
CM-Q1 — 7 seam-job_kinds vocab reservation at Phase 1?
CM-Q2 — approve-required default + dispensation flag?
CM-Q3 — customer_message_inbound creates event before job?
CM-Q4 — SLA modelling: deadline_at vs customer_sla table?
CM-Q5 — Body encryption mechanism?
CM-Q6 — Multi-channel routing approach?
TX-Q1 — When drop iu_notification_event/read? Survey Q7.
TX-Q2 — Register dot.job.* event types at Phase 1 or only when emitted?
TX-Q3 — consumer_registry multi-match: all enqueue or compete?
TX-Q4 — Cycle detection: refuse globally or depth N?
RM-Q1 — Phase order: Phase 3 (heartbeat) before Phase 1 (job_queue)?
RM-Q2 — Combine Phase 1+2 into one migration?
RM-Q3 — Phase 5 pilot collection?
RM-Q4 — Phase 6 MOT template — cut pipeline or new workflow?
RM-Q5 — Approve 7-phase cadence?
§3. Top 10 questions blocking Council ratification
Of the 50+ questions above, these block ratification of the DP1–DP7 designs:
- DP2-Q2 — verbatim §11.5 executor CHECK (must align with law).
- DP1-Q1 — hybrid scheduler model (else DP4 stale-check has no caller).
- DP3-Q4 — separate
job_dead_lettervs widen existing (determines table count). - DP4-Q3 — NOTIFY channel naming (determines listener spec).
- DP5-Q1 — widen
iu_sql_event_route.target_event_domainCHECK (vocab gate). - DP6-Q1 — broadcast-fallback warn-only vs deny (changes behaviour live).
- DP7-Q1 — partition threshold 5M (sets capacity planning).
- MOT-Q1 —
job_workflowat Phase 1 vs Phase 6 (changes Phase 1 footprint). - TX-Q1 —
iu_notification_eventlegacy fate (changes table-count baseline). - RM-Q1 — Phase 3 (heartbeat) before Phase 1 (job_queue)?
§4. Risk register summary
risk_summary:
high: 5
medium: 9
low: 5
total: 19
mitigations_documented_for_all_high: true
new_substrate_disabled_by_default: true
no_phase_blocks_rollback: true
existing_law_unchanged: true
Risk register + open questions. No mutation. Authored 2026-05-26.