KB-20CB
05 — DP3 Retry, Lease, Lock, Dead-Letter
12 min read Revision 1
design-packdieu-45dp3retryleasedead-letterskip-lockeddesign-only
05 — DP3 — Retry, Lease, Lock, Dead-Letter
DESIGN-ONLY. Cites Điều 45 §7, §8 (all sub-sections), §10.1, §15.5. No DDL, no DML.
§1. Goal
Specify how the job_queue substrate from DP2 actually gets worked safely:
- Claim pattern under concurrency (FOR UPDATE SKIP LOCKED + lease).
- Retry with bounded attempts and backoff sourced from config.
- Lease lifecycle including stale-lease recovery.
- Dead-letter contract widened from event-only (
iu_route_dead_letter) to event+job. - Audit trail integrated with
dot_iu_command_runper Điều 35.
DP3 does not fix concrete numbers (e.g., max_attempts=5); it fixes the contract. Numbers are Council ratification points routed to doc 15.
§2. Current state
dot_iu_runtime_lease(lease_name PK, lease_holder, lease_token uuid, acquired_at, expires_at, renewed_at)exists; 0 rows;fn_dot_iu_runtime_lease_acquire/releaseexists.iu_route_dead_letterexists, fully shaped (attempts ≥ 1, failure_code, failure_detail, resolution ∈ {replayed, discarded, superseded}, CHECK(resolved_at IS NULL) = (resolution IS NULL)), 0 rows.iu_route_worker_cursor.attempts_written+.dead_letteredcolumns exist.event_pending.error_count + last_errorexist on event-tier.- No
max_attemptsconfig. No backoff function. Worker just retries on next tick.
§3. Proposed design
§3.1 Claim under concurrency
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
-- Inside fn_job_claim:
WITH candidate AS (
SELECT job_id FROM job_queue
WHERE status='queued'
AND process_after <= now()
AND job_kind = ANY(p_kinds)
ORDER BY priority DESC, enqueued_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
UPDATE job_queue
SET status='leased',
lease_name = p_lease_name,
lease_holder = p_lease_holder,
lease_token = gen_random_uuid(),
leased_until = now() + (p_lease_ttl_seconds || ' seconds')::interval,
attempts = attempts + 1
WHERE job_id = (SELECT job_id FROM candidate)
RETURNING job_id, lease_token, leased_until;
FOR UPDATE SKIP LOCKED(§5.2 allowed) guarantees concurrent workers never see the same row.- Lease TTL default suggestion: 600s (Council). Lease is renewable via
fn_job_lease_renew. - Insert into
dot_iu_runtime_leasemirrors the lease for cross-process visibility (lease_name ='job:'||job_id).
§3.2 Retry contract
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
dot_config keys (proposed defaults — Council ratifies):
queue.retry.max_attempts.default = 5
queue.retry.max_attempts.<job_kind> = override per kind
queue.retry.backoff.strategy = 'exponential' -- or 'linear' | 'constant'
queue.retry.backoff.base_seconds = 30
queue.retry.backoff.cap_seconds = 3600
backoff function (suggested):
next_process_after = now() + LEAST(cap, base * (2 ^ (attempts - 1)) ) * (0.8 + random()*0.4)
-- jittered exponential
§3.3 Transient failure path
fn_job_fail_transient(p_job_id, p_lease_token, p_error_code, p_error_message, p_actor):
-- validates lease_token matches
-- writes last_error / last_error_at
-- if attempts < max_attempts(kind):
UPDATE job_queue SET status='retry_waiting',
process_after = now() + backoff_interval,
lease_name=NULL, lease_holder=NULL, lease_token=NULL, leased_until=NULL
WHERE job_id=p_job_id;
-- a retry_waiting row becomes queued automatically by the claim filter
-- (status='queued' OR (status='retry_waiting' AND process_after <= now()))
-- OR a stale-waiter-promoter job advances them to status='queued'.
-- else:
move to dead_letter (see §3.4)
§3.4 Dead-letter path
fn_job_fail_permanent(p_job_id, p_lease_token, p_error_code, p_error_message, p_actor):
-- §8.5 refusal contract — used for malformed payload, contract violation
-- attempts := max_attempts (forces dead_letter path without further retry)
-- delegates to fn_job_move_to_dead_letter
fn_job_move_to_dead_letter(p_job_id, p_failure_code, p_failure_detail):
-- snapshot job row into a widened iu_route_dead_letter (or a new job_dead_letter table — see §4)
-- UPDATE job_queue SET status='dead_letter', last_error_at=now()
-- INSERT system/queue_job_dead_lettered event (severity warning)
§3.5 Dead-letter table widening
Two options:
| Option | Description | Verdict |
|---|---|---|
A. Reuse iu_route_dead_letter |
Widen the table's event_ref/event_domain semantics to also store job snapshots. Risk: event-vs-job conflation in storage. |
⚠️ Possible but messy |
B. New job_dead_letter table |
Mirror the shape but type-clean. Add view v_dead_letter_all UNION ALL of both. |
✅ Recommended |
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY (option B)
table public.job_dead_letter (
job_id uuid PRIMARY KEY REFERENCES job_queue(job_id),
job_kind text NOT NULL,
event_domain text NOT NULL,
executor text NOT NULL,
idempotency_key text NOT NULL,
failure_code text NOT NULL,
failure_detail text NULL,
attempts integer NOT NULL CHECK (attempts >= 1),
job_snapshot jsonb NOT NULL, -- frozen state for forensics; signal-only payload refs preserved
first_failed_at timestamptz NOT NULL,
last_failed_at timestamptz NOT NULL,
resolved_at timestamptz NULL,
resolution text NULL
CHECK (resolution IN ('replayed','discarded','superseded')),
resolution_note text NULL,
CHECK ((resolved_at IS NULL) = (resolution IS NULL)) -- mirrors iu_route_dead_letter contract
);
view v_dead_letter_all AS
SELECT 'event'::text AS layer, event_ref::text AS key, event_domain, event_type, route_code, failure_code, attempts,
first_failed_at, last_failed_at, resolved_at, resolution
FROM iu_route_dead_letter
UNION ALL
SELECT 'job'::text AS layer, job_id::text AS key, event_domain, job_kind, executor, failure_code, attempts,
first_failed_at, last_failed_at, resolved_at, resolution
FROM job_dead_letter;
§3.6 Stale lease recovery (lease reaper job)
A job_kind='lease_reaper' job runs at cadence (DP1 Layer 1):
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
fn_job_lease_reaper():
-- For each job_queue row where status IN ('leased','in_progress')
-- AND leased_until < now():
-- - move to retry_waiting (lease holder is presumed dead)
-- - clear lease fields
-- - write last_error = 'lease_expired'
-- - emit system/queue_lease_reaped event (severity warning)
-- Also for dot_iu_runtime_lease rows where expires_at < now(): DELETE.
This is the safety net: a worker crashing mid-job does not lock its row forever.
§4. Lifecycle / status
Status transitions enforced by DP3 functions (extending DP2 state machine):
queued → leased (fn_job_claim)
leased → in_progress (fn_job_progress)
leased → retry_waiting (fn_job_fail_transient, attempts < max)
leased → dead_letter (fn_job_fail_transient at limit, or fn_job_fail_permanent)
leased → retry_waiting (fn_job_lease_reaper, lease expired)
in_progress → succeeded (fn_job_succeed)
in_progress → retry_waiting | dead_letter (same as leased branch)
retry_waiting → queued (claim filter sees process_after passed)
dead_letter → queued (fn_job_dead_letter_replay; bumps attempts back to 0 if resolution='replayed')
§5. Indexes / performance
- Claim query indexed by
(status='queued', priority DESC, enqueued_at ASC)partial index. - Lease reaper query indexed by
(status IN ('leased','in_progress'), leased_until)partial index. - Dead-letter inspection:
(job_kind, last_failed_at DESC). - Replay: by
job_idPK.
§6. Security / governance
fn_job_claimcallable by executors holdingdot_executorrole.fn_job_dead_letter_replayrequiresworkflow_adminrole (per Điều 31 refusal contract + §8.4 authority list).fn_job_lease_reaperis a privileged maintenance fn; runs asworkflow_admin.
§7. Rollback / disable
queue.retry.max_attempts.default = 1disables retry effectively.queue.retry.backoff.cap_seconds = 0collapses to constant.queue.job_substrate.enabled = false(DP2 flag) makes the whole DP3 surface inert.fn_job_lease_reapercan be gated byqueue.lease_reaper.enabled.
§8. Healthcheck / observability
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY
view v_job_retry_pressure AS
SELECT job_kind, count(*) FILTER (WHERE status='retry_waiting') AS waiting,
count(*) FILTER (WHERE status='dead_letter') AS dead,
max(attempts) FILTER (WHERE status<>'cleaned') AS max_attempts_seen
FROM job_queue GROUP BY job_kind;
view v_lease_active AS
SELECT lease_name, lease_holder, leased_until, EXTRACT(epoch FROM (leased_until - now())) AS ttl_remaining
FROM job_queue WHERE status IN ('leased','in_progress');
function fn_dead_letter_health() RETURNS jsonb
-- returns {open_count_event, open_count_job, oldest_first_failed_at, by_failure_code}
Fed into v_queue_health (DP4).
§9. Compatibility with Điều 45 v1.0
| Clause | Compliance |
|---|---|
| §7 idempotency three-layer defense | ✅ CHECK + SKIP LOCKED + lease token |
| §8.1 retry | ✅ config-driven max_attempts |
| §8.2 backoff explicit | ✅ named strategy + base + cap; not implicit |
| §8.3 lease | ✅ TTL + token + reaper |
| §8.4 dead-letter resolution {replayed, discarded, superseded} | ✅ verbatim CHECK |
| §8.5 refusal contract | ✅ fn_job_fail_permanent path |
| §10.1 batch principles | ✅ batch-claim sketch (LIMIT N + lease) honours rollback rule |
| §15.5 silent_gap | ✅ lease reaper closes the orphan-lease class of silent gap |
§10. Implementation prerequisites
- DP2 must be in place (job_queue exists).
- DP1 cadence must reach the lease reaper at < lease TTL / 2 (so reaping happens within ~5 min of crash).
dot_configkeys created (read-only at first, can ship with defaults).
§11. Open questions
| # | Question | Routed to |
|---|---|---|
| DP3-Q1 | max_attempts default per domain: 3? 5? 10? |
Council |
| DP3-Q2 | Backoff strategy default: exponential? linear? | Council |
| DP3-Q3 | Lease TTL default: 600s? 300s? per job_kind override? | Council |
| DP3-Q4 | Approve separate job_dead_letter (Option B) vs widening iu_route_dead_letter? |
Council |
| DP3-Q5 | Replay authority: workflow_admin only, or per-domain agency? | Council + Điều 37 |
| DP3-Q6 | Should fn_job_fail_permanent require an acknowledged_by audit field (D31 refusal evidence)? |
Council |
| DP3-Q7 | retry_waiting → queued promoter: filter at claim time vs explicit promoter job? | Council |
§12. Self-test
self_test:
cites_dieu45_section: §7,§8,§10.1,§15.5
defines_status_lifecycle_compatible_with_§6_7: yes
defines_idempotency_key: inherits DP2
defines_retry_dlq: yes
defines_lease: yes
defines_observability_view: v_job_retry_pressure, v_lease_active
defines_dot_config_disable_flag: queue.retry.*, queue.lease_reaper.enabled
defines_executor_set_compatible_with_§11_5: yes
no_vector_in_transient: yes
signal_not_data: yes (snapshot is jsonb signal-shaped)
pg_sot: yes
rollback_concept: yes
no_pg_cron_dependency_phase_1: yes
no_pg_18_dependency: yes
no_mutation_authored: yes
DP3 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).