KB-20CB

05 — DP3 Retry, Lease, Lock, Dead-Letter

12 min read Revision 1
design-packdieu-45dp3retryleasedead-letterskip-lockeddesign-only

05 — DP3 — Retry, Lease, Lock, Dead-Letter

DESIGN-ONLY. Cites Điều 45 §7, §8 (all sub-sections), §10.1, §15.5. No DDL, no DML.


§1. Goal

Specify how the job_queue substrate from DP2 actually gets worked safely:

  1. Claim pattern under concurrency (FOR UPDATE SKIP LOCKED + lease).
  2. Retry with bounded attempts and backoff sourced from config.
  3. Lease lifecycle including stale-lease recovery.
  4. Dead-letter contract widened from event-only (iu_route_dead_letter) to event+job.
  5. Audit trail integrated with dot_iu_command_run per Điều 35.

DP3 does not fix concrete numbers (e.g., max_attempts=5); it fixes the contract. Numbers are Council ratification points routed to doc 15.


§2. Current state

  • dot_iu_runtime_lease(lease_name PK, lease_holder, lease_token uuid, acquired_at, expires_at, renewed_at) exists; 0 rows; fn_dot_iu_runtime_lease_acquire/release exists.
  • iu_route_dead_letter exists, fully shaped (attempts ≥ 1, failure_code, failure_detail, resolution ∈ {replayed, discarded, superseded}, CHECK (resolved_at IS NULL) = (resolution IS NULL)), 0 rows.
  • iu_route_worker_cursor.attempts_written + .dead_lettered columns exist.
  • event_pending.error_count + last_error exist on event-tier.
  • No max_attempts config. No backoff function. Worker just retries on next tick.

§3. Proposed design

§3.1 Claim under concurrency

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

-- Inside fn_job_claim:
WITH candidate AS (
  SELECT job_id FROM job_queue
   WHERE status='queued'
     AND process_after <= now()
     AND job_kind = ANY(p_kinds)
   ORDER BY priority DESC, enqueued_at ASC
   LIMIT 1
   FOR UPDATE SKIP LOCKED
)
UPDATE job_queue
   SET status='leased',
       lease_name = p_lease_name,
       lease_holder = p_lease_holder,
       lease_token = gen_random_uuid(),
       leased_until = now() + (p_lease_ttl_seconds || ' seconds')::interval,
       attempts = attempts + 1
 WHERE job_id = (SELECT job_id FROM candidate)
RETURNING job_id, lease_token, leased_until;
  • FOR UPDATE SKIP LOCKED (§5.2 allowed) guarantees concurrent workers never see the same row.
  • Lease TTL default suggestion: 600s (Council). Lease is renewable via fn_job_lease_renew.
  • Insert into dot_iu_runtime_lease mirrors the lease for cross-process visibility (lease_name = 'job:'||job_id).

§3.2 Retry contract

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

dot_config keys (proposed defaults — Council ratifies):
  queue.retry.max_attempts.default                   = 5
  queue.retry.max_attempts.<job_kind>                = override per kind
  queue.retry.backoff.strategy                       = 'exponential'  -- or 'linear' | 'constant'
  queue.retry.backoff.base_seconds                   = 30
  queue.retry.backoff.cap_seconds                    = 3600

backoff function (suggested):
  next_process_after = now() + LEAST(cap, base * (2 ^ (attempts - 1)) ) * (0.8 + random()*0.4)
  -- jittered exponential

§3.3 Transient failure path

fn_job_fail_transient(p_job_id, p_lease_token, p_error_code, p_error_message, p_actor):
  -- validates lease_token matches
  -- writes last_error / last_error_at
  -- if attempts < max_attempts(kind):
       UPDATE job_queue SET status='retry_waiting',
              process_after = now() + backoff_interval,
              lease_name=NULL, lease_holder=NULL, lease_token=NULL, leased_until=NULL
        WHERE job_id=p_job_id;
       -- a retry_waiting row becomes queued automatically by the claim filter
       --   (status='queued' OR (status='retry_waiting' AND process_after <= now()))
       -- OR a stale-waiter-promoter job advances them to status='queued'.
  -- else:
       move to dead_letter (see §3.4)

§3.4 Dead-letter path

fn_job_fail_permanent(p_job_id, p_lease_token, p_error_code, p_error_message, p_actor):
  -- §8.5 refusal contract — used for malformed payload, contract violation
  -- attempts := max_attempts (forces dead_letter path without further retry)
  -- delegates to fn_job_move_to_dead_letter

fn_job_move_to_dead_letter(p_job_id, p_failure_code, p_failure_detail):
  -- snapshot job row into a widened iu_route_dead_letter (or a new job_dead_letter table — see §4)
  -- UPDATE job_queue SET status='dead_letter', last_error_at=now()
  -- INSERT system/queue_job_dead_lettered event (severity warning)

§3.5 Dead-letter table widening

Two options:

Option Description Verdict
A. Reuse iu_route_dead_letter Widen the table's event_ref/event_domain semantics to also store job snapshots. Risk: event-vs-job conflation in storage. ⚠️ Possible but messy
B. New job_dead_letter table Mirror the shape but type-clean. Add view v_dead_letter_all UNION ALL of both. ✅ Recommended
NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY (option B)

table public.job_dead_letter (
  job_id                uuid PRIMARY KEY REFERENCES job_queue(job_id),
  job_kind              text NOT NULL,
  event_domain          text NOT NULL,
  executor              text NOT NULL,
  idempotency_key       text NOT NULL,
  failure_code          text NOT NULL,
  failure_detail        text NULL,
  attempts              integer NOT NULL CHECK (attempts >= 1),
  job_snapshot          jsonb NOT NULL,             -- frozen state for forensics; signal-only payload refs preserved
  first_failed_at       timestamptz NOT NULL,
  last_failed_at        timestamptz NOT NULL,
  resolved_at           timestamptz NULL,
  resolution            text NULL
      CHECK (resolution IN ('replayed','discarded','superseded')),
  resolution_note       text NULL,
  CHECK ((resolved_at IS NULL) = (resolution IS NULL))   -- mirrors iu_route_dead_letter contract
);

view v_dead_letter_all AS
  SELECT 'event'::text AS layer, event_ref::text AS key, event_domain, event_type, route_code, failure_code, attempts,
         first_failed_at, last_failed_at, resolved_at, resolution
    FROM iu_route_dead_letter
  UNION ALL
  SELECT 'job'::text  AS layer, job_id::text AS key, event_domain, job_kind, executor, failure_code, attempts,
         first_failed_at, last_failed_at, resolved_at, resolution
    FROM job_dead_letter;

§3.6 Stale lease recovery (lease reaper job)

A job_kind='lease_reaper' job runs at cadence (DP1 Layer 1):

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

fn_job_lease_reaper():
  -- For each job_queue row where status IN ('leased','in_progress')
  --   AND leased_until < now():
  --     - move to retry_waiting (lease holder is presumed dead)
  --     - clear lease fields
  --     - write last_error = 'lease_expired'
  --     - emit system/queue_lease_reaped event (severity warning)
  -- Also for dot_iu_runtime_lease rows where expires_at < now(): DELETE.

This is the safety net: a worker crashing mid-job does not lock its row forever.


§4. Lifecycle / status

Status transitions enforced by DP3 functions (extending DP2 state machine):

queued      → leased            (fn_job_claim)
leased      → in_progress       (fn_job_progress)
leased      → retry_waiting     (fn_job_fail_transient, attempts < max)
leased      → dead_letter       (fn_job_fail_transient at limit, or fn_job_fail_permanent)
leased      → retry_waiting     (fn_job_lease_reaper, lease expired)
in_progress → succeeded         (fn_job_succeed)
in_progress → retry_waiting | dead_letter  (same as leased branch)
retry_waiting → queued          (claim filter sees process_after passed)
dead_letter → queued            (fn_job_dead_letter_replay; bumps attempts back to 0 if resolution='replayed')

§5. Indexes / performance

  • Claim query indexed by (status='queued', priority DESC, enqueued_at ASC) partial index.
  • Lease reaper query indexed by (status IN ('leased','in_progress'), leased_until) partial index.
  • Dead-letter inspection: (job_kind, last_failed_at DESC).
  • Replay: by job_id PK.

§6. Security / governance

  • fn_job_claim callable by executors holding dot_executor role.
  • fn_job_dead_letter_replay requires workflow_admin role (per Điều 31 refusal contract + §8.4 authority list).
  • fn_job_lease_reaper is a privileged maintenance fn; runs as workflow_admin.

§7. Rollback / disable

  • queue.retry.max_attempts.default = 1 disables retry effectively.
  • queue.retry.backoff.cap_seconds = 0 collapses to constant.
  • queue.job_substrate.enabled = false (DP2 flag) makes the whole DP3 surface inert.
  • fn_job_lease_reaper can be gated by queue.lease_reaper.enabled.

§8. Healthcheck / observability

NON-EXECUTABLE DESIGN SKETCH — DO NOT APPLY

view v_job_retry_pressure AS
  SELECT job_kind, count(*) FILTER (WHERE status='retry_waiting') AS waiting,
                            count(*) FILTER (WHERE status='dead_letter') AS dead,
                            max(attempts) FILTER (WHERE status<>'cleaned') AS max_attempts_seen
    FROM job_queue GROUP BY job_kind;

view v_lease_active AS
  SELECT lease_name, lease_holder, leased_until, EXTRACT(epoch FROM (leased_until - now())) AS ttl_remaining
    FROM job_queue WHERE status IN ('leased','in_progress');

function fn_dead_letter_health() RETURNS jsonb
  -- returns {open_count_event, open_count_job, oldest_first_failed_at, by_failure_code}

Fed into v_queue_health (DP4).


§9. Compatibility with Điều 45 v1.0

Clause Compliance
§7 idempotency three-layer defense ✅ CHECK + SKIP LOCKED + lease token
§8.1 retry ✅ config-driven max_attempts
§8.2 backoff explicit ✅ named strategy + base + cap; not implicit
§8.3 lease ✅ TTL + token + reaper
§8.4 dead-letter resolution {replayed, discarded, superseded} ✅ verbatim CHECK
§8.5 refusal contract ✅ fn_job_fail_permanent path
§10.1 batch principles ✅ batch-claim sketch (LIMIT N + lease) honours rollback rule
§15.5 silent_gap ✅ lease reaper closes the orphan-lease class of silent gap

§10. Implementation prerequisites

  • DP2 must be in place (job_queue exists).
  • DP1 cadence must reach the lease reaper at < lease TTL / 2 (so reaping happens within ~5 min of crash).
  • dot_config keys created (read-only at first, can ship with defaults).

§11. Open questions

# Question Routed to
DP3-Q1 max_attempts default per domain: 3? 5? 10? Council
DP3-Q2 Backoff strategy default: exponential? linear? Council
DP3-Q3 Lease TTL default: 600s? 300s? per job_kind override? Council
DP3-Q4 Approve separate job_dead_letter (Option B) vs widening iu_route_dead_letter? Council
DP3-Q5 Replay authority: workflow_admin only, or per-domain agency? Council + Điều 37
DP3-Q6 Should fn_job_fail_permanent require an acknowledged_by audit field (D31 refusal evidence)? Council
DP3-Q7 retry_waiting → queued promoter: filter at claim time vs explicit promoter job? Council

§12. Self-test

self_test:
  cites_dieu45_section: §7,§8,§10.1,§15.5
  defines_status_lifecycle_compatible_with_§6_7: yes
  defines_idempotency_key: inherits DP2
  defines_retry_dlq: yes
  defines_lease: yes
  defines_observability_view: v_job_retry_pressure, v_lease_active
  defines_dot_config_disable_flag: queue.retry.*, queue.lease_reaper.enabled
  defines_executor_set_compatible_with_§11_5: yes
  no_vector_in_transient: yes
  signal_not_data: yes (snapshot is jsonb signal-shaped)
  pg_sot: yes
  rollback_concept: yes
  no_pg_cron_dependency_phase_1: yes
  no_pg_18_dependency: yes
  no_mutation_authored: yes

DP3 design. No mutation. Authored 2026-05-26 by Claude Opus 4.7 (1M).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-full-queue-orchestration-design-pack/05-DP3-retry-lease-dead-letter.md