KB-40AF

Phase 2 — Lease Governance (stale lease reaper)

5 min read Revision 1
dieu45phase2lease-governancereaperdp32026-05-26

Lease Governance — stale lease reaper

Two-function shape

Function Mode Mutates Gated by
fn_job_reap_stale_leases_dry_run(p_limit) STABLE, INVOKER No — (always available)
fn_job_reap_stale_leases_apply(p_actor, p_limit) VOLATILE, SECURITY DEFINER Yes queue.job_substrate.enabled AND queue.lease.reaper_enabled AND NOT queue.lease.reaper_dry_run_only

Stale lease definition

A row in job_queue is stale-leased when:

  • state = 'leased'
  • lease_until IS NOT NULL
  • lease_until < now()

The Phase 1 index job_queue_lease_until_idx ON (lease_until) WHERE state='leased' supports this scan efficiently.

fn_job_reap_stale_leases_dry_run — output shape

{
  "evaluated_at": "<now>",
  "stale_lease_count": <int>,
  "limit": <int>,
  "mutation": false,
  "jobs": [
    {
      "job_id": "<uuid>",
      "job_kind": "<text>",
      "lease_owner": "<text>",
      "lease_until": "<timestamptz>",
      "attempts": <int>,
      "max_attempts": <int>,
      "overdue_seconds": <bigint>,
      "would_action": "reset_to_retry_waiting" | "move_to_dead_letter"
    }
  ]
}

would_action predicts the apply-side decision based on whether attempts+1 >= max_attempts.

fn_job_reap_stale_leases_apply — apply logic

For each stale-leased job (claimed with FOR UPDATE SKIP LOCKED):

If attempts + 1 >= max_attempts → move to DLQ via fn_job_move_to_dead_letter with final_error='lease_expired_reaped_at_max_attempts'. Attempts are bumped before the move so the DLQ row records the final attempt count.

Else → reset to retry_waiting:

  • attempts := attempts + 1
  • last_error := 'lease_expired_reaped'
  • scheduled_at := now() + queue.retry.backoff_base_sec * 2^(attempts-1) capped at 2^10
  • lease_owner := NULL, lease_until := NULL

Triple-gate design rationale

Gate Purpose Default Flip authority
queue.job_substrate.enabled Substrate master gate (Phase 1) false Phase 3 enactment
queue.lease.reaper_enabled Reaper master gate false Operator authorization
queue.lease.reaper_dry_run_only Safety gate; refuses mutation even if reaper_enabled=true true Operator authorization per reap window

The third gate is the dry-run safety: even after enabling the reaper, durable mutation requires explicitly flipping reaper_dry_run_only=false. This gives operators a "armed but safe" window for observation.

Refusal proofs (bounded TX)

Test Result
apply with reaper_dry_run_only=true, reaper_enabled=true, substrate=true {"reason":"queue.lease.reaper_dry_run_only=true","refused":true}
apply with reaper_enabled=false, reaper_dry_run_only=false {"reason":"queue.lease.reaper_enabled=false","refused":true}
apply with substrate=false {"reason":"queue.job_substrate.enabled=false","refused":true} (implicit from gate order; tested in Phase 1)
apply with empty actor {"reason":"actor_required","refused":true}

Apply path proof (bounded TX, all three gates temporarily true)

2 stale-leased jobs prepared:

  • phase2_proof_retryable (attempts=0, max=5) → reset_to_retry_waiting, attempts→1, backoff_sec=10
  • phase2_proof_dlq_bound (attempts=4, max=5) → moved_to_dead_letter, attempts→5

Apply output:

{
  "actor": "phase2_proof_reaper",
  "refused": false,
  "reset_count": 1,
  "dead_letter_count": 1,
  "actions": [
    {"action":"reset_to_retry_waiting","job_id":"5f2a426d…","attempts":1,"backoff_sec":10,"scheduled_at":"…"},
    {"action":"moved_to_dead_letter","job_id":"4f1eae41…","attempts":5,
     "dead_letter":{"dead_letter_id":"9a4f319f…","state":"dead_letter","refused":false,"attempts":5,…}}
  ]
}

SKIP LOCKED semantics inherited from FOR UPDATE SKIP LOCKED in the cursor loop. Concurrent reaper invocations are safe.

Lease duration source

queue.lease.duration_sec (default 300s) is used by fn_job_claim to set lease_until. The reaper does NOT re-read this — it only checks lease_until < now(). This means changing lease duration affects new leases, not in-flight ones.

Future work (out of Phase 2)

  • Lease reaper as a job_kind (DP3 §11.3.2 design): the reaper itself enqueues into job_queue rather than being called externally. Out of scope until Phase 3+ when worker substrate is enabled.
  • Per-job-kind lease duration override (job_queue.metadata.lease_override_sec?). Not designed yet.
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-phase-2-heartbeat-activation-lease-governance/03-lease-governance.md