KB-481E

Phase 3 — Queue health + stuck-job audit (PASS)

4 min read Revision 1
dieu45phase3queue-healthv-queue-healthv-job-queue-backlogv-job-dead-letter-summaryno-stuck-job2026-05-26

Phase 3 — Queue health + stuck-job audit

1. v_queue_health

surface                | queue_substrate_phase1
substrate_enabled      | false                  -- closed at exit
heartbeat_enabled      | true
worker_enabled         | false
stale_threshold_seconds| 300
executors_fresh        | 1                      -- dieu45_phase3_pilot
executors_warning      | 0
executors_stale        | 1                      -- legacy iu_outbound_default (carry-forward)
queued_count           | 0
leased_count           | 0
in_progress_count      | 0
retry_waiting_count    | 0
failed_count           | 0
dead_letter_count      | 0
cancelled_count        | 0
succeeded_count        | 6                      -- this pilot
cleaned_count          | 0
stale_lease_count      | 0
dlq_pending_count      | 0
dlq_total_count        | 0

2. v_job_queue_backlog

job_kind                | state     | row_count
------------------------+-----------+----------
cut.cleanup_checkpoint  | succeeded |         1
cut.copy_to_staging     | succeeded |         1
cut.cut                 | succeeded |         1
cut.mark                | succeeded |         1
cut.verify_cut          | succeeded |         1
cut.verify_mark         | succeeded |         1

oldest_created_at = 2026-05-26 15:13:49.405831+00, newest_created_at = 2026-05-26 15:13:49.430201+00 (all 6 enqueued within 25 ms of each other), avg_attempts = 0.00, max_attempts_seen = 0.

3. v_job_dead_letter_summary

(0 rows)

4. No stuck jobs

SELECT count(*) FROM job_queue
WHERE state IN ('queued','leased','in_progress','retry_waiting','failed');
-- 0
SELECT count(*) FROM job_queue WHERE state='succeeded' AND finished_at IS NULL;
-- 0
SELECT count(*) FROM job_queue
WHERE lease_owner IS NOT NULL AND lease_until < now();
-- 0  -- no stale leases (the pilot ran inside lease window 300 s)

5. No unsafe dead_letter

SELECT count(*) FROM job_dead_letter;
-- 0

SELECT count(*) FROM job_dead_letter WHERE triage_status NOT IN ('pending','acknowledged');
-- 0

6. Staging lifecycle correctness

staging_record_id : 258c715c-…   (this pilot)
lifecycle_status  : consumed
approved_at       : 2026-05-26 15:13:49.501679+00
consumed_at       : 2026-05-26 15:17:38.784037+00
consumed_by_run_id: a64340fe-…   (matches cut_run_id)
vector_excluded   : t            (NVSZ guarantee)

Full state sequence: pending → pending_review → approved → consumed (no skip).

7. Cleanup checkpoint

fn_iu_op_cleanup_dry_run(older_than_days=15) returned:

{"alias":"fn_iu_op_cleanup_dry_run","apply":false,"actions":[],"eligible_count":0,"older_than_days":15}

apply=false, actions=[]. No staging row was deleted/moved — the dry-run is the policy. Older candidates (>15 days): 0 (the older c3f3f073 staging is dated 2026-05-26 09:13 — inside the 15-day window and lifecycle_status='consumed' so already cleanable when policy flips to apply-mode in a future phase).

8. Carry-forward stale executor

The single executors_stale=1 value is legacy (iu_outbound_default, last tick 2026-05-22 11:31:41+00). The §15.5 silent-gap surface is visible but not closed — that remains a Phase 3+ task. See 09-carry-forward.md item 2.

9. Pilot durations

step duration (ms)
Enqueue 6 jobs ~27 ms
cut.copy_to_staging claim→ack 8 ms
cut.mark (incl. fn_iu_op_mark_file) 38 ms
cut.verify_mark 13 ms
cut.cut initial attempt (failed) 356 ms
Patch + retry cut.cut 395 ms
cut.verify_cut 51 ms
cut.cleanup_checkpoint 84 ms
Final heartbeat + gate close 36 ms

Total wall time of pilot SQL ~3 min 50 s (most of it idle between the failed cut and the patch).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-dieu45-phase-3-mark-cut-queue-pilot-dieu37-write-channel/08-queue-health-and-stuck-job-check.md