KB-206C

Automation Orchestrator Design · 05 Batch / Overnight / Error / Resume Model

12 min read Revision 1
dot-iu-cutterv0.5automation-orchestrator-designbatch-modeovernight-runerror-queueresume-queuehuman-review-queueg5-passdieu442026-05-20

Automation Orchestrator Design · 05 Batch / Overnight / Error / Resume Model

doc 5 of 7 · 2026-05-20 · design-only macro

phase                : G5 — batch mode, overnight scheduling, error/resume queues
outcome              : G5 PASS — serial-per-doc batch with fan-in sovereign approvals
production_mutation  : NONE

1. Why batch mode is non-trivial

The Constitution required ≥ 20 separate stop-routed macros. Future cuts (e.g. cutting 30 laws in one overnight run) cannot tolerate that pattern. But the v0.5 doctrine forbids batching authority: each (document, phase) needs its own sovereign approval (DQ_6, review_decision per phase).

The batch mode reconciles these as follows:

Batch mode collects REQUESTS in parallel and processes APPROVALS serially. Sovereign signs N approval docs in one sitting; the orchestrator then drives all N runs in parallel-safe lanes until each hits its next sovereign gate.

2. Batch entry point

$ cutter orchestrate batch --queue queue.yaml [--max-concurrent 4] \
                            [--stop-on-first-failure] [--resume]

queue.yaml shape (≤ 100 documents per batch in MVP):

schema_version : 1
batch_id       : <UUID7>
created_by     : <actor>
created_utc    : <ISO8601>
default_scope  : enacted_only
default_actor  : sovereign@example
items:
  - document_id : ICX-LAW-2026-001
    source_uri  : null    # null = use latest source_version already in DB
  - document_id : ICX-LAW-2026-002
    source_uri  : null
  - document_id : ICX-LAW-2026-003
    source_uri  : null

No secrets, no DSNs, no signature material in the queue file.

3. Batch state machine (overlays per-doc state machine)

batch_states:
  - batch_pending
  - batch_planning         # per-doc phases 1–5 running in parallel lanes
  - batch_awaiting_cut_authz_set     # all docs paused at SG_1
  - batch_writing          # per-doc phases 6–9 running serially in lanes
  - batch_awaiting_lifecycle_authz_set   # all docs paused at SG_2
  - batch_enacting         # per-doc phase 10 running serially
  - batch_closeout         # consolidated KB report
  - batch_failed_partial   # any doc terminal-failed; others rolled to terminal
  - batch_voided

Transitions:

batch_planning            → batch_awaiting_cut_authz_set     : all docs reached SG_1
batch_awaiting_cut_authz_set → batch_writing                 : SG_1 set received
batch_writing             → batch_awaiting_lifecycle_authz_set : all docs through phase 9
batch_awaiting_lifecycle_authz_set → batch_enacting          : SG_2 set received
batch_enacting            → batch_closeout                   : all docs through phase 10
batch_closeout            → terminal_success                 : final report uploaded
any_doc_terminal_fail      → batch_failed_partial             : (configurable: see §6)

4. Fan-in approval flow

SG_1_set_handshake:
  step_a (orchestrator) : per doc, upload SG1-request.md to that doc's KB folder
  step_b (orchestrator) : upload ONE consolidated
                          "<batch>/SG1-batch-overview.md" linking all per-doc
                          SG1 request docs
  step_c (orchestrator) : exit 0 ; await sovereign
  step_d (sovereign)    : reviews batch overview; for each doc requiring
                          approval, uploads an SG1-approval.md per-doc.
                          Sovereign MAY skip docs (those stay paused).
  step_e (operator)     : cutter orchestrate resume --batch-id <id>
                                       --approval-set <SG1-approvals folder>
  step_f (orchestrator) : for each approval, validates + resumes that doc;
                          unsigned docs stay at batch_awaiting_cut_authz_set
                          and become batch_partial unless reopened.

SG_2_set_handshake:
  same shape with the additional constraint that each per-doc approval
  doc MUST embed a fresh review_decision_id (created by sovereign or by
  a sovereign-authorized tool that fires N inserts in one txn).

This way the sovereign reads ONE batch overview, then signs N approval docs — and the per-doc audit trail still has its own review_decision row, satisfying Phase 7 doctrine.

5. Parallel-lane safety

parallel_lanes:
  default_max_concurrent : 4
  lane_assignment        : one doc per lane ; FIFO
  per_lane_state         : independent sidecar at .../runs/<run_id>/state.json

forbidden_overlap:
  - two lanes mutating the same canonical_address_prefix at once
    (orchestrator pre-flight refuses if any two queue items share a prefix)
  - two lanes computing the same writer_digest at once
    (rare but possible if two docs have identical structure;
     orchestrator detects at cutplan phase and serializes them)
  - two lanes opening cut_leg_a txn simultaneously
    (a global file lock /var/lib/cutter/locks/cut_leg_a.lock serializes
     phase 6 across all lanes — backup against double-write surprises)

phase_6_serialization_rationale:
  fn_iu_create owns an advisory_xact_lock per canonical_address ;
  but two docs with DIFFERENT prefixes would deadlock-free in theory.
  We still serialize phase 6 globally in MVP because the v0.5 backup
  (pg_dump narrow) is heavy and overlapping backups would saturate IO.
  Re-evaluate the global lock in v0.7 once we have steady-state metrics.

6. Failure handling — three policies

policy_default (--no-stop-on-first-failure):
  on_any_doc_failure : the failed doc moves to error_queue/ in the batch folder;
                       remaining docs continue.
  on_batch_close     : final batch report enumerates {successful, failed}.

policy_strict (--stop-on-first-failure):
  on_any_doc_failure : all in-flight docs are STOPPED at next safe boundary
                       (between phases ; never mid-txn).
                       Successful-so-far docs stay terminal-success.
                       Pending docs become batch_voided pending sovereign review.

policy_quarantine (DEFAULT for overnight runs):
  on_drift_or_invariant_fail : the failing doc is quarantined ;
                                its sidecar moves to quarantine_queue/ ;
                                a single KB STOP-quarantine doc is uploaded ;
                                batch proceeds with remaining docs.
  on_sovereign_stop          : entire batch enters batch_awaiting_<gate> ;
                                only the specific gate-pending docs wait.

policy_quarantine is the recommended default for unattended overnight runs because:

  • a single bad document doesn't waste the night;
  • the failure surface (one KB doc per quarantine) is bounded;
  • sovereign reviews one queue in the morning.

7. Overnight scheduling

launch_pattern:
  cron               : 22:00 local (deployment cron, NOT orchestrator-internal)
  command            : cutter orchestrate batch --queue overnight.yaml
                                      --max-concurrent 2
                                      --quarantine
  hard_cap_minutes   : 360 (6 h wall clock for an overnight batch)
  per_doc_hard_cap   : 60 min (doc 02 §6)
  alarm_action       : STOP_OVER_BATCH_HARDCAP → upload partial summary

approval_window:
  pre_batch_window   : sovereign signs SG_1 + SG_2 approvals BEFORE batch start
                        (these are pre-batched approvals stored next to queue.yaml)
  in_batch_pause     : ANY in-batch request for SG_1 or SG_2 that the pre-batched
                        approval set does NOT cover → STOP_NEEDS_LATE_APPROVAL ;
                        doc moves to human_review_queue/ ;
                        batch proceeds.

pre_batched_approval_validity:
  scope               : exactly the (document_id, source_version_id, candidate_count,
                        writer_digest, change_set_id-placeholder) tuple
  ttl                 : 12 h from sovereign signature
  consume             : exactly-once per doc

This pre-batched approval mechanism makes unattended runs feasible WITHOUT changing the v0.5 "fresh review_decision per phase" doctrine — the sovereign still issued each review_decision; they just did it ahead of time.

8. Queue directory layout

${CUTTER_BATCH_DIR}/<batch_id>/
  queue.yaml
  SG1-pre-batched-approvals/
    ICX-LAW-2026-001.md
    ICX-LAW-2026-002.md
    ...
  SG2-pre-batched-approvals/
    ICX-LAW-2026-001.md
    ICX-LAW-2026-002.md
    ...
  runs/
    <run_id_1>/state.json
    <run_id_1>/pre_write.gpg
    <run_id_1>/phase-*.md
    ...
  error_queue/<run_id>/...        # failed docs (default policy)
  quarantine_queue/<run_id>/...   # quarantine policy
  human_review_queue/<run_id>/... # missing pre-batched approval
  batch_report.md                 # consolidated KB upload after closeout

9. Resume — at every level

single_run_resume       : cutter orchestrate resume --run-id <id>
                                                    [--approval-kb-id <path>]
batch_resume            : cutter orchestrate resume-batch --batch-id <id>
                                                          [--approval-set <folder>]
queue_resume_after_void : cutter orchestrate batch --queue queue.yaml
                                                   --skip-completed
                                                   --resume

skip-completed_semantics:
  - if (document_id, source_version_id) already has a terminal-success run,
    that item is skipped (idempotent).
  - if terminal-failure, item is skipped UNLESS --include-failed is set,
    which requires per-doc sovereign authorization.

10. Error queue triage UX

When the sovereign reviews error_queue/:

per_run_error_doc:
  fields:
    - run_id
    - document_id
    - state_at_failure
    - failed_gate
    - invariant_diff
    - drift_evidence (if any)
    - sidecar_snapshot_truncated (≤ 50 lines)
    - reproduction_command
    - recommended_disposition: { amend | void | open-fix-macro }
  uploaded_to_kb: yes, under .../<batch_id>/error_queue-summary.md

The sovereign can void / amend / fix one doc at a time without unblocking the batch.

11. Human review queue triage UX

For runs that hit STOP_NEEDS_LATE_APPROVAL:

per_run_review_doc:
  fields:
    - run_id, document_id
    - state_at_pause (SG_1 or SG_2)
    - request_payload (manifest_digest, writer_digest, candidate_count, …)
    - reason_pre_batched_approval_missing
    - one_line_resume_command
  uploaded_to_kb: under .../<batch_id>/human_review_queue-summary.md

The sovereign uploads the missing approval; operator resumes: cutter orchestrate resume --run-id <id> --approval-kb-id <path>.

12. Verdict

g5_outcome              : PASS
batch_states            : 9 + 2 terminal-fail/voided
parallel_lanes_default  : 4 (configurable)
phase_6_global_serialization : yes (advisory lock + global file lock)
sovereign_approval_fan_in  : N requests, N approvals, batched in time
overnight_default_policy : quarantine + pre-batched approvals
resume_at_three_levels   : single run, batch, queue
queues                   : error_queue, quarantine_queue, human_review_queue
clutter_bound            : O(N) docs per batch, O(1) overhead docs
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.5-automation-orchestrator-design/05-batch-overnight-error-resume-model-2026-05-20.md