Automation Orchestrator Design · 05 Batch / Overnight / Error / Resume Model
Automation Orchestrator Design · 05 Batch / Overnight / Error / Resume Model
doc 5 of 7 · 2026-05-20 · design-only macro
phase : G5 — batch mode, overnight scheduling, error/resume queues outcome : G5 PASS — serial-per-doc batch with fan-in sovereign approvals production_mutation : NONE
1. Why batch mode is non-trivial
The Constitution required ≥ 20 separate stop-routed macros. Future cuts (e.g. cutting 30 laws in one overnight run) cannot tolerate that pattern. But the v0.5 doctrine forbids batching authority: each (document, phase) needs its own sovereign approval (DQ_6, review_decision per phase).
The batch mode reconciles these as follows:
Batch mode collects REQUESTS in parallel and processes APPROVALS serially. Sovereign signs N approval docs in one sitting; the orchestrator then drives all N runs in parallel-safe lanes until each hits its next sovereign gate.
2. Batch entry point
$ cutter orchestrate batch --queue queue.yaml [--max-concurrent 4] \
[--stop-on-first-failure] [--resume]
queue.yaml shape (≤ 100 documents per batch in MVP):
schema_version : 1
batch_id : <UUID7>
created_by : <actor>
created_utc : <ISO8601>
default_scope : enacted_only
default_actor : sovereign@example
items:
- document_id : ICX-LAW-2026-001
source_uri : null # null = use latest source_version already in DB
- document_id : ICX-LAW-2026-002
source_uri : null
- document_id : ICX-LAW-2026-003
source_uri : null
No secrets, no DSNs, no signature material in the queue file.
3. Batch state machine (overlays per-doc state machine)
batch_states:
- batch_pending
- batch_planning # per-doc phases 1–5 running in parallel lanes
- batch_awaiting_cut_authz_set # all docs paused at SG_1
- batch_writing # per-doc phases 6–9 running serially in lanes
- batch_awaiting_lifecycle_authz_set # all docs paused at SG_2
- batch_enacting # per-doc phase 10 running serially
- batch_closeout # consolidated KB report
- batch_failed_partial # any doc terminal-failed; others rolled to terminal
- batch_voided
Transitions:
batch_planning → batch_awaiting_cut_authz_set : all docs reached SG_1
batch_awaiting_cut_authz_set → batch_writing : SG_1 set received
batch_writing → batch_awaiting_lifecycle_authz_set : all docs through phase 9
batch_awaiting_lifecycle_authz_set → batch_enacting : SG_2 set received
batch_enacting → batch_closeout : all docs through phase 10
batch_closeout → terminal_success : final report uploaded
any_doc_terminal_fail → batch_failed_partial : (configurable: see §6)
4. Fan-in approval flow
SG_1_set_handshake:
step_a (orchestrator) : per doc, upload SG1-request.md to that doc's KB folder
step_b (orchestrator) : upload ONE consolidated
"<batch>/SG1-batch-overview.md" linking all per-doc
SG1 request docs
step_c (orchestrator) : exit 0 ; await sovereign
step_d (sovereign) : reviews batch overview; for each doc requiring
approval, uploads an SG1-approval.md per-doc.
Sovereign MAY skip docs (those stay paused).
step_e (operator) : cutter orchestrate resume --batch-id <id>
--approval-set <SG1-approvals folder>
step_f (orchestrator) : for each approval, validates + resumes that doc;
unsigned docs stay at batch_awaiting_cut_authz_set
and become batch_partial unless reopened.
SG_2_set_handshake:
same shape with the additional constraint that each per-doc approval
doc MUST embed a fresh review_decision_id (created by sovereign or by
a sovereign-authorized tool that fires N inserts in one txn).
This way the sovereign reads ONE batch overview, then signs N approval docs — and the per-doc audit trail still has its own review_decision row, satisfying Phase 7 doctrine.
5. Parallel-lane safety
parallel_lanes:
default_max_concurrent : 4
lane_assignment : one doc per lane ; FIFO
per_lane_state : independent sidecar at .../runs/<run_id>/state.json
forbidden_overlap:
- two lanes mutating the same canonical_address_prefix at once
(orchestrator pre-flight refuses if any two queue items share a prefix)
- two lanes computing the same writer_digest at once
(rare but possible if two docs have identical structure;
orchestrator detects at cutplan phase and serializes them)
- two lanes opening cut_leg_a txn simultaneously
(a global file lock /var/lib/cutter/locks/cut_leg_a.lock serializes
phase 6 across all lanes — backup against double-write surprises)
phase_6_serialization_rationale:
fn_iu_create owns an advisory_xact_lock per canonical_address ;
but two docs with DIFFERENT prefixes would deadlock-free in theory.
We still serialize phase 6 globally in MVP because the v0.5 backup
(pg_dump narrow) is heavy and overlapping backups would saturate IO.
Re-evaluate the global lock in v0.7 once we have steady-state metrics.
6. Failure handling — three policies
policy_default (--no-stop-on-first-failure):
on_any_doc_failure : the failed doc moves to error_queue/ in the batch folder;
remaining docs continue.
on_batch_close : final batch report enumerates {successful, failed}.
policy_strict (--stop-on-first-failure):
on_any_doc_failure : all in-flight docs are STOPPED at next safe boundary
(between phases ; never mid-txn).
Successful-so-far docs stay terminal-success.
Pending docs become batch_voided pending sovereign review.
policy_quarantine (DEFAULT for overnight runs):
on_drift_or_invariant_fail : the failing doc is quarantined ;
its sidecar moves to quarantine_queue/ ;
a single KB STOP-quarantine doc is uploaded ;
batch proceeds with remaining docs.
on_sovereign_stop : entire batch enters batch_awaiting_<gate> ;
only the specific gate-pending docs wait.
policy_quarantine is the recommended default for unattended overnight
runs because:
- a single bad document doesn't waste the night;
- the failure surface (one KB doc per quarantine) is bounded;
- sovereign reviews one queue in the morning.
7. Overnight scheduling
launch_pattern:
cron : 22:00 local (deployment cron, NOT orchestrator-internal)
command : cutter orchestrate batch --queue overnight.yaml
--max-concurrent 2
--quarantine
hard_cap_minutes : 360 (6 h wall clock for an overnight batch)
per_doc_hard_cap : 60 min (doc 02 §6)
alarm_action : STOP_OVER_BATCH_HARDCAP → upload partial summary
approval_window:
pre_batch_window : sovereign signs SG_1 + SG_2 approvals BEFORE batch start
(these are pre-batched approvals stored next to queue.yaml)
in_batch_pause : ANY in-batch request for SG_1 or SG_2 that the pre-batched
approval set does NOT cover → STOP_NEEDS_LATE_APPROVAL ;
doc moves to human_review_queue/ ;
batch proceeds.
pre_batched_approval_validity:
scope : exactly the (document_id, source_version_id, candidate_count,
writer_digest, change_set_id-placeholder) tuple
ttl : 12 h from sovereign signature
consume : exactly-once per doc
This pre-batched approval mechanism makes unattended runs feasible WITHOUT changing the v0.5 "fresh review_decision per phase" doctrine — the sovereign still issued each review_decision; they just did it ahead of time.
8. Queue directory layout
${CUTTER_BATCH_DIR}/<batch_id>/
queue.yaml
SG1-pre-batched-approvals/
ICX-LAW-2026-001.md
ICX-LAW-2026-002.md
...
SG2-pre-batched-approvals/
ICX-LAW-2026-001.md
ICX-LAW-2026-002.md
...
runs/
<run_id_1>/state.json
<run_id_1>/pre_write.gpg
<run_id_1>/phase-*.md
...
error_queue/<run_id>/... # failed docs (default policy)
quarantine_queue/<run_id>/... # quarantine policy
human_review_queue/<run_id>/... # missing pre-batched approval
batch_report.md # consolidated KB upload after closeout
9. Resume — at every level
single_run_resume : cutter orchestrate resume --run-id <id>
[--approval-kb-id <path>]
batch_resume : cutter orchestrate resume-batch --batch-id <id>
[--approval-set <folder>]
queue_resume_after_void : cutter orchestrate batch --queue queue.yaml
--skip-completed
--resume
skip-completed_semantics:
- if (document_id, source_version_id) already has a terminal-success run,
that item is skipped (idempotent).
- if terminal-failure, item is skipped UNLESS --include-failed is set,
which requires per-doc sovereign authorization.
10. Error queue triage UX
When the sovereign reviews error_queue/:
per_run_error_doc:
fields:
- run_id
- document_id
- state_at_failure
- failed_gate
- invariant_diff
- drift_evidence (if any)
- sidecar_snapshot_truncated (≤ 50 lines)
- reproduction_command
- recommended_disposition: { amend | void | open-fix-macro }
uploaded_to_kb: yes, under .../<batch_id>/error_queue-summary.md
The sovereign can void / amend / fix one doc at a time without unblocking the batch.
11. Human review queue triage UX
For runs that hit STOP_NEEDS_LATE_APPROVAL:
per_run_review_doc:
fields:
- run_id, document_id
- state_at_pause (SG_1 or SG_2)
- request_payload (manifest_digest, writer_digest, candidate_count, …)
- reason_pre_batched_approval_missing
- one_line_resume_command
uploaded_to_kb: under .../<batch_id>/human_review_queue-summary.md
The sovereign uploads the missing approval; operator resumes:
cutter orchestrate resume --run-id <id> --approval-kb-id <path>.
12. Verdict
g5_outcome : PASS
batch_states : 9 + 2 terminal-fail/voided
parallel_lanes_default : 4 (configurable)
phase_6_global_serialization : yes (advisory lock + global file lock)
sovereign_approval_fan_in : N requests, N approvals, batched in time
overnight_default_policy : quarantine + pre-batched approvals
resume_at_three_levels : single run, batch, queue
queues : error_queue, quarantine_queue, human_review_queue
clutter_bound : O(N) docs per batch, O(1) overhead docs