KB-4965

Automation Orchestrator Design · 06 Implementation Roadmap & Module Plan

15 min read Revision 1
dot-iu-cutterv0.5automation-orchestrator-designimplementation-roadmapmodule-planrollout-phasesg6-passno-impl-yetdieu442026-05-20

Automation Orchestrator Design · 06 Implementation Roadmap & Module Plan

doc 6 of 7 · 2026-05-20 · design-only macro

phase                : G6 — module map + rollout sequence
outcome              : G6 PASS — 11 new modules ≤ 200 LOC each ;
                       6-macro rollout (O1..O6) ; first cut at O4
production_mutation  : NONE (no code authored in this macro)

1. Where the new code lives

New code goes under cutter_agent/orchestrator/ so it is clearly separable from the ratified v0.5 modules at the top level of cutter_agent/.

cutter_agent/
├── (existing v0.5 modules — UNCHANGED for orchestrator MVP)
├── orchestrator/
│   ├── __init__.py
│   ├── runner.py
│   ├── state_store.py
│   ├── run_context.py
│   ├── gates.py
│   ├── approval.py
│   ├── kb_reporter.py
│   ├── discover.py
│   ├── batch.py
│   └── phases/
│       ├── __init__.py
│       ├── source_pin.py
│       ├── mark.py
│       ├── cutplan.py
│       ├── backup.py
│       ├── grant_probe.py
│       ├── cut_leg_a.py
│       ├── structural_verify.py
│       ├── leg_b_record.py
│       ├── write_verify.py
│       ├── lifecycle_enact.py
│       └── closeout.py
└── (existing cli.py extended with new subcommands)

2. Per-module surface contracts

2.1 runner.py (~150 LOC)

class OrchestratorRunner:
    def __init__(self, sidecar_root: Path, kb: KBReporter,
                 db_provider: Callable[..., Any]): ...
    def cut(self, *, document_id: str, mode: Mode,
            actor: str) -> RunResult: ...
    def resume(self, *, run_id: str,
               approval_kb_id: Optional[Path]) -> RunResult: ...
    def void(self, *, run_id: str, reason_kb_id: Path) -> RunResult: ...

The runner owns the state machine (doc 02). It NEVER imports psycopg directly; it uses an injectable db_provider (same pattern as prod_iu_adapter).

2.2 run_context.py (~100 LOC)

@dataclass(frozen=False)
class RunContext:
    run_id: str
    document_id: str
    source_document_id: str
    source_version_id: str
    mode: Mode                        # "dryrun" | "live"
    state: OrchestratorState
    context_pins: Dict[str, Any]      # manifest_digest, region_sha, writer_digest, …
    phases: Dict[PhaseName, PhaseRecord]
    sovereign_approvals: List[ApprovalRecord]
    idempotency_keys: Dict[PhaseName, str]
    started_utc: str
    actor: str
    schema_version: int = 1

PIN values that the Constitution had as module constants live here — they are populated by phases, not hardcoded.

2.3 state_store.py (~120 LOC)

class StateStore:
    def __init__(self, sidecar_root: Path): ...
    def acquire(self, run_id: str) -> ContextManager[RunContext]:
        # fcntl-locked open of state.json
    def create(self, ctx: RunContext) -> None: ...
    def update(self, ctx: RunContext) -> None: ...
    def read(self, run_id: str) -> RunContext: ...
    def append_runs_index(self, ctx: RunContext) -> None: ...

Persistence is fcntl-locked JSON; resume MUST go through acquire.

2.4 gates.py (~100 LOC)

GATE_INVARIANTS: Dict[GateName, Dict[str, Callable[[RunContext], bool]]] = {...}

def evaluate_internal(gate: GateName, ctx: RunContext) -> GateResult: ...
def sovereign_handshake(gate: SovereignGateName, ctx: RunContext,
                        kb: KBReporter) -> None: ...
def validate_approval(approval_kb_id: Path, expected_gate: SovereignGateName,
                      ctx: RunContext) -> ApprovalRecord: ...

2.5 discover.py (~150 LOC)

Pure read-only helpers. NO writes ever:

def discover_source_document(conn, document_id: str) -> Optional[UUID]: ...
def discover_latest_source_version(conn, sd_id: UUID) -> SourceVersion: ...
def discover_function_md5(conn, qualified_fn: str) -> str: ...
def discover_grant_matrix(conn) -> GrantMatrix: ...
def discover_vocab_snapshot(conn) -> VocabSnapshot: ...
def discover_iu_count(conn, doc_prefix: str) -> int: ...
def discover_lifecycle_vocab_present(conn) -> bool: ...

2.6 kb_reporter.py (~100 LOC)

class KBReporter:
    def __init__(self, api: AgentDataAPI): ...
    def upload_phase_doc(self, ctx: RunContext, phase: PhaseName,
                         body_md: str) -> str: ...    # returns kb_id
    def upload_sovereign_request(self, ctx: RunContext,
                                 gate: SovereignGateName,
                                 body_md: str) -> str: ...
    def upload_stop_doc(self, ctx: RunContext, reason: str,
                        body_md: str) -> str: ...
    def append_to_runs_index(self, line: str) -> None: ...

Retries (3, exp backoff, cap 30 s) live here.

2.7 approval.py (~80 LOC)

def validate_sovereign_approval(
    kb: KBReporter, approval_kb_id: Path,
    expected_gate: SovereignGateName, run_id: str,
    expected_payload_sha: str) -> ApprovalRecord:
    """
    - re-reads the KB doc
    - checks the sovereign signature line (StubSigning for v0.6)
    - checks (gate, run_id, payload_sha) match
    - checks timestamp ≤ 24 h (SG_1) / ≤ 12 h (SG_2 batched)
    - raises ApprovalInvalid on any failure
    """

The signing surface is cutter_agent.signing.SigningProvider (existing). Default in v0.6 = StubSigningProvider. The real-crypto swap is sequenced separately (see §7 below).

2.8 phases/*.py (≤ 200 LOC each)

Each phase module exports:

def run(ctx: RunContext, *, db_provider, kb: KBReporter) -> None:
    """
    - re-read live drift before action
    - perform the phase's work (reusing v0.5-proven modules)
    - record invariants, idempotency_key, artifacts into ctx
    - evaluate_internal(GATE_FOR_THIS_PHASE, ctx)
    - upload phase KB doc
    - update ctx.state to the next state
    """

Phases are stateless; all state goes through RunContext.

2.9 batch.py (~200 LOC)

class BatchRunner:
    def __init__(self, sidecar_root, kb, db_provider): ...
    def run_queue(self, queue_path: Path, *, max_concurrent: int = 4,
                  policy: FailurePolicy = FailurePolicy.QUARANTINE) -> BatchResult: ...
    def resume_batch(self, batch_id: str,
                     approval_set_dir: Optional[Path]) -> BatchResult: ...

Batch runner wraps N OrchestratorRunner instances + a global lane scheduler + global file locks (per doc 05 §5).

2.10 CLI extensions (cli.py)

$ cutter orchestrate cut --document-id <id> [--mode dryrun|live]
                          [--actor <a>]
$ cutter orchestrate resume --run-id <id> [--approval-kb-id <path>]
$ cutter orchestrate void --run-id <id> --reason-kb-id <path>
$ cutter orchestrate batch --queue <path> [--max-concurrent N]
                            [--policy quarantine|strict|continue]
$ cutter orchestrate resume-batch --batch-id <id> [--approval-set <dir>]
$ cutter orchestrate inspect --run-id <id>   # human-readable sidecar summary

Production refusal preserved: any of these subcommands with --mode live AND missing approval refuses with non-zero exit and clear message (no --force).

3. Config model

dot_config_keys_new_in_v0.6:
  orchestrator.sidecar_root            : "/var/lib/cutter/runs"
  orchestrator.batch_dir               : "/var/lib/cutter/batches"
  orchestrator.global_locks_dir        : "/var/lib/cutter/locks"
  orchestrator.backup_target_table_set : "public.information_unit, public.unit_version, …"
  orchestrator.backup_gpg_fpr          : ${BACKUP_GPG_FPR}    # GSM resolves at boot
  orchestrator.grant_apply_principal   : "directus"           # not used at runtime in v0.6+
  orchestrator.expected_grant_matrix_sha: "<sha256 of canonical YAML>"
  orchestrator.kb_runs_index_path      : "knowledge/dev/laws/_orchestrator-runs-index/v0.6-runs.md"
  orchestrator.phase_soft_cap_minutes  : (per-phase YAML map)
  orchestrator.run_hard_cap_minutes    : 60

Every value is editable by sovereign without code change.

4. GSM / secret handling

secrets_consumed:
  BACKUP_GPG_FPR           : GPG public-key fingerprint for backup encryption
  CUTTER_EXEC_DSN          : production DSN for cutter_exec (live mode only)
  CUTTER_VERIFY_DSN        : production DSN for cutter_verify (live mode only)
  DIRECTUS_SECDEF_DSN      : production DSN for directus principal (probe lane)
  AGENT_DATA_API_TOKEN     : KB upload auth

secrets_NOT_user_supplied  : YES — all resolved from GSM at process start
secrets_NOT_in_sidecar     : YES — sidecar holds keys, not values
secrets_NOT_in_KB_reports  : YES — kb_reporter strips on serialize
secrets_NOT_in_logs        : YES — logger redacts via env-guard list

The hardcode-cleanliness policy applies verbatim: env-var NAMES are allowed in refusal guards, env-var VALUES are never recorded.

5. Test plan

unit_tests:
  - test_state_store_roundtrip
  - test_state_store_fcntl_locking
  - test_run_context_serialization_no_secrets
  - test_gates_invariant_eval_pass_fail
  - test_approval_validate_happy_path
  - test_approval_validate_rejects_stale
  - test_approval_validate_rejects_wrong_gate
  - test_kb_reporter_retry_then_fail
  - test_discover_helpers_against_fixture_db
  - test_each_phase_with_mocked_db_provider
  - test_batch_runner_serial_under_lock
  - test_batch_runner_quarantine_on_invariant_fail

dryrun_e2e_tests:
  - test_cut_constitution_replays_to_identical_writer_digest
  - test_cut_synthetic_doc_phases_1_to_11_all_pass
  - test_resume_after_each_internal_gate
  - test_resume_after_each_sovereign_gate
  - test_void_clears_idempotency_safely

live_smoke_tests:
  - manual; not in CI
  - first live invocation = sovereign-supervised at O4
  - re-cuts the Constitution with a freshly-issued review_decision_id
    to prove the v0.6 path produces byte-identical lifecycle log rows
    (the only mutation is a NEW review_decision row, recorded for audit)

property_tests (xhigh-effort follow-on):
  - randomised candidate counts (1 .. 1000)
  - randomised section_type distributions
  - random drift injection between phases (assert STOP_DRIFT)

6. Rollout — six macros, sovereign-sequenced

macro_O1 — AUTHORING-ONLY (≤ 60 min)
  scope        : module skeletons + interface contracts ; no execution
  produces     : ≤ 2000 LOC of new code under cutter_agent/orchestrator/
                 + matching test stubs (all skipping until O2 fills them)
  gates        : G0 repo precheck, G1 contract author, G2 tests-compile,
                 G3 commit, G4 KB report
  authority    : single-line sovereign approval
  duration     : 1 day

macro_O2 — UNIT-IMPL + IN-MEMORY E2E (≤ 60 min ; xhigh-effort)
  scope        : fill the modules ; add a mock db_provider ;
                 phases 1–11 run against an InMemoryAdapter for the
                 Constitution rowset ;
  produces     : 265 → ~400 tests OK (no new live writes)
  gates        : G0..G7
  authority    : single-line sovereign approval
  duration     : 1 day

macro_O3 — DRY-RUN END-TO-END ON CONSTITUTION (≤ 60 min)
  scope        : runner with mode=dryrun against a snapshot of live PG
                 (using a transient PG container; same pattern as v0.5
                  S2 ROLLBACK-only test)
  produces     : KB folder under .../v0.6-orchestrator-dryrun-constitution/
                 with all 11 phase docs + final report ;
                 writer_digest equals v0.5 d99a31d4… (proof of equivalence)
  authority    : sovereign approval
  duration     : 1 day

macro_O4 — FIRST LIVE NEW-DOCUMENT CUT (≤ 60 min)
  scope        : pick a small new document (≤ 20 candidates) ; full live
                 phases 1–11 ; sovereign-supervised SG_1 + SG_2
  produces     : production rows for that document + KB folder
  authority    : sovereign approval (this is the LIVE gate, not the
                 authoring gate)
  duration     : 1 day

macro_O5 — BATCH MODE GA (≤ 60 min)
  scope        : enable cutter orchestrate batch with quarantine default
  produces     : first batch run of ≤ 5 documents under sovereign supervision
  authority    : sovereign approval
  duration     : 1 day

macro_O6 — REAL CRYPTO MIGRATION (deferred ; effort xhigh)
  scope        : replace StubSigning with a real signing provider
                 (HSM-backed or KMS-backed) for both executor + verifier
                 lanes ; replay strategy ruled per DQ_2
  produces     : NEW SigningProvider implementation ; no v0.5 row mutated
  authority    : sovereign architectural ruling FIRST
  duration     : multi-day ; not blocking for O1..O5

7. SigningProvider — interface (today + tomorrow)

class SigningProvider(Protocol):
    name: str             # "stub" | "kms-hsm" | …
    def sign(self, payload: bytes, *, principal: str) -> Signature: ...
    def verify(self, payload: bytes, sig: Signature) -> bool: ...

class StubSigningProvider:
    name = "stub"
    def sign(self, payload, *, principal):
        return Signature(b"stub:" + sha256(payload + principal.encode()).digest(),
                         provider=self.name)
    def verify(self, payload, sig):
        return sig.bytes == b"stub:" + sha256(payload + ...).digest()

class KmsHsmSigningProvider:           # implemented in O6
    ...

The provider is injected at runner construction. v0.5 ratifications are NOT replayed under real crypto (DQ_2 option "swap forward for next document"); the Constitution's stub signatures stand.

8. Backward-compat with existing CLI

cli.py keeps demo / run / selftest exactly as is. The orchestrate subcommand tree is purely additive. --production still refuses (the orchestrator gates production via its own sovereign-approval machinery; --production short-circuit must remain refused to preserve the v0.5 entry-point invariant).

9. Open items deferred (not blocking the design)

deferred:
  - new role cutter_orchestrator (vs reusing directus SECDEF probe) — revisit O2
  - DB table cutter_governance.orchestrator_run (vs sidecar) — revisit O5
  - real crypto signing migration — O6
  - per-document custom MARK extractors (currently we assume the
    Constitution's MARK shape generalizes) — revisit when first non-CONST doc lands
  - i18n of phase report templates — out of scope
  - browser-friendly UI for sovereign approvals — out of scope

10. Verdict

g6_outcome              : PASS
new_modules_total       : 11 + 2 CLI extensions
loc_estimate_total      : ≤ 1500 LOC orchestrator + ≤ 500 LOC tests
reused_modules          : 8 v0.5-proven modules
rollout_macros          : 6 (O1..O6) ; first live cut at O4
authority_per_macro     : single-line sovereign approval
constitution_replay     : equivalence-tested at O3 (writer_digest d99a31d4…)
real_crypto             : deferred to O6 ; provider interface present from O1
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.5-automation-orchestrator-design/06-implementation-roadmap-and-module-plan-2026-05-20.md