dot-iu-cutter v0.4 — Scale, Automation & Non-Hardcode Review (2026-05-17)
dot-iu-cutter v0.4 — Scale, Automation & Non-Hardcode Review
Date: 2026-05-17 · DESIGN ONLY (no code/commit/dry-run/provision/secret/deploy). r2/addendum to the schema-binding package, covering the User scale / automation / non-hardcode mandate GPT flagged as not yet addressed. Inputs: accepted code 56d3732, deployed cutter_governance DDL + PK/FK/UNIQUE (read-only grounding).
A. Scale review (100k and 1M information units)
A.1 Write amplification per information unit (IU), canonical MARK→SWEEP→REVIEW→CUT→VERIFY
| Table | Rows / IU | Growth driver |
|---|---|---|
| decision_backlog_entry | 1 (+1 per escalation) | 1× IU |
| decision_backlog_history | 5 (+1 per extra transition: re-review/re-sweep/abandon) | state transitions — fastest linear grower |
| decision_backlog_sweep_log | ~1 per sweep pass (not per IU) | sweep frequency, not IU count |
| manifest_envelope | 1 (+1 per re-manifest) | 1× IU |
| manifest_unit_block | U = units per manifest | manifest fan-out — dominant at large IUs |
| review_decision | 1 (+1 per re-review) | review iterations |
| dot_pair_signature | 2 (+1 per re-cut/re-verify) | signatures, grows with retries |
| cut_change_set | 1 (+1 per forward-compensation) | 1× IU (+rollback) |
| cut_change_set_affected_row | A = affected rows per cut | cut fan-out — dominant for wide cuts |
| verify_result | 1 (+1 per re-verify) | verify iterations |
| decision_backlog_dependency | 0 (N per dependency edge) | dependency graph density |
| canonical_address_alias | 0 (OD-2 deferred) | n/a |
Single-unit IU ⇒ ~15 rows/IU (r3 baseline). Realistic IU with U unit-blocks and A affected rows ⇒ ≈ 10 + U + A rows/IU.
- 100k IUs: ~1.5M rows (single-unit) → tens of millions if
U,A~10–100. - 1M IUs: ~15M rows (single-unit) → 100M–1B+ with fan-out.
A.2 Fastest-growing tables
- manifest_unit_block and cut_change_set_affected_row — multiplied by manifest/cut fan-out (
U,A); these dominate. - decision_backlog_history — 5×/IU minimum, more with retries; the append-only audit spine.
- dot_pair_signature — 2×/IU, grows with re-cut/re-verify.
decision_backlog_sweep_loggrows with sweep cadence, not IU volume (cheap).
A.3 Indexes needed BEFORE scale (additive, non-structural; NOT needed for the single-IU dry-run)
PG does not auto-index FKs. Required btree indexes before production scale:
decision_backlog_history (entry_id),(entry_id, changed_at)— lineage/audit reads.cut_change_set (decision_backlog_entry_id); UNIQUE already coversidempotency_key,rollback_key.cut_change_set_affected_row (change_set_id).verify_result (change_set_id),(prior_verify_result_id).review_decision (manifest_id),(prior_review_decision_id).manifest_envelope (source_doc_ref)— SB-DEC-1 lineage join (entry→manifest).dot_pair_signature (prior_signature_id)— per-lane chain walk.- Partial
decision_backlog_entry (status) WHERE status IN ('marked','reviewed_deferred')— sweep candidate scan. - Scale blocker (call-out):
phases.mark()doesfind("decision_backlog_entry")(fullSELECT *) then filters the idempotency key in Python → O(N) full-table scan + full materialization per MARK. At 1M entries this is fatal. Fix is two-part: (code)mark()must look the key up server-side with an equality filter; (index) the idempotency key lives inpayloadjsonb → needs either an expression index((payload->>'idempotency_key'))or a generated stored column + UNIQUE index. The expression-index option is purely additive DDL; the generated-column option is a minimal additive structural migration. Either way: not needed for the single-IU dry-run; required before scale. Recommended: expression UNIQUE index (no column add) + server-side filtered lookup inmark().
A.4 Deterministic sweep cursor strategy
- Candidate set:
status IN ('marked','reviewed_deferred'), ordered by keyset(emitted_at, entry_id)ascending (stable, gap-free under concurrent inserts). - Bounded batch size from config (env knob, same pattern as
RRetryPolicy._int, e.g.DOT_CUTTER_SWEEP_BATCHclamped) — never hardcoded. - Watermark: persist last
(emitted_at, entry_id)high-water indecision_backlog_sweep_log(already written each pass: add tofindings/mirror_pathjsonb or a dedicated nullable col — design choice for the code cycle, no migration: reusefindingsjsonb). Next pass resumes after the watermark (resumable, idempotent: re-running a pass over already-promoted rows is a no-op because CASmarked→review_pendingfails the predicate for rows no longermarked). - Promotion uses the existing CAS (OD-SM-1) so two concurrent sweeps cannot double-promote.
A.5 Bounded retry & idempotency at scale
- Retry knobs already env-driven & clamped (
RetryPolicy, DA-13) — non-hardcoded, fail-closed. Reused unchanged. - Idempotency:
cut_change_set.idempotency_keyis UNIQUE ⇒ 23505 →IdempotencyResume(converge by select). MARK replay dedup usespayload_idempotency_key(deterministic SHA-256 over signal fields, OD-1) — must become an indexed server-side lookup at scale (A.3). Deterministic keys (no randomness) ⇒ at-least-once queue delivery is safe.
A.6 Future archival / compaction boundary
- Lineages whose entry is in a terminal state (
verified_complete,reviewed_rejected,abandoned) and whosechanged_at/verified_atis older than a configured retention horizon are archive-eligible. - Recommended boundary: range-partition the two fastest growers (
decision_backlog_historybychanged_at,cut_change_set_affected_rowbyapplied_at) monthly; detach+archive old closed partitions. Append-only invariant is preserved (archival = partition detach, never row DELETE in the live ledger). - This is a post-scale workstream (separate GPT-gated DDL cycle); only the boundary definition is fixed here. Not needed for dry-run or initial scale.
B. Non-hardcode review
All constants introduced by the mapping design, classified. Reject criteria: no fixed IP / container ID / DSN / password / doc ID / batch size / vector collection name may be hardcoded.
| Constant (mapping doc) | Value | Class | Verdict |
|---|---|---|---|
TOOL_REV |
cutter_agent.__version__ |
config/derived (read from package version, not a literal) | OK — must be read from __version__, never a copied string literal |
ACTOR_EXEC/ACTOR_VERIFY |
cutter_exec/cutter_verify |
protocol constant (principal identities; already module constants PRINCIPAL_EXEC/VERIFY) |
OK |
lane → signature_kind (DOT-991→executor,DOT-992→verifier); signer_dot_id=lane |
DOT-991/992 | protocol constant | OK |
operation_kind,status(envelope),governance_event_kind,review_scope,block_role,reviewer_class,risk_class_assessment,trigger_kind,verdict map,state |
text vocab | schema-contract expected value (BATCH-1 documented allowed values, no PG enum) | OK iff centralized in one binding-vocabulary module and asserted by a schema-contract test against a documented allowed-values registry — not scattered literals |
cross_signed_by_dot_verifier=false (at review) |
false | protocol constant | OK |
verifier_tool_revision="pending" sentinel at CUT (SB-DEC-3) |
"pending" | schema-contract expected value | OK — declare in vocabulary module + test |
target_table="cutter_governance/none", operation_kind="apply" (SB-DEC-4) |
sentinel | schema-contract expected value (dry-run logical cut) | OK — declared sentinel; flagged that real-target semantics are a later (post-dry-run) concern |
rollback_key=f"rbk:{entry}:{cs}", idempotency_key=f"ick:{entry}" |
derived | derived deterministic key (function of identity, not a literal) | OK — deterministic, collision-safe, no randomness |
| sweep batch size | — | config key (must be env knob) | MUST be config (DOT_CUTTER_SWEEP_BATCH), clamped; reject hardcoding |
| connection (host/port/db/sslmode/user/pw) | — | config key (already load_connection_config from env; _Secret) |
OK — no IP/DSN/password literal anywhere in mapping; verified |
| vector collection name (future) | — | config key | not present in v0.4; when introduced MUST be config, never literal |
| migration artefact hash | — | n/a (no migration this cycle) | n/a |
- No fixed IP / container ID / DSN / password / doc ID / batch size / vector collection appears in the mapping design. Connection identity is 100% env-sourced (
db_adapter.load_connection_config), password held in_Secret(redacted) — confirmed. - Action for the code cycle: all vocabulary/sentinel constants live in ONE
cutter_agentbinding-vocabulary module; sweep batch + any future tunables are env knobs; runtime table/column assumptions are backed by the schema-contract test fixture (pg-backed-test-revision-plan §2) so a deployed-schema change fails tests deterministically rather than at runtime.
C. SQL / NoSQL hybrid strategy
C.1 Field classification (representative; full per-column in per-writer-mapping-design)
- SQL core identity: all PK uuids —
entry_id, history_id, (envelope_id,unit_local_id), review_decision_id, signature_id, change_set_id, affected_row_id, verify_result_id, sweep_id. - SQL relationship: all FK columns — history.entry_id, unit_block.envelope_id, review_decision.manifest_id, cut_change_set.{executor/verifier_signature_id,decision_backlog_entry_id}, affected_row.change_set_id, verify_result.{change_set_id,executor/verifier_signature_id,escalation_ref,prior_verify_result_id}, self-FK chains.
- SQL governance event:
status, change_kind, governance_event_kind, verdict, validation_state, state, decision_at, changed_at, signed_at, verified_at, cross_signed_by_dot_verifier. - SQL query projection:
status, source_doc_ref, idempotency_key, rollback_key, manifest_id, change_set_id, entry_id, changed_at, emitted_at(filter/join/sweep/lineage keys → indexed at scale). - JSONB payload:
payload, change_diff, source_span, findings, reviewer_identity, reviewer_independence_evidence, payload_envelope, payload_summary, candidate_edges, report_summary, before/after_state_snapshot. - Vector payload: none in v0.4 (future: embeddings of canonical IU text for semantic retrieval).
- Blob/object pointer:
mirror_path(text pointer),source_doc_ref(lineage/doc-store pointer).
C.2 SQL is SSOT — confirmed
PostgreSQL cutter_governance is the single source of truth for: identity (PK uuids), lifecycle/state machine (decision_backlog_entry.status + CAS, OD-SM-1/2), governance & audit (append-only decision_backlog_history), review, cut, verify (their tables + FK lineage + DOT-pair signature chain), and idempotency (UNIQUE cut_change_set.idempotency_key). No other store may assert or override these.
C.3 When JSONB should later be normalized
Normalize a JSONB field into columns / a child table when any of: (a) it is filtered/joined at scale (would need a GIN/expression index and still be opaque), (b) it requires FK integrity to another entity, (c) it drives a control-flow/governance decision, (d) it is reported/aggregated routinely. Concrete near-term candidates if/when queried: manifest_unit_block.source_span (if span-range queried), verify_result.findings (if drift analytics), review_decision.findings. Until a query need is proven, JSONB stays (avoids premature schema churn). The idempotency key inside payload is the first to graduate out of JSONB (A.3) because it is queried on the MARK hot path.
C.4 Vector store = acceleration only, never authority
Any future vector/embedding store is a derived, rebuildable index of SQL-authoritative IU/canonical content. It never holds identity, lifecycle, governance, audit, or idempotency truth. On any divergence, SQL wins and the vector index is rebuilt. Vector collection name + endpoint = config keys (B), never literals. This keeps the hybrid honest: NoSQL/vector = search latency optimization; SQL = correctness & authority.
D. Information-unit-centric design
| Writer / table | Relationship to source/result IU |
|---|---|
decision_backlog_entry |
The governance request about a source IU (payload.iu_ref); lifecycle anchor for that IU's cut |
manifest_envelope |
The proposed operation over the source IU; source_doc_ref ← entry_id carries IU lineage (SB-DEC-1) |
manifest_unit_block |
The IU's unit decomposition — one block per logical unit; proposed_canonical_address = the result unit's canonical address (canonical_address relationship lives here, not on envelope) |
review_decision |
Governance verdict on the manifested IU operation; FK→envelope ties it to the IU |
cut_change_set |
The applied transformation producing the result IU(s); decision_backlog_entry_id→source IU; payload_summary.content_hash = result content identity |
cut_change_set_affected_row |
The concrete affected target rows/result-IUs (one per affected unit; dry-run uses SB-DEC-4 sentinel since the cut is logical, no real row mutated) |
verify_result |
Round-trip verification of result IU vs source (axis-1 drift); closes the IU lifecycle at verified_complete |
decision_backlog_history |
Append-only audit of the IU's every state transition |
dot_pair_signature |
Executor/verifier attestation binding the IU's cut & verify to DOT-991/DOT-992 lanes |
canonical_address_alias |
Deferred (OD-2, 0 rows). Canonical-address aliasing/rename of an IU is a future HIGH-risk workstream |
Supersession status: handled in-SQL, write-once, append-only — review_decision.superseded_by_review_decision_id (re-review), manifest_envelope.superseded_by_envelope_id (re-manifest), cut_change_set forward-compensation (no physical rollback). Alias-based supersession remains deferred.
E. Automation readiness
- Resumable phase state: only stable/terminal states persist (
decision_backlog_entry.status, OD-SM-2 — S5/S7 never persisted). Any phase resumes by reading status and re-attempting under CAS; no half-state exists. - Concurrency guard per unit/manifest: status compare-and-set (OD-SM-1, no advisory lock) serializes advancement per entry;
manifest_unit_blockcomposite PK blocks duplicate unit blocks; UNIQUEcut_change_set.idempotency_key/rollback_keyblocks duplicate cuts. SERIALIZABLE isolation for CUT/VERIFY. - No manual SQL in runtime: the adapter generates all SQL from an allow-listed identifier set with parameterized values; no operator/manual SQL path; binding vocabulary centralized; append-only guards reject DELETE/TRUNCATE/DDL/GRANT.
- Structured redacted logs:
_Secretredaction; JSON log line schema{sqlstate,error_class,phase,table,entry_id,key_name}; passwords never rendered (<redacted-secret>); no secret in artefacts (G-23/G-24). - Future queue/signal contract: v0.4 input is
LocalSignalonly (OD-4; no production queue/bus). Automation will front MARK with an at-least-once queue; deterministic idempotency keys make replays safe (converge, not duplicate). The production signal-source/queue contract is a separate deferred GPT-gated design — named here as the integration boundary, not designed.
F. Verdict (this doc)
- Code patch still sufficient for correctness, but its scope must expand to: (1) centralized binding-vocabulary module (non-hardcode B); (2) server-side idempotency lookup in
mark()(scale A.3); (3) config-driven sweep batch + deterministic keyset cursor (A.4); (4) schema-contract tests covering vocabulary + columns. Still column-level binding, no flow/principal/isolation change. - Structural schema migration of
cutter_governance: still NOT required for the dry-run. For production scale, an additive index migration (A.3 list) and optionally one expression/generated idempotency index are required — separate, GPT-gated, index-only DDL cycle; not a blocker for resuming the single-IU PG-backed dry-run. - JSONB: stays JSONB now; normalize per C.3 rule when queried at scale; the payload idempotency key graduates first.
- PG-backed dry-run can resume after the (expanded) code patch PASSes — single-IU, count-invariant, r3 unchanged, no index/migration needed for it.