KB-6731

dot-iu-cutter v0.4 — Scale, Automation & Non-Hardcode Review (2026-05-17)

17 min read Revision 1
dot-iu-cutterv0.4schema-bindingscalenon-hardcodeautomationdieu44design

dot-iu-cutter v0.4 — Scale, Automation & Non-Hardcode Review

Date: 2026-05-17 · DESIGN ONLY (no code/commit/dry-run/provision/secret/deploy). r2/addendum to the schema-binding package, covering the User scale / automation / non-hardcode mandate GPT flagged as not yet addressed. Inputs: accepted code 56d3732, deployed cutter_governance DDL + PK/FK/UNIQUE (read-only grounding).

A. Scale review (100k and 1M information units)

A.1 Write amplification per information unit (IU), canonical MARK→SWEEP→REVIEW→CUT→VERIFY

Table Rows / IU Growth driver
decision_backlog_entry 1 (+1 per escalation) 1× IU
decision_backlog_history 5 (+1 per extra transition: re-review/re-sweep/abandon) state transitions — fastest linear grower
decision_backlog_sweep_log ~1 per sweep pass (not per IU) sweep frequency, not IU count
manifest_envelope 1 (+1 per re-manifest) 1× IU
manifest_unit_block U = units per manifest manifest fan-out — dominant at large IUs
review_decision 1 (+1 per re-review) review iterations
dot_pair_signature 2 (+1 per re-cut/re-verify) signatures, grows with retries
cut_change_set 1 (+1 per forward-compensation) 1× IU (+rollback)
cut_change_set_affected_row A = affected rows per cut cut fan-out — dominant for wide cuts
verify_result 1 (+1 per re-verify) verify iterations
decision_backlog_dependency 0 (N per dependency edge) dependency graph density
canonical_address_alias 0 (OD-2 deferred) n/a

Single-unit IU ⇒ ~15 rows/IU (r3 baseline). Realistic IU with U unit-blocks and A affected rows ⇒ ≈ 10 + U + A rows/IU.

  • 100k IUs: ~1.5M rows (single-unit) → tens of millions if U,A ~10–100.
  • 1M IUs: ~15M rows (single-unit) → 100M–1B+ with fan-out.

A.2 Fastest-growing tables

  1. manifest_unit_block and cut_change_set_affected_row — multiplied by manifest/cut fan-out (U, A); these dominate.
  2. decision_backlog_history — 5×/IU minimum, more with retries; the append-only audit spine.
  3. dot_pair_signature — 2×/IU, grows with re-cut/re-verify. decision_backlog_sweep_log grows with sweep cadence, not IU volume (cheap).

A.3 Indexes needed BEFORE scale (additive, non-structural; NOT needed for the single-IU dry-run)

PG does not auto-index FKs. Required btree indexes before production scale:

  • decision_backlog_history (entry_id), (entry_id, changed_at) — lineage/audit reads.
  • cut_change_set (decision_backlog_entry_id); UNIQUE already covers idempotency_key, rollback_key.
  • cut_change_set_affected_row (change_set_id).
  • verify_result (change_set_id), (prior_verify_result_id).
  • review_decision (manifest_id), (prior_review_decision_id).
  • manifest_envelope (source_doc_ref)SB-DEC-1 lineage join (entry→manifest).
  • dot_pair_signature (prior_signature_id) — per-lane chain walk.
  • Partial decision_backlog_entry (status) WHERE status IN ('marked','reviewed_deferred') — sweep candidate scan.
  • Scale blocker (call-out): phases.mark() does find("decision_backlog_entry") (full SELECT *) then filters the idempotency key in Python → O(N) full-table scan + full materialization per MARK. At 1M entries this is fatal. Fix is two-part: (code) mark() must look the key up server-side with an equality filter; (index) the idempotency key lives in payload jsonb → needs either an expression index ((payload->>'idempotency_key')) or a generated stored column + UNIQUE index. The expression-index option is purely additive DDL; the generated-column option is a minimal additive structural migration. Either way: not needed for the single-IU dry-run; required before scale. Recommended: expression UNIQUE index (no column add) + server-side filtered lookup in mark().

A.4 Deterministic sweep cursor strategy

  • Candidate set: status IN ('marked','reviewed_deferred'), ordered by keyset (emitted_at, entry_id) ascending (stable, gap-free under concurrent inserts).
  • Bounded batch size from config (env knob, same pattern as RRetryPolicy._int, e.g. DOT_CUTTER_SWEEP_BATCH clamped) — never hardcoded.
  • Watermark: persist last (emitted_at, entry_id) high-water in decision_backlog_sweep_log (already written each pass: add to findings/mirror_path jsonb or a dedicated nullable col — design choice for the code cycle, no migration: reuse findings jsonb). Next pass resumes after the watermark (resumable, idempotent: re-running a pass over already-promoted rows is a no-op because CAS marked→review_pending fails the predicate for rows no longer marked).
  • Promotion uses the existing CAS (OD-SM-1) so two concurrent sweeps cannot double-promote.

A.5 Bounded retry & idempotency at scale

  • Retry knobs already env-driven & clamped (RetryPolicy, DA-13) — non-hardcoded, fail-closed. Reused unchanged.
  • Idempotency: cut_change_set.idempotency_key is UNIQUE ⇒ 23505 → IdempotencyResume (converge by select). MARK replay dedup uses payload_idempotency_key (deterministic SHA-256 over signal fields, OD-1) — must become an indexed server-side lookup at scale (A.3). Deterministic keys (no randomness) ⇒ at-least-once queue delivery is safe.

A.6 Future archival / compaction boundary

  • Lineages whose entry is in a terminal state (verified_complete, reviewed_rejected, abandoned) and whose changed_at/verified_at is older than a configured retention horizon are archive-eligible.
  • Recommended boundary: range-partition the two fastest growers (decision_backlog_history by changed_at, cut_change_set_affected_row by applied_at) monthly; detach+archive old closed partitions. Append-only invariant is preserved (archival = partition detach, never row DELETE in the live ledger).
  • This is a post-scale workstream (separate GPT-gated DDL cycle); only the boundary definition is fixed here. Not needed for dry-run or initial scale.

B. Non-hardcode review

All constants introduced by the mapping design, classified. Reject criteria: no fixed IP / container ID / DSN / password / doc ID / batch size / vector collection name may be hardcoded.

Constant (mapping doc) Value Class Verdict
TOOL_REV cutter_agent.__version__ config/derived (read from package version, not a literal) OK — must be read from __version__, never a copied string literal
ACTOR_EXEC/ACTOR_VERIFY cutter_exec/cutter_verify protocol constant (principal identities; already module constants PRINCIPAL_EXEC/VERIFY) OK
lane → signature_kind (DOT-991→executor,DOT-992→verifier); signer_dot_id=lane DOT-991/992 protocol constant OK
operation_kind,status(envelope),governance_event_kind,review_scope,block_role,reviewer_class,risk_class_assessment,trigger_kind,verdict map,state text vocab schema-contract expected value (BATCH-1 documented allowed values, no PG enum) OK iff centralized in one binding-vocabulary module and asserted by a schema-contract test against a documented allowed-values registry — not scattered literals
cross_signed_by_dot_verifier=false (at review) false protocol constant OK
verifier_tool_revision="pending" sentinel at CUT (SB-DEC-3) "pending" schema-contract expected value OK — declare in vocabulary module + test
target_table="cutter_governance/none", operation_kind="apply" (SB-DEC-4) sentinel schema-contract expected value (dry-run logical cut) OK — declared sentinel; flagged that real-target semantics are a later (post-dry-run) concern
rollback_key=f"rbk:{entry}:{cs}", idempotency_key=f"ick:{entry}" derived derived deterministic key (function of identity, not a literal) OK — deterministic, collision-safe, no randomness
sweep batch size config key (must be env knob) MUST be config (DOT_CUTTER_SWEEP_BATCH), clamped; reject hardcoding
connection (host/port/db/sslmode/user/pw) config key (already load_connection_config from env; _Secret) OK — no IP/DSN/password literal anywhere in mapping; verified
vector collection name (future) config key not present in v0.4; when introduced MUST be config, never literal
migration artefact hash n/a (no migration this cycle) n/a
  • No fixed IP / container ID / DSN / password / doc ID / batch size / vector collection appears in the mapping design. Connection identity is 100% env-sourced (db_adapter.load_connection_config), password held in _Secret (redacted) — confirmed.
  • Action for the code cycle: all vocabulary/sentinel constants live in ONE cutter_agent binding-vocabulary module; sweep batch + any future tunables are env knobs; runtime table/column assumptions are backed by the schema-contract test fixture (pg-backed-test-revision-plan §2) so a deployed-schema change fails tests deterministically rather than at runtime.

C. SQL / NoSQL hybrid strategy

C.1 Field classification (representative; full per-column in per-writer-mapping-design)

  • SQL core identity: all PK uuids — entry_id, history_id, (envelope_id,unit_local_id), review_decision_id, signature_id, change_set_id, affected_row_id, verify_result_id, sweep_id.
  • SQL relationship: all FK columns — history.entry_id, unit_block.envelope_id, review_decision.manifest_id, cut_change_set.{executor/verifier_signature_id,decision_backlog_entry_id}, affected_row.change_set_id, verify_result.{change_set_id,executor/verifier_signature_id,escalation_ref,prior_verify_result_id}, self-FK chains.
  • SQL governance event: status, change_kind, governance_event_kind, verdict, validation_state, state, decision_at, changed_at, signed_at, verified_at, cross_signed_by_dot_verifier.
  • SQL query projection: status, source_doc_ref, idempotency_key, rollback_key, manifest_id, change_set_id, entry_id, changed_at, emitted_at (filter/join/sweep/lineage keys → indexed at scale).
  • JSONB payload: payload, change_diff, source_span, findings, reviewer_identity, reviewer_independence_evidence, payload_envelope, payload_summary, candidate_edges, report_summary, before/after_state_snapshot.
  • Vector payload: none in v0.4 (future: embeddings of canonical IU text for semantic retrieval).
  • Blob/object pointer: mirror_path (text pointer), source_doc_ref (lineage/doc-store pointer).

C.2 SQL is SSOT — confirmed

PostgreSQL cutter_governance is the single source of truth for: identity (PK uuids), lifecycle/state machine (decision_backlog_entry.status + CAS, OD-SM-1/2), governance & audit (append-only decision_backlog_history), review, cut, verify (their tables + FK lineage + DOT-pair signature chain), and idempotency (UNIQUE cut_change_set.idempotency_key). No other store may assert or override these.

C.3 When JSONB should later be normalized

Normalize a JSONB field into columns / a child table when any of: (a) it is filtered/joined at scale (would need a GIN/expression index and still be opaque), (b) it requires FK integrity to another entity, (c) it drives a control-flow/governance decision, (d) it is reported/aggregated routinely. Concrete near-term candidates if/when queried: manifest_unit_block.source_span (if span-range queried), verify_result.findings (if drift analytics), review_decision.findings. Until a query need is proven, JSONB stays (avoids premature schema churn). The idempotency key inside payload is the first to graduate out of JSONB (A.3) because it is queried on the MARK hot path.

C.4 Vector store = acceleration only, never authority

Any future vector/embedding store is a derived, rebuildable index of SQL-authoritative IU/canonical content. It never holds identity, lifecycle, governance, audit, or idempotency truth. On any divergence, SQL wins and the vector index is rebuilt. Vector collection name + endpoint = config keys (B), never literals. This keeps the hybrid honest: NoSQL/vector = search latency optimization; SQL = correctness & authority.

D. Information-unit-centric design

Writer / table Relationship to source/result IU
decision_backlog_entry The governance request about a source IU (payload.iu_ref); lifecycle anchor for that IU's cut
manifest_envelope The proposed operation over the source IU; source_doc_ref ← entry_id carries IU lineage (SB-DEC-1)
manifest_unit_block The IU's unit decomposition — one block per logical unit; proposed_canonical_address = the result unit's canonical address (canonical_address relationship lives here, not on envelope)
review_decision Governance verdict on the manifested IU operation; FK→envelope ties it to the IU
cut_change_set The applied transformation producing the result IU(s); decision_backlog_entry_id→source IU; payload_summary.content_hash = result content identity
cut_change_set_affected_row The concrete affected target rows/result-IUs (one per affected unit; dry-run uses SB-DEC-4 sentinel since the cut is logical, no real row mutated)
verify_result Round-trip verification of result IU vs source (axis-1 drift); closes the IU lifecycle at verified_complete
decision_backlog_history Append-only audit of the IU's every state transition
dot_pair_signature Executor/verifier attestation binding the IU's cut & verify to DOT-991/DOT-992 lanes
canonical_address_alias Deferred (OD-2, 0 rows). Canonical-address aliasing/rename of an IU is a future HIGH-risk workstream

Supersession status: handled in-SQL, write-once, append-only — review_decision.superseded_by_review_decision_id (re-review), manifest_envelope.superseded_by_envelope_id (re-manifest), cut_change_set forward-compensation (no physical rollback). Alias-based supersession remains deferred.

E. Automation readiness

  • Resumable phase state: only stable/terminal states persist (decision_backlog_entry.status, OD-SM-2 — S5/S7 never persisted). Any phase resumes by reading status and re-attempting under CAS; no half-state exists.
  • Concurrency guard per unit/manifest: status compare-and-set (OD-SM-1, no advisory lock) serializes advancement per entry; manifest_unit_block composite PK blocks duplicate unit blocks; UNIQUE cut_change_set.idempotency_key/rollback_key blocks duplicate cuts. SERIALIZABLE isolation for CUT/VERIFY.
  • No manual SQL in runtime: the adapter generates all SQL from an allow-listed identifier set with parameterized values; no operator/manual SQL path; binding vocabulary centralized; append-only guards reject DELETE/TRUNCATE/DDL/GRANT.
  • Structured redacted logs: _Secret redaction; JSON log line schema {sqlstate,error_class,phase,table,entry_id,key_name}; passwords never rendered (<redacted-secret>); no secret in artefacts (G-23/G-24).
  • Future queue/signal contract: v0.4 input is LocalSignal only (OD-4; no production queue/bus). Automation will front MARK with an at-least-once queue; deterministic idempotency keys make replays safe (converge, not duplicate). The production signal-source/queue contract is a separate deferred GPT-gated design — named here as the integration boundary, not designed.

F. Verdict (this doc)

  • Code patch still sufficient for correctness, but its scope must expand to: (1) centralized binding-vocabulary module (non-hardcode B); (2) server-side idempotency lookup in mark() (scale A.3); (3) config-driven sweep batch + deterministic keyset cursor (A.4); (4) schema-contract tests covering vocabulary + columns. Still column-level binding, no flow/principal/isolation change.
  • Structural schema migration of cutter_governance: still NOT required for the dry-run. For production scale, an additive index migration (A.3 list) and optionally one expression/generated idempotency index are required — separate, GPT-gated, index-only DDL cycle; not a blocker for resuming the single-IU PG-backed dry-run.
  • JSONB: stays JSONB now; normalize per C.3 rule when queried at scale; the payload idempotency key graduates first.
  • PG-backed dry-run can resume after the (expanded) code patch PASSes — single-IU, count-invariant, r3 unchanged, no index/migration needed for it.
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.4-schema-binding/dot-iu-cutter-v0.4-scale-automation-nonhardcode-review-2026-05-17.md