80000x · 03 — cut_manifest schema (YAML/JSON, M1-M16 validation rules)
03 — cut_manifest schema
The single artifact passed from MARK to REVIEW to CUT. Identity is the SHA-256 of its canonical-JSON serialization (
manifest_digest).
1. Top-level structure (YAML for readability; serialized as JSON)
manifest:
manifest_id: <ulid or uuid; assigned by Agent at MARK time>
manifest_digest: <sha256(canonical-json(manifest)); recomputed by CUT>
manifest_format_version: "1.0"
doc_code: <e.g., "LUAT-XYZ-2024"; if unknown, propose & flag>
created_by: <agent identity: "codex-mark-r1" | "claude-mark" | ...>
created_at: <ISO-8601 UTC>
source:
type: url | file | inline
url_or_file: <verbatim from user>
retrieved_at: <ISO-8601 UTC>
source_hash: <sha256 of the fetched body>
source_bytes: <int>
normalization_rule: "whitespace_collapse_v1" # named rule; see §3
articles: # one entry per article requested
- article_label: "Điều 37"
article_number: 37
title: <article title text or null>
original_text_hash: <sha256 of normalized article body>
boundary:
start_quote: <first 80 chars, byte-exact from source>
end_quote: <last 80 chars, byte-exact from source>
method: regex_label_match | manual_anchor | inline_text_entire_input
pieces: # array; see §2
- { … }
uncertainty_flags: [...] # article-level flags
reconstruction:
method: "concat_by_source_position_then_normalize_v1"
expected_digest: <sha256(normalized_concat) == original_text_hash>
preview: <first 400 chars of normalized concat, for human eyeball>
rerun_byte_identical: true | false # Agent self-test
approval:
status: pending | approved | rejected | verified
approved_by: <operator id / DOT id / null>
approved_at: <ISO-8601 UTC / null>
approval_doc_id: <KB doc id / null>
rejection_reason: <string / null>
cut_record: # filled in by CUT stage; null pre-cut
cut_at: <ISO-8601 UTC / null>
cut_by_principal: "workflow_admin" | "directus" | null
dot_command_run_id: <uuid / null>
iu_ids_created: [<uuid>, …] / null
iu_piece_collection_ids: [<uuid>, …] / null
iu_piece_membership_count: <int / null>
verify_record: # filled in by VERIFY stage; null pre-verify
verified_at: <ISO-8601 UTC / null>
v1_axis_a: PASS | FAIL | null
v2_axis_b: PASS | FAIL | null
v3_axis_c: PASS | FAIL | null
v4_no_qdrant_drift: PASS | FAIL | null
v5_production_documents_untouched: PASS | FAIL | null
v6_gates_inert: PASS | FAIL | null
v7_dot_command_run_present: PASS | FAIL | null
v8_sql_bridges_resolved: PASS | FAIL | null
verify_report_doc_id: <KB doc id / null>
uncertainty_flags: [...] # manifest-level flags
2. Piece schema (one entry of articles[].pieces[])
- local_piece_id: "lp-001-title" # proposal handle; not a real iu_id
source_position: 1 # Axis A: dense, monotonic, unique-in-article
depth: 0 # 0=title, 1=clause, 2=sub-point
parent_local_piece_id: null | "lp-XXX-..." # Axis C
unit_kind: design_doc_section | law_unit
section_type: <substrate vocab; see 02 §6>
piece_role: title | intro | body | step | clause | appendix | reference
text: <byte-exact piece body from source>
text_hash: <sha256(text)>
text_bytes: <int>
axis_a:
source_position: <int; MUST match top-level>
source_url: <article URL or file>
source_hash: <sha256 of full source body>
axis_b:
legal_document: <e.g., "luat-xyz-2024" or null>
section_type: <e.g., "article" or "paragraph">
unit_kind: <design_doc_section | law_unit>
professional_tags: [<string>, …] # optional axis-B extras
axis_c:
parent_local_piece_id: <local_piece_id or null>
depth: <int 0|1|2>
subtree_position: <int; position among siblings, 1-based>
sql_bridge: # optional
link_role: "represents" # or other 11-role vocab term
object_kind: "table" | "view" | "function" | ...
direction: "outbound" # MARK always outbound; inbound is V0 carry-forward
target_object_ref: <e.g., "public.information_unit">
uncertainty_flags: [...] # piece-level flags
3. Normalization rules
whitespace_collapse_v1
Used by source.normalization_rule and the reconstruction step.
- Decode source as UTF-8 (BOM stripped if present).
- Convert CRLF and CR to LF.
- For each non-empty line: strip trailing whitespace; collapse runs of
\tand space to a single space; preserve leading-indentation up to 4 spaces (deeper indentation collapses to 4). - Collapse runs of ≥3 consecutive LFs to exactly 2.
- Trim leading/trailing LFs from the final string.
Pieces' text field stores the byte-exact slice from the normalized source. The reconstruction concat re-applies rule 3-5 to the joined string.
concat_by_source_position_then_normalize_v1
- Sort
pieces[]bysource_positionascending. - Join with the inter-piece separator the source used (typically
\n\nfor paragraphs,\nfor sub-points). The separator is part ofwhitespace_collapse_v1rule 4. - Apply
whitespace_collapse_v1rules 3-5 to the result. - Compute
sha256; assert equality witharticles[].original_text_hash.
4. Validation rules (mark stage)
The Agent enforces these before emitting the manifest:
M1 doc_code : matches /^[A-Z][A-Z0-9_-]+$/
M2 articles[] : len >= 1
M3 pieces[] : len >= 1 per article
M4 source_position : dense (1..N, no gaps) within an article
M5 source_position : strictly increasing within an article
M6 source_position : unique within an article
M7 parent_local_piece_id : null OR resolves to a piece in the SAME article
M8 depth : root pieces depth=0; depth(p) == depth(parent(p)) + 1
M9 acyclic : Axis-C graph has no cycles
M10 unit_kind : in {design_doc_section, law_unit}
M11 section_type : in substrate vocab (snapshot in 02 §6; re-verify in DB if available)
M12 piece_role : in {title, intro, body, step, clause, appendix, reference}
M13 text_hash : == sha256(text)
M14 reconstruction : preview hash == original_text_hash
M15 manifest_digest : present, computed by Step 9 of 02 §2
M16 approval.status : == 'pending' (never 'approved' at MARK time)
A manifest that fails any of M1–M16 is not emitted. The Agent stops and reports.
5. Validation rules (review stage)
The reviewer's checklist (04-review-approval-checklist.md) re-runs M1–M16 plus:
R1 source URL is reachable from a fresh fetch
R2 source_hash matches a fresh fetch (within 24 h or recomputed)
R3 start_quote and end_quote appear verbatim in the source
R4 uncertainty_flags : empty OR each flag explicitly resolved by reviewer
R5 reconstruction preview matches the article body when manually compared
R6 doc_code is unique within the existing IU corpus
R7 no piece's text appears verbatim in an existing enacted IU (no duplicate cut)
If R1–R7 all PASS and the reviewer agrees, they set:
approval:
status: approved
approved_by: <operator id>
approved_at: <ISO-8601 UTC>
approval_doc_id: <KB doc id of the approval record>
and re-upload the manifest to a stable KB path. From this moment the manifest is immutable.
6. Validation rules (cut stage)
The DOT command (dot_iu_cut_from_manifest) re-checks:
C1 approval.status == 'approved'
C2 approval_doc_id resolves to a real KB doc
C3 approved_at fresh (default ≤ 24 h; configurable)
C4 manifest_digest recomputed == manifest_digest in file
C5 every section_type / unit_kind / piece_role / link_role still in the live DB vocab
C6 no row already exists at the proposed canonical_address (G6-style cut-once guard)
C7 fresh pg_dump within ≤ 60 min (G7 backup gate)
C8 principal in {workflow_admin, directus_with_iu_grant} per channel constitution
Failure of any C check raises RefusedCut and does not write a single row.
7. JSON example (truncated)
{
"manifest": {
"manifest_id": "01J9XYZABCMARK37",
"manifest_digest": "<sha256>",
"manifest_format_version": "1.0",
"doc_code": "LUAT-XYZ-2024",
"created_by": "codex-mark-r1",
"created_at": "2026-05-25T12:34:56Z",
"source": { "type": "url", "url_or_file": "https://example.gov.vn/...dieu-37", "retrieved_at": "...", "source_hash": "...", "source_bytes": 21345, "normalization_rule": "whitespace_collapse_v1" },
"articles": [{
"article_label": "Điều 37",
"article_number": 37,
"title": "Tên điều khoản",
"original_text_hash": "<sha256>",
"boundary": { "start_quote": "Điều 37. Tên điều khoản...", "end_quote": "...đối với cá nhân.", "method": "regex_label_match" },
"pieces": [ /* see §2 */ ],
"uncertainty_flags": []
}],
"reconstruction": { "method": "concat_by_source_position_then_normalize_v1", "expected_digest": "<sha256>", "preview": "Điều 37. Tên điều khoản\n\n1. ...", "rerun_byte_identical": true },
"approval": { "status": "pending", "approved_by": null, "approved_at": null, "approval_doc_id": null, "rejection_reason": null },
"cut_record": null,
"verify_record": null,
"uncertainty_flags": []
}
}
8. Canonical JSON serialization
import json, hashlib
def canonical(obj) -> bytes:
return json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8")
manifest_digest = hashlib.sha256(canonical(manifest)).hexdigest()
Keys are sorted lexicographically at every level. Whitespace is suppressed. Unicode is preserved. Floating-point numbers are forbidden (use integers for all counts/positions).