KB-5233

80000x · 03 — cut_manifest schema (YAML/JSON, M1-M16 validation rules)

11 min read Revision 1
iu-core80000xcut_manifest_schemamanifestschemavalidation-rulesM1-M16R1-R7C1-C8canonical-jsonmanifest-digestnormalization-rulewhitespace-collapse-v1

03 — cut_manifest schema

The single artifact passed from MARK to REVIEW to CUT. Identity is the SHA-256 of its canonical-JSON serialization (manifest_digest).

1. Top-level structure (YAML for readability; serialized as JSON)

manifest:
  manifest_id:           <ulid or uuid; assigned by Agent at MARK time>
  manifest_digest:       <sha256(canonical-json(manifest)); recomputed by CUT>
  manifest_format_version: "1.0"
  doc_code:              <e.g., "LUAT-XYZ-2024"; if unknown, propose & flag>
  created_by:            <agent identity: "codex-mark-r1" | "claude-mark" | ...>
  created_at:            <ISO-8601 UTC>

  source:
    type:                url | file | inline
    url_or_file:         <verbatim from user>
    retrieved_at:        <ISO-8601 UTC>
    source_hash:         <sha256 of the fetched body>
    source_bytes:        <int>
    normalization_rule:  "whitespace_collapse_v1"      # named rule; see §3

  articles:                                            # one entry per article requested
    - article_label:            "Điều 37"
      article_number:           37
      title:                    <article title text or null>
      original_text_hash:       <sha256 of normalized article body>
      boundary:
        start_quote:            <first 80 chars, byte-exact from source>
        end_quote:              <last 80 chars, byte-exact from source>
        method:                 regex_label_match | manual_anchor | inline_text_entire_input
      pieces:                                          # array; see §2
        - { … }
      uncertainty_flags:        [...]                  # article-level flags

  reconstruction:
    method:                     "concat_by_source_position_then_normalize_v1"
    expected_digest:            <sha256(normalized_concat) == original_text_hash>
    preview:                    <first 400 chars of normalized concat, for human eyeball>
    rerun_byte_identical:       true | false           # Agent self-test

  approval:
    status:                     pending | approved | rejected | verified
    approved_by:                <operator id / DOT id / null>
    approved_at:                <ISO-8601 UTC / null>
    approval_doc_id:            <KB doc id / null>
    rejection_reason:           <string / null>

  cut_record:                                          # filled in by CUT stage; null pre-cut
    cut_at:                     <ISO-8601 UTC / null>
    cut_by_principal:           "workflow_admin" | "directus" | null
    dot_command_run_id:         <uuid / null>
    iu_ids_created:             [<uuid>, …] / null
    iu_piece_collection_ids:    [<uuid>, …] / null
    iu_piece_membership_count:  <int / null>

  verify_record:                                       # filled in by VERIFY stage; null pre-verify
    verified_at:                <ISO-8601 UTC / null>
    v1_axis_a:                  PASS | FAIL | null
    v2_axis_b:                  PASS | FAIL | null
    v3_axis_c:                  PASS | FAIL | null
    v4_no_qdrant_drift:         PASS | FAIL | null
    v5_production_documents_untouched: PASS | FAIL | null
    v6_gates_inert:             PASS | FAIL | null
    v7_dot_command_run_present: PASS | FAIL | null
    v8_sql_bridges_resolved:    PASS | FAIL | null
    verify_report_doc_id:       <KB doc id / null>

  uncertainty_flags:            [...]                  # manifest-level flags

2. Piece schema (one entry of articles[].pieces[])

- local_piece_id:               "lp-001-title"          # proposal handle; not a real iu_id
  source_position:              1                       # Axis A: dense, monotonic, unique-in-article
  depth:                        0                       # 0=title, 1=clause, 2=sub-point
  parent_local_piece_id:        null | "lp-XXX-..."     # Axis C
  unit_kind:                    design_doc_section | law_unit
  section_type:                 <substrate vocab; see 02 §6>
  piece_role:                   title | intro | body | step | clause | appendix | reference
  text:                         <byte-exact piece body from source>
  text_hash:                    <sha256(text)>
  text_bytes:                   <int>

  axis_a:
    source_position:            <int; MUST match top-level>
    source_url:                 <article URL or file>
    source_hash:                <sha256 of full source body>

  axis_b:
    legal_document:             <e.g., "luat-xyz-2024" or null>
    section_type:               <e.g., "article" or "paragraph">
    unit_kind:                  <design_doc_section | law_unit>
    professional_tags:          [<string>, …]          # optional axis-B extras

  axis_c:
    parent_local_piece_id:      <local_piece_id or null>
    depth:                      <int 0|1|2>
    subtree_position:           <int; position among siblings, 1-based>

  sql_bridge:                                          # optional
    link_role:                  "represents"           # or other 11-role vocab term
    object_kind:                "table" | "view" | "function" | ...
    direction:                  "outbound"             # MARK always outbound; inbound is V0 carry-forward
    target_object_ref:          <e.g., "public.information_unit">

  uncertainty_flags:            [...]                  # piece-level flags

3. Normalization rules

whitespace_collapse_v1

Used by source.normalization_rule and the reconstruction step.

  1. Decode source as UTF-8 (BOM stripped if present).
  2. Convert CRLF and CR to LF.
  3. For each non-empty line: strip trailing whitespace; collapse runs of \t and space to a single space; preserve leading-indentation up to 4 spaces (deeper indentation collapses to 4).
  4. Collapse runs of ≥3 consecutive LFs to exactly 2.
  5. Trim leading/trailing LFs from the final string.

Pieces' text field stores the byte-exact slice from the normalized source. The reconstruction concat re-applies rule 3-5 to the joined string.

concat_by_source_position_then_normalize_v1

  1. Sort pieces[] by source_position ascending.
  2. Join with the inter-piece separator the source used (typically \n\n for paragraphs, \n for sub-points). The separator is part of whitespace_collapse_v1 rule 4.
  3. Apply whitespace_collapse_v1 rules 3-5 to the result.
  4. Compute sha256; assert equality with articles[].original_text_hash.

4. Validation rules (mark stage)

The Agent enforces these before emitting the manifest:

M1  doc_code         : matches /^[A-Z][A-Z0-9_-]+$/
M2  articles[]       : len >= 1
M3  pieces[]         : len >= 1 per article
M4  source_position  : dense (1..N, no gaps) within an article
M5  source_position  : strictly increasing within an article
M6  source_position  : unique within an article
M7  parent_local_piece_id : null OR resolves to a piece in the SAME article
M8  depth            : root pieces depth=0; depth(p) == depth(parent(p)) + 1
M9  acyclic          : Axis-C graph has no cycles
M10 unit_kind        : in {design_doc_section, law_unit}
M11 section_type     : in substrate vocab (snapshot in 02 §6; re-verify in DB if available)
M12 piece_role       : in {title, intro, body, step, clause, appendix, reference}
M13 text_hash        : == sha256(text)
M14 reconstruction   : preview hash == original_text_hash
M15 manifest_digest  : present, computed by Step 9 of 02 §2
M16 approval.status  : == 'pending' (never 'approved' at MARK time)

A manifest that fails any of M1–M16 is not emitted. The Agent stops and reports.

5. Validation rules (review stage)

The reviewer's checklist (04-review-approval-checklist.md) re-runs M1–M16 plus:

R1  source URL is reachable from a fresh fetch
R2  source_hash matches a fresh fetch (within 24 h or recomputed)
R3  start_quote and end_quote appear verbatim in the source
R4  uncertainty_flags  : empty OR each flag explicitly resolved by reviewer
R5  reconstruction preview matches the article body when manually compared
R6  doc_code is unique within the existing IU corpus
R7  no piece's text appears verbatim in an existing enacted IU (no duplicate cut)

If R1–R7 all PASS and the reviewer agrees, they set:

approval:
  status: approved
  approved_by: <operator id>
  approved_at: <ISO-8601 UTC>
  approval_doc_id: <KB doc id of the approval record>

and re-upload the manifest to a stable KB path. From this moment the manifest is immutable.

6. Validation rules (cut stage)

The DOT command (dot_iu_cut_from_manifest) re-checks:

C1  approval.status == 'approved'
C2  approval_doc_id resolves to a real KB doc
C3  approved_at fresh (default ≤ 24 h; configurable)
C4  manifest_digest recomputed == manifest_digest in file
C5  every section_type / unit_kind / piece_role / link_role still in the live DB vocab
C6  no row already exists at the proposed canonical_address (G6-style cut-once guard)
C7  fresh pg_dump within ≤ 60 min (G7 backup gate)
C8  principal in {workflow_admin, directus_with_iu_grant} per channel constitution

Failure of any C check raises RefusedCut and does not write a single row.

7. JSON example (truncated)

{
  "manifest": {
    "manifest_id": "01J9XYZABCMARK37",
    "manifest_digest": "<sha256>",
    "manifest_format_version": "1.0",
    "doc_code": "LUAT-XYZ-2024",
    "created_by": "codex-mark-r1",
    "created_at": "2026-05-25T12:34:56Z",
    "source": { "type": "url", "url_or_file": "https://example.gov.vn/...dieu-37", "retrieved_at": "...", "source_hash": "...", "source_bytes": 21345, "normalization_rule": "whitespace_collapse_v1" },
    "articles": [{
      "article_label": "Điều 37",
      "article_number": 37,
      "title": "Tên điều khoản",
      "original_text_hash": "<sha256>",
      "boundary": { "start_quote": "Điều 37. Tên điều khoản...", "end_quote": "...đối với cá nhân.", "method": "regex_label_match" },
      "pieces": [ /* see §2 */ ],
      "uncertainty_flags": []
    }],
    "reconstruction": { "method": "concat_by_source_position_then_normalize_v1", "expected_digest": "<sha256>", "preview": "Điều 37. Tên điều khoản\n\n1. ...", "rerun_byte_identical": true },
    "approval": { "status": "pending", "approved_by": null, "approved_at": null, "approval_doc_id": null, "rejection_reason": null },
    "cut_record": null,
    "verify_record": null,
    "uncertainty_flags": []
  }
}

8. Canonical JSON serialization

import json, hashlib
def canonical(obj) -> bytes:
    return json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8")
manifest_digest = hashlib.sha256(canonical(manifest)).hexdigest()

Keys are sorted lexicographically at every level. Whitespace is suppressed. Unicode is preserved. Floating-point numbers are forbidden (use integers for all counts/positions).

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-iu-core-operational-cut-workflow-mark-review-cut-verify/03-cut-manifest-schema.md