KB-4A7B

dot-iu-cutter v0.5 — Canonicalization & Address Grammar Design (DESIGN ONLY) (2026-05-17)

8 min read Revision 1
dot-iu-cutterv0.5canonicalizationaddress-grammarcanonical-addressiu-idprovenancedesign-onlydieu44

dot-iu-cutter v0.5 — Canonicalization & Address Grammar Design

Date: 2026-05-17 Phase: v0_5_constitution_hardtest_and_information_unit_factory_master_plan Nature: DESIGN ONLY. No parser runs, no IU is produced, no cut.

Parent: dot-iu-cutter-v0.5-constitution-hardtest-master-plan-2026-05-17.md


1. The grammar problem (corrected by grounding)

The handoff assumed a national-Constitution grammar Chương / Điều / Khoản / Điểm / đoạn. The configured fixture is the internal Incomex Architecture Constitution v4.6.3, whose actual hierarchy is:

observed_hierarchy:
  L0_preamble: "Văn bản tối cao..." (no Lời nói đầu in classic form)
  L1_principles: "15 Nguyên tắc Nền tảng" (numbered 1..15)
  L1_infra: "KIẾN TRÚC HẠ TẦNG" with sections A, B, C
  L1_law_index: "Mục lục Luật" -> Điều 0..44, each with {Tên, File, Ghi chú}
  status_markers: ✅ ENACTED | 📋 CONTROLLED DRAFT inline on nodes
  NOT_present: Chương / Khoản / Điểm classic tree

Therefore: grammar must be detected and selected from a grammar-profile registry, never hardcoded. A single hardcoded Chương/Điều/Khoản/Điểm parser would mis-cut this document. This is open decision OD-G1 and risk R-GRM-1.


2. Grammar-profile registry (design — NOT created)

grammar_profile:
  grammar_profile_ref: text PK              # e.g. "vn-national-constitution-2013",
                                            #      "incomex-architecture-constitution-v4"
  level_definitions:                        # ordered, config-driven
    - level: name (e.g. NGUYEN_TAC | KIEN_TRUC_SECTION | DIEU | KHOAN | DIEM | DOAN)
      matcher_kind: regex|structural|heading-rule (config, not inline literal)
      matcher_ref: pointer to matcher config row
      numbering_scheme: arabic|roman|letter|none
      is_leaf_candidate: boolean
  status_marker_rules:
    - marker: "✅"  => enacted
    - marker: "📋"  => controlled_draft (eligible_for_cut depends on enacted_only_policy)
  address_template_ref: pointer to address grammar (see §4)

Profiles are data. The Constitution hardtest ships a new profile incomex-architecture-constitution-v4; the national 2013 profile remains available but is not the fixture profile.


3. Canonicalization stages

canon_pipeline:
  C0_input: parser-ready normalized content + source_span table (ingestion doc)
  C1_grammar_detect:
    do: choose grammar_profile_ref (registry rule on doc signals); fail closed
        if ambiguous -> human review queue (no silent default)
  C2_hierarchy_extract:
    do: walk content per profile level_definitions, build node tree
        node = {level, ordinal, raw_heading, status_marker, span_ref}
  C3_address_derivation:
    do: assign deterministic canonical_address per node (see §4)
  C4_iu_id_derivation:
    do: leaf nodes -> IU; iu_id = deterministic(document_version_id, canonical_address)
        entry_id = deterministic(iu_id, manifest_strategy)  (see §5)
  C5_text_normalization:
    do: canonical text = trimmed, NFC, internal-whitespace-collapsed,
        diacritics + status tokens preserved; store canonical_text_checksum
  C6_provenance_link:
    do: bind every IU to source_span.span_id + span_checksum (reject if missing)
  C7_grammar_validation:
    do: assert tree well-formed per profile (no orphan level, monotone ordinals);
        violations -> review queue, NOT auto-fixed
  C8_review_queue:
    do: ambiguous/draft/violating segments queued for human/GPT decision

4. Stable canonical_address grammar

canonical_address:
  goal: globally unique, deterministic, human-legible, stable across re-ingest
        of the same document_version
  shape (proposal, OD-A1):
    "<DOCPREFIX>-<L1>-<L2>-...-<Lk>"
  docprefix:
    - derived from source_document_ref, NOT a literal (e.g. ICX-CONST)
    - guarantees no collision with existing DIEU_28/32/35 addresses (e.g. D38-DIEU28-...)
  examples (illustrative, profile incomex-architecture-constitution-v4):
    - "ICX-CONST/NT-12"            # Nguyên tắc 12
    - "ICX-CONST/KT-A"             # Kiến trúc Hạ tầng section A
    - "ICX-CONST/DIEU-44"          # Điều 44 (📋 controlled draft -> may be cut-excluded)
  rules:
    - ordinals from source numbering, normalized (roman->arabic recorded but address
      keeps source form for legibility; uniqueness from full path)
    - address NEVER encodes volatile state (status marker is metadata, not address)
    - re-ingest of same content_checksum => identical addresses (determinism A3)
    - address is namespace-prefixed so cross-document uniqueness holds (coexistence)

canonical_address_alias exists in schema but alias writes remain forbidden; aliasing policy is deferred (master plan §10).


5. Stable IU id / entry id

identity:
  iu_id        = sha-derived(document_version_id, canonical_address)      # content-stable
  entry_id     = sha-derived(iu_id, manifest_strategy_marker)             # ledger key
  properties:
    - same version + same address => same iu_id (idempotent re-cut guard, A6)
    - different document_version_id (source changed) => new iu_id (no silent overwrite)
    - DIEU_28 IU vs Constitution IU cannot collide (docprefix in address -> in iu_id)
  manifest_strategy (OD-M1):
    - default = per-IU envelope (preserves validated +15 row delta invariant)
    - document-level manifest is an alternative requiring invariant re-derivation
      => must NOT be silently adopted; escalated

6. Leaf-IU definition (OD-G2)

leaf_iu_question:
  - For this fixture, is the leaf a Điều, or a sub-bullet inside an Điều, or a
    Nguyên tắc, or a status-marked clause?
  impact:
    - drives volume estimate (handoff guessed 300..500 leaf IUs for clause/point
      grammar; this fixture's Điều-level count is ~45 + 15 principles + infra A/B/C,
      far fewer at Điều granularity, far more if sub-bullets are leaves)
    - drives governance row volume (+15 per leaf IU under per-IU manifest)
  ruling_needed: GPT/User must fix leaf granularity before volume estimate is trusted

7. Anti-hardcoding (binding)

no_hardcode:
  - grammar levels/matchers: grammar_profile registry rows, not inline regex constants
  - docprefix: derived from source_document_ref, never a literal
  - status marker semantics: profile config, not inline "if '✅' in line"
  - leaf rule: profile config, not hardcoded "Điều == leaf"

8. Open decisions

open_decisions:
  OD-G1: confirm fixture grammar profile = incomex-architecture-constitution-v4
         (NOT national 2013 Chương/Khoản/Điểm) — escalated
  OD-G2: leaf-IU granularity for this document — escalated
  OD-A1: canonical_address namespacing (docprefix scheme) — escalated
  OD-G3: handling of living-document status markers (✅/📋) in address vs metadata
  OD-G4: review-queue tooling/owner for ambiguous segments

9. Do not run yet

No grammar detection run, no parsing, no IU production, no canonical_address write, no alias write, no cut. Design only. Forbidden list = master plan §10.


10. Git

git: { branch: main, HEAD: e93424b5ff7fa5e4b8406131977ce4339cd0856a,
       status_short_iu_cutter: clean, code_changed: false, commit_made: false }
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.5-constitution-hardtest-design/dot-iu-cutter-v0.5-canonicalization-and-address-grammar-design-2026-05-17.md