dot-iu-cutter v0.5 — Canonicalization & Address Grammar Design (DESIGN ONLY) (2026-05-17)
dot-iu-cutter v0.5 — Canonicalization & Address Grammar Design
Date: 2026-05-17
Phase: v0_5_constitution_hardtest_and_information_unit_factory_master_plan
Nature: DESIGN ONLY. No parser runs, no IU is produced, no cut.
Parent: dot-iu-cutter-v0.5-constitution-hardtest-master-plan-2026-05-17.md
1. The grammar problem (corrected by grounding)
The handoff assumed a national-Constitution grammar
Chương / Điều / Khoản / Điểm / đoạn. The configured fixture is the internal
Incomex Architecture Constitution v4.6.3, whose actual hierarchy is:
observed_hierarchy:
L0_preamble: "Văn bản tối cao..." (no Lời nói đầu in classic form)
L1_principles: "15 Nguyên tắc Nền tảng" (numbered 1..15)
L1_infra: "KIẾN TRÚC HẠ TẦNG" with sections A, B, C
L1_law_index: "Mục lục Luật" -> Điều 0..44, each with {Tên, File, Ghi chú}
status_markers: ✅ ENACTED | 📋 CONTROLLED DRAFT inline on nodes
NOT_present: Chương / Khoản / Điểm classic tree
Therefore: grammar must be detected and selected from a grammar-profile
registry, never hardcoded. A single hardcoded Chương/Điều/Khoản/Điểm parser
would mis-cut this document. This is open decision OD-G1 and risk R-GRM-1.
2. Grammar-profile registry (design — NOT created)
grammar_profile:
grammar_profile_ref: text PK # e.g. "vn-national-constitution-2013",
# "incomex-architecture-constitution-v4"
level_definitions: # ordered, config-driven
- level: name (e.g. NGUYEN_TAC | KIEN_TRUC_SECTION | DIEU | KHOAN | DIEM | DOAN)
matcher_kind: regex|structural|heading-rule (config, not inline literal)
matcher_ref: pointer to matcher config row
numbering_scheme: arabic|roman|letter|none
is_leaf_candidate: boolean
status_marker_rules:
- marker: "✅" => enacted
- marker: "📋" => controlled_draft (eligible_for_cut depends on enacted_only_policy)
address_template_ref: pointer to address grammar (see §4)
Profiles are data. The Constitution hardtest ships a new profile
incomex-architecture-constitution-v4; the national 2013 profile remains
available but is not the fixture profile.
3. Canonicalization stages
canon_pipeline:
C0_input: parser-ready normalized content + source_span table (ingestion doc)
C1_grammar_detect:
do: choose grammar_profile_ref (registry rule on doc signals); fail closed
if ambiguous -> human review queue (no silent default)
C2_hierarchy_extract:
do: walk content per profile level_definitions, build node tree
node = {level, ordinal, raw_heading, status_marker, span_ref}
C3_address_derivation:
do: assign deterministic canonical_address per node (see §4)
C4_iu_id_derivation:
do: leaf nodes -> IU; iu_id = deterministic(document_version_id, canonical_address)
entry_id = deterministic(iu_id, manifest_strategy) (see §5)
C5_text_normalization:
do: canonical text = trimmed, NFC, internal-whitespace-collapsed,
diacritics + status tokens preserved; store canonical_text_checksum
C6_provenance_link:
do: bind every IU to source_span.span_id + span_checksum (reject if missing)
C7_grammar_validation:
do: assert tree well-formed per profile (no orphan level, monotone ordinals);
violations -> review queue, NOT auto-fixed
C8_review_queue:
do: ambiguous/draft/violating segments queued for human/GPT decision
4. Stable canonical_address grammar
canonical_address:
goal: globally unique, deterministic, human-legible, stable across re-ingest
of the same document_version
shape (proposal, OD-A1):
"<DOCPREFIX>-<L1>-<L2>-...-<Lk>"
docprefix:
- derived from source_document_ref, NOT a literal (e.g. ICX-CONST)
- guarantees no collision with existing DIEU_28/32/35 addresses (e.g. D38-DIEU28-...)
examples (illustrative, profile incomex-architecture-constitution-v4):
- "ICX-CONST/NT-12" # Nguyên tắc 12
- "ICX-CONST/KT-A" # Kiến trúc Hạ tầng section A
- "ICX-CONST/DIEU-44" # Điều 44 (📋 controlled draft -> may be cut-excluded)
rules:
- ordinals from source numbering, normalized (roman->arabic recorded but address
keeps source form for legibility; uniqueness from full path)
- address NEVER encodes volatile state (status marker is metadata, not address)
- re-ingest of same content_checksum => identical addresses (determinism A3)
- address is namespace-prefixed so cross-document uniqueness holds (coexistence)
canonical_address_alias exists in schema but alias writes remain forbidden;
aliasing policy is deferred (master plan §10).
5. Stable IU id / entry id
identity:
iu_id = sha-derived(document_version_id, canonical_address) # content-stable
entry_id = sha-derived(iu_id, manifest_strategy_marker) # ledger key
properties:
- same version + same address => same iu_id (idempotent re-cut guard, A6)
- different document_version_id (source changed) => new iu_id (no silent overwrite)
- DIEU_28 IU vs Constitution IU cannot collide (docprefix in address -> in iu_id)
manifest_strategy (OD-M1):
- default = per-IU envelope (preserves validated +15 row delta invariant)
- document-level manifest is an alternative requiring invariant re-derivation
=> must NOT be silently adopted; escalated
6. Leaf-IU definition (OD-G2)
leaf_iu_question:
- For this fixture, is the leaf a Điều, or a sub-bullet inside an Điều, or a
Nguyên tắc, or a status-marked clause?
impact:
- drives volume estimate (handoff guessed 300..500 leaf IUs for clause/point
grammar; this fixture's Điều-level count is ~45 + 15 principles + infra A/B/C,
far fewer at Điều granularity, far more if sub-bullets are leaves)
- drives governance row volume (+15 per leaf IU under per-IU manifest)
ruling_needed: GPT/User must fix leaf granularity before volume estimate is trusted
7. Anti-hardcoding (binding)
no_hardcode:
- grammar levels/matchers: grammar_profile registry rows, not inline regex constants
- docprefix: derived from source_document_ref, never a literal
- status marker semantics: profile config, not inline "if '✅' in line"
- leaf rule: profile config, not hardcoded "Điều == leaf"
8. Open decisions
open_decisions:
OD-G1: confirm fixture grammar profile = incomex-architecture-constitution-v4
(NOT national 2013 Chương/Khoản/Điểm) — escalated
OD-G2: leaf-IU granularity for this document — escalated
OD-A1: canonical_address namespacing (docprefix scheme) — escalated
OD-G3: handling of living-document status markers (✅/📋) in address vs metadata
OD-G4: review-queue tooling/owner for ambiguous segments
9. Do not run yet
No grammar detection run, no parsing, no IU production, no canonical_address write, no alias write, no cut. Design only. Forbidden list = master plan §10.
10. Git
git: { branch: main, HEAD: e93424b5ff7fa5e4b8406131977ce4339cd0856a,
status_short_iu_cutter: clean, code_changed: false, commit_made: false }