KB-1958
dot-iu-cutter v0.5 — Constitution Nuxt Authoritative Extraction + Normalization Design (raw vs normalized checksum, include/exclude, failure conditions)
8 min read Revision 1
dot-iu-cutterv0.5constitution-fixturenuxt-parser-checksumextractionnormalizationchecksum-designb6design-onlydieu442026-05-18
dot-iu-cutter v0.5 — Constitution Nuxt Authoritative Extraction + Normalization Design
Phase:
v0_5_constitution_nuxt_parser_checksum_ratification· Nature:design_only__no_execution· Date: 2026-05-18 · doc 3 of 5dml: none ; source_seed: none ; dry_run: none ; cut: none ; verify: none this is a SPECIFICATION to be ratified (OD-SR2), not an executed transform decision_authority: GPT / User ONLY ; self_advance: PROHIBITED
Derived from doc 1 guarantees (G1–G5) and doc 2 grounded evidence. Resolves the OD-SR2 design content GPT flagged blocking in the seed-authoring review.
1. Authoritative content container rule
rule: authoritative content = the server-rendered (SSR) <article> element of the
Constitution route — equivalently <main> (doc 2 §5 proved identical normalized text).
rationale:
- present and stable in raw SSR HTML (no client JS needed) -> reproducible offline (G5)
- normalized hash identical across 3 same-session re-renders despite raw churn (G1/G3)
- it is a complete coherent document: H1 title -> ... -> CHANGELOG -> backlink
selection (proposed, OD-SR2 to ratify):
outer: first <article> ... </article> on the Constitution route
inner authoritative span: from the H1 line "HIẾN PHÁP KIẾN TRÚC HỆ THỐNG INCOMEX — v4.6.3 BAN HÀNH"
THROUGH the end of CHANGELOG, EXCLUDING the trailing "Back to Knowledge Hub …" backlink
(= doc 2 candidate_B; sha256 f9d22d05… stable x3)
fallback_if_absent: if no <article>/<main> with the H1 anchor -> FAIL_NO_SPAN (see §6),
do NOT fall back to raw bytes, do NOT fabricate.
2. Include / Exclude rules
INCLUDE (part of content identity):
- H1 title line incl. version label "v4.6.3 BAN HÀNH" (document identity)
- "Văn bản tối cao …" preamble
- 15 NGUYÊN TẮC NỀN TẢNG (status-bearing principle nodes)
- KIẾN TRÚC HẠ TẦNG (A/B/C sections)
- MỤC LỤC LUẬT (law index incl. its status legend ✅/📋/📝/⛔ and Điều entries)
- 2 CHIỀU QUẢN LÝ
- THUẬT NGỮ (glossary)
- CHANGELOG (authoritative: it is part of the document body, drives living-document
drift, and is the operator's change ledger -> INCLUDE; this is the DECISION FLAG for GPT)
- all status markers ✅ 📋 📝 ⛔ exactly as rendered
EXCLUDE (Nuxt chrome / non-document / volatile):
- <head>, <script>, <style>, window.__NUXT__, <script id="__NUXT_DATA__"> hydration JSON
(this is the proven raw-volatility locus, doc 2 §3)
- top breadcrumb "Knowledge / Phát triển / Laws / …" (navigation chrome)
- trailing "Back to Knowledge Hub knowledge/dev/laws/constitution.md" backlink (chrome)
- site header/footer/nav/aside portal shell ("Incomex AI Portal")
- Directus/asset URLs, build-hash query strings
- any renderer-injected timestamp (page render time) — NOT document content
(document-authored dates inside CHANGELOG are INCLUDED; renderer clock is EXCLUDED)
DECISION_FLAG_CL1: CHANGELOG inclusion is authoritative-by-default here (it IS the
document's own change ledger). GPT/User may rule CHANGELOG excluded if it is deemed
renderer-generated rather than document-authored. Recommended: INCLUDE.
3. Normalization pipeline (deterministic, ordered)
N1 decode: UTF-8 ; replace undecodable bytes deterministically (errors=replace)
N2 isolate: extract authoritative span (§1); drop <script>/<style> subtrees first
N3 detag: remove remaining HTML tags -> text; HTML-entity unescape (single pass)
N4 unicode: Unicode NFC ; strip U+FEFF BOM
N5 newlines: CRLF -> LF ; lone CR -> LF
N6 hspace: collapse runs of [ \t\f\v] to a single SPACE
N7 trim: strip leading/trailing space per line
N8 vspace: collapse runs of blank lines to a single \n ; trim doc ends
N9 preserve: Vietnamese diacritics UNCHANGED (NFC, no transliteration);
status markers preserved by EXACT codepoint (see §4) — never stripped,
never NFC-folded, never emoji-variation-normalized
order_is_normative: N1..N9 applied in this exact order (different order => different hash)
4. Unicode / emoji status-marker handling (QG5 — codepoint-exact)
markers_by_exact_codepoint (must survive N1..N9 unchanged):
"✅" U+2705 utf8 e2 9c 85 -> enacted (LIVE map)
"📋" U+1F4CB utf8 f0 9f 93 8b -> controlled_draft (LIVE map)
"📝" U+1F4DD utf8 f0 9f 93 9d -> draft (LIVE map; added B1)
"⛔" U+26D4 utf8 e2 9b 94 -> obsolete (LIVE map; added B1)
rules:
- NFC (N4) does NOT decompose these codepoints (verified: NFC-stable) -> safe
- do NOT strip variation selectors if present; hash the bytes as received post-NFC
- do NOT map emoji to text aliases before hashing
- marker counts are an integrity probe, not the identity (identity = full-span sha256)
grounded_evidence: doc 2 §5 — counts (✅19 📋1 📝1 ⛔1 in SSR article) identical across
all 3 fetches; markers passed through normalization byte-stable.
5. raw_fetch_checksum vs normalized_content_checksum (QG3)
raw_fetch_checksum:
def: sha256(exact raw response bytes of a single controlled GET, redirects=0)
property: NON-DETERMINISTIC for this source (doc 2 §3 proved fetch3 ≠ fetch1/2)
role: FORENSIC / AUDIT ONLY — never an identity, never a drift signal
storage: source_document_version_registry.provenance->>'raw_checksum'
(no raw_checksum column live — MISMATCH-5)
normalized_content_checksum (THE persisted identity):
def: sha256( normalize_N1..N9( extract_authoritative_span( raw ) ) )
property: STABLE across cosmetic Nuxt re-render (doc 2 §5: identical x3)
role: the version identity + the "Cắt Hiến pháp" drift signal (G1/G2/G3)
storage: source_document_version_registry.content_checksum (NOT NULL)
candidate_value_under_proposed_span (OD-SR2 candidate_B, evidence only — NOT to persist
until ratified): f9d22d0571fa296cbc8e308c46acde93804ffcfb4a19a2e7f55dabd8657d1689
document_version_id (deterministic PK, unchanged from ratification plan):
"icxconst-" || left( sha256_hex( content_checksum || '|' || 'incomex-constitution' ), 32 )
-> pure f(content_checksum, source_document_ref); idempotent under
live UNIQUE(source_document_ref, content_checksum)
drift_handling: new content -> new content_checksum -> new document_version_id;
provenance->>'supersedes_version_id' records the chain (no supersedes column live).
6. Failure conditions (fail-closed; map to doc 1 BLOCKED/FAIL)
FAIL_NO_SPAN: no <article>/<main> with H1 anchor -> BLOCKED (no fabrication, no raw fallback)
FAIL_MARKER_LOSS: post-normalization marker codepoint count anomaly / unknown marker -> BLOCKED
FAIL_HTTP: non-200, redirect>0, or content-type not text/html -> BLOCKED
FAIL_NONDETERMINISM: two controlled fetches under the ratified profile yield different
normalized_content_checksum -> BLOCKED (profile not sound for this source)
DRIFT (not a failure): normalized checksum differs from REGISTERED version -> new version
row proposed + routed to review (doc 1 FAIL_DRIFT) — expected for a
living document; never a silent re-cut
governance: never invent a checksum; BLOCKED is always preferred over a guessed PASS.
7. Statement
- Authoritative span rule + include/exclude + ordered normalization pipeline defined (QG4); raw vs normalized checksum distinguished (QG3); status markers preserved codepoint-exact with grounded evidence (QG5); failure conditions fail-closed.
- Design only — no source seed/DML/dry-run/cut/verify (QG7). doc 3 of 5; STOP after package → route GPT/User. Self-advance PROHIBITED.
Companion: parser-operational-framing, source-grounding-and-repeatability, parser-profile-and-ruling-request, ratification-report.