KB-1958

dot-iu-cutter v0.5 — Constitution Nuxt Authoritative Extraction + Normalization Design (raw vs normalized checksum, include/exclude, failure conditions)

8 min read Revision 1
dot-iu-cutterv0.5constitution-fixturenuxt-parser-checksumextractionnormalizationchecksum-designb6design-onlydieu442026-05-18

dot-iu-cutter v0.5 — Constitution Nuxt Authoritative Extraction + Normalization Design

Phase: v0_5_constitution_nuxt_parser_checksum_ratification · Nature: design_only__no_execution · Date: 2026-05-18 · doc 3 of 5

dml: none ; source_seed: none ; dry_run: none ; cut: none ; verify: none
this is a SPECIFICATION to be ratified (OD-SR2), not an executed transform
decision_authority: GPT / User ONLY ; self_advance: PROHIBITED

Derived from doc 1 guarantees (G1–G5) and doc 2 grounded evidence. Resolves the OD-SR2 design content GPT flagged blocking in the seed-authoring review.

1. Authoritative content container rule

rule: authoritative content = the server-rendered (SSR) <article> element of the
  Constitution route — equivalently <main> (doc 2 §5 proved identical normalized text).
rationale:
  - present and stable in raw SSR HTML (no client JS needed) -> reproducible offline (G5)
  - normalized hash identical across 3 same-session re-renders despite raw churn (G1/G3)
  - it is a complete coherent document: H1 title -> ... -> CHANGELOG -> backlink
selection (proposed, OD-SR2 to ratify):
  outer: first <article> ... </article> on the Constitution route
  inner authoritative span: from the H1 line "HIẾN PHÁP KIẾN TRÚC HỆ THỐNG INCOMEX — v4.6.3 BAN HÀNH"
    THROUGH the end of CHANGELOG, EXCLUDING the trailing "Back to Knowledge Hub …" backlink
  (= doc 2 candidate_B; sha256 f9d22d05… stable x3)
fallback_if_absent: if no <article>/<main> with the H1 anchor -> FAIL_NO_SPAN (see §6),
  do NOT fall back to raw bytes, do NOT fabricate.

2. Include / Exclude rules

INCLUDE (part of content identity):
  - H1 title line incl. version label "v4.6.3 BAN HÀNH" (document identity)
  - "Văn bản tối cao …" preamble
  - 15 NGUYÊN TẮC NỀN TẢNG (status-bearing principle nodes)
  - KIẾN TRÚC HẠ TẦNG (A/B/C sections)
  - MỤC LỤC LUẬT (law index incl. its status legend ✅/📋/📝/⛔ and Điều entries)
  - 2 CHIỀU QUẢN LÝ
  - THUẬT NGỮ (glossary)
  - CHANGELOG  (authoritative: it is part of the document body, drives living-document
    drift, and is the operator's change ledger -> INCLUDE; this is the DECISION FLAG for GPT)
  - all status markers ✅ 📋 📝 ⛔ exactly as rendered

EXCLUDE (Nuxt chrome / non-document / volatile):
  - <head>, <script>, <style>, window.__NUXT__, <script id="__NUXT_DATA__"> hydration JSON
    (this is the proven raw-volatility locus, doc 2 §3)
  - top breadcrumb "Knowledge / Phát triển / Laws / …" (navigation chrome)
  - trailing "Back to Knowledge Hub  knowledge/dev/laws/constitution.md" backlink (chrome)
  - site header/footer/nav/aside portal shell ("Incomex AI Portal")
  - Directus/asset URLs, build-hash query strings
  - any renderer-injected timestamp (page render time) — NOT document content
    (document-authored dates inside CHANGELOG are INCLUDED; renderer clock is EXCLUDED)
DECISION_FLAG_CL1: CHANGELOG inclusion is authoritative-by-default here (it IS the
  document's own change ledger). GPT/User may rule CHANGELOG excluded if it is deemed
  renderer-generated rather than document-authored. Recommended: INCLUDE.

3. Normalization pipeline (deterministic, ordered)

N1 decode:        UTF-8 ; replace undecodable bytes deterministically (errors=replace)
N2 isolate:       extract authoritative span (§1); drop <script>/<style> subtrees first
N3 detag:         remove remaining HTML tags -> text; HTML-entity unescape (single pass)
N4 unicode:       Unicode NFC ; strip U+FEFF BOM
N5 newlines:      CRLF -> LF ; lone CR -> LF
N6 hspace:        collapse runs of [ \t\f\v] to a single SPACE
N7 trim:          strip leading/trailing space per line
N8 vspace:        collapse runs of blank lines to a single \n ; trim doc ends
N9 preserve:      Vietnamese diacritics UNCHANGED (NFC, no transliteration);
                  status markers preserved by EXACT codepoint (see §4) — never stripped,
                  never NFC-folded, never emoji-variation-normalized
order_is_normative: N1..N9 applied in this exact order (different order => different hash)

4. Unicode / emoji status-marker handling (QG5 — codepoint-exact)

markers_by_exact_codepoint (must survive N1..N9 unchanged):
  "✅"  U+2705            utf8 e2 9c 85   -> enacted          (LIVE map)
  "📋"  U+1F4CB           utf8 f0 9f 93 8b -> controlled_draft (LIVE map)
  "📝"  U+1F4DD           utf8 f0 9f 93 9d -> draft            (LIVE map; added B1)
  "⛔"  U+26D4            utf8 e2 9b 94   -> obsolete          (LIVE map; added B1)
rules:
  - NFC (N4) does NOT decompose these codepoints (verified: NFC-stable) -> safe
  - do NOT strip variation selectors if present; hash the bytes as received post-NFC
  - do NOT map emoji to text aliases before hashing
  - marker counts are an integrity probe, not the identity (identity = full-span sha256)
grounded_evidence: doc 2 §5 — counts (✅19 📋1 📝1 ⛔1 in SSR article) identical across
  all 3 fetches; markers passed through normalization byte-stable.

5. raw_fetch_checksum vs normalized_content_checksum (QG3)

raw_fetch_checksum:
  def: sha256(exact raw response bytes of a single controlled GET, redirects=0)
  property: NON-DETERMINISTIC for this source (doc 2 §3 proved fetch3 ≠ fetch1/2)
  role: FORENSIC / AUDIT ONLY — never an identity, never a drift signal
  storage: source_document_version_registry.provenance->>'raw_checksum'
           (no raw_checksum column live — MISMATCH-5)

normalized_content_checksum (THE persisted identity):
  def: sha256( normalize_N1..N9( extract_authoritative_span( raw ) ) )
  property: STABLE across cosmetic Nuxt re-render (doc 2 §5: identical x3)
  role: the version identity + the "Cắt Hiến pháp" drift signal (G1/G2/G3)
  storage: source_document_version_registry.content_checksum  (NOT NULL)
  candidate_value_under_proposed_span (OD-SR2 candidate_B, evidence only — NOT to persist
    until ratified): f9d22d0571fa296cbc8e308c46acde93804ffcfb4a19a2e7f55dabd8657d1689

document_version_id (deterministic PK, unchanged from ratification plan):
  "icxconst-" || left( sha256_hex( content_checksum || '|' || 'incomex-constitution' ), 32 )
  -> pure f(content_checksum, source_document_ref); idempotent under
     live UNIQUE(source_document_ref, content_checksum)
drift_handling: new content -> new content_checksum -> new document_version_id;
  provenance->>'supersedes_version_id' records the chain (no supersedes column live).

6. Failure conditions (fail-closed; map to doc 1 BLOCKED/FAIL)

FAIL_NO_SPAN:        no <article>/<main> with H1 anchor -> BLOCKED (no fabrication, no raw fallback)
FAIL_MARKER_LOSS:    post-normalization marker codepoint count anomaly / unknown marker -> BLOCKED
FAIL_HTTP:           non-200, redirect>0, or content-type not text/html -> BLOCKED
FAIL_NONDETERMINISM: two controlled fetches under the ratified profile yield different
                     normalized_content_checksum -> BLOCKED (profile not sound for this source)
DRIFT (not a failure): normalized checksum differs from REGISTERED version -> new version
                     row proposed + routed to review (doc 1 FAIL_DRIFT) — expected for a
                     living document; never a silent re-cut
governance: never invent a checksum; BLOCKED is always preferred over a guessed PASS.

7. Statement

  • Authoritative span rule + include/exclude + ordered normalization pipeline defined (QG4); raw vs normalized checksum distinguished (QG3); status markers preserved codepoint-exact with grounded evidence (QG5); failure conditions fail-closed.
  • Design only — no source seed/DML/dry-run/cut/verify (QG7). doc 3 of 5; STOP after package → route GPT/User. Self-advance PROHIBITED.

Companion: parser-operational-framing, source-grounding-and-repeatability, parser-profile-and-ruling-request, ratification-report.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.5-constitution-nuxt-parser-checksum-ratification/dot-iu-cutter-v0.5-constitution-nuxt-authoritative-extraction-and-normalization-design-2026-05-18.md