KB-12ED

dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference: Algorithm Analysis (+135 divergence localized to N8 vertical-whitespace; Codex vs Claude reconstruction)

9 min read Revision 1

dot-iu-cutterv0.5constitution-fixturenuxt-parser-reference-implalgorithm-analysisdivergence-localizedn8-vspaceauthoring-onlyno-executiondieu442026-05-18

dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference: Algorithm Analysis

Phase: v0_5_constitution_nuxt_parser_reference_implementation_authoring · Nature: analysis_only__no_execution · Date: 2026-05-18 · doc 2 of 5
mutation: none (KB read-only grounding + analysis) ; decision_authority: GPT/User ONLY
self_advance: PROHIBITED

Recovers the exact algorithm evidence from the ratification, cross-interval, drift-triage, and E1-blocked packages, then localizes the +135-char divergence by controlled experiment.

1. Checksum lineage (KB-SSOT, candidate_B authoritative span)

L1 ratification (Claude prose, 2026-05-18 ~09:02Z):     f9d22d05…d1689 / len 17791 / 19·1·1·1
L2 cross-interval (Claude prose, ~09:17Z, sep session): f9d22d05…d1689 / len 17791 / 19·1·1·1  (stable across interval)
   -> KB article then re-revised ("Revision 44" living-document edit)
L3 Codex canonical seed exec (canonical VPS impl, post-revision): 17660443… / len 17522 / 19·1·1·1  == RATIFIED CANONICAL
L4 drift-triage (Claude prose reconstruction, post-revision):     072983ac… / len 17657 / 19·1·1·1
L5 E1 capture (Claude prose, post-revision, deterministic 3/3):   072983ac… / len 17657 / 19·1·1·1
key: markers ✅19 📋1 📝1 ⛔1 invariant across ALL points -> NOT a normative
  change; the constitution version label "v4.6.3 BAN HÀNH" never bumped.

Two superimposed deltas existed in prior analysis and had to be separated:

delta_1 living-document edit : L1/L2 (17791) -> L3/L4/L5 region (KB "Revision 44")
delta_2 executor divergence  : on the SAME post-revision content,
                               Codex 17522  vs  Claude 17657  = +135 chars
prior_triage_could_not_separate_them: the drift-triage doc explicitly flagged
  its absolute checksums as non-authoritative (prose reimpl ≠ canonical impl)
  and relied on relative span geometry only.

2. What was already excluded as the cause (from prior KB evidence)

NOT marker handling     : ✅19 📋1 📝1 ⛔1 identical at every lineage point
NOT span boundary       : candidate_A − candidate_B ≈ 336 (ratif.) vs 339
                          (drift-triage) — stable; breadcrumb/backlink exclusion
                          and H1→end-CHANGELOG boundary are NOT moving
NOT extraction container : single <article>, single </article>, single <main>
                          (re-confirmed live this phase)
NOT raw transport        : raw bytes vary (Nuxt render noise) but normalized is
                          stable per executor -> transport is not the signal
remaining_suspects: the N1..N9 normalization steps where prose admits >1 reading
  — primarily detag block-boundary policy and N8 vertical-whitespace.

3. Live re-grounding this phase (read-only, 3 GETs)

source_url: https://vps.incomexsaigoncorp.vn/knowledge/dev/laws/constitution
transport: HTTP 200 ·3 ; nginx/1.29.5 ; text/html;charset=utf-8 ; x-powered-by Nuxt ;
  cache-control no-cache ; 0 redirects (same transport class as all prior phases)
raw: 1,215,202 bytes ·3 ; raw_sha256 c1c273f2… IDENTICAL 3/3 this session
  (raw happened to be stable this session; still treated forensic-only)
structure: <article> ·1 ; H1 "HIẾN PHÁP KIẾN TRÚC HỆ THỐNG INCOMEX — v4.6.3 BAN
  HÀNH" present ; CHANGELOG present ; trailing "Back to Knowledge Hub" backlink
  in a <footer> inside <article> (single occurrence) -> candidate_B = H1 start →
  start of that <footer>.
finding: live normalized content is byte-identical to Codex's ratified canonical
  (see §5) -> the source has NOT content-drifted since L3; delta_1 settled at the
  canonical value; only delta_2 (executor divergence) remained to explain.

4. Controlled divergence experiment (decisive)

A single fragment (candidate_B span of fetch1) was normalized under the pinned pipeline while varying only one ambiguous step at a time:

R1 pinned: every block tag(open/close) -> "\n"; collapse [ \t\f\v ]+ -> 1 space;
   collapse >=2 blank lines -> 1 blank line
   => len 17819  sha b6ea1672…  mk 19/1/1/1
V2 R1 but DO NOT collapse U+00A0 (nbsp) separately
   => len 17819  sha b6ea1672…  mk 19/1/1/1   (IDENTICAL to R1 -> nbsp NOT a factor here)
V3 R1 but N8 = drop ALL empty lines (single "\n" between content, ZERO blank lines)
   => len 17522  sha 17660443e0f2…  mk 19/1/1/1   ==  RATIFIED CANONICAL (exact)
V4 R1 but detag = only p/br/li/tr/h* -> "\n", other block tags removed (no \n)
   => len 17657  sha 072983ac6cf4…  mk 19/1/1/1   ==  CLAUDE E1 OUTPUT (exact)
V5 R1 but NO tag -> newline at all (all tags stripped to "")
   => len 17634  sha 438c77c3…      mk 19/1/1/1

conclusion (QG2 — divergence source IDENTIFIED, not "insufficient evidence"):
  divergence_step: N8 vertical-whitespace ("collapse blank-line runs to a single \n")
  canonical_semantics (Codex, ratified):
    "single \n" == ONE NEWLINE -> drop every empty line; blocks separated by a
    single \n with NO blank line between them.   (variant V3 -> 17660443/17522)
  claude_E1_semantics (prose misreading):
    a combined block-detag + "one blank line between blocks" reading
    -> +135 chars.   (variant V4 -> 072983ac/17657)
  the +135 is exactly the residual empty-line characters the canonical executor
  removes and the Claude reconstruction kept. Markers/span/NFC/entity/nbsp are
  all confirmed NOT the cause (V2 isolates nbsp as null; markers invariant;
  candidate_A−B stable).

5. Source-drift vs parser-mismatch classification

classification: PARSER-IMPLEMENTATION DIVERGENCE (delta_2), fully resolved.
  Source content is NOT drifted vs the ratified canonical: under the canonical
  N8 semantics (V3) the live page reproduces 17660443e0f23e994e1807cf8e2292095
  1a9e70c598956dbd0e752f4f5cae80c / 17522 EXACTLY and deterministically 3/3.
  The earlier "living KB-revision drift" (delta_1) had already settled at this
  canonical value before L3; no further content drift has occurred.
raw_byte_variation: forensic-only Nuxt render noise; never the identity signal
  (consistent with ratified raw/normalized split).

6. Pinned decisions required by the reference implementation

D-FETCH   : HTTP GET, redirects=0, expect 200 + text/html; raw bytes captured;
            raw_sha256 = forensic only (never identity)
D-DECODE  : UTF-8, errors='replace'; strip leading U+FEFF
D-SPAN    : first <article>…</article> inner; candidate_B = from <h1> matching
            /^<h1[^>]*>\s*HIẾN PHÁP/ THROUGH end CHANGELOG, end-bounded by the
            start of the <footer> that contains "Back to Knowledge Hub"
            (that backlink footer EXCLUDED); CHANGELOG INCLUDED (GPT R-CL1)
D-DROP    : remove <script>/<style> subtrees and HTML comments first
D-DETAG   : EVERY block-level tag (open OR close OR self) of {p div section
            article header footer nav aside main h1-6 ul ol li dl dt dd table
            thead tbody tfoot tr td th blockquote pre hr br figure figcaption}
            -> "\n"; ALL other (inline) tags -> "" (no separator)
D-ENTITY  : html.unescape, single pass, AFTER detag
D-NFC     : Unicode NFC; strip U+FEFF
D-EOL     : CRLF -> LF, lone CR -> LF
D-HSPACE  : collapse runs of [ \t\f\v] AND U+00A0 -> single ASCII space
D-VSPACE  : **CANONICAL** — strip each line, then DROP ALL EMPTY LINES; join
            remaining lines with a single "\n" (this is the fix vs Claude E1)
D-TRAILNL : NO trailing newline appended (identity = exact joined bytes)
D-MARKERS : U+2705 / U+1F4CB / U+1F4DD / U+26D4 preserved by codepoint; never
            folded/aliased; counts are an integrity probe, not the identity
D-SENTINEL: when later written to the snapshot artifact, identity = sha256 of
            ONLY the bytes between <<<BEGIN-NORMALIZED-CONTENT…/…END…>>> sentinel
            lines (sentinels excluded, no added trailing \n) — equals this
            implementation's normalized_content bytes exactly (artifact-spec doc)

7. Statement

+135 divergence localized by controlled experiment to N8 vertical-whitespace semantics; canonical reading reproduced exactly; markers/span/nbsp/entity excluded as causes (QG2 satisfied — source identified, not "insufficient").
Analysis only; nothing executed/persisted (QG1/QG5).
doc 2 of 5; STOP after 5 files → route GPT/User. Self-advance PROHIBITED.

Companions: operational-framing, implementation-draft, test-result, authoring-report.