KB-12ED
dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference: Algorithm Analysis (+135 divergence localized to N8 vertical-whitespace; Codex vs Claude reconstruction)
9 min read Revision 1
dot-iu-cutterv0.5constitution-fixturenuxt-parser-reference-implalgorithm-analysisdivergence-localizedn8-vspaceauthoring-onlyno-executiondieu442026-05-18
dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference: Algorithm Analysis
Phase:
v0_5_constitution_nuxt_parser_reference_implementation_authoring· Nature:analysis_only__no_execution· Date: 2026-05-18 · doc 2 of 5mutation: none (KB read-only grounding + analysis) ; decision_authority: GPT/User ONLY self_advance: PROHIBITED
Recovers the exact algorithm evidence from the ratification, cross-interval, drift-triage, and E1-blocked packages, then localizes the +135-char divergence by controlled experiment.
1. Checksum lineage (KB-SSOT, candidate_B authoritative span)
L1 ratification (Claude prose, 2026-05-18 ~09:02Z): f9d22d05…d1689 / len 17791 / 19·1·1·1
L2 cross-interval (Claude prose, ~09:17Z, sep session): f9d22d05…d1689 / len 17791 / 19·1·1·1 (stable across interval)
-> KB article then re-revised ("Revision 44" living-document edit)
L3 Codex canonical seed exec (canonical VPS impl, post-revision): 17660443… / len 17522 / 19·1·1·1 == RATIFIED CANONICAL
L4 drift-triage (Claude prose reconstruction, post-revision): 072983ac… / len 17657 / 19·1·1·1
L5 E1 capture (Claude prose, post-revision, deterministic 3/3): 072983ac… / len 17657 / 19·1·1·1
key: markers ✅19 📋1 📝1 ⛔1 invariant across ALL points -> NOT a normative
change; the constitution version label "v4.6.3 BAN HÀNH" never bumped.
Two superimposed deltas existed in prior analysis and had to be separated:
delta_1 living-document edit : L1/L2 (17791) -> L3/L4/L5 region (KB "Revision 44")
delta_2 executor divergence : on the SAME post-revision content,
Codex 17522 vs Claude 17657 = +135 chars
prior_triage_could_not_separate_them: the drift-triage doc explicitly flagged
its absolute checksums as non-authoritative (prose reimpl ≠ canonical impl)
and relied on relative span geometry only.
2. What was already excluded as the cause (from prior KB evidence)
NOT marker handling : ✅19 📋1 📝1 ⛔1 identical at every lineage point
NOT span boundary : candidate_A − candidate_B ≈ 336 (ratif.) vs 339
(drift-triage) — stable; breadcrumb/backlink exclusion
and H1→end-CHANGELOG boundary are NOT moving
NOT extraction container : single <article>, single </article>, single <main>
(re-confirmed live this phase)
NOT raw transport : raw bytes vary (Nuxt render noise) but normalized is
stable per executor -> transport is not the signal
remaining_suspects: the N1..N9 normalization steps where prose admits >1 reading
— primarily detag block-boundary policy and N8 vertical-whitespace.
3. Live re-grounding this phase (read-only, 3 GETs)
source_url: https://vps.incomexsaigoncorp.vn/knowledge/dev/laws/constitution
transport: HTTP 200 ·3 ; nginx/1.29.5 ; text/html;charset=utf-8 ; x-powered-by Nuxt ;
cache-control no-cache ; 0 redirects (same transport class as all prior phases)
raw: 1,215,202 bytes ·3 ; raw_sha256 c1c273f2… IDENTICAL 3/3 this session
(raw happened to be stable this session; still treated forensic-only)
structure: <article> ·1 ; H1 "HIẾN PHÁP KIẾN TRÚC HỆ THỐNG INCOMEX — v4.6.3 BAN
HÀNH" present ; CHANGELOG present ; trailing "Back to Knowledge Hub" backlink
in a <footer> inside <article> (single occurrence) -> candidate_B = H1 start →
start of that <footer>.
finding: live normalized content is byte-identical to Codex's ratified canonical
(see §5) -> the source has NOT content-drifted since L3; delta_1 settled at the
canonical value; only delta_2 (executor divergence) remained to explain.
4. Controlled divergence experiment (decisive)
A single fragment (candidate_B span of fetch1) was normalized under the pinned pipeline while varying only one ambiguous step at a time:
R1 pinned: every block tag(open/close) -> "\n"; collapse [ \t\f\v ]+ -> 1 space;
collapse >=2 blank lines -> 1 blank line
=> len 17819 sha b6ea1672… mk 19/1/1/1
V2 R1 but DO NOT collapse U+00A0 (nbsp) separately
=> len 17819 sha b6ea1672… mk 19/1/1/1 (IDENTICAL to R1 -> nbsp NOT a factor here)
V3 R1 but N8 = drop ALL empty lines (single "\n" between content, ZERO blank lines)
=> len 17522 sha 17660443e0f2… mk 19/1/1/1 == RATIFIED CANONICAL (exact)
V4 R1 but detag = only p/br/li/tr/h* -> "\n", other block tags removed (no \n)
=> len 17657 sha 072983ac6cf4… mk 19/1/1/1 == CLAUDE E1 OUTPUT (exact)
V5 R1 but NO tag -> newline at all (all tags stripped to "")
=> len 17634 sha 438c77c3… mk 19/1/1/1
conclusion (QG2 — divergence source IDENTIFIED, not "insufficient evidence"):
divergence_step: N8 vertical-whitespace ("collapse blank-line runs to a single \n")
canonical_semantics (Codex, ratified):
"single \n" == ONE NEWLINE -> drop every empty line; blocks separated by a
single \n with NO blank line between them. (variant V3 -> 17660443/17522)
claude_E1_semantics (prose misreading):
a combined block-detag + "one blank line between blocks" reading
-> +135 chars. (variant V4 -> 072983ac/17657)
the +135 is exactly the residual empty-line characters the canonical executor
removes and the Claude reconstruction kept. Markers/span/NFC/entity/nbsp are
all confirmed NOT the cause (V2 isolates nbsp as null; markers invariant;
candidate_A−B stable).
5. Source-drift vs parser-mismatch classification
classification: PARSER-IMPLEMENTATION DIVERGENCE (delta_2), fully resolved.
Source content is NOT drifted vs the ratified canonical: under the canonical
N8 semantics (V3) the live page reproduces 17660443e0f23e994e1807cf8e2292095
1a9e70c598956dbd0e752f4f5cae80c / 17522 EXACTLY and deterministically 3/3.
The earlier "living KB-revision drift" (delta_1) had already settled at this
canonical value before L3; no further content drift has occurred.
raw_byte_variation: forensic-only Nuxt render noise; never the identity signal
(consistent with ratified raw/normalized split).
6. Pinned decisions required by the reference implementation
D-FETCH : HTTP GET, redirects=0, expect 200 + text/html; raw bytes captured;
raw_sha256 = forensic only (never identity)
D-DECODE : UTF-8, errors='replace'; strip leading U+FEFF
D-SPAN : first <article>…</article> inner; candidate_B = from <h1> matching
/^<h1[^>]*>\s*HIẾN PHÁP/ THROUGH end CHANGELOG, end-bounded by the
start of the <footer> that contains "Back to Knowledge Hub"
(that backlink footer EXCLUDED); CHANGELOG INCLUDED (GPT R-CL1)
D-DROP : remove <script>/<style> subtrees and HTML comments first
D-DETAG : EVERY block-level tag (open OR close OR self) of {p div section
article header footer nav aside main h1-6 ul ol li dl dt dd table
thead tbody tfoot tr td th blockquote pre hr br figure figcaption}
-> "\n"; ALL other (inline) tags -> "" (no separator)
D-ENTITY : html.unescape, single pass, AFTER detag
D-NFC : Unicode NFC; strip U+FEFF
D-EOL : CRLF -> LF, lone CR -> LF
D-HSPACE : collapse runs of [ \t\f\v] AND U+00A0 -> single ASCII space
D-VSPACE : **CANONICAL** — strip each line, then DROP ALL EMPTY LINES; join
remaining lines with a single "\n" (this is the fix vs Claude E1)
D-TRAILNL : NO trailing newline appended (identity = exact joined bytes)
D-MARKERS : U+2705 / U+1F4CB / U+1F4DD / U+26D4 preserved by codepoint; never
folded/aliased; counts are an integrity probe, not the identity
D-SENTINEL: when later written to the snapshot artifact, identity = sha256 of
ONLY the bytes between <<<BEGIN-NORMALIZED-CONTENT…/…END…>>> sentinel
lines (sentinels excluded, no added trailing \n) — equals this
implementation's normalized_content bytes exactly (artifact-spec doc)
7. Statement
- +135 divergence localized by controlled experiment to N8 vertical-whitespace semantics; canonical reading reproduced exactly; markers/span/nbsp/entity excluded as causes (QG2 satisfied — source identified, not "insufficient").
- Analysis only; nothing executed/persisted (QG1/QG5).
- doc 2 of 5; STOP after 5 files → route GPT/User. Self-advance PROHIBITED.
Companions: operational-framing, implementation-draft, test-result, authoring-report.