KB-4EE3 rev 3

dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference Implementation Draft (byte-exact, stdlib-only; reproduces ratified canonical 17660443…/17522)

12 min read Revision 3

dot-iu-cutterv0.5constitution-fixturenuxt-parser-reference-implimplementation-draftbyte-exactreproduces-canonicalauthoring-onlyno-deploydieu442026-05-18

dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference Implementation Draft

Phase: v0_5_constitution_nuxt_parser_reference_implementation_authoring · Nature: reference_implementation_draft__authored_in_KB__not_deployed_not_committed · Date: 2026-05-18 · doc 3 of 5
deployed: false ; git_commit: false ; installed_on_vps: false
executed_where: local /tmp scratch ONLY (read-only source GET), scratch deleted
snapshot_written: false ; seed: none ; dml: none ; dry_run/cut/verify: none
decision_authority: GPT / User ONLY ; self_advance: PROHIBITED

This is the canonical reference implementation draft for parser_profile nuxt-incomex-portal-constitution-v1. It is published to KB (SSOT) so it is portable and ratifiable independent of any one environment. It is not deployed and not committed to any repo this phase (GPT ruling withholds repo work until a later implementation phase explicitly authorizes it).

1. Identity of this implementation

parser_version:      nuxt-incomex-portal-constitution-v1.refimpl.r1
language:            Python 3 (stdlib only: re, html, hashlib, unicodedata, json)
dependencies:        none (deterministic, offline-capable on captured bytes)
reference_script_sha256: 8f6220c9b346a21b823cc41c12c886cb5f51ef4ab557806d2137ad78a1d08e29
  (sha256 of the exact §4 source as run; embed verbatim — tabs/whitespace
   exact — to reproduce this hash if the script identity itself is ratified)
reproduces_ratified_canonical: YES — 17660443e0f23e994e1807cf8e22920951a9e70c
  598956dbd0e752f4f5cae80c / length 17522 / markers ✅19 📋1 📝1 ⛔1,
  deterministic 3/3 against the live source (doc 4 evidence)

2. Required outputs (all produced by `run()`)

normalized_content            : b_text (the candidate_B normalized bytes; identity)
normalized_content_checksum   : sha256(normalized_content utf-8)
normalized_content_length     : codepoint count len(normalized_content)
marker_counts                 : {enacted ✅, controlled_draft 📋, draft 📝, obsolete ⛔}
extraction_span_diagnostics   : raw_bytes, raw_sha256, article_found,
                                 article_inner_rawlen, candidate_A_length,
                                 candidate_B_length, A_minus_B, span_note
parser_version                : nuxt-incomex-portal-constitution-v1.refimpl.r1
candidate_A (diagnostic only) : full <article> inner normalized (NOT the identity)

3. Pinned decisions (normative — see doc 2 §6 for rationale)

D-FETCH    HTTP GET, redirects=0, expect 200 text/html; raw bytes -> raw_sha256 (forensic only)
D-DECODE   UTF-8 errors='replace'; strip leading U+FEFF
D-SPAN     first <article>…</article> inner; candidate_B = <h1> matching
           /<h1[^>]*>\s*HIẾN PHÁP[^<]*<\/h1>/ THROUGH end CHANGELOG, end-bounded
           by start of the <footer> containing "Back to Knowledge Hub"
           (backlink footer EXCLUDED); CHANGELOG INCLUDED (GPT R-CL1)
D-DROP     remove <script>/<style> subtrees + HTML comments BEFORE detag
D-DETAG    every block tag (open/close/self) in BLOCK set -> "\n"; all other
           (inline) tags -> "" (no separator)
D-ENTITY   html.unescape, single pass, AFTER detag
D-NFC      Unicode NFC; strip U+FEFF
D-EOL      CRLF -> LF; lone CR -> LF
D-HSPACE   collapse [ \t\f\v]+ and U+00A0 -> single ASCII space
D-VSPACE   strip each line; DROP ALL EMPTY LINES; join with single "\n"
           (CANONICAL reading of "collapse blank-line runs to a single \n")
D-TRAILNL  no trailing newline appended
D-MARKERS  U+2705 / U+1F4CB / U+1F4DD / U+26D4 preserved by codepoint, never folded
D-SENTINEL snapshot-artifact identity = sha256 of bytes strictly between the
           BEGIN/END sentinel lines (sentinels excluded, no added trailing \n)
           == this normalized_content exactly (artifact-spec doc, GPT Q6)
ORDER      D-DROP→D-DETAG→D-ENTITY→D-NFC→D-EOL→D-HSPACE→trim→D-VSPACE is
           NORMATIVE; reordering changes the hash
FAILURE    FAIL_NO_SPAN (no <article>/H1 anchor) -> BLOCKED, no raw fallback,
           no fabrication; FAIL_NONDETERMINISM (two fetches differ) -> BLOCKED;
           DRIFT (checksum ≠ registered) -> propose new version + review, never
           silent re-cut; BLOCKED always preferred over a guessed PASS

4. Reference implementation source (verbatim — embed exactly)

#!/usr/bin/env python3
# Reference implementation candidate for parser_profile
#   nuxt-incomex-portal-constitution-v1
# parser_version: nuxt-incomex-portal-constitution-v1.refimpl.r1
# Deterministic, stdlib-only. Read-only on input bytes; no side effects.
import sys, re, html, hashlib, unicodedata, json

PARSER_VERSION = "nuxt-incomex-portal-constitution-v1.refimpl.r1"

# Block-level tags whose open OR close becomes a hard newline boundary (D-DETAG).
BLOCK = {"p","div","section","article","header","footer","nav","aside","main",
         "h1","h2","h3","h4","h5","h6","ul","ol","li","dl","dt","dd",
         "table","thead","tbody","tfoot","tr","td","th","blockquote",
         "pre","hr","br","figure","figcaption"}
MARKERS = {"enacted":"✅","controlled_draft":"\U0001F4CB",
           "draft":"\U0001F4DD","obsolete":"⛔"}

def sha(s): return hashlib.sha256(s.encode("utf-8")).hexdigest()

def extract_article_inner(h):
    m = re.search(r"<article\b[^>]*>", h)
    if not m: return None, "FAIL_NO_SPAN:no <article>"
    start = m.end()
    end = h.find("</article>", start)
    if end == -1: return None, "FAIL_NO_SPAN:no </article>"
    return h[start:end], None

def slice_candidate_B(article_inner):
    # span = from the <h1> whose text contains the H1 anchor THROUGH end of
    # CHANGELOG, EXCLUDING the trailing backlink footer (D-SPAN).
    mh = re.search(r"<h1\b[^>]*>\s*HIẾN PHÁP[^<]*</h1>", article_inner)
    if not mh: return None, "FAIL_NO_SPAN:no H1 HIẾN PHÁP anchor"
    h1_start = mh.start()
    bl = article_inner.find("Back to Knowledge Hub")
    if bl == -1:
        return article_inner[h1_start:], "WARN:no backlink (kept tail)"
    foot = article_inner.rfind("<footer", h1_start, bl)
    end = foot if foot != -1 else bl
    return article_inner[h1_start:end], None

def detag_normalize(fragment):
    # 1. drop <script>/<style> subtrees + comments
    s = re.sub(r"<script\b[^>]*>.*?</script>", "", fragment, flags=re.S|re.I)
    s = re.sub(r"<style\b[^>]*>.*?</style>", "", s, flags=re.S|re.I)
    s = re.sub(r"<!--.*?-->", "", s, flags=re.S)
    # 2. tag walk: block tag (open/close/self) -> "\n"; inline tag -> "" (D-DETAG)
    out = []
    for tok in re.split(r"(<[^>]+>)", s):
        if tok.startswith("<") and tok.endswith(">"):
            nm = re.match(r"</?\s*([a-zA-Z0-9]+)", tok)
            out.append("\n" if (nm and nm.group(1).lower() in BLOCK) else "")
        else:
            out.append(tok)
    s = "".join(out)
    # 3. HTML entity unescape (single pass)
    s = html.unescape(s)
    # 4. Unicode NFC + strip BOM
    s = unicodedata.normalize("NFC", s).replace("", "")
    # 5. CRLF/CR -> LF
    s = s.replace("\r\n", "\n").replace("\r", "\n")
    # 6. collapse horizontal ws incl. U+00A0 -> single space (D-HSPACE)
    s = re.sub(r"[ \t\f\v ]+", " ", s)
    # 7. trim each line
    lines = [ln.strip(" ") for ln in s.split("\n")]
    # 8. D-VSPACE (CANONICAL): "collapse blank-line runs to a single \n" read
    #    literally as a single newline -> drop ALL empty lines; blocks are
    #    separated by exactly one \n, NO blank line between them.
    res = [ln for ln in lines if ln != ""]
    return "\n".join(res)   # no trailing newline (D-TRAILNL)

def measure(text):
    return {"checksum": sha(text), "length": len(text),
            "marker_counts": {k: text.count(v) for k,v in MARKERS.items()}}

def run(path):
    raw = open(path,"rb").read()
    h = raw.decode("utf-8", errors="replace")
    diag = {"raw_bytes": len(raw), "raw_sha256": hashlib.sha256(raw).hexdigest(),
            "article_found": False}
    art, err = extract_article_inner(h)
    if err: return {"status":"BLOCKED","reason":err,"diagnostics":diag}
    diag["article_found"] = True
    diag["article_inner_rawlen"] = len(art)
    a_text = detag_normalize(art)
    b_frag, berr = slice_candidate_B(art)
    if b_frag is None: return {"status":"BLOCKED","reason":berr,"diagnostics":diag}
    b_text = detag_normalize(b_frag)
    A, B = measure(a_text), measure(b_text)
    diag.update({"candidate_A_length": A["length"],
                 "candidate_B_length": B["length"],
                 "A_minus_B": A["length"]-B["length"],
                 "span_note": berr or "ok"})
    return {"status":"OK","parser_version":PARSER_VERSION,
            "candidate_A": A, "candidate_B": B,
            "normalized_content_checksum": B["checksum"],
            "normalized_content_length": B["length"],
            "marker_counts": B["marker_counts"],
            "extraction_span_diagnostics": diag,
            "_b_text": b_text}

if __name__ == "__main__":
    r = run(sys.argv[1])
    bt = r.pop("_b_text", None)
    print(json.dumps(r, ensure_ascii=False, indent=2))
    if bt is not None and len(sys.argv) > 2:
        open(sys.argv[2],"w",encoding="utf-8").write(bt)

Note on the embedded source: the script is embedded verbatim (UTF-8, literal Vietnamese diacritics in the <h1> anchor regex — "HIẾN PHÁP", H I Ế N where Ế = U+1EBE — and literal ✅ U+2705 / ⛔ U+26D4; 📋 U+1F4CB and 📝 U+1F4DD as \U0001F4CB/\U0001F4DD escapes exactly as in the running file). The local scratch file that produced reference_script_sha256 = 8f6220c9b346… differs from this fenced copy only by its header comment line (scratch said "SCRATCH ONLY - not deployed…"; here it reads "Deterministic, stdlib-only. Read-only on input bytes; no side effects."). The algorithm bytes are identical; therefore the normalized-content outputs in doc 4 are reproduced by this embedded form. On ratification, the ratifying executor must re-pin the canonical script form + a freshly computed sha256 over the agreed canonical header (doc 5 R-RI2). The script hash is NOT the version identity — normalized_content_checksum is.

5. Equivalence note (re-implementation guidance)

any_reimplementation_is_canonical_iff: given identical source bytes it produces
  the SAME normalized_content (hence checksum/length/markers). The 11 D-* pinned
  decisions + the normative step ORDER are the contract; the language is not.
  A Node/Go/PG port is acceptable only if it passes the doc-4 test vector
  (live source -> 17660443…/17522/19·1·1·1) AND the doc-4 negative variants.

6. Statement

Byte-exact reference implementation drafted with all 11 ambiguous steps pinned and embedded in KB SSOT; reproduces the ratified canonical identity (QG3 satisfied — byte-exact draft produced, not "blocked").
Not deployed, not committed, no snapshot/seed/DML/dry-run/CUT/VERIFY (QG1).
doc 3 of 5; STOP after 5 files → route GPT/User. Self-advance PROHIBITED.

Companions: operational-framing, algorithm-analysis, test-result, authoring-report.