dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference Implementation Draft (byte-exact, stdlib-only; reproduces ratified canonical 17660443…/17522)
dot-iu-cutter v0.5 — Constitution Nuxt Parser Reference Implementation Draft
Phase:
v0_5_constitution_nuxt_parser_reference_implementation_authoring· Nature:reference_implementation_draft__authored_in_KB__not_deployed_not_committed· Date: 2026-05-18 · doc 3 of 5deployed: false ; git_commit: false ; installed_on_vps: false executed_where: local /tmp scratch ONLY (read-only source GET), scratch deleted snapshot_written: false ; seed: none ; dml: none ; dry_run/cut/verify: none decision_authority: GPT / User ONLY ; self_advance: PROHIBITED
This is the canonical reference implementation draft for parser_profile nuxt-incomex-portal-constitution-v1. It is published to KB (SSOT) so it is portable and ratifiable independent of any one environment. It is not deployed and not committed to any repo this phase (GPT ruling withholds repo work until a later implementation phase explicitly authorizes it).
1. Identity of this implementation
parser_version: nuxt-incomex-portal-constitution-v1.refimpl.r1
language: Python 3 (stdlib only: re, html, hashlib, unicodedata, json)
dependencies: none (deterministic, offline-capable on captured bytes)
reference_script_sha256: 8f6220c9b346a21b823cc41c12c886cb5f51ef4ab557806d2137ad78a1d08e29
(sha256 of the exact §4 source as run; embed verbatim — tabs/whitespace
exact — to reproduce this hash if the script identity itself is ratified)
reproduces_ratified_canonical: YES — 17660443e0f23e994e1807cf8e22920951a9e70c
598956dbd0e752f4f5cae80c / length 17522 / markers ✅19 📋1 📝1 ⛔1,
deterministic 3/3 against the live source (doc 4 evidence)
2. Required outputs (all produced by run())
normalized_content : b_text (the candidate_B normalized bytes; identity)
normalized_content_checksum : sha256(normalized_content utf-8)
normalized_content_length : codepoint count len(normalized_content)
marker_counts : {enacted ✅, controlled_draft 📋, draft 📝, obsolete ⛔}
extraction_span_diagnostics : raw_bytes, raw_sha256, article_found,
article_inner_rawlen, candidate_A_length,
candidate_B_length, A_minus_B, span_note
parser_version : nuxt-incomex-portal-constitution-v1.refimpl.r1
candidate_A (diagnostic only) : full <article> inner normalized (NOT the identity)
3. Pinned decisions (normative — see doc 2 §6 for rationale)
D-FETCH HTTP GET, redirects=0, expect 200 text/html; raw bytes -> raw_sha256 (forensic only)
D-DECODE UTF-8 errors='replace'; strip leading U+FEFF
D-SPAN first <article>…</article> inner; candidate_B = <h1> matching
/<h1[^>]*>\s*HIẾN PHÁP[^<]*<\/h1>/ THROUGH end CHANGELOG, end-bounded
by start of the <footer> containing "Back to Knowledge Hub"
(backlink footer EXCLUDED); CHANGELOG INCLUDED (GPT R-CL1)
D-DROP remove <script>/<style> subtrees + HTML comments BEFORE detag
D-DETAG every block tag (open/close/self) in BLOCK set -> "\n"; all other
(inline) tags -> "" (no separator)
D-ENTITY html.unescape, single pass, AFTER detag
D-NFC Unicode NFC; strip U+FEFF
D-EOL CRLF -> LF; lone CR -> LF
D-HSPACE collapse [ \t\f\v]+ and U+00A0 -> single ASCII space
D-VSPACE strip each line; DROP ALL EMPTY LINES; join with single "\n"
(CANONICAL reading of "collapse blank-line runs to a single \n")
D-TRAILNL no trailing newline appended
D-MARKERS U+2705 / U+1F4CB / U+1F4DD / U+26D4 preserved by codepoint, never folded
D-SENTINEL snapshot-artifact identity = sha256 of bytes strictly between the
BEGIN/END sentinel lines (sentinels excluded, no added trailing \n)
== this normalized_content exactly (artifact-spec doc, GPT Q6)
ORDER D-DROP→D-DETAG→D-ENTITY→D-NFC→D-EOL→D-HSPACE→trim→D-VSPACE is
NORMATIVE; reordering changes the hash
FAILURE FAIL_NO_SPAN (no <article>/H1 anchor) -> BLOCKED, no raw fallback,
no fabrication; FAIL_NONDETERMINISM (two fetches differ) -> BLOCKED;
DRIFT (checksum ≠ registered) -> propose new version + review, never
silent re-cut; BLOCKED always preferred over a guessed PASS
4. Reference implementation source (verbatim — embed exactly)
#!/usr/bin/env python3
# Reference implementation candidate for parser_profile
# nuxt-incomex-portal-constitution-v1
# parser_version: nuxt-incomex-portal-constitution-v1.refimpl.r1
# Deterministic, stdlib-only. Read-only on input bytes; no side effects.
import sys, re, html, hashlib, unicodedata, json
PARSER_VERSION = "nuxt-incomex-portal-constitution-v1.refimpl.r1"
# Block-level tags whose open OR close becomes a hard newline boundary (D-DETAG).
BLOCK = {"p","div","section","article","header","footer","nav","aside","main",
"h1","h2","h3","h4","h5","h6","ul","ol","li","dl","dt","dd",
"table","thead","tbody","tfoot","tr","td","th","blockquote",
"pre","hr","br","figure","figcaption"}
MARKERS = {"enacted":"✅","controlled_draft":"\U0001F4CB",
"draft":"\U0001F4DD","obsolete":"⛔"}
def sha(s): return hashlib.sha256(s.encode("utf-8")).hexdigest()
def extract_article_inner(h):
m = re.search(r"<article\b[^>]*>", h)
if not m: return None, "FAIL_NO_SPAN:no <article>"
start = m.end()
end = h.find("</article>", start)
if end == -1: return None, "FAIL_NO_SPAN:no </article>"
return h[start:end], None
def slice_candidate_B(article_inner):
# span = from the <h1> whose text contains the H1 anchor THROUGH end of
# CHANGELOG, EXCLUDING the trailing backlink footer (D-SPAN).
mh = re.search(r"<h1\b[^>]*>\s*HIẾN PHÁP[^<]*</h1>", article_inner)
if not mh: return None, "FAIL_NO_SPAN:no H1 HIẾN PHÁP anchor"
h1_start = mh.start()
bl = article_inner.find("Back to Knowledge Hub")
if bl == -1:
return article_inner[h1_start:], "WARN:no backlink (kept tail)"
foot = article_inner.rfind("<footer", h1_start, bl)
end = foot if foot != -1 else bl
return article_inner[h1_start:end], None
def detag_normalize(fragment):
# 1. drop <script>/<style> subtrees + comments
s = re.sub(r"<script\b[^>]*>.*?</script>", "", fragment, flags=re.S|re.I)
s = re.sub(r"<style\b[^>]*>.*?</style>", "", s, flags=re.S|re.I)
s = re.sub(r"<!--.*?-->", "", s, flags=re.S)
# 2. tag walk: block tag (open/close/self) -> "\n"; inline tag -> "" (D-DETAG)
out = []
for tok in re.split(r"(<[^>]+>)", s):
if tok.startswith("<") and tok.endswith(">"):
nm = re.match(r"</?\s*([a-zA-Z0-9]+)", tok)
out.append("\n" if (nm and nm.group(1).lower() in BLOCK) else "")
else:
out.append(tok)
s = "".join(out)
# 3. HTML entity unescape (single pass)
s = html.unescape(s)
# 4. Unicode NFC + strip BOM
s = unicodedata.normalize("NFC", s).replace("", "")
# 5. CRLF/CR -> LF
s = s.replace("\r\n", "\n").replace("\r", "\n")
# 6. collapse horizontal ws incl. U+00A0 -> single space (D-HSPACE)
s = re.sub(r"[ \t\f\v ]+", " ", s)
# 7. trim each line
lines = [ln.strip(" ") for ln in s.split("\n")]
# 8. D-VSPACE (CANONICAL): "collapse blank-line runs to a single \n" read
# literally as a single newline -> drop ALL empty lines; blocks are
# separated by exactly one \n, NO blank line between them.
res = [ln for ln in lines if ln != ""]
return "\n".join(res) # no trailing newline (D-TRAILNL)
def measure(text):
return {"checksum": sha(text), "length": len(text),
"marker_counts": {k: text.count(v) for k,v in MARKERS.items()}}
def run(path):
raw = open(path,"rb").read()
h = raw.decode("utf-8", errors="replace")
diag = {"raw_bytes": len(raw), "raw_sha256": hashlib.sha256(raw).hexdigest(),
"article_found": False}
art, err = extract_article_inner(h)
if err: return {"status":"BLOCKED","reason":err,"diagnostics":diag}
diag["article_found"] = True
diag["article_inner_rawlen"] = len(art)
a_text = detag_normalize(art)
b_frag, berr = slice_candidate_B(art)
if b_frag is None: return {"status":"BLOCKED","reason":berr,"diagnostics":diag}
b_text = detag_normalize(b_frag)
A, B = measure(a_text), measure(b_text)
diag.update({"candidate_A_length": A["length"],
"candidate_B_length": B["length"],
"A_minus_B": A["length"]-B["length"],
"span_note": berr or "ok"})
return {"status":"OK","parser_version":PARSER_VERSION,
"candidate_A": A, "candidate_B": B,
"normalized_content_checksum": B["checksum"],
"normalized_content_length": B["length"],
"marker_counts": B["marker_counts"],
"extraction_span_diagnostics": diag,
"_b_text": b_text}
if __name__ == "__main__":
r = run(sys.argv[1])
bt = r.pop("_b_text", None)
print(json.dumps(r, ensure_ascii=False, indent=2))
if bt is not None and len(sys.argv) > 2:
open(sys.argv[2],"w",encoding="utf-8").write(bt)
Note on the embedded source: the script is embedded verbatim (UTF-8, literal Vietnamese diacritics in the
<h1>anchor regex — "HIẾN PHÁP", H I Ế N where Ế = U+1EBE — and literal✅U+2705 /⛔U+26D4;📋U+1F4CB and📝U+1F4DD as\U0001F4CB/\U0001F4DDescapes exactly as in the running file). The local scratch file that producedreference_script_sha256 = 8f6220c9b346…differs from this fenced copy only by its header comment line (scratch said "SCRATCH ONLY - not deployed…"; here it reads "Deterministic, stdlib-only. Read-only on input bytes; no side effects."). The algorithm bytes are identical; therefore the normalized-content outputs in doc 4 are reproduced by this embedded form. On ratification, the ratifying executor must re-pin the canonical script form + a freshly computed sha256 over the agreed canonical header (doc 5 R-RI2). The script hash is NOT the version identity —normalized_content_checksumis.
5. Equivalence note (re-implementation guidance)
any_reimplementation_is_canonical_iff: given identical source bytes it produces
the SAME normalized_content (hence checksum/length/markers). The 11 D-* pinned
decisions + the normative step ORDER are the contract; the language is not.
A Node/Go/PG port is acceptable only if it passes the doc-4 test vector
(live source -> 17660443…/17522/19·1·1·1) AND the doc-4 negative variants.
6. Statement
- Byte-exact reference implementation drafted with all 11 ambiguous steps pinned and embedded in KB SSOT; reproduces the ratified canonical identity (QG3 satisfied — byte-exact draft produced, not "blocked").
- Not deployed, not committed, no snapshot/seed/DML/dry-run/CUT/VERIFY (QG1).
- doc 3 of 5; STOP after 5 files → route GPT/User. Self-advance PROHIBITED.
Companions: operational-framing, algorithm-analysis, test-result, authoring-report.