Active-Scope / Fence / Section Extractor (recheck-6 blocker B)
03 - Active-Scope / Fence / Section Extractor (recheck-6 blocker B)
Load-bearing copy: doc 00 §Active-scope / fence / section extractor. This is the full pseudocode + the fail-closed status table + worked tests. One deterministic algorithm produces the bytes/markers/ranges that feed every per-doc digest; "restricted to active scope" is no longer prose.
Algorithm
- Input = raw UTF-8 bytes of
get_document_for_rewrite(document_id)at the pinnedkb_revision. - Normalize first: CRLF → LF, then lone CR → LF, over the whole byte string. All later steps run on normalized bytes (so line numbering is stable and CRLF↔LF flips don't change content).
- Line model: split on LF; lines are 1-based;
line_text(i)= line i without its terminating LF; a trailing non-LF remainder is the final line. - Marker recognition (exact, line-based):
line_text(i)matches exactly one grammar —<!-- DOC_STATUS: <value> -->(whole line); literals<!-- ENVELOPE:EXCLUDE-BEGIN -->/<!-- ENVELOPE:EXCLUDE-END -->;<!-- SUPERSEDED_NON_AUTHORITY BEGIN…-->(prefix match, ends-->, no embedded-->) / the literal<!-- SUPERSEDED_NON_AUTHORITY END -->. Matching is not exempted inside code fences (an accidental standalone marker line therefore fails closed as a duplicate — safe). - Fence pairing (single stack, flat — nesting unsupported): push on any BEGIN; a BEGIN while the
stack is non-empty →
FENCE_NESTED_UNSUPPORTED; an END must match the stack top, elseFENCE_UNBALANCED(orEXCLUDE_REGION_UNBALANCEDfor an EXCLUDE END); a non-empty stack at end-of-scan → unbalanced; an EXCLUDE region in any doc other than self-host doc 00 →EXCLUDE_REGION_UNBALANCED(EXCLUDE is doc-00-only). - DOC_STATUS cardinality: exactly one — zero →
ACTIVE_SCOPE_MARKER_MISSING, two+ →ACTIVE_SCOPE_MARKER_DUPLICATE; an ACTIVE member's value must beACTIVE_AUTHORITY. - Overlap assertion: any SUPERSEDED range ∩ EXCLUDE range →
ACTIVE_SUPERSEDED_OVERLAP(structurally impossible under flat non-nesting; kept as a defense-in-depth assertion). - Active bytes: removal set = all SUPERSEDED ranges (BEGIN..END inclusive) ∪ (self-host only) the
EXCLUDE range inclusive.
normalized_active_content= concatenation, in ascending line order, ofline_text(i) + LFfor every line NOT in the removal set (re-emitting one LF per retained line fixes the trailing newline deterministically). Hash input =FIX7_DOC_NORMALIZED_CONTENT_V1\t<document_id>\n- these bytes.
- Section identity (by marker structure, NEVER heading text):
WHOLE_DOCUMENT(no fences) /WHOLE_DOCUMENT_MINUS_SUPERSEDED_FENCES/ (self-host)WHOLE_DOCUMENT_MINUS_EXCLUDE_AND_SUPERSEDED; ≠ the envelope's recorded value →SECTION_ID_MISMATCH; a differing line-range form →SECTION_RANGE_MISMATCH. - Registry: emit
(document_id, marker_kind, marker_literal)per marker; livemarker_fence_registry_sha256≠ sealed →MARKER_REGISTRY_MISMATCH. - Fail-closed default: any status above → STOP, block authoring, return to Codex recheck. Never best-effort.
Required fail-closed statuses (all present)
ACTIVE_SCOPE_MARKER_MISSING, ACTIVE_SCOPE_MARKER_DUPLICATE, FENCE_UNBALANCED,
FENCE_NESTED_UNSUPPORTED, ACTIVE_SUPERSEDED_OVERLAP, SECTION_ID_MISMATCH, SECTION_RANGE_MISMATCH,
EXCLUDE_REGION_UNBALANCED, MARKER_REGISTRY_MISMATCH. (G-ACTIVE-SCOPE-EXTRACTOR, doc 06.)
Design decisions Codex flagged, now pinned
- Line numbering: 1-based, over normalized bytes (normalization precedes numbering).
- CRLF→LF timing: before any line/marker/range computation.
- Find markers: exact whole-line grammar; SUPERSEDED BEGIN carries a free description but is matched
by prefix+suffix and its exact bytes are bound via
marker_literalin the registry. - Duplicate markers: fail closed (
*_DUPLICATE). - Nested / unbalanced / missing fences: fail closed (
FENCE_NESTED_UNSUPPORTED/FENCE_UNBALANCED/EXCLUDE_REGION_UNBALANCED). - Overlapping active/superseded: impossible under flat non-nesting; explicit
ACTIVE_SUPERSEDED_OVERLAPassertion anyway. - Document without marker:
ACTIVE_SCOPE_MARKER_MISSING. - Heading/section changes: identity is by marker structure, so heading edits change the content hash (correct) but never silently change identity; a recorded-vs-computed mismatch fails closed.
- Extraction is by explicit marker + computed descriptor, not heading id, not implicit byte range.
Worked tests (doc 08, computed)
Synthetic docs exercised: WHOLE_DOCUMENT extracts cleanly; a self-host doc with an EXCLUDE region yields
WHOLE_DOCUMENT_MINUS_EXCLUDE_AND_SUPERSEDED; duplicate/missing DOC_STATUS, nested SUPERSEDED, unbalanced
EXCLUDE, and an EXCLUDE-inside-SUPERSEDED overlap each fail closed with the documented status; adding a
SUPERSEDED fence flips the computed section descriptor → SECTION_ID_MISMATCH vs a recorded WHOLE_DOCUMENT.