KB-6ABF

Active-Scope / Fence / Section Extractor (recheck-6 blocker B)

6 min read Revision 1

03 - Active-Scope / Fence / Section Extractor (recheck-6 blocker B)

Load-bearing copy: doc 00 §Active-scope / fence / section extractor. This is the full pseudocode + the fail-closed status table + worked tests. One deterministic algorithm produces the bytes/markers/ranges that feed every per-doc digest; "restricted to active scope" is no longer prose.

Algorithm

  1. Input = raw UTF-8 bytes of get_document_for_rewrite(document_id) at the pinned kb_revision.
  2. Normalize first: CRLF → LF, then lone CR → LF, over the whole byte string. All later steps run on normalized bytes (so line numbering is stable and CRLF↔LF flips don't change content).
  3. Line model: split on LF; lines are 1-based; line_text(i) = line i without its terminating LF; a trailing non-LF remainder is the final line.
  4. Marker recognition (exact, line-based): line_text(i) matches exactly one grammar — <!-- DOC_STATUS: <value> --> (whole line); literals <!-- ENVELOPE:EXCLUDE-BEGIN --> / <!-- ENVELOPE:EXCLUDE-END -->; <!-- SUPERSEDED_NON_AUTHORITY BEGIN…--> (prefix match, ends -->, no embedded -->) / the literal <!-- SUPERSEDED_NON_AUTHORITY END -->. Matching is not exempted inside code fences (an accidental standalone marker line therefore fails closed as a duplicate — safe).
  5. Fence pairing (single stack, flat — nesting unsupported): push on any BEGIN; a BEGIN while the stack is non-empty → FENCE_NESTED_UNSUPPORTED; an END must match the stack top, else FENCE_UNBALANCED (or EXCLUDE_REGION_UNBALANCED for an EXCLUDE END); a non-empty stack at end-of-scan → unbalanced; an EXCLUDE region in any doc other than self-host doc 00 → EXCLUDE_REGION_UNBALANCED (EXCLUDE is doc-00-only).
  6. DOC_STATUS cardinality: exactly one — zero → ACTIVE_SCOPE_MARKER_MISSING, two+ → ACTIVE_SCOPE_MARKER_DUPLICATE; an ACTIVE member's value must be ACTIVE_AUTHORITY.
  7. Overlap assertion: any SUPERSEDED range ∩ EXCLUDE range → ACTIVE_SUPERSEDED_OVERLAP (structurally impossible under flat non-nesting; kept as a defense-in-depth assertion).
  8. Active bytes: removal set = all SUPERSEDED ranges (BEGIN..END inclusive) ∪ (self-host only) the EXCLUDE range inclusive. normalized_active_content = concatenation, in ascending line order, of line_text(i) + LF for every line NOT in the removal set (re-emitting one LF per retained line fixes the trailing newline deterministically). Hash input = FIX7_DOC_NORMALIZED_CONTENT_V1\t<document_id>\n
    • these bytes.
  9. Section identity (by marker structure, NEVER heading text): WHOLE_DOCUMENT (no fences) / WHOLE_DOCUMENT_MINUS_SUPERSEDED_FENCES / (self-host) WHOLE_DOCUMENT_MINUS_EXCLUDE_AND_SUPERSEDED; ≠ the envelope's recorded value → SECTION_ID_MISMATCH; a differing line-range form → SECTION_RANGE_MISMATCH.
  10. Registry: emit (document_id, marker_kind, marker_literal) per marker; live marker_fence_registry_sha256 ≠ sealed → MARKER_REGISTRY_MISMATCH.
  11. Fail-closed default: any status above → STOP, block authoring, return to Codex recheck. Never best-effort.

Required fail-closed statuses (all present)

ACTIVE_SCOPE_MARKER_MISSING, ACTIVE_SCOPE_MARKER_DUPLICATE, FENCE_UNBALANCED, FENCE_NESTED_UNSUPPORTED, ACTIVE_SUPERSEDED_OVERLAP, SECTION_ID_MISMATCH, SECTION_RANGE_MISMATCH, EXCLUDE_REGION_UNBALANCED, MARKER_REGISTRY_MISMATCH. (G-ACTIVE-SCOPE-EXTRACTOR, doc 06.)

Design decisions Codex flagged, now pinned

  • Line numbering: 1-based, over normalized bytes (normalization precedes numbering).
  • CRLF→LF timing: before any line/marker/range computation.
  • Find markers: exact whole-line grammar; SUPERSEDED BEGIN carries a free description but is matched by prefix+suffix and its exact bytes are bound via marker_literal in the registry.
  • Duplicate markers: fail closed (*_DUPLICATE).
  • Nested / unbalanced / missing fences: fail closed (FENCE_NESTED_UNSUPPORTED / FENCE_UNBALANCED / EXCLUDE_REGION_UNBALANCED).
  • Overlapping active/superseded: impossible under flat non-nesting; explicit ACTIVE_SUPERSEDED_OVERLAP assertion anyway.
  • Document without marker: ACTIVE_SCOPE_MARKER_MISSING.
  • Heading/section changes: identity is by marker structure, so heading edits change the content hash (correct) but never silently change identity; a recorded-vs-computed mismatch fails closed.
  • Extraction is by explicit marker + computed descriptor, not heading id, not implicit byte range.

Worked tests (doc 08, computed)

Synthetic docs exercised: WHOLE_DOCUMENT extracts cleanly; a self-host doc with an EXCLUDE region yields WHOLE_DOCUMENT_MINUS_EXCLUDE_AND_SUPERSEDED; duplicate/missing DOC_STATUS, nested SUPERSEDED, unbalanced EXCLUDE, and an EXCLUDE-inside-SUPERSEDED overlap each fail closed with the documented status; adding a SUPERSEDED fence flips the computed section descriptor → SECTION_ID_MISMATCH vs a recorded WHOLE_DOCUMENT.

Back to Knowledge Hub knowledge/dev/reports/architecture/t1-fix7-blueprint-patch-after-codex-recheck-6-byte-exact-envelope-2026-06-09/03-active-scope-fence-section-extractor.md