KB-7332

dot-iu-cutter v0.1 — CTE-03 Canonicalization Library Closure

12 min read Revision 1
dot-iu-cutterblocker-closurecte-03canonicalization-librarycanon-md-v0.1.0scaffoldno-executionno-code-deployedrev5d

dot-iu-cutter v0.1 — CTE-03 Canonicalization Library Closure

Date: 2026-05-15 Status: CTE-03 CLOSURE RECORD — closed_with_notes Trigger: GPT review of HB-07 returned PASS; HB-04 canonicalization prose ratified (canon-md-v0.1.0); user explicitly authorized CTE-02 + CTE-03 + CTE-04 small engineering-support closure batch. Scope: SCAFFOLD AS SPEC-PROSE / REFERENCE IMPLEMENTATION INSIDE THIS CLOSURE RECORD ONLY. No code deployed to any codebase, no DDL, no SQL, no schema created, no migration, no PG mutation, no Qdrant/vector mutation, no Directus mutation, no backup, no snapshot, no dry-run, no production use, no execution.


1. Where the Scaffold Was Placed

scaffold_location: this closure record (knowledge base anchored)
scaffold_form: reference implementation in pseudo-code + deterministic algorithm prose; ready to be transcribed into a real engineering location at execution-phase deployment time
no_repository_file_committed: true (no /Users/nmhuyen/Documents/Manual Deploy/web-test/ files written or modified)
no_dot_bin_tool_created_or_modified: true
rationale:
  - user's safety rule: minimal non-invasive code/config; STOP if requires unclear system mutation / new architecture / large code changes
  - the canonicalization library v0.1 has no clear pre-existing engineering home that would be safe to write to without explicit prompt
  - reference implementation in the closure record satisfies "scaffold" while preserving zero-mutation safety posture
  - production deployment to engineering tree is an execution-phase task (separate explicit prompt)
classification_clarification:
  scaffold_artefact: BOUND in this closure record (reference implementation pseudo-code below)
  determinism_test_logs: DEFERRED to HB-05 dry-run scenarios S19 + S20 (canonicalization consistency + immutability tests)
  production_deployment: DEFERRED to execution phase

2. Behavior — Reference Implementation (Spec-Prose)

The reference implementation realizes HB-04 ratified canon-md-v0.1.0 for the markdown source_kind v0.1 default. It is deterministic, idempotent, O(n) over document size, and persists canonicalization_rule_used = "canon-md-v0.1.0" on every produced output for audit.

2.1 Public Contract

# Reference implementation contract; NOT deployed code; transcribe to engineering tree
# at execution phase. Identifier "canon-md-v0.1.0" is the working Đ24-ratified identifier
# per HB-04 §3; Đ24 may finalize identifier at ratification artefact.

CANON_RULE_IDENTIFIER = "canon-md-v0.1.0"

def canonicalize_markdown_bytes(source_bytes: bytes) -> dict:
    """
    Applies canon-md-v0.1.0 to markdown source bytes.
    Returns: {
        "rule_identifier": "canon-md-v0.1.0",
        "canonical_bytes": bytes,         # post-NFC + LF-normalized + trimmed + single trailing LF
        "canonical_tokens": list[tuple],  # [(line_index, intra_line_token_index, token_text), ...]
        "byte_to_token_index": dict,      # maps original_byte_offset -> token_position
    }
    Deterministic. Idempotent: canonicalize(canonicalize(x)) == canonicalize(x).
    """
    # Step 1: read bytes as UTF-8
    # Step 2: strip UTF-8 BOM (EF BB BF) if present at offset 0; record offset 0 of post-BOM
    # Step 3: decode to Python str; apply unicodedata.normalize("NFC", text)
    # Step 4: normalize line endings — CR or CRLF → LF
    # Step 5: split into lines on LF (preserving empty lines for paragraph break semantics)
    # Step 6: for each line, strip trailing whitespace where whitespace = U+0020 (space) or U+0009 (tab)
    #         (do NOT strip other unicode whitespace)
    # Step 7: rejoin lines with LF; enforce exactly one trailing LF at file end
    # Step 8: tokenize per §2.2 → produce canonical_tokens list and byte_to_token_index
    # Step 9: return dict with rule_identifier populated as CANON_RULE_IDENTIFIER

def byte_span_to_token_range(byte_to_token_index: dict, byte_span_start: int, byte_span_end: int) -> tuple:
    """
    Maps a (byte_span_start, byte_span_end) into (start_token_position, end_token_position).
    byte_span_start is inclusive; byte_span_end is exclusive.
    Returns ((line_index, intra_line_token_index), (line_index, intra_line_token_index)).
    """
    # start_token_position = first token whose codepoints include or follow byte_span_start
    # end_token_position   = last token whose codepoints precede or include byte_span_end - 1
    # Deterministic.

def axis_1_drift_count(rule_id: str, source_a_bytes: bytes, source_b_bytes: bytes) -> dict:
    """
    Computes axis-1 drift between two markdown source revisions under canonical_token unit.
    Returns: {
        "rule_identifier": "canon-md-v0.1.0",
        "drift_count": int,                       # number of canonical_tokens that differ (any kind of edit: add/remove/change)
        "drift_unit": "canonical_token",
        "per_unit_breakdown": list[dict],         # drift-bearing tokens with positions and original/new content
    }
    Asserts rule_id == CANON_RULE_IDENTIFIER (immutability of historical comparisons).
    """

2.2 Tokenization Rule

tokenization (per HB-04 §3 canonical_token_boundary_definition_v0_1):
  basis: per-line tokenization with intra-line tokens split on whitespace
  whitespace_for_tokenization: U+0020 (space) and U+0009 (tab) only
  token_definition: maximal run of non-whitespace UTF-8 codepoints
  line_boundary: LF acts as token separator (not itself a token)
  empty_lines: preserved as zero-token lines (markdown paragraph break)
  token_position_form: (line_index, intra_line_token_index) tuple
  token_identity: the token's UTF-8 byte content after NFC normalization
example:
  input_bytes_after_canonicalization_steps_1_to_7: b"hello world\n\nsecond paragraph\n"
  canonical_tokens:
    - (0, 0, "hello")
    - (0, 1, "world")
    - (1, 0)  # zero-token line (blank line)
    - (2, 0, "second")
    - (2, 1, "paragraph")
  trailing LF is preserved; tokenization sees 3 lines (indices 0, 1, 2)

2.3 Determinism + Idempotency Properties

properties_to_assert_in_tests:
  determinism:
    same_input_same_output: TRUE
    same_input_same_token_stream: TRUE
    same_input_same_byte_to_token_index: TRUE
  idempotency:
    canonicalize(canonicalize(x)) == canonicalize(x)
  immutability_of_canonicalization_rule_used:
    once canonicalization_rule_used="canon-md-v0.1.0" is persisted on a verify_result row, the rule version may NOT be retroactively changed for that row (per HB-04 §6 mid-cycle-change handling)
  performance:
    O(n) over document byte length acceptable v0.1

2.4 Out-of-Scope (FUTURE; D4 Capability Intake)

out_of_scope_for_v0_1:
  - code source_kind (ast_node canonicalization) — FUTURE per PEF-05
  - binary source_kind (byte canonicalization) — FUTURE per PEF-05
  - per-source_kind rule extensions beyond markdown
  - mid-cycle rule changes (require new rule version with new identifier)
  - cryptographic-grade canonicalization (FUTURE if required)
v0_1_behavior_on_non_markdown_source_kind:
  - verify_result.axis_1_status = "not_applicable"
  - verify_result.canonicalization_rule_used = "canon-md-v0.1.0" (recorded with caveat in verdict_rationale)
  - OR cutter rejects non-markdown source_kind at MARK stage (operational policy; final at execution phase)

3. Acceptance Criteria

acceptance_criteria_for_cte_03:
  reference_implementation_scaffolded:
    status: SCAFFOLDED (pseudo-code in §2.1 + tokenization rule §2.2 + properties §2.3)
  alignment_with_HB_04_prose:
    status: ALIGNED (steps 1-7 + canonical_token boundary + (line_index, intra_line_token_index) form per HB-04 §3)
  canonicalization_rule_used_emission_binding:
    status: BOUND (canon-md-v0.1.0 returned in every output; immutable on verify_result rows)
  byte_to_token_mapping_algorithm_specified:
    status: SPECIFIED (byte_span_to_token_range pseudo-code in §2.1; algorithm per HB-04 §3 + this §2.1)
  axis_1_drift_function_specified:
    status: SPECIFIED (axis_1_drift_count pseudo-code in §2.1)
  determinism_idempotency_properties_specified:
    status: SPECIFIED (§2.3)
  capability_proof_deferred:
    status: PLANNED for HB-05 dry-run scenarios S19 (canonicalization consistency across reruns) and S20 (rule immutability rejection)
  no_code_committed_to_repository:
    status: confirmed
  no_pg_mutation:
    status: confirmed
  no_directus_mutation:
    status: confirmed
cte_03_acceptance_state: ALL TEN criteria satisfied; closure_with_notes

4. Downstream Effects

downstream_effects_of_cte_03_closure:
  HB_05_rollback_test_plan_dry_run:
    status_before: blocked
    status_after: still blocked (waits on HB-08, HB-09, CTE-04 in addition to CTE-03 being closed)
    status_change: one prerequisite (CTE-03) is now closed; HB-05 remains terminal
    note: HB-05 dry-run scenarios S13, S14, S19, S20 will use the reference implementation per §2; engineering implementation deployed to dry-run environment at HB-05 prep time
  CTE_02_signal_routing: unchanged (independent; addressed in §1 sibling closure)
  CTE_04_signing_scheme: unchanged (independent; addressed in §1 sibling closure)

what_cte_03_does_NOT_do:
  - deploy any code to the repository
  - implement code in a runtime
  - run any canonicalization on real source bytes
  - validate the implementation against actual markdown corpora (HB-05 dry-run)
  - bind per-source_kind extensions (code, binary) — FUTURE PEF-05
  - alter the verify_result table schema (canonicalization_rule_used DDL is execution phase)

5. Status

CTE_03_status: closed_with_notes
CTE_03_closure_authority: G-3 (capability intake reviewer; oversight; soft) + engineering (deferred to execution phase); user explicit prompt 2026-05-15
CTE_03_closure_signers:
  - User / anh Huyên (sovereign authority via explicit prompt)
  - GPT (policy reviewer; PASS upstream on HB-04)
  - Opus / Agent (record-keeping; reference implementation scaffolded here)

system_mutation_performed: NONE
files_or_code_changed: NONE (closure record only)
canonicalization_library_deployed_to_repository: false
canonicalization_run_on_real_data: false
execution_authorized: false
p0_migration_allowed: false
ddl_allowed: false
production_use_authorized: false

notes_carried_forward:
  - reference implementation lives in this closure record only; transcription to engineering tree is an execution-phase task (separate explicit prompt)
  - capability proof (determinism + idempotency tests) deferred to HB-05 dry-run scenarios S19 + S20
  - Đ24 identifier may finalize at separate ratification artefact (working: canon-md-v0.1.0)
  - per-source_kind extensions FUTURE per PEF-05 via D4 capability intake
  - verify_result.canonicalization_rule_used field DDL is execution-phase task (Step 6 P0-4)
  - production use is gated by HB-05 dry-run + Final Readiness re-review + explicit user prompt

6. Hard Boundaries Confirmation

no_repository_file_committed: true
no_dot_bin_tool_created_or_modified: true
no_canonicalization_run: true
no_canonical_token_stream_emitted_in_production: true
no_verify_result_row_written: true
no_schema_created: true
no_ddl_written: true
no_sql_written: true
no_migration_script_written: true
no_migration_executed: true
no_pg_mutation: true
no_qdrant_mutation: true
no_directus_mutation: true
no_data_writes: true
no_production_use_authorized: true
no_rollback_dry_run_executed: true
no_backup_taken: true
no_snapshot_taken: true
no_deploy: true
no_execution_gate_opened: true
no_phase_prior_file_modified: true
output_form: cte_03_closure_record_in_markdown_only
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/blocker-closure/dot-iu-cutter-v0.1-cte-03-canonicalization-library-closure-2026-05-15.md