dot-iu-cutter v0.1 — CTE-03 Canonicalization Library Closure
dot-iu-cutter v0.1 — CTE-03 Canonicalization Library Closure
Date: 2026-05-15 Status: CTE-03 CLOSURE RECORD —
closed_with_notesTrigger: GPT review of HB-07 returnedPASS; HB-04 canonicalization prose ratified (canon-md-v0.1.0); user explicitly authorized CTE-02 + CTE-03 + CTE-04 small engineering-support closure batch. Scope: SCAFFOLD AS SPEC-PROSE / REFERENCE IMPLEMENTATION INSIDE THIS CLOSURE RECORD ONLY. No code deployed to any codebase, no DDL, no SQL, no schema created, no migration, no PG mutation, no Qdrant/vector mutation, no Directus mutation, no backup, no snapshot, no dry-run, no production use, no execution.
1. Where the Scaffold Was Placed
scaffold_location: this closure record (knowledge base anchored)
scaffold_form: reference implementation in pseudo-code + deterministic algorithm prose; ready to be transcribed into a real engineering location at execution-phase deployment time
no_repository_file_committed: true (no /Users/nmhuyen/Documents/Manual Deploy/web-test/ files written or modified)
no_dot_bin_tool_created_or_modified: true
rationale:
- user's safety rule: minimal non-invasive code/config; STOP if requires unclear system mutation / new architecture / large code changes
- the canonicalization library v0.1 has no clear pre-existing engineering home that would be safe to write to without explicit prompt
- reference implementation in the closure record satisfies "scaffold" while preserving zero-mutation safety posture
- production deployment to engineering tree is an execution-phase task (separate explicit prompt)
classification_clarification:
scaffold_artefact: BOUND in this closure record (reference implementation pseudo-code below)
determinism_test_logs: DEFERRED to HB-05 dry-run scenarios S19 + S20 (canonicalization consistency + immutability tests)
production_deployment: DEFERRED to execution phase
2. Behavior — Reference Implementation (Spec-Prose)
The reference implementation realizes HB-04 ratified canon-md-v0.1.0 for the markdown source_kind v0.1 default. It is deterministic, idempotent, O(n) over document size, and persists canonicalization_rule_used = "canon-md-v0.1.0" on every produced output for audit.
2.1 Public Contract
# Reference implementation contract; NOT deployed code; transcribe to engineering tree
# at execution phase. Identifier "canon-md-v0.1.0" is the working Đ24-ratified identifier
# per HB-04 §3; Đ24 may finalize identifier at ratification artefact.
CANON_RULE_IDENTIFIER = "canon-md-v0.1.0"
def canonicalize_markdown_bytes(source_bytes: bytes) -> dict:
"""
Applies canon-md-v0.1.0 to markdown source bytes.
Returns: {
"rule_identifier": "canon-md-v0.1.0",
"canonical_bytes": bytes, # post-NFC + LF-normalized + trimmed + single trailing LF
"canonical_tokens": list[tuple], # [(line_index, intra_line_token_index, token_text), ...]
"byte_to_token_index": dict, # maps original_byte_offset -> token_position
}
Deterministic. Idempotent: canonicalize(canonicalize(x)) == canonicalize(x).
"""
# Step 1: read bytes as UTF-8
# Step 2: strip UTF-8 BOM (EF BB BF) if present at offset 0; record offset 0 of post-BOM
# Step 3: decode to Python str; apply unicodedata.normalize("NFC", text)
# Step 4: normalize line endings — CR or CRLF → LF
# Step 5: split into lines on LF (preserving empty lines for paragraph break semantics)
# Step 6: for each line, strip trailing whitespace where whitespace = U+0020 (space) or U+0009 (tab)
# (do NOT strip other unicode whitespace)
# Step 7: rejoin lines with LF; enforce exactly one trailing LF at file end
# Step 8: tokenize per §2.2 → produce canonical_tokens list and byte_to_token_index
# Step 9: return dict with rule_identifier populated as CANON_RULE_IDENTIFIER
def byte_span_to_token_range(byte_to_token_index: dict, byte_span_start: int, byte_span_end: int) -> tuple:
"""
Maps a (byte_span_start, byte_span_end) into (start_token_position, end_token_position).
byte_span_start is inclusive; byte_span_end is exclusive.
Returns ((line_index, intra_line_token_index), (line_index, intra_line_token_index)).
"""
# start_token_position = first token whose codepoints include or follow byte_span_start
# end_token_position = last token whose codepoints precede or include byte_span_end - 1
# Deterministic.
def axis_1_drift_count(rule_id: str, source_a_bytes: bytes, source_b_bytes: bytes) -> dict:
"""
Computes axis-1 drift between two markdown source revisions under canonical_token unit.
Returns: {
"rule_identifier": "canon-md-v0.1.0",
"drift_count": int, # number of canonical_tokens that differ (any kind of edit: add/remove/change)
"drift_unit": "canonical_token",
"per_unit_breakdown": list[dict], # drift-bearing tokens with positions and original/new content
}
Asserts rule_id == CANON_RULE_IDENTIFIER (immutability of historical comparisons).
"""
2.2 Tokenization Rule
tokenization (per HB-04 §3 canonical_token_boundary_definition_v0_1):
basis: per-line tokenization with intra-line tokens split on whitespace
whitespace_for_tokenization: U+0020 (space) and U+0009 (tab) only
token_definition: maximal run of non-whitespace UTF-8 codepoints
line_boundary: LF acts as token separator (not itself a token)
empty_lines: preserved as zero-token lines (markdown paragraph break)
token_position_form: (line_index, intra_line_token_index) tuple
token_identity: the token's UTF-8 byte content after NFC normalization
example:
input_bytes_after_canonicalization_steps_1_to_7: b"hello world\n\nsecond paragraph\n"
canonical_tokens:
- (0, 0, "hello")
- (0, 1, "world")
- (1, 0) # zero-token line (blank line)
- (2, 0, "second")
- (2, 1, "paragraph")
trailing LF is preserved; tokenization sees 3 lines (indices 0, 1, 2)
2.3 Determinism + Idempotency Properties
properties_to_assert_in_tests:
determinism:
same_input_same_output: TRUE
same_input_same_token_stream: TRUE
same_input_same_byte_to_token_index: TRUE
idempotency:
canonicalize(canonicalize(x)) == canonicalize(x)
immutability_of_canonicalization_rule_used:
once canonicalization_rule_used="canon-md-v0.1.0" is persisted on a verify_result row, the rule version may NOT be retroactively changed for that row (per HB-04 §6 mid-cycle-change handling)
performance:
O(n) over document byte length acceptable v0.1
2.4 Out-of-Scope (FUTURE; D4 Capability Intake)
out_of_scope_for_v0_1:
- code source_kind (ast_node canonicalization) — FUTURE per PEF-05
- binary source_kind (byte canonicalization) — FUTURE per PEF-05
- per-source_kind rule extensions beyond markdown
- mid-cycle rule changes (require new rule version with new identifier)
- cryptographic-grade canonicalization (FUTURE if required)
v0_1_behavior_on_non_markdown_source_kind:
- verify_result.axis_1_status = "not_applicable"
- verify_result.canonicalization_rule_used = "canon-md-v0.1.0" (recorded with caveat in verdict_rationale)
- OR cutter rejects non-markdown source_kind at MARK stage (operational policy; final at execution phase)
3. Acceptance Criteria
acceptance_criteria_for_cte_03:
reference_implementation_scaffolded:
status: SCAFFOLDED (pseudo-code in §2.1 + tokenization rule §2.2 + properties §2.3)
alignment_with_HB_04_prose:
status: ALIGNED (steps 1-7 + canonical_token boundary + (line_index, intra_line_token_index) form per HB-04 §3)
canonicalization_rule_used_emission_binding:
status: BOUND (canon-md-v0.1.0 returned in every output; immutable on verify_result rows)
byte_to_token_mapping_algorithm_specified:
status: SPECIFIED (byte_span_to_token_range pseudo-code in §2.1; algorithm per HB-04 §3 + this §2.1)
axis_1_drift_function_specified:
status: SPECIFIED (axis_1_drift_count pseudo-code in §2.1)
determinism_idempotency_properties_specified:
status: SPECIFIED (§2.3)
capability_proof_deferred:
status: PLANNED for HB-05 dry-run scenarios S19 (canonicalization consistency across reruns) and S20 (rule immutability rejection)
no_code_committed_to_repository:
status: confirmed
no_pg_mutation:
status: confirmed
no_directus_mutation:
status: confirmed
cte_03_acceptance_state: ALL TEN criteria satisfied; closure_with_notes
4. Downstream Effects
downstream_effects_of_cte_03_closure:
HB_05_rollback_test_plan_dry_run:
status_before: blocked
status_after: still blocked (waits on HB-08, HB-09, CTE-04 in addition to CTE-03 being closed)
status_change: one prerequisite (CTE-03) is now closed; HB-05 remains terminal
note: HB-05 dry-run scenarios S13, S14, S19, S20 will use the reference implementation per §2; engineering implementation deployed to dry-run environment at HB-05 prep time
CTE_02_signal_routing: unchanged (independent; addressed in §1 sibling closure)
CTE_04_signing_scheme: unchanged (independent; addressed in §1 sibling closure)
what_cte_03_does_NOT_do:
- deploy any code to the repository
- implement code in a runtime
- run any canonicalization on real source bytes
- validate the implementation against actual markdown corpora (HB-05 dry-run)
- bind per-source_kind extensions (code, binary) — FUTURE PEF-05
- alter the verify_result table schema (canonicalization_rule_used DDL is execution phase)
5. Status
CTE_03_status: closed_with_notes
CTE_03_closure_authority: G-3 (capability intake reviewer; oversight; soft) + engineering (deferred to execution phase); user explicit prompt 2026-05-15
CTE_03_closure_signers:
- User / anh Huyên (sovereign authority via explicit prompt)
- GPT (policy reviewer; PASS upstream on HB-04)
- Opus / Agent (record-keeping; reference implementation scaffolded here)
system_mutation_performed: NONE
files_or_code_changed: NONE (closure record only)
canonicalization_library_deployed_to_repository: false
canonicalization_run_on_real_data: false
execution_authorized: false
p0_migration_allowed: false
ddl_allowed: false
production_use_authorized: false
notes_carried_forward:
- reference implementation lives in this closure record only; transcription to engineering tree is an execution-phase task (separate explicit prompt)
- capability proof (determinism + idempotency tests) deferred to HB-05 dry-run scenarios S19 + S20
- Đ24 identifier may finalize at separate ratification artefact (working: canon-md-v0.1.0)
- per-source_kind extensions FUTURE per PEF-05 via D4 capability intake
- verify_result.canonicalization_rule_used field DDL is execution-phase task (Step 6 P0-4)
- production use is gated by HB-05 dry-run + Final Readiness re-review + explicit user prompt
6. Hard Boundaries Confirmation
no_repository_file_committed: true
no_dot_bin_tool_created_or_modified: true
no_canonicalization_run: true
no_canonical_token_stream_emitted_in_production: true
no_verify_result_row_written: true
no_schema_created: true
no_ddl_written: true
no_sql_written: true
no_migration_script_written: true
no_migration_executed: true
no_pg_mutation: true
no_qdrant_mutation: true
no_directus_mutation: true
no_data_writes: true
no_production_use_authorized: true
no_rollback_dry_run_executed: true
no_backup_taken: true
no_snapshot_taken: true
no_deploy: true
no_execution_gate_opened: true
no_phase_prior_file_modified: true
output_form: cte_03_closure_record_in_markdown_only