dot-iu-cutter v0.1 — P0 Canonicalization Rule v0.1 Planning Note
dot-iu-cutter v0.1 — P0 Canonicalization Rule v0.1 Planning Note
Date: 2026-05-15 Status: IMPLEMENTATION PLANNING — Lane "canonicalization" Scope: PLANNING ONLY. No code, no executable canonicalization implementation, no DDL, no SQL, no migration. Master:
implementation-planning/dot-iu-cutter-v0.1-p0-implementation-planning-master-2026-05-15.md
1. Purpose
Author a prose plan for canonicalization_rule_v0.1 — what the rule does, why it is bound to byte → canonical_token conversion (per X-A), and what remains to be Đ24-ratified before the rule may be used in any production CUT.
This file is planning prose only. No code is written; no canonicalization library is implemented; no rule version is yet Đ24-ratified.
2. Source Inputs
ratification/dot-iu-cutter-v0.1-x-a-source-span-drift-unit-ratification-2026-05-15.md§3.3 + §3.4risk-review/dot-iu-cutter-v0.1-p0-cross-cutting-decision-register-2026-05-15.md§3.7 (X-7)implementation-planning/dot-iu-cutter-v0.1-p0-cross-cutting-resolution-plan-2026-05-15.md§9 (X-7)migration-design/dot-iu-cutter-v0.1-p0-4-verify-result-migration-design-2026-05-15.md§6, §14 (mid-cycle change risk)migration-design/dot-iu-cutter-v0.1-p0-2-manifest-envelope-unit-block-migration-design-2026-05-15.md§9 item 4 (source_span unit)
3. Placeholder Already Ratified at X-A
placeholder_ratified_at_X_A:
scope: markdown source_kind only (v0.1 default)
steps:
- NFC unicode normalization
- LF line endings (normalize CR/CRLF to LF)
- trailing whitespace trim (per line)
identity_field: canonicalization_rule_used on verify_result MUST record the rule version
rule_changes_mid_cycle: prohibited
prose_full_ratification: still pending Đ24 (X-7)
placeholder_authority: Đ24 vocab owner (drift unit vocabulary + placeholder acceptance) + Đ44 family registry custodian (cross-family alignment) + Đ32 (residual risk acceptance)
The placeholder is sufficient for the byte → canonical_token conversion structure used in X-A. It is NOT sufficient for production CUTs because edge cases (BOM, mixed line endings, token boundaries) need explicit prose binding.
4. Prose Plan — What canonicalization_rule_v0.1 Must Specify
The full prose (to be authored as a separate Đ24 ratification artefact before execution) must specify the following items. Each item is recorded here at planning level.
4.1 Source-kind scope (v0.1 default)
source_kind_scope_v0_1:
primary: markdown
fallback_for_unspecified_source_kind: markdown rule applied with WARNING (per-source_kind extension is FUTURE)
out_of_scope_v0_1:
code: ast_node canonicalization (FUTURE D4 capability intake)
binary: byte-level handling (FUTURE)
4.2 Step ordering and idempotency
step_ordering:
1. read source bytes as UTF-8
2. apply NFC unicode normalization
3. normalize line endings: CR or CRLF → LF
4. trim trailing whitespace per line (whitespace = space + tab; not other unicode whitespace)
5. ensure file ends with exactly one trailing LF (open prose decision; recommend YES)
6. tokenize into canonical_tokens (token boundary rule below)
idempotency_requirement:
- applying the rule twice yields the same canonical_token stream as applying it once
- byte-to-token mapping deterministic across runs on the same input
4.3 BOM handling
bom_handling:
policy: strip UTF-8 BOM (EF BB BF) at file start if present; byte offset 0 of post-BOM bytes used for source_span calculations
rationale: BOM is encoding metadata, not content
prose_to_author_in_dieu24: explicit
4.4 Mixed line endings within a single document
mixed_line_endings_handling:
policy: all CR / CRLF / LF sequences normalized to LF; no document is rejected for mixed endings
rationale: stable normalization required regardless of source provenance
prose_to_author_in_dieu24: explicit
4.5 Trailing newline at file end
trailing_newline_handling:
policy_recommended_v0_1: enforce exactly one LF at file end (POSIX text file convention)
rationale: prevents trailing-newline drift contributing to false positives in axis-1
alternative_under_discussion: preserve original (no enforcement) — rejected for v0.1 because it makes drift detection non-stable
prose_to_author_in_dieu24: explicit
4.6 Consecutive blank lines
consecutive_blank_lines_handling:
policy_recommended_v0_1: preserve as-is (no collapsing)
rationale: markdown semantics treat consecutive blank lines as content (paragraph break), not noise
prose_to_author_in_dieu24: explicit
4.7 Canonical token boundary definition
canonical_token_boundary_definition_recommendation:
basis: per-line tokenization with intra-line tokens split on whitespace
per_line_tokens:
- tokens are maximal runs of non-whitespace UTF-8 code points
- whitespace = space (U+0020) or tab (U+0009)
- tokens preserved in original order
line_boundary:
- LF is itself a token boundary marker (not a token)
- each line yields zero or more tokens
identity_of_a_canonical_token:
- the token's UTF-8 byte content after NFC normalization
position_index:
- canonical_token_position = (line_index, intra_line_token_index) OR a flat token sequence index — exact representation chosen at execution time; prose must specify a single canonical form
prose_to_author_in_dieu24: explicit
4.8 Byte-offset → canonical_token position mapping algorithm
byte_offset_to_token_position_mapping:
algorithm_recommended_v0_1:
1. read post-BOM bytes
2. walk bytes from offset 0, applying step ordering §4.2 progressively to maintain byte→codepoint→token correspondence
3. for each byte_span_start: locate the first canonical_token whose codepoints include or follow that byte
4. for each byte_span_end: locate the last canonical_token whose codepoints precede or include that byte
5. emit (start_token_position, end_token_position) as the canonical token range corresponding to the byte span
determinism: required
performance_class: O(n) over document size acceptable v0.1
prose_to_author_in_dieu24: explicit
4.9 Per-source_kind extension policy (FUTURE)
extension_policy:
channel: D4 capability intake → Đ24 ratification → Đ32 risk review
example_extensions_FUTURE:
code:
rule_kind: ast_node
requires_parser: per language
binary:
rule_kind: byte-level
requires_canonicalization_resistant_handling: true
v0_1_treatment:
- non-markdown source_kind triggers axis_1_status='not_applicable' (verify_result)
- or applies the markdown rule with a WARNING if reviewer authorizes
prose_to_author_in_dieu24: explicit
5. What Still Needs Đ24 Ratification Before Execution
pending_dieu24_ratification:
- full prose canonicalization_rule_v0.1 (this planning note is the authoring; ratification is a separate Đ24 step recorded in a ratification file)
- identity scheme for canonicalization_rule_used field value (e.g., "canon-md-v0.1.0" SemVer-style)
- token-position representation form ((line_index, intra_line_index) tuple OR flat sequence index) — must be exactly one
- explicit BOM / mixed line ending / trailing newline policy bindings per §4.3–§4.5
- explicit canonical_token boundary policy per §4.7
- explicit byte-offset → token-position algorithm per §4.8
6. Byte → Canonical_token Conversion Expectation (per X-A)
conversion_expectation:
input: source_span_start, source_span_end (byte offsets per X-A) on manifest_unit_block
process: apply canonicalization_rule_v0.1 to source revision bytes; compute token positions per §4.8
output: (start_token_position, end_token_position) used by VERIFY axis-1 drift comparison
drift_unit: canonical_token (per X-A)
drift_threshold: 0 (default; non-zero only with explicit Đ32 policy)
verify_result_recording:
canonicalization_rule_used: REQUIRED field populated on every verify_result row
axis_1_drift_details: JSONB carries per-unit byte-span → token-position mapping for audit
mid_cycle_rule_change_handling:
prohibited; rule changes require D4 capability intake + Đ24 ratification of a new rule version
legacy verify_result rows retain their canonicalization_rule_used value and remain immutable
7. canonicalization_rule_used Requirement
field_canonicalization_rule_used:
table: verify_result
type-class: text (SemVer-style identifier)
nullability: NOT NULL
immutability: immutable after row insert
default_value_v0_1: identifier of the first Đ24-ratified prose version (e.g., "canon-md-v0.1.0" — exact identifier set at ratification time)
audit_use:
- allows reproduction of historical drift calculations
- prevents ghost drift on rule version change
- supports rule-version impact analysis (which verify_results would change verdict under a new rule)
8. Blockers Before Execution
execution_preconditions_from_this_planning_note:
- Đ24 full prose canonicalization_rule_v0.1 ratified
- canonicalization_rule_v0.1 identifier assigned (e.g., "canon-md-v0.1.0")
- canonicalization rule library scaffolding present (placeholder implementation; v0.1 acceptable as application-layer)
- canonicalization_rule_used field on verify_result wired in DDL (execution phase)
9. What This Plan Does NOT Do
this_file_does_NOT:
- implement the canonicalization rule
- emit any canonical_token stream
- run any conversion
- ratify the prose (Đ24 ratification is a separate file)
- mutate any state
- write any code
- write any SQL / DDL / migration script
10. Explicit Confirmation
no_code_written: true
no_canonicalization_implementation: true
no_canonical_token_stream_emitted: true
no_dieu24_ratification_recorded_in_this_file: true (ratification is a separate file)
no_ddl_written: true
no_sql_written: true
no_migration_executed: true
no_pg_mutation: true
no_qdrant_mutation: true
no_directus_mutation: true
no_data_writes: true
no_implementation_execution: true
no_phase_prior_file_modified: true
output_form: canonicalization_rule_planning_prose_only