dot-iu-cutter v0.1 — P0 Canonicalization Rule v0.1 Planning Note

Date: 2026-05-15 Status: IMPLEMENTATION PLANNING — Lane "canonicalization" Scope: PLANNING ONLY. No code, no executable canonicalization implementation, no DDL, no SQL, no migration. Master: implementation-planning/dot-iu-cutter-v0.1-p0-implementation-planning-master-2026-05-15.md

1. Purpose

Author a prose plan for canonicalization_rule_v0.1 — what the rule does, why it is bound to byte → canonical_token conversion (per X-A), and what remains to be Đ24-ratified before the rule may be used in any production CUT.

This file is planning prose only. No code is written; no canonicalization library is implemented; no rule version is yet Đ24-ratified.

2. Source Inputs

ratification/dot-iu-cutter-v0.1-x-a-source-span-drift-unit-ratification-2026-05-15.md §3.3 + §3.4
risk-review/dot-iu-cutter-v0.1-p0-cross-cutting-decision-register-2026-05-15.md §3.7 (X-7)
implementation-planning/dot-iu-cutter-v0.1-p0-cross-cutting-resolution-plan-2026-05-15.md §9 (X-7)
migration-design/dot-iu-cutter-v0.1-p0-4-verify-result-migration-design-2026-05-15.md §6, §14 (mid-cycle change risk)
migration-design/dot-iu-cutter-v0.1-p0-2-manifest-envelope-unit-block-migration-design-2026-05-15.md §9 item 4 (source_span unit)

3. Placeholder Already Ratified at X-A

placeholder_ratified_at_X_A:
  scope: markdown source_kind only (v0.1 default)
  steps:
    - NFC unicode normalization
    - LF line endings (normalize CR/CRLF to LF)
    - trailing whitespace trim (per line)
  identity_field: canonicalization_rule_used on verify_result MUST record the rule version
  rule_changes_mid_cycle: prohibited
  prose_full_ratification: still pending Đ24 (X-7)
placeholder_authority: Đ24 vocab owner (drift unit vocabulary + placeholder acceptance) + Đ44 family registry custodian (cross-family alignment) + Đ32 (residual risk acceptance)

The placeholder is sufficient for the byte → canonical_token conversion structure used in X-A. It is NOT sufficient for production CUTs because edge cases (BOM, mixed line endings, token boundaries) need explicit prose binding.

4. Prose Plan — What canonicalization_rule_v0.1 Must Specify

The full prose (to be authored as a separate Đ24 ratification artefact before execution) must specify the following items. Each item is recorded here at planning level.

4.1 Source-kind scope (v0.1 default)

source_kind_scope_v0_1:
  primary: markdown
  fallback_for_unspecified_source_kind: markdown rule applied with WARNING (per-source_kind extension is FUTURE)
  out_of_scope_v0_1:
    code: ast_node canonicalization (FUTURE D4 capability intake)
    binary: byte-level handling (FUTURE)

4.2 Step ordering and idempotency

step_ordering:
  1. read source bytes as UTF-8
  2. apply NFC unicode normalization
  3. normalize line endings: CR or CRLF → LF
  4. trim trailing whitespace per line (whitespace = space + tab; not other unicode whitespace)
  5. ensure file ends with exactly one trailing LF (open prose decision; recommend YES)
  6. tokenize into canonical_tokens (token boundary rule below)
idempotency_requirement:
  - applying the rule twice yields the same canonical_token stream as applying it once
  - byte-to-token mapping deterministic across runs on the same input

4.3 BOM handling

bom_handling:
  policy: strip UTF-8 BOM (EF BB BF) at file start if present; byte offset 0 of post-BOM bytes used for source_span calculations
  rationale: BOM is encoding metadata, not content
prose_to_author_in_dieu24: explicit

4.4 Mixed line endings within a single document

mixed_line_endings_handling:
  policy: all CR / CRLF / LF sequences normalized to LF; no document is rejected for mixed endings
  rationale: stable normalization required regardless of source provenance
prose_to_author_in_dieu24: explicit

4.5 Trailing newline at file end

trailing_newline_handling:
  policy_recommended_v0_1: enforce exactly one LF at file end (POSIX text file convention)
  rationale: prevents trailing-newline drift contributing to false positives in axis-1
  alternative_under_discussion: preserve original (no enforcement) — rejected for v0.1 because it makes drift detection non-stable
prose_to_author_in_dieu24: explicit

4.6 Consecutive blank lines

consecutive_blank_lines_handling:
  policy_recommended_v0_1: preserve as-is (no collapsing)
  rationale: markdown semantics treat consecutive blank lines as content (paragraph break), not noise
prose_to_author_in_dieu24: explicit

4.7 Canonical token boundary definition

canonical_token_boundary_definition_recommendation:
  basis: per-line tokenization with intra-line tokens split on whitespace
  per_line_tokens:
    - tokens are maximal runs of non-whitespace UTF-8 code points
    - whitespace = space (U+0020) or tab (U+0009)
    - tokens preserved in original order
  line_boundary:
    - LF is itself a token boundary marker (not a token)
    - each line yields zero or more tokens
  identity_of_a_canonical_token:
    - the token's UTF-8 byte content after NFC normalization
  position_index:
    - canonical_token_position = (line_index, intra_line_token_index) OR a flat token sequence index — exact representation chosen at execution time; prose must specify a single canonical form
prose_to_author_in_dieu24: explicit

4.8 Byte-offset → canonical_token position mapping algorithm

byte_offset_to_token_position_mapping:
  algorithm_recommended_v0_1:
    1. read post-BOM bytes
    2. walk bytes from offset 0, applying step ordering §4.2 progressively to maintain byte→codepoint→token correspondence
    3. for each byte_span_start: locate the first canonical_token whose codepoints include or follow that byte
    4. for each byte_span_end: locate the last canonical_token whose codepoints precede or include that byte
    5. emit (start_token_position, end_token_position) as the canonical token range corresponding to the byte span
  determinism: required
  performance_class: O(n) over document size acceptable v0.1
prose_to_author_in_dieu24: explicit

4.9 Per-source_kind extension policy (FUTURE)

extension_policy:
  channel: D4 capability intake → Đ24 ratification → Đ32 risk review
  example_extensions_FUTURE:
    code:
      rule_kind: ast_node
      requires_parser: per language
    binary:
      rule_kind: byte-level
      requires_canonicalization_resistant_handling: true
v0_1_treatment:
  - non-markdown source_kind triggers axis_1_status='not_applicable' (verify_result)
  - or applies the markdown rule with a WARNING if reviewer authorizes
prose_to_author_in_dieu24: explicit

5. What Still Needs Đ24 Ratification Before Execution

pending_dieu24_ratification:
  - full prose canonicalization_rule_v0.1 (this planning note is the authoring; ratification is a separate Đ24 step recorded in a ratification file)
  - identity scheme for canonicalization_rule_used field value (e.g., "canon-md-v0.1.0" SemVer-style)
  - token-position representation form ((line_index, intra_line_index) tuple OR flat sequence index) — must be exactly one
  - explicit BOM / mixed line ending / trailing newline policy bindings per §4.3–§4.5
  - explicit canonical_token boundary policy per §4.7
  - explicit byte-offset → token-position algorithm per §4.8

6. Byte → Canonical_token Conversion Expectation (per X-A)

conversion_expectation:
  input: source_span_start, source_span_end (byte offsets per X-A) on manifest_unit_block
  process: apply canonicalization_rule_v0.1 to source revision bytes; compute token positions per §4.8
  output: (start_token_position, end_token_position) used by VERIFY axis-1 drift comparison
  drift_unit: canonical_token (per X-A)
  drift_threshold: 0 (default; non-zero only with explicit Đ32 policy)
verify_result_recording:
  canonicalization_rule_used: REQUIRED field populated on every verify_result row
  axis_1_drift_details: JSONB carries per-unit byte-span → token-position mapping for audit
mid_cycle_rule_change_handling:
  prohibited; rule changes require D4 capability intake + Đ24 ratification of a new rule version
  legacy verify_result rows retain their canonicalization_rule_used value and remain immutable

7. canonicalization_rule_used Requirement

field_canonicalization_rule_used:
  table: verify_result
  type-class: text (SemVer-style identifier)
  nullability: NOT NULL
  immutability: immutable after row insert
  default_value_v0_1: identifier of the first Đ24-ratified prose version (e.g., "canon-md-v0.1.0" — exact identifier set at ratification time)
  audit_use:
    - allows reproduction of historical drift calculations
    - prevents ghost drift on rule version change
    - supports rule-version impact analysis (which verify_results would change verdict under a new rule)

8. Blockers Before Execution

execution_preconditions_from_this_planning_note:
  - Đ24 full prose canonicalization_rule_v0.1 ratified
  - canonicalization_rule_v0.1 identifier assigned (e.g., "canon-md-v0.1.0")
  - canonicalization rule library scaffolding present (placeholder implementation; v0.1 acceptable as application-layer)
  - canonicalization_rule_used field on verify_result wired in DDL (execution phase)

9. What This Plan Does NOT Do

this_file_does_NOT:
  - implement the canonicalization rule
  - emit any canonical_token stream
  - run any conversion
  - ratify the prose (Đ24 ratification is a separate file)
  - mutate any state
  - write any code
  - write any SQL / DDL / migration script

10. Explicit Confirmation

no_code_written: true
no_canonicalization_implementation: true
no_canonical_token_stream_emitted: true
no_dieu24_ratification_recorded_in_this_file: true (ratification is a separate file)
no_ddl_written: true
no_sql_written: true
no_migration_executed: true
no_pg_mutation: true
no_qdrant_mutation: true
no_directus_mutation: true
no_data_writes: true
no_implementation_execution: true
no_phase_prior_file_modified: true
output_form: canonicalization_rule_planning_prose_only