KB-4251

dot-iu-cutter v0.1 — HB-04 X-7 Canonicalization Rule v0.1 Prose Ratification Closure

12 min read Revision 1
dot-iu-cutterblocker-closurehb-04x-7canonicalization-rulev0.1dieu24dieu44no-codeno-executionno-ddlrev5d

dot-iu-cutter v0.1 — HB-04 X-7 Canonicalization Rule v0.1 Prose Ratification Closure

Date: 2026-05-15 Status: HB-04 CLOSURE RECORD — closed_with_notes Trigger: GPT review of HB-06 returned PASS (2026-05-15). User has explicitly authorized batch closure of HB-01, HB-02, HB-03, HB-04. Scope: CLOSURE RECORD ONLY. No code, no canonicalization library implementation, no DDL, no SQL, no migration, no PG mutation, no Directus mutation, no Qdrant/vector mutation, no execution.


1. Scope

HB-04 ratifies the canonicalization rule v0.1 prose per X-7. The v0.1 placeholder bound at X-A (NFC + LF + trim) is now elevated to ratified prose covering the markdown source_kind v0.1 default plus the binding that canonicalization_rule_used must be recorded on every verify_result row.

hb_04_scope:
  in_scope:
    - ratify canonicalization rule v0.1 prose at the level needed to unblock CTE-03 scaffolding
    - bind canonicalization_rule_used field requirement on verify_result
    - record Đ24 + Đ44 sign-off attribution
    - record downstream effect on CTE-03
  not_in_scope:
    - implement the canonicalization library (CTE-03; separate engineering session)
    - issue per-source_kind extensions for code (ast_node) or binary (byte) — FUTURE via D4 capability intake (PEF-05)
    - create or alter any table
    - write any code

2. Source References

  • reviews/dot-iu-cutter-v0.1-hb-06-operational-seats-closure-gpt-review-2026-05-15.md (PASS — authorizes batch closure)
  • blocker-closure/dot-iu-cutter-v0.1-hb-06-operational-seats-closure-2026-05-15.md
  • ratification/dot-iu-cutter-v0.1-x-a-source-span-drift-unit-ratification-2026-05-15.md §3.3 + §3.4 (v0.1 placeholder accepted)
  • implementation-planning/dot-iu-cutter-v0.1-p0-canonicalization-rule-v0.1-planning-note-2026-05-15.md §4 (prose plan)
  • migration-design/dot-iu-cutter-v0.1-p0-4-verify-result-migration-design-2026-05-15.md §6 + §14 (mid-cycle change risk + canonicalization_rule_used requirement)
  • migration-design/dot-iu-cutter-v0.1-p0-2-manifest-envelope-unit-block-migration-design-2026-05-15.md §9 item 4 (source_span unit alignment)
  • risk-review/dot-iu-cutter-v0.1-p0-cross-cutting-decision-register-2026-05-15.md §3.7 (X-7 options + recommendation)
  • blocker-closure/dot-iu-cutter-v0.1-p0-workstream-b-vocabulary-schema-canonicalization-2026-05-15.md §5 (HB-04 acceptance criteria)

3. Decision Recorded

decision_id: HB-04
cross_cutting_decision_resolved: X-7
selected_option: dieu24_ratified_prose_for_markdown_v0_1_with_byte_to_token_conversion

canonicalization_rule_v0_1_prose_ratified:
  identifier_proposal: canon-md-v0.1.0
  identifier_status: working identifier accepted; Đ24 retains authority to set final identifier at execution-phase DDL authoring without re-opening HB-04
  scope: markdown source_kind v0.1 default
  rule_steps (in order; idempotent):
    1: read source bytes as UTF-8
    2: strip UTF-8 BOM (EF BB BF) at file start if present; byte offset 0 of post-BOM bytes used for source_span calculations
    3: apply NFC unicode normalization
    4: normalize line endings — any CR or CRLF sequence → LF
    5: trim trailing whitespace per line (whitespace = space U+0020 + tab U+0009; not other unicode whitespace)
    6: enforce exactly one LF at file end (recommendation v0.1; Đ24 ratifies)
    7: tokenize into canonical_tokens (token boundary rule below)
  canonical_token_boundary_definition_v0_1:
    basis: per-line tokenization with intra-line tokens split on whitespace
    intra_line_tokens: maximal runs of non-whitespace UTF-8 code points
    whitespace: space (U+0020) or tab (U+0009)
    line_boundary: LF acts as token separator; not itself a token
    consecutive_blank_lines: preserved as-is (markdown semantic content; not collapsed)
    canonical_token_identity: the token's UTF-8 byte content after NFC normalization
    canonical_token_position_form: (line_index, intra_line_token_index) tuple — bound v0.1; flat-sequence-index alternative is NOT chosen v0.1
  byte_offset_to_canonical_token_position_mapping_algorithm:
    1: read post-BOM bytes
    2: walk bytes from offset 0, applying rule steps 3-7 progressively to maintain byte→codepoint→token correspondence
    3: for each byte_span_start: locate the first canonical_token whose codepoints include or follow that byte
    4: for each byte_span_end: locate the last canonical_token whose codepoints precede or include that byte
    5: emit (start_token_position, end_token_position) per byte_span
    determinism: required
    performance_class: O(n) over document size acceptable v0.1
  axis_1_drift_unit_binding: canonical_token (per X-A; reaffirmed here)
  drift_threshold_default: 0 (any drift = FAIL; non-zero allowance requires explicit Đ32 policy)
  per_source_kind_extension_policy:
    markdown_v0_1: this prose applies
    non_markdown_source_kind_v0_1: axis_1_status='not_applicable' (out of scope v0.1)
    code_source_kind: FUTURE ast_node rule via D4 capability intake (PEF-05) + Đ24 ratification
    binary_source_kind: FUTURE byte rule via D4 capability intake (PEF-05) + Đ24 ratification
  mid_cycle_rule_change_handling:
    prohibited: rule changes require D4 capability intake + Đ24 ratification of a NEW rule version (new identifier)
    legacy_verify_results: retain their canonicalization_rule_used value; immutable

canonicalization_rule_used_field_binding:
  table: verify_result (P0-4)
  type_class: text (SemVer-style identifier)
  nullability: NOT NULL
  immutability: immutable after row insert
  default_v0_1: canon-md-v0.1.0 (or final Đ24 identifier)
  audit_use:
    - allows reproduction of historical drift calculations
    - prevents ghost drift on rule version change
    - supports rule-version impact analysis
  requirement: MUST be populated on every verify_result row

4. Authority / Sign-Off

authorities_signing:
  primary_signers:
    - Đ24 (vocabulary owner) — ratifies canonicalization rule v0.1 prose
    - Đ44 (family registry custodian) — accepts cross-family alignment: verify_result (verify_family) references the rule identifier; manifest_unit_block (manifest_family) source_span uses byte offsets that the rule maps via §3.3 algorithm
  secondary_signers:
    - GPT (policy reviewer; PASS upstream on cross-cutting register and X-7 recommendation; PASS on canonicalization planning note)
    - User / anh Huyên (sovereign authority)
    - Opus / Agent (record-keeping side)

what_each_authority_accepts:
  Đ24:
    - full prose for canonicalization rule v0.1 (markdown scope, step ordering, BOM/line-ending/trim/exactly-one-LF policies, canonical_token boundary, byte→token mapping algorithm, per-source_kind extension policy, mid-cycle change handling)
    - canonicalization_rule_used field binding on verify_result
    - identifier working name canon-md-v0.1.0 (final identifier at Đ24 ratification artefact if renamed)
  Đ44:
    - cross-family alignment between verify_result and manifest_unit_block via byte-span → canonical-token-position conversion
    - reaffirms X-A binding
  GPT:
    - cross-cutting register §3.7 recommendation matches the closure
  User / anh Huyên:
    - sovereign acceptance per the explicit prompt

5. Acceptance Criteria

acceptance_criteria_for_hb_04:
  v0_1_prose_ratified:
    status: RATIFIED (scope, steps, token boundary, mapping algorithm, extension policy, mid-cycle handling)
  canonicalization_rule_used_field_binding_recorded:
    status: BOUND (NOT NULL + immutable on verify_result)
  identifier_recorded:
    status: ASSIGNED (working: canon-md-v0.1.0; Đ24 may finalize at ratification artefact)
  signing_attribution_recorded:
    status: ATTRIBUTED (Đ24 + Đ44 primary; GPT + User + Opus/Agent secondary)
  no_canonicalization_library_implemented:
    status: confirmed (CTE-03 remains OPEN; now ready_to_close)
  no_code_written:
    status: confirmed
  no_DDL:
    status: confirmed
hb_04_acceptance_state: ALL SEVEN criteria satisfied; closure_with_notes

6. Downstream Effects

downstream_effects_of_hb_04_closure:
  CTE_03_canonicalization_library_scaffolding:
    status_before: blocked (waited on HB-04)
    status_after: ready_to_close (HB-04 ratified)
    next_action: open engineering session to scaffold the application-layer canonicalization library implementing the v0.1 prose; G-3 oversight
    note: CTE-03 is NOT closed by this closure

  HB_05_rollback_test_plan_dry_run:
    status_before: blocked
    status_after: still blocked (terminal node; many upstream remain)
    status_change: none — HB-05 cannot close until CTE-03 + others all close
    note: HB-04 contributes the parallel chain HB-04 → CTE-03 → HB-05; further closures of CTE-03 + CTE-02 + CTE-04 + HB-07 + HB-09 are required before HB-05

  Step_6_DDL (P0-4 verify_result execution):
    note: first DDL of Step 6 requires canonicalization_rule_used field referencing a Đ24-ratified identifier; HB-04 satisfies the identifier-existence requirement
    status_change: pre-execution gate intact

  HB_01_HB_02_HB_03_HB_06_HB_07_HB_08_HB_09_CTE_01_CTE_02_CTE_04:
    status_change: none (independent of HB-04)

what_HB_04_does_NOT_do:
  - implement the canonicalization library (CTE-03; engineering work; separate session)
  - run any conversion
  - emit any canonical_token stream
  - bind per-source_kind extensions for code or binary (FUTURE; PEF-05)
  - alter any table (verify_result.canonicalization_rule_used field DDL is execution-phase task)
  - issue final Đ24 ratification artefact under ratification/ (Đ24 may produce a separate ratification file referencing this closure; closure is sufficient for HB-04 acceptance per user prompt)

7. Status

HB_04_status: closed_with_notes
HB_04_closure_authority: Đ24 + Đ44 (per cross-cutting register §3.7 + user prompt 2026-05-15)
HB_04_closure_signers:
  - Đ24 vocabulary owner (primary)
  - Đ44 family registry custodian (primary)
  - GPT (policy reviewer)
  - User / anh Huyên (sovereign authority)
  - Opus / Agent (record-keeping)

execution_authorized: false
implementation_allowed: false
ddl_allowed: false
migration_allowed: false
canonicalization_library_scaffolded: false (CTE-03 remains OPEN; now ready_to_close)
canonicalization_library_executed: false

notes_carried_forward:
  - identifier working: canon-md-v0.1.0; Đ24 may finalize at a separate ratification artefact under ratification/ if renamed
  - per-source_kind extensions for code (ast_node) and binary (byte) FUTURE via D4 capability intake (PEF-05)
  - canonical_token_position_form chosen v0.1: (line_index, intra_line_token_index) tuple
  - mid-cycle rule change requires a new rule version with new identifier; legacy verify_result rows remain immutable
  - CTE-03 scaffolding is engineering work; G-3 oversight; HB-04 does not implement

8. Hard Boundaries Confirmation

no_canonicalization_library_implemented: true (CTE-03 remains OPEN)
no_canonical_token_stream_emitted: true
no_conversion_run: true
no_code_written: true
no_ddl_written: true
no_sql_written: true
no_table_created: true
no_table_altered: true (verify_result.canonicalization_rule_used field DDL is execution-phase task)
no_migration_script_written: true
no_migration_executed: true
no_pg_mutation: true
no_qdrant_mutation: true
no_directus_mutation: true
no_data_writes: true
no_per_source_kind_extension_for_code_or_binary_in_this_file: true (FUTURE; PEF-05)
no_execution: true
no_phase_prior_file_modified: true
output_form: hb_04_closure_record_in_markdown_only
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/blocker-closure/dot-iu-cutter-v0.1-hb-04-canonicalization-prose-ratification-2026-05-15.md