KB-8659

Canonical Hash Encoding Spec (FIX7-CANON-V1)

5 min read Revision 1

02 - Canonical Hash Encoding Spec (FIX7-CANON-V1)

(Codex recheck-5 blocker A.) This is the byte-exact, domain-separated, schema-versioned encoding that makes every aggregate digest in the ACTIVE_AUTHORITY_APPROVAL_ENVELOPE reproducible. It is recorded verbatim in doc 00 §Canonical hash encoding (the load-bearing copy); this report doc is the rationale + worked proof. canonical_encoding_version: FIX7-CANON-V1 is pinned in the envelope and a verifier MUST reject any other version (G-CANONICAL-ENCODING-CONTRACT).

Universal rules

dimension rule
algorithm SHA-256, lowercase hex; shasum -a 256 == python hashlib.sha256().hexdigest() (proven)
encoding UTF-8; no Unicode NFC/NFD normalization; raw bytes after newline normalization
newline normalization CRLF (0x0D 0x0A) and lone CR (0x0D) → LF (0x0A) before hashing
record separator each record terminated by exactly one LF (0x0A)
field separator intra-record fields joined by a single TAB (0x09)
field order fixed per record type (below); never the source YAML key order
sort key records emitted in ascending lexicographic UTF-8-byte order by the named key
trailing newline input = domain tag (LF-terminated) + records (each LF-terminated); ends in exactly one trailing LF
domain tag fixed ASCII string ending in LF, unique per hash; prevents cross-kind collision
path normalization full canonical KB document_id incl. knowledge/dev/reports/architecture/...; no prefix-strip, no leading slash, no ./, no trailing slash, exact case
revision representation base-10 ASCII, no leading zeros; self-host doc 00 uses SELF_HOST_PIN_BY_EXCLUDE_REGION_HASH
null / absent forbidden; explicit tokens only: NOT_APPLICABLE, SEAL_AT_CODEX_RECHECK_6, NON_AUTHORITY_DIAGNOSTIC (hashed literally)
boolean lowercase true / false only
reproducible command printf '%s' "<input>" | shasum -a 256 == hashlib.sha256(input).hexdigest()

No aggregate hash may be described only in prose; none may depend on unordered map/object serialization. Both prohibitions are enforced by G-CANONICAL-ENCODING-CONTRACT.

Per-aggregate specifications

aggregate domain tag record (field order) sort key seal
active_corpus_membership_sha256 FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1\n document_id document_id asc NOW (structural)
active_corpus_sha256 FIX7_ACTIVE_AUTHORITY_CORPUS_V1\n document_id \t doc_status \t active_section_id_or_range \t revision_repr \t normalized_active_content_sha256 document_id asc recheck-6
marker_fence_registry_sha256 FIX7_MARKER_FENCE_REGISTRY_V1\n document_id \t marker_kind \t marker_literal (document_id, marker_kind, marker_literal) asc recheck-6
superseded_boundary_sha256 FIX7_SUPERSEDED_BOUNDARY_V1\n superseded_id \t fence_range superseded_id asc recheck-6
guard_set_sha256 FIX7_GUARD_SET_V1\n = normalized_active_content_sha256(doc 06); guard_set_revision = doc 06 kb_revision n/a recheck-6
per-doc normalized_active_content_sha256 FIX7_DOC_NORMALIZED_CONTENT_V1\t<document_id>\n then the normalized active bytes (active scope; EXCLUDE region removed for self-host) n/a recheck-6
per-doc full_document_sha256 NON_AUTHORITY_DIAGNOSTIC for every member n/a
envelope_manifest_sha256 FIX7_ACTIVE_AUTHORITY_ENVELOPE_MANIFEST_V1\n the complete authority-field roster (doc 03) EXCEPT itself + detached_seal_sha256 roster order recheck-6
detached_seal_sha256 FIX7_CODEX_DETACHED_SEAL_V1\n detached-seal authority fields EXCEPT itself seal order recheck-6 (Codex)

Seal timing (why most values are SEAL_AT_CODEX_RECHECK_6)

A content hash of the approved corpus can only be computed at the approval event. T1 pre-writing "approved" content hashes would be self-fabricated authority — the exact anti-pattern this chain polices. So T1 fixes only the structural values now (membership + canonical_encoding_version) and Codex computes and seals every content/aggregate hash at the recheck-6 PASS, recording them in the detached seal. The encoding is fully specified now so Codex's computation is deterministic, not a judgement call.

Worked proof (this pass)

Membership over the 10 full doc_ids, FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1, ascending, LF, trailing LF:

f2bda8effc7be19b54722828126b82d7d2d48bee5e5e5dc0c8f347ce210fe251

Identical under shasum -a 256 and python hashlib. The canonical input bytes are FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1\n followed by each full document_id\n in ascending order. The adversarial self-audit (doc 08) shows: prefix-strip → mismatch; trailing-LF removal → mismatch; unordered serialization → mismatch; reorder of source → stable (the verifier re-sorts); CRLF→LF → content-stable. Each was computed, not asserted.

Back to Knowledge Hub knowledge/dev/reports/architecture/t1-fix7-blueprint-patch-after-codex-recheck-5-canonical-envelope-2026-06-09/02-canonical-hash-encoding-spec.md