Canonical Hash Encoding Spec (FIX7-CANON-V1)
02 - Canonical Hash Encoding Spec (FIX7-CANON-V1)
(Codex recheck-5 blocker A.) This is the byte-exact, domain-separated, schema-versioned encoding that
makes every aggregate digest in the ACTIVE_AUTHORITY_APPROVAL_ENVELOPE reproducible. It is recorded
verbatim in doc 00 §Canonical hash encoding (the load-bearing copy); this report doc is the
rationale + worked proof. canonical_encoding_version: FIX7-CANON-V1 is pinned in the envelope and a
verifier MUST reject any other version (G-CANONICAL-ENCODING-CONTRACT).
Universal rules
| dimension | rule |
|---|---|
| algorithm | SHA-256, lowercase hex; shasum -a 256 == python hashlib.sha256().hexdigest() (proven) |
| encoding | UTF-8; no Unicode NFC/NFD normalization; raw bytes after newline normalization |
| newline normalization | CRLF (0x0D 0x0A) and lone CR (0x0D) → LF (0x0A) before hashing |
| record separator | each record terminated by exactly one LF (0x0A) |
| field separator | intra-record fields joined by a single TAB (0x09) |
| field order | fixed per record type (below); never the source YAML key order |
| sort key | records emitted in ascending lexicographic UTF-8-byte order by the named key |
| trailing newline | input = domain tag (LF-terminated) + records (each LF-terminated); ends in exactly one trailing LF |
| domain tag | fixed ASCII string ending in LF, unique per hash; prevents cross-kind collision |
| path normalization | full canonical KB document_id incl. knowledge/dev/reports/architecture/...; no prefix-strip, no leading slash, no ./, no trailing slash, exact case |
| revision representation | base-10 ASCII, no leading zeros; self-host doc 00 uses SELF_HOST_PIN_BY_EXCLUDE_REGION_HASH |
| null / absent | forbidden; explicit tokens only: NOT_APPLICABLE, SEAL_AT_CODEX_RECHECK_6, NON_AUTHORITY_DIAGNOSTIC (hashed literally) |
| boolean | lowercase true / false only |
| reproducible command | printf '%s' "<input>" | shasum -a 256 == hashlib.sha256(input).hexdigest() |
No aggregate hash may be described only in prose; none may depend on unordered map/object
serialization. Both prohibitions are enforced by G-CANONICAL-ENCODING-CONTRACT.
Per-aggregate specifications
| aggregate | domain tag | record (field order) | sort key | seal |
|---|---|---|---|---|
active_corpus_membership_sha256 |
FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1\n |
document_id |
document_id asc |
NOW (structural) |
active_corpus_sha256 |
FIX7_ACTIVE_AUTHORITY_CORPUS_V1\n |
document_id \t doc_status \t active_section_id_or_range \t revision_repr \t normalized_active_content_sha256 |
document_id asc |
recheck-6 |
marker_fence_registry_sha256 |
FIX7_MARKER_FENCE_REGISTRY_V1\n |
document_id \t marker_kind \t marker_literal |
(document_id, marker_kind, marker_literal) asc |
recheck-6 |
superseded_boundary_sha256 |
FIX7_SUPERSEDED_BOUNDARY_V1\n |
superseded_id \t fence_range |
superseded_id asc |
recheck-6 |
guard_set_sha256 |
FIX7_GUARD_SET_V1\n |
= normalized_active_content_sha256(doc 06); guard_set_revision = doc 06 kb_revision |
n/a | recheck-6 |
per-doc normalized_active_content_sha256 |
FIX7_DOC_NORMALIZED_CONTENT_V1\t<document_id>\n |
then the normalized active bytes (active scope; EXCLUDE region removed for self-host) | n/a | recheck-6 |
per-doc full_document_sha256 |
— | NON_AUTHORITY_DIAGNOSTIC for every member | — | n/a |
envelope_manifest_sha256 |
FIX7_ACTIVE_AUTHORITY_ENVELOPE_MANIFEST_V1\n |
the complete authority-field roster (doc 03) EXCEPT itself + detached_seal_sha256 |
roster order | recheck-6 |
detached_seal_sha256 |
FIX7_CODEX_DETACHED_SEAL_V1\n |
detached-seal authority fields EXCEPT itself | seal order | recheck-6 (Codex) |
Seal timing (why most values are SEAL_AT_CODEX_RECHECK_6)
A content hash of the approved corpus can only be computed at the approval event. T1 pre-writing
"approved" content hashes would be self-fabricated authority — the exact anti-pattern this chain
polices. So T1 fixes only the structural values now (membership + canonical_encoding_version) and
Codex computes and seals every content/aggregate hash at the recheck-6 PASS, recording them in the
detached seal. The encoding is fully specified now so Codex's computation is deterministic, not a
judgement call.
Worked proof (this pass)
Membership over the 10 full doc_ids, FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1, ascending, LF, trailing LF:
f2bda8effc7be19b54722828126b82d7d2d48bee5e5e5dc0c8f347ce210fe251
Identical under shasum -a 256 and python hashlib. The canonical input bytes are
FIX7_ACTIVE_AUTHORITY_MEMBERSHIP_V1\n followed by each full document_id\n in ascending order. The
adversarial self-audit (doc 08) shows: prefix-strip → mismatch; trailing-LF removal → mismatch;
unordered serialization → mismatch; reorder of source → stable (the verifier re-sorts); CRLF→LF →
content-stable. Each was computed, not asserted.