KB-D565

dot-iu-cutter v0.5 — Source-Document Ingestion Pipeline Design (DESIGN ONLY) (2026-05-17)

8 min read Revision 1
dot-iu-cutterv0.5ingestionsource-registrychecksumversioningprovenancedesign-onlydieu44

dot-iu-cutter v0.5 — Source-Document Ingestion Pipeline Design

Date: 2026-05-17 Phase: v0_5_constitution_hardtest_and_information_unit_factory_master_plan Nature: DESIGN ONLY. No table, no fetch job, no write is authorized.

Parent: dot-iu-cutter-v0.5-constitution-hardtest-master-plan-2026-05-17.md


1. Problem

Cắt hiến pháp must resolve a human phrase to an authoritative, versioned, checksummed source representation before any canonicalization or cut. Today the Constitution is not in the system; it lives at an external Directus-rendered URL. The runtime must never hardcode that URL/path — it must resolve through a source-document registry.

Grounding: the configured URL serves the internal Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 (KB-7294 rev 44, HTML rendered, markdown source, status markers mixing ✅ ENACTED / 📋 CONTROLLED DRAFT). The ingestion design must therefore be format- and authority-aware, not assume a clean legal PDF.


2. Source-document registry (design — NOT created)

Conceptual entity source_document (proposed columns; creation forbidden now):

source_document:
  source_document_ref: text PK            # stable logical key, e.g. "incomex-constitution"
  human_aliases: text[]                    # {"hiến pháp","cắt hiến pháp"} -> intent resolution
  source_url: text                         # registered, NOT hardcoded in code
  source_authority_class: text             # enum-by-registry: authoritative|draft|mirror
  expected_format: text                    # html|markdown|pdf|plain (parser_profile key)
  parser_profile_ref: text                 # FK -> parser_profile registry (config-driven)
  grammar_profile_ref: text                # FK -> grammar profile (canonicalization doc)
  enacted_only_policy: boolean             # if true, exclude 📋 CONTROLLED DRAFT nodes
  registered_by: text
  registered_at: timestamptz
  active: boolean

source_document_version (immutable snapshot per ingest):

source_document_version:
  document_version_id: text PK             # deterministic (see §5)
  source_document_ref: text FK
  content_checksum: text                   # sha256 of canonical-normalized bytes
  raw_checksum: text                       # sha256 of raw fetched bytes (pre-normalization)
  retrieved_at: timestamptz
  http_etag / http_last_modified: text     # opportunistic, advisory only
  source_format_detected: text
  parser_profile_resolved: text
  byte_length: bigint
  authority_snapshot: text                 # authoritative|draft at ingest time
  supersedes_version_id: text NULL         # version chain
  ingest_status: text                      # fetched|normalized|anchored|ready|rejected

source_span (provenance anchor; one per addressable node):

source_span:
  span_id: text PK
  document_version_id: text FK
  node_path: text                          # detected-hierarchy path (canonicalization doc)
  char_start: bigint
  char_end: bigint
  byte_start: bigint
  byte_end: bigint
  span_checksum: text                      # sha256 of the exact substring

These are proposals for a later schema-design cycle (Q5/Q6). No DDL here.


3. Ingestion stages (config-driven, no hardcoded path/label)

pipeline:
  I0_intent_resolve:
    in: human phrase ("cắt hiến pháp")
    do: lookup human_aliases -> source_document_ref; fail closed if unmapped
  I1_fetch:
    do: HTTP GET registered source_url (read-only); capture status, etag, bytes
    guard: reject non-2xx, reject if authority_class disallows, size ceiling
  I2_identify_format:
    do: detect html/markdown/pdf/plain; resolve parser_profile_ref
    note: current fixture = HTML (Directus) wrapping markdown w/ tables+lists
  I3_compute_raw_checksum:
    do: raw_checksum = sha256(raw_bytes)
  I4_normalize_encoding:
    do: UTF-8 NFC; strip BOM; CRLF->LF; preserve Vietnamese diacritics exactly
  I5_strip_non_content_noise:
    do: remove Directus chrome/nav/script/style; keep headings, numbering,
        tables, ordered/nested lists, status markers (✅/📋) as content tokens
    guard: noise-strip rules are parser_profile config, NOT inline constants
  I6_content_checksum:
    do: content_checksum = sha256(normalized_content_bytes)
  I7_structural_anchoring:
    do: emit source_span anchors over normalized content (char+byte offsets)
        BEFORE canonicalization, so provenance is offset-stable
  I8_version_id:
    do: document_version_id = deterministic(content_checksum, source_document_ref)
        (see §5)
  I9_authority_gate:
    do: record authority_snapshot; if enacted_only_policy then flag 📋 CONTROLLED
        DRAFT nodes as excluded-from-cut (not deleted, just not eligible)
  I10_emit_ready:
    out: parser-ready canonical source representation + span table (in isolated
         env only during dry-run; NO production write now)

4. Source authority model

authority:
  question_OD-S1: is KB-7294 rev44 authoritative enough to cut?
  proposal:
    - source_authority_class on the registry decides eligibility
    - 📋 CONTROLLED DRAFT nodes (e.g. Điều 44) excluded unless explicitly waived
    - cutting an authoritative version pins document_version_id; later edits to
      the source create a NEW version, never mutate the cut one
    - human alias "cắt hiến pháp" resolves only to active authoritative registry rows
  escalate: GPT/User must rule OD-S1 before any Constitution dry-run-at-volume

5. Determinism, checksum, versioning

determinism:
  raw_checksum: sha256(raw_fetched_bytes)
  content_checksum: sha256(normalize(strip(raw)))   # stable across cosmetic Directus changes
  document_version_id: f(content_checksum, source_document_ref)
    - same content + same source_ref  => same version_id (idempotent re-ingest, A2)
    - changed content                 => new version_id, supersedes chain
  span_checksum: sha256(exact_substring) — detects drift between version + cut
rationale:
  - content_checksum (not raw) is the identity basis so Directus re-render noise
    does not spuriously fork versions
  - raw_checksum retained for forensic/audit only

6. Provenance to every IU

Every IU produced downstream MUST carry:

iu_provenance:
  document_version_id
  source_span.span_id
  node_path
  span_checksum_at_cut
guarantee: an IU with no resolvable source_span is INVALID and rejected at REVIEW.

This satisfies master-plan acceptance A4 and P5.


7. Anti-hardcoding rules (binding)

no_hardcode:
  - source_url never literal in code: always source_document registry lookup
  - parser noise-strip rules: parser_profile config rows, not inline constants
  - format detection table: registry/config, not switch-on-literal
  - human alias mapping: registry, not if/elif on "cắt hiến pháp"

8. Open decisions

open_decisions:
  OD-S1: authority sufficiency + enacted-only policy (escalated)
  OD-SR1: registry table ownership/schema (cutter_governance vs new) — Q5/Q6
  OD-SR2: checksum normalization profile exact rule set (whitespace/table policy)
  OD-SR3: version supersession semantics for a living document (S178-style edits)

9. Do not run yet

No fetch job, no registry table, no checksum write, no span write, no production write, no schema migration, no code change. Read-only GET of the source URL was used once for grounding only. See master plan §10 for the full forbidden list.


10. Git

git: { branch: main, HEAD: e93424b5ff7fa5e4b8406131977ce4339cd0856a,
       status_short_iu_cutter: clean, code_changed: false, commit_made: false }
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.5-constitution-hardtest-design/dot-iu-cutter-v0.5-source-document-ingestion-pipeline-design-2026-05-17.md