dot-iu-cutter v0.5 — Source-Document Ingestion Pipeline Design (DESIGN ONLY) (2026-05-17)
dot-iu-cutter v0.5 — Source-Document Ingestion Pipeline Design
Date: 2026-05-17
Phase: v0_5_constitution_hardtest_and_information_unit_factory_master_plan
Nature: DESIGN ONLY. No table, no fetch job, no write is authorized.
Parent: dot-iu-cutter-v0.5-constitution-hardtest-master-plan-2026-05-17.md
1. Problem
Cắt hiến pháp must resolve a human phrase to an authoritative, versioned,
checksummed source representation before any canonicalization or cut. Today the
Constitution is not in the system; it lives at an external Directus-rendered
URL. The runtime must never hardcode that URL/path — it must resolve through a
source-document registry.
Grounding: the configured URL serves the internal Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 (KB-7294 rev 44, HTML rendered, markdown source, status markers mixing ✅ ENACTED / 📋 CONTROLLED DRAFT). The ingestion design must therefore be format- and authority-aware, not assume a clean legal PDF.
2. Source-document registry (design — NOT created)
Conceptual entity source_document (proposed columns; creation forbidden now):
source_document:
source_document_ref: text PK # stable logical key, e.g. "incomex-constitution"
human_aliases: text[] # {"hiến pháp","cắt hiến pháp"} -> intent resolution
source_url: text # registered, NOT hardcoded in code
source_authority_class: text # enum-by-registry: authoritative|draft|mirror
expected_format: text # html|markdown|pdf|plain (parser_profile key)
parser_profile_ref: text # FK -> parser_profile registry (config-driven)
grammar_profile_ref: text # FK -> grammar profile (canonicalization doc)
enacted_only_policy: boolean # if true, exclude 📋 CONTROLLED DRAFT nodes
registered_by: text
registered_at: timestamptz
active: boolean
source_document_version (immutable snapshot per ingest):
source_document_version:
document_version_id: text PK # deterministic (see §5)
source_document_ref: text FK
content_checksum: text # sha256 of canonical-normalized bytes
raw_checksum: text # sha256 of raw fetched bytes (pre-normalization)
retrieved_at: timestamptz
http_etag / http_last_modified: text # opportunistic, advisory only
source_format_detected: text
parser_profile_resolved: text
byte_length: bigint
authority_snapshot: text # authoritative|draft at ingest time
supersedes_version_id: text NULL # version chain
ingest_status: text # fetched|normalized|anchored|ready|rejected
source_span (provenance anchor; one per addressable node):
source_span:
span_id: text PK
document_version_id: text FK
node_path: text # detected-hierarchy path (canonicalization doc)
char_start: bigint
char_end: bigint
byte_start: bigint
byte_end: bigint
span_checksum: text # sha256 of the exact substring
These are proposals for a later schema-design cycle (Q5/Q6). No DDL here.
3. Ingestion stages (config-driven, no hardcoded path/label)
pipeline:
I0_intent_resolve:
in: human phrase ("cắt hiến pháp")
do: lookup human_aliases -> source_document_ref; fail closed if unmapped
I1_fetch:
do: HTTP GET registered source_url (read-only); capture status, etag, bytes
guard: reject non-2xx, reject if authority_class disallows, size ceiling
I2_identify_format:
do: detect html/markdown/pdf/plain; resolve parser_profile_ref
note: current fixture = HTML (Directus) wrapping markdown w/ tables+lists
I3_compute_raw_checksum:
do: raw_checksum = sha256(raw_bytes)
I4_normalize_encoding:
do: UTF-8 NFC; strip BOM; CRLF->LF; preserve Vietnamese diacritics exactly
I5_strip_non_content_noise:
do: remove Directus chrome/nav/script/style; keep headings, numbering,
tables, ordered/nested lists, status markers (✅/📋) as content tokens
guard: noise-strip rules are parser_profile config, NOT inline constants
I6_content_checksum:
do: content_checksum = sha256(normalized_content_bytes)
I7_structural_anchoring:
do: emit source_span anchors over normalized content (char+byte offsets)
BEFORE canonicalization, so provenance is offset-stable
I8_version_id:
do: document_version_id = deterministic(content_checksum, source_document_ref)
(see §5)
I9_authority_gate:
do: record authority_snapshot; if enacted_only_policy then flag 📋 CONTROLLED
DRAFT nodes as excluded-from-cut (not deleted, just not eligible)
I10_emit_ready:
out: parser-ready canonical source representation + span table (in isolated
env only during dry-run; NO production write now)
4. Source authority model
authority:
question_OD-S1: is KB-7294 rev44 authoritative enough to cut?
proposal:
- source_authority_class on the registry decides eligibility
- 📋 CONTROLLED DRAFT nodes (e.g. Điều 44) excluded unless explicitly waived
- cutting an authoritative version pins document_version_id; later edits to
the source create a NEW version, never mutate the cut one
- human alias "cắt hiến pháp" resolves only to active authoritative registry rows
escalate: GPT/User must rule OD-S1 before any Constitution dry-run-at-volume
5. Determinism, checksum, versioning
determinism:
raw_checksum: sha256(raw_fetched_bytes)
content_checksum: sha256(normalize(strip(raw))) # stable across cosmetic Directus changes
document_version_id: f(content_checksum, source_document_ref)
- same content + same source_ref => same version_id (idempotent re-ingest, A2)
- changed content => new version_id, supersedes chain
span_checksum: sha256(exact_substring) — detects drift between version + cut
rationale:
- content_checksum (not raw) is the identity basis so Directus re-render noise
does not spuriously fork versions
- raw_checksum retained for forensic/audit only
6. Provenance to every IU
Every IU produced downstream MUST carry:
iu_provenance:
document_version_id
source_span.span_id
node_path
span_checksum_at_cut
guarantee: an IU with no resolvable source_span is INVALID and rejected at REVIEW.
This satisfies master-plan acceptance A4 and P5.
7. Anti-hardcoding rules (binding)
no_hardcode:
- source_url never literal in code: always source_document registry lookup
- parser noise-strip rules: parser_profile config rows, not inline constants
- format detection table: registry/config, not switch-on-literal
- human alias mapping: registry, not if/elif on "cắt hiến pháp"
8. Open decisions
open_decisions:
OD-S1: authority sufficiency + enacted-only policy (escalated)
OD-SR1: registry table ownership/schema (cutter_governance vs new) — Q5/Q6
OD-SR2: checksum normalization profile exact rule set (whitespace/table policy)
OD-SR3: version supersession semantics for a living document (S178-style edits)
9. Do not run yet
No fetch job, no registry table, no checksum write, no span write, no production write, no schema migration, no code change. Read-only GET of the source URL was used once for grounding only. See master plan §10 for the full forbidden list.
10. Git
git: { branch: main, HEAD: e93424b5ff7fa5e4b8406131977ce4339cd0856a,
status_short_iu_cutter: clean, code_changed: false, commit_made: false }