KB-B362

D36 No-Vector Staging Zone — 03 SQL + NoSQL Staging Architecture</title> <parameter name="tags">["d36", "no-vector-staging-zone", "architecture", "iu_staging_record", "iu_staging_payload", "design", "2026-05-25"]

14 min read Revision 1

03 — SQL + NoSQL Staging Architecture

Purpose. Design the general substrate that serves all temporary data needs — SQL temp, NoSQL temp, MARK manifests, draft payloads, review packages, agent intermediate outputs, event working state, import previews. Parent: 01-d36-no-vector-staging-zone-addendum.md, 02-…-matrix.md. Status: DESIGN ONLY. No DDL, no migration, no production mutation.

The substrate is intentionally small — a two-table sidecar pattern (a proven precedent: see 15000x mig 031 sidecar template registry / 18000x mig 032 sidecar versioning) — parameterised by payload_type. One shape covers all eight use cases.


1. Substrate overview

┌──────────────────────────────────────────┐
│  iu_staging_record   (metadata, SQL row) │
│  PK: staging_record_id (uuid)            │
│  - source / owner / lifecycle / index    │
│  - payload_type discriminator            │
│  - integrity hash + size + counts        │
└──────────────────────────────────────────┘
                  │ 1
                  │
                  │ 1..N (some types have multiple parts; most have 1)
                  ▼
┌──────────────────────────────────────────┐
│  iu_staging_payload  (payload layer)     │
│  PK: staging_payload_id (uuid)           │
│  FK: staging_record_id                   │
│  - payload_kind (json|text|blob_ref)     │
│  - payload_json (jsonb, nullable)        │
│  - payload_text (text, nullable)         │
│  - blob_ref     (text, nullable)         │
│  - byte_len + content_hash               │
└──────────────────────────────────────────┘

Why two tables instead of one wide table?

  1. The metadata row is small and queryable for governance, lifecycle, indexes, audit, and retention scans. Keeping it narrow keeps lifecycle scans fast.
  2. The payload row is wide and rarely scanned; it is the place where bulk JSON / text / blob refs live. A separate table lets pg_toast handle big payloads without bloating the metadata table.
  3. A staging item may have multiple payload parts (e.g. a review package = manifest + commentary + check-list). The 1:N pattern handles this without forcing the metadata row to model parts.
  4. The "narrow sidecar" pattern matches the 15000x/18000x precedent: "Two narrow sidecars beat one wide sidecar" — applied here to the metadata vs payload split.

2. iu_staging_record — metadata table

Column Type Required Notes
staging_record_id uuid PK yes gen_random_uuid(); full uuid, no truncation
staging_kind text (enum) yes One of the 8 use cases — see §4 below
payload_type text (enum) yes Discriminator for payload table; see §5
purpose text yes Short human description, like collection_registry.purpose
lifecycle_status text (enum) yes pendingapprovedconsumed / rejected / expired / cleaned
owner_actor text yes Agent or human actor who created the record
source_kind text yes agent / user / system / import
source_ref text no URL / doc id / event id the record references
idempotency_key text UNIQUE yes md5(source_ref‖staging_kind‖suffix); for safe replay (25000x lesson)
content_hash text yes sha256 over canonical payload bytes
byte_len bigint yes Total payload bytes
part_count int yes Number of rows in iu_staging_payload for this record
metadata jsonb no Free-form tags / hints (NOT the payload)
referenced_iu_ids uuid[] no If the record points at IU-Core entities (Điều 37 cross-ref)
created_at timestamptz yes DEFAULT now()
approved_at timestamptz no Set on approved transition
approved_by text no Reviewer name / role
approval_doc_id text no KB doc id of the review package
consumed_at timestamptz no Set on consumed transition
consumed_by_run_id uuid no FK to dot_iu_command_run.run_id of the consumer
expires_at timestamptz yes Computed at create time from policy
cleaned_at timestamptz no Set on cleaned transition
vector_excluded boolean yes DEFAULT true; CHECK = true (Rule N1 hard)

CHECK constraints (testable):

CHECK (vector_excluded = true)                          -- Rule N1 hard
CHECK (lifecycle_status IN
       ('pending','approved','consumed','rejected','expired','cleaned'))
CHECK ( (lifecycle_status = 'approved')
        = (approved_at IS NOT NULL AND approved_by IS NOT NULL) )
CHECK ( (lifecycle_status = 'consumed')
        = (consumed_at IS NOT NULL AND consumed_by_run_id IS NOT NULL) )
CHECK ( expires_at > created_at )

Indexes: lifecycle_status, expires_at, staging_kind, payload_type, idempotency_key (UNIQUE), owner_actor.


3. iu_staging_payload — payload table

Column Type Required Notes
staging_payload_id uuid PK yes gen_random_uuid()
staging_record_id uuid FK yes iu_staging_record(staging_record_id); ON DELETE CASCADE
part_index int yes 0-based; UNIQUE per staging_record_id
part_name text yes Human label — manifest / commentary / checklist / preview_rows / …
payload_kind text (enum) yes json / text / blob_ref
payload_json jsonb conditional NOT NULL iff payload_kind = 'json'
payload_text text conditional NOT NULL iff payload_kind = 'text'
blob_ref text conditional NOT NULL iff payload_kind = 'blob_ref'; points at object storage (e.g. s3://… or directus_file_id)
byte_len bigint yes Computed at insert
content_hash text yes sha256 over canonical bytes of this part
created_at timestamptz yes DEFAULT now()

CHECK constraints:

CHECK (payload_kind IN ('json','text','blob_ref'))
CHECK (
  (payload_kind = 'json'     AND payload_json IS NOT NULL AND payload_text IS NULL AND blob_ref IS NULL) OR
  (payload_kind = 'text'     AND payload_text IS NOT NULL AND payload_json IS NULL AND blob_ref IS NULL) OR
  (payload_kind = 'blob_ref' AND blob_ref     IS NOT NULL AND payload_json IS NULL AND payload_text IS NULL)
)
UNIQUE (staging_record_id, part_index)

Indexes: staging_record_id, payload_kind.

Vectorization note. The payload table is the only place that holds embeddable bytes (payload_json / payload_text). By having it sit under iu_staging_payload, the vector connector's species/path exclusion (04-vector-exclusion-contract.md) is sufficient to keep it out of Qdrant.


4. staging_kind enum (8 supported use cases)

staging_kind:
  - mark_manifest          # Stage-1 MARK output (cut_manifest)
  - review_package         # Reviewer's bundle for a manifest
  - cut_preview            # Dry-run row-set a future CUT would write
  - sql_snapshot           # SELECT result frozen for review
  - nosql_payload          # JSON payload (e.g. Lark export, import preview)
  - draft_iu_composition   # In-flight IU-Core composition before commit
  - agent_intermediate     # Agent dry-run / research / planning artifact
  - event_working_state    # In-flight enrichment / partial composition

A new staging_kind value is added only by a follow-up addendum + matching DOT enum bump + pinning-test bump (small-blast-radius change; not a new law).


5. payload_type enum

Mapped from staging_kind but kept separate to allow re-use across kinds:

payload_type:
  - manifest_json          # cut_manifest schema M1..M16+R1..R7+C1..C8
  - mark_report            # textual MARK output (Markdown / log)
  - sql_result_snapshot    # rows + columns + sql_text + ran_at
  - nosql_payload          # opaque JSON document
  - source_excerpt         # excerpt of an external doc/URL
  - import_preview         # proposed import row-set
  - event_working_state    # event enrichment buffer
  - composition_draft      # IU-Core pieces + planned membership
  - review_bundle          # mixed multi-part: manifest + commentary + checks

review_bundle is the multi-part case (part_count > 1). The others typically have part_count = 1.


6. Specific use-case shapes

6.1 MARK manifest (covers MARK→REVIEW→CUT integration)

iu_staging_record
  staging_kind      = 'mark_manifest'
  payload_type      = 'manifest_json'
  purpose           = 'MARK manifest for <article#> — IU-Core cut'
  source_ref        = '<source_doc_url_or_id>'
  idempotency_key   = md5(source_ref ‖ 'mark_manifest' ‖ <suffix>)
  expires_at        = created_at + INTERVAL '14 days'
  vector_excluded   = true
iu_staging_payload (part_index = 0)
  part_name      = 'manifest'
  payload_kind   = 'json'
  payload_json   = { … cut_manifest M1..M16+R1..R7+C1..C8 … }
  content_hash   = sha256(canonical(payload_json))

The content_hash becomes the manifest_digest that the CUT contract (dot_iu_cut_from_manifest) requires for the approval gate. No more hash drift between disk-file and DB-row.

6.2 SQL temporary snapshot

staging_kind = 'sql_snapshot'
payload_type = 'sql_result_snapshot'
payload_json = {
  "sql_text": "...",
  "ran_at":   "2026-05-25T08:11:15Z",
  "row_count": 42,
  "columns":  ["col_a","col_b"],
  "rows":     [[…],[…],…]
}

6.3 NoSQL temporary payload (Lark / import preview)

staging_kind = 'nosql_payload' | 'import_preview'
payload_type = 'nosql_payload'  | 'import_preview'
payload_json = { … upstream JSON document … }

6.4 Review package (multi-part)

iu_staging_record  staging_kind = 'review_package'  part_count = 3
iu_staging_payload part_index=0  part_name='manifest'   payload_kind='json'
iu_staging_payload part_index=1  part_name='commentary' payload_kind='text'
iu_staging_payload part_index=2  part_name='checklist'  payload_kind='json'

6.5 Agent intermediate output / draft composition / event working state

All follow the same pattern: one metadata row + one (or several) payload rows, typed by staging_kind + payload_type.


7. Views

Two read-only views ship with the substrate (matching the 15000x convention):

View Purpose
v_iu_staging_record Convenience join of record + part counts + days_to_expiry + age
v_iu_staging_payload_observability Per-record payload summary (size, hashes, parts)

These views are surfaced in Nuxt /knowledge/registries/staging (optional UI).


8. Functions (DDL contract; not implemented in this macro)

Function Caller Role
fn_iu_staging_create(staging_kind, payload_type, purpose, owner_actor, source_kind, source_ref, idempotency_key, parts jsonb[], expires_at) DOT only Atomic create of record + parts; sets vector_excluded = true
fn_iu_staging_approve(staging_record_id, approved_by, approval_doc_id) DOT only Gate-checked transition pending → approved
fn_iu_staging_consume(staging_record_id, consumed_by_run_id) DOT only Gate-checked transition approved → consumed
fn_iu_staging_reject(staging_record_id, reason) DOT only pending → rejected
fn_iu_staging_cleanup(scope jsonb) DOT only (expired ∨ consumed ∨ rejected) → cleaned; respects retention policy
fn_iu_staging_emit_event(staging_record_id, event_type) Trigger or DOT Writes to event_outbox
fn_iu_staging_healthcheck() Healthcheck runner Returns ok/false; includes vector-exclusion invariant + lifecycle counts
fn_iu_staging_unregister(staging_record_id) Rollback DOT REFUSED-guarded; deletes record + parts atomically

All functions are SECURITY DEFINER, STRICT, owned by workflow_admin, REVOKE FROM PUBLIC, GRANT EXECUTE TO directus. Same pattern as IU-Core.


9. Configuration gates

Gate Default Effect
iu_core.staging_writes_enabled false Master gate for any write into iu_staging_*
iu_core.staging_cleanup_enabled false Master gate for fn_iu_staging_cleanup
iu_core.no_vector_staging_excluded true Defence-in-depth gate read by the vector connector

Standard "in-TX gate toggle reversibility" pattern (proven 12000x): open → run → close inside one transaction.


10. Storage decisions deferred to review

Decision Options Default proposal
Schema iu_core vs new iu_staging iu_core (reuse role boundary; tables prefixed iu_staging_)
Payload size limit 1 MiB / 10 MiB / unlimited 10 MiB per part (blob_ref for larger)
Retention default 7d / 14d / 30d per kind per-kind table; see 07-implementation-backlog.md
Approval requirement Required for mark_manifest, review_package, cut_preview; optional otherwise Same
Event domain staging vs iu staging (own domain for clean filters)

Open questions about these are in 08-review-questions-before-implementation.md.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/design/d36-no-vector-staging-zone/03-sql-nosql-staging-architecture.md