KB-18A0

IU vector-sync boundary rule — one vector never spans two IUs

3 min read Revision 1
iu-corevector-syncqdrantboundary-ruleembeddingbinding-ruledot-iu-cutter

02 — IU vector-sync boundary rule (per-IU embedding integrity)

  • Macro: AGENTDATA_MCP_CONNECTOR_REPAIR_AND_1K_REPORT_VERIFICATION
  • Date: 2026-05-23
  • Status: BINDING RULE for all IU-Core → Qdrant / vector-store sync work (the 1200x IU_CORE_1200X_QDRANT_LIVE_SYNC_AND_OPERATOR_UI macro and beyond).

Rule

When embedding Information Units (IUs) into a vector store:

  1. One vector per chunk; one chunk = content from exactly one IU. A vector / chunk must never contain content drawn from more than one IU.
  2. Over-long IU → chunk only inside that IU's boundary. If a single IU is too long for one embedding unit, split it into multiple chunks — but every chunk stays strictly within that one IU. A chunk must never straddle an IU boundary.
  3. Every chunk carries identity. Each chunk's payload / metadata must carry unit_id (the IU id) and parent_piece_id. A chunk with no IU identity is invalid.
  4. Never concatenate IU A + IU B into one embedding unit. No merging of distinct IUs into a shared vector for any reason — not padding, not batching, not "context-window efficiency".
  5. Collection / document-level vectors are metadata only. A collection- or document-level summary vector is permitted ONLY as an explicitly-marked metadata / summary object. It must never replace or substitute for the per-IU vectors.

Why

The IU is the atomic unit of governed meaning. A vector that mixes two IUs makes retrieval, drift detection (iu_vector_sync_point.content_digest vs indexed_digest) and per-IU provenance unsound — a search hit could no longer be traced to a single governed unit. Per-IU boundary integrity keeps the vector layer consistent with the five-layer model and the unit_id / parent_piece_id lineage already enforced in Postgres.

How to apply

  • cutter_agent/iu_core/vector_sync.py build_sync_plan / VectorPoint: when an IU exceeds the embedding size limit, emit multiple VectorPoints with a stable per-chunk point_key derived from unit_id + chunk index — never one point spanning two IUs.
  • Carry unit_id and parent_piece_id in every Qdrant point payload.
  • A source_kind=collection or source_kind=corpus point is allowed only as a marked summary object — additive to, never instead of, the source_kind=iu points.

Relates to: the 1k macro doc 02 (vector-sync foundation) and the 1200x IU_CORE_1200X_QDRANT_LIVE_SYNC_AND_OPERATOR_UI next-macro package.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/agentdata-mcp-connector-repair-and-1k-report-verification/02-iu-vector-boundary-rule.md