KB-2C4F

S170 Deep Investigation — doc_id Encoding Bug Class

8 min read Revision 1
reportinvestigationdoc-idencoding_fs_keyrootcauses1702026-04-07

S170 Deep Investigation — doc_id Encoding Bug Class

Date: 2026-04-07 | Status: INVESTIGATED — chờ duyệt phương án


7 Câu Trả Lời

Q1: Ai quyết định dùng __ làm PG key? Commit nào? Lý do gốc?

Evidence: _fs_key docstring gốc: "Encode a document ID for use as a flat Firestore document key."

Firestore không cho phép / trong document ID (nó parse / thành collection/document path). Khi hệ thống dùng Firestore (trước S109), _fs_key chuyển /__ là BẮT BUỘC.

Commit S109 (PR #289-293) migrate Firestore→PostgreSQL. pg_store.py line 6: "Simple key-value store with JSONB data column, mimicking Firestore's document model." _fs_key được giữ nguyên — Firestore legacy, không ai refactor.

SSOT: __ format là DERIVED (Firestore encoding). / format là LOGICAL (human-readable path).

Q2: Ai quyết định dùng / ở layer khác?

Evidence: directus_sync.py line 44: _SYNC_PREFIXES = ("knowledge/",) — uses / because Directus knowledge_documents.source_id stores agentdata:knowledge/dev/laws/file.md (human path).

API consumers (MCP, curl, agents) use / format in POST body (payload.document_id = "knowledge/dev/laws/file.md") and in URL path (GET /documents/knowledge/dev/laws/file.md). FastAPI URL encoding also accepts __ in path.

2 entry points, 2 formats:

  • POST body document_id: user decides format (typically /)
  • URL path {doc_id:path}: FastAPI decodes, can be either

Q3: Cái nào SSOT, cái nào derived?

Format Role Where stored Who creates
/ (slash) LOGICAL SSOT POST payload, Directus source_id, Qdrant doc_id, data->document_id User/API consumer
__ (double underscore) PG KEY (derived) kb_documents.key _fs_key() in server.py

/ là SSOT. __ là encoding artifact từ Firestore. PG hoàn toàn hỗ trợ / trong TEXT key.

Q4: Có thể thống nhất 1 format?

CÓ. PG TEXT primary key chấp nhận /. Rào cản kỹ thuật:

  • 17 call sites dùng _fs_key() — tất cả cần thay đổi
  • ~900 existing rows trong kb_documents dùng __ key — cần migration
  • kb_documents_history 1,950 rows reference __ keys — cần migration
  • kb_audit_log 2,500+ rows — cần migration
  • Qdrant vectors dùng / format → ALREADY correct
  • Directus source_id dùng / format → ALREADY correct

Rào cản chính: application code refactor (17 sites) + data migration (900+1950+2500 rows).

Q5: 17 call sites phân loại

# Line Context Type Risk
1 481 Session sentinel check READ Low
2 757 /ingest write to KB WRITE Medium — writes __ key
3 1163 move_document parent check READ Low (deprecated)
4 1210 move_document ancestor check READ Low (deprecated)
5 1230 create_document READ+WRITE High — sets __ key
6 1336 update_document READ+WRITE High — uses __ key
7 1481 move_document READ Low (deprecated 501)
8 1549 delete_document READ High — delete uses __ key, emit uses raw doc_id
9 1614 get_document READ Low
10 1694 patch_document READ+WRITE Medium
11 1803 batch_documents READ Low
12 1884 list_kb READ Low
13 1962 reindex — update vector_status WRITE Low
14 2100 reindex_missing — read doc READ Low
15 2137 reindex_missing — update status WRITE Low

3 HIGH risk: create (line 1230), update (line 1336), delete (line 1549). These are the CRUD paths where __ key is written to PG and events are emitted.

Q6: Khi nào orphan tích tụ?

Timeline:

  • Before WEB-82 (#268): No event system. No sync. Agent Data and Directus independently managed. → ALL drift accumulated here.
  • WEB-82 (#268): Event system introduced. But events use raw doc_id from URL path.
  • WEB-83 (#269): directus_sync introduced with _should_sync("knowledge/").
  • S167: __ encoding fix in directus_sync_listener (normalize before check).
  • Gap: reconcile-knowledge (WEB-80b #267) only syncs A→B (create), never B→A (cleanup).

574 B→A orphans: Accumulated since system start. Directus records created by reconcile-knowledge, but when AD files were deleted, no reverse cleanup existed.

Q7: PG tools cho bài toán này

Option Mechanism Pros Cons
Generated column ALTER TABLE ADD COLUMN doc_id TEXT GENERATED ALWAYS AS (replace(key, '__', '/')) STORED Zero app change for read. PG auto-derives / from __ key. Can index. Still stores __ as key. Doesn't eliminate root cause.
CHECK constraint ALTER TABLE ADD CONSTRAINT chk_key_format CHECK (key NOT LIKE '%/%') Enforces __ format at PG level. Prevents accidental / in key. Doesn't help with format mismatch in events.
Domain type CREATE DOMAIN doc_id_t AS TEXT CHECK (VALUE NOT LIKE '%__%') Can enforce format on specific columns. Requires schema change.
Trigger BEFORE INSERT trigger: NEW.key = replace(NEW.key, '/', '__') Auto-normalize. App doesn't need to call _fs_key. Hides the conversion, harder to debug.
Full migration Change PG key to use / format. Drop _fs_key. Eliminates root cause. Single format everywhere. 17 code changes + 900+1950+2500 row migration.

So Sánh 3 Phương Án

Criteria PA1: Python normalize PA2: PG generated column + CHECK PA3: Full migration (key→/)
PG FIRST (NT-13) ❌ App-level fix ✅ PG enforces ✅✅ PG is SSOT
Vĩnh viễn? Partial — only directus_sync Partial — key still __ YES — 1 format
Cơ hội nhầm? (CQ-2) 16 sites unfixed Low (generated col auto) 0 — no _fs_key
DOT cặp? (CP-12) DOT-317 added DOT-317 + PG CHECK DOT-317 + PG CHECK
CP-11 không sửa code? ✅ No code change ✅ Only PG DDL ❌ 17 code sites changed
Risk Low Low Medium (data migration)
Effort Done (S170) ~30 min ~2-4 hours

Khuyến Nghị

Giai đoạn 1 (NGAY — đã làm): PA1 + DOT-317

  • Python normalize in directus_sync ✅
  • DOT-317 scan daily ✅
  • 0 orphans hiện tại ✅

Giai đoạn 2 (TIẾP — đề xuất): PA2 — PG generated column

ALTER TABLE kb_documents ADD COLUMN doc_id TEXT
  GENERATED ALWAYS AS (replace(key, '__', '/')) STORED;
CREATE INDEX idx_kb_documents_doc_id_gen ON kb_documents (doc_id);
  • Mọi query/event CÓ THỂ dùng doc_id (/ format) thay vì key (__ format)
  • App code dần chuyển sang đọc doc_id column
  • _fs_key vẫn hoạt động, nhưng events emit doc_id column thay vì raw URL

Giai đoạn 3 (DÀI HẠN — khi refactor): PA3 — Full migration

  • Chuyển PG key sang / format
  • Drop _fs_key() hoàn toàn
  • 1 format duy nhất toàn hệ thống

Lý do bám NT-13 PG FIRST: PA2 đặt SSOT ở PG (generated column). App đọc column thay vì tự derive. PG enforce format qua CHECK. Không cần sửa 17 code sites ngay.

DOT Cặp Đề Xuất

DOT Tier Role Trigger
DOT-317 (sync-orphan-scan) A Detect lệch AD↔Directus Cron daily 5AM
DOT_KB_PROTECT B Prevent — event triggers on kb_documents Event (PG trigger)

Contract Test (Đ31)

SSOT CONTRACT: doc_id format
- Logical doc_id: always uses '/' separator (e.g. knowledge/dev/laws/file.md)
- PG key: always uses '__' separator (legacy encoding from Firestore)
- Conversion: _fs_key(logical) → PG key
- Reverse: replace('__', '/') → logical
- Events MUST emit logical format (with '/')
- PG generated column doc_id provides SSOT for logical format

7 questions answered | 15 call sites classified | 3 solutions compared | PA2 recommended