S170 Deep Investigation — doc_id Encoding Bug Class
S170 Deep Investigation — doc_id Encoding Bug Class
Date: 2026-04-07 | Status: INVESTIGATED — chờ duyệt phương án
7 Câu Trả Lời
Q1: Ai quyết định dùng __ làm PG key? Commit nào? Lý do gốc?
Evidence: _fs_key docstring gốc: "Encode a document ID for use as a flat Firestore document key."
Firestore không cho phép / trong document ID (nó parse / thành collection/document path). Khi hệ thống dùng Firestore (trước S109), _fs_key chuyển / → __ là BẮT BUỘC.
Commit S109 (PR #289-293) migrate Firestore→PostgreSQL. pg_store.py line 6: "Simple key-value store with JSONB data column, mimicking Firestore's document model." _fs_key được giữ nguyên — Firestore legacy, không ai refactor.
SSOT: __ format là DERIVED (Firestore encoding). / format là LOGICAL (human-readable path).
Q2: Ai quyết định dùng / ở layer khác?
Evidence: directus_sync.py line 44: _SYNC_PREFIXES = ("knowledge/",) — uses / because Directus knowledge_documents.source_id stores agentdata:knowledge/dev/laws/file.md (human path).
API consumers (MCP, curl, agents) use / format in POST body (payload.document_id = "knowledge/dev/laws/file.md") and in URL path (GET /documents/knowledge/dev/laws/file.md). FastAPI URL encoding also accepts __ in path.
2 entry points, 2 formats:
- POST body
document_id: user decides format (typically/) - URL path
{doc_id:path}: FastAPI decodes, can be either
Q3: Cái nào SSOT, cái nào derived?
| Format | Role | Where stored | Who creates |
|---|---|---|---|
/ (slash) |
LOGICAL SSOT | POST payload, Directus source_id, Qdrant doc_id, data->document_id | User/API consumer |
__ (double underscore) |
PG KEY (derived) | kb_documents.key | _fs_key() in server.py |
/ là SSOT. __ là encoding artifact từ Firestore. PG hoàn toàn hỗ trợ / trong TEXT key.
Q4: Có thể thống nhất 1 format?
CÓ. PG TEXT primary key chấp nhận /. Rào cản kỹ thuật:
- 17 call sites dùng
_fs_key()— tất cả cần thay đổi - ~900 existing rows trong kb_documents dùng
__key — cần migration - kb_documents_history 1,950 rows reference
__keys — cần migration - kb_audit_log 2,500+ rows — cần migration
- Qdrant vectors dùng
/format → ALREADY correct - Directus source_id dùng
/format → ALREADY correct
Rào cản chính: application code refactor (17 sites) + data migration (900+1950+2500 rows).
Q5: 17 call sites phân loại
| # | Line | Context | Type | Risk |
|---|---|---|---|---|
| 1 | 481 | Session sentinel check | READ | Low |
| 2 | 757 | /ingest write to KB | WRITE | Medium — writes __ key |
| 3 | 1163 | move_document parent check | READ | Low (deprecated) |
| 4 | 1210 | move_document ancestor check | READ | Low (deprecated) |
| 5 | 1230 | create_document | READ+WRITE | High — sets __ key |
| 6 | 1336 | update_document | READ+WRITE | High — uses __ key |
| 7 | 1481 | move_document | READ | Low (deprecated 501) |
| 8 | 1549 | delete_document | READ | High — delete uses __ key, emit uses raw doc_id |
| 9 | 1614 | get_document | READ | Low |
| 10 | 1694 | patch_document | READ+WRITE | Medium |
| 11 | 1803 | batch_documents | READ | Low |
| 12 | 1884 | list_kb | READ | Low |
| 13 | 1962 | reindex — update vector_status | WRITE | Low |
| 14 | 2100 | reindex_missing — read doc | READ | Low |
| 15 | 2137 | reindex_missing — update status | WRITE | Low |
3 HIGH risk: create (line 1230), update (line 1336), delete (line 1549). These are the CRUD paths where __ key is written to PG and events are emitted.
Q6: Khi nào orphan tích tụ?
Timeline:
- Before WEB-82 (#268): No event system. No sync. Agent Data and Directus independently managed. → ALL drift accumulated here.
- WEB-82 (#268): Event system introduced. But events use raw
doc_idfrom URL path. - WEB-83 (#269): directus_sync introduced with
_should_sync("knowledge/"). - S167:
__encoding fix in directus_sync_listener (normalize before check). - Gap: reconcile-knowledge (WEB-80b #267) only syncs A→B (create), never B→A (cleanup).
574 B→A orphans: Accumulated since system start. Directus records created by reconcile-knowledge, but when AD files were deleted, no reverse cleanup existed.
Q7: PG tools cho bài toán này
| Option | Mechanism | Pros | Cons |
|---|---|---|---|
| Generated column | ALTER TABLE ADD COLUMN doc_id TEXT GENERATED ALWAYS AS (replace(key, '__', '/')) STORED |
Zero app change for read. PG auto-derives / from __ key. Can index. |
Still stores __ as key. Doesn't eliminate root cause. |
| CHECK constraint | ALTER TABLE ADD CONSTRAINT chk_key_format CHECK (key NOT LIKE '%/%') |
Enforces __ format at PG level. Prevents accidental / in key. |
Doesn't help with format mismatch in events. |
| Domain type | CREATE DOMAIN doc_id_t AS TEXT CHECK (VALUE NOT LIKE '%__%') |
Can enforce format on specific columns. | Requires schema change. |
| Trigger | BEFORE INSERT trigger: NEW.key = replace(NEW.key, '/', '__') |
Auto-normalize. App doesn't need to call _fs_key. | Hides the conversion, harder to debug. |
| Full migration | Change PG key to use / format. Drop _fs_key. |
Eliminates root cause. Single format everywhere. | 17 code changes + 900+1950+2500 row migration. |
So Sánh 3 Phương Án
| Criteria | PA1: Python normalize | PA2: PG generated column + CHECK | PA3: Full migration (key→/) |
|---|---|---|---|
| PG FIRST (NT-13) | ❌ App-level fix | ✅ PG enforces | ✅✅ PG is SSOT |
| Vĩnh viễn? | Partial — only directus_sync | Partial — key still __ | YES — 1 format |
| Cơ hội nhầm? (CQ-2) | 16 sites unfixed | Low (generated col auto) | 0 — no _fs_key |
| DOT cặp? (CP-12) | DOT-317 added | DOT-317 + PG CHECK | DOT-317 + PG CHECK |
| CP-11 không sửa code? | ✅ No code change | ✅ Only PG DDL | ❌ 17 code sites changed |
| Risk | Low | Low | Medium (data migration) |
| Effort | Done (S170) | ~30 min | ~2-4 hours |
Khuyến Nghị
Giai đoạn 1 (NGAY — đã làm): PA1 + DOT-317
- Python normalize in directus_sync ✅
- DOT-317 scan daily ✅
- 0 orphans hiện tại ✅
Giai đoạn 2 (TIẾP — đề xuất): PA2 — PG generated column
ALTER TABLE kb_documents ADD COLUMN doc_id TEXT
GENERATED ALWAYS AS (replace(key, '__', '/')) STORED;
CREATE INDEX idx_kb_documents_doc_id_gen ON kb_documents (doc_id);
- Mọi query/event CÓ THỂ dùng
doc_id(/ format) thay vìkey(__ format) - App code dần chuyển sang đọc
doc_idcolumn _fs_keyvẫn hoạt động, nhưng events emitdoc_idcolumn thay vì raw URL
Giai đoạn 3 (DÀI HẠN — khi refactor): PA3 — Full migration
- Chuyển PG key sang
/format - Drop
_fs_key()hoàn toàn - 1 format duy nhất toàn hệ thống
Lý do bám NT-13 PG FIRST: PA2 đặt SSOT ở PG (generated column). App đọc column thay vì tự derive. PG enforce format qua CHECK. Không cần sửa 17 code sites ngay.
DOT Cặp Đề Xuất
| DOT | Tier | Role | Trigger |
|---|---|---|---|
| DOT-317 (sync-orphan-scan) | A | Detect lệch AD↔Directus | Cron daily 5AM |
| DOT_KB_PROTECT | B | Prevent — event triggers on kb_documents | Event (PG trigger) |
Contract Test (Đ31)
SSOT CONTRACT: doc_id format
- Logical doc_id: always uses '/' separator (e.g. knowledge/dev/laws/file.md)
- PG key: always uses '__' separator (legacy encoding from Firestore)
- Conversion: _fs_key(logical) → PG key
- Reverse: replace('__', '/') → logical
- Events MUST emit logical format (with '/')
- PG generated column doc_id provides SSOT for logical format
7 questions answered | 15 call sites classified | 3 solutions compared | PA2 recommended