P3D Vector/Search Freshness Audit Report
P3D Vector/Search Freshness Audit Report
Date: 2026-05-11 Mode: READ_ONLY_INVESTIGATION Scope: Agent Data legacy KB vector search, Qdrant collection
production_documentsReport path:knowledge/dev/laws/dieu44-trien-khai/reports/p3d-vector-search-freshness-audit-report.md
Status Fields
phase_status=PASS
mode=READ_ONLY_INVESTIGATION
no_mutation_performed=true
kb_direct_read_ok=true
qdrant_status=ok
sync_status=warning
recent_docs_vectorized=all
search_ranking_quality=noisy
root_cause=SEARCH_RANKING_NO_PATH_TITLE_BOOST_NOT_VECTOR_FRESHNESS
root_cause_confidence=high
recommended_next_action=P3D_VECTOR_SEARCH_HYBRID_PATH_TITLE_BOOST_PACK
unsafe_actions_not_taken=no_reindex,no_auto_heal,no_kb_reindex,no_kb_reindex_missing,no_cleanup_orphans_write,no_qdrant_delete,no_qdrant_upsert,no_db_write,no_trigger_creation,no_restart,no_redeploy,no_config_change,no_code_change,no_dot_heal,no_iu_vector_implementation
0. Rule Read Evidence
Read in main process, no background agent:
| Required read | Evidence |
|---|---|
.claude/skills/incomex-rules.md |
Read locally. Key constraints confirmed: direct search_knowledge, no background agent, report with evidence. |
search_knowledge("operating rules SSOT") |
Returned knowledge/dev/ssot/operating-rules.md, title Nguyên tắc Điều hành — SSOT v7.58 Concise, score 0.42450312; also returned VPS Operating Rules. |
search_knowledge("hiến pháp v4.0 constitution") |
Returned knowledge/dev/laws/constitution.md, title Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 BAN HÀNH, and warnings that v3.9 is superseded. |
| Prompt file | Read via get_document_for_rewrite("knowledge/dev/laws/dieu44-trien-khai/prompts/p3d-agent-vector-search-freshness-audit-readonly-2026-05-10.md"), truncated=false, content_length=6508. |
1. Three Declarations
- Vĩnh viễn? Root cause is not fixed by manually reindexing one document. The durable fix is deterministic search behavior: path/title/document_id exact-match or boost before/alongside vector similarity.
- Nhầm được không? Current system can confuse exact prompt/report lookup because Qdrant semantic score ignores path and title as ranking signals. A hybrid/boosted search contract would make exact-path/title queries hard to mis-rank.
- 100% tự động? Create/update vector sync is automatic, but discoverability is not deterministic for exact names. Search ranking needs automated boost/rerank, not operator memory or manual reindex.
2. System Health Evidence
Read-only /health from inside incomex-agent-data on port 8000:
{
"status": "healthy",
"services": {
"qdrant": {"status": "ok", "latency_ms": 9.3, "last_error": null},
"postgres": {"status": "ok", "latency_ms": 1.0, "last_error": null},
"openai": {"status": "ok", "latency_ms": 0.0, "last_error": null}
},
"data_integrity": {
"document_count": 2406,
"vector_point_count": 4973,
"ratio": 2.07,
"sync_status": "warning",
"embed_calls": 1443,
"embed_tokens": 1145939
},
"event_system": {
"enabled": true,
"webhooks_registered": 0,
"webhooks_active": 0,
"listeners": 1,
"events_logged": 1232
}
}
Container/source evidence:
incomex-agent-data image=agent-data-local:latest status=Up 3 days (healthy)
QDRANT_COLLECTION=production_documents
store_enabled=true
3. Target Document Results
All five recent targets exist in KB/PG, have vector_status=ready, and have Qdrant points. None of the target documents are missing from Qdrant.
| Target doc | KB exists | vector_status | Qdrant points | chunks | Search top5/top10/top20 | Rank | Notes |
|---|---|---|---|---|---|---|---|
reviews/gpt-review-p3d-step1-reauthored-spec-and-pack1-directive-2026-05-10.md |
true | ready | 1 | 1 | true/true/true | 5 | Present but outranked by older semantically similar GPT review docs. |
prompts/p3d-pack1-readonly-inventory-prompt.md |
true | ready | 3 | 3 | true/true/true | 3 | Revision 2, chunked. Query ranks related report/review above exact target. |
directives/gpt-directive-agent-run-step1-checkpoint-and-pack1-inventory-readonly-2026-05-10.md |
true | ready | 1 | 1 | true/true/true | 1 | Exact-ish directive query ranks correctly. |
prompts/p3d-agent-copy-paste-run-step1-checkpoint-and-pack1-inventory-2026-05-10.md |
true | ready | 1 | 1 | true/true/true | 1 | Exact-ish prompt query ranks correctly. |
prompts/p3d-agent-vector-search-freshness-audit-readonly-2026-05-10.md |
true | ready | 2 | 2 | true/true/true | 1 | Exact-ish vector audit query ranks correctly. |
Direct KB reads via MCP get_document returned all five documents with expected titles and revisions:
gpt-review-p3d-step1... revision=1 content_length=3612
p3d-pack1-readonly-inventory-prompt.md revision=2 content_length=10498
gpt-directive-agent-run-step1... revision=1 content_length=2537
p3d-agent-copy-paste-run-step1... revision=1 content_length=2523
p3d-agent-vector-search-freshness-audit... revision=1 content_length=6508
Qdrant payload sample pattern for all targets:
payload.document_id=<full KB path>
payload.content=<chunk text>
payload.metadata.title=<document title>
payload.metadata.tags=<tags>
payload.metadata.chunk_index=<0-based chunk>
payload.metadata.total_chunks=<chunk count>
4. Ranking Evidence
The key failure mode is noisy ranking, not missing vectors.
Target 1 Query
Query:
GPT Review P3D Step 1 Re-authored Spec Pack 1 Directive
Top hits:
1 score=0.641119 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-23-p3d1-notification-schema-functions-prompt-rev1-2026-05-07.md
2 score=0.631840 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-23-p3d3-prompt-rev1-opus-checkpoint-2026-05-08.md
3 score=0.629037 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-gpilot4-pass-and-gpilot5-directive-2026-05-04.md
4 score=0.624177 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-gpilot5-pass-and-next-directive-2026-05-04.md
5 score=0.621927 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-p3d-step1-reauthored-spec-and-pack1-directive-2026-05-10.md
Interpretation: exact target is vectorized and retrieved, but older documents with similar semantic shape outrank it. This is a ranking issue.
Target 2 Query
Query:
p3d-pack1-readonly-inventory-prompt revision 2
Top hits:
1 score=0.630753 knowledge/dev/laws/dieu44-trien-khai/reports/p3d-pack1-iu-canonical-contract-and-tac-iu-reconciliation-report.md
2 score=0.626626 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-p3d-pack1-preliminary-design-and-inventory-next-2026-05-10.md
3 score=0.610256 knowledge/dev/laws/dieu44-trien-khai/prompts/p3d-pack1-readonly-inventory-prompt.md
Interpretation: query includes the exact filename slug and revision, but current vector-only path does not boost document_id, metadata.title, or revision metadata. It returns the exact target in top 5 but not first.
Targets 3-5
Exact-ish query ranks:
gpt-directive-agent-run-step1-checkpoint-and-pack1-inventory-readonly -> rank 1
p3d-agent-copy-paste-run-step1-checkpoint-and-pack1-inventory 2026-05-10 -> rank 1
vector search freshness audit readonly 2026-05-10 -> rank 1
Interpretation: search can work when the semantic query is distinctive enough, but ranking is not deterministic for exact path/title lookup.
5. Source Inspection
Source inspected read-only on VPS:
/opt/incomex/docker/agent-data-repo/agent_data/server.py
/opt/incomex/docker/agent-data-repo/agent_data/vector_store.py
/opt/incomex/docker/agent-data-repo/agent_data/pg_vector_listener.py
/opt/incomex/docker/agent-data-repo/agent_data/resilient_client.py
Create / Update Sync
Evidence:
server.py:1427-1440 creates document_data with vector_status="pending"
server.py:1442 writes PG row
server.py:1444-1452 immediately calls _sync_vector_entry(...)
server.py:535-570 _sync_vector_entry calls store.upsert_document(...) and updates vector_status to result.status
server.py:1551-1557 re-embeds only when update payload changes content
Conclusion: API createDocument/content update path performs immediate best-effort vector sync in-request. Recent target timestamps support this:
| Doc | created_at | updated_at | Approx sync delta |
|---|---|---|---|
| gpt review | 2026-05-10T14:43:09.964361Z | 2026-05-10T14:43:10.147478Z | 0.18s |
| directive | 2026-05-11T01:43:06.638308Z | 2026-05-11T01:43:06.846556Z | 0.21s |
| copy/paste prompt | 2026-05-11T01:45:11.632516Z | 2026-05-11T01:45:11.929767Z | 0.30s |
| vector audit prompt | 2026-05-11T01:48:12.921346Z | 2026-05-11T01:48:13.458715Z | 0.54s |
The Pack 1 inventory prompt is revision 2 and was updated later; it also has vector_status=ready and 3 Qdrant chunks.
Qdrant Search Logic
Evidence:
server.py:1185-1192 says strategy is Qdrant vector first, PG keyword fallback.
server.py:1194-1217 returns immediately when Qdrant has hits.
server.py:1221-1282 PG keyword scan only runs as fallback.
vector_store.py:273 embeds query.
vector_store.py:292 calls Qdrant search with limit=top_k*2.
vector_store.py:294-308 deduplicates by document_id.
vector_store.py:301-305 returns document_id, snippet, score, metadata.
Conclusion:
searchKnowledge_mode=QDRANT_VECTOR_FIRST_WITH_PG_KEYWORD_FALLBACK_ONLY_ON_EMPTY_OR_ERROR
hybrid_search_active_for_normal_qdrant_hits=false
title_path_document_id_boost_active=false
The code includes metadata.title in Qdrant payload, but it is not used as a boost/rerank signal on the normal vector path. PG fallback includes body and title keyword overlap, but fallback is bypassed whenever Qdrant returns any hits.
Chunking
Evidence:
vector_store.py:186-188 splits content into chunks.
vector_store.py:200-208 stores content, document_id, metadata.chunk_index, metadata.total_chunks.
Long docs are expected to produce multiple Qdrant points. This explains why vector/doc ratio can exceed 1.0 without indicating missing vectors.
6. Trigger / Listener / Webhook Evidence
Runtime:
event_system.enabled=true
webhooks_registered=0
webhooks_active=0
event_system.listeners=1
pg_stat_activity LISTEN kb_vector_sync listeners=2
Database triggers on kb_documents:
trg_kb_vector_sync|O|CREATE TRIGGER trg_kb_vector_sync AFTER INSERT OR DELETE OR UPDATE ON public.kb_documents FOR EACH ROW EXECUTE FUNCTION fn_kb_notify_vector_sync()
Relevant fn_kb_notify_vector_sync() behavior:
skips operations/tasks/comments, registries, empty key
on soft-delete transition emits semantic DELETE
skips empty/short content (<10 chars)
otherwise pg_notify('kb_vector_sync', {op,key,document_id})
Listener source:
pg_vector_listener.py listens on channel kb_vector_sync.
For INSERT/UPDATE it reads the PG row and calls store.upsert_document(...).
For DELETE or soft-deleted rows it calls store.delete_document(...).
resilient_client.py:469-473 starts the PG->Qdrant listener during FastAPI lifespan.
Conclusion:
webhooks_connected=false
pg_trigger_connected=true
pg_listener_connected=true
api_create_sync_connected=true
direct_pg_write_sync_connected=true_for_non_empty_non_excluded_rows
Nuance: direct PG listener upserts vectors but does not update vector_status in PG. API create/update does update vector_status.
7. Sync Warning Root Cause
Read-only /kb/audit-sync {"auto_heal": false}:
{
"status_code": 200,
"total_documents": 2406,
"total_vectors": 4973,
"ghost_count": 5,
"orphan_count": 0,
"status": "needs_cleanup",
"recommendations": [
"5 documents missing vectors — run POST /kb/reindex-missing"
],
"documents_without_vectors_sample": [
"",
"knowledge/current-state/templates/test_empty.md.tmpl",
"knowledge/dev",
"knowledge/dev/blueprints",
"knowledge/dev/laws"
],
"orphan_sample": []
}
Vector status aggregate:
none|3
pending|1
ready|2402
live_docs=2406
deleted_docs=2132
Non-ready live rows:
knowledge/current-state/templates/test_empty.md.tmpl|pending|body_len=2
knowledge/dev|none|body_len=0
knowledge/dev/blueprints|none|body_len=0
knowledge/dev/laws|none|body_len=0
Empty-id nuance:
<empty-id> exists=true vector_status=ready body_len=3846 qdrant_points=1
Audit still reports "" as ghost because list_document_ids() ignores falsey document_id values while PG includes the live empty-id row. This is an audit-accounting edge case, not evidence that recent docs are missing.
Health sync_status=warning is caused by:
primary_warning_source=ratio_threshold
ratio=4973/2406=2.07
threshold_warning=ratio > 2.0
secondary_audit_status=5 ghost ids, mostly intentionally empty/folder/short docs
orphan_count=0
recent_target_missing_vectors=false
The ratio is inflated by legitimate chunking of long documents plus normal collection growth. The audit status is noisy because folder/empty/short docs are counted as ghost candidates even though the trigger/listener intentionally skips short or empty content.
8. Cron / Audit Monitoring
Crontab evidence:
30 4 * * * /opt/incomex/dot/bin/dot-vector-audit --local >> /var/log/incomex/dot-vector-audit.log 2>&1
dot-vector-audit --help:
Usage: dot-vector-audit [--heal] [--cloud] [--local]
--heal Auto-fix orphans + reindex missing vectors
--cloud Target cloud Agent Data
--local Target local Agent Data (default)
Latest log sample:
VECTOR AUDIT — local (http://172.18.0.5:8000)
Pre-flight check... OK
Running audit-sync (dry-run)...
Documents: 2323
Vectors: 4767
Orphans: 0
Ghosts: 5
Status: needs_cleanup
RESULT: ISSUES FOUND — report-only mode; do not auto-heal from cron
Conclusion: cron safety net is report-only and points at the local Agent Data container URL. It is not configured with --heal.
9. Required Answers
1. Are newly created KB documents being vectorized immediately after createDocument?
Yes for API create/update content path. Source calls _sync_vector_entry() immediately after PG write, and recent targets show sub-second created-to-ready update deltas. This is best-effort and depends on OpenAI/Qdrant being healthy, both currently OK.
2. Are recent documents present in PG/KB but missing in Qdrant?
No for the five target documents. All are in KB/PG and Qdrant:
target_qdrant_counts=1,3,1,1,2
target_vector_status=ready for all
The only audit ghosts are empty/folder/short/empty-id edge cases, not these recent P3D documents.
3. Are recent documents present in Qdrant but ranked poorly?
Partially yes. All five targets are in top 20 for the tested queries; two are not rank 1:
gpt review target rank=5
pack1 inventory prompt rank=3
This matches user reports that recent docs are not reliably searchable when query intent is exact path/title lookup.
4. Is searchKnowledge using Qdrant vector search only, hybrid search, or additional keyword logic?
Normal successful path is Qdrant vector search only. There is PG keyword fallback, but only when Qdrant is disabled, errors, or returns no hits. It is not a true hybrid search for normal Qdrant hits.
5. Is title/path/document_id being boosted or ignored?
Ignored for ranking in the normal Qdrant path. Title/tags are stored in payload metadata and returned, but no code boosts metadata.title, document_id, or path tokens. document_id is only used for deduplication and counting/deletion filters.
6. Is there a lag between KB write and vector availability?
For target API-created docs, observed lag is sub-second based on created_at to vector-status updated_at. No evidence of material lag for recent targets.
7. Are webhooks/triggers/listeners actually connected?
Webhooks are not registered/active. PG trigger and listener are connected:
webhooks_registered=0
webhooks_active=0
trg_kb_vector_sync enabled=O
pg_stat_activity LISTEN kb_vector_sync listeners=2
event_system.listeners=1
The API create path also has direct sync independent of the PG listener.
8. Is sync_status=warning caused by ghosts/orphans, ratio threshold, pending vector_status rows, or other drift?
The /health warning is directly caused by ratio threshold ratio > 2.0; current ratio is 2.07. The read-only audit also reports ghost_count=5, orphan_count=0, but those ghosts are mostly empty/folder/short rows and one empty-id accounting edge case. There is one live pending row with body length 2.
9. What minimal safe next action is recommended?
Do not reindex as a first response. The minimal safe next action is a small search behavior pack:
P3D_VECTOR_SEARCH_HYBRID_PATH_TITLE_BOOST_PACK
Scope should be read/design first, then small implementation if approved:
1. Add deterministic exact `document_id` / path-slug / `metadata.title` boost or rerank before returning Qdrant hits.
2. Preserve legacy vector search and chunk payloads.
3. Keep Qdrant collection unchanged.
4. Optionally refine audit health so empty/folder/short docs and empty-id edge cases do not create noisy ghost warnings.
10. Final Root Cause
root_cause=SEARCH_RANKING_NO_PATH_TITLE_BOOST_NOT_VECTOR_FRESHNESS
confidence=high
The recent documents are not stale or missing. They are vectorized and retrievable. The unreliable user experience comes from Qdrant semantic ranking being used as the normal path without exact path/title/document_id boosting. Similar older P3D/GPT review/prompt documents can outrank the exact recent target.
11. Boundary Compliance
Actions performed:
read_local_skill=true
search_knowledge_direct_main_process=true
get_document_read=true
source_inspection_read_only=true
health_get_read_only=true
audit_sync_auto_heal_false=true
pg_select_only=true
qdrant_count_scroll_search_read_only=true
cron_log_read_only=true
report_created_and_uploaded=true
Actions explicitly not performed:
no_reindex=true
no_auto_heal=true
no_kb_reindex=true
no_kb_reindex_missing=true
no_cleanup_orphans_write=true
no_qdrant_delete=true
no_qdrant_upsert=true
no_db_insert_update_delete=true
no_trigger_creation=true
no_restart=true
no_redeploy=true
no_config_change=true
no_code_change=true
no_dot_vector_audit_heal=true
no_iu_vector_implementation=true
no_production_documents_mutation=true
Note: final report upload to the required KB path is the only write-like action and is part of the requested deliverable, not a system/data repair.