P3D Vector/Search Freshness Audit Report

Date: 2026-05-11 Mode: READ_ONLY_INVESTIGATION Scope: Agent Data legacy KB vector search, Qdrant collection production_documents Report path: knowledge/dev/laws/dieu44-trien-khai/reports/p3d-vector-search-freshness-audit-report.md

Status Fields

phase_status=PASS
mode=READ_ONLY_INVESTIGATION
no_mutation_performed=true
kb_direct_read_ok=true
qdrant_status=ok
sync_status=warning
recent_docs_vectorized=all
search_ranking_quality=noisy
root_cause=SEARCH_RANKING_NO_PATH_TITLE_BOOST_NOT_VECTOR_FRESHNESS
root_cause_confidence=high
recommended_next_action=P3D_VECTOR_SEARCH_HYBRID_PATH_TITLE_BOOST_PACK
unsafe_actions_not_taken=no_reindex,no_auto_heal,no_kb_reindex,no_kb_reindex_missing,no_cleanup_orphans_write,no_qdrant_delete,no_qdrant_upsert,no_db_write,no_trigger_creation,no_restart,no_redeploy,no_config_change,no_code_change,no_dot_heal,no_iu_vector_implementation

0. Rule Read Evidence

Read in main process, no background agent:

Required read	Evidence
`.claude/skills/incomex-rules.md`	Read locally. Key constraints confirmed: direct `search_knowledge`, no background agent, report with evidence.
`search_knowledge("operating rules SSOT")`	Returned `knowledge/dev/ssot/operating-rules.md`, title `Nguyên tắc Điều hành — SSOT v7.58 Concise`, score `0.42450312`; also returned VPS Operating Rules.
`search_knowledge("hiến pháp v4.0 constitution")`	Returned `knowledge/dev/laws/constitution.md`, title `Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 BAN HÀNH`, and warnings that v3.9 is superseded.
Prompt file	Read via `get_document_for_rewrite("knowledge/dev/laws/dieu44-trien-khai/prompts/p3d-agent-vector-search-freshness-audit-readonly-2026-05-10.md")`, `truncated=false`, `content_length=6508`.

1. Three Declarations

Vĩnh viễn? Root cause is not fixed by manually reindexing one document. The durable fix is deterministic search behavior: path/title/document_id exact-match or boost before/alongside vector similarity.
Nhầm được không? Current system can confuse exact prompt/report lookup because Qdrant semantic score ignores path and title as ranking signals. A hybrid/boosted search contract would make exact-path/title queries hard to mis-rank.
100% tự động? Create/update vector sync is automatic, but discoverability is not deterministic for exact names. Search ranking needs automated boost/rerank, not operator memory or manual reindex.

2. System Health Evidence

Read-only /health from inside incomex-agent-data on port 8000:

{
  "status": "healthy",
  "services": {
    "qdrant": {"status": "ok", "latency_ms": 9.3, "last_error": null},
    "postgres": {"status": "ok", "latency_ms": 1.0, "last_error": null},
    "openai": {"status": "ok", "latency_ms": 0.0, "last_error": null}
  },
  "data_integrity": {
    "document_count": 2406,
    "vector_point_count": 4973,
    "ratio": 2.07,
    "sync_status": "warning",
    "embed_calls": 1443,
    "embed_tokens": 1145939
  },
  "event_system": {
    "enabled": true,
    "webhooks_registered": 0,
    "webhooks_active": 0,
    "listeners": 1,
    "events_logged": 1232
  }
}

Container/source evidence:

incomex-agent-data image=agent-data-local:latest status=Up 3 days (healthy)
QDRANT_COLLECTION=production_documents
store_enabled=true

3. Target Document Results

All five recent targets exist in KB/PG, have vector_status=ready, and have Qdrant points. None of the target documents are missing from Qdrant.

Target doc	KB exists	vector_status	Qdrant points	chunks	Search top5/top10/top20	Rank	Notes
`reviews/gpt-review-p3d-step1-reauthored-spec-and-pack1-directive-2026-05-10.md`	true	ready	1	1	true/true/true	5	Present but outranked by older semantically similar GPT review docs.
`prompts/p3d-pack1-readonly-inventory-prompt.md`	true	ready	3	3	true/true/true	3	Revision 2, chunked. Query ranks related report/review above exact target.
`directives/gpt-directive-agent-run-step1-checkpoint-and-pack1-inventory-readonly-2026-05-10.md`	true	ready	1	1	true/true/true	1	Exact-ish directive query ranks correctly.
`prompts/p3d-agent-copy-paste-run-step1-checkpoint-and-pack1-inventory-2026-05-10.md`	true	ready	1	1	true/true/true	1	Exact-ish prompt query ranks correctly.
`prompts/p3d-agent-vector-search-freshness-audit-readonly-2026-05-10.md`	true	ready	2	2	true/true/true	1	Exact-ish vector audit query ranks correctly.

Direct KB reads via MCP get_document returned all five documents with expected titles and revisions:

gpt-review-p3d-step1... revision=1 content_length=3612
p3d-pack1-readonly-inventory-prompt.md revision=2 content_length=10498
gpt-directive-agent-run-step1... revision=1 content_length=2537
p3d-agent-copy-paste-run-step1... revision=1 content_length=2523
p3d-agent-vector-search-freshness-audit... revision=1 content_length=6508

Qdrant payload sample pattern for all targets:

payload.document_id=<full KB path>
payload.content=<chunk text>
payload.metadata.title=<document title>
payload.metadata.tags=<tags>
payload.metadata.chunk_index=<0-based chunk>
payload.metadata.total_chunks=<chunk count>

4. Ranking Evidence

The key failure mode is noisy ranking, not missing vectors.

Target 1 Query

Query:

GPT Review P3D Step 1 Re-authored Spec Pack 1 Directive

Top hits:

1 score=0.641119 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-23-p3d1-notification-schema-functions-prompt-rev1-2026-05-07.md
2 score=0.631840 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-23-p3d3-prompt-rev1-opus-checkpoint-2026-05-08.md
3 score=0.629037 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-gpilot4-pass-and-gpilot5-directive-2026-05-04.md
4 score=0.624177 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-gpilot5-pass-and-next-directive-2026-05-04.md
5 score=0.621927 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-p3d-step1-reauthored-spec-and-pack1-directive-2026-05-10.md

Interpretation: exact target is vectorized and retrieved, but older documents with similar semantic shape outrank it. This is a ranking issue.

Target 2 Query

Query:

p3d-pack1-readonly-inventory-prompt revision 2

Top hits:

1 score=0.630753 knowledge/dev/laws/dieu44-trien-khai/reports/p3d-pack1-iu-canonical-contract-and-tac-iu-reconciliation-report.md
2 score=0.626626 knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-p3d-pack1-preliminary-design-and-inventory-next-2026-05-10.md
3 score=0.610256 knowledge/dev/laws/dieu44-trien-khai/prompts/p3d-pack1-readonly-inventory-prompt.md

Interpretation: query includes the exact filename slug and revision, but current vector-only path does not boost document_id, metadata.title, or revision metadata. It returns the exact target in top 5 but not first.

Targets 3-5

Exact-ish query ranks:

gpt-directive-agent-run-step1-checkpoint-and-pack1-inventory-readonly -> rank 1
p3d-agent-copy-paste-run-step1-checkpoint-and-pack1-inventory 2026-05-10 -> rank 1
vector search freshness audit readonly 2026-05-10 -> rank 1

Interpretation: search can work when the semantic query is distinctive enough, but ranking is not deterministic for exact path/title lookup.

5. Source Inspection

Source inspected read-only on VPS:

/opt/incomex/docker/agent-data-repo/agent_data/server.py
/opt/incomex/docker/agent-data-repo/agent_data/vector_store.py
/opt/incomex/docker/agent-data-repo/agent_data/pg_vector_listener.py
/opt/incomex/docker/agent-data-repo/agent_data/resilient_client.py

Create / Update Sync

Evidence:

server.py:1427-1440 creates document_data with vector_status="pending"
server.py:1442 writes PG row
server.py:1444-1452 immediately calls _sync_vector_entry(...)
server.py:535-570 _sync_vector_entry calls store.upsert_document(...) and updates vector_status to result.status
server.py:1551-1557 re-embeds only when update payload changes content

Conclusion: API createDocument/content update path performs immediate best-effort vector sync in-request. Recent target timestamps support this:

Doc	created_at	updated_at	Approx sync delta
gpt review	2026-05-10T14:43:09.964361Z	2026-05-10T14:43:10.147478Z	0.18s
directive	2026-05-11T01:43:06.638308Z	2026-05-11T01:43:06.846556Z	0.21s
copy/paste prompt	2026-05-11T01:45:11.632516Z	2026-05-11T01:45:11.929767Z	0.30s
vector audit prompt	2026-05-11T01:48:12.921346Z	2026-05-11T01:48:13.458715Z	0.54s

The Pack 1 inventory prompt is revision 2 and was updated later; it also has vector_status=ready and 3 Qdrant chunks.

Qdrant Search Logic

Evidence:

server.py:1185-1192 says strategy is Qdrant vector first, PG keyword fallback.
server.py:1194-1217 returns immediately when Qdrant has hits.
server.py:1221-1282 PG keyword scan only runs as fallback.
vector_store.py:273 embeds query.
vector_store.py:292 calls Qdrant search with limit=top_k*2.
vector_store.py:294-308 deduplicates by document_id.
vector_store.py:301-305 returns document_id, snippet, score, metadata.

Conclusion:

searchKnowledge_mode=QDRANT_VECTOR_FIRST_WITH_PG_KEYWORD_FALLBACK_ONLY_ON_EMPTY_OR_ERROR
hybrid_search_active_for_normal_qdrant_hits=false
title_path_document_id_boost_active=false

The code includes metadata.title in Qdrant payload, but it is not used as a boost/rerank signal on the normal vector path. PG fallback includes body and title keyword overlap, but fallback is bypassed whenever Qdrant returns any hits.

Chunking

Evidence:

vector_store.py:186-188 splits content into chunks.
vector_store.py:200-208 stores content, document_id, metadata.chunk_index, metadata.total_chunks.

Long docs are expected to produce multiple Qdrant points. This explains why vector/doc ratio can exceed 1.0 without indicating missing vectors.

6. Trigger / Listener / Webhook Evidence

Runtime:

event_system.enabled=true
webhooks_registered=0
webhooks_active=0
event_system.listeners=1
pg_stat_activity LISTEN kb_vector_sync listeners=2

Database triggers on kb_documents:

trg_kb_vector_sync|O|CREATE TRIGGER trg_kb_vector_sync AFTER INSERT OR DELETE OR UPDATE ON public.kb_documents FOR EACH ROW EXECUTE FUNCTION fn_kb_notify_vector_sync()

Relevant fn_kb_notify_vector_sync() behavior:

skips operations/tasks/comments, registries, empty key
on soft-delete transition emits semantic DELETE
skips empty/short content (<10 chars)
otherwise pg_notify('kb_vector_sync', {op,key,document_id})

Listener source:

pg_vector_listener.py listens on channel kb_vector_sync.
For INSERT/UPDATE it reads the PG row and calls store.upsert_document(...).
For DELETE or soft-deleted rows it calls store.delete_document(...).
resilient_client.py:469-473 starts the PG->Qdrant listener during FastAPI lifespan.

Conclusion:

webhooks_connected=false
pg_trigger_connected=true
pg_listener_connected=true
api_create_sync_connected=true
direct_pg_write_sync_connected=true_for_non_empty_non_excluded_rows

Nuance: direct PG listener upserts vectors but does not update vector_status in PG. API create/update does update vector_status.

7. Sync Warning Root Cause

Read-only /kb/audit-sync {"auto_heal": false}:

{
  "status_code": 200,
  "total_documents": 2406,
  "total_vectors": 4973,
  "ghost_count": 5,
  "orphan_count": 0,
  "status": "needs_cleanup",
  "recommendations": [
    "5 documents missing vectors — run POST /kb/reindex-missing"
  ],
  "documents_without_vectors_sample": [
    "",
    "knowledge/current-state/templates/test_empty.md.tmpl",
    "knowledge/dev",
    "knowledge/dev/blueprints",
    "knowledge/dev/laws"
  ],
  "orphan_sample": []
}

Vector status aggregate:

none|3
pending|1
ready|2402
live_docs=2406
deleted_docs=2132

Non-ready live rows:

knowledge/current-state/templates/test_empty.md.tmpl|pending|body_len=2
knowledge/dev|none|body_len=0
knowledge/dev/blueprints|none|body_len=0
knowledge/dev/laws|none|body_len=0

Empty-id nuance:

<empty-id> exists=true vector_status=ready body_len=3846 qdrant_points=1

Audit still reports "" as ghost because list_document_ids() ignores falsey document_id values while PG includes the live empty-id row. This is an audit-accounting edge case, not evidence that recent docs are missing.

Health sync_status=warning is caused by:

primary_warning_source=ratio_threshold
ratio=4973/2406=2.07
threshold_warning=ratio > 2.0
secondary_audit_status=5 ghost ids, mostly intentionally empty/folder/short docs
orphan_count=0
recent_target_missing_vectors=false

The ratio is inflated by legitimate chunking of long documents plus normal collection growth. The audit status is noisy because folder/empty/short docs are counted as ghost candidates even though the trigger/listener intentionally skips short or empty content.

8. Cron / Audit Monitoring

Crontab evidence:

30 4 * * * /opt/incomex/dot/bin/dot-vector-audit --local >> /var/log/incomex/dot-vector-audit.log 2>&1

dot-vector-audit --help:

Usage: dot-vector-audit [--heal] [--cloud] [--local]
  --heal   Auto-fix orphans + reindex missing vectors
  --cloud  Target cloud Agent Data
  --local  Target local Agent Data (default)

Latest log sample:

VECTOR AUDIT — local (http://172.18.0.5:8000)
Pre-flight check... OK
Running audit-sync (dry-run)...
Documents: 2323
Vectors: 4767
Orphans: 0
Ghosts: 5
Status: needs_cleanup
RESULT: ISSUES FOUND — report-only mode; do not auto-heal from cron

Conclusion: cron safety net is report-only and points at the local Agent Data container URL. It is not configured with --heal.

9. Required Answers

1. Are newly created KB documents being vectorized immediately after createDocument?

Yes for API create/update content path. Source calls _sync_vector_entry() immediately after PG write, and recent targets show sub-second created-to-ready update deltas. This is best-effort and depends on OpenAI/Qdrant being healthy, both currently OK.

2. Are recent documents present in PG/KB but missing in Qdrant?

No for the five target documents. All are in KB/PG and Qdrant:

target_qdrant_counts=1,3,1,1,2
target_vector_status=ready for all

The only audit ghosts are empty/folder/short/empty-id edge cases, not these recent P3D documents.

3. Are recent documents present in Qdrant but ranked poorly?

Partially yes. All five targets are in top 20 for the tested queries; two are not rank 1:

gpt review target rank=5
pack1 inventory prompt rank=3

This matches user reports that recent docs are not reliably searchable when query intent is exact path/title lookup.

4. Is searchKnowledge using Qdrant vector search only, hybrid search, or additional keyword logic?

Normal successful path is Qdrant vector search only. There is PG keyword fallback, but only when Qdrant is disabled, errors, or returns no hits. It is not a true hybrid search for normal Qdrant hits.

5. Is title/path/document_id being boosted or ignored?

Ignored for ranking in the normal Qdrant path. Title/tags are stored in payload metadata and returned, but no code boosts metadata.title, document_id, or path tokens. document_id is only used for deduplication and counting/deletion filters.

6. Is there a lag between KB write and vector availability?

For target API-created docs, observed lag is sub-second based on created_at to vector-status updated_at. No evidence of material lag for recent targets.

7. Are webhooks/triggers/listeners actually connected?

Webhooks are not registered/active. PG trigger and listener are connected:

webhooks_registered=0
webhooks_active=0
trg_kb_vector_sync enabled=O
pg_stat_activity LISTEN kb_vector_sync listeners=2
event_system.listeners=1

The API create path also has direct sync independent of the PG listener.

8. Is sync_status=warning caused by ghosts/orphans, ratio threshold, pending vector_status rows, or other drift?

The /health warning is directly caused by ratio threshold ratio > 2.0; current ratio is 2.07. The read-only audit also reports ghost_count=5, orphan_count=0, but those ghosts are mostly empty/folder/short rows and one empty-id accounting edge case. There is one live pending row with body length 2.

9. What minimal safe next action is recommended?

Do not reindex as a first response. The minimal safe next action is a small search behavior pack:

P3D_VECTOR_SEARCH_HYBRID_PATH_TITLE_BOOST_PACK

Scope should be read/design first, then small implementation if approved:

1. Add deterministic exact `document_id` / path-slug / `metadata.title` boost or rerank before returning Qdrant hits.
2. Preserve legacy vector search and chunk payloads.
3. Keep Qdrant collection unchanged.
4. Optionally refine audit health so empty/folder/short docs and empty-id edge cases do not create noisy ghost warnings.

10. Final Root Cause

root_cause=SEARCH_RANKING_NO_PATH_TITLE_BOOST_NOT_VECTOR_FRESHNESS
confidence=high

The recent documents are not stale or missing. They are vectorized and retrievable. The unreliable user experience comes from Qdrant semantic ranking being used as the normal path without exact path/title/document_id boosting. Similar older P3D/GPT review/prompt documents can outrank the exact recent target.

11. Boundary Compliance

Actions performed:

read_local_skill=true
search_knowledge_direct_main_process=true
get_document_read=true
source_inspection_read_only=true
health_get_read_only=true
audit_sync_auto_heal_false=true
pg_select_only=true
qdrant_count_scroll_search_read_only=true
cron_log_read_only=true
report_created_and_uploaded=true

Actions explicitly not performed:

no_reindex=true
no_auto_heal=true
no_kb_reindex=true
no_kb_reindex_missing=true
no_cleanup_orphans_write=true
no_qdrant_delete=true
no_qdrant_upsert=true
no_db_insert_update_delete=true
no_trigger_creation=true
no_restart=true
no_redeploy=true
no_config_change=true
no_code_change=true
no_dot_vector_audit_heal=true
no_iu_vector_implementation=true
no_production_documents_mutation=true

Note: final report upload to the required KB path is the only write-like action and is part of the requested deliverable, not a system/data repair.