20A/20B-P0 — Search/Vector Hygiene Inspection Report
20A/20B-P0 — Search/Vector Hygiene Inspection Report
Date: 2026-05-05 | Mode: READ-ONLY (no mutations performed) Controlling: Design pack 20 + GPT directive + Opus review Prompt:
knowledge/dev/laws/dieu44-trien-khai/prompts/20a-20b-p0-search-vector-hygiene-inspection-prompt.mdInspector: Opus 4.7 (1M)
20A Verdict — Search Filter Feasibility
VIABLE — small patch (~20 LOC).
Evidence
Search entry points found:
/app/agent_data/vector_store.py:254—VectorStore.search(query, top_k, filter_tags, filter_status)(Qdrant call site)/app/agent_data/server.py:826—query_knowledge(payload)(HTTP/MCP wrapper)- Wrapper threads
filter_tags+filter_statusonly (server.py:1116-1122).
Current filter logic (vector_store.py:275-290):
conditions: list[Any] = []
if filter_tags:
conditions.append(qmodels.FieldCondition(
key="metadata.tags",
match=qmodels.MatchAny(any=filter_tags)))
if filter_status:
conditions.append(qmodels.FieldCondition(
key="metadata.status",
match=qmodels.MatchValue(value=filter_status)))
query_filter = qmodels.Filter(must=conditions) if conditions else None
Qdrant client classes already imported (line 26): qmodels from qdrant_client.http.models — Filter, FieldCondition, MatchAny, MatchValue all in use. MatchExcept available in same module (no new import needed).
Payload structure (vector_store.py:201-210):
payload = {
"content": chunk_text,
"document_id": document_id, # e.g. "context-pack/<BID>/<file>"
"metadata": { **base_metadata, "chunk_index": ..., "total_chunks": ... },
"parent_id": parent_id,
"is_human_readable": ...,
}
Critical confirmation for A4: Context-pack uploads carry metadata.source = "dieu43_context_pack_publish" (build script line 1080). This field is the natural key for the exclude filter — no new payload field needed, no re-index required.
A1–A7 Answers
| # | Q | Answer |
|---|---|---|
| A1 | Search params? | query: str, top_k: int=5, filter_tags: list[str]|None, filter_status: str|None |
| A2 | Filter coverage? | Only metadata.tags (MatchAny) + metadata.status (MatchValue). No source/path/document_id filter. |
| A3 | Qdrant classes? | Filter(must=[...]) + FieldCondition(key, match) with MatchAny/MatchValue. |
| A4 | source/path in payload? |
metadata.source IS set by build script ("dieu43_context_pack_publish"). document_id top-level holds full path. Both queryable. |
| A5 | Patch size? | Small (~20 LOC) — add exclude_source: list[str]|None param + 5-line MatchExcept block in search(), thread through server.py:1116-1122. |
| A6 | Patch points (if VIABLE) | (a) vector_store.py:259 add param; (b) after line 290, append must_not condition with MatchExcept; (c) server.py:1116 add exclude_source extract; (d) server.py:1121 pass-through; (e) Pydantic QueryFilters model add field (need to locate — see Notes). |
| A7 | Why not viable? | n/a — VIABLE |
Recommended patch shape (for design phase, NOT executed)
# vector_store.py: signature
def search(self, *, query, top_k=5,
filter_tags=None, filter_status=None,
exclude_source: list[str] | None = None):
...
must_not: list[Any] = []
if exclude_source:
must_not.append(qmodels.FieldCondition(
key="metadata.source",
match=qmodels.MatchAny(any=exclude_source)))
query_filter = qmodels.Filter(
must=conditions or None,
must_not=must_not or None,
) if (conditions or must_not) else None
LOC estimate: 8 in vector_store.py + 4 in server.py + 2 in Pydantic model = ~14 LOC core. Plus tests/docstring → ≤30 LOC total. Well under the 50 LOC threshold.
20B-P0 Findings
B1 — Upload pattern (DOT build script)
ACCUMULATE. dot-context-pack-build.sh:1068 sets doc_id="context-pack/${BUILD_ID}/${outname}". Each cron build mints a new BUILD_ID (timestamp+hash); upload POSTs to documents?upsert=true but the document_id is unique per build → 9 new docs per build, no replacement of prior builds.
B2 — Delete-after-upload
ABSENT. grep -n 'deleteDocument\|delete_document' dot-context-pack-build.sh returned no matches in the upload section. The only delete token in the script is at line 592 (a sed comment about markdown processing). Build script never trims older builds from KB.
B3 — Patch insertion point (informational, not executed)
After line 1109 log_ok "7e KB mirror DONE (${ok}/8 sections live)" add a step 7f that calls list_documents(path="context-pack/"), filters out the current ${BUILD_ID} and any retention window, and issues delete_document per stale doc_id. Self-contained, runs only on full-success path.
B4 — Filesystem build count + cron cleanup
/opt/incomex/context-pack/— empty exceptcurrentsymlink →/opt/incomex/context-pack-staging/20260505-100008-4277ba/opt/incomex/context-pack-staging/— 141 build directories- Total size: 36 MB (very small per-build, ~256 KB avg)
- Cron cleanup: NONE (
crontab -l | grep -Ei 'context|pack|cleanup'empty for incomex user; root crontab empty;systemctl list-timersno match)
B5 — Cold archive feasibility
YES. Disk: /dev/sda1 96G total, 53G free (46% used). 36 MB / 141 builds ≈ 256 KB/build → 7-day retention at 8 builds/day ≈ 56 builds × 256 KB ≈ 14 MB. Path structure already keyed by build_id timestamp → trivial find -mtime +7 -delete. No structural blocker.
B6 — PG manifest audit trail
YES, sufficient. All 3 tables exist:
context_pack_manifest(131 rows) — 20 columns includingid,generated_at,published_at,git_commit,trigger_source,publish_status,kb_mirror_status,publish_step,health_status,logical_checksum_sha256,file_checksum_sha256,_dot_origincontext_pack_requests(150 rows)context_pack_sections(1157 rows)
Schema covers build_id (via timestamp/checksum), status lifecycle, dual checksums, KB mirror status. Adequate for permanent audit trail without KB.
B7 — v_context_pack_latest
View returns build 20260504-040018-372b48 (manifest id=138, generated 2026-05-04 04:00:22 UTC, publish_status=live, kb_mirror_status=live, health_status=fail).
Matches KB latest from prior report (20260504-040018-372b48) — confirmed.
Note: A newer staging build exists on FS (20260505-100008-4277ba, today 12:00 local) but has not promoted to live in PG view; current symlink does point to it. Indicates publish pipeline may be stuck at a step after staging but before manifest finalization. Out of scope here — flag for follow-up.
B8 — Delete candidates if "keep latest only"
If keep latest complete (9 sections of 20260504-040018-372b48): KB total 1,174 - 9 = 1,165 stale docs. Logic correct.
B9 — Full KB removal risk
LOW. Combined evidence:
- No Python/Markdown source under
/opt/incomex/referencescontext-pack/viasearch_knowledgeor KBget_document(grep returned 0 matches outside the build/verify scripts themselves). - The DOT consumer scripts (
dot-dieu43-fs-verify.sh,dot-dieu43-fs-init.sh) reference filesystem paths only. - Claude Code agent context-pack consumption (e.g.
LAWS_INDEX.md,PROJECT_MAP.md) flows from the filesystemcurrentsymlink, not from KB search. - KB upload of context-pack is a one-way mirror with no readers identified.
Residual risk: any ad-hoc/external tool not under /opt/incomex that calls search_knowledge and depends on tags=["context-pack"] results would lose those hits. Mitigation: 20A exclude filter would have shadowed those results anyway, so the dependency is already broken or absent.
B10 — Agent runtime dependency
Filesystem only. current symlink → /opt/incomex/context-pack-staging/<BUILD_ID>/ containing 9 files: ARCHITECTURE.mmd, DB_MAP.md, DOT_REGISTRY.md, ENTITIES_OVERVIEW.md, LAWS_INDEX.md, OPS_CODE_INVENTORY.md, PROJECT_MAP.md, RED_ZONES.md, project-map.json. No code path searches KB for context-pack documents.
→ KB mirror is redundant. Stopping uploads + purging existing 1,174 docs is safe under the read patterns observed.
Recommendation
Primary path: STOP uploading context-pack to KB + purge all 1,174 existing docs.
Rationale:
- Context-pack consumers are filesystem-based (B10).
- PG manifest already provides a permanent, queryable audit trail with dual checksums (B6).
- Filesystem retains 141 builds in 36 MB; 7-day cron cleanup is trivial to add (B4/B5).
- KB mirror provides no observed value, costs vector quota, and pollutes search noise (the original motivation for this inspection).
20A next: Patch is VIABLE as a defense-in-depth supplement (≤30 LOC), but not strictly required if 20B-P0 primary path executes — once KB is clear of context-pack, there is nothing to exclude. Recommend proceeding with 20A patch anyway as guard against future re-uploads.
20B next: Proceed to design lifecycle P1:
- Add
delete_documentstep7ftodot-context-pack-build.sh(post-7e), cleaning allcontext-pack/*not matching current BUILD_ID — OR simpler: change KB upload to a no-op (KB_MIRROR_OK=1; return 0) and one-shot purge existing 1,174. - Add cron:
find /opt/incomex/context-pack-staging -maxdepth 1 -type d -mtime +7 -exec rm -rf {} + - One-shot batch delete of 1,174 KB docs (paginated
delete_documentcalls). - Document the new authority chain: PG manifest = SSOT for build metadata, FS = active artifacts, KB = no longer in scope.
Open items flagged (out of scope for this inspection):
- Latest staging build
20260505-100008-4277baexists on FS but PGv_context_pack_lateststill shows20260504-040018-372b48. Publish pipeline appears stuck / health_status=failon id=138. Investigate separately. - 7 .bak copies of
dot-context-pack-build.shaccumulating in/opt/incomex/dot/bin/. Housekeeping.
Hard Boundary Compliance
| Boundary | Status |
|---|---|
| Read-only inspection | ✅ |
| No deleteDocument | ✅ |
| No deindex | ✅ |
| No DOT script patch | ✅ |
| No Đ43 patch | ✅ |
| No vector config edit | ✅ |
| No filesystem cleanup | ✅ |
| No service restart | ✅ |
| No Pack 2C/IU | ✅ |
20A/20B-P0 Inspection Report | 2026-05-05 | Read-only complete. Awaiting design dispatch.