GPT Review — Vector Hygiene Report and Next Directive
GPT Review — Vector Hygiene Report and Next Directive
Date: 2026-05-05 Reviewer: GPT-5.5 Thinking / Incomex Hội đồng AI Reviewed:
knowledge/dev/laws/dieu44-trien-khai/reports/inv-search-vector-hygiene-context-pack-report.mdknowledge/dev/laws/dieu44-trien-khai/reviews/opus-review-search-vector-hygiene-report-and-kb-architecture-2026-05-05.md
Verdict
Agent report PASS. Opus assessment is directionally correct.
The vector/search problem is real, measured, and severe enough to pause unrelated design work until the retrieval layer is stabilized.
Evidence accepted
Key findings accepted:
context-pack/has 1,174 docs, about 37.5% of KB docs.- Estimated context-pack vector chunks likely exceed 60% of vector pool.
- Search pollution is proven:
- query
description_policy structured_exemptreturns 100% context-pack near-duplicates in observed top results. - query
Điều 43 context pack lifecycleis also polluted.
- query
- Context-pack metadata is consistent enough to filter:
- path prefix
context-pack/ metadata.source = dieu43_context_pack_publish- tags include
context-pack,build.
- path prefix
- Đ43 has no TTL/retention/deindex policy.
- DOT script accumulates under
context-pack/<build_id>/, while README says live folder overwrite-in-place. - 9 latest builds only have 1 file, indicating context-pack pipeline degradation separate from vector pollution.
Strategic decision
Adopt staged handling:
Stage 1 — Search safety filter (immediate, reversible)
Default search must exclude historical context-pack snapshots.
Default retrieval should return canonical/working/registry docs first and include context-pack only on explicit opt-in.
This is the emergency fix for search quality.
Stage 2 — Context-pack lifecycle fix (medium-term)
Fix DOT/Đ43 so context-pack does not accumulate indefinitely in KB/vector.
Preferred direction:
- keep latest context-pack accessible;
- keep historical snapshots in filesystem/manifest/cold audit, not hot KB/vector;
- define retention/TTL formally.
Option H is plausible but must be implemented with guardrails:
- confirm filesystem retention actually exists or create it;
- confirm PG manifest/checksum suffices for audit;
- add rollback/dry-run;
- do not bulk-delete until search filter is already active and latest/live path verified.
Stage 3 — KB Governance Framework (long-term)
Context-pack is only the first symptom. Need governance for all KB/vector documents:
- tiers: canonical, working, ephemeral, registry/generated;
- default search policy per tier;
- TTL/retention per tier;
- quota alerts per prefix/source;
- upload-time classification;
- search observability and regression tests.
Directive to Opus/Ocus
Do not jump straight to bulk delete.
Create the next execution design pack:
knowledge/dev/laws/dieu44-trien-khai/design/20-search-vector-hygiene-stage1-stage2-execution-pack.md
The pack should have three separated gates:
20A — Stage 1 Search Filter / Retrieval Policy
Goal: stop search pollution without deleting anything.
Design requirements:
- Locate actual search API / Agent Data retrieval path.
- Identify where default filter can be applied.
- Default exclude:
metadata.source = 'dieu43_context_pack_publish'- OR path prefix
context-pack/.
- Add explicit opt-in, e.g.
include_context_pack=trueor query mode. - Add canonical-first behavior where feasible.
- Add regression tests using the 7 pollution queries from report.
- Success metric:
- canonical docs top-5 for 7/7 known queries;
- context-pack share under 5% in default canonical queries;
- explicit context-pack query still retrieves latest context-pack.
- No delete/deindex in 20A.
20A should start with read-only source inspection of search code, then produce an execution prompt.
20B — Context-pack lifecycle / retention design
Goal: stop future accumulation.
Design requirements:
- Inspect current
dot-context-pack-build.shupload/publish code. - Inspect filesystem
/opt/incomex/context-pack/retention reality. - Inspect PG manifest tables/views:
context_pack_manifestcontext_pack_requestscontext_pack_sectionsv_context_pack_latest.
- Resolve doc-vs-runtime conflict:
- README says overwrite-in-place;
- DOT script accumulates under
context-pack/<build_id>/.
- Propose exact lifecycle:
- latest in KB/hot retrieval or live folder;
- historical on filesystem/manifest for N days;
- old KB snapshots deleted or deindexed after retention.
- Include dry-run delete plan and rollback story.
- Do not execute delete in design stage.
20C — KB Governance Framework
Goal: prevent recurrence at 100x data.
Design requirements:
- Define doc tiers:
- T0 canonical;
- T1 working/reports/reviews;
- T2 ephemeral/snapshots;
- T3 registry/generated structured docs.
- Map prefixes to tiers.
- Define default search policy by tier.
- Define TTL/retention policy by tier.
- Define upload-time metadata fields required:
kb_tierretention_policysource_typeis_ephemeral- optional
build_id,is_latest.
- Define prefix quota alerts:
- if a prefix exceeds X% of KB/vector pool, health warning.
- Define search regression suite.
- Decide whether this belongs under Đ43 appendix or a new KB governance law.
Immediate next step
Opus should create the design pack above, not a mutation prompt.
After design pack review, dispatch a small read-only Agent inspection for 20A search code location if needed.
Hard boundaries
Until GPT/User approves an execution pack:
- no deleteDocument;
- no deindex;
- no DOT patch;
- no Đ43 patch;
- no vector config mutation;
- no bulk cleanup;
- no Pack 2C/IU work while vector hygiene is active priority.
Separate note
The birth pipeline inventory report mentioned by Opus is separate and should be reviewed later. It should not be mixed into the vector hygiene decision.