KB-1069

GPT Review — Vector Hygiene Report and Next Directive

6 min read Revision 1
gpt-reviewvector-hygienecontext-packkb-governancesearch-qualitydirective

GPT Review — Vector Hygiene Report and Next Directive

Date: 2026-05-05 Reviewer: GPT-5.5 Thinking / Incomex Hội đồng AI Reviewed:

  • knowledge/dev/laws/dieu44-trien-khai/reports/inv-search-vector-hygiene-context-pack-report.md
  • knowledge/dev/laws/dieu44-trien-khai/reviews/opus-review-search-vector-hygiene-report-and-kb-architecture-2026-05-05.md

Verdict

Agent report PASS. Opus assessment is directionally correct.

The vector/search problem is real, measured, and severe enough to pause unrelated design work until the retrieval layer is stabilized.

Evidence accepted

Key findings accepted:

  • context-pack/ has 1,174 docs, about 37.5% of KB docs.
  • Estimated context-pack vector chunks likely exceed 60% of vector pool.
  • Search pollution is proven:
    • query description_policy structured_exempt returns 100% context-pack near-duplicates in observed top results.
    • query Điều 43 context pack lifecycle is also polluted.
  • Context-pack metadata is consistent enough to filter:
    • path prefix context-pack/
    • metadata.source = dieu43_context_pack_publish
    • tags include context-pack, build.
  • Đ43 has no TTL/retention/deindex policy.
  • DOT script accumulates under context-pack/<build_id>/, while README says live folder overwrite-in-place.
  • 9 latest builds only have 1 file, indicating context-pack pipeline degradation separate from vector pollution.

Strategic decision

Adopt staged handling:

Stage 1 — Search safety filter (immediate, reversible)

Default search must exclude historical context-pack snapshots.

Default retrieval should return canonical/working/registry docs first and include context-pack only on explicit opt-in.

This is the emergency fix for search quality.

Stage 2 — Context-pack lifecycle fix (medium-term)

Fix DOT/Đ43 so context-pack does not accumulate indefinitely in KB/vector.

Preferred direction:

  • keep latest context-pack accessible;
  • keep historical snapshots in filesystem/manifest/cold audit, not hot KB/vector;
  • define retention/TTL formally.

Option H is plausible but must be implemented with guardrails:

  • confirm filesystem retention actually exists or create it;
  • confirm PG manifest/checksum suffices for audit;
  • add rollback/dry-run;
  • do not bulk-delete until search filter is already active and latest/live path verified.

Stage 3 — KB Governance Framework (long-term)

Context-pack is only the first symptom. Need governance for all KB/vector documents:

  • tiers: canonical, working, ephemeral, registry/generated;
  • default search policy per tier;
  • TTL/retention per tier;
  • quota alerts per prefix/source;
  • upload-time classification;
  • search observability and regression tests.

Directive to Opus/Ocus

Do not jump straight to bulk delete.

Create the next execution design pack:

knowledge/dev/laws/dieu44-trien-khai/design/20-search-vector-hygiene-stage1-stage2-execution-pack.md

The pack should have three separated gates:

20A — Stage 1 Search Filter / Retrieval Policy

Goal: stop search pollution without deleting anything.

Design requirements:

  1. Locate actual search API / Agent Data retrieval path.
  2. Identify where default filter can be applied.
  3. Default exclude:
    • metadata.source = 'dieu43_context_pack_publish'
    • OR path prefix context-pack/.
  4. Add explicit opt-in, e.g. include_context_pack=true or query mode.
  5. Add canonical-first behavior where feasible.
  6. Add regression tests using the 7 pollution queries from report.
  7. Success metric:
    • canonical docs top-5 for 7/7 known queries;
    • context-pack share under 5% in default canonical queries;
    • explicit context-pack query still retrieves latest context-pack.
  8. No delete/deindex in 20A.

20A should start with read-only source inspection of search code, then produce an execution prompt.

20B — Context-pack lifecycle / retention design

Goal: stop future accumulation.

Design requirements:

  1. Inspect current dot-context-pack-build.sh upload/publish code.
  2. Inspect filesystem /opt/incomex/context-pack/ retention reality.
  3. Inspect PG manifest tables/views:
    • context_pack_manifest
    • context_pack_requests
    • context_pack_sections
    • v_context_pack_latest.
  4. Resolve doc-vs-runtime conflict:
    • README says overwrite-in-place;
    • DOT script accumulates under context-pack/<build_id>/.
  5. Propose exact lifecycle:
    • latest in KB/hot retrieval or live folder;
    • historical on filesystem/manifest for N days;
    • old KB snapshots deleted or deindexed after retention.
  6. Include dry-run delete plan and rollback story.
  7. Do not execute delete in design stage.

20C — KB Governance Framework

Goal: prevent recurrence at 100x data.

Design requirements:

  1. Define doc tiers:
    • T0 canonical;
    • T1 working/reports/reviews;
    • T2 ephemeral/snapshots;
    • T3 registry/generated structured docs.
  2. Map prefixes to tiers.
  3. Define default search policy by tier.
  4. Define TTL/retention policy by tier.
  5. Define upload-time metadata fields required:
    • kb_tier
    • retention_policy
    • source_type
    • is_ephemeral
    • optional build_id, is_latest.
  6. Define prefix quota alerts:
    • if a prefix exceeds X% of KB/vector pool, health warning.
  7. Define search regression suite.
  8. Decide whether this belongs under Đ43 appendix or a new KB governance law.

Immediate next step

Opus should create the design pack above, not a mutation prompt.

After design pack review, dispatch a small read-only Agent inspection for 20A search code location if needed.

Hard boundaries

Until GPT/User approves an execution pack:

  • no deleteDocument;
  • no deindex;
  • no DOT patch;
  • no Đ43 patch;
  • no vector config mutation;
  • no bulk cleanup;
  • no Pack 2C/IU work while vector hygiene is active priority.

Separate note

The birth pipeline inventory report mentioned by Opus is separate and should be reviewed later. It should not be mixed into the vector hygiene decision.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/reviews/gpt-review-vector-hygiene-report-and-next-directive-2026-05-05.md