KB-295C

Phase 2F-D1 — list_documents SQL Pushdown Implementation Report (2026-05-14)

11 min read Revision 1
phase-2fd1implementationlist_documentssql-pushdownperformanceagent-dataproduction

Phase 2F-D1 — list_documents SQL Pushdown — Implementation Report

Date: 2026-05-14 Status: PASS — deployed to VPS, committed. Commit: 19315ce on main of /opt/incomex/docker/agent-data-repo. Design reference: knowledge/current-state/reports/phase-2f-c-rev2-mcp-production-patch-design-2026-05-14.md Approval: knowledge/current-state/reviews/gpt-approval-phase-2f-c-rev2-d1-2026-05-14.md


1. Code changes summary

File Change
agent_data/pg_store.py +85 lines: new _escape_like_prefix(prefix) and list_kb_docs_paginated(prefix, limit, offset) — lean projection, COLLATE "C" pushdown, partial-index-aware predicates. stream_docs retained for non-list callers.
agent_data/server.py +47 / -33 lines: new constant LIST_DOCUMENTS_MAX_OFFSET = 10000 (env-tunable via AGENT_DATA_LIST_MAX_OFFSET); offset cap in _normalize_list_pagination; rewrite of _list_kb_documents_page to call the new pg_store helper and fetch limit+1 to detect next page; MCP schema description clarifies paging semantics and adds prefix synonym; dispatcher accepts prefix synonym; CONNECTOR_SCHEMA_VERSION bumped to gpt-agent-data-2026-05-14.1.

Total diff: 2 files changed, 132 insertions(+), 33 deletions(-).

Behavioural changes

  • count semantics: now reflects the number of items returned in this page, not the total matches. Clients should detect end-of-pages by next_offset == null.
  • prefix synonym: MCP arg prefix accepted alongside path; first non-empty wins. Backward compat with path preserved.
  • offset cap: > 10 000 returns 422 INVALID_ARGUMENT with field=offset, max_offset=10000.
  • limit cap: unchanged at 100 (already enforced).
  • Soft-deleted rows: excluded by SQL WHERE (data->>'deleted_at') IS NULL, matching pre-D1 behaviour.
  • Anomalous row: the one row with empty key and empty document_id (4.4 KB body) is excluded by length(data->>'document_id') > 0.

2. Index DDL applied

Applied directly via psql on VPS (one-off SRE operation, non-transactional, non-blocking):

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kb_documents_doc_id_c_live
  ON kb_documents (((data->>'document_id') COLLATE "C"))
  WHERE (data->>'deleted_at') IS NULL
    AND (data->>'document_id') IS NOT NULL
    AND length(data->>'document_id') > 0;
ANALYZE kb_documents;

Design deviation noted: Rev2 specified text_pattern_ops opclass. During verification, plain btree on (data->>'document_id') COLLATE "C" produced a strictly better plan because COLLATE "C" matches both the LIKE 'prefix%' range optimization AND the ORDER BY ordering — letting the planner do a straight Index Scan with no Sort step. With text_pattern_ops, the planner still needed an explicit Sort because ~<~ ordering does not satisfy ordinary ORDER BY. The index name is therefore idx_kb_documents_doc_id_c_live (no _pattern suffix). Same partial WHERE clause, same correctness, better plan.

Build time: <1 s. Size: ~70 KB. Existing idx_kb_documents_doc_id retained as safety net.

3. EXPLAIN before / after

Before (current stream_docs path used by D1's predecessor)

Seq Scan on kb_documents  (rows=5040 width=529, actual time=0.044..3.626)
Execution Time: 4.015 ms  -- PG only; Python materialization dominated wall time

After — new query plans (post-index, post-ANALYZE)

Prefix Plan Execution time
knowledge/ (2 575 rows match) Limit -> Index Scan using idx_kb_documents_doc_id_c_live (Index Cond >= 'knowledge/' AND < 'knowledge0') — no Sort 9.5 ms
knowledge/test/ (28 rows match) Limit -> Sort -> Bitmap Heap Scan (planner picks bitmap for very narrow set) 0.35 ms
empty (2 911 alive rows) Limit -> Index Scan using idx_kb_documents_doc_id_c_live — no Sort 1.9 ms
no/such/path/ (0 rows) Limit -> Index Scan returns immediately 0.04 ms

4. Functional test results

Test harness: 16 checks run inside incomex-agent-data against http://127.0.0.1:8000/mcp-gpt.

# Test Result
T1 /mcp-gpt tools/list count 8 (search/list/get/getrw/batch/upload/update/patch)
T2 /mcp-gpt-full tools/list count 11 (above + delete/move/ingest)
T3 list_documents no args returns 50 items, next_offset=50, truncated=True
T4 path=knowledge/ limit=50 50 items, all start with knowledge/, next_offset=50
T5 path=knowledge/test/ limit=50 28 items, samples confirmed
T6 path=no/such/path/ 0 items, next_offset=null
T7 path=knowledge_/ (literal _) 0 items_ is escaped (would have returned knowledge/* if escape broken)
T8 limit=1 offset=0 returns first doc, next_offset=1, truncated=True
T9 limit=1 offset=1 returns different second doc
T10 limit=101 isError=true, INVALID_ARGUMENT max_limit=100
T11 offset=10001 isError=true, INVALID_ARGUMENT max_offset=10000
T12 prefix synonym vs path identical results
T13 soft-delete exclusion 100 items, none expose deleted_at field
T14 regression: get_document OK, no error
T15 regression: batch_read (2 paths) OK, no error
T16 regression: search_knowledge OK, returns hits

16 / 16 PASS.

5. Performance: before / after

Phase 2F-A baseline

  • list_documents path=knowledge/test/, concurrency 5: p95 4 470 ms.

D1 result (this implementation)

Same hardware, same dataset, same MCP endpoint (/mcp-gpt).

A. Single-client (n=5 sequential)

Workload p50 p95 max
list path=knowledge/test/ limit=50 7.6 ms 8.0 ms 8.0 ms
list path=knowledge/ limit=50 14.2 ms 24.5 ms 24.5 ms
list path=no/such/path/ limit=50 8.2 ms 10.5 ms 10.5 ms

B. Concurrency 5 (n=50)

Workload p50 p95 max errors
list path=knowledge/test/ 16.3 ms 23.3 ms 28.4 ms 0
list path=knowledge/ 45.6 ms 78.6 ms 1 803 ms* 0
list path='' (worst case) 37.4 ms 68.0 ms 73.5 ms 0

* Single outlier on first burst (cold cache). Re-run with n=100 sustained:

Workload p50 p95 p99 max
list path=knowledge/ limit=50 n=100 c=5 72.9 ms 138.0 ms 157.5 ms 157.5 ms

C. Concurrency 10 (n=50)

Workload p50 p95 max
list path=knowledge/ 70.5 ms 139.2 ms 146.9 ms
list path='' 57.4 ms 130.4 ms 143.9 ms

Comparison

Metric Before (2F-A) After (D1) Speedup
p95 knowledge/test/ c=5 4 470 ms 23.3 ms ~192x
p95 knowledge/ c=5 sustained n/a (slower) 138 ms
p95 knowledge/ c=10 n/a (would degrade) 139 ms

Targets met:

  • p95 concurrency 5: < 250 ms target -> 23–138 ms PASS
  • p95 concurrency 10: < 400 ms target -> 130–140 ms PASS

6. Regression tests

No tool latency regressed.

Tool n / concurrency p50 p95 max errors
get_document 30 / c=5 16.0 ms 24.5 ms 26.3 ms 0
batch_read (2 paths) 30 / c=5 19.3 ms 28.7 ms 35.1 ms 0
search_knowledge 10 / c=3 202.8 ms 580.6 ms 580.6 ms 0

7. Rollback procedure

Code revert:

ssh root@38.242.240.89
cd /opt/incomex/docker/agent-data-repo
git revert 19315ce
cd /opt/incomex/docker
docker compose build agent-data && docker compose up -d agent-data

Index drop (separate, run only if also reverting):

DROP INDEX CONCURRENTLY IF EXISTS idx_kb_documents_doc_id_c_live;

Both operations are non-blocking. The pre-D1 stream_docs-based path remains in code (revert restores it as the caller); existing idx_kb_documents_doc_id index stays in place either way.

8. Commit

  • Hash: 19315ce
  • Branch: main
  • Message preview: phase2f-d1: optimize list_documents with SQL pushdown + partial index
  • Co-author trailer: Claude Opus 4.7 (1M context)

Repo: /opt/incomex/docker/agent-data-repo (5 ahead, 112 behind origin/main — long-standing per project convention; do not git pull).

9. Risks / notes

  • R1 (mitigated). LIKE escape: tested with knowledge_/ (returns 0, correctly does NOT match knowledge/*). Order is backslash -> % -> _. Verified with 1 184 alive docs containing _.
  • R2 (acknowledged). Sort order under COLLATE "C" is byte-wise, same as Python's str.sort(). No semantic change vs pre-D1.
  • R3 (deviated from Rev2). Plain btree COLLATE "C" was used instead of text_pattern_ops because it produced a strictly better plan (no Sort step). Both correctness and intent preserved. Index name reflects: idx_kb_documents_doc_id_c_live (not ..._c_pattern_live).
  • R4 (out of scope). stream_docs still has the full-scan pattern. Used by reindex_kb_documents, cleanup_orphan_vectors, _compute_data_integrity, audit_sync. Will need server-side cursor migration when the table grows past ~30 k rows. Not blocking now.
  • R5 (data anomaly). One row with empty key and empty document_id is excluded by length(...) > 0. Should be cleaned up separately.
  • R6 (planner cold start). First burst after index creation showed a 1.8 s outlier in a single test. Sustained runs (100 requests) show p99 = 157 ms with no outliers. Acceptable; warm-up is one-time.
  • R7 (logging). New connector_call log line replaces total_count with has_more. Any log-parsing tooling must update accordingly.

10. Recommendation

Proceed to Phase 2F-D2a (transport content negotiation).

D1 is in production, p95 targets exceeded, zero errors, zero regressions. The list_documents path is no longer the throughput bottleneck. Next blocker for ChatGPT MCP App handshake is transport (Accept negotiation / SSE), per Rev2 S3. D2a is independent of D1 and can ship immediately.

D3 (move_document schema cleanup) can be bundled with D2a or shipped right after; trivial change.

D4 canary remains parked until C1..C7 evidence is collected.


Appendix — VPS state at completion

  • Container: incomex-agent-data Up (healthy) after rebuild; CPU 1.6 %, mem 1.4 GiB / 2.5 GiB.
  • Postgres: 5045 total rows, 2912 alive (one new test doc created during functional checks).
  • New index: idx_kb_documents_doc_id_c_live present, size ~70 KB.
  • Logs: no ERROR / Exception / Traceback since restart.
  • Health: all three services (qdrant, postgres, openai) reporting ok.

End of report.

Back to Knowledge Hub knowledge/current-state/reports/phase-2f-d1-list-documents-sql-pushdown-implementation-2026-05-14.md