Phase 2F-D1 — list_documents SQL Pushdown — Implementation Report

Date: 2026-05-14 Status: PASS — deployed to VPS, committed. Commit: 19315ce on main of /opt/incomex/docker/agent-data-repo. Design reference: knowledge/current-state/reports/phase-2f-c-rev2-mcp-production-patch-design-2026-05-14.md Approval: knowledge/current-state/reviews/gpt-approval-phase-2f-c-rev2-d1-2026-05-14.md

1. Code changes summary

File Change

agent_data/pg_store.py +85 lines: new _escape_like_prefix(prefix) and list_kb_docs_paginated(prefix, limit, offset) — lean projection, COLLATE "C" pushdown, partial-index-aware predicates. stream_docs retained for non-list callers.

agent_data/server.py +47 / -33 lines: new constant LIST_DOCUMENTS_MAX_OFFSET = 10000 (env-tunable via AGENT_DATA_LIST_MAX_OFFSET); offset cap in _normalize_list_pagination; rewrite of _list_kb_documents_page to call the new pg_store helper and fetch limit+1 to detect next page; MCP schema description clarifies paging semantics and adds prefix synonym; dispatcher accepts prefix synonym; CONNECTOR_SCHEMA_VERSION bumped to gpt-agent-data-2026-05-14.1.

File	Change
`agent_data/pg_store.py`	+85 lines: new `_escape_like_prefix(prefix)` and `list_kb_docs_paginated(prefix, limit, offset)` — lean projection, COLLATE "C" pushdown, partial-index-aware predicates. `stream_docs` retained for non-list callers.
`agent_data/server.py`	+47 / -33 lines: new constant `LIST_DOCUMENTS_MAX_OFFSET = 10000` (env-tunable via `AGENT_DATA_LIST_MAX_OFFSET`); offset cap in `_normalize_list_pagination`; rewrite of `_list_kb_documents_page` to call the new pg_store helper and fetch `limit+1` to detect next page; MCP schema description clarifies paging semantics and adds `prefix` synonym; dispatcher accepts `prefix` synonym; `CONNECTOR_SCHEMA_VERSION` bumped to `gpt-agent-data-2026-05-14.1`.

Total diff: 2 files changed, 132 insertions(+), 33 deletions(-).

Behavioural changes

count semantics: now reflects the number of items returned in this page, not the total matches. Clients should detect end-of-pages by next_offset == null.
prefix synonym: MCP arg prefix accepted alongside path; first non-empty wins. Backward compat with path preserved.
offset cap: > 10 000 returns 422 INVALID_ARGUMENT with field=offset, max_offset=10000.
limit cap: unchanged at 100 (already enforced).
Soft-deleted rows: excluded by SQL WHERE (data->>'deleted_at') IS NULL, matching pre-D1 behaviour.
Anomalous row: the one row with empty key and empty document_id (4.4 KB body) is excluded by length(data->>'document_id') > 0.

2. Index DDL applied

Applied directly via psql on VPS (one-off SRE operation, non-transactional, non-blocking):

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kb_documents_doc_id_c_live
  ON kb_documents (((data->>'document_id') COLLATE "C"))
  WHERE (data->>'deleted_at') IS NULL
    AND (data->>'document_id') IS NOT NULL
    AND length(data->>'document_id') > 0;
ANALYZE kb_documents;

Design deviation noted: Rev2 specified text_pattern_ops opclass. During verification, plain btree on (data->>'document_id') COLLATE "C" produced a strictly better plan because COLLATE "C" matches both the LIKE 'prefix%' range optimization AND the ORDER BY ordering — letting the planner do a straight Index Scan with no Sort step. With text_pattern_ops, the planner still needed an explicit Sort because ~<~ ordering does not satisfy ordinary ORDER BY. The index name is therefore idx_kb_documents_doc_id_c_live (no _pattern suffix). Same partial WHERE clause, same correctness, better plan.

Build time: <1 s. Size: ~70 KB. Existing idx_kb_documents_doc_id retained as safety net.

3. EXPLAIN before / after

Before (current `stream_docs` path used by D1's predecessor)

Seq Scan on kb_documents  (rows=5040 width=529, actual time=0.044..3.626)
Execution Time: 4.015 ms  -- PG only; Python materialization dominated wall time

After — new query plans (post-index, post-ANALYZE)

Prefix	Plan	Execution time
`knowledge/` (2 575 rows match)	`Limit -> Index Scan using idx_kb_documents_doc_id_c_live` (Index Cond `>= 'knowledge/' AND < 'knowledge0'`) — no Sort	9.5 ms
`knowledge/test/` (28 rows match)	`Limit -> Sort -> Bitmap Heap Scan` (planner picks bitmap for very narrow set)	0.35 ms
empty (2 911 alive rows)	`Limit -> Index Scan using idx_kb_documents_doc_id_c_live` — no Sort	1.9 ms
`no/such/path/` (0 rows)	`Limit -> Index Scan` returns immediately	0.04 ms

4. Functional test results

Test harness: 16 checks run inside incomex-agent-data against http://127.0.0.1:8000/mcp-gpt.

#	Test	Result
T1	`/mcp-gpt tools/list` count	8 (search/list/get/getrw/batch/upload/update/patch)
T2	`/mcp-gpt-full tools/list` count	11 (above + delete/move/ingest)
T3	`list_documents` no args	returns 50 items, `next_offset=50`, `truncated=True`
T4	`path=knowledge/` limit=50	50 items, all start with `knowledge/`, `next_offset=50`
T5	`path=knowledge/test/` limit=50	28 items, samples confirmed
T6	`path=no/such/path/`	0 items, `next_offset=null`
T7	`path=knowledge_/` (literal `_`)	0 items — `_` is escaped (would have returned `knowledge/*` if escape broken)
T8	limit=1 offset=0	returns first doc, `next_offset=1`, `truncated=True`
T9	limit=1 offset=1	returns different second doc
T10	limit=101	`isError=true`, `INVALID_ARGUMENT max_limit=100`
T11	offset=10001	`isError=true`, `INVALID_ARGUMENT max_offset=10000`
T12	`prefix` synonym vs `path`	identical results
T13	soft-delete exclusion	100 items, none expose `deleted_at` field
T14	regression: `get_document`	OK, no error
T15	regression: `batch_read` (2 paths)	OK, no error
T16	regression: `search_knowledge`	OK, returns hits

16 / 16 PASS.

5. Performance: before / after

Phase 2F-A baseline

list_documents path=knowledge/test/, concurrency 5: p95 4 470 ms.

D1 result (this implementation)

Same hardware, same dataset, same MCP endpoint (/mcp-gpt).

A. Single-client (n=5 sequential)

Workload	p50	p95	max
`list path=knowledge/test/ limit=50`	7.6 ms	8.0 ms	8.0 ms
`list path=knowledge/ limit=50`	14.2 ms	24.5 ms	24.5 ms
`list path=no/such/path/ limit=50`	8.2 ms	10.5 ms	10.5 ms

B. Concurrency 5 (n=50)

Workload	p50	p95	max
`list path=knowledge/test/`	16.3 ms	23.3 ms	28.4 ms
`list path=knowledge/`	45.6 ms	78.6 ms	1 803 ms*
`list path=''` (worst case)	37.4 ms	68.0 ms	73.5 ms

* Single outlier on first burst (cold cache). Re-run with n=100 sustained:

Workload	p50	p95	p99	max
`list path=knowledge/ limit=50` n=100 c=5	72.9 ms	138.0 ms	157.5 ms	157.5 ms

C. Concurrency 10 (n=50)

Workload	p50	p95	max
`list path=knowledge/`	70.5 ms	139.2 ms	146.9 ms
`list path=''`	57.4 ms	130.4 ms	143.9 ms

Comparison

Metric	Before (2F-A)	After (D1)	Speedup
p95 `knowledge/test/` c=5	4 470 ms	23.3 ms	~192x
p95 `knowledge/` c=5 sustained	n/a (slower)	138 ms	—
p95 `knowledge/` c=10	n/a (would degrade)	139 ms	—

Targets met:

p95 concurrency 5: < 250 ms target -> 23–138 ms PASS
p95 concurrency 10: < 400 ms target -> 130–140 ms PASS

6. Regression tests

No tool latency regressed.

Tool	n / concurrency	p50	p95	max
`get_document`	30 / c=5	16.0 ms	24.5 ms	26.3 ms
`batch_read` (2 paths)	30 / c=5	19.3 ms	28.7 ms	35.1 ms
`search_knowledge`	10 / c=3	202.8 ms	580.6 ms	580.6 ms

7. Rollback procedure

Code revert:

ssh root@38.242.240.89
cd /opt/incomex/docker/agent-data-repo
git revert 19315ce
cd /opt/incomex/docker
docker compose build agent-data && docker compose up -d agent-data

Index drop (separate, run only if also reverting):

DROP INDEX CONCURRENTLY IF EXISTS idx_kb_documents_doc_id_c_live;

Both operations are non-blocking. The pre-D1 stream_docs-based path remains in code (revert restores it as the caller); existing idx_kb_documents_doc_id index stays in place either way.

8. Commit

Hash: 19315ce
Branch: main
Message preview: phase2f-d1: optimize list_documents with SQL pushdown + partial index
Co-author trailer: Claude Opus 4.7 (1M context)

Repo: /opt/incomex/docker/agent-data-repo (5 ahead, 112 behind origin/main — long-standing per project convention; do not git pull).

9. Risks / notes

R1 (mitigated). LIKE escape: tested with knowledge_/ (returns 0, correctly does NOT match knowledge/*). Order is backslash -> % -> _. Verified with 1 184 alive docs containing _.
R2 (acknowledged). Sort order under COLLATE "C" is byte-wise, same as Python's str.sort(). No semantic change vs pre-D1.
R3 (deviated from Rev2). Plain btree COLLATE "C" was used instead of text_pattern_ops because it produced a strictly better plan (no Sort step). Both correctness and intent preserved. Index name reflects: idx_kb_documents_doc_id_c_live (not ..._c_pattern_live).
R4 (out of scope). stream_docs still has the full-scan pattern. Used by reindex_kb_documents, cleanup_orphan_vectors, _compute_data_integrity, audit_sync. Will need server-side cursor migration when the table grows past ~30 k rows. Not blocking now.
R5 (data anomaly). One row with empty key and empty document_id is excluded by length(...) > 0. Should be cleaned up separately.
R6 (planner cold start). First burst after index creation showed a 1.8 s outlier in a single test. Sustained runs (100 requests) show p99 = 157 ms with no outliers. Acceptable; warm-up is one-time.
R7 (logging). New connector_call log line replaces total_count with has_more. Any log-parsing tooling must update accordingly.

10. Recommendation

Proceed to Phase 2F-D2a (transport content negotiation).

D1 is in production, p95 targets exceeded, zero errors, zero regressions. The list_documents path is no longer the throughput bottleneck. Next blocker for ChatGPT MCP App handshake is transport (Accept negotiation / SSE), per Rev2 S3. D2a is independent of D1 and can ship immediately.

D3 (move_document schema cleanup) can be bundled with D2a or shipped right after; trivial change.

D4 canary remains parked until C1..C7 evidence is collected.

Appendix — VPS state at completion

Container: incomex-agent-data Up (healthy) after rebuild; CPU 1.6 %, mem 1.4 GiB / 2.5 GiB.
Postgres: 5045 total rows, 2912 alive (one new test doc created during functional checks).
New index: idx_kb_documents_doc_id_c_live present, size ~70 KB.
Logs: no ERROR / Exception / Traceback since restart.
Health: all three services (qdrant, postgres, openai) reporting ok.

End of report.