Phase 2F-C — MCP Production Patch Design Report (2026-05-13)
Phase 2F-C — MCP Production Patch Design Report
Date: 2026-05-13 Status: Investigation + design only. No code changes, no rebuild, no restart, no commit. Author: Claude (Phase 2F-C) Scope: Production-grade design for AgentData MCP server prior to re-connecting ChatGPT.
Executive summary
list_documentsp95 4.47 s is a Python-side bottleneck, not SQL. PostgreSQL returns the full table in ~4 ms; the application then transfers 27 MB JSONB, materializes 5 040 dicts per request, then filters/sorts/paginates in Python. Pagination args exist in the API but are applied after full materialization.- Transport blocker for ChatGPT/Claude.ai is application-level, not nginx. All
/mcp*handlers returnapplication/json. The/gpt-mcpnginx route already hasproxy_buffering off+chunked_transfer_encoding on— it is ready for SSE. The fix is inagent_data/server.py. move_documentis intentionally501 NOT_IMPLEMENTED(S170 deprecation) but its schema is still advertised inMCP_TOOLS. This is a schema/runtime mismatch, not a missing implementation.- Write p95 0.8–1.3 s is mostly OpenAI embed + Qdrant
wait=TrueAND double work. Handlers call_sync_vector_entrysynchronously and a PG-NOTIFY listener (pg_vector_listener.py) does the same work asynchronously. Removing the inline sync path is a large, low-risk win. - Endpoint policy is already partially correct — only
/gpt-mcp/{token}/mcp → /mcp-gpt(8 tools, no delete/move/ingest) is exposed publicly via nginx./mcp-gpt-fullexists in the app but is not nginx-routed. Recommend keeping it that way. - Two patches (
2F-D1list_documents +2F-D2transport content-negotiation) are sufficient to unblock ChatGPT testing; everything else can follow. - One nginx secret was observed in
/etc/nginx/secrets/gpt-mcp-route.conf. It is NOT printed in this report. Recommend rotating after this design phase completes (separate task — do not rotate now per instruction).
Phase 0 — Source map / fact base
0.1 Source paths (VPS = SSOT)
| Concern | VPS path |
|---|---|
| MCP handlers / dispatch / tool schemas | /opt/incomex/docker/agent-data-repo/agent_data/server.py |
| PostgreSQL store | /opt/incomex/docker/agent-data-repo/agent_data/pg_store.py |
| Qdrant + OpenAI embedding | /opt/incomex/docker/agent-data-repo/agent_data/vector_store.py |
| Async vector sync listener | /opt/incomex/docker/agent-data-repo/agent_data/pg_vector_listener.py |
| Nginx site config | container incomex-nginx:/etc/nginx/conf.d/default.conf |
| Nginx secret-token MCP route | container incomex-nginx:/etc/nginx/secrets/gpt-mcp-route.conf |
0.2 Git state (VPS repo)
- Branch:
main - HEAD:
8d1442f phase2e: harden MCP profiles and certify get_document behavior - Working tree: clean (
git status --shortreturns nothing). - Repo is
5 ahead, 112 behindorigin/main (long-standing — per project memory, edit host directly, nevergit pull). Local clone at/Users/nmhuyen/agent-data-testis outdated relative to VPS — do all design against VPS source.
0.3 Runtime state
incomex-agent-data Up 33 minutes (healthy) 8080/tcp
postgres Up 3 weeks (healthy) 127.0.0.1:5432->5432/tcp
incomex-qdrant Up 8 weeks (healthy) 6333-6334/tcp
incomex-nginx Up 13 days :80, :443
Application log shows PG->Qdrant vector sync listener started — the async listener thread is already running in production.
0.4 MCP profiles (Python-level routes, not nginx)
agent_data/server.py:2689-2718:
| Profile | Route in app | Tools |
|---|---|---|
| internal | POST /mcp |
All 11 (MCP_TOOLS) |
| readonly | POST /mcp-readonly |
5 read tools (MCP_READONLY_TOOLS) |
| gpt | POST /mcp-gpt |
8 = read + upload/update/patch (MCP_GPT_TOOLS) |
| gpt-full | POST /mcp-gpt-full |
All 11 (MCP_GPT_FULL_TOOLS) |
Nginx public exposure (/etc/nginx/secrets/gpt-mcp-route.conf):
/gpt-mcp/{secret}/mcp→ upstream/mcp-gpt(8 tools).- Sets
X-API-Keyheader internally. - Has
proxy_buffering off,chunked_transfer_encoding on,proxy_read_timeout 60s. /mcp-gpt-fullis NOT exposed by nginx.
0.5 PostgreSQL kb_documents schema (VPS)
key TEXT PRIMARY KEY -- filesystem-style ("knowledge__current-state__..."); _fs_key() translates '/' <-> '__'
data JSONB -- {document_id, parent_id, content:{body}, metadata, revision, deleted_at, vector_status, ...}
updated_at TIMESTAMPTZ
Indexes:
kb_documents_pkey btree (key)
idx_kb_documents_doc_id btree ((data->>'document_id'))
idx_kb_documents_deleted btree ((data->>'deleted_at'))
Triggers (6):
trg_kb_audit, trg_kb_snapshot_update, trg_kb_snapshot_delete,
trg_kb_truncation_guard, trg_kb_updated_at, trg_kb_vector_sync (-> pg_notify),
trg_context_pack_law_enact
Row stats:
total = 5040, alive = 2907, table size = 27 MB
avg JSONB row ~6 KB, max ~94 KB
top prefixes (alive): knowledge=2570, registries=223, operations=105
Part 1 — list_documents root cause + production design
1.1 Current code path
agent_data/server.py:2046 _list_kb_documents_page:
def _list_kb_documents_page(*, prefix="", limit=LIST_DOCUMENTS_DEFAULT_LIMIT, offset=0):
limit, offset = _normalize_list_pagination(limit, offset)
_ensure_pg()
docs = pg_store.stream_docs(KB_COLLECTION) # <- FULL TABLE LOAD
matches = []
for data in docs: # <- Python loop over 5040 rows
if data.get("deleted_at") is not None: continue
doc_id = data.get("document_id", data.get("_key", ""))
if prefix and not doc_id.startswith(prefix): continue
matches.append({...})
matches.sort(key=lambda x: x["document_id"]) # <- Python sort
page = matches[offset : offset + limit] # <- Python slice
agent_data/pg_store.py:170-176 stream_docs:
def stream_docs(collection):
with _conn() as conn, conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute(f"SELECT key, data FROM {tbl}") # <- no WHERE, no LIMIT, no ORDER
return [{"_key": row["key"], **dict(row["data"])} for row in cur.fetchall()]
No body file reads, no Qdrant calls, no N+1. Just an unbounded full-table scan returning every JSONB row, fully materialized into Python dicts on every call.
1.2 Why p95 = 4.47 s while EXPLAIN says 4 ms
PG-side SELECT key, data FROM kb_documents:
Seq Scan on kb_documents (cost=0.00..445.14 rows=5014 width=529)
(actual time=0.044..3.626 rows=5040 loops=1)
Execution Time: 4.015 ms
PostgreSQL itself is fast. The cost is in the wire + Python layer:
- 27 MB JSONB over the unix socket (psycopg2 -> cursor.fetchall).
- 5 040 calls to
dict(row["data"])+ 5 040{**dict, "_key": ...}allocations. - Then iterate again, filter, sort. Then slice.
- Multiplied by concurrency 5 -> GIL contention, gunicorn/uvicorn worker time spikes.
- No caching, no shared snapshot.
1.3 Proposed SQL (D1)
SELECT
data->>'document_id' AS document_id,
data->>'parent_id' AS parent_id,
COALESCE(data->'metadata'->>'title','') AS title,
COALESCE(data->'metadata'->'tags','[]'::jsonb) AS tags,
COALESCE((data->>'revision')::int, 0) AS revision
FROM kb_documents
WHERE (data->>'deleted_at') IS NULL
AND (%(prefix)s = '' OR (data->>'document_id') LIKE %(prefix_like)s)
ORDER BY (data->>'document_id')
LIMIT %(limit)s OFFSET %(offset)s;
prefix_like = prefix || '%', built in Python with like_escape on prefix to neutralize %/_/\.
Also need total count (current API returns count = total matches). Two options:
- A. Second
SELECT COUNT(*) FROM kb_documents WHERE ... AND (data->>'document_id') LIKE ...— extra query (~5–20 ms). - B. Drop
total_count; returncount = len(items)+next_offset. Recommended — matches REST norms, removes a query, andtotal_countis rarely needed by a paginated client.
If total_count must stay, prefer A.
1.4 Index design
Current idx_kb_documents_doc_id uses default collation. EXPLAIN with the new SQL:
Limit (actual time=81.6..87.4 rows=50 loops=1)
Index Scan using idx_kb_documents_doc_id on kb_documents
Filter: (((data->>'deleted_at') IS NULL) AND ((data->>'document_id') ~~ 'knowledge/%'))
Rows Removed by Filter: 1611
Execution Time: 87.5 ms
Index Scan walks 1 661 entries to find 50 — LIKE 'knowledge/%' is treated as a filter, not a range, under UTF-8 collation (confirmed: 'knowledge/x' < 'knowledge0' returns FALSE under default collation; TRUE under COLLATE "C").
Recommended new index (partial + text_pattern_ops):
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kb_documents_doc_id_pattern
ON kb_documents ((data->>'document_id') text_pattern_ops)
WHERE (data->>'deleted_at') IS NULL;
text_pattern_opsmakesLIKE 'prefix%'a true range scan (no row-by-row filter).- Partial
WHERE deleted_at IS NULLshrinks the index by ~42 % (only 2 907 of 5 040 rows) and removes the deleted-at filter cost. - Expected new SQL execution: <10 ms at LIMIT 50.
Deploy plan:
CREATE INDEX CONCURRENTLY— cannot run inside a transaction, must be one statement outside a migration block. AgentData has no Alembic; current pattern is bare SQL applied by SRE.- Roll forward: apply on running DB (CONCURRENTLY is non-blocking; only takes ShareUpdateExclusiveLock against schema changes, allows reads + writes).
- Rollback:
DROP INDEX CONCURRENTLY IF EXISTS idx_kb_documents_doc_id_pattern;— also non-blocking. Existingidx_kb_documents_doc_idstays in place as a safety net.
1.5 MCP schema / API
VPS currently advertises prefix/limit/offset already (server.py:2756, LIST_DOCUMENTS_DEFAULT_LIMIT=50, MAX=100). But the MCP tool schema sends path, which is mapped to prefix (server.py:2972-2976). Keep:
- Backward compat: Accept
path(current MCP arg) ANDprefix. Use the first non-empty. - Add
limit: optional, default 50, max 100. Already in code path. - Add
offset: optional, default 0. Already in code path. - No
recursive/depthin this phase — prefix already gives recursive subtree behaviour.
Response shape stays compatible:
{
"items": [...],
"count": <items.length>,
"returned_count": <items.length>,
"limit": <int>,
"offset": <int>,
"next_offset": <int|null>,
"truncated": <bool>
}
If we drop the global total_count semantics, document the change in the schema description.
1.6 Expected performance
Conservative targets after D1 only (SQL pushdown, no new index):
| Scenario | Before | After D1 |
|---|---|---|
| Single request, no prefix, limit=50 | ~900 ms p50 | ~80 ms p50 |
| Single request, narrow prefix, limit=50 | ~900 ms p50 | ~90 ms p50 |
| Concurrency 5 p95 | 4 470 ms | <= 250 ms |
| Concurrency 10 p95 | (untested) | <= 400 ms |
After also adding the partial text_pattern_ops index:
| Scenario | After D1 + index |
|---|---|
| Single request, narrow prefix | < 30 ms p50 |
| Concurrency 5 p95 | < 150 ms |
| Concurrency 10 p95 | < 250 ms |
Production target: list_documents p95 < 300 ms at concurrency 10. Achievable with D1; index makes it comfortable.
1.7 Scaling estimate
| Rows alive | Current behaviour (full scan + Python) | After D1 |
|---|---|---|
| 3 k | ~900 ms / call | ~80 ms |
| 30 k | ~10 s / call (would already be DOS-able) | ~100 ms |
| 300 k | timeouts | ~200 ms |
1.8 Risks
- R1.
LIKE 'prefix%'with user-suppliedprefixrequires escaping%_\(psycopg2 parameter binding doesn't escape LIKE wildcards). Mitigation:prefix.replace('\\','\\\\').replace('%','\\%').replace('_','\\_') + '%'. - R2.
text_pattern_opsorders bytewise, not by locale. Output order will differ slightly from current Python.sort()(which is also bytewise on str). Verify: no test asserts a locale-specific order. - R3. Removing
stream_docscallers —reindex_kb_documents,cleanup_orphan_vectors,_compute_data_integrity,audit_syncstill need full iteration. Keepstream_docs(or rename toiter_all_docsand yield via server-side cursor) but don't use it forlist_documents. Lower priority follow-up: convert reindex/cleanup to chunked cursor. - R4.
LIST_DOCUMENTS_MAX_RESPONSE_BYTESguard already exists (server.py:2085) — keep it; with limit=100 and rich metadata, response bytes stay well under guard.
1.9 Tests after D1
- Unit: prefix=empty, prefix='knowledge/', prefix with
%and_, prefix='nonexistent/'. - Pagination boundary: offset=0/24/49/50/99/100; limit=1/50/100/101 (reject).
- Soft-delete behaviour:
deleted_atset -> excluded from results. - Ordering stability: same call twice returns same order.
- Concurrency 5, 10, 20 — p95 < target.
- Backward compat: MCP
args.pathstill works. - EXPLAIN check (in CI): assert plan uses
Index ScannotSeq Scan(optional, requires test DB).
Part 2 — MCP transport / content negotiation design
2.1 Current state
App (agent_data/server.py):
@app.post("/mcp")(line 3074) — returns plain dict via FastAPI auto-conversion. ImplicitlyContent-Type: application/json.@app.post("/mcp-readonly|gpt|gpt-full")(line 3353-3380) ->_mcp_filtered_handler(line 3214) — returnsJSONResponse(content=...). Explicitlyapplication/json.- No
Acceptheader is read. - No
text/event-streampath. No streaming. - No keep-alive ping. No session id. No
Last-Event-ID.
Nginx (/gpt-mcp/{token}/mcp -> /mcp-gpt):
proxy_buffering offOKchunked_transfer_encoding onOKproxy_cache offOKproxy_read_timeout 60s(would need to extend if true server-sent streams)Connection ''OK (preventskeep-alivefrom being forwarded; allows hop-by-hop chunked)- No
X-Accel-Buffering: noset explicitly — not strictly required when buffering already off.
Conclusion: nginx is SSE-ready for the /gpt-mcp route. The blocker is the app.
2.2 Why FastMCP saw "Unexpected content type"
FastMCP (recent versions, and the ChatGPT/Claude.ai connectors) follow the MCP Streamable HTTP transport (spec rev 2025-03-26). The client sends:
POST /mcp HTTP/1.1
Accept: application/json, text/event-stream
Content-Type: application/json
It then accepts either:
- a single JSON response (
Content-Type: application/json), OR - an SSE stream (
Content-Type: text/event-stream) carrying one or moreevent: messageframes with the JSON-RPC response.
Older clients warned/failed when the server didn't echo SSE for tools/call results, especially long-running ones. The "Unexpected content type" message in our trace likely came from a FastMCP version that strictly expected text/event-stream for some operation, OR from the ChatGPT MCP App which (per Anthropic / OpenAI docs as of cutoff) requires the SSE variant for Streamable HTTP. Either way the server-side delta is the same: support content negotiation.
2.3 Design — Phase A vs B
| A. JSON hardening | B. JSON + SSE content negotiation | |
|---|---|---|
Always send explicit Content-Type: application/json; charset=utf-8 |
yes | yes when Accept lacks SSE |
Honour Accept: text/event-stream |
reject with 406 OR keep JSON | return SSE-framed JSON-RPC |
| Code surface | small (1–2 helpers) | small-medium (1 helper + Streaming wrapper) |
| Affects nginx | no | yes, for routes other than /gpt-mcp (/api/ would need proxy_buffering off if used as SSE) |
| Breaks current FastMCP/curl clients | no | no (single-event SSE response is valid JSON-RPC, client picks via Accept) |
| Unblocks ChatGPT MCP App | probable but not certain | strongly likely |
| Future Streamable HTTP features (resumability, server pushes) | no | becomes feasible |
Recommendation: Implement B, but in two micro-steps:
- D2a: Add a tiny helper that inspects
Acceptand serializes the JSON-RPC response either asJSONResponse(content=..., media_type="application/json")or asStreamingResponse(... , media_type="text/event-stream")with a single SSE frame. Apply to/mcp,/mcp-readonly,/mcp-gpt,/mcp-gpt-full. Don't introduce session-id or resumability yet. - D2b (later): If ChatGPT App needs keep-alive comments or multiple frames, extend with
event: message+event: pingevery N seconds, and Mcp-Session-Id support. Not required to unblock.
2.4 Helper sketch (descriptive — NOT to be implemented in this phase)
Conceptual response builder:
def _mcp_respond(payload: dict, accept: str) -> Response:
if "text/event-stream" in (accept or "") and "application/json" not in _explicit_only_json(accept):
body = f"event: message\ndata: {json.dumps(payload, default=str)}\n\n"
return StreamingResponse(
iter([body]),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
return JSONResponse(content=payload, media_type="application/json; charset=utf-8")
- For requests that return immediately (current pattern, all tools), one SSE frame is enough — clients accept it.
event: messageis the spec-recommended event name; some clients accept unnameddata:events too.- No keep-alive ping needed because every response closes the stream.
- For
notifications/*(currently204 No Content), keep204— SSE only fires when there's a JSON-RPC reply.
2.5 Nginx changes
/gpt-mcp/{token}/mcplocation — already SSE-ready. No change needed. Bumpproxy_read_timeoutfrom 60s to 120s if a future stream may sit idle.location /api/(general AgentData proxy) — currently hasproxy_read_timeout 300sbut does not disable buffering. If anyone hits/api/mcpvia this route and asks for SSE, nginx will buffer. Addproxy_buffering off; proxy_cache off;in/api/when D2a ships. Safe — these routes are typically short JSON responses, butproxy_buffering offdoesn't hurt them.
2.6 Risks
- R5. Some HTTP clients send
Accept: */*; the helper must treat*/*as "JSON is fine" (don't auto-stream). Decision rule: SSE only when client explicitly includestext/event-streamAND the only JSON token isapplication/jsonis not preferred (qweighting). Practical rule: stream iftext/event-streamsubstring present and the client did not includeapplication/jsonwith higher q. To keep code simple: stream whentext/event-streamis present at all — but this may regress clients that sendAccept: application/json, text/event-stream;q=0.1. Tradeoff for D2a: stream only ifAcceptcontainstext/event-streamANDAcceptdoes NOT containapplication/json. ChatGPT MCP App sends onlytext/event-streamfor the Streamable HTTP variant; Claude.ai connector sends both. This rule keeps currentcurl -H 'Accept: application/json'100 % unchanged, and gives the SSE-strict client what it wants. - R6. Order of operations: handler currently builds the dict, then returns. With SSE wrapper, errors must be in-stream JSON, not HTTP 500. The current code path already returns
{"isError": true, ...}in the JSON-RPC envelope; preserve that. - R7. Stream framing bytes:
event: message\ndata: <json>\n\n. Two\n\nare critical (end-of-event). Test with curl--no-buffer.
2.7 Test matrix
| Test | Expectation |
|---|---|
curl -H 'Accept: application/json' POST /mcp |
Content-Type: application/json; charset=utf-8, single dict body |
curl -H 'Accept: application/json, text/event-stream' POST /mcp (Claude.ai mode) |
application/json (R5 rule) |
curl -H 'Accept: text/event-stream' POST /mcp |
Content-Type: text/event-stream, body event: message\ndata: {...}\n\n, connection closed |
curl --no-buffer -H 'Accept: text/event-stream' POST /gpt-mcp/.../mcp |
nginx passes through unbuffered |
FastMCP client initialize + tools/list + tools/call |
All succeed |
ChatGPT MCP App handshake against /gpt-mcp/.../mcp (post D2a deploy) |
Succeeds; tools list loads; one tool call returns |
Part 3 — move_document decision
3.1 Mystery resolved
agent_data/server.py:1624-1636:
@app.post("/documents/{doc_id:path}/move", response_model=DocumentResponse)
async def move_document(doc_id, payload, _=Depends(require_api_key)):
# DEPRECATED S170: move_document only changed parent_id metadata,
# did NOT rename PG key or Qdrant document_id. Use copy+delete instead.
raise _error(
501,
"NOT_IMPLEMENTED",
"move_document is deprecated. Use: (1) upload_document to new path, (2) delete old path.",
document_id=doc_id,
)
try:
... # <- original implementation, unreachable
The function is a kill switch: the raise is at the very top of the body, leaving the original logic as dead code below. Decision was made in S170 because move only updated parent_id metadata without renaming the PG key (which is the canonical _fs_key form) or the Qdrant document_id, leaving the document discoverable under two paths and search indices stale.
The MCP schema (MCP_TOOLS) and the dispatcher (_dispatch_mcp_tool line 3057) still advertise and route the call — so any GPT/Claude that calls it gets 501 NOT_IMPLEMENTED returned wrapped in the JSON-RPC result.content[0].text (because the catch in _mcp_filtered_handler line 3326 turns HTTPException into a structured error). That matches the Phase 2F-A observation.
3.2 Policy options
- A. Implement properly (rename key + rename Qdrant document_id atomically).
- B. Remove from
MCP_TOOLSAND remove fromMCP_GPT_FULL_TOOLSso the schema matches the runtime. - C. Keep schema, prefix
descriptionwith[DEPRECATED]. - D. Keep in
/mcp-gpt-fullonly; remove from public/mcp-gpt.
/mcp-gpt already excludes move_document. /mcp-gpt-full is not nginx-exposed. So the live impact is small — the only path that lets a client call it is /mcp (internal) or /mcp-gpt-full (not exposed). Still: the schema-runtime mismatch is a correctness defect and confuses every LLM that reads the tools list.
3.3 Recommendation
B. Remove move_document entirely from:
MCP_TOOLSlist (server.py:2722) — schemaMCP_GPT_FULL_TOOLSfrozenset (server.py:2708) — allowlist_dispatch_mcp_toolbranch (server.py:3061) — dispatcher- Keep
@app.post("/documents/{doc_id:path}/move")returning 501 — backward compat for any internal caller.
Rationale:
- The runtime decision (S170) is months old and the right call: copy+delete is a clean replacement.
- Leaving a tool the LLM can attempt but will always fail wastes a tool slot in the LLM's instruction budget and pollutes audit logs.
- Re-implementing properly is non-trivial (atomic rename across PG + Qdrant + invalidate caches + handle deep-prefix children) and is not required for production.
If business later needs first-class rename: re-introduce as rename_document with a proper transactional implementation.
3.4 Tests after B
tools/liston/mcp-gpt-fulldoes NOT includemove_document.tools/callwithname=move_documenton any profile returns-32601 Tool not allowed(orUnknown tool).- Direct
POST /documents/{x}/movestill returns 501 (regression: don't accidentally re-enable the dead code).
Part 4 — Write latency design
4.1 Current path (per upload/update/patch)
For example update_document (server.py:1490-1622):
_ensure_pgpg_store.get_doc(...)— 1 SELECT by PK (~5–15 ms)- Compute new content/metadata
pg_store.update_doc(..., expected_revision=...)— 1 UPDATE + 6 triggers fire:trg_kb_updated_at(BEFORE)trg_kb_truncation_guard(BEFORE)trg_kb_audit(AFTER)trg_kb_snapshot_update(BEFORE)trg_context_pack_law_enact(AFTER, conditional)trg_kb_vector_sync(AFTER) ->pg_notify('kb_vector_sync', ...)(~30–80 ms total for all triggers)
- Inline sync:
_delete_vector_entry(doc_id)->vector_store.delete_document(Qdrant call, ~50–150 ms) - Inline sync:
_sync_vector_entry(...)->vector_store.upsert_document:_split_text(CPU, small)- For each chunk:
self._openai.embeddings.create(...)(sequential, ~150–400 ms/chunk over HTTPS) self._qdrant_upsert(points)withwait=True(~50–150 ms)pg_store.update_doc(..., {"vector_status": "ready"})— second UPDATE, fires triggers again (!)
emit_fire_and_forget(DOCUMENT_UPDATED, ...)(~microseconds)
Plus in parallel (background thread):
pg_vector_listener.pyreceives the NOTIFY from step 4- Calls
vector_store.upsert_documentagain for the same doc
Net result: each write does the Qdrant upsert twice (once inline, once async). The inline path is the one that contributes to user-visible p95.
4.2 What is mandatory sync vs async
| Step | Must be sync? | Why |
|---|---|---|
| PG read (get) | yes | required for revision check / not-found semantics |
| PG write (update_doc) | yes | confirms persistence, returns 200 |
| Revision bump | yes | atomic with PG write |
| Trigger fires NOTIFY | yes (PG-side) | already in trigger |
| Delete old Qdrant vectors | no | the listener can do this; user only cares once data lands |
| OpenAI embed | no | costliest step, async-friendly |
| Qdrant upsert | no | dittos |
vector_status update |
no | listener updates after embed |
| event_bus emit | already async (fire-and-forget) | — |
4.3 Options
Option 1 — keep sync, batch + parallelize embeds
Stay sync but:
- Use OpenAI batch embeddings: one HTTPS call with N chunk inputs returns N vectors. Cuts 3-chunk doc from 3 x 250 ms to ~300 ms.
- Use
asyncio.to_threadorThreadPoolExecutorto run delete + upsert in parallel. - Drop
wait=Trueon Qdrant upsert; the listener already deals with eventual consistency. (Note: Qdrantwait=Falsereturns immediately; the server still applies the upsert.)
Expected p95: ~500–700 ms (down from 1.3 s).
Risk: still on the critical path; OpenAI rate-limit at concurrency 10+ will push tail latency. Doesn't address the double work with the listener.
Option 2 — async vector via existing listener
Remove the inline _delete_vector_entry + _sync_vector_entry from create_document, update_document (content branch), and patch_document. The PG-NOTIFY trigger already publishes; the listener already consumes.
Set vector_status: "pending" in the same UPDATE that lands the content (it's already done for create — server.py:1456).
Expected p95: ~80–150 ms for upload/update/patch (dominated by PG round-trips + triggers).
Risks:
- Eventual consistency:
search_knowledgemay miss the new content for the lag between PG commit and listener processing (typically <2 s on this hardware). - Listener failure-mode: if the listener thread dies, vectors fall behind silently. Today's code already runs
/kb/audit-syncand/kb/reindex-missingendpoints to recover; surface this in monitoring/health. - Documentation contract: clients (GPT) must be told: writes are committed sync; semantic indexing eventually consistent.
get_documentalways reflects current content.
Option 3 — hybrid by body size
Threshold env (e.g. MCP_SYNC_EMBED_MAX_CHARS=2000):
- Small docs (<=2 000 chars, ~1 chunk) -> sync embed (~250 ms). User sees searchable doc immediately.
- Larger docs -> fall through to listener (async).
Expected p95: ~250 ms small, ~120 ms large.
Risk: behaviour split; harder to reason about.
4.4 Recommendation
Option 2 for production, with these guardrails:
- Remove inline
_delete_vector_entry+_sync_vector_entrycalls increate_document,update_document(content branch),patch_document. Keep thevector_status: "pending"write. - Verify the PG-NOTIFY trigger fires for all three paths (UPDATE on
dataIS DISTINCT FROM — already in code) — confirmed viapg_get_functiondef. - The listener has resilience: it reconnects on PG drop (poll loop in
pg_vector_listener.py). Add a Prometheus/log gaugevector_sync_lag_seconds = NOW() - max(updated_at where vector_status='pending')and alert if > 30 s. - Existing
/kb/reindex-missing(server.py:2317 area) becomes the recovery handle. - Add to MCP
update_document/patch_documentdescription: "Document content is persisted synchronously; semantic indexing is eventual-consistent (typically <2 s). Useget_documentto read back content."
4.5 What about _delete_vector_entry on delete and upsert path?
The PG trigger emits op=DELETE for hard-delete and op=DELETE on soft-delete transition (deleted_at NULL -> NOT NULL). The listener calls vector_store.delete_document. So delete_document handler also doesn't need the inline call — same simplification applies.
4.6 Risk register
- R8. Soft-delete-then-immediate-search race: user soft-deletes, calls
search_knowledgewithin 1 s, gets the deleted doc back. Acceptable; document. - R9. OpenAI/Qdrant outage during listener catchup: docs stuck on
vector_status=pendingindefinitely. Mitigate with the alert (R3 above) and periodic auto-reindex cron (/kb/reindex-missing). - R10. Order of operations under concurrent updates: listener may apply embeds out of order. The listener fetches current
dataon each notification (pg_store.get_doc(...)inpg_vector_listener._handle_notification), so the final embedded state always matches the final committed state. Already correct.
4.7 Tests after D4
- Upload doc -> respond <200 ms -> wait 5 s ->
search_knowledgefinds it. - Update doc -> respond <200 ms ->
get_documentreturns new content immediately. - Patch doc -> respond <200 ms; verify
vector_statustransitionspending -> readywithin 5 s. - Soft-delete -> respond <200 ms; verify Qdrant entry removed within 5 s.
- Concurrency 5 across upload/update/patch -> all p95 <300 ms.
- Kill listener -> writes still succeed; documents go
pending; restarting listener catches up.
Part 5 — Endpoint policy for GPT
5.1 Threat surface per tool
| Tool | Read/Write | Risk if mis-issued by LLM | Reversible? |
|---|---|---|---|
search_knowledge |
R | none | n/a |
list_documents |
R | none | n/a |
get_document |
R | none | n/a |
get_document_for_rewrite |
R | none | n/a |
batch_read |
R | none | n/a |
upload_document |
W | wrong path, duplicate doc | yes (delete) |
update_document |
W | overwrites a doc | yes (snapshot trigger captures prior) |
patch_document |
W | targeted edit | yes (snapshot) |
delete_document |
W (soft) | drops a doc from list/search | yes (deleted_at UPDATE NULL) |
move_document |
W | currently 501 — recommend remove (Part 3) | n/a |
ingest_document |
W | SSRF / external fetch from LLM-controlled URL | partial (delete after) |
5.2 Recommendation
Keep current production policy: /gpt-mcp/{token}/mcp -> /mcp-gpt (8 tools, no delete/move/ingest) is the only public surface.
- Do NOT expose
/mcp-gpt-fullpublicly via nginx until each high-risk tool has:- SSRF allow-list for
ingest_document(onlygs://+ a known set ofhttps://hosts). - Rate-limit and audit-log shipping for
delete_document. - A re-implementation or removal of
move_documentper Part 3.
- SSRF allow-list for
- Admin/manual mode: keep
/mcp-gpt-fullas an internal-only route (no public nginx exposure, only reachable from inside Docker network or with a separate, more restricted nginx route gated by both API key and source IP). - Audit log: ensure
connector_call wrapper=delete_document ...is captured in nginx access log (currentlyaccess_log offon/gpt-mcp/...). Recommend re-enabling structured access log on the/gpt-mcplocation with a low-cardinality format — knowing which tool was invoked is essential for incident response. - Rotate the nginx-side
gpt-mcptoken after this design phase (separate ticket; not now). It was observed in clear in the conf file and in a previous deploy log per the comment.
5.3 Rollback
include /etc/nginx/secrets/gpt-mcp-route.conf; is one line — comment it out to remove the public route entirely. Reload nginx. No upstream change required.
Part 6 — Patch sequence
Each phase is one PR-sized change. No phase requires nginx changes unless explicitly listed.
2F-D1 — list_documents SQL pushdown + index
| Goal | Push prefix, sort, limit, offset to PostgreSQL. Add partial pattern index. |
| Files | agent_data/pg_store.py (add list_kb_docs(prefix, limit, offset)), agent_data/server.py (rewrite _list_kb_documents_page). DB migration SQL applied out-of-band (CREATE INDEX CONCURRENTLY). |
| MCP schema change | None (current path/limit/offset preserved; behaviour identical with smaller total_count semantics — keep or drop per S1.3). |
| Nginx change | None. |
| Risk | LIKE escaping (R1); ordering change (R2); stream_docs callers untouched (R3). |
| Rollback | Revert PR + DROP INDEX CONCURRENTLY idx_kb_documents_doc_id_pattern;. |
| Tests | S1.9. |
| Pass criteria | Functional MCP unchanged; list_documents single p50 < 100 ms, concurrency 5 p95 < 300 ms; no Seq Scan in EXPLAIN. |
| Expected impact | 4.47 s -> <300 ms p95 @ c5. |
2F-D2a — Transport content negotiation (JSON + SSE)
| Goal | Honour Accept: text/event-stream; keep JSON for everyone else. |
| Files | agent_data/server.py (one helper _mcp_respond, applied at the 4 return sites in /mcp, _mcp_filtered_handler). |
| MCP schema change | None. |
| Nginx change | Optionally add proxy_buffering off; proxy_cache off; to location /api/ for symmetry; not strictly required if ChatGPT only hits /gpt-mcp/.... |
| Risk | R5 Accept-rule heuristic; R6 in-stream error envelopes; R7 framing bytes. |
| Rollback | Revert PR. Nginx untouched. |
| Tests | S2.7 matrix. |
| Pass criteria | All current clients (curl, FastMCP, Codex, Claude) unchanged. SSE-strict client receives valid text/event-stream framing for initialize, tools/list, tools/call. |
| Expected impact | Unblocks ChatGPT MCP App handshake (high confidence). |
2F-D3 — move_document schema cleanup
| Goal | Remove move_document from MCP tools list + dispatcher; keep HTTP endpoint returning 501. |
| Files | agent_data/server.py (delete branch in _dispatch_mcp_tool; remove entry from MCP_TOOLS; remove from MCP_GPT_FULL_TOOLS). |
| MCP schema change | One tool fewer (MCP_TOOLS 11 -> 10; MCP_GPT_FULL_TOOLS 11 -> 10). connectorSchemaVersion will need a bump; _mcp_schema_hash() auto-updates. |
| Nginx change | None. |
| Risk | Any client still calling move_document over MCP gets Unknown tool instead of NOT_IMPLEMENTED. Should be no impact since GPT/Claude read the dynamic tools list. |
| Rollback | Revert PR. |
| Tests | S3.4. |
| Pass criteria | tools/list excludes move_document; HTTP /documents/{x}/move still 501. |
| Expected impact | Schema-runtime correctness; LLM no longer attempts deprecated tool. |
2F-D4 — Async write path via existing listener
| Goal | Remove inline _delete_vector_entry + _sync_vector_entry from create/update/patch/delete handlers; rely on PG-NOTIFY listener. |
| Files | agent_data/server.py (create_document, update_document, patch_document, delete_document — strip the inline vector calls; keep vector_status: "pending" write). Add a vector_sync_lag_seconds health metric. |
| MCP schema change | Description update on write tools to declare eventual-consistent semantic indexing. |
| Nginx change | None. |
| Risk | R8/R9/R10 above. |
| Rollback | Revert PR. Listener keeps running; documents reach consistency either way. |
| Tests | S4.7. |
| Pass criteria | Upload/update/patch p95 < 300 ms at concurrency 10; vector_sync_lag_seconds < 5 s steady-state; reindex-missing catches all stragglers within 1 cron cycle. |
| Expected impact | 0.8–1.3 s -> <250 ms p95. |
2F-D5 (optional, later) — Streamable HTTP resumability
Session id (Mcp-Session-Id header), Last-Event-ID, keep-alive ping. Only needed if/when long-running tool calls require server-pushed progress events. Out of scope to unblock ChatGPT.
Part 7 — Production pass criteria
7.1 Functional
| Tool | Pass |
|---|---|
search_knowledge |
returns >=1 hit on 'operating rules SSOT', latency < 2 s |
list_documents (no prefix) |
returns first 50; next_offset present |
list_documents (prefix='knowledge/test/') |
returns expected subset |
get_document |
found = full payload; missing = {error: "not found"} |
get_document_for_rewrite |
full body returned |
batch_read (3 paths) |
returns 3 items, truncated by default |
upload_document |
new doc visible in get_document immediately; in search <5 s |
update_document |
revision +1; new content readable immediately |
patch_document |
one-replace succeeds; ambiguous returns 409 |
delete_document |
soft-delete; excluded from list_documents; vector gone <5 s |
move_document |
(after D3) not present in tools/list |
ingest_document |
(only via /mcp-gpt-full admin) honours allow-list |
7.2 Latency targets (p95)
| Path | After D1+D2+D4 | After D1+D2+D4 with index |
|---|---|---|
initialize |
< 50 ms | < 50 ms |
tools/list |
< 50 ms | < 50 ms |
get_document |
< 200 ms | < 200 ms |
batch_read (3 paths) |
< 300 ms | < 300 ms |
list_documents single |
< 100 ms | < 50 ms |
list_documents concurrency 5 |
< 250 ms | < 150 ms |
list_documents concurrency 10 |
< 400 ms | < 250 ms |
upload_document |
< 250 ms | < 250 ms |
update_document |
< 250 ms | < 250 ms |
patch_document |
< 250 ms | < 250 ms |
search_knowledge |
< 2 s (OpenAI bound) | unchanged |
7.3 Transport pass
Accept: application/json-> JSON, byte-identical to today.Accept: application/json, text/event-stream-> JSON (heuristic R5).Accept: text/event-stream-> SSE with oneevent: messageframe,data: <json>\n\n.- FastMCP client
initialize -> tools/list -> tools/callE2E green. - ChatGPT MCP App handshake + tools/list + one
list_documentscall green (live test after D2a deploys).
7.4 Stress
- Concurrency 5 and 10 across mixed tool calls (1 minute, 100 rps target).
error_rate == 0for all valid calls.- No 5xx, no read timeouts, no Qdrant
wait=Truetimeouts. - Container CPU < 50 % avg, memory stable, no goroutine/thread leak.
7.5 Safety
- No secret rendered in logs (current code uses
<redacted>for env dump; verify same in any added log statements). /mcp-gpt-fullnot nginx-exposed.delete_documentaudited at nginx layer (re-enable access_log on/gpt-mcp/...).- Rotate
/gpt-mcp/{token}after D1+D2 ship (separate task). - Rollback documented per patch (Part 6).
Part 8 — Recommendation summary
Implement D1 + D2a before re-connecting ChatGPT. Both are surgical, small-blast-radius, and address the two top blockers (list latency, transport).
Then implement D3. Trivial cleanup; gets the schema honest.
D4 is high-value but bigger surface; prefer to land it as a separate session after D1+D2a are verified in production. Existing listener already covers the async path — D4 just removes the redundant sync path.
Do not expose /mcp-gpt-full publicly. Current 8-tool policy is correct for autonomous GPT.
Appendix A — Observed facts (read-only inspection)
kb_documentsrow counts and sizes:total=5040, alive=2907, table 27 MB, idx 1.7 MB combined.- EXPLAIN of current
stream_docs: Seq Scan, 4.0 ms execution; cost is in transfer + Python materialization. - EXPLAIN of proposed pushed-down query: 71–97 ms execution; would drop to <10 ms with partial
text_pattern_opsindex. - pg trigger
fn_kb_notify_vector_syncemitspg_notify('kb_vector_sync', json)on INSERT/UPDATE/DELETE with skip rules foroperations/tasks/comments/,registries/, empty body. - pg listener
pg_vector_listener.pyis running (PG->Qdrant vector sync listener startedin container logs). - All
/mcp*endpoints return JSON only; nginx/gpt-mcp/{token}/mcpalready has SSE-friendly proxy settings. move_documentraises 501NOT_IMPLEMENTEDunconditionally at function entry (server.py:1631-1635).
Appendix B — Open questions (for human approval)
- Drop
total_countfromlist_documentsresponse, or keep with extraCOUNT(*)query? (Recommend drop.) - Bump nginx
proxy_read_timeouton/gpt-mcp/{token}from 60 s -> 120 s for SSE futures? (Optional.) - Add structured access log on
/gpt-mcp/{token}location for audit? (Recommend yes.) - Schedule the gpt-mcp token rotation as a follow-up task? (Recommend yes, separate ticket.)
- Add
vector_sync_lag_secondshealth metric in D4? (Recommend yes.)
End of report. No code modified. No commit. No restart. No nginx reload.