Phase 2F-C — MCP Production Patch Design Report

Date: 2026-05-13 Status: Investigation + design only. No code changes, no rebuild, no restart, no commit. Author: Claude (Phase 2F-C) Scope: Production-grade design for AgentData MCP server prior to re-connecting ChatGPT.

Executive summary

list_documents p95 4.47 s is a Python-side bottleneck, not SQL. PostgreSQL returns the full table in ~4 ms; the application then transfers 27 MB JSONB, materializes 5 040 dicts per request, then filters/sorts/paginates in Python. Pagination args exist in the API but are applied after full materialization.
Transport blocker for ChatGPT/Claude.ai is application-level, not nginx. All /mcp* handlers return application/json. The /gpt-mcp nginx route already has proxy_buffering off + chunked_transfer_encoding on — it is ready for SSE. The fix is in agent_data/server.py.
move_document is intentionally 501 NOT_IMPLEMENTED (S170 deprecation) but its schema is still advertised in MCP_TOOLS. This is a schema/runtime mismatch, not a missing implementation.
Write p95 0.8–1.3 s is mostly OpenAI embed + Qdrant wait=True AND double work. Handlers call _sync_vector_entry synchronously and a PG-NOTIFY listener (pg_vector_listener.py) does the same work asynchronously. Removing the inline sync path is a large, low-risk win.
Endpoint policy is already partially correct — only /gpt-mcp/{token}/mcp → /mcp-gpt (8 tools, no delete/move/ingest) is exposed publicly via nginx. /mcp-gpt-full exists in the app but is not nginx-routed. Recommend keeping it that way.
Two patches (2F-D1 list_documents + 2F-D2 transport content-negotiation) are sufficient to unblock ChatGPT testing; everything else can follow.
One nginx secret was observed in /etc/nginx/secrets/gpt-mcp-route.conf. It is NOT printed in this report. Recommend rotating after this design phase completes (separate task — do not rotate now per instruction).

Phase 0 — Source map / fact base

0.1 Source paths (VPS = SSOT)

Concern	VPS path
MCP handlers / dispatch / tool schemas	`/opt/incomex/docker/agent-data-repo/agent_data/server.py`
PostgreSQL store	`/opt/incomex/docker/agent-data-repo/agent_data/pg_store.py`
Qdrant + OpenAI embedding	`/opt/incomex/docker/agent-data-repo/agent_data/vector_store.py`
Async vector sync listener	`/opt/incomex/docker/agent-data-repo/agent_data/pg_vector_listener.py`
Nginx site config	container `incomex-nginx:/etc/nginx/conf.d/default.conf`
Nginx secret-token MCP route	container `incomex-nginx:/etc/nginx/secrets/gpt-mcp-route.conf`

0.2 Git state (VPS repo)

Branch: main
HEAD: 8d1442f phase2e: harden MCP profiles and certify get_document behavior
Working tree: clean (git status --short returns nothing).
Repo is 5 ahead, 112 behind origin/main (long-standing — per project memory, edit host directly, never git pull). Local clone at /Users/nmhuyen/agent-data-test is outdated relative to VPS — do all design against VPS source.

0.3 Runtime state

incomex-agent-data   Up 33 minutes (healthy)   8080/tcp
postgres             Up 3 weeks (healthy)      127.0.0.1:5432->5432/tcp
incomex-qdrant       Up 8 weeks (healthy)      6333-6334/tcp
incomex-nginx        Up 13 days                :80, :443

Application log shows PG->Qdrant vector sync listener started — the async listener thread is already running in production.

0.4 MCP profiles (Python-level routes, not nginx)

agent_data/server.py:2689-2718:

Profile	Route in app	Tools
internal	`POST /mcp`	All 11 (`MCP_TOOLS`)
readonly	`POST /mcp-readonly`	5 read tools (`MCP_READONLY_TOOLS`)
gpt	`POST /mcp-gpt`	8 = read + upload/update/patch (`MCP_GPT_TOOLS`)
gpt-full	`POST /mcp-gpt-full`	All 11 (`MCP_GPT_FULL_TOOLS`)

Nginx public exposure (/etc/nginx/secrets/gpt-mcp-route.conf):

/gpt-mcp/{secret}/mcp → upstream /mcp-gpt (8 tools).
Sets X-API-Key header internally.
Has proxy_buffering off, chunked_transfer_encoding on, proxy_read_timeout 60s.
/mcp-gpt-full is NOT exposed by nginx.

0.5 PostgreSQL `kb_documents` schema (VPS)

key        TEXT PRIMARY KEY     -- filesystem-style ("knowledge__current-state__..."); _fs_key() translates '/' <-> '__'
data       JSONB                -- {document_id, parent_id, content:{body}, metadata, revision, deleted_at, vector_status, ...}
updated_at TIMESTAMPTZ
Indexes:
  kb_documents_pkey                btree (key)
  idx_kb_documents_doc_id          btree ((data->>'document_id'))
  idx_kb_documents_deleted         btree ((data->>'deleted_at'))
Triggers (6):
  trg_kb_audit, trg_kb_snapshot_update, trg_kb_snapshot_delete,
  trg_kb_truncation_guard, trg_kb_updated_at, trg_kb_vector_sync (-> pg_notify),
  trg_context_pack_law_enact
Row stats:
  total = 5040, alive = 2907, table size = 27 MB
  avg JSONB row ~6 KB, max ~94 KB
  top prefixes (alive): knowledge=2570, registries=223, operations=105

Part 1 — `list_documents` root cause + production design

1.1 Current code path

agent_data/server.py:2046 _list_kb_documents_page:

def _list_kb_documents_page(*, prefix="", limit=LIST_DOCUMENTS_DEFAULT_LIMIT, offset=0):
    limit, offset = _normalize_list_pagination(limit, offset)
    _ensure_pg()
    docs = pg_store.stream_docs(KB_COLLECTION)          # <- FULL TABLE LOAD
    matches = []
    for data in docs:                                    # <- Python loop over 5040 rows
        if data.get("deleted_at") is not None: continue
        doc_id = data.get("document_id", data.get("_key", ""))
        if prefix and not doc_id.startswith(prefix): continue
        matches.append({...})
    matches.sort(key=lambda x: x["document_id"])         # <- Python sort
    page = matches[offset : offset + limit]              # <- Python slice

agent_data/pg_store.py:170-176 stream_docs:

def stream_docs(collection):
    with _conn() as conn, conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute(f"SELECT key, data FROM {tbl}")     # <- no WHERE, no LIMIT, no ORDER
        return [{"_key": row["key"], **dict(row["data"])} for row in cur.fetchall()]

No body file reads, no Qdrant calls, no N+1. Just an unbounded full-table scan returning every JSONB row, fully materialized into Python dicts on every call.

1.2 Why p95 = 4.47 s while EXPLAIN says 4 ms

PG-side SELECT key, data FROM kb_documents:

Seq Scan on kb_documents  (cost=0.00..445.14 rows=5014 width=529)
                          (actual time=0.044..3.626 rows=5040 loops=1)
Execution Time: 4.015 ms

PostgreSQL itself is fast. The cost is in the wire + Python layer:

27 MB JSONB over the unix socket (psycopg2 -> cursor.fetchall).
5 040 calls to dict(row["data"]) + 5 040 {**dict, "_key": ...} allocations.
Then iterate again, filter, sort. Then slice.
Multiplied by concurrency 5 -> GIL contention, gunicorn/uvicorn worker time spikes.
No caching, no shared snapshot.

1.3 Proposed SQL (D1)

SELECT
  data->>'document_id'                            AS document_id,
  data->>'parent_id'                              AS parent_id,
  COALESCE(data->'metadata'->>'title','')         AS title,
  COALESCE(data->'metadata'->'tags','[]'::jsonb)  AS tags,
  COALESCE((data->>'revision')::int, 0)           AS revision
FROM kb_documents
WHERE (data->>'deleted_at') IS NULL
  AND (%(prefix)s = '' OR (data->>'document_id') LIKE %(prefix_like)s)
ORDER BY (data->>'document_id')
LIMIT %(limit)s OFFSET %(offset)s;

prefix_like = prefix || '%', built in Python with like_escape on prefix to neutralize %/_/\.

Also need total count (current API returns count = total matches). Two options:

A. Second SELECT COUNT(*) FROM kb_documents WHERE ... AND (data->>'document_id') LIKE ... — extra query (~5–20 ms).
B. Drop total_count; return count = len(items) + next_offset. Recommended — matches REST norms, removes a query, and total_count is rarely needed by a paginated client.

If total_count must stay, prefer A.

1.4 Index design

Current idx_kb_documents_doc_id uses default collation. EXPLAIN with the new SQL:

Limit  (actual time=81.6..87.4 rows=50 loops=1)
  Index Scan using idx_kb_documents_doc_id on kb_documents
    Filter: (((data->>'deleted_at') IS NULL) AND ((data->>'document_id') ~~ 'knowledge/%'))
    Rows Removed by Filter: 1611
Execution Time: 87.5 ms

Index Scan walks 1 661 entries to find 50 — LIKE 'knowledge/%' is treated as a filter, not a range, under UTF-8 collation (confirmed: 'knowledge/x' < 'knowledge0' returns FALSE under default collation; TRUE under COLLATE "C").

Recommended new index (partial + text_pattern_ops):

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kb_documents_doc_id_pattern
  ON kb_documents ((data->>'document_id') text_pattern_ops)
  WHERE (data->>'deleted_at') IS NULL;

text_pattern_ops makes LIKE 'prefix%' a true range scan (no row-by-row filter).
Partial WHERE deleted_at IS NULL shrinks the index by ~42 % (only 2 907 of 5 040 rows) and removes the deleted-at filter cost.
Expected new SQL execution: <10 ms at LIMIT 50.

Deploy plan:

CREATE INDEX CONCURRENTLY — cannot run inside a transaction, must be one statement outside a migration block. AgentData has no Alembic; current pattern is bare SQL applied by SRE.
Roll forward: apply on running DB (CONCURRENTLY is non-blocking; only takes ShareUpdateExclusiveLock against schema changes, allows reads + writes).
Rollback: DROP INDEX CONCURRENTLY IF EXISTS idx_kb_documents_doc_id_pattern; — also non-blocking. Existing idx_kb_documents_doc_id stays in place as a safety net.

1.5 MCP schema / API

VPS currently advertises prefix/limit/offset already (server.py:2756, LIST_DOCUMENTS_DEFAULT_LIMIT=50, MAX=100). But the MCP tool schema sends path, which is mapped to prefix (server.py:2972-2976). Keep:

Backward compat: Accept path (current MCP arg) AND prefix. Use the first non-empty.
Add limit: optional, default 50, max 100. Already in code path.
Add offset: optional, default 0. Already in code path.
No recursive/depth in this phase — prefix already gives recursive subtree behaviour.

Response shape stays compatible:

{
  "items": [...],
  "count": <items.length>,
  "returned_count": <items.length>,
  "limit": <int>,
  "offset": <int>,
  "next_offset": <int|null>,
  "truncated": <bool>
}

If we drop the global total_count semantics, document the change in the schema description.

1.6 Expected performance

Conservative targets after D1 only (SQL pushdown, no new index):

Scenario	Before	After D1
Single request, no prefix, limit=50	~900 ms p50	~80 ms p50
Single request, narrow prefix, limit=50	~900 ms p50	~90 ms p50
Concurrency 5 p95	4 470 ms	<= 250 ms
Concurrency 10 p95	(untested)	<= 400 ms

After also adding the partial text_pattern_ops index:

Scenario	After D1 + index
Single request, narrow prefix	< 30 ms p50
Concurrency 5 p95	< 150 ms
Concurrency 10 p95	< 250 ms

Production target: list_documents p95 < 300 ms at concurrency 10. Achievable with D1; index makes it comfortable.

1.7 Scaling estimate

Rows alive	Current behaviour (full scan + Python)	After D1
3 k	~900 ms / call	~80 ms
30 k	~10 s / call (would already be DOS-able)	~100 ms
300 k	timeouts	~200 ms

1.8 Risks

R1. LIKE 'prefix%' with user-supplied prefix requires escaping % _ \ (psycopg2 parameter binding doesn't escape LIKE wildcards). Mitigation: prefix.replace('\\','\\\\').replace('%','\\%').replace('_','\\_') + '%'.
R2. text_pattern_ops orders bytewise, not by locale. Output order will differ slightly from current Python .sort() (which is also bytewise on str). Verify: no test asserts a locale-specific order.
R3. Removing stream_docs callers — reindex_kb_documents, cleanup_orphan_vectors, _compute_data_integrity, audit_sync still need full iteration. Keep stream_docs (or rename to iter_all_docs and yield via server-side cursor) but don't use it for list_documents. Lower priority follow-up: convert reindex/cleanup to chunked cursor.
R4. LIST_DOCUMENTS_MAX_RESPONSE_BYTES guard already exists (server.py:2085) — keep it; with limit=100 and rich metadata, response bytes stay well under guard.

1.9 Tests after D1

Unit: prefix=empty, prefix='knowledge/', prefix with % and _, prefix='nonexistent/'.
Pagination boundary: offset=0/24/49/50/99/100; limit=1/50/100/101 (reject).
Soft-delete behaviour: deleted_at set -> excluded from results.
Ordering stability: same call twice returns same order.
Concurrency 5, 10, 20 — p95 < target.
Backward compat: MCP args.path still works.
EXPLAIN check (in CI): assert plan uses Index Scan not Seq Scan (optional, requires test DB).

Part 2 — MCP transport / content negotiation design

2.1 Current state

App (agent_data/server.py):

@app.post("/mcp") (line 3074) — returns plain dict via FastAPI auto-conversion. Implicitly Content-Type: application/json.
@app.post("/mcp-readonly|gpt|gpt-full") (line 3353-3380) -> _mcp_filtered_handler (line 3214) — returns JSONResponse(content=...). Explicitly application/json.
No Accept header is read.
No text/event-stream path. No streaming.
No keep-alive ping. No session id. No Last-Event-ID.

Nginx (/gpt-mcp/{token}/mcp -> /mcp-gpt):

proxy_buffering off OK
chunked_transfer_encoding on OK
proxy_cache off OK
proxy_read_timeout 60s (would need to extend if true server-sent streams)
Connection '' OK (prevents keep-alive from being forwarded; allows hop-by-hop chunked)
No X-Accel-Buffering: no set explicitly — not strictly required when buffering already off.

Conclusion: nginx is SSE-ready for the /gpt-mcp route. The blocker is the app.

2.2 Why FastMCP saw "Unexpected content type"

FastMCP (recent versions, and the ChatGPT/Claude.ai connectors) follow the MCP Streamable HTTP transport (spec rev 2025-03-26). The client sends:

POST /mcp HTTP/1.1
Accept: application/json, text/event-stream
Content-Type: application/json

It then accepts either:

a single JSON response (Content-Type: application/json), OR
an SSE stream (Content-Type: text/event-stream) carrying one or more event: message frames with the JSON-RPC response.

Older clients warned/failed when the server didn't echo SSE for tools/call results, especially long-running ones. The "Unexpected content type" message in our trace likely came from a FastMCP version that strictly expected text/event-stream for some operation, OR from the ChatGPT MCP App which (per Anthropic / OpenAI docs as of cutoff) requires the SSE variant for Streamable HTTP. Either way the server-side delta is the same: support content negotiation.

2.3 Design — Phase A vs B

	A. JSON hardening	B. JSON + SSE content negotiation
Always send explicit `Content-Type: application/json; charset=utf-8`	yes	yes when Accept lacks SSE
Honour `Accept: text/event-stream`	reject with 406 OR keep JSON	return SSE-framed JSON-RPC
Code surface	small (1–2 helpers)	small-medium (1 helper + Streaming wrapper)
Affects nginx	no	yes, for routes other than `/gpt-mcp` (`/api/` would need `proxy_buffering off` if used as SSE)
Breaks current FastMCP/curl clients	no	no (single-event SSE response is valid JSON-RPC, client picks via Accept)
Unblocks ChatGPT MCP App	probable but not certain	strongly likely
Future Streamable HTTP features (resumability, server pushes)	no	becomes feasible

Recommendation: Implement B, but in two micro-steps:

D2a: Add a tiny helper that inspects Accept and serializes the JSON-RPC response either as JSONResponse(content=..., media_type="application/json") or as StreamingResponse(... , media_type="text/event-stream") with a single SSE frame. Apply to /mcp, /mcp-readonly, /mcp-gpt, /mcp-gpt-full. Don't introduce session-id or resumability yet.
D2b (later): If ChatGPT App needs keep-alive comments or multiple frames, extend with event: message + event: ping every N seconds, and Mcp-Session-Id support. Not required to unblock.

2.4 Helper sketch (descriptive — NOT to be implemented in this phase)

Conceptual response builder:

def _mcp_respond(payload: dict, accept: str) -> Response:
    if "text/event-stream" in (accept or "") and "application/json" not in _explicit_only_json(accept):
        body = f"event: message\ndata: {json.dumps(payload, default=str)}\n\n"
        return StreamingResponse(
            iter([body]),
            media_type="text/event-stream",
            headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
        )
    return JSONResponse(content=payload, media_type="application/json; charset=utf-8")

For requests that return immediately (current pattern, all tools), one SSE frame is enough — clients accept it.
event: message is the spec-recommended event name; some clients accept unnamed data: events too.
No keep-alive ping needed because every response closes the stream.
For notifications/* (currently 204 No Content), keep 204 — SSE only fires when there's a JSON-RPC reply.

2.5 Nginx changes

/gpt-mcp/{token}/mcp location — already SSE-ready. No change needed. Bump proxy_read_timeout from 60s to 120s if a future stream may sit idle.
location /api/ (general AgentData proxy) — currently has proxy_read_timeout 300s but does not disable buffering. If anyone hits /api/mcp via this route and asks for SSE, nginx will buffer. Add proxy_buffering off; proxy_cache off; in /api/ when D2a ships. Safe — these routes are typically short JSON responses, but proxy_buffering off doesn't hurt them.

2.6 Risks

R5. Some HTTP clients send Accept: */*; the helper must treat */* as "JSON is fine" (don't auto-stream). Decision rule: SSE only when client explicitly includes text/event-stream AND the only JSON token is application/json is not preferred (q weighting). Practical rule: stream if text/event-stream substring present and the client did not include application/json with higher q. To keep code simple: stream when text/event-stream is present at all — but this may regress clients that send Accept: application/json, text/event-stream;q=0.1. Tradeoff for D2a: stream only if Accept contains text/event-stream AND Accept does NOT contain application/json. ChatGPT MCP App sends only text/event-stream for the Streamable HTTP variant; Claude.ai connector sends both. This rule keeps current curl -H 'Accept: application/json' 100 % unchanged, and gives the SSE-strict client what it wants.
R6. Order of operations: handler currently builds the dict, then returns. With SSE wrapper, errors must be in-stream JSON, not HTTP 500. The current code path already returns {"isError": true, ...} in the JSON-RPC envelope; preserve that.
R7. Stream framing bytes: event: message\ndata: <json>\n\n. Two \n\n are critical (end-of-event). Test with curl --no-buffer.

2.7 Test matrix

Test	Expectation
`curl -H 'Accept: application/json' POST /mcp`	`Content-Type: application/json; charset=utf-8`, single dict body
`curl -H 'Accept: application/json, text/event-stream' POST /mcp` (Claude.ai mode)	`application/json` (R5 rule)
`curl -H 'Accept: text/event-stream' POST /mcp`	`Content-Type: text/event-stream`, body `event: message\ndata: {...}\n\n`, connection closed
`curl --no-buffer -H 'Accept: text/event-stream' POST /gpt-mcp/.../mcp`	nginx passes through unbuffered
FastMCP client `initialize` + `tools/list` + `tools/call`	All succeed
ChatGPT MCP App handshake against `/gpt-mcp/.../mcp` (post D2a deploy)	Succeeds; tools list loads; one tool call returns

Part 3 — `move_document` decision

3.1 Mystery resolved

agent_data/server.py:1624-1636:

@app.post("/documents/{doc_id:path}/move", response_model=DocumentResponse)
async def move_document(doc_id, payload, _=Depends(require_api_key)):
    # DEPRECATED S170: move_document only changed parent_id metadata,
    # did NOT rename PG key or Qdrant document_id. Use copy+delete instead.
    raise _error(
        501,
        "NOT_IMPLEMENTED",
        "move_document is deprecated. Use: (1) upload_document to new path, (2) delete old path.",
        document_id=doc_id,
    )
    try:
        ...  # <- original implementation, unreachable

The function is a kill switch: the raise is at the very top of the body, leaving the original logic as dead code below. Decision was made in S170 because move only updated parent_id metadata without renaming the PG key (which is the canonical _fs_key form) or the Qdrant document_id, leaving the document discoverable under two paths and search indices stale.

The MCP schema (MCP_TOOLS) and the dispatcher (_dispatch_mcp_tool line 3057) still advertise and route the call — so any GPT/Claude that calls it gets 501 NOT_IMPLEMENTED returned wrapped in the JSON-RPC result.content[0].text (because the catch in _mcp_filtered_handler line 3326 turns HTTPException into a structured error). That matches the Phase 2F-A observation.

3.2 Policy options

A. Implement properly (rename key + rename Qdrant document_id atomically).
B. Remove from MCP_TOOLS AND remove from MCP_GPT_FULL_TOOLS so the schema matches the runtime.
C. Keep schema, prefix description with [DEPRECATED].
D. Keep in /mcp-gpt-full only; remove from public /mcp-gpt.

/mcp-gpt already excludes move_document. /mcp-gpt-full is not nginx-exposed. So the live impact is small — the only path that lets a client call it is /mcp (internal) or /mcp-gpt-full (not exposed). Still: the schema-runtime mismatch is a correctness defect and confuses every LLM that reads the tools list.

3.3 Recommendation

B. Remove move_document entirely from:

MCP_TOOLS list (server.py:2722) — schema
MCP_GPT_FULL_TOOLS frozenset (server.py:2708) — allowlist
_dispatch_mcp_tool branch (server.py:3061) — dispatcher
Keep @app.post("/documents/{doc_id:path}/move") returning 501 — backward compat for any internal caller.

Rationale:

The runtime decision (S170) is months old and the right call: copy+delete is a clean replacement.
Leaving a tool the LLM can attempt but will always fail wastes a tool slot in the LLM's instruction budget and pollutes audit logs.
Re-implementing properly is non-trivial (atomic rename across PG + Qdrant + invalidate caches + handle deep-prefix children) and is not required for production.

If business later needs first-class rename: re-introduce as rename_document with a proper transactional implementation.

3.4 Tests after B

tools/list on /mcp-gpt-full does NOT include move_document.
tools/call with name=move_document on any profile returns -32601 Tool not allowed (or Unknown tool).
Direct POST /documents/{x}/move still returns 501 (regression: don't accidentally re-enable the dead code).

Part 4 — Write latency design

4.1 Current path (per `upload`/`update`/`patch`)

For example update_document (server.py:1490-1622):

_ensure_pg
pg_store.get_doc(...) — 1 SELECT by PK (~5–15 ms)
Compute new content/metadata
pg_store.update_doc(..., expected_revision=...) — 1 UPDATE + 6 triggers fire:
- trg_kb_updated_at (BEFORE)
- trg_kb_truncation_guard (BEFORE)
- trg_kb_audit (AFTER)
- trg_kb_snapshot_update (BEFORE)
- trg_context_pack_law_enact (AFTER, conditional)
- trg_kb_vector_sync (AFTER) -> pg_notify('kb_vector_sync', ...) (~30–80 ms total for all triggers)
Inline sync: _delete_vector_entry(doc_id) -> vector_store.delete_document (Qdrant call, ~50–150 ms)
Inline sync: _sync_vector_entry(...) -> vector_store.upsert_document:
- _split_text (CPU, small)
- For each chunk: self._openai.embeddings.create(...) (sequential, ~150–400 ms/chunk over HTTPS)
- self._qdrant_upsert(points) with wait=True (~50–150 ms)
- pg_store.update_doc(..., {"vector_status": "ready"}) — second UPDATE, fires triggers again (!)
emit_fire_and_forget(DOCUMENT_UPDATED, ...) (~microseconds)

Plus in parallel (background thread):

pg_vector_listener.py receives the NOTIFY from step 4
Calls vector_store.upsert_document again for the same doc

Net result: each write does the Qdrant upsert twice (once inline, once async). The inline path is the one that contributes to user-visible p95.

4.2 What is mandatory sync vs async

Step	Must be sync?	Why
PG read (get)	yes	required for revision check / not-found semantics
PG write (update_doc)	yes	confirms persistence, returns 200
Revision bump	yes	atomic with PG write
Trigger fires NOTIFY	yes (PG-side)	already in trigger
Delete old Qdrant vectors	no	the listener can do this; user only cares once data lands
OpenAI embed	no	costliest step, async-friendly
Qdrant upsert	no	dittos
`vector_status` update	no	listener updates after embed
event_bus emit	already async (fire-and-forget)	—

4.3 Options

Option 1 — keep sync, batch + parallelize embeds

Stay sync but:

Use OpenAI batch embeddings: one HTTPS call with N chunk inputs returns N vectors. Cuts 3-chunk doc from 3 x 250 ms to ~300 ms.
Use asyncio.to_thread or ThreadPoolExecutor to run delete + upsert in parallel.
Drop wait=True on Qdrant upsert; the listener already deals with eventual consistency. (Note: Qdrant wait=False returns immediately; the server still applies the upsert.)

Expected p95: ~500–700 ms (down from 1.3 s).

Risk: still on the critical path; OpenAI rate-limit at concurrency 10+ will push tail latency. Doesn't address the double work with the listener.

Option 2 — async vector via existing listener

Remove the inline _delete_vector_entry + _sync_vector_entry from create_document, update_document (content branch), and patch_document. The PG-NOTIFY trigger already publishes; the listener already consumes.

Set vector_status: "pending" in the same UPDATE that lands the content (it's already done for create — server.py:1456).

Expected p95: ~80–150 ms for upload/update/patch (dominated by PG round-trips + triggers).

Risks:

Eventual consistency: search_knowledge may miss the new content for the lag between PG commit and listener processing (typically <2 s on this hardware).
Listener failure-mode: if the listener thread dies, vectors fall behind silently. Today's code already runs /kb/audit-sync and /kb/reindex-missing endpoints to recover; surface this in monitoring/health.
Documentation contract: clients (GPT) must be told: writes are committed sync; semantic indexing eventually consistent. get_document always reflects current content.

Option 3 — hybrid by body size

Threshold env (e.g. MCP_SYNC_EMBED_MAX_CHARS=2000):

Small docs (<=2 000 chars, ~1 chunk) -> sync embed (~250 ms). User sees searchable doc immediately.
Larger docs -> fall through to listener (async).

Expected p95: ~250 ms small, ~120 ms large.

Risk: behaviour split; harder to reason about.

4.4 Recommendation

Option 2 for production, with these guardrails:

Remove inline _delete_vector_entry + _sync_vector_entry calls in create_document, update_document (content branch), patch_document. Keep the vector_status: "pending" write.
Verify the PG-NOTIFY trigger fires for all three paths (UPDATE on data IS DISTINCT FROM — already in code) — confirmed via pg_get_functiondef.
The listener has resilience: it reconnects on PG drop (poll loop in pg_vector_listener.py). Add a Prometheus/log gauge vector_sync_lag_seconds = NOW() - max(updated_at where vector_status='pending') and alert if > 30 s.
Existing /kb/reindex-missing (server.py:2317 area) becomes the recovery handle.
Add to MCP update_document/patch_document description: "Document content is persisted synchronously; semantic indexing is eventual-consistent (typically <2 s). Use get_document to read back content."

4.5 What about `_delete_vector_entry` on delete and upsert path?

The PG trigger emits op=DELETE for hard-delete and op=DELETE on soft-delete transition (deleted_at NULL -> NOT NULL). The listener calls vector_store.delete_document. So delete_document handler also doesn't need the inline call — same simplification applies.

4.6 Risk register

R8. Soft-delete-then-immediate-search race: user soft-deletes, calls search_knowledge within 1 s, gets the deleted doc back. Acceptable; document.
R9. OpenAI/Qdrant outage during listener catchup: docs stuck on vector_status=pending indefinitely. Mitigate with the alert (R3 above) and periodic auto-reindex cron (/kb/reindex-missing).
R10. Order of operations under concurrent updates: listener may apply embeds out of order. The listener fetches current data on each notification (pg_store.get_doc(...) in pg_vector_listener._handle_notification), so the final embedded state always matches the final committed state. Already correct.

4.7 Tests after D4

Upload doc -> respond <200 ms -> wait 5 s -> search_knowledge finds it.
Update doc -> respond <200 ms -> get_document returns new content immediately.
Patch doc -> respond <200 ms; verify vector_status transitions pending -> ready within 5 s.
Soft-delete -> respond <200 ms; verify Qdrant entry removed within 5 s.
Concurrency 5 across upload/update/patch -> all p95 <300 ms.
Kill listener -> writes still succeed; documents go pending; restarting listener catches up.

Part 5 — Endpoint policy for GPT

5.1 Threat surface per tool

Tool	Read/Write	Risk if mis-issued by LLM	Reversible?
`search_knowledge`	R	none	n/a
`list_documents`	R	none	n/a
`get_document`	R	none	n/a
`get_document_for_rewrite`	R	none	n/a
`batch_read`	R	none	n/a
`upload_document`	W	wrong path, duplicate doc	yes (delete)
`update_document`	W	overwrites a doc	yes (snapshot trigger captures prior)
`patch_document`	W	targeted edit	yes (snapshot)
`delete_document`	W (soft)	drops a doc from list/search	yes (`deleted_at` UPDATE NULL)
`move_document`	W	currently 501 — recommend remove (Part 3)	n/a
`ingest_document`	W	SSRF / external fetch from LLM-controlled URL	partial (delete after)

5.2 Recommendation

Keep current production policy: /gpt-mcp/{token}/mcp -> /mcp-gpt (8 tools, no delete/move/ingest) is the only public surface.

Do NOT expose /mcp-gpt-full publicly via nginx until each high-risk tool has:
- SSRF allow-list for ingest_document (only gs:// + a known set of https:// hosts).
- Rate-limit and audit-log shipping for delete_document.
- A re-implementation or removal of move_document per Part 3.
Admin/manual mode: keep /mcp-gpt-full as an internal-only route (no public nginx exposure, only reachable from inside Docker network or with a separate, more restricted nginx route gated by both API key and source IP).
Audit log: ensure connector_call wrapper=delete_document ... is captured in nginx access log (currently access_log off on /gpt-mcp/...). Recommend re-enabling structured access log on the /gpt-mcp location with a low-cardinality format — knowing which tool was invoked is essential for incident response.
Rotate the nginx-side gpt-mcp token after this design phase (separate ticket; not now). It was observed in clear in the conf file and in a previous deploy log per the comment.

5.3 Rollback

include /etc/nginx/secrets/gpt-mcp-route.conf; is one line — comment it out to remove the public route entirely. Reload nginx. No upstream change required.

Part 6 — Patch sequence

Each phase is one PR-sized change. No phase requires nginx changes unless explicitly listed.

2F-D1 — `list_documents` SQL pushdown + index


Goal	Push prefix, sort, limit, offset to PostgreSQL. Add partial pattern index.
Files	`agent_data/pg_store.py` (add `list_kb_docs(prefix, limit, offset)`), `agent_data/server.py` (rewrite `_list_kb_documents_page`). DB migration SQL applied out-of-band (`CREATE INDEX CONCURRENTLY`).
MCP schema change	None (current `path`/`limit`/`offset` preserved; behaviour identical with smaller `total_count` semantics — keep or drop per S1.3).
Nginx change	None.
Risk	LIKE escaping (R1); ordering change (R2); `stream_docs` callers untouched (R3).
Rollback	Revert PR + `DROP INDEX CONCURRENTLY idx_kb_documents_doc_id_pattern;`.
Tests	S1.9.
Pass criteria	Functional MCP unchanged; list_documents single p50 < 100 ms, concurrency 5 p95 < 300 ms; no Seq Scan in EXPLAIN.
Expected impact	4.47 s -> <300 ms p95 @ c5.

2F-D2a — Transport content negotiation (JSON + SSE)


Goal	Honour `Accept: text/event-stream`; keep JSON for everyone else.
Files	`agent_data/server.py` (one helper `_mcp_respond`, applied at the 4 return sites in `/mcp`, `_mcp_filtered_handler`).
MCP schema change	None.
Nginx change	Optionally add `proxy_buffering off; proxy_cache off;` to `location /api/` for symmetry; not strictly required if ChatGPT only hits `/gpt-mcp/...`.
Risk	R5 Accept-rule heuristic; R6 in-stream error envelopes; R7 framing bytes.
Rollback	Revert PR. Nginx untouched.
Tests	S2.7 matrix.
Pass criteria	All current clients (curl, FastMCP, Codex, Claude) unchanged. SSE-strict client receives valid `text/event-stream` framing for `initialize`, `tools/list`, `tools/call`.
Expected impact	Unblocks ChatGPT MCP App handshake (high confidence).

2F-D3 — `move_document` schema cleanup


Goal	Remove `move_document` from MCP tools list + dispatcher; keep HTTP endpoint returning 501.
Files	`agent_data/server.py` (delete branch in `_dispatch_mcp_tool`; remove entry from `MCP_TOOLS`; remove from `MCP_GPT_FULL_TOOLS`).
MCP schema change	One tool fewer (`MCP_TOOLS` 11 -> 10; `MCP_GPT_FULL_TOOLS` 11 -> 10). `connectorSchemaVersion` will need a bump; `_mcp_schema_hash()` auto-updates.
Nginx change	None.
Risk	Any client still calling `move_document` over MCP gets `Unknown tool` instead of `NOT_IMPLEMENTED`. Should be no impact since GPT/Claude read the dynamic tools list.
Rollback	Revert PR.
Tests	S3.4.
Pass criteria	`tools/list` excludes `move_document`; HTTP `/documents/{x}/move` still 501.
Expected impact	Schema-runtime correctness; LLM no longer attempts deprecated tool.

2F-D4 — Async write path via existing listener


Goal	Remove inline `_delete_vector_entry` + `_sync_vector_entry` from create/update/patch/delete handlers; rely on PG-NOTIFY listener.
Files	`agent_data/server.py` (`create_document`, `update_document`, `patch_document`, `delete_document` — strip the inline vector calls; keep `vector_status: "pending"` write). Add a `vector_sync_lag_seconds` health metric.
MCP schema change	Description update on write tools to declare eventual-consistent semantic indexing.
Nginx change	None.
Risk	R8/R9/R10 above.
Rollback	Revert PR. Listener keeps running; documents reach consistency either way.
Tests	S4.7.
Pass criteria	Upload/update/patch p95 < 300 ms at concurrency 10; `vector_sync_lag_seconds` < 5 s steady-state; reindex-missing catches all stragglers within 1 cron cycle.
Expected impact	0.8–1.3 s -> <250 ms p95.

2F-D5 (optional, later) — Streamable HTTP resumability

Session id (Mcp-Session-Id header), Last-Event-ID, keep-alive ping. Only needed if/when long-running tool calls require server-pushed progress events. Out of scope to unblock ChatGPT.

Part 7 — Production pass criteria

7.1 Functional

Tool	Pass
`search_knowledge`	returns >=1 hit on `'operating rules SSOT'`, latency < 2 s
`list_documents` (no prefix)	returns first 50; `next_offset` present
`list_documents` (prefix='knowledge/test/')	returns expected subset
`get_document`	found = full payload; missing = `{error: "not found"}`
`get_document_for_rewrite`	full body returned
`batch_read` (3 paths)	returns 3 items, truncated by default
`upload_document`	new doc visible in `get_document` immediately; in `search` <5 s
`update_document`	revision +1; new content readable immediately
`patch_document`	one-replace succeeds; ambiguous returns 409
`delete_document`	soft-delete; excluded from `list_documents`; vector gone <5 s
`move_document`	(after D3) not present in `tools/list`
`ingest_document`	(only via `/mcp-gpt-full` admin) honours allow-list

7.2 Latency targets (p95)

Path	After D1+D2+D4	After D1+D2+D4 with index
`initialize`	< 50 ms	< 50 ms
`tools/list`	< 50 ms	< 50 ms
`get_document`	< 200 ms	< 200 ms
`batch_read` (3 paths)	< 300 ms	< 300 ms
`list_documents` single	< 100 ms	< 50 ms
`list_documents` concurrency 5	< 250 ms	< 150 ms
`list_documents` concurrency 10	< 400 ms	< 250 ms
`upload_document`	< 250 ms	< 250 ms
`update_document`	< 250 ms	< 250 ms
`patch_document`	< 250 ms	< 250 ms
`search_knowledge`	< 2 s (OpenAI bound)	unchanged

7.3 Transport pass

Accept: application/json -> JSON, byte-identical to today.
Accept: application/json, text/event-stream -> JSON (heuristic R5).
Accept: text/event-stream -> SSE with one event: message frame, data: <json>\n\n.
FastMCP client initialize -> tools/list -> tools/call E2E green.
ChatGPT MCP App handshake + tools/list + one list_documents call green (live test after D2a deploys).

7.4 Stress

Concurrency 5 and 10 across mixed tool calls (1 minute, 100 rps target).
error_rate == 0 for all valid calls.
No 5xx, no read timeouts, no Qdrant wait=True timeouts.
Container CPU < 50 % avg, memory stable, no goroutine/thread leak.

7.5 Safety

No secret rendered in logs (current code uses <redacted> for env dump; verify same in any added log statements).
/mcp-gpt-full not nginx-exposed.
delete_document audited at nginx layer (re-enable access_log on /gpt-mcp/...).
Rotate /gpt-mcp/{token} after D1+D2 ship (separate task).
Rollback documented per patch (Part 6).

Part 8 — Recommendation summary

Implement D1 + D2a before re-connecting ChatGPT. Both are surgical, small-blast-radius, and address the two top blockers (list latency, transport).

Then implement D3. Trivial cleanup; gets the schema honest.

D4 is high-value but bigger surface; prefer to land it as a separate session after D1+D2a are verified in production. Existing listener already covers the async path — D4 just removes the redundant sync path.

Do not expose /mcp-gpt-full publicly. Current 8-tool policy is correct for autonomous GPT.

Appendix A — Observed facts (read-only inspection)

kb_documents row counts and sizes: total=5040, alive=2907, table 27 MB, idx 1.7 MB combined.
EXPLAIN of current stream_docs: Seq Scan, 4.0 ms execution; cost is in transfer + Python materialization.
EXPLAIN of proposed pushed-down query: 71–97 ms execution; would drop to <10 ms with partial text_pattern_ops index.
pg trigger fn_kb_notify_vector_sync emits pg_notify('kb_vector_sync', json) on INSERT/UPDATE/DELETE with skip rules for operations/tasks/comments/, registries/, empty body.
pg listener pg_vector_listener.py is running (PG->Qdrant vector sync listener started in container logs).
All /mcp* endpoints return JSON only; nginx /gpt-mcp/{token}/mcp already has SSE-friendly proxy settings.
move_document raises 501 NOT_IMPLEMENTED unconditionally at function entry (server.py:1631-1635).

Appendix B — Open questions (for human approval)

Drop total_count from list_documents response, or keep with extra COUNT(*) query? (Recommend drop.)
Bump nginx proxy_read_timeout on /gpt-mcp/{token} from 60 s -> 120 s for SSE futures? (Optional.)
Add structured access log on /gpt-mcp/{token} location for audit? (Recommend yes.)
Schedule the gpt-mcp token rotation as a follow-up task? (Recommend yes, separate ticket.)
Add vector_sync_lag_seconds health metric in D4? (Recommend yes.)

End of report. No code modified. No commit. No restart. No nginx reload.