KB-3281

Phase 2F-C — MCP Production Patch Design Report (2026-05-13)

42 min read Revision 1
phase-2fmcpdesignproductionagent-datalist_documentstransportssewrite-latency

Phase 2F-C — MCP Production Patch Design Report

Date: 2026-05-13 Status: Investigation + design only. No code changes, no rebuild, no restart, no commit. Author: Claude (Phase 2F-C) Scope: Production-grade design for AgentData MCP server prior to re-connecting ChatGPT.


Executive summary

  1. list_documents p95 4.47 s is a Python-side bottleneck, not SQL. PostgreSQL returns the full table in ~4 ms; the application then transfers 27 MB JSONB, materializes 5 040 dicts per request, then filters/sorts/paginates in Python. Pagination args exist in the API but are applied after full materialization.
  2. Transport blocker for ChatGPT/Claude.ai is application-level, not nginx. All /mcp* handlers return application/json. The /gpt-mcp nginx route already has proxy_buffering off + chunked_transfer_encoding on — it is ready for SSE. The fix is in agent_data/server.py.
  3. move_document is intentionally 501 NOT_IMPLEMENTED (S170 deprecation) but its schema is still advertised in MCP_TOOLS. This is a schema/runtime mismatch, not a missing implementation.
  4. Write p95 0.8–1.3 s is mostly OpenAI embed + Qdrant wait=True AND double work. Handlers call _sync_vector_entry synchronously and a PG-NOTIFY listener (pg_vector_listener.py) does the same work asynchronously. Removing the inline sync path is a large, low-risk win.
  5. Endpoint policy is already partially correct — only /gpt-mcp/{token}/mcp → /mcp-gpt (8 tools, no delete/move/ingest) is exposed publicly via nginx. /mcp-gpt-full exists in the app but is not nginx-routed. Recommend keeping it that way.
  6. Two patches (2F-D1 list_documents + 2F-D2 transport content-negotiation) are sufficient to unblock ChatGPT testing; everything else can follow.
  7. One nginx secret was observed in /etc/nginx/secrets/gpt-mcp-route.conf. It is NOT printed in this report. Recommend rotating after this design phase completes (separate task — do not rotate now per instruction).

Phase 0 — Source map / fact base

0.1 Source paths (VPS = SSOT)

Concern VPS path
MCP handlers / dispatch / tool schemas /opt/incomex/docker/agent-data-repo/agent_data/server.py
PostgreSQL store /opt/incomex/docker/agent-data-repo/agent_data/pg_store.py
Qdrant + OpenAI embedding /opt/incomex/docker/agent-data-repo/agent_data/vector_store.py
Async vector sync listener /opt/incomex/docker/agent-data-repo/agent_data/pg_vector_listener.py
Nginx site config container incomex-nginx:/etc/nginx/conf.d/default.conf
Nginx secret-token MCP route container incomex-nginx:/etc/nginx/secrets/gpt-mcp-route.conf

0.2 Git state (VPS repo)

  • Branch: main
  • HEAD: 8d1442f phase2e: harden MCP profiles and certify get_document behavior
  • Working tree: clean (git status --short returns nothing).
  • Repo is 5 ahead, 112 behind origin/main (long-standing — per project memory, edit host directly, never git pull). Local clone at /Users/nmhuyen/agent-data-test is outdated relative to VPS — do all design against VPS source.

0.3 Runtime state

incomex-agent-data   Up 33 minutes (healthy)   8080/tcp
postgres             Up 3 weeks (healthy)      127.0.0.1:5432->5432/tcp
incomex-qdrant       Up 8 weeks (healthy)      6333-6334/tcp
incomex-nginx        Up 13 days                :80, :443

Application log shows PG->Qdrant vector sync listener started — the async listener thread is already running in production.

0.4 MCP profiles (Python-level routes, not nginx)

agent_data/server.py:2689-2718:

Profile Route in app Tools
internal POST /mcp All 11 (MCP_TOOLS)
readonly POST /mcp-readonly 5 read tools (MCP_READONLY_TOOLS)
gpt POST /mcp-gpt 8 = read + upload/update/patch (MCP_GPT_TOOLS)
gpt-full POST /mcp-gpt-full All 11 (MCP_GPT_FULL_TOOLS)

Nginx public exposure (/etc/nginx/secrets/gpt-mcp-route.conf):

  • /gpt-mcp/{secret}/mcp → upstream /mcp-gpt (8 tools).
  • Sets X-API-Key header internally.
  • Has proxy_buffering off, chunked_transfer_encoding on, proxy_read_timeout 60s.
  • /mcp-gpt-full is NOT exposed by nginx.

0.5 PostgreSQL kb_documents schema (VPS)

key        TEXT PRIMARY KEY     -- filesystem-style ("knowledge__current-state__..."); _fs_key() translates '/' <-> '__'
data       JSONB                -- {document_id, parent_id, content:{body}, metadata, revision, deleted_at, vector_status, ...}
updated_at TIMESTAMPTZ
Indexes:
  kb_documents_pkey                btree (key)
  idx_kb_documents_doc_id          btree ((data->>'document_id'))
  idx_kb_documents_deleted         btree ((data->>'deleted_at'))
Triggers (6):
  trg_kb_audit, trg_kb_snapshot_update, trg_kb_snapshot_delete,
  trg_kb_truncation_guard, trg_kb_updated_at, trg_kb_vector_sync (-> pg_notify),
  trg_context_pack_law_enact
Row stats:
  total = 5040, alive = 2907, table size = 27 MB
  avg JSONB row ~6 KB, max ~94 KB
  top prefixes (alive): knowledge=2570, registries=223, operations=105

Part 1 — list_documents root cause + production design

1.1 Current code path

agent_data/server.py:2046 _list_kb_documents_page:

def _list_kb_documents_page(*, prefix="", limit=LIST_DOCUMENTS_DEFAULT_LIMIT, offset=0):
    limit, offset = _normalize_list_pagination(limit, offset)
    _ensure_pg()
    docs = pg_store.stream_docs(KB_COLLECTION)          # <- FULL TABLE LOAD
    matches = []
    for data in docs:                                    # <- Python loop over 5040 rows
        if data.get("deleted_at") is not None: continue
        doc_id = data.get("document_id", data.get("_key", ""))
        if prefix and not doc_id.startswith(prefix): continue
        matches.append({...})
    matches.sort(key=lambda x: x["document_id"])         # <- Python sort
    page = matches[offset : offset + limit]              # <- Python slice

agent_data/pg_store.py:170-176 stream_docs:

def stream_docs(collection):
    with _conn() as conn, conn.cursor(cursor_factory=RealDictCursor) as cur:
        cur.execute(f"SELECT key, data FROM {tbl}")     # <- no WHERE, no LIMIT, no ORDER
        return [{"_key": row["key"], **dict(row["data"])} for row in cur.fetchall()]

No body file reads, no Qdrant calls, no N+1. Just an unbounded full-table scan returning every JSONB row, fully materialized into Python dicts on every call.

1.2 Why p95 = 4.47 s while EXPLAIN says 4 ms

PG-side SELECT key, data FROM kb_documents:

Seq Scan on kb_documents  (cost=0.00..445.14 rows=5014 width=529)
                          (actual time=0.044..3.626 rows=5040 loops=1)
Execution Time: 4.015 ms

PostgreSQL itself is fast. The cost is in the wire + Python layer:

  • 27 MB JSONB over the unix socket (psycopg2 -> cursor.fetchall).
  • 5 040 calls to dict(row["data"]) + 5 040 {**dict, "_key": ...} allocations.
  • Then iterate again, filter, sort. Then slice.
  • Multiplied by concurrency 5 -> GIL contention, gunicorn/uvicorn worker time spikes.
  • No caching, no shared snapshot.

1.3 Proposed SQL (D1)

SELECT
  data->>'document_id'                            AS document_id,
  data->>'parent_id'                              AS parent_id,
  COALESCE(data->'metadata'->>'title','')         AS title,
  COALESCE(data->'metadata'->'tags','[]'::jsonb)  AS tags,
  COALESCE((data->>'revision')::int, 0)           AS revision
FROM kb_documents
WHERE (data->>'deleted_at') IS NULL
  AND (%(prefix)s = '' OR (data->>'document_id') LIKE %(prefix_like)s)
ORDER BY (data->>'document_id')
LIMIT %(limit)s OFFSET %(offset)s;

prefix_like = prefix || '%', built in Python with like_escape on prefix to neutralize %/_/\.

Also need total count (current API returns count = total matches). Two options:

  • A. Second SELECT COUNT(*) FROM kb_documents WHERE ... AND (data->>'document_id') LIKE ... — extra query (~5–20 ms).
  • B. Drop total_count; return count = len(items) + next_offset. Recommended — matches REST norms, removes a query, and total_count is rarely needed by a paginated client.

If total_count must stay, prefer A.

1.4 Index design

Current idx_kb_documents_doc_id uses default collation. EXPLAIN with the new SQL:

Limit  (actual time=81.6..87.4 rows=50 loops=1)
  Index Scan using idx_kb_documents_doc_id on kb_documents
    Filter: (((data->>'deleted_at') IS NULL) AND ((data->>'document_id') ~~ 'knowledge/%'))
    Rows Removed by Filter: 1611
Execution Time: 87.5 ms

Index Scan walks 1 661 entries to find 50 — LIKE 'knowledge/%' is treated as a filter, not a range, under UTF-8 collation (confirmed: 'knowledge/x' < 'knowledge0' returns FALSE under default collation; TRUE under COLLATE "C").

Recommended new index (partial + text_pattern_ops):

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_kb_documents_doc_id_pattern
  ON kb_documents ((data->>'document_id') text_pattern_ops)
  WHERE (data->>'deleted_at') IS NULL;
  • text_pattern_ops makes LIKE 'prefix%' a true range scan (no row-by-row filter).
  • Partial WHERE deleted_at IS NULL shrinks the index by ~42 % (only 2 907 of 5 040 rows) and removes the deleted-at filter cost.
  • Expected new SQL execution: <10 ms at LIMIT 50.

Deploy plan:

  • CREATE INDEX CONCURRENTLY — cannot run inside a transaction, must be one statement outside a migration block. AgentData has no Alembic; current pattern is bare SQL applied by SRE.
  • Roll forward: apply on running DB (CONCURRENTLY is non-blocking; only takes ShareUpdateExclusiveLock against schema changes, allows reads + writes).
  • Rollback: DROP INDEX CONCURRENTLY IF EXISTS idx_kb_documents_doc_id_pattern; — also non-blocking. Existing idx_kb_documents_doc_id stays in place as a safety net.

1.5 MCP schema / API

VPS currently advertises prefix/limit/offset already (server.py:2756, LIST_DOCUMENTS_DEFAULT_LIMIT=50, MAX=100). But the MCP tool schema sends path, which is mapped to prefix (server.py:2972-2976). Keep:

  • Backward compat: Accept path (current MCP arg) AND prefix. Use the first non-empty.
  • Add limit: optional, default 50, max 100. Already in code path.
  • Add offset: optional, default 0. Already in code path.
  • No recursive/depth in this phase — prefix already gives recursive subtree behaviour.

Response shape stays compatible:

{
  "items": [...],
  "count": <items.length>,
  "returned_count": <items.length>,
  "limit": <int>,
  "offset": <int>,
  "next_offset": <int|null>,
  "truncated": <bool>
}

If we drop the global total_count semantics, document the change in the schema description.

1.6 Expected performance

Conservative targets after D1 only (SQL pushdown, no new index):

Scenario Before After D1
Single request, no prefix, limit=50 ~900 ms p50 ~80 ms p50
Single request, narrow prefix, limit=50 ~900 ms p50 ~90 ms p50
Concurrency 5 p95 4 470 ms <= 250 ms
Concurrency 10 p95 (untested) <= 400 ms

After also adding the partial text_pattern_ops index:

Scenario After D1 + index
Single request, narrow prefix < 30 ms p50
Concurrency 5 p95 < 150 ms
Concurrency 10 p95 < 250 ms

Production target: list_documents p95 < 300 ms at concurrency 10. Achievable with D1; index makes it comfortable.

1.7 Scaling estimate

Rows alive Current behaviour (full scan + Python) After D1
3 k ~900 ms / call ~80 ms
30 k ~10 s / call (would already be DOS-able) ~100 ms
300 k timeouts ~200 ms

1.8 Risks

  • R1. LIKE 'prefix%' with user-supplied prefix requires escaping % _ \ (psycopg2 parameter binding doesn't escape LIKE wildcards). Mitigation: prefix.replace('\\','\\\\').replace('%','\\%').replace('_','\\_') + '%'.
  • R2. text_pattern_ops orders bytewise, not by locale. Output order will differ slightly from current Python .sort() (which is also bytewise on str). Verify: no test asserts a locale-specific order.
  • R3. Removing stream_docs callers — reindex_kb_documents, cleanup_orphan_vectors, _compute_data_integrity, audit_sync still need full iteration. Keep stream_docs (or rename to iter_all_docs and yield via server-side cursor) but don't use it for list_documents. Lower priority follow-up: convert reindex/cleanup to chunked cursor.
  • R4. LIST_DOCUMENTS_MAX_RESPONSE_BYTES guard already exists (server.py:2085) — keep it; with limit=100 and rich metadata, response bytes stay well under guard.

1.9 Tests after D1

  1. Unit: prefix=empty, prefix='knowledge/', prefix with % and _, prefix='nonexistent/'.
  2. Pagination boundary: offset=0/24/49/50/99/100; limit=1/50/100/101 (reject).
  3. Soft-delete behaviour: deleted_at set -> excluded from results.
  4. Ordering stability: same call twice returns same order.
  5. Concurrency 5, 10, 20 — p95 < target.
  6. Backward compat: MCP args.path still works.
  7. EXPLAIN check (in CI): assert plan uses Index Scan not Seq Scan (optional, requires test DB).

Part 2 — MCP transport / content negotiation design

2.1 Current state

App (agent_data/server.py):

  • @app.post("/mcp") (line 3074) — returns plain dict via FastAPI auto-conversion. Implicitly Content-Type: application/json.
  • @app.post("/mcp-readonly|gpt|gpt-full") (line 3353-3380) -> _mcp_filtered_handler (line 3214) — returns JSONResponse(content=...). Explicitly application/json.
  • No Accept header is read.
  • No text/event-stream path. No streaming.
  • No keep-alive ping. No session id. No Last-Event-ID.

Nginx (/gpt-mcp/{token}/mcp -> /mcp-gpt):

  • proxy_buffering off OK
  • chunked_transfer_encoding on OK
  • proxy_cache off OK
  • proxy_read_timeout 60s (would need to extend if true server-sent streams)
  • Connection '' OK (prevents keep-alive from being forwarded; allows hop-by-hop chunked)
  • No X-Accel-Buffering: no set explicitly — not strictly required when buffering already off.

Conclusion: nginx is SSE-ready for the /gpt-mcp route. The blocker is the app.

2.2 Why FastMCP saw "Unexpected content type"

FastMCP (recent versions, and the ChatGPT/Claude.ai connectors) follow the MCP Streamable HTTP transport (spec rev 2025-03-26). The client sends:

POST /mcp HTTP/1.1
Accept: application/json, text/event-stream
Content-Type: application/json

It then accepts either:

  • a single JSON response (Content-Type: application/json), OR
  • an SSE stream (Content-Type: text/event-stream) carrying one or more event: message frames with the JSON-RPC response.

Older clients warned/failed when the server didn't echo SSE for tools/call results, especially long-running ones. The "Unexpected content type" message in our trace likely came from a FastMCP version that strictly expected text/event-stream for some operation, OR from the ChatGPT MCP App which (per Anthropic / OpenAI docs as of cutoff) requires the SSE variant for Streamable HTTP. Either way the server-side delta is the same: support content negotiation.

2.3 Design — Phase A vs B

A. JSON hardening B. JSON + SSE content negotiation
Always send explicit Content-Type: application/json; charset=utf-8 yes yes when Accept lacks SSE
Honour Accept: text/event-stream reject with 406 OR keep JSON return SSE-framed JSON-RPC
Code surface small (1–2 helpers) small-medium (1 helper + Streaming wrapper)
Affects nginx no yes, for routes other than /gpt-mcp (/api/ would need proxy_buffering off if used as SSE)
Breaks current FastMCP/curl clients no no (single-event SSE response is valid JSON-RPC, client picks via Accept)
Unblocks ChatGPT MCP App probable but not certain strongly likely
Future Streamable HTTP features (resumability, server pushes) no becomes feasible

Recommendation: Implement B, but in two micro-steps:

  • D2a: Add a tiny helper that inspects Accept and serializes the JSON-RPC response either as JSONResponse(content=..., media_type="application/json") or as StreamingResponse(... , media_type="text/event-stream") with a single SSE frame. Apply to /mcp, /mcp-readonly, /mcp-gpt, /mcp-gpt-full. Don't introduce session-id or resumability yet.
  • D2b (later): If ChatGPT App needs keep-alive comments or multiple frames, extend with event: message + event: ping every N seconds, and Mcp-Session-Id support. Not required to unblock.

2.4 Helper sketch (descriptive — NOT to be implemented in this phase)

Conceptual response builder:

def _mcp_respond(payload: dict, accept: str) -> Response:
    if "text/event-stream" in (accept or "") and "application/json" not in _explicit_only_json(accept):
        body = f"event: message\ndata: {json.dumps(payload, default=str)}\n\n"
        return StreamingResponse(
            iter([body]),
            media_type="text/event-stream",
            headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
        )
    return JSONResponse(content=payload, media_type="application/json; charset=utf-8")
  • For requests that return immediately (current pattern, all tools), one SSE frame is enough — clients accept it.
  • event: message is the spec-recommended event name; some clients accept unnamed data: events too.
  • No keep-alive ping needed because every response closes the stream.
  • For notifications/* (currently 204 No Content), keep 204 — SSE only fires when there's a JSON-RPC reply.

2.5 Nginx changes

  • /gpt-mcp/{token}/mcp location — already SSE-ready. No change needed. Bump proxy_read_timeout from 60s to 120s if a future stream may sit idle.
  • location /api/ (general AgentData proxy) — currently has proxy_read_timeout 300s but does not disable buffering. If anyone hits /api/mcp via this route and asks for SSE, nginx will buffer. Add proxy_buffering off; proxy_cache off; in /api/ when D2a ships. Safe — these routes are typically short JSON responses, but proxy_buffering off doesn't hurt them.

2.6 Risks

  • R5. Some HTTP clients send Accept: */*; the helper must treat */* as "JSON is fine" (don't auto-stream). Decision rule: SSE only when client explicitly includes text/event-stream AND the only JSON token is application/json is not preferred (q weighting). Practical rule: stream if text/event-stream substring present and the client did not include application/json with higher q. To keep code simple: stream when text/event-stream is present at all — but this may regress clients that send Accept: application/json, text/event-stream;q=0.1. Tradeoff for D2a: stream only if Accept contains text/event-stream AND Accept does NOT contain application/json. ChatGPT MCP App sends only text/event-stream for the Streamable HTTP variant; Claude.ai connector sends both. This rule keeps current curl -H 'Accept: application/json' 100 % unchanged, and gives the SSE-strict client what it wants.
  • R6. Order of operations: handler currently builds the dict, then returns. With SSE wrapper, errors must be in-stream JSON, not HTTP 500. The current code path already returns {"isError": true, ...} in the JSON-RPC envelope; preserve that.
  • R7. Stream framing bytes: event: message\ndata: <json>\n\n. Two \n\n are critical (end-of-event). Test with curl --no-buffer.

2.7 Test matrix

Test Expectation
curl -H 'Accept: application/json' POST /mcp Content-Type: application/json; charset=utf-8, single dict body
curl -H 'Accept: application/json, text/event-stream' POST /mcp (Claude.ai mode) application/json (R5 rule)
curl -H 'Accept: text/event-stream' POST /mcp Content-Type: text/event-stream, body event: message\ndata: {...}\n\n, connection closed
curl --no-buffer -H 'Accept: text/event-stream' POST /gpt-mcp/.../mcp nginx passes through unbuffered
FastMCP client initialize + tools/list + tools/call All succeed
ChatGPT MCP App handshake against /gpt-mcp/.../mcp (post D2a deploy) Succeeds; tools list loads; one tool call returns

Part 3 — move_document decision

3.1 Mystery resolved

agent_data/server.py:1624-1636:

@app.post("/documents/{doc_id:path}/move", response_model=DocumentResponse)
async def move_document(doc_id, payload, _=Depends(require_api_key)):
    # DEPRECATED S170: move_document only changed parent_id metadata,
    # did NOT rename PG key or Qdrant document_id. Use copy+delete instead.
    raise _error(
        501,
        "NOT_IMPLEMENTED",
        "move_document is deprecated. Use: (1) upload_document to new path, (2) delete old path.",
        document_id=doc_id,
    )
    try:
        ...  # <- original implementation, unreachable

The function is a kill switch: the raise is at the very top of the body, leaving the original logic as dead code below. Decision was made in S170 because move only updated parent_id metadata without renaming the PG key (which is the canonical _fs_key form) or the Qdrant document_id, leaving the document discoverable under two paths and search indices stale.

The MCP schema (MCP_TOOLS) and the dispatcher (_dispatch_mcp_tool line 3057) still advertise and route the call — so any GPT/Claude that calls it gets 501 NOT_IMPLEMENTED returned wrapped in the JSON-RPC result.content[0].text (because the catch in _mcp_filtered_handler line 3326 turns HTTPException into a structured error). That matches the Phase 2F-A observation.

3.2 Policy options

  • A. Implement properly (rename key + rename Qdrant document_id atomically).
  • B. Remove from MCP_TOOLS AND remove from MCP_GPT_FULL_TOOLS so the schema matches the runtime.
  • C. Keep schema, prefix description with [DEPRECATED].
  • D. Keep in /mcp-gpt-full only; remove from public /mcp-gpt.

/mcp-gpt already excludes move_document. /mcp-gpt-full is not nginx-exposed. So the live impact is small — the only path that lets a client call it is /mcp (internal) or /mcp-gpt-full (not exposed). Still: the schema-runtime mismatch is a correctness defect and confuses every LLM that reads the tools list.

3.3 Recommendation

B. Remove move_document entirely from:

  • MCP_TOOLS list (server.py:2722) — schema
  • MCP_GPT_FULL_TOOLS frozenset (server.py:2708) — allowlist
  • _dispatch_mcp_tool branch (server.py:3061) — dispatcher
  • Keep @app.post("/documents/{doc_id:path}/move") returning 501 — backward compat for any internal caller.

Rationale:

  • The runtime decision (S170) is months old and the right call: copy+delete is a clean replacement.
  • Leaving a tool the LLM can attempt but will always fail wastes a tool slot in the LLM's instruction budget and pollutes audit logs.
  • Re-implementing properly is non-trivial (atomic rename across PG + Qdrant + invalidate caches + handle deep-prefix children) and is not required for production.

If business later needs first-class rename: re-introduce as rename_document with a proper transactional implementation.

3.4 Tests after B

  • tools/list on /mcp-gpt-full does NOT include move_document.
  • tools/call with name=move_document on any profile returns -32601 Tool not allowed (or Unknown tool).
  • Direct POST /documents/{x}/move still returns 501 (regression: don't accidentally re-enable the dead code).

Part 4 — Write latency design

4.1 Current path (per upload/update/patch)

For example update_document (server.py:1490-1622):

  1. _ensure_pg
  2. pg_store.get_doc(...) — 1 SELECT by PK (~5–15 ms)
  3. Compute new content/metadata
  4. pg_store.update_doc(..., expected_revision=...) — 1 UPDATE + 6 triggers fire:
    • trg_kb_updated_at (BEFORE)
    • trg_kb_truncation_guard (BEFORE)
    • trg_kb_audit (AFTER)
    • trg_kb_snapshot_update (BEFORE)
    • trg_context_pack_law_enact (AFTER, conditional)
    • trg_kb_vector_sync (AFTER) -> pg_notify('kb_vector_sync', ...) (~30–80 ms total for all triggers)
  5. Inline sync: _delete_vector_entry(doc_id) -> vector_store.delete_document (Qdrant call, ~50–150 ms)
  6. Inline sync: _sync_vector_entry(...) -> vector_store.upsert_document:
    • _split_text (CPU, small)
    • For each chunk: self._openai.embeddings.create(...) (sequential, ~150–400 ms/chunk over HTTPS)
    • self._qdrant_upsert(points) with wait=True (~50–150 ms)
    • pg_store.update_doc(..., {"vector_status": "ready"}) — second UPDATE, fires triggers again (!)
  7. emit_fire_and_forget(DOCUMENT_UPDATED, ...) (~microseconds)

Plus in parallel (background thread):

  • pg_vector_listener.py receives the NOTIFY from step 4
  • Calls vector_store.upsert_document again for the same doc

Net result: each write does the Qdrant upsert twice (once inline, once async). The inline path is the one that contributes to user-visible p95.

4.2 What is mandatory sync vs async

Step Must be sync? Why
PG read (get) yes required for revision check / not-found semantics
PG write (update_doc) yes confirms persistence, returns 200
Revision bump yes atomic with PG write
Trigger fires NOTIFY yes (PG-side) already in trigger
Delete old Qdrant vectors no the listener can do this; user only cares once data lands
OpenAI embed no costliest step, async-friendly
Qdrant upsert no dittos
vector_status update no listener updates after embed
event_bus emit already async (fire-and-forget)

4.3 Options

Option 1 — keep sync, batch + parallelize embeds

Stay sync but:

  • Use OpenAI batch embeddings: one HTTPS call with N chunk inputs returns N vectors. Cuts 3-chunk doc from 3 x 250 ms to ~300 ms.
  • Use asyncio.to_thread or ThreadPoolExecutor to run delete + upsert in parallel.
  • Drop wait=True on Qdrant upsert; the listener already deals with eventual consistency. (Note: Qdrant wait=False returns immediately; the server still applies the upsert.)

Expected p95: ~500–700 ms (down from 1.3 s).

Risk: still on the critical path; OpenAI rate-limit at concurrency 10+ will push tail latency. Doesn't address the double work with the listener.

Option 2 — async vector via existing listener

Remove the inline _delete_vector_entry + _sync_vector_entry from create_document, update_document (content branch), and patch_document. The PG-NOTIFY trigger already publishes; the listener already consumes.

Set vector_status: "pending" in the same UPDATE that lands the content (it's already done for create — server.py:1456).

Expected p95: ~80–150 ms for upload/update/patch (dominated by PG round-trips + triggers).

Risks:

  • Eventual consistency: search_knowledge may miss the new content for the lag between PG commit and listener processing (typically <2 s on this hardware).
  • Listener failure-mode: if the listener thread dies, vectors fall behind silently. Today's code already runs /kb/audit-sync and /kb/reindex-missing endpoints to recover; surface this in monitoring/health.
  • Documentation contract: clients (GPT) must be told: writes are committed sync; semantic indexing eventually consistent. get_document always reflects current content.

Option 3 — hybrid by body size

Threshold env (e.g. MCP_SYNC_EMBED_MAX_CHARS=2000):

  • Small docs (<=2 000 chars, ~1 chunk) -> sync embed (~250 ms). User sees searchable doc immediately.
  • Larger docs -> fall through to listener (async).

Expected p95: ~250 ms small, ~120 ms large.

Risk: behaviour split; harder to reason about.

4.4 Recommendation

Option 2 for production, with these guardrails:

  1. Remove inline _delete_vector_entry + _sync_vector_entry calls in create_document, update_document (content branch), patch_document. Keep the vector_status: "pending" write.
  2. Verify the PG-NOTIFY trigger fires for all three paths (UPDATE on data IS DISTINCT FROM — already in code) — confirmed via pg_get_functiondef.
  3. The listener has resilience: it reconnects on PG drop (poll loop in pg_vector_listener.py). Add a Prometheus/log gauge vector_sync_lag_seconds = NOW() - max(updated_at where vector_status='pending') and alert if > 30 s.
  4. Existing /kb/reindex-missing (server.py:2317 area) becomes the recovery handle.
  5. Add to MCP update_document/patch_document description: "Document content is persisted synchronously; semantic indexing is eventual-consistent (typically <2 s). Use get_document to read back content."

4.5 What about _delete_vector_entry on delete and upsert path?

The PG trigger emits op=DELETE for hard-delete and op=DELETE on soft-delete transition (deleted_at NULL -> NOT NULL). The listener calls vector_store.delete_document. So delete_document handler also doesn't need the inline call — same simplification applies.

4.6 Risk register

  • R8. Soft-delete-then-immediate-search race: user soft-deletes, calls search_knowledge within 1 s, gets the deleted doc back. Acceptable; document.
  • R9. OpenAI/Qdrant outage during listener catchup: docs stuck on vector_status=pending indefinitely. Mitigate with the alert (R3 above) and periodic auto-reindex cron (/kb/reindex-missing).
  • R10. Order of operations under concurrent updates: listener may apply embeds out of order. The listener fetches current data on each notification (pg_store.get_doc(...) in pg_vector_listener._handle_notification), so the final embedded state always matches the final committed state. Already correct.

4.7 Tests after D4

  • Upload doc -> respond <200 ms -> wait 5 s -> search_knowledge finds it.
  • Update doc -> respond <200 ms -> get_document returns new content immediately.
  • Patch doc -> respond <200 ms; verify vector_status transitions pending -> ready within 5 s.
  • Soft-delete -> respond <200 ms; verify Qdrant entry removed within 5 s.
  • Concurrency 5 across upload/update/patch -> all p95 <300 ms.
  • Kill listener -> writes still succeed; documents go pending; restarting listener catches up.

Part 5 — Endpoint policy for GPT

5.1 Threat surface per tool

Tool Read/Write Risk if mis-issued by LLM Reversible?
search_knowledge R none n/a
list_documents R none n/a
get_document R none n/a
get_document_for_rewrite R none n/a
batch_read R none n/a
upload_document W wrong path, duplicate doc yes (delete)
update_document W overwrites a doc yes (snapshot trigger captures prior)
patch_document W targeted edit yes (snapshot)
delete_document W (soft) drops a doc from list/search yes (deleted_at UPDATE NULL)
move_document W currently 501 — recommend remove (Part 3) n/a
ingest_document W SSRF / external fetch from LLM-controlled URL partial (delete after)

5.2 Recommendation

Keep current production policy: /gpt-mcp/{token}/mcp -> /mcp-gpt (8 tools, no delete/move/ingest) is the only public surface.

  • Do NOT expose /mcp-gpt-full publicly via nginx until each high-risk tool has:
    • SSRF allow-list for ingest_document (only gs:// + a known set of https:// hosts).
    • Rate-limit and audit-log shipping for delete_document.
    • A re-implementation or removal of move_document per Part 3.
  • Admin/manual mode: keep /mcp-gpt-full as an internal-only route (no public nginx exposure, only reachable from inside Docker network or with a separate, more restricted nginx route gated by both API key and source IP).
  • Audit log: ensure connector_call wrapper=delete_document ... is captured in nginx access log (currently access_log off on /gpt-mcp/...). Recommend re-enabling structured access log on the /gpt-mcp location with a low-cardinality format — knowing which tool was invoked is essential for incident response.
  • Rotate the nginx-side gpt-mcp token after this design phase (separate ticket; not now). It was observed in clear in the conf file and in a previous deploy log per the comment.

5.3 Rollback

include /etc/nginx/secrets/gpt-mcp-route.conf; is one line — comment it out to remove the public route entirely. Reload nginx. No upstream change required.


Part 6 — Patch sequence

Each phase is one PR-sized change. No phase requires nginx changes unless explicitly listed.

2F-D1 — list_documents SQL pushdown + index

Goal Push prefix, sort, limit, offset to PostgreSQL. Add partial pattern index.
Files agent_data/pg_store.py (add list_kb_docs(prefix, limit, offset)), agent_data/server.py (rewrite _list_kb_documents_page). DB migration SQL applied out-of-band (CREATE INDEX CONCURRENTLY).
MCP schema change None (current path/limit/offset preserved; behaviour identical with smaller total_count semantics — keep or drop per S1.3).
Nginx change None.
Risk LIKE escaping (R1); ordering change (R2); stream_docs callers untouched (R3).
Rollback Revert PR + DROP INDEX CONCURRENTLY idx_kb_documents_doc_id_pattern;.
Tests S1.9.
Pass criteria Functional MCP unchanged; list_documents single p50 < 100 ms, concurrency 5 p95 < 300 ms; no Seq Scan in EXPLAIN.
Expected impact 4.47 s -> <300 ms p95 @ c5.

2F-D2a — Transport content negotiation (JSON + SSE)

Goal Honour Accept: text/event-stream; keep JSON for everyone else.
Files agent_data/server.py (one helper _mcp_respond, applied at the 4 return sites in /mcp, _mcp_filtered_handler).
MCP schema change None.
Nginx change Optionally add proxy_buffering off; proxy_cache off; to location /api/ for symmetry; not strictly required if ChatGPT only hits /gpt-mcp/....
Risk R5 Accept-rule heuristic; R6 in-stream error envelopes; R7 framing bytes.
Rollback Revert PR. Nginx untouched.
Tests S2.7 matrix.
Pass criteria All current clients (curl, FastMCP, Codex, Claude) unchanged. SSE-strict client receives valid text/event-stream framing for initialize, tools/list, tools/call.
Expected impact Unblocks ChatGPT MCP App handshake (high confidence).

2F-D3 — move_document schema cleanup

Goal Remove move_document from MCP tools list + dispatcher; keep HTTP endpoint returning 501.
Files agent_data/server.py (delete branch in _dispatch_mcp_tool; remove entry from MCP_TOOLS; remove from MCP_GPT_FULL_TOOLS).
MCP schema change One tool fewer (MCP_TOOLS 11 -> 10; MCP_GPT_FULL_TOOLS 11 -> 10). connectorSchemaVersion will need a bump; _mcp_schema_hash() auto-updates.
Nginx change None.
Risk Any client still calling move_document over MCP gets Unknown tool instead of NOT_IMPLEMENTED. Should be no impact since GPT/Claude read the dynamic tools list.
Rollback Revert PR.
Tests S3.4.
Pass criteria tools/list excludes move_document; HTTP /documents/{x}/move still 501.
Expected impact Schema-runtime correctness; LLM no longer attempts deprecated tool.

2F-D4 — Async write path via existing listener

Goal Remove inline _delete_vector_entry + _sync_vector_entry from create/update/patch/delete handlers; rely on PG-NOTIFY listener.
Files agent_data/server.py (create_document, update_document, patch_document, delete_document — strip the inline vector calls; keep vector_status: "pending" write). Add a vector_sync_lag_seconds health metric.
MCP schema change Description update on write tools to declare eventual-consistent semantic indexing.
Nginx change None.
Risk R8/R9/R10 above.
Rollback Revert PR. Listener keeps running; documents reach consistency either way.
Tests S4.7.
Pass criteria Upload/update/patch p95 < 300 ms at concurrency 10; vector_sync_lag_seconds < 5 s steady-state; reindex-missing catches all stragglers within 1 cron cycle.
Expected impact 0.8–1.3 s -> <250 ms p95.

2F-D5 (optional, later) — Streamable HTTP resumability

Session id (Mcp-Session-Id header), Last-Event-ID, keep-alive ping. Only needed if/when long-running tool calls require server-pushed progress events. Out of scope to unblock ChatGPT.


Part 7 — Production pass criteria

7.1 Functional

Tool Pass
search_knowledge returns >=1 hit on 'operating rules SSOT', latency < 2 s
list_documents (no prefix) returns first 50; next_offset present
list_documents (prefix='knowledge/test/') returns expected subset
get_document found = full payload; missing = {error: "not found"}
get_document_for_rewrite full body returned
batch_read (3 paths) returns 3 items, truncated by default
upload_document new doc visible in get_document immediately; in search <5 s
update_document revision +1; new content readable immediately
patch_document one-replace succeeds; ambiguous returns 409
delete_document soft-delete; excluded from list_documents; vector gone <5 s
move_document (after D3) not present in tools/list
ingest_document (only via /mcp-gpt-full admin) honours allow-list

7.2 Latency targets (p95)

Path After D1+D2+D4 After D1+D2+D4 with index
initialize < 50 ms < 50 ms
tools/list < 50 ms < 50 ms
get_document < 200 ms < 200 ms
batch_read (3 paths) < 300 ms < 300 ms
list_documents single < 100 ms < 50 ms
list_documents concurrency 5 < 250 ms < 150 ms
list_documents concurrency 10 < 400 ms < 250 ms
upload_document < 250 ms < 250 ms
update_document < 250 ms < 250 ms
patch_document < 250 ms < 250 ms
search_knowledge < 2 s (OpenAI bound) unchanged

7.3 Transport pass

  • Accept: application/json -> JSON, byte-identical to today.
  • Accept: application/json, text/event-stream -> JSON (heuristic R5).
  • Accept: text/event-stream -> SSE with one event: message frame, data: <json>\n\n.
  • FastMCP client initialize -> tools/list -> tools/call E2E green.
  • ChatGPT MCP App handshake + tools/list + one list_documents call green (live test after D2a deploys).

7.4 Stress

  • Concurrency 5 and 10 across mixed tool calls (1 minute, 100 rps target).
  • error_rate == 0 for all valid calls.
  • No 5xx, no read timeouts, no Qdrant wait=True timeouts.
  • Container CPU < 50 % avg, memory stable, no goroutine/thread leak.

7.5 Safety

  • No secret rendered in logs (current code uses <redacted> for env dump; verify same in any added log statements).
  • /mcp-gpt-full not nginx-exposed.
  • delete_document audited at nginx layer (re-enable access_log on /gpt-mcp/...).
  • Rotate /gpt-mcp/{token} after D1+D2 ship (separate task).
  • Rollback documented per patch (Part 6).

Part 8 — Recommendation summary

Implement D1 + D2a before re-connecting ChatGPT. Both are surgical, small-blast-radius, and address the two top blockers (list latency, transport).

Then implement D3. Trivial cleanup; gets the schema honest.

D4 is high-value but bigger surface; prefer to land it as a separate session after D1+D2a are verified in production. Existing listener already covers the async path — D4 just removes the redundant sync path.

Do not expose /mcp-gpt-full publicly. Current 8-tool policy is correct for autonomous GPT.


Appendix A — Observed facts (read-only inspection)

  • kb_documents row counts and sizes: total=5040, alive=2907, table 27 MB, idx 1.7 MB combined.
  • EXPLAIN of current stream_docs: Seq Scan, 4.0 ms execution; cost is in transfer + Python materialization.
  • EXPLAIN of proposed pushed-down query: 71–97 ms execution; would drop to <10 ms with partial text_pattern_ops index.
  • pg trigger fn_kb_notify_vector_sync emits pg_notify('kb_vector_sync', json) on INSERT/UPDATE/DELETE with skip rules for operations/tasks/comments/, registries/, empty body.
  • pg listener pg_vector_listener.py is running (PG->Qdrant vector sync listener started in container logs).
  • All /mcp* endpoints return JSON only; nginx /gpt-mcp/{token}/mcp already has SSE-friendly proxy settings.
  • move_document raises 501 NOT_IMPLEMENTED unconditionally at function entry (server.py:1631-1635).

Appendix B — Open questions (for human approval)

  1. Drop total_count from list_documents response, or keep with extra COUNT(*) query? (Recommend drop.)
  2. Bump nginx proxy_read_timeout on /gpt-mcp/{token} from 60 s -> 120 s for SSE futures? (Optional.)
  3. Add structured access log on /gpt-mcp/{token} location for audit? (Recommend yes.)
  4. Schedule the gpt-mcp token rotation as a follow-up task? (Recommend yes, separate ticket.)
  5. Add vector_sync_lag_seconds health metric in D4? (Recommend yes.)

End of report. No code modified. No commit. No restart. No nginx reload.