KB-4C12

GPT MCP Connector Root Cause Timeline Investigation — 2026-05-12

17 min read Revision 1
gptmcpagent-dataconnectorroot-cause2026-05-12

GPT MCP Connector Root Cause Timeline Investigation

Date: 2026-05-12 Agent: Codex/CoreX Scope: GPT connector / Agent Data MCP ClientResponseError investigation Mode: read-mostly investigation; no VPS restart, no Qdrant/PG mutation, no OGV-2C rollback

Executive conclusion

root_cause=GPT_CONNECTOR_WRAPPER_SCHEMA_DRIFT_AND_LIST_RESPONSE_SIZE

The evidence does not support a VPS, Agent Data, Qdrant, PG, nginx upstream, or OGV-2C server-side incident. During the failure window, Agent Data kept returning 200 for /api/mcp, /api/health, /documents/..., and OpenAPI discovery. Container uptime is stable, no deploy/restart occurred in the window, and nginx/Agent Data logs show no 5xx pattern.

The failure is concentrated at the GPT connector/action wrapper layer:

  1. healthCheck is a GPT/OpenAPI-style REST action, not a current MCP tool. Current MCP tool discovery exposes snake_case tools and no healthCheck.
  2. listDocuments is unsafe in the GPT/action wrapper because it can produce oversized responses. The REST route expects prefix; if any wrapper maps prefix incorrectly to path, Agent Data ignores the filter and returns the full KB. Even with correct prefix, the reviews prefix is large enough to create a 167KB JSON body, and MCP wrapping was observed at 192639 bytes.
  3. searchKnowledge and getDocumentTruncated use narrower request/response paths, which explains why they can work while healthCheck and listDocuments fail.

Fix was not applied because the writable GPT connector registry/gateway is not accessible from this Codex environment. The safe root fix is to update/rebind the GPT connector schema/wrapper, not restart or rollback server infrastructure.

Rules and sources read

  • .claude/skills/incomex-rules.md
  • search_knowledge("operating rules SSOT") returned knowledge/dev/ssot/operating-rules.md with version marker v7.58.
  • search_knowledge("hiến pháp v4.0 constitution") returned knowledge/dev/laws/constitution.md with version marker v4.6.3.
  • search_knowledge("MCP Agent Data connector healthCheck listDocuments ClientResponseError GPT") returned prior Agent Data MCP diagnostic documents.
  • knowledge/dev/laws/dieu44-trien-khai/notes/opus-diagnostic-agent-data-mcp-connection-gpt-client-side-2026-05-12.md
  • knowledge/current-state/reports/gpt-mcp-agent-data-healthcheck-2026-04-20.md

Timeline, absolute timestamps

Timezone notes:

  • Agent Data and nginx Docker logs below are UTC.
  • Vietnam local time = UTC+07.
  • VPS local time = CEST UTC+02.
Timestamp ICT Timestamp UTC Event Evidence/log source Conclusion
2026-05-12 10:02:05 03:02:05Z /kb/list?prefix=knowledge/dev/laws/dieu44-trien-khai/reports/ returned 200. Agent Data log: GET /kb/list?prefix=...reports... 200 OK from connector network. KB list route was alive before incident escalation.
2026-05-12 10:05:22 03:05:22Z /kb/list?prefix=.../notes/ returned 200. Agent Data log: GET /kb/list?prefix=...notes... 200 OK. Correct prefix filtering path worked.
2026-05-12 10:06-10:16 03:06-03:16Z Repeated /mcp and /health 200. Agent Data/nginx logs show recurring POST /mcp 200 OK and GET /api/health 200. Server was not down.
2026-05-12 10:10:03 03:10:03Z Client attempted OAuth discovery under /api/mcp/.well-known/oauth-authorization-server; received 404; immediately followed by /api/mcp 200 and 204/200 sequence. nginx + Agent Data logs. Connector/proxy performed discovery/reconnect behavior; 404 is on discovery path only, not /mcp failure.
2026-05-12 10:17:08 03:17:08Z Same OAuth discovery 404, followed by /api/mcp 200, 204, 200, 200, 200. nginx + Agent Data logs. Reconnect/schema negotiation continued to work at MCP route.
2026-05-12 10:19:12-10:19:15 03:19:12-03:19:15Z Three /api/mcp calls returned 200 with response sizes 862, 1042, 6222 bytes. nginx logs from client IP 123.24.178.152. Search/get-like MCP calls were succeeding during the window.
2026-05-12 10:20:03-10:20:06 03:20:03-03:20:06Z /api/mcp 200 and /api/health 200 from curl source. nginx logs from 38.242.240.89. Independent health smoke succeeded.
2026-05-12 10:21:21 03:21:21Z GET /api/documents/...opus-gate-review...?... returned 200, 894 bytes, user agent ChatGPT-User/1.0. nginx log. Direct get-document path worked from GPT/browser-style client during the incident window.
2026-05-12 10:21:57 03:21:57Z POST /api/mcp returned 200 but 192639 bytes; nginx warned upstream response buffered to temp file. nginx log: upstream response is buffered to a temporary file then POST /api/mcp 200 192639. list_documents/large result behavior is confirmed. Server returned 200; client-side size/parsing limits can still fail.
2026-05-12 10:22:49 03:22:49Z Five unauthenticated local JSON-RPC probes returned 401. Agent Data log from 127.0.0.1; this was Codex's negative probe without API key. Not GPT incident evidence; confirms auth gate works.
2026-05-12 10:22:50-10:22:54 03:22:50-03:22:54Z Internal route-size probes showed prefix vs path behavior. Codex internal probes, see evidence section. Wrong REST parameter path returns full KB.
2026-05-12 10:23:51 03:23:51Z OAuth discovery 404 again, immediately followed by /api/mcp 200/204/200/200/200. nginx + Agent Data logs. Discovery 404 is non-fatal; /mcp route kept working.
2026-05-12 10:25:01-10:25:03 03:25:01-03:25:03Z /api/mcp 200 and /api/health 200 from curl source. nginx logs. Server still healthy.
2026-05-12 10:28:53-10:28:56 03:28:53-03:28:56Z Codex direct MCP search_knowledge calls returned 200 with 7362/7958/6589 bytes. nginx + Agent Data logs from current Codex session. Codex MCP connection and search path healthy.
2026-05-12 10:32:25-10:32:34 03:32:25-03:32:34Z Codex-side acceptance smoke: search_knowledge, get_document, list_documents(path=notes) all succeeded. Codex MCP tool results; nginx/Agent Data /mcp 200 logs. Agent Data MCP server usable from Codex side after incident report.

Server and deploy state

Observed container state:

incomex-agent-data   agent-data-local:latest   Up 24 hours (healthy)
incomex-nginx        nginx:alpine              Up 11 days
postgres             postgres:16               Up 3 weeks (healthy)
incomex-qdrant       qdrant/qdrant:latest      Up 7 weeks (healthy)
agent-data StartedAt=2026-05-11T03:17:28.63541742Z
agent-data Image=sha256:be5a82c4caee3eaed0bbcf5efff51dcf07243e9079b5f47ea77001b7dc67a731

Recent Agent Data git history:

eaf2140 | 2026-05-11 05:19:31 +0200 | P3D: harden vector search rerank
ff2fc25 | 2026-05-11 04:40:41 +0200 | P3D vector search: app-layer path/title boost rerank
a40b217 | 2026-05-07 07:23:52 +0200 | OGV-2C: write gate

Conclusion: no Agent Data deploy, restart, image change, or OGV-2C rollback-worthy event occurred in the 30-minute window.

Tool schema and wrapper evidence

Current MCP tool source exposes:

agent_data/server.py:
MCP_TOOLS includes:
- search_knowledge(query, limit)
- list_documents(path)
- get_document(document_id)
- get_document_for_rewrite(document_id)
No healthCheck MCP tool is defined.

_dispatch_mcp_tool:
if tool_name == "list_documents":
    return await list_kb_documents(prefix=args.get("path", "docs"))

Current REST/OpenAPI source exposes GPT/action style operations:

docs/api/openapi.yaml:
POST /chat operationId: searchKnowledge
GET /kb/list operationId: listDocuments, query parameter: prefix
GET /health operationId: healthCheck

Live public OpenAPI check:

https://vps.incomexsaigoncorp.vn/api/openapi.json 200 application/json
listDocuments params [('prefix', 'query')]
https://vps.incomexsaigoncorp.vn/api/health 200 application/json

Interpretation:

  • Codex MCP connector uses mcp__agent_data__.search_knowledge, list_documents(path), get_document(...).
  • GPT/action reports use camelCase names searchKnowledge, getDocumentTruncated, listDocuments(prefix), healthCheck.
  • A GPT wrapper calling healthCheck as an MCP tool is stale/nonexistent for current MCP.
  • A GPT wrapper calling REST /kb/list with path instead of prefix loses filtering.

Response-size and parameter evidence

Read-only internal route probes against Agent Data:

/kb/list?prefix=knowledge/dev/laws/dieu44-trien-khai/reviews 200 167057 application/json
/kb/list?path=knowledge/dev/laws/dieu44-trien-khai/reviews   200 709367 application/json
/kb/list?prefix=knowledge/dev/laws/dieu44-trien-khai/notes   200 688 application/json
/kb/list?path=knowledge/dev/laws/dieu44-trien-khai/notes     200 709367 application/json

Nginx observed the same class of risk through MCP:

2026-05-12T03:21:57Z warn upstream response is buffered to temp file while reading upstream, request: "POST /api/mcp"
2026-05-12T03:21:57Z "POST /api/mcp HTTP/1.1" 200 192639

Prior KB evidence from 2026-04-20:

listDocuments(prefix="knowledge/") -> ResponseTooLargeError
searchKnowledge and getDocumentTruncated succeeded in the same report

Conclusion: listDocuments has an established oversized-response failure mode. If wrapper mapping is wrong, even a narrow prefix like notes returns full KB. If mapping is correct but prefix is broad, the response can still exceed GPT connector limits.

GPT gateway log access

Direct GPT connector / MCP gateway logs were not accessible from this Codex workspace. Local Codex logs show rmcp request/response for this Codex session and successful Agent Data calls, but they do not contain the GPT-side ClientResponseError stack, request id, trace id, or wrapper parser failure.

Therefore:

  • exact GPT ClientResponseError stack: NOT_ACCESSIBLE
  • GPT connector cached schema before/after refresh: NOT_ACCESSIBLE
  • GPT gateway response parser/body limit log: NOT_ACCESSIBLE

This is the remaining uncertainty. The root cause above is still supported by server/proxy/source evidence and by prior GPT Action KB evidence.

Hypothesis matrix

Hypothesis Result Evidence
H1. GPT connector uses stale schema after schema/tool registry refresh. PARTIAL_PASS GPT-facing names are camelCase REST actions; current MCP exposes snake_case only and no healthCheck. OAuth discovery/reconnect events appeared at 03:10, 03:17, 03:23 UTC, but GPT schema cache logs are not accessible.
H2. GPT wrapper calls healthCheck/listDocuments that no longer match current MCP schema. PASS Current MCP source has no healthCheck; list_documents takes path, while REST OpenAPI listDocuments takes prefix.
H3. listDocuments wrapper maps prefix to REST path, returning full KB. PASS_RISK_CONFIRMED /kb/list?path=...notes returned 709367 bytes while correct prefix returned 688 bytes. Exact GPT wrapper request body is not accessible, so mapping mistake is confirmed as a live hazard, not directly captured in GPT logs.
H4. GPT session/token expired or refresh failed after idle. NOT_PRIMARY Server logs show authenticated /api/mcp 200 and /documents 200 in the window. Only 401s observed were Codex's unauthenticated local negative probes from 127.0.0.1.
H5. MCP transport/proxy stale connection pool. NOT_PROVEN Discovery/reconnect events occurred, but every follow-up /api/mcp completed 200/204. No upstream 5xx/timeout evidence.
H6. Response size/timeout causes ClientResponseError. PASS Historical ResponseTooLargeError; current /api/mcp 192639 bytes with nginx temp buffering; REST wrong-param full KB is 709367 bytes.
H7. Tool registry/config changed in last 30 minutes. FAIL_NOT_FOUND No deploy/restart/git change in window. Current source/docs show a standing dual-surface schema drift, not a new server change.
H8. OGV-2C a40b217 affected more than create path. FAIL OGV-2C was 2026-05-07; Agent Data writes/read/search/list health succeeded after it; no server errors point to write gate.
H9. Server-side 500/502. FAIL nginx and Agent Data logs show 200/204/401 expected probes; no 500/502/504 pattern in incident window.

Fix status

No server-side fix was applied in this pack:

  • no code change
  • no deploy
  • no restart
  • no Qdrant mutation
  • no PG mutation
  • no OGV-2C rollback
  • no connector registry edit, because GPT connector wrapper registry is not available from this Codex environment

The root fix must be applied in the GPT connector/action wrapper registry:

  1. Remove stale MCP healthCheck wrapper, or explicitly map GPT healthCheck to REST GET /api/health. Do not expose it as an MCP tool unless Agent Data adds an official MCP health tool.
  2. For listDocuments, use exactly one supported path:
    • MCP: call list_documents with path.
    • REST/OpenAPI: call /api/kb/list with query parameter prefix.
  3. Add default pagination or limit to listDocuments; do not allow full-KB returns by default.
  4. Add a hard response-size guard with a clear error such as LIST_DOCUMENTS_RESPONSE_TOO_LARGE_USE_NARROWER_PREFIX, instead of surfacing opaque ClientResponseError.
  5. Add connector schema version/hash and fail fast when GPT cached wrapper names do not match the current route/tool schema.
  6. Log per call: wrapper name, upstream route, mapped query/body keys, status, response bytes, and parse/body-limit failure reason. Do not log tokens or raw document bodies.

Acceptance tests

GPT-side required 3-round acceptance was not executable from this Codex environment because the GPT connector gateway/wrapper registry is not exposed as a callable surface here. The report must not falsely claim GPT-side PASS.

Codex-side smoke after investigation:

search_knowledge("OGV-2C write gate a40b217") -> PASS, qdrant_hits=5, top result opus-gate-review-ogv-2c-case-closure-2026-05-07.md
get_document("knowledge/dev/laws/dieu44-trien-khai/reviews/opus-gate-review-ogv-2c-case-closure-2026-05-07.md") -> PASS, revision=1, truncated=true
list_documents(path="knowledge/dev/laws/dieu44-trien-khai/notes") -> PASS, count=2
HTTP /api/health -> PASS, 200 application/json

Required GPT-side post-fix test remains:

Round 1-3, spaced a few minutes:
1. searchKnowledge("OGV-2C write gate a40b217")
2. getDocumentTruncated("knowledge/dev/laws/dieu44-trien-khai/reviews/opus-gate-review-ogv-2c-case-closure-2026-05-07.md")
3. listDocuments(prefix="knowledge/dev/laws/dieu44-trien-khai/notes")
4. listDocuments(prefix="knowledge/dev/laws/dieu44-trien-khai/reviews", limit=10) if wrapper supports limit
5. health endpoint/tool officially mapped to HTTP /api/health, or remove stale healthCheck MCP wrapper

PASS criteria after connector fix:

0 ClientResponseError in 3 rounds
logs show listDocuments notes uses prefix/path correctly and does not return full KB
logs show reviews request is limited/paginated or rejected with explicit size error
logs show schema version/hash used by GPT connector
healthCheck either maps to HTTP /api/health or is removed as stale MCP wrapper

Prevention

Permanent prevention should be implemented at connector/schema level:

  • publish one canonical connector manifest, generated from the live server schema;
  • include schema version/hash in every connector session;
  • reject stale wrappers when operation names or input keys drift;
  • add contract tests for listDocuments mapping: prefix must not be sent as path on REST, and MCP path must map to server prefix;
  • add default limit/pagination to list routes;
  • add response-size tests for notes, reviews, and root knowledge/;
  • add log fields for upstream route, mapped query keys, status, bytes, and parser failure class;
  • keep healthCheck as HTTP health unless an official MCP health tool is added.

Final status

phase_status=PARTIAL_BLOCKED_EXTERNAL_GPT_CONNECTOR root_cause=GPT_CONNECTOR_WRAPPER_SCHEMA_DRIFT_AND_LIST_RESPONSE_SIZE confidence=medium_high server_side_incident=false ogv_2c_rollback_needed=false fix_applied=false no_mutation_performed=true_except_report_upload recommended_next_action=GPT_CONNECTOR_SCHEMA_REBIND_AND_LISTDOCUMENTS_PAGINATION_FIX