AgentData MCP Timeout / KB Stability Investigation — 2026-05-18
AgentData MCP Timeout / KB Stability Investigation — 2026-05-18
0. Governance / scope
- Mission: investigate unstable AgentData MCP / KB reads reported during
dot-iu-cutteroperation on 2026-05-18. - Production mutation: none, except this report upload to KB via
upload_document. - Code touched: none.
- Secrets: not included; nginx route/API secret values were observed only to confirm configuration and are intentionally redacted from this report.
3 cau Tuyen ngon
- Vinh vien: do not patch one incident only; fix by adding chunked read contracts, request_id/duration logging, timeout alignment, and health/circuit checks so future large reads are bounded and diagnosable.
- Nhầm được không: enforce response-size limits and cursor/chunk APIs server-side; clients cannot accidentally request unbounded full payloads.
- 100% tự động: add synthetic MCP probes and structured log alerts for 5xx/slow calls so instability is detected without manual GPT retry loops.
1. Rules / documents read
.claude/skills/incomex-rules.mdread locally; 36 rules acknowledged. Background agents not used.search_knowledge("operating rules SSOT"): foundknowledge/dev/ssot/operating-rules.md, OR v7.58 dated 2026-05-01, andknowledge/dev/ssot/vps/vps-operating-rules.mdv1.0.search_knowledge("hiến pháp v4.0 constitution"): current constitution source returnedknowledge/dev/laws/constitution.md, metadata titleHiến pháp Kiến trúc Hệ thống Incomex v4.6.3 BAN HÀNH.- Related MCP prior reports read/searched:
phase-2f-c-rev2-mcp-production-patch-design-2026-05-14.md,phase-2f-e2-chatgpt-mcp-patch-hang-diagnosis-2026-05-14.md,gpt-mcp-connector-root-cause-2026-05-12.md.
2. Symptoms observed
Reported symptoms from GPT operating session:
list_documentssometimes returned504 upstream request timeout.get_document_for_rewritesometimes returned503 upstream connect error / disconnect/reset before headers.batch_readsometimes succeeded, while full read of long files was more fragile.- Affected paths included:
knowledge/dev/laws/dieu44-trien-khai/v0.5-constitution-source-document-seed-authoring/knowledge/dev/laws/dieu44-trien-khai/v0.5-fabric-addendum-scope/
3. Reproduced yes/no
Partially reproduced.
- Hard 503/504 was NOT reproduced during this investigation.
- Slow connector behavior WAS reproduced through the MCP app layer:
list_documentsforv0.5-constitution-source-document-seed-authoring/: success, 5 items, MCP tool wall time 16.1695s.get_document_for_rewritefordot-iu-cutter-v0.5-constitution-source-document-grounding-and-checksum-plan-2026-05-18.md: success,content_length=8721, MCP tool wall time 20.2761s.list_documentsforv0.5-fabric-addendum-scope/: success, 22 items, MCP tool wall time 22.3201s.
Same requests against AgentData directly were fast:
container-local /mcp:
list run=1 http=200 time=0.017 size=3492
list run=2 http=200 time=0.009 size=3492
get run=1 http=200 time=0.033 size=10243
get run=2 http=200 time=0.011 size=10243
list-fabric run=1 http=200 time=0.022 size=11029
list-fabric run=2 http=200 time=0.012 size=11029
Through public nginx /api/mcp from the VPS:
api-list http=200 time=0.120 size=3492
api-get http=200 time=0.046 size=10243
api-list-fabric http=200 time=0.443 size=11029
Concurrent public-nginx stress, 20 parallel MCP calls:
concurrency_total_time=0.684
errors= 0
p50=0.564 p95=0.603 max=0.631
4. Evidence from logs / runtime
Service state:
incomex-agent-data Up 4 days (healthy)
RestartCount=0 OOMKilled=false Health=healthy
host memory: 11Gi total, 7.4Gi available
sample docker stats: agent-data CPU 55.13%, memory 1.531GiB / 2.5GiB; postgres 262.5MiB / 2GiB; qdrant 111.4MiB / 1GiB
AgentData logs on 2026-05-18 showed wrapper calls succeeding with 200 for relevant endpoint families; examples:
connector_call wrapper=list_documents ... status=200 response_bytes=9695
connector_call wrapper=batch_read ... status=200 response_bytes=78290
connector_call wrapper=get_document_for_rewrite ... status=200 response_bytes=19946
connector_call wrapper=upload_document ... status=200 response_bytes=178
Qdrant probe OK: 7503 vectors (7-38ms samples)
Nginx log filter for /api/mcp, upstream timeout/reset/503/504 did not show MCP 5xx during the checked window. The 5xx/reset search only found unrelated Nuxt buffering warnings such as:
an upstream response is buffered to a temporary file ... request: "GET /" ... upstream: "http://172.18.0.6:3000/"
DB/storage evidence:
kb_count ms=295.06 rows=[(3516,)]
max_body ms=389.82 rows=[(129290,)]
target_body ms=13.25 rows=[(8721,)]
fabric_top max observed body bytes=19754
EXPLAIN list prefix: Index Scan path via idx_kb_documents_doc_id_c_live; Execution Time: 5.749 ms
Relevant nginx config facts:
/api/AgentData route hasproxy_read_timeout 300s.- Claude MCP route has
proxy_read_timeout 300s,proxy_send_timeout 300s. - GPT MCP route has
proxy_read_timeout 60s,proxy_send_timeout 30s,access_log off.
5. Endpoint assessment
list_documents: current handler uses SQL pushdown, prefix filter, pagination, max limit 100, max response bytes 128000. Tested folders returned fast at core service and DB level. Not root cause for the reproduced case.batch_read: returns up to 20 docs;full=truereturns monolithic full bodies without total response guard or cursor. Works in logs, but is a reliability risk through GPT/connector gateways.get_document_for_rewrite: full document returned as one JSON payload. Tested target is only 8721 chars, but the contract has no chunk/cursor fallback.upload_document: multiple 200s observed in logs; no upload instability reproduced. This report upload is the only mutation performed.
6. Root cause candidate
Most likely: connector/gateway/proxy-route instability, not AgentData core, DB, Qdrant, or nginx /api/mcp core path.
Rationale:
- AgentData direct handler returns the same requests in 0.009-0.033s locally and 0.046-0.443s through public nginx.
- DB query plan for the problematic folder executes in 5.749ms and uses the intended live prefix index.
- AgentData/nginx logs did not show matching MCP 503/504/reset evidence for the checked
/api/mcppath. - MCP app layer showed 16-22s wall time for small successful responses; this latency is outside the core service timings.
- GPT MCP route has weaker timeout settings than
/api/mcp/Claude route and has access logs disabled, limiting correlation.
Secondary reliability contributors:
get_document_for_rewriteandbatch_read(full=true)use monolithic JSON responses.- No structured duration/request_id logs at enough layers to prove where external 503/504 was generated.
- GPT route access logging is disabled, so exact ChatGPT connector request correlation is incomplete.
Classification:
- MCP service code bug: not proven for tested case.
- Upstream/proxy/gateway: likely primary candidate.
- DB/storage: not supported by evidence.
- Payload/document size: possible contributor for full/batch reads through connector, not for the tested 8.7KB target alone.
- Connection pool: not supported by evidence; 20 concurrent public requests passed.
- Timeout config: GPT route 60s/30s is a candidate mismatch;
/api/mcp300s is not. - Handler bug: no direct bug found; missing chunking/response guards are design gaps.
7. Proposed fixes
Quick mitigation, no production patch required:
- Prefer narrow
list_documents(path=...)withlimit<=100; avoid root-wide listing. - For long files, use
batch_read(full=false)first, then only full-read a small number of exact docs. - Retry transient MCP failures with exponential backoff and jitter: 1s, 2s, 4s, max 3 attempts.
- When GPT route fails, retry via
/api/mcp/Codex path if available because/api/mcpis currently faster and has 300s timeout.
Low-risk config proposal, pending explicit approval:
- Align GPT MCP route with
/api/mcpand Claude MCP: setproxy_read_timeout 300s,proxy_send_timeout 300s. - Enable sanitized access logging for GPT MCP route with request_id, status, upstream_status, request_time, upstream_response_time, bytes_sent, no headers/secrets.
Code/contract proposal:
- Add structured log fields to all MCP tool calls:
request_id,tool,profile,duration_ms,status,response_bytes,failure_class,doc_count,content_length. - Add
get_document_chunk(document_id, offset, limit)or extendget_document_for_rewritewithmax_chars+cursor. - Add total response-byte guard for
batch_read(full=true); return partial results with per-doc cursor when over threshold. - Add client-visible error shape:
retryable,suggested_backoff_ms,partial_available,request_id. - Add synthetic canary job for
list_documents,get_document_for_rewrite,batch_read,upload_documentwith p50/p95 and 5xx alerting. - Keep current SQL-pushdown listing; add regression test that prefix listing does not fetch
content.bodyand remains index-backed. - Consider streaming only after chunk API is in place; chunk/cursor is simpler and more reliable for GPT clients than one large streaming JSON response.
8. Tests run
search_knowledge("operating rules SSOT"): PASS.search_knowledge("hiến pháp v4.0 constitution"): PASS.list_documentstarget authoring folder through MCP app: PASS but slow, 16.1695s.get_document_for_rewritetarget full document through MCP app: PASS but slow, 20.2761s.list_documentsfabric folder through MCP app: PASS but slow, 22.3201s.- Direct AgentData local
/mcptimings: PASS, all 0.009-0.033s. - Public nginx
/api/mcptimings: PASS, 0.046-0.443s. - 20 concurrent public nginx calls: PASS, 0 errors, p95 0.603s.
- AgentData/nginx/postgres log grep: no MCP 503/504/reset found in checked output; unrelated PG errors/deadlocks exist but not tied to KB MCP timings.
- DB prefix EXPLAIN: PASS, execution 5.749ms, live prefix index used.
9. Remaining risks / blockers
- Exact GPT-side 503/504 request IDs were not available from OpenAI/connector gateway logs, so the external failure origin cannot be proven conclusively.
- GPT MCP route has
access_log off; historical correlation for that route is limited. - No production code/config patch was applied in this investigation.
- Existing PG logs contain unrelated application errors/deadlocks; they should be tracked separately but do not explain the tested MCP read latency.
- Full-document and full-batch reads remain connector-fragile until chunk/cursor and response guards are implemented.
10. Done status
MERGE is N/A. Production verification for investigation PASS: core AgentData/DB/nginx path is healthy for tested cases; GPT/connector reliability remains at risk until timeout alignment, structured logging, and chunked read contract are implemented.