AgentData MCP Timeout / KB Stability Investigation — 2026-05-18

0. Governance / scope

Mission: investigate unstable AgentData MCP / KB reads reported during dot-iu-cutter operation on 2026-05-18.
Production mutation: none, except this report upload to KB via upload_document.
Code touched: none.
Secrets: not included; nginx route/API secret values were observed only to confirm configuration and are intentionally redacted from this report.

3 cau Tuyen ngon

Vinh vien: do not patch one incident only; fix by adding chunked read contracts, request_id/duration logging, timeout alignment, and health/circuit checks so future large reads are bounded and diagnosable.
Nhầm được không: enforce response-size limits and cursor/chunk APIs server-side; clients cannot accidentally request unbounded full payloads.
100% tự động: add synthetic MCP probes and structured log alerts for 5xx/slow calls so instability is detected without manual GPT retry loops.

1. Rules / documents read

.claude/skills/incomex-rules.md read locally; 36 rules acknowledged. Background agents not used.
search_knowledge("operating rules SSOT"): found knowledge/dev/ssot/operating-rules.md, OR v7.58 dated 2026-05-01, and knowledge/dev/ssot/vps/vps-operating-rules.md v1.0.
search_knowledge("hiến pháp v4.0 constitution"): current constitution source returned knowledge/dev/laws/constitution.md, metadata title Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 BAN HÀNH.
Related MCP prior reports read/searched: phase-2f-c-rev2-mcp-production-patch-design-2026-05-14.md, phase-2f-e2-chatgpt-mcp-patch-hang-diagnosis-2026-05-14.md, gpt-mcp-connector-root-cause-2026-05-12.md.

2. Symptoms observed

Reported symptoms from GPT operating session:

list_documents sometimes returned 504 upstream request timeout.
get_document_for_rewrite sometimes returned 503 upstream connect error / disconnect/reset before headers.
batch_read sometimes succeeded, while full read of long files was more fragile.
Affected paths included:
- knowledge/dev/laws/dieu44-trien-khai/v0.5-constitution-source-document-seed-authoring/
- knowledge/dev/laws/dieu44-trien-khai/v0.5-fabric-addendum-scope/

3. Reproduced yes/no

Partially reproduced.

Hard 503/504 was NOT reproduced during this investigation.
Slow connector behavior WAS reproduced through the MCP app layer:
- list_documents for v0.5-constitution-source-document-seed-authoring/: success, 5 items, MCP tool wall time 16.1695s.
- get_document_for_rewrite for dot-iu-cutter-v0.5-constitution-source-document-grounding-and-checksum-plan-2026-05-18.md: success, content_length=8721, MCP tool wall time 20.2761s.
- list_documents for v0.5-fabric-addendum-scope/: success, 22 items, MCP tool wall time 22.3201s.

Same requests against AgentData directly were fast:

container-local /mcp:
list run=1 http=200 time=0.017 size=3492
list run=2 http=200 time=0.009 size=3492
get run=1 http=200 time=0.033 size=10243
get run=2 http=200 time=0.011 size=10243
list-fabric run=1 http=200 time=0.022 size=11029
list-fabric run=2 http=200 time=0.012 size=11029

Through public nginx /api/mcp from the VPS:

api-list http=200 time=0.120 size=3492
api-get http=200 time=0.046 size=10243
api-list-fabric http=200 time=0.443 size=11029

Concurrent public-nginx stress, 20 parallel MCP calls:

concurrency_total_time=0.684
errors= 0
p50=0.564 p95=0.603 max=0.631

4. Evidence from logs / runtime

Service state:

incomex-agent-data Up 4 days (healthy)
RestartCount=0 OOMKilled=false Health=healthy
host memory: 11Gi total, 7.4Gi available
sample docker stats: agent-data CPU 55.13%, memory 1.531GiB / 2.5GiB; postgres 262.5MiB / 2GiB; qdrant 111.4MiB / 1GiB

AgentData logs on 2026-05-18 showed wrapper calls succeeding with 200 for relevant endpoint families; examples:

connector_call wrapper=list_documents ... status=200 response_bytes=9695
connector_call wrapper=batch_read ... status=200 response_bytes=78290
connector_call wrapper=get_document_for_rewrite ... status=200 response_bytes=19946
connector_call wrapper=upload_document ... status=200 response_bytes=178
Qdrant probe OK: 7503 vectors (7-38ms samples)

Nginx log filter for /api/mcp, upstream timeout/reset/503/504 did not show MCP 5xx during the checked window. The 5xx/reset search only found unrelated Nuxt buffering warnings such as:

an upstream response is buffered to a temporary file ... request: "GET /" ... upstream: "http://172.18.0.6:3000/"

DB/storage evidence:

kb_count ms=295.06 rows=[(3516,)]
max_body ms=389.82 rows=[(129290,)]
target_body ms=13.25 rows=[(8721,)]
fabric_top max observed body bytes=19754
EXPLAIN list prefix: Index Scan path via idx_kb_documents_doc_id_c_live; Execution Time: 5.749 ms

Relevant nginx config facts:

/api/ AgentData route has proxy_read_timeout 300s.
Claude MCP route has proxy_read_timeout 300s, proxy_send_timeout 300s.
GPT MCP route has proxy_read_timeout 60s, proxy_send_timeout 30s, access_log off.

5. Endpoint assessment

list_documents: current handler uses SQL pushdown, prefix filter, pagination, max limit 100, max response bytes 128000. Tested folders returned fast at core service and DB level. Not root cause for the reproduced case.
batch_read: returns up to 20 docs; full=true returns monolithic full bodies without total response guard or cursor. Works in logs, but is a reliability risk through GPT/connector gateways.
get_document_for_rewrite: full document returned as one JSON payload. Tested target is only 8721 chars, but the contract has no chunk/cursor fallback.
upload_document: multiple 200s observed in logs; no upload instability reproduced. This report upload is the only mutation performed.

6. Root cause candidate

Most likely: connector/gateway/proxy-route instability, not AgentData core, DB, Qdrant, or nginx /api/mcp core path.

Rationale:

AgentData direct handler returns the same requests in 0.009-0.033s locally and 0.046-0.443s through public nginx.
DB query plan for the problematic folder executes in 5.749ms and uses the intended live prefix index.
AgentData/nginx logs did not show matching MCP 503/504/reset evidence for the checked /api/mcp path.
MCP app layer showed 16-22s wall time for small successful responses; this latency is outside the core service timings.
GPT MCP route has weaker timeout settings than /api/mcp/Claude route and has access logs disabled, limiting correlation.

Secondary reliability contributors:

get_document_for_rewrite and batch_read(full=true) use monolithic JSON responses.
No structured duration/request_id logs at enough layers to prove where external 503/504 was generated.
GPT route access logging is disabled, so exact ChatGPT connector request correlation is incomplete.

Classification:

MCP service code bug: not proven for tested case.
Upstream/proxy/gateway: likely primary candidate.
DB/storage: not supported by evidence.
Payload/document size: possible contributor for full/batch reads through connector, not for the tested 8.7KB target alone.
Connection pool: not supported by evidence; 20 concurrent public requests passed.
Timeout config: GPT route 60s/30s is a candidate mismatch; /api/mcp 300s is not.
Handler bug: no direct bug found; missing chunking/response guards are design gaps.

7. Proposed fixes

Quick mitigation, no production patch required:

Prefer narrow list_documents(path=...) with limit<=100; avoid root-wide listing.
For long files, use batch_read(full=false) first, then only full-read a small number of exact docs.
Retry transient MCP failures with exponential backoff and jitter: 1s, 2s, 4s, max 3 attempts.
When GPT route fails, retry via /api/mcp/Codex path if available because /api/mcp is currently faster and has 300s timeout.

Low-risk config proposal, pending explicit approval:

Align GPT MCP route with /api/mcp and Claude MCP: set proxy_read_timeout 300s, proxy_send_timeout 300s.
Enable sanitized access logging for GPT MCP route with request_id, status, upstream_status, request_time, upstream_response_time, bytes_sent, no headers/secrets.

Code/contract proposal:

Add structured log fields to all MCP tool calls: request_id, tool, profile, duration_ms, status, response_bytes, failure_class, doc_count, content_length.
Add get_document_chunk(document_id, offset, limit) or extend get_document_for_rewrite with max_chars + cursor.
Add total response-byte guard for batch_read(full=true); return partial results with per-doc cursor when over threshold.
Add client-visible error shape: retryable, suggested_backoff_ms, partial_available, request_id.
Add synthetic canary job for list_documents, get_document_for_rewrite, batch_read, upload_document with p50/p95 and 5xx alerting.
Keep current SQL-pushdown listing; add regression test that prefix listing does not fetch content.body and remains index-backed.
Consider streaming only after chunk API is in place; chunk/cursor is simpler and more reliable for GPT clients than one large streaming JSON response.

8. Tests run

search_knowledge("operating rules SSOT"): PASS.
search_knowledge("hiến pháp v4.0 constitution"): PASS.
list_documents target authoring folder through MCP app: PASS but slow, 16.1695s.
get_document_for_rewrite target full document through MCP app: PASS but slow, 20.2761s.
list_documents fabric folder through MCP app: PASS but slow, 22.3201s.
Direct AgentData local /mcp timings: PASS, all 0.009-0.033s.
Public nginx /api/mcp timings: PASS, 0.046-0.443s.
20 concurrent public nginx calls: PASS, 0 errors, p95 0.603s.
AgentData/nginx/postgres log grep: no MCP 503/504/reset found in checked output; unrelated PG errors/deadlocks exist but not tied to KB MCP timings.
DB prefix EXPLAIN: PASS, execution 5.749ms, live prefix index used.

9. Remaining risks / blockers

Exact GPT-side 503/504 request IDs were not available from OpenAI/connector gateway logs, so the external failure origin cannot be proven conclusively.
GPT MCP route has access_log off; historical correlation for that route is limited.
No production code/config patch was applied in this investigation.
Existing PG logs contain unrelated application errors/deadlocks; they should be tracked separately but do not explain the tested MCP read latency.
Full-document and full-batch reads remain connector-fragile until chunk/cursor and response guards are implemented.

10. Done status

MERGE is N/A. Production verification for investigation PASS: core AgentData/DB/nginx path is healthy for tested cases; GPT/connector reliability remains at risk until timeout alignment, structured logging, and chunked read contract are implemented.