AgentData MCP Timeout Investigation — GPT Review and Next Actions
AgentData MCP Timeout Investigation — GPT Review and Next Actions
Date: 2026-05-18
Reviewer: GPT
Reviewed report: knowledge/dev/laws/dieu44-trien-khai/ops/agentdata-mcp-timeout-investigation-2026-05-18.md
Verdict
investigation_quality: PASS
core_agentdata_db_root_cause: unlikely_for_tested_cases
primary_candidate: GPT_MCP_connector_or_route_gateway_layer
hard_503_504_reproduced: false
slow_MCP_app_layer_reproduced: true
additional_investigation_needed: true_before_large_code_patch
quick_config_mitigation_recommended: true
chunk_cursor_contract_recommended: true
Coretics' report is sufficient to conclude that the dominant latency observed in the tested cases is not explained by AgentData core handler, DB prefix listing, Qdrant, or public /api/mcp path.
The key evidence is the latency split:
latency_split:
AgentData_local_direct: 0.009_to_0.033_seconds
public_nginx_api_mcp: 0.046_to_0.443_seconds
MCP_tool_layer_observed: 16_to_22_seconds
This points to connector/gateway/tool-layer latency, or a GPT MCP route config/logging gap, rather than DB/query slowness.
Accepted findings
accepted:
DB_listing_not_root_cause:
evidence: prefix list EXPLAIN uses idx_kb_documents_doc_id_c_live; execution 5.749ms
AgentData_core_fast:
evidence: direct /mcp requests < 0.033s for tested cases
public_nginx_api_fast:
evidence: /api/mcp requests < 0.443s; 20-concurrency p95 0.603s
GPT_route_observability_gap:
evidence: GPT MCP route access_log off; exact GPT request correlation unavailable
full_read_contract_risk:
evidence: get_document_for_rewrite and batch_read(full=true) return monolithic JSON without cursor/chunk fallback
What should be done now
Priority 1 — Observability and route timeout alignment
This is the fastest safety improvement and should be done before speculative code rewrites.
priority_1:
- align GPT MCP route timeout with /api/mcp and Claude route:
proxy_read_timeout: 300s
proxy_send_timeout: 300s
- enable sanitized GPT MCP access log:
fields:
- request_id
- status
- upstream_status
- request_time
- upstream_response_time
- bytes_sent
forbidden:
- secrets
- auth headers
- request body content
- add/verify request_id propagation if available
Expected benefit:
benefit:
- fewer false 503/504 caused by route timeout mismatch
- ability to prove whether latency is in GPT route, upstream, or external connector layer
Priority 2 — Chunk/cursor read contract
For maximum stability, full document reads must be bounded.
priority_2:
- add get_document_chunk(document_id, offset, limit)
- or extend get_document_for_rewrite with max_chars + cursor
- add response byte guard for batch_read(full=true)
- return partial_available + cursor instead of large monolithic payload
Expected benefit:
benefit:
- GPT can read long docs reliably
- avoids connector payload/serialization fragility
- permits retry/resume at chunk level
Priority 3 — Synthetic probes and p95 alerting
priority_3:
- synthetic probes for list_documents, get_document_for_rewrite, batch_read, upload_document
- track p50/p95/max, 5xx, retryable failures
- alert if p95 > threshold or 5xx appears
Additional investigation still needed
The remaining unknown is where the 16-22s tool-layer latency is introduced.
needs_more_evidence:
- GPT/OpenAI connector gateway logs unavailable
- GPT MCP route access log currently off
- no exact request_id linking GPT tool call to nginx/AgentData logs
Therefore, next investigation should not repeat DB/query checks. It should instrument the route and compare:
compare_layers:
- GPT MCP route access log request_time/upstream_response_time
- AgentData internal wrapper duration_ms
- public /api/mcp duration
- client observed tool wall time
If nginx request_time is fast but GPT tool wall time remains slow, the delay is outside VPS/API core path.
Recommended next action
Open a narrow ops phase:
next_phase: agentdata_mcp_reliability_phase_1_route_observability_and_timeout_alignment
nature: config_patch_design_or_apply_with_safety_gate
Suggested sequence:
sequence:
1: inspect exact GPT MCP nginx route config
2: author minimal config patch for timeout alignment + sanitized access log
3: review patch
4: apply config with nginx -t and reload if approved
5: run synthetic probes through GPT route and /api/mcp
6: report latency and 5xx results
If direct production config patch is allowed by the user/Coretics process, keep it minimal and reversible. Otherwise author patch only and route to approval.
Longer-term implementation
After route observability is fixed, implement:
long_term:
- chunked read API
- total response byte guard
- structured MCP tool-call logging
- client-visible retryable error shape
- synthetic canary
Final status
status: INVESTIGATION_REVIEWED__ROUTE_OBSERVABILITY_AND_CHUNKING_NEXT