KB-3700

AgentData MCP Timeout Investigation — GPT Review and Next Actions

6 min read Revision 1

agentdatamcptimeoutkbopsgpt-reviewperformancestabilitynext-actionsdieu442026-05-18

AgentData MCP Timeout Investigation — GPT Review and Next Actions

Date: 2026-05-18 Reviewer: GPT Reviewed report: knowledge/dev/laws/dieu44-trien-khai/ops/agentdata-mcp-timeout-investigation-2026-05-18.md

Verdict

investigation_quality: PASS
core_agentdata_db_root_cause: unlikely_for_tested_cases
primary_candidate: GPT_MCP_connector_or_route_gateway_layer
hard_503_504_reproduced: false
slow_MCP_app_layer_reproduced: true
additional_investigation_needed: true_before_large_code_patch
quick_config_mitigation_recommended: true
chunk_cursor_contract_recommended: true

Coretics' report is sufficient to conclude that the dominant latency observed in the tested cases is not explained by AgentData core handler, DB prefix listing, Qdrant, or public /api/mcp path.

The key evidence is the latency split:

latency_split:
  AgentData_local_direct: 0.009_to_0.033_seconds
  public_nginx_api_mcp: 0.046_to_0.443_seconds
  MCP_tool_layer_observed: 16_to_22_seconds

This points to connector/gateway/tool-layer latency, or a GPT MCP route config/logging gap, rather than DB/query slowness.

Accepted findings

accepted:
  DB_listing_not_root_cause:
    evidence: prefix list EXPLAIN uses idx_kb_documents_doc_id_c_live; execution 5.749ms

  AgentData_core_fast:
    evidence: direct /mcp requests < 0.033s for tested cases

  public_nginx_api_fast:
    evidence: /api/mcp requests < 0.443s; 20-concurrency p95 0.603s

  GPT_route_observability_gap:
    evidence: GPT MCP route access_log off; exact GPT request correlation unavailable

  full_read_contract_risk:
    evidence: get_document_for_rewrite and batch_read(full=true) return monolithic JSON without cursor/chunk fallback

What should be done now

Priority 1 — Observability and route timeout alignment

This is the fastest safety improvement and should be done before speculative code rewrites.

priority_1:
  - align GPT MCP route timeout with /api/mcp and Claude route:
      proxy_read_timeout: 300s
      proxy_send_timeout: 300s
  - enable sanitized GPT MCP access log:
      fields:
        - request_id
        - status
        - upstream_status
        - request_time
        - upstream_response_time
        - bytes_sent
      forbidden:
        - secrets
        - auth headers
        - request body content
  - add/verify request_id propagation if available

Expected benefit:

benefit:
  - fewer false 503/504 caused by route timeout mismatch
  - ability to prove whether latency is in GPT route, upstream, or external connector layer

Priority 2 — Chunk/cursor read contract

For maximum stability, full document reads must be bounded.

priority_2:
  - add get_document_chunk(document_id, offset, limit)
  - or extend get_document_for_rewrite with max_chars + cursor
  - add response byte guard for batch_read(full=true)
  - return partial_available + cursor instead of large monolithic payload

Expected benefit:

benefit:
  - GPT can read long docs reliably
  - avoids connector payload/serialization fragility
  - permits retry/resume at chunk level

Priority 3 — Synthetic probes and p95 alerting

priority_3:
  - synthetic probes for list_documents, get_document_for_rewrite, batch_read, upload_document
  - track p50/p95/max, 5xx, retryable failures
  - alert if p95 > threshold or 5xx appears

Additional investigation still needed

The remaining unknown is where the 16-22s tool-layer latency is introduced.

needs_more_evidence:
  - GPT/OpenAI connector gateway logs unavailable
  - GPT MCP route access log currently off
  - no exact request_id linking GPT tool call to nginx/AgentData logs

Therefore, next investigation should not repeat DB/query checks. It should instrument the route and compare:

compare_layers:
  - GPT MCP route access log request_time/upstream_response_time
  - AgentData internal wrapper duration_ms
  - public /api/mcp duration
  - client observed tool wall time

If nginx request_time is fast but GPT tool wall time remains slow, the delay is outside VPS/API core path.

Recommended next action

Open a narrow ops phase:

next_phase: agentdata_mcp_reliability_phase_1_route_observability_and_timeout_alignment
nature: config_patch_design_or_apply_with_safety_gate

Suggested sequence:

sequence:
  1: inspect exact GPT MCP nginx route config
  2: author minimal config patch for timeout alignment + sanitized access log
  3: review patch
  4: apply config with nginx -t and reload if approved
  5: run synthetic probes through GPT route and /api/mcp
  6: report latency and 5xx results

If direct production config patch is allowed by the user/Coretics process, keep it minimal and reversible. Otherwise author patch only and route to approval.

Longer-term implementation

After route observability is fixed, implement:

long_term:
  - chunked read API
  - total response byte guard
  - structured MCP tool-call logging
  - client-visible retryable error shape
  - synthetic canary

Final status

status: INVESTIGATION_REVIEWED__ROUTE_OBSERVABILITY_AND_CHUNKING_NEXT