KB-1E9A

AgentData MCP Reliability Phase 1 — Route Observability + Timeout Alignment — 2026-05-18

9 min read Revision 1
agentdatamcpgpt-routenginxobservabilitytimeoutsynthetic-probeopsdieu442026-05-18

AgentData MCP Reliability Phase 1 — Route Observability + Timeout Alignment — 2026-05-18

0. Scope / governance

Phase: agentdata_mcp_reliability_phase_1_route_observability_and_timeout_alignment

Goal: improve stability and diagnosability of the GPT MCP route for AgentData/KB by aligning route timeouts and enabling sanitized route timing logs.

Production mutation: nginx config only, followed by nginx -t and nginx reload. No DB changes. No AgentData core rewrite. No data deletion. No service restart outside nginx reload.

Secrets policy: no auth headers, request body, API keys, or route secret values are included in this report. GPT route path is redacted.

1. Rules / documents read

  • .claude/skills/incomex-rules.md: read locally; 36-item workflow acknowledged; no background agents used.
  • search_knowledge("operating rules SSOT"): returned knowledge/dev/ssot/operating-rules.md OR v7.58 and knowledge/dev/ssot/vps/vps-operating-rules.md v1.0.
  • search_knowledge("hiến pháp v4.0 constitution"): current constitution source returned knowledge/dev/laws/constitution.md, metadata title Hiến pháp Kiến trúc Hệ thống Incomex v4.6.3 BAN HÀNH.
  • Investigation report read: knowledge/dev/laws/dieu44-trien-khai/ops/agentdata-mcp-timeout-investigation-2026-05-18.md.
  • GPT review read: knowledge/dev/laws/dieu44-trien-khai/ops/agentdata-mcp-timeout-investigation-gpt-review-and-next-actions-2026-05-18.md.

2. 3 cau Tuyen ngon

  1. Vinh vien: route timeout and request timing are now infrastructure-level signals, not session-specific manual diagnosis.
  2. Nhầm được không: sanitized access log format excludes auth/header/body/route-secret by construction; nginx -t gate prevents invalid config reload.
  3. 100% tự động: synthetic probes produce p50/p95/max/status/error evidence; future route failures can be correlated with request_id and upstream timings.

3. Config inspected

Sanitized inspection before patch:

GPT MCP route:
proxy_pass http://agent_data_backend/mcp-gpt-full;
proxy_buffering off;
proxy_read_timeout 60s;
proxy_send_timeout 30s;
proxy_connect_timeout 10s;
access_log off;

Route count: two exact GPT MCP locations: /gpt-mcp/[REDACTED]/mcp and /gpt-mcp/[REDACTED]/mcp/
/api/ AgentData route: proxy_read_timeout 300s
Claude MCP route: already 300s read/send timeout from prior config family

4. Patch proposed/applied

Patch applied, minimal and reversible.

Files changed on VPS host bind mounts:

/opt/incomex/docker/nginx/conf.d/default.conf
/opt/incomex/docker/nginx/secrets/gpt-mcp-route.conf

Backups created:

/opt/incomex/docker/nginx/conf.d/default.conf.bak-agentdata-mcp-phase1-20260518T092354Z
/opt/incomex/docker/nginx/secrets/gpt-mcp-route.conf.bak-agentdata-mcp-phase1-20260518T092354Z

Sanitized effective config after patch:

log_format gpt_mcp_sanitized 'request_id=$request_id status=$status upstream_status=$upstream_status request_time=$request_time upstream_response_time=$upstream_response_time bytes_sent=$bytes_sent';

GPT MCP route:
proxy_pass http://agent_data_backend/mcp-gpt-full;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_connect_timeout 10s;
access_log /var/log/nginx/gpt-mcp-access.log gpt_mcp_sanitized;

Forbidden fields not logged:

auth headers: not present
request body: not present
route path/secret: not present
API key: not present

5. nginx -t / reload result

Patch command result:

PATCH_APPLIED_TO_FILES=2
NGINX_T_EXIT=0
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
NGINX_RELOAD_EXIT=0
2026/05/18 09:23:54 [notice] 2314#2314: signal process started

Post-reload validation:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Rollback was not needed.

6. Before probe results

Probe set: 5 calls each for /api/mcp and GPT route.

BASELINE_PROBE_START
api_mcp list_seed count=5 errors=0 statuses=200 p50=0.037s p95=0.038s max=0.105s sizes=[3490]
api_mcp get_target count=5 errors=0 statuses=200 p50=0.053s p95=0.058s max=0.151s sizes=[10241]
api_mcp list_fabric count=5 errors=0 statuses=200 p50=0.085s p95=0.098s max=0.108s sizes=[11027]
api_mcp batch_small count=5 errors=0 statuses=200 p50=0.028s p95=0.042s max=0.052s sizes=[2340]
gpt_route list_seed count=5 errors=0 statuses=200 p50=0.022s p95=0.034s max=0.078s sizes=[3490]
gpt_route get_target count=5 errors=0 statuses=200 p50=0.029s p95=0.041s max=0.073s sizes=[10241]
gpt_route list_fabric count=5 errors=0 statuses=200 p50=0.045s p95=0.050s max=0.056s sizes=[11027]
gpt_route batch_small count=5 errors=0 statuses=200 p50=0.073s p95=0.118s max=0.149s sizes=[2340]
BASELINE_PROBE_END

7. After probe results

Same probe set after nginx -t + reload.

AFTER_PROBE_START
api_mcp list_seed count=5 errors=0 statuses=200 p50=0.023s p95=0.029s max=0.089s sizes=[3490]
api_mcp get_target count=5 errors=0 statuses=200 p50=0.023s p95=0.030s max=0.046s sizes=[10241]
api_mcp list_fabric count=5 errors=0 statuses=200 p50=0.032s p95=0.047s max=0.057s sizes=[11027]
api_mcp batch_small count=5 errors=0 statuses=200 p50=0.032s p95=0.038s max=0.071s sizes=[2340]
gpt_route list_seed count=5 errors=0 statuses=200 p50=0.028s p95=0.032s max=0.035s sizes=[3490]
gpt_route get_target count=5 errors=0 statuses=200 p50=0.027s p95=0.037s max=0.043s sizes=[10241]
gpt_route list_fabric count=5 errors=0 statuses=200 p50=0.071s p95=0.095s max=0.187s sizes=[11027]
gpt_route batch_small count=5 errors=0 statuses=200 p50=0.026s p95=0.030s max=0.053s sizes=[2340]
AFTER_PROBE_END

8. Timing observability verification

Sanitized GPT route access log is now active.

Sample tail:

request_id=278032cd38e163b451d5b1d60a0a08fc status=200 upstream_status=200 request_time=0.010 upstream_response_time=0.008 bytes_sent=10824
request_id=b34881f19523d693df65f18f71fdc607 status=200 upstream_status=200 request_time=0.090 upstream_response_time=0.062 bytes_sent=11610
request_id=f2249cf217380e4c4d5205f90ec47395 status=200 upstream_status=200 request_time=0.013 upstream_response_time=0.011 bytes_sent=2922

Sensitive pattern check:

SENSITIVE_PATTERN_HITS=0

Checked patterns included X-API-Key, Authorization, jsonrpc, tools/call, gpt-mcp/, and Bearer.

9. Whether latency improved

Route timing stayed healthy before and after. No 503/504 was reproduced in either baseline or after probes.

Observed changes:

  • /api/mcp probes generally improved after reload, p95 all below 0.047s except max 0.089s for list_seed.
  • GPT route remained healthy. batch_small improved from p95 0.118s to p95 0.030s. list_fabric p95 increased from 0.050s to 0.095s with max 0.187s, still well below risk thresholds.
  • Primary improvement is not raw latency; it is timeout headroom and route-level timing visibility.

10. Remaining risks

  • GPT/OpenAI connector gateway latency can still happen outside VPS/nginx/AgentData; this phase now makes that distinguishable when nginx request_time is fast but client tool wall time is slow.
  • Full document and batch_read(full=true) remain monolithic payload contracts; large reads can still be fragile through connector layers.
  • Current log is route-level only; AgentData internal tool duration logging should still be enhanced later for end-to-end request_id correlation.
  • Synthetic probes were ad hoc in this phase; a scheduled canary/alert still needs implementation.

11. Next recommendation

Proceed with chunk/cursor API design next, but do it as a separate phase after observing GPT route logs for real traffic.

Recommended next phase:

next_phase: agentdata_mcp_reliability_phase_2_chunk_cursor_and_response_guards
trigger: if GPT route logs show nginx/upstream fast but GPT tool wall time remains slow, or if full/batch reads still fail
scope:
  - get_document_chunk(document_id, offset, limit)
  - batch_read full=true response byte guard
  - partial_available + cursor response shape
  - structured AgentData tool duration logs with request_id

Phase 1 status:

status: APPLIED_AND_VERIFIED
nginx_reload: success
route_works_after_patch: yes
request_timing_observable: yes
secrets_logged: no evidence found
Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/ops/agentdata-mcp-reliability-phase-1-route-observability-timeout-2026-05-18.md