KB-54B9

GPT ↔ Agent Data connector — third recurrence root-cause analysis (2026-05-12)

14 min read Revision 1
incidentgpt-actionconnectorroot-cause2026-05-12

GPT ↔ Agent Data connector — third recurrence root-cause analysis

  • Date (UTC): 2026-05-12
  • Recurrences today (before this report): ≥ 3
  • Investigation window: 04:00 UTC → 10:10 UTC
  • Verdict: Server-side (VPS, nginx, Agent Data, schema, redirect) is NOT the root cause. Root cause is per-conversation OpenAI Action gateway "stuck" session state.

1. Full-day timeline (UTC)

time event GPT-side server-side conclusion
01:46–04:14 regular ChatGPT-User traffic to /api/{chat,documents,documents/batch} repeated PASS rounds all 200 server healthy
04:14 Agent Data container restart (planned) n/a startup OK; Qdrant + PG probe OK no GPT regression after restart
04:15:03–04:15:14 brief upstream unavailability during restart n/a nginx 502 to internal probes only (curl, Uptime-Kuma) did not affect OpenAI traffic
04:15:52 wrapper listDocuments HTTP 400 WRONG_QUERY_PARAMETER n/a caller IP 38.242.240.89 (internal), UA Python-urllib/3.12 internal smoke script using stale param shape, not OpenAI
06:33–07:13 /api/chatgpt-openapi.json redirect rebuilt GPT-Builder import attempts from 123.24.178.152 nginx patched: return 308 /api/openapi.json deprecated URL guard installed
06:52, 07:12, 07:13 HEAD /api/openapi.json 200 0 curl HEAD probes by user nginx returns headers only (RFC-compliant) not a corruption event
07:05–08:49 normal traffic continues (125 ChatGPT-User reqs today, all 200) repeated PASS all 200, bytes 148–34059 server healthy throughout
08:49:43 last ChatGPT-User request reaching VPS: POST /api/documents?upsert=true 200 148 end of last PASS window normal last successful egress
08:50 → 09:42 ~53 min idle (no GPT egress to VPS) session sat idle in same conversation no GPT traffic suspected window for gateway-session transition to error state
09:42:54–09:48 Failure window #A (same conversation) — 5 tool calls triggered healthCheck no output; searchKnowledge / getDocumentTruncated / listDocuments(notes) / listDocuments(reviews) → ClientResponseError 0 ChatGPT-User UA hits, 0 OpenAI-IP hits, 0 non-healthCheck wrapper calls in window requests never left OpenAI's gateway
09:48–09:55 Failure window #B (same conversation, 5 retries) identical failure pattern identical: 0 egress to VPS stuck state reproduced
09:55–10:10 Fresh conversation triggered, same Custom GPT healthCheck PASS, searchKnowledge PASS, listDocuments PASS (×2), createDocument PASS 5 ChatGPT-User reqs from 172.196.40.{212,213,215,222}, all 200 gateway egress restored simply by opening a new conversation

Today's connector wrapper telemetry (00:00 UTC → 10:00 UTC): 1202 PASS / 1 FAIL (the FAIL is the internal smoke-script param mismatch above). No FAIL ever originated from OpenAI IPs.


2. Evidence — three failure rounds

Round 1 (earlier today, user-reported)

Symptom: GPT tools returned ClientResponseError; previously fixed by (a) connector-layer patch, (b) reimport /api/openapi.json, (c) /api/chatgpt-openapi.json → 308 redirect. Those addressed real but distinct earlier VPS-side bugs (each fix permanent and confirmed in nginx config). They did not address the recurring gateway-session issue.

Round 2 — same-conversation, 09:42:54 → 09:48 UTC

# tool GPT result request at nginx OpenAI IP at VPS wrapper status
1 healthCheck (silent) NO NO
2 searchKnowledge ClientResponseError NO NO
3 getDocumentTruncated ClientResponseError NO NO
4 listDocuments(notes) ClientResponseError NO NO
5 listDocuments(reviews) ClientResponseError NO NO

Window-level: 0 ChatGPT-User UA hits, 0 hits from 20.194.x / 172.196.x. Only internal Uptime-Kuma + self-probes (172.18.0.7) and a stray iPhone-UA bot at 09:44:46 hitting / (301).

Round 3 — same-conversation retry, 09:48 → 09:55 UTC

Identical signature (0 egress / 5 ClientResponseError) — reproduces the stuck state. The conversation cannot recover by retrying.

Counter-experiment — fresh conversation, 09:55 → 10:10 UTC

time UTC tool OpenAI IP route status bytes wrapper log
09:56:32 healthCheck 172.196.40.213 GET /api/health 200 531 healthy
09:56:48 searchKnowledge 172.196.40.215 POST /api/chat 200 3910 OK
09:56:56 listDocuments 172.196.40.222 GET /api/kb/list?prefix=knowledge/current-state/reports&limit=5 200 1275 total_count=403
09:57:07 createDocument 172.196.40.212 POST /api/documents 200 119 OK
09:59:34 searchKnowledge 172.196.40.215 POST /api/chat 200 5585 OK

A/B summary: same VPS, same Custom GPT, same schema (openapi 3.1.0, version 1.2.0, x-connector-schema-version gpt-agent-data-2026-05-12.1, hash aaec3d401df2), same auth — only the conversation changed. Fresh conversation → 100% PASS. Same conversation → 0% egress, 100% ClientResponseError.


3. PASS vs FAIL window comparison

Factor PASS (07:00–08:50 UTC, then fresh 09:55+) FAIL (09:42–09:55 UTC) Difference
OpenAI IP reaches VPS YES (125 + 5 hits) NO (0 hits) egress vanished
Routes called /api/chat, /api/documents/batch, /api/documents, /api/kb/list, /api/health n/a
Schema URL fetched /api/openapi.json 200 27833 unchanged identical
Schema hash aaec3d401df2 unchanged identical
Schema version gpt-agent-data-2026-05-12.1 unchanged identical
info.version 1.2.0 1.2.0 identical
Server status healthy healthy identical
Conversation (PASS windows: various; fresh-conv PASS = new) same stuck conv only variable
4xx/5xx from OpenAI 0 all day n/a (no requests)
Auth header invalid (401/403) none none identical

Single biggest difference: conversation identity. Everything else identical.


4. Root cause

The OpenAI Custom-GPT Action gateway maintains a per-conversation tool-call session that, after some triggering event (most likely an idle interval ≥ several tens of minutes, possibly combined with an earlier transient response anomaly), enters a degraded state. In this state, the gateway drops every subsequent tool dispatch from this conversation before egress reaches the upstream server. No request lands at VPS while the session is in that state.

The mechanism is opaque to us (lives entirely inside OpenAI infrastructure), but the empirical signature is unambiguous:

  • N retries inside the stuck conversation → 0 reach VPS, all return ClientResponseError.
  • A fresh conversation against the same Custom GPT → 100% reach VPS, all PASS.
  • No server-side state change is needed to recover.

5. Hypotheses ruled out (with evidence)

H Hypothesis Verdict Evidence
H1 Schema cache/lifecycle drift partially supported as layer (gateway-side) gateway-internal; not directly observable from VPS, but the stuck-state lives in same layer
H2 Gateway suppresses egress after internal error CONFIRMED 0/10 same-conv calls reached VPS; 5/5 fresh-conv calls reached
H3 One bad response poisons session not falsified, but not necessary last successful egress 08:49 was unremarkable (148 bytes upsert); session may have drifted on idle alone
H4 Auth/token issue REJECTED no 401/403 today; fresh conv used identical auth and passed
H5 REST↔MCP schema split confuses GPT Actions REJECTED GPT Action surface uses REST aaec3d401df2; MCP 5f13902e975a is internal MCP transport only
H6 Deprecated /api/chatgpt-openapi.json still in use REJECTED zero traffic to that path after 07:13:50 (308 redirect installed)
H7 batchReadDocuments parser error / large response REJECTED for this incident failures include the smallest tools (healthCheck); batchRead not even attempted in failure window; last successful batchRead at 08:49:08 was a typical 14,748 bytes
H8 Real server-side error (5xx/timeout/upstream reset) REJECTED nginx access log 0 non-2xx from OpenAI all day; nginx error_log only buffer-to-temp-file warnings; conntrack 18/262144; no firewall drops; Agent Data uvicorn no exceptions during traffic

6. Fix applied / why no VPS change

No VPS change applied or warranted. The hard boundaries from the incident ticket were respected:

  • No rollback of OGV-2C.
  • No PG / Qdrant touch.
  • No Agent Data / nginx restart (no server-side evidence required one).
  • No curl/Claude-only smoke test claimed as PASS — every verdict above is anchored to a real GPT tool invocation from the user.

Operational recovery (proven, user-side only): when a conversation enters stuck state, start a fresh conversation with the same Custom GPT. The new conversation gets a clean gateway session and tool calls resume immediately.

If the recurrence pattern still bothers us after this report, three optional mitigations require user approval before any VPS edit:

  1. Bump info.version on every schema change (currently frozen at 1.2.0 across multiple edits today). Each bump invalidates the gateway's per-schema cache. Cheap, defensive, low-risk.
  2. Add a lightweight /api/actions/ping returning < 200 bytes. Gives the gateway a near-zero-cost idle keep-alive target, may reduce the rate of idle-induced stuck sessions. (Cannot guarantee — gateway internals opaque.)
  3. Stamp every response with X-Schema-Version: <hash> header. Lets us correlate server-side responses with the gateway's view of schema and detect drift cheaply.

None of these change the diagnosis: server is healthy.


7. Long-run acceptance test

Round A (T+0 from fresh conv): PASS — five tool calls in fresh conversation all reached VPS with HTTP 200; user-side: healthCheck healthy, searchKnowledge returned hits, listDocuments returned total_count=403 (reports prefix) and total_count=2728 (root prefix, matches user's observation 2727), createDocument wrote the fresh-test marker at knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md.

Rounds B–F (long-idle, heavy-tool, post-batch, post-5×search, post-large-list): not yet executed — they require additional GPT-side triggers spread over ~15 min + heavy operations. Recommended to run inside the fresh conversation across the next 30 minutes; if any of those rounds drops to 0 egress we re-open Phase 3 with new evidence.

If user runs Rounds B–F and they all pass, the incident is considered resolved with the operational recovery procedure above as the standing remediation until OpenAI publishes a fix at the gateway layer.


8. Recommendations — to reduce recurrence (priority order)

  1. Treat ClientResponseError-without-egress as a gateway-side incident class. Do not restart VPS services, do not edit OpenAPI, do not re-import schema as a first response. Open a fresh conversation first.
  2. Bump info.version whenever the OpenAPI schema body changes. Today's hash aaec3d401df2 was served under 1.2.0 even after structural edits — gateway can't reliably invalidate.
  3. Capture, for every future ClientResponseError, the exact UTC timestamp + tool name + visible error payload from the GPT side. That's the only signal that distinguishes gateway-stuck from transport from auth from server.
  4. Optional VPS mitigations (require approval before applying): info.version bump policy, /api/actions/ping, X-Schema-Version response header.
  5. Sync-status warning (document_count=2727 vs vector_point_count=5583, ratio=2.05) is unrelated to today's incident — chunking ratio ~2 is normal — but worth a separate follow-up if RAG quality drift is suspected.

Appendix — raw evidence pointers (VPS, ephemeral)

  • /tmp/gpt_today.log — 125 ChatGPT-User access lines (timestamp, IP, route, status, bytes)
  • /tmp/gpt_tail_nginx.log — 5-line capture of fresh-conv hits (172.196.40.213/215/222/212/215)
  • /tmp/gpt_tail_agentdata.log — 336-line capture spanning both stuck windows and fresh-conv window
  • docker logs incomex-nginx --since "2026-05-12T00:00:00Z" — full nginx today (status counts: 125× 200 for ChatGPT-User, 0× non-2xx)
  • docker logs incomex-agent-data --since "2026-05-12T00:00:00Z" — wrapper telemetry: 1202 PASS / 1 FAIL (FAIL = internal smoke at 04:15:52)
  • Active nginx conf: /opt/incomex/docker/nginx/conf.d/default.conf lines 137–150 (308 redirect + canonical schema route)
  • Schema served: 27,833 bytes, md5 a2e75d3354d930fc6db9a4252a249aa5, version 1.2.0, x-connector-schema-version gpt-agent-data-2026-05-12.1, x-connector-schema-hash aaec3d401df2
  • Fresh-test user report co-evidence: knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md (created by user via fresh-conv createDocument at 09:57:07 UTC)
Back to Knowledge Hub knowledge/current-state/reports/gpt-agent-data-connector-third-recurrence-root-cause-2026-05-12.md