GPT ↔ Agent Data connector — third recurrence root-cause analysis (2026-05-12)
GPT ↔ Agent Data connector — third recurrence root-cause analysis
- Date (UTC): 2026-05-12
- Recurrences today (before this report): ≥ 3
- Investigation window: 04:00 UTC → 10:10 UTC
- Verdict: Server-side (VPS, nginx, Agent Data, schema, redirect) is NOT the root cause. Root cause is per-conversation OpenAI Action gateway "stuck" session state.
1. Full-day timeline (UTC)
| time | event | GPT-side | server-side | conclusion |
|---|---|---|---|---|
| 01:46–04:14 | regular ChatGPT-User traffic to /api/{chat,documents,documents/batch} |
repeated PASS rounds | all 200 | server healthy |
| 04:14 | Agent Data container restart (planned) | n/a | startup OK; Qdrant + PG probe OK | no GPT regression after restart |
| 04:15:03–04:15:14 | brief upstream unavailability during restart | n/a | nginx 502 to internal probes only (curl, Uptime-Kuma) | did not affect OpenAI traffic |
| 04:15:52 | wrapper listDocuments HTTP 400 WRONG_QUERY_PARAMETER |
n/a | caller IP 38.242.240.89 (internal), UA Python-urllib/3.12 |
internal smoke script using stale param shape, not OpenAI |
| 06:33–07:13 | /api/chatgpt-openapi.json redirect rebuilt |
GPT-Builder import attempts from 123.24.178.152 |
nginx patched: return 308 /api/openapi.json |
deprecated URL guard installed |
| 06:52, 07:12, 07:13 | HEAD /api/openapi.json 200 0 |
curl HEAD probes by user | nginx returns headers only (RFC-compliant) | not a corruption event |
| 07:05–08:49 | normal traffic continues (125 ChatGPT-User reqs today, all 200) | repeated PASS | all 200, bytes 148–34059 | server healthy throughout |
| 08:49:43 | last ChatGPT-User request reaching VPS: POST /api/documents?upsert=true 200 148 |
end of last PASS window | normal | last successful egress |
| 08:50 → 09:42 | ~53 min idle (no GPT egress to VPS) | session sat idle in same conversation | no GPT traffic | suspected window for gateway-session transition to error state |
| 09:42:54–09:48 | Failure window #A (same conversation) — 5 tool calls triggered | healthCheck no output; searchKnowledge / getDocumentTruncated / listDocuments(notes) / listDocuments(reviews) → ClientResponseError | 0 ChatGPT-User UA hits, 0 OpenAI-IP hits, 0 non-healthCheck wrapper calls in window | requests never left OpenAI's gateway |
| 09:48–09:55 | Failure window #B (same conversation, 5 retries) | identical failure pattern | identical: 0 egress to VPS | stuck state reproduced |
| 09:55–10:10 | Fresh conversation triggered, same Custom GPT | healthCheck PASS, searchKnowledge PASS, listDocuments PASS (×2), createDocument PASS | 5 ChatGPT-User reqs from 172.196.40.{212,213,215,222}, all 200 |
gateway egress restored simply by opening a new conversation |
Today's connector wrapper telemetry (00:00 UTC → 10:00 UTC): 1202 PASS / 1 FAIL (the FAIL is the internal smoke-script param mismatch above). No FAIL ever originated from OpenAI IPs.
2. Evidence — three failure rounds
Round 1 (earlier today, user-reported)
Symptom: GPT tools returned ClientResponseError; previously fixed by (a) connector-layer patch, (b) reimport /api/openapi.json, (c) /api/chatgpt-openapi.json → 308 redirect. Those addressed real but distinct earlier VPS-side bugs (each fix permanent and confirmed in nginx config). They did not address the recurring gateway-session issue.
Round 2 — same-conversation, 09:42:54 → 09:48 UTC
| # | tool | GPT result | request at nginx | OpenAI IP at VPS | wrapper status |
|---|---|---|---|---|---|
| 1 | healthCheck | (silent) | NO | NO | — |
| 2 | searchKnowledge | ClientResponseError | NO | NO | — |
| 3 | getDocumentTruncated | ClientResponseError | NO | NO | — |
| 4 | listDocuments(notes) | ClientResponseError | NO | NO | — |
| 5 | listDocuments(reviews) | ClientResponseError | NO | NO | — |
Window-level: 0 ChatGPT-User UA hits, 0 hits from 20.194.x / 172.196.x. Only internal Uptime-Kuma + self-probes (172.18.0.7) and a stray iPhone-UA bot at 09:44:46 hitting / (301).
Round 3 — same-conversation retry, 09:48 → 09:55 UTC
Identical signature (0 egress / 5 ClientResponseError) — reproduces the stuck state. The conversation cannot recover by retrying.
Counter-experiment — fresh conversation, 09:55 → 10:10 UTC
| time UTC | tool | OpenAI IP | route | status | bytes | wrapper log |
|---|---|---|---|---|---|---|
| 09:56:32 | healthCheck | 172.196.40.213 | GET /api/health | 200 | 531 | healthy |
| 09:56:48 | searchKnowledge | 172.196.40.215 | POST /api/chat | 200 | 3910 | OK |
| 09:56:56 | listDocuments | 172.196.40.222 | GET /api/kb/list?prefix=knowledge/current-state/reports&limit=5 | 200 | 1275 | total_count=403 |
| 09:57:07 | createDocument | 172.196.40.212 | POST /api/documents | 200 | 119 | OK |
| 09:59:34 | searchKnowledge | 172.196.40.215 | POST /api/chat | 200 | 5585 | OK |
A/B summary: same VPS, same Custom GPT, same schema (openapi 3.1.0, version 1.2.0, x-connector-schema-version gpt-agent-data-2026-05-12.1, hash aaec3d401df2), same auth — only the conversation changed. Fresh conversation → 100% PASS. Same conversation → 0% egress, 100% ClientResponseError.
3. PASS vs FAIL window comparison
| Factor | PASS (07:00–08:50 UTC, then fresh 09:55+) | FAIL (09:42–09:55 UTC) | Difference |
|---|---|---|---|
| OpenAI IP reaches VPS | YES (125 + 5 hits) | NO (0 hits) | egress vanished |
| Routes called | /api/chat, /api/documents/batch, /api/documents, /api/kb/list, /api/health |
n/a | — |
| Schema URL fetched | /api/openapi.json 200 27833 |
unchanged | identical |
| Schema hash | aaec3d401df2 |
unchanged | identical |
| Schema version | gpt-agent-data-2026-05-12.1 |
unchanged | identical |
| info.version | 1.2.0 |
1.2.0 |
identical |
| Server status | healthy | healthy | identical |
| Conversation | (PASS windows: various; fresh-conv PASS = new) | same stuck conv | only variable |
| 4xx/5xx from OpenAI | 0 all day | n/a (no requests) | — |
| Auth header invalid (401/403) | none | none | identical |
Single biggest difference: conversation identity. Everything else identical.
4. Root cause
The OpenAI Custom-GPT Action gateway maintains a per-conversation tool-call session that, after some triggering event (most likely an idle interval ≥ several tens of minutes, possibly combined with an earlier transient response anomaly), enters a degraded state. In this state, the gateway drops every subsequent tool dispatch from this conversation before egress reaches the upstream server. No request lands at VPS while the session is in that state.
The mechanism is opaque to us (lives entirely inside OpenAI infrastructure), but the empirical signature is unambiguous:
- N retries inside the stuck conversation → 0 reach VPS, all return
ClientResponseError. - A fresh conversation against the same Custom GPT → 100% reach VPS, all PASS.
- No server-side state change is needed to recover.
5. Hypotheses ruled out (with evidence)
| H | Hypothesis | Verdict | Evidence |
|---|---|---|---|
| H1 | Schema cache/lifecycle drift | partially supported as layer (gateway-side) | gateway-internal; not directly observable from VPS, but the stuck-state lives in same layer |
| H2 | Gateway suppresses egress after internal error | CONFIRMED | 0/10 same-conv calls reached VPS; 5/5 fresh-conv calls reached |
| H3 | One bad response poisons session | not falsified, but not necessary | last successful egress 08:49 was unremarkable (148 bytes upsert); session may have drifted on idle alone |
| H4 | Auth/token issue | REJECTED | no 401/403 today; fresh conv used identical auth and passed |
| H5 | REST↔MCP schema split confuses GPT Actions | REJECTED | GPT Action surface uses REST aaec3d401df2; MCP 5f13902e975a is internal MCP transport only |
| H6 | Deprecated /api/chatgpt-openapi.json still in use |
REJECTED | zero traffic to that path after 07:13:50 (308 redirect installed) |
| H7 | batchReadDocuments parser error / large response |
REJECTED for this incident | failures include the smallest tools (healthCheck); batchRead not even attempted in failure window; last successful batchRead at 08:49:08 was a typical 14,748 bytes |
| H8 | Real server-side error (5xx/timeout/upstream reset) | REJECTED | nginx access log 0 non-2xx from OpenAI all day; nginx error_log only buffer-to-temp-file warnings; conntrack 18/262144; no firewall drops; Agent Data uvicorn no exceptions during traffic |
6. Fix applied / why no VPS change
No VPS change applied or warranted. The hard boundaries from the incident ticket were respected:
- No rollback of OGV-2C.
- No PG / Qdrant touch.
- No Agent Data / nginx restart (no server-side evidence required one).
- No curl/Claude-only smoke test claimed as PASS — every verdict above is anchored to a real GPT tool invocation from the user.
Operational recovery (proven, user-side only): when a conversation enters stuck state, start a fresh conversation with the same Custom GPT. The new conversation gets a clean gateway session and tool calls resume immediately.
If the recurrence pattern still bothers us after this report, three optional mitigations require user approval before any VPS edit:
- Bump
info.versionon every schema change (currently frozen at1.2.0across multiple edits today). Each bump invalidates the gateway's per-schema cache. Cheap, defensive, low-risk. - Add a lightweight
/api/actions/pingreturning < 200 bytes. Gives the gateway a near-zero-cost idle keep-alive target, may reduce the rate of idle-induced stuck sessions. (Cannot guarantee — gateway internals opaque.) - Stamp every response with
X-Schema-Version: <hash>header. Lets us correlate server-side responses with the gateway's view of schema and detect drift cheaply.
None of these change the diagnosis: server is healthy.
7. Long-run acceptance test
Round A (T+0 from fresh conv): PASS — five tool calls in fresh conversation all reached VPS with HTTP 200; user-side: healthCheck healthy, searchKnowledge returned hits, listDocuments returned total_count=403 (reports prefix) and total_count=2728 (root prefix, matches user's observation 2727), createDocument wrote the fresh-test marker at knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md.
Rounds B–F (long-idle, heavy-tool, post-batch, post-5×search, post-large-list): not yet executed — they require additional GPT-side triggers spread over ~15 min + heavy operations. Recommended to run inside the fresh conversation across the next 30 minutes; if any of those rounds drops to 0 egress we re-open Phase 3 with new evidence.
If user runs Rounds B–F and they all pass, the incident is considered resolved with the operational recovery procedure above as the standing remediation until OpenAI publishes a fix at the gateway layer.
8. Recommendations — to reduce recurrence (priority order)
- Treat ClientResponseError-without-egress as a gateway-side incident class. Do not restart VPS services, do not edit OpenAPI, do not re-import schema as a first response. Open a fresh conversation first.
- Bump
info.versionwhenever the OpenAPI schema body changes. Today's hashaaec3d401df2was served under1.2.0even after structural edits — gateway can't reliably invalidate. - Capture, for every future ClientResponseError, the exact UTC timestamp + tool name + visible error payload from the GPT side. That's the only signal that distinguishes gateway-stuck from transport from auth from server.
- Optional VPS mitigations (require approval before applying):
info.versionbump policy,/api/actions/ping,X-Schema-Versionresponse header. - Sync-status warning (
document_count=2727 vs vector_point_count=5583, ratio=2.05) is unrelated to today's incident — chunking ratio ~2 is normal — but worth a separate follow-up if RAG quality drift is suspected.
Appendix — raw evidence pointers (VPS, ephemeral)
/tmp/gpt_today.log— 125 ChatGPT-User access lines (timestamp, IP, route, status, bytes)/tmp/gpt_tail_nginx.log— 5-line capture of fresh-conv hits (172.196.40.213/215/222/212/215)/tmp/gpt_tail_agentdata.log— 336-line capture spanning both stuck windows and fresh-conv windowdocker logs incomex-nginx --since "2026-05-12T00:00:00Z"— full nginx today (status counts: 125× 200 for ChatGPT-User, 0× non-2xx)docker logs incomex-agent-data --since "2026-05-12T00:00:00Z"— wrapper telemetry: 1202 PASS / 1 FAIL (FAIL = internal smoke at 04:15:52)- Active nginx conf:
/opt/incomex/docker/nginx/conf.d/default.conflines 137–150 (308 redirect + canonical schema route) - Schema served: 27,833 bytes, md5
a2e75d3354d930fc6db9a4252a249aa5, version1.2.0, x-connector-schema-versiongpt-agent-data-2026-05-12.1, x-connector-schema-hashaaec3d401df2 - Fresh-test user report co-evidence:
knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md(created by user via fresh-conv createDocument at 09:57:07 UTC)