GPT ↔ Agent Data connector — third recurrence root-cause analysis

Date (UTC): 2026-05-12
Recurrences today (before this report): ≥ 3
Investigation window: 04:00 UTC → 10:10 UTC
Verdict: Server-side (VPS, nginx, Agent Data, schema, redirect) is NOT the root cause. Root cause is per-conversation OpenAI Action gateway "stuck" session state.

1. Full-day timeline (UTC)

time	event	GPT-side	server-side	conclusion
01:46–04:14	regular ChatGPT-User traffic to `/api/{chat,documents,documents/batch}`	repeated PASS rounds	all 200	server healthy
04:14	Agent Data container restart (planned)	n/a	startup OK; Qdrant + PG probe OK	no GPT regression after restart
04:15:03–04:15:14	brief upstream unavailability during restart	n/a	nginx 502 to internal probes only (curl, Uptime-Kuma)	did not affect OpenAI traffic
04:15:52	wrapper `listDocuments` HTTP 400 `WRONG_QUERY_PARAMETER`	n/a	caller IP `38.242.240.89` (internal), UA `Python-urllib/3.12`	internal smoke script using stale param shape, not OpenAI
06:33–07:13	`/api/chatgpt-openapi.json` redirect rebuilt	GPT-Builder import attempts from `123.24.178.152`	nginx patched: `return 308 /api/openapi.json`	deprecated URL guard installed
06:52, 07:12, 07:13	`HEAD /api/openapi.json 200 0`	curl HEAD probes by user	nginx returns headers only (RFC-compliant)	not a corruption event
07:05–08:49	normal traffic continues (125 ChatGPT-User reqs today, all 200)	repeated PASS	all 200, bytes 148–34059	server healthy throughout
08:49:43	last ChatGPT-User request reaching VPS: `POST /api/documents?upsert=true 200 148`	end of last PASS window	normal	last successful egress
08:50 → 09:42	~53 min idle (no GPT egress to VPS)	session sat idle in same conversation	no GPT traffic	suspected window for gateway-session transition to error state
09:42:54–09:48	Failure window #A (same conversation) — 5 tool calls triggered	healthCheck no output; searchKnowledge / getDocumentTruncated / listDocuments(notes) / listDocuments(reviews) → ClientResponseError	0 ChatGPT-User UA hits, 0 OpenAI-IP hits, 0 non-healthCheck wrapper calls in window	requests never left OpenAI's gateway
09:48–09:55	Failure window #B (same conversation, 5 retries)	identical failure pattern	identical: 0 egress to VPS	stuck state reproduced
09:55–10:10	Fresh conversation triggered, same Custom GPT	healthCheck PASS, searchKnowledge PASS, listDocuments PASS (×2), createDocument PASS	5 ChatGPT-User reqs from `172.196.40.{212,213,215,222}`, all 200	gateway egress restored simply by opening a new conversation

Today's connector wrapper telemetry (00:00 UTC → 10:00 UTC): 1202 PASS / 1 FAIL (the FAIL is the internal smoke-script param mismatch above). No FAIL ever originated from OpenAI IPs.

2. Evidence — three failure rounds

Round 1 (earlier today, user-reported)

Symptom: GPT tools returned ClientResponseError; previously fixed by (a) connector-layer patch, (b) reimport /api/openapi.json, (c) /api/chatgpt-openapi.json → 308 redirect. Those addressed real but distinct earlier VPS-side bugs (each fix permanent and confirmed in nginx config). They did not address the recurring gateway-session issue.

Round 2 — same-conversation, 09:42:54 → 09:48 UTC

#	tool	GPT result	request at nginx	OpenAI IP at VPS	wrapper status
1	healthCheck	(silent)	NO	NO	—
2	searchKnowledge	ClientResponseError	NO	NO	—
3	getDocumentTruncated	ClientResponseError	NO	NO	—
4	listDocuments(notes)	ClientResponseError	NO	NO	—
5	listDocuments(reviews)	ClientResponseError	NO	NO	—

Window-level: 0 ChatGPT-User UA hits, 0 hits from 20.194.x / 172.196.x. Only internal Uptime-Kuma + self-probes (172.18.0.7) and a stray iPhone-UA bot at 09:44:46 hitting / (301).

Round 3 — same-conversation retry, 09:48 → 09:55 UTC

Identical signature (0 egress / 5 ClientResponseError) — reproduces the stuck state. The conversation cannot recover by retrying.

Counter-experiment — fresh conversation, 09:55 → 10:10 UTC

time UTC	tool	OpenAI IP	route	status	bytes	wrapper log
09:56:32	healthCheck	172.196.40.213	GET /api/health	200	531	healthy
09:56:48	searchKnowledge	172.196.40.215	POST /api/chat	200	3910	OK
09:56:56	listDocuments	172.196.40.222	GET /api/kb/list?prefix=knowledge/current-state/reports&limit=5	200	1275	`total_count=403`
09:57:07	createDocument	172.196.40.212	POST /api/documents	200	119	OK
09:59:34	searchKnowledge	172.196.40.215	POST /api/chat	200	5585	OK

A/B summary: same VPS, same Custom GPT, same schema (openapi 3.1.0, version 1.2.0, x-connector-schema-version gpt-agent-data-2026-05-12.1, hash aaec3d401df2), same auth — only the conversation changed. Fresh conversation → 100% PASS. Same conversation → 0% egress, 100% ClientResponseError.

3. PASS vs FAIL window comparison

Factor	PASS (07:00–08:50 UTC, then fresh 09:55+)	FAIL (09:42–09:55 UTC)	Difference
OpenAI IP reaches VPS	YES (125 + 5 hits)	NO (0 hits)	egress vanished
Routes called	`/api/chat`, `/api/documents/batch`, `/api/documents`, `/api/kb/list`, `/api/health`	n/a	—
Schema URL fetched	`/api/openapi.json` 200 27833	unchanged	identical
Schema hash	`aaec3d401df2`	unchanged	identical
Schema version	`gpt-agent-data-2026-05-12.1`	unchanged	identical
info.version	`1.2.0`	`1.2.0`	identical
Server status	healthy	healthy	identical
Conversation	(PASS windows: various; fresh-conv PASS = new)	same stuck conv	only variable
4xx/5xx from OpenAI	0 all day	n/a (no requests)	—
Auth header invalid (401/403)	none	none	identical

Single biggest difference: conversation identity. Everything else identical.

4. Root cause

The OpenAI Custom-GPT Action gateway maintains a per-conversation tool-call session that, after some triggering event (most likely an idle interval ≥ several tens of minutes, possibly combined with an earlier transient response anomaly), enters a degraded state. In this state, the gateway drops every subsequent tool dispatch from this conversation before egress reaches the upstream server. No request lands at VPS while the session is in that state.

The mechanism is opaque to us (lives entirely inside OpenAI infrastructure), but the empirical signature is unambiguous:

N retries inside the stuck conversation → 0 reach VPS, all return ClientResponseError.
A fresh conversation against the same Custom GPT → 100% reach VPS, all PASS.
No server-side state change is needed to recover.

5. Hypotheses ruled out (with evidence)

H	Hypothesis	Verdict	Evidence
H1	Schema cache/lifecycle drift	partially supported as layer (gateway-side)	gateway-internal; not directly observable from VPS, but the stuck-state lives in same layer
H2	Gateway suppresses egress after internal error	CONFIRMED	0/10 same-conv calls reached VPS; 5/5 fresh-conv calls reached
H3	One bad response poisons session	not falsified, but not necessary	last successful egress 08:49 was unremarkable (148 bytes upsert); session may have drifted on idle alone
H4	Auth/token issue	REJECTED	no 401/403 today; fresh conv used identical auth and passed
H5	REST↔MCP schema split confuses GPT Actions	REJECTED	GPT Action surface uses REST `aaec3d401df2`; MCP `5f13902e975a` is internal MCP transport only
H6	Deprecated `/api/chatgpt-openapi.json` still in use	REJECTED	zero traffic to that path after 07:13:50 (308 redirect installed)
H7	`batchReadDocuments` parser error / large response	REJECTED for this incident	failures include the smallest tools (healthCheck); batchRead not even attempted in failure window; last successful batchRead at 08:49:08 was a typical 14,748 bytes
H8	Real server-side error (5xx/timeout/upstream reset)	REJECTED	nginx access log 0 non-2xx from OpenAI all day; nginx error_log only buffer-to-temp-file warnings; conntrack 18/262144; no firewall drops; Agent Data uvicorn no exceptions during traffic

6. Fix applied / why no VPS change

No VPS change applied or warranted. The hard boundaries from the incident ticket were respected:

No rollback of OGV-2C.
No PG / Qdrant touch.
No Agent Data / nginx restart (no server-side evidence required one).
No curl/Claude-only smoke test claimed as PASS — every verdict above is anchored to a real GPT tool invocation from the user.

Operational recovery (proven, user-side only): when a conversation enters stuck state, start a fresh conversation with the same Custom GPT. The new conversation gets a clean gateway session and tool calls resume immediately.

If the recurrence pattern still bothers us after this report, three optional mitigations require user approval before any VPS edit:

Bump info.version on every schema change (currently frozen at 1.2.0 across multiple edits today). Each bump invalidates the gateway's per-schema cache. Cheap, defensive, low-risk.
Add a lightweight /api/actions/ping returning < 200 bytes. Gives the gateway a near-zero-cost idle keep-alive target, may reduce the rate of idle-induced stuck sessions. (Cannot guarantee — gateway internals opaque.)
Stamp every response with X-Schema-Version: <hash> header. Lets us correlate server-side responses with the gateway's view of schema and detect drift cheaply.

None of these change the diagnosis: server is healthy.

7. Long-run acceptance test

Round A (T+0 from fresh conv): PASS — five tool calls in fresh conversation all reached VPS with HTTP 200; user-side: healthCheck healthy, searchKnowledge returned hits, listDocuments returned total_count=403 (reports prefix) and total_count=2728 (root prefix, matches user's observation 2727), createDocument wrote the fresh-test marker at knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md.

Rounds B–F (long-idle, heavy-tool, post-batch, post-5×search, post-large-list): not yet executed — they require additional GPT-side triggers spread over ~15 min + heavy operations. Recommended to run inside the fresh conversation across the next 30 minutes; if any of those rounds drops to 0 egress we re-open Phase 3 with new evidence.

If user runs Rounds B–F and they all pass, the incident is considered resolved with the operational recovery procedure above as the standing remediation until OpenAI publishes a fix at the gateway layer.

8. Recommendations — to reduce recurrence (priority order)

Treat ClientResponseError-without-egress as a gateway-side incident class. Do not restart VPS services, do not edit OpenAPI, do not re-import schema as a first response. Open a fresh conversation first.
Bump info.version whenever the OpenAPI schema body changes. Today's hash aaec3d401df2 was served under 1.2.0 even after structural edits — gateway can't reliably invalidate.
Capture, for every future ClientResponseError, the exact UTC timestamp + tool name + visible error payload from the GPT side. That's the only signal that distinguishes gateway-stuck from transport from auth from server.
Optional VPS mitigations (require approval before applying): info.version bump policy, /api/actions/ping, X-Schema-Version response header.
Sync-status warning (document_count=2727 vs vector_point_count=5583, ratio=2.05) is unrelated to today's incident — chunking ratio ~2 is normal — but worth a separate follow-up if RAG quality drift is suspected.

Appendix — raw evidence pointers (VPS, ephemeral)

/tmp/gpt_today.log — 125 ChatGPT-User access lines (timestamp, IP, route, status, bytes)
/tmp/gpt_tail_nginx.log — 5-line capture of fresh-conv hits (172.196.40.213/215/222/212/215)
/tmp/gpt_tail_agentdata.log — 336-line capture spanning both stuck windows and fresh-conv window
docker logs incomex-nginx --since "2026-05-12T00:00:00Z" — full nginx today (status counts: 125× 200 for ChatGPT-User, 0× non-2xx)
docker logs incomex-agent-data --since "2026-05-12T00:00:00Z" — wrapper telemetry: 1202 PASS / 1 FAIL (FAIL = internal smoke at 04:15:52)
Active nginx conf: /opt/incomex/docker/nginx/conf.d/default.conf lines 137–150 (308 redirect + canonical schema route)
Schema served: 27,833 bytes, md5 a2e75d3354d930fc6db9a4252a249aa5, version 1.2.0, x-connector-schema-version gpt-agent-data-2026-05-12.1, x-connector-schema-hash aaec3d401df2
Fresh-test user report co-evidence: knowledge/current-state/reports/gpt-action-agent-data-fresh-test-2026-05-12.md (created by user via fresh-conv createDocument at 09:57:07 UTC)