KB-346C

Single Worker Fix Report 2026-04-03

3 min read Revision 1

Single Worker Fix Report — 2026-04-03

Mission: S135H — Fix GPT action chập chờn PR: #337 — fix: uvicorn 2 workers + /chat request timeout 25s Status: DEPLOYED + VERIFIED


Evidence trước fix

  • 31 nginx upstream timeouts trong 72h
  • Process freeze 2h12m (Apr 2 03:26→05:38)
  • Uvicorn 1 worker, /chat sync blocking
  • Container: 521MB / 1.5GB (34%)

Fix applied

Fix 1: --workers 2 (Dockerfile CMD)

CMD ["bash", "-lc", "uvicorn agent_data.server:app --host 0.0.0.0 --port ${PORT:-8080} --workers 2"]

Memory safe: 2×~600MB = ~1.2GB < 1.5GB limit.

Fix 2: 25s timeout on agent.llm_response() (server.py:902-921)

from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeout
_LLM_TIMEOUT_S = 25
with ThreadPoolExecutor(max_workers=1) as pool:
    future = pool.submit(agent.llm_response, "!" + llm_input)
    agent_reply = future.result(timeout=_LLM_TIMEOUT_S)

Returns 504 with clear error message instead of silent timeout.

CI Status

Check Status
Lint Only PASS
Pass Gate PASS
Guard Bootstrap Scaffold PASS
Deploy Agent Data to VPS PASS (success)

Production Verification

Workers confirmed

PID 784249 — master process (--workers 2)
PID 784336 — worker 1 (600MB)
PID 784739 — worker 2 (598MB)

Health: HEALTHY

{"status":"healthy","services":{"qdrant":{"status":"ok","latency_ms":4.5},"postgres":{"status":"ok","latency_ms":0.7},"openai":{"status":"ok","latency_ms":0.0}}}

Concurrent test: PASSED

REQ1 (parallel): HTTP 200 — 4.631s
REQ2 (parallel): HTTP 200 — 7.858s

Both completed — trước đây REQ2 phải chờ REQ1 xong (serial). Giờ chạy song song.

5x sequential search: 5/5 PASS

Search 1: HTTP 200 — 2.127s
Search 2: HTTP 200 — 3.466s
Search 3: HTTP 200 — 3.938s
Search 4: HTTP 200 — 5.304s
Search 5: HTTP 200 — 2.458s

Average: 3.4s | Max: 5.3s | Min: 2.1s

Post-test health: HEALTHY

Kết luận

Fix giải quyết nguyên nhân gốc #1 (single worker blocking) và thêm safety net (timeout 25s).

  • 2 workers: 1 bị block → worker còn lại vẫn phục vụ
  • Timeout 25s: worker được giải phóng khi OpenAI chậm, GPT nhận lỗi rõ ràng
  • Concurrent test pass: evidence rằng parallel serving hoạt động

Nguyên nhân #2 (process freeze/OOM) vẫn cần monitor — nếu cả 2 workers OOM, server vẫn down. Nhưng với 2 workers + restart policy, recovery nhanh hơn đáng kể.


Report date: 2026-04-04 PR: Huyen1974/agent-data-test#337