KB-346C
Single Worker Fix Report 2026-04-03
3 min read Revision 1
Single Worker Fix Report — 2026-04-03
Mission: S135H — Fix GPT action chập chờn PR: #337 — fix: uvicorn 2 workers + /chat request timeout 25s Status: DEPLOYED + VERIFIED
Evidence trước fix
- 31 nginx upstream timeouts trong 72h
- Process freeze 2h12m (Apr 2 03:26→05:38)
- Uvicorn 1 worker, /chat sync blocking
- Container: 521MB / 1.5GB (34%)
Fix applied
Fix 1: --workers 2 (Dockerfile CMD)
CMD ["bash", "-lc", "uvicorn agent_data.server:app --host 0.0.0.0 --port ${PORT:-8080} --workers 2"]
Memory safe: 2×~600MB = ~1.2GB < 1.5GB limit.
Fix 2: 25s timeout on agent.llm_response() (server.py:902-921)
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeout
_LLM_TIMEOUT_S = 25
with ThreadPoolExecutor(max_workers=1) as pool:
future = pool.submit(agent.llm_response, "!" + llm_input)
agent_reply = future.result(timeout=_LLM_TIMEOUT_S)
Returns 504 with clear error message instead of silent timeout.
CI Status
| Check | Status |
|---|---|
| Lint Only | PASS |
| Pass Gate | PASS |
| Guard Bootstrap Scaffold | PASS |
| Deploy Agent Data to VPS | PASS (success) |
Production Verification
Workers confirmed
PID 784249 — master process (--workers 2)
PID 784336 — worker 1 (600MB)
PID 784739 — worker 2 (598MB)
Health: HEALTHY
{"status":"healthy","services":{"qdrant":{"status":"ok","latency_ms":4.5},"postgres":{"status":"ok","latency_ms":0.7},"openai":{"status":"ok","latency_ms":0.0}}}
Concurrent test: PASSED
REQ1 (parallel): HTTP 200 — 4.631s
REQ2 (parallel): HTTP 200 — 7.858s
Both completed — trước đây REQ2 phải chờ REQ1 xong (serial). Giờ chạy song song.
5x sequential search: 5/5 PASS
Search 1: HTTP 200 — 2.127s
Search 2: HTTP 200 — 3.466s
Search 3: HTTP 200 — 3.938s
Search 4: HTTP 200 — 5.304s
Search 5: HTTP 200 — 2.458s
Average: 3.4s | Max: 5.3s | Min: 2.1s
Post-test health: HEALTHY
Kết luận
Fix giải quyết nguyên nhân gốc #1 (single worker blocking) và thêm safety net (timeout 25s).
- 2 workers: 1 bị block → worker còn lại vẫn phục vụ
- Timeout 25s: worker được giải phóng khi OpenAI chậm, GPT nhận lỗi rõ ràng
- Concurrent test pass: evidence rằng parallel serving hoạt động
Nguyên nhân #2 (process freeze/OOM) vẫn cần monitor — nếu cả 2 workers OOM, server vẫn down. Nhưng với 2 workers + restart policy, recovery nhanh hơn đáng kể.
Report date: 2026-04-04 PR: Huyen1974/agent-data-test#337