KB-59C6

VPS Log Investigation — 8 Error Groups

6 min read Revision 1
reportlogsinvestigationerrorsdeadlocks1702026-04-06

VPS Log Investigation — 2026-04-06

Date: 2026-04-06 | Status: AUDIT COMPLETE — 6 error groups found


Log Summary

Service Total Errors/Warnings Critical
Agent Data 24 3 query failures + 1 x 500
Directus 241 37 meta_catalog blocked + 2 deadlocks
PostgreSQL 104 5 deadlocks + 37 S129-A guard blocks
Nginx 1,280 ~1270 proxy_temp warnings + 2 connection refused
Nuxt 8,678 8,552 icon undefined + 118 homepage 403

Error Group 1: PG Deadlocks (RECURRING — P1)

Count: 5 deadlocks in 24h Timestamps: 14:50, 15:10, 15:20, 18:00 (x2) UTC on Apr 5

Pattern: Always between same 2 cron jobs:

Process A: SELECT refresh_meta_catalog_from_pivot();     (cron */10 min)
Process B: SELECT fn_refresh_orphan_species(); ...       (cron */10 min)

Both run at */10 minute intervals, both lock meta_catalog (relation 44273).

Root cause: Two crons with SAME schedule (*/10 * * * *) compete for AccessExclusiveLock on meta_catalog. When both start at the same minute, deadlock.

Fix: Stagger crons. Change orphan refresh to */10 * * * * offset by 5 min: 5,15,25,35,45,55 * * * *. Or combine into single function.


Error Group 2: S129-A Guard Block (EXPECTED — P3)

Count: 37 blocked updates Message: Direct UPDATE of meta_catalog counting fields is blocked. Use refresh_registry_counts() or DOT tools.

Root cause: Directus Flows try to UPDATE meta_catalog counting fields directly. PG trigger S129-A blocks this — correct behavior. The Flows should call refresh_registry_counts() instead.

Fix: Update Directus Flows that touch meta_catalog counts to call the PG function instead of direct UPDATE. Or disable the Flows and rely on cron refresh.


Error Group 3: Agent Data Query Knowledge Failed (P2)

Count: 3 failures at 09:56:41-43 UTC Message: Query knowledge failed: (empty error detail) Result: 1 x POST /chat → 500 Internal Server Error

Root cause: The /chat endpoint calls vector search (Qdrant) which failed. Likely cause: search query too complex, embedding timeout, or Qdrant momentary load. Error message is empty — poor error logging.

Fix: Improve error logging in chat endpoint to include the actual exception. Add retry logic for transient Qdrant failures.


Error Group 4: Nuxt Icon Undefined (NOISE — P3)

Count: 8,552 occurrences Message: [Icon] failed to load icon 'mdi:undefined'

Root cause: A Nuxt component references mdi:undefined icon — likely a dynamic icon name that resolves to undefined when data is missing.

Fix: Find the component with dynamic icon binding and add fallback: icon ?? 'mdi:help-circle'.


Error Group 5: Nuxt Homepage 403 (P2)

Count: 118 occurrences Message: [Homepage SSR] Directus query failed: GET /items/pages ... 403 Forbidden

Root cause: Nuxt SSR queries Directus for homepage data but gets 403. The Directus token used by Nuxt SSR may have expired or the public role lacks permission on the pages collection.

Fix: Check Directus public permissions for pages collection. Ensure Nuxt uses a valid token or public access is configured.


Error Group 6: Nginx Proxy Temp + Connection Refused (P3)

Count: ~1,270 proxy_temp warnings + 2 connection refused Messages:

  • upstream response is buffered to a temporary file (large responses)
  • connect() failed (111: Connection refused) at 14:52 UTC (Agent Data restart)

Root cause:

  • Proxy temp: Nuxt SSR responses exceed nginx buffer size. Cosmetic, not functional.
  • Connection refused: Agent Data was restarting (Docker rebuild at 14:52). Expected.

Fix: Increase proxy_buffer_size and proxy_buffers in nginx config to reduce temp file warnings. Connection refused is transient (resolved after restart).


Error Group 7: move_document Bug (CONFIRMED — P1)

Evidence: 3 move API calls returned 200 OK:

POST /documents/knowledge/dev/architecture/label-law.md/move → 200
POST /documents/knowledge/dev/architecture/dieu26-new-registries-counting-law-draft.md/move → 200
POST /documents/knowledge/dev/architecture/dieu28-display-technology-law-v2-draft.md/move → 200

But files were NOT actually moved (S170 agent had to use manual copy+delete).

Root cause: The move_document endpoint returns 200 but does not change the document_id in PG or re-embed in Qdrant. Needs code investigation in server.py.

Fix: Fix move_document implementation in server.py, or deprecate the endpoint and document copy+delete as the standard pattern.


Error Group 8: Directus Path Traversal Probes (SECURITY — P2)

Count: 2,301 x 403 responses Paths probed: /logs/error.log, /wp-content/uploads/error_log, /wp-content/error_log

Root cause: External scanners/bots probing for common web vulnerabilities. Directus correctly returns 403.

Fix: No code fix needed. Consider rate-limiting or blocking known scanner IPs at nginx level.


Priority Summary

Priority Group Action
P1 Deadlocks Stagger cron schedules
P1 move_document bug Fix or deprecate endpoint
P2 Query knowledge 500 Improve error logging
P2 Homepage 403 Check Directus permissions
P2 Path traversal Rate-limit bots
P3 S129-A blocks Update Directus Flows
P3 Icon undefined Fix Nuxt component
P3 Nginx proxy_temp Increase buffer size

Log audit complete | 8 error groups | 2 P1, 3 P2, 3 P3 | 0 changes made