KB-35C1

S174 Hardtest Agent-Data CRUD + Log Stability

19 min read Revision 1

S174 Hardtest Agent-Data CRUD + Log Stability

Date: 2026-04-09 Operator: Codex Scope: hardtest MCP CRUD/search path, inspect runtime logs during test, and hậu kiểm Đ31 integrity chain after S174-FIX-04 and S174-FIX-05. Constraint: no fixes performed in this mission.

1. Executive Summary

Test window started at 2026-04-09 05:24:42 CEST on VPS (2026-04-09 10:24:42 ICT).

Overall result:

  • Total measured MCP operations: 90
  • PASS: 90
  • FAIL: 0
  • Observed fail rate: 0%
  • Search path looked stable under both sequential and concurrent load.
  • No backend evidence of 500, ERROR, WARNING, Traceback, PostgreSQL slow/deadlock, or Qdrant index/query failure during the test window.
  • Host/container resources stayed healthy: host RAM available 8.6Gi, swap 0B used, disk / at 40%, TCP established 5.
  • One anomaly remains unexplained: a single patch_document step took 115323 ms client-observed wall clock, but backend logs in the same window did not show DB/Qdrant/runtime failure.
  • Đ31 chain is not fully clean yet: latest cron log within 7h shows env-contract-check PASS, logrotate-config-check PASS, WATCHDOG alive, but rsyslog-health-check still reported a fault and the overall run ended PASS: 10 | FAIL: 117 | ERROR: 1.

Bottom line:

  • agent-data CRUD/search is operationally stable enough for normal use based on this hardtest.
  • There is no evidence of silent backend failure during the test window.
  • Đ31 recovery is only partial from an operations perspective because the runner lives again, but the cron chain is not fully green.

2. Test Method

The test used direct MCP calls against the knowledge base:

  • search_knowledge
  • get_document
  • upload_document
  • patch_document
  • delete_document

Latency measurement method:

  • search_knowledge: tool-reported usage.latency_ms
  • get_document / upload_document / patch_document / delete_document: local wall-clock timing around each call

Cleanup verification:

  • Created test files knowledge/current-state/tests/hardtest-1.md through hardtest-5.md
  • Deleted all five after lifecycle tests
  • Final list check returned an empty directory

3. Part A — Hardtest CRUD Results

3.1 Summary Table

Group Count PASS FAIL Avg (ms) p95 (ms) p99 (ms) Max (ms)
A1 Sequential search_knowledge 20 20 0 3482.9 5638 7300 7300
A1 Sequential get_document 10 10 0 14257.1 17779 17779 17779
A2 Concurrent search_knowledge all rounds 30 30 0 2078.7 4732 8698 8698
A3 CRUD lifecycle total 5 5 0 105654.2 178823 178823 178823
A3 upload_document 5 5 0 15554.8 19688 19688 19688
A3 get_document 5 5 0 14548.0 17102 17102 17102
A3 patch_document 5 5 0 37690.0 115323 115323 115323
A3 search_knowledge verify 5 5 0 2043.4 3196 3196 3196
A3 delete_document 5 5 0 9249.8 14277 14277 14277
A4 Heavy search_knowledge 5 5 0 4876.2 5407 5407 5407

3.2 A1 Sequential Baseline

Sequential search_knowledge latencies, ms:

504, 3128, 4452, 3515, 2810, 3109, 3395, 3526, 3135, 3429,
5638, 4797, 7300, 2433, 3263, 2304, 3721, 3633, 2849, 2718

Sequential get_document latencies, ms:

12161, 16758, 14756, 10211, 9987, 17779, 15745, 15278, 15742, 14154

Interpretation:

  • Baseline search stayed in low single-digit seconds.
  • get_document is materially slower than search_knowledge in this environment, but it was consistent and had no failures.

3.3 A2 Concurrent Stress

Each round fired 10 search_knowledge requests near-simultaneously, then paused 5s.

Round 1 latencies, ms:

590, 433, 4732, 2778, 2131, 2872, 2783, 3999, 3900, 2381

Round 2 latencies, ms:

751, 460, 644, 507, 8698, 2395, 945, 511, 1402, 4184

Round 3 latencies, ms:

573, 440, 4112, 393, 398, 464, 856, 4388, 2888, 752

Per-round summary:

Round Count PASS FAIL Avg (ms) p95 (ms) Max (ms)
1 10 10 0 2659.9 4732 4732
2 10 10 0 2049.7 8698 8698
3 10 10 0 1526.4 4388 4388

Interpretation:

  • Concurrent search did not produce errors or retries.
  • Tail latency exists, with one 8698 ms outlier, but overall average under concurrency was better than sequential baseline.

3.4 A3 CRUD Lifecycle

Each lifecycle did:

  1. upload_document
  2. get_document
  3. patch_document
  4. search_knowledge with unique marker
  5. delete_document

Per-lifecycle timings, ms:

Iteration Upload Get Patch Search Delete Total
1 19688 14091 16609 3196 14277 93321
2 16316 11910 115323 1857 6007 178823
3 16052 16589 26306 1583 5204 85875
4 12803 13048 16466 1704 14022 80918
5 12915 17102 13746 1877 6739 89334

Evidence that vector update/read path completed correctly:

  • All 5 patch steps succeeded.
  • All 5 post-patch searches found the unique marker immediately.
  • Cleanup check returned zero remaining test files.

Cleanup evidence:

list_documents(path="knowledge/current-state/tests")
=> {"items":[],"count":0}

Interpretation:

  • Functional CRUD lifecycle passed 5/5.
  • The only hard anomaly in Part A is the second patch_document wall-clock spike at 115323 ms.
  • Because read-after-write and search-after-write still passed, this is not evidence of data loss or vector-sync failure by itself.

Heavy-query latencies, ms:

4990, 5161, 4217, 4606, 5407

Interpretation:

  • Complex search stayed roughly 4.2s to 5.4s.
  • No timeout or retry behavior was observed.

4. Part B — Runtime Logs and Resource Evidence

4.1 incomex-agent-data Logs

Command:

ssh root@38.242.240.89 "docker logs incomex-agent-data --since '2026-04-09T05:24:42+02:00' 2>&1 | tail -200"

Representative output:

Qdrant probe OK: 11 points (5ms)
PostgreSQL probe OK (0ms)
INFO:     172.18.0.1:57826 - "GET /health HTTP/1.1" 200 OK
INFO:     172.18.0.1:57842 - "POST /mcp HTTP/1.1" 200 OK
INFO:     172.18.0.1:57842 - "GET /kb/get/knowledge/current-state/tests/hardtest-4.md?full=true&search=false HTTP/1.1" 200 OK
INFO:     172.18.0.1:57852 - "POST /mcp HTTP/1.1" 200 OK
Qdrant probe OK: 11 points (26ms)

Focused log counts during the test window:

http_500=0
http_404=1
error_level=0
warning_level=0
traceback=0
exception=0

The lone 404 observed:

INFO: 172.18.0.7:41816 - "GET /documents/knowledge/dev/architecture/nd-36-01-semantic-relationship-infrastructure-draft.md?full=true&search=false HTTP/1.1" 404 Not Found

Assessment:

  • No hardtest-generated 500, exception, or warning was present.
  • The single 404 points to an unrelated missing draft path, not to any hardtest-* document.

4.2 Qdrant Logs

Command:

ssh root@38.242.240.89 "docker logs incomex-qdrant --since '2026-04-09T05:24:42+02:00' 2>&1 | tail -100"

Representative output:

2026-04-09T03:30:23.219534Z  INFO actix_web::middleware::logger: 172.18.0.3 "PUT /collections/documents/points?wait=true&ordering=weak HTTP/1.1" 200 639 "-" "-" 0.074744
2026-04-09T03:30:24.835833Z  INFO actix_web::middleware::logger: 172.18.0.3 "POST /collections/documents/points/search HTTP/1.1" 200 2736 "-" "-" 0.004804
2026-04-09T03:30:30.171320Z  INFO actix_web::middleware::logger: 172.18.0.3 "POST /collections/documents/points/search HTTP/1.1" 200 2976 "-" "-" 0.012230
2026-04-09T03:30:31.422245Z  INFO actix_web::middleware::logger: 172.18.0.3 "POST /collections/documents/points/delete?wait=true&ordering=weak HTTP/1.1" 200 99 "-" "-" 0.025348

Error-focused command:

ssh root@38.242.240.89 "docker logs incomex-qdrant --since '2026-04-09T05:24:42+02:00' 2>&1 | grep -iE 'error|warn|timeout|panic|fail' | tail -100"

Output:

[no output]

Assessment:

  • Qdrant handled search, upsert, and delete during the hardtest without visible error.
  • Server-side vector operations stayed in the sub-75ms range in sampled log lines.

4.3 PostgreSQL Logs

Command:

ssh root@38.242.240.89 "docker logs postgres --since '2026-04-09T05:24:42+02:00' 2>&1 | grep -iE 'error|slow|timeout|deadlock' | tail -50"

Output:

[no output]

Assessment:

  • No PostgreSQL evidence of slow query, timeout, error, or deadlock during the hardtest window.

4.4 Container and Host Resources

Command:

ssh root@38.242.240.89 "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}'"

Output:

NAME                 CPU %   MEM USAGE / LIMIT      NET I/O
incomex-nuxt         0.00%   84.03MiB / 512MiB      89.2MB / 540MB
uptime-kuma          1.44%   159.1MiB / 11.68GiB    331MB / 20.1MB
incomex-agent-data   1.82%   1.097GiB / 2.5GiB      102GB / 527MB
postgres             0.03%   222.5MiB / 2GiB        1.86GB / 279GB
incomex-nginx        0.00%   32.05MiB / 256MiB      11.5GB / 11.9GB
incomex-directus     6.82%   192.2MiB / 1GiB        1.12GB / 1.42GB
incomex-qdrant       0.49%   114.7MiB / 1GiB        640MB / 839MB

Command:

ssh root@38.242.240.89 "free -h && echo '---' && df -h /"

Output:

               total        used        free      shared  buff/cache   available
Mem:            11Gi       3.1Gi       5.8Gi       214Mi       2.7Gi       8.6Gi
Swap:          2.0Gi          0B       2.0Gi
---
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        96G   38G   59G  40% /

Command:

ssh root@38.242.240.89 "ss -s"

Output:

Total: 249
TCP:   64 (estab 5, closed 49, orphaned 0, timewait 4)

Assessment:

  • No CPU, RAM, disk, swap, or socket saturation evidence appeared during the test.
  • incomex-agent-data memory at 1.097GiB / 2.5GiB leaves meaningful headroom.

5. Part C — Đ31 Chain Hậu Kiểm

5.1 Latest Cron Artifacts

Command:

ssh root@38.242.240.89 "ls -lt /opt/incomex/logs/integrity/cron-*.log | head -3"

Output:

-rw-r--r-- 1 root root 40118 Apr  9 04:49 /opt/incomex/logs/integrity/cron-20260409-044906.log
-rw-r--r-- 1 root root  1040 Apr  9 04:44 /opt/incomex/logs/integrity/cron-20260409-044426.log
-rw-r--r-- 1 root root   494 Apr  9 04:42 /opt/incomex/logs/integrity/cron-20260409-044209.log

This satisfies the user check for a fresh cron log within 7h.

5.2 Latest Cron Summary

Tail evidence from /opt/incomex/logs/integrity/cron-20260409-044906.log:

PASS: 10 | FAIL: 117 | ERROR: 1
Pass Rate: 7.9% (10/127)
Issues Created: 0 | Reopened: 117
WATCHDOG: alive
run_id: cron-20260409-044906

Focused chain grep:

ssh root@38.242.240.89 "grep -nE 'env-contract-check|logrotate-config-check|rsyslog-health-check|WATCHDOG|runner sống|Starting integrity scan|Missing required env' /opt/incomex/logs/integrity/cron-20260409-044906.log | tail -50"

Output:

4:env-contract-check: scanned 5 required vars
8:logrotate-config-check: dry-run complete
11:rsyslog-health-check: status=active, suspend_count_1h=651
14:WARN: rsyslog-health-check detected fault (exit=1). Issue reported. Runner continues.
161:    MSR-D31-WATCHDOG [dieu31] WATCHDOG — runner sống
794:  ⚡ MSR-D31-WATCHDOG: WATCHDOG — WATCHDOG — runner sống
797:    Delta: WATCHDOG — runner alive
804:  WATCHDOG: alive

Assessment:

  • env-contract-check: PASS
  • logrotate-config-check: PASS
  • rsyslog-health-check: NOT clean, fault detected
  • Runner: alive
  • Watchdog: alive

Therefore the cron chain is not fully PASS end-to-end.

5.3 system_issues Distribution

The user-provided SQL used issue_type, but current rsyslog health code records under issue_class='rsyslog_fault'. The direct query on issue_type returned 0, which is a query mismatch, not proof that rsyslog faults never existed.

Field evidence:

ssh root@38.242.240.89 "docker exec postgres psql -U directus -d directus -Atc \"SELECT issue_class, status, count(*) FROM system_issues WHERE issue_class IN ('config_error','env_drift','logrotate_drift','rsyslog_fault') GROUP BY 1,2 ORDER BY 1,2;\""

Output:

config_error|resolved|1
env_drift|resolved|1
logrotate_drift|resolved|1
rsyslog_fault|resolved|4

Recent rsyslog rows:

ssh root@38.242.240.89 "docker exec postgres psql -U directus -d directus -Atc \"SELECT issue_class, status, code, last_seen_at, resolved_at, occurrence_count FROM system_issues WHERE issue_class='rsyslog_fault' ORDER BY last_seen_at DESC LIMIT 5;\""

Output:

rsyslog_fault|resolved|ISS-2780|2026-04-09 02:49:06+00|2026-04-09 02:49:33.812|1
rsyslog_fault|resolved|ISS-2779|2026-04-09 02:44:27+00|2026-04-09 02:49:33.812|1
rsyslog_fault|resolved|ISS-2778|2026-04-09 02:42:10+00||1
rsyslog_fault|resolved|ISS-2777|2026-04-09 02:41:25+00||1

Interpretation:

  • The issue records exist and are currently marked resolved.
  • Two resolved rows show empty resolved_at, so lifecycle metadata is not fully normalized.

Implementation evidence from the local repo:

Relevant behavior:

  • rsyslog-health-check.sh counts journal suspend events over the last hour and reports issue_class":"rsyslog_fault".
  • cron-integrity.sh treats this check as warn-only and allows the runner to continue.

6. Conclusion

6.1 Is agent-data CRUD stable?

Yes, with an important caveat.

  • Measured operations: 90/90 PASS
  • Observed fail rate: 0%
  • Concurrent search hardtest produced no retries or failed calls
  • Search p95 stayed at 4732 ms under concurrent load and 5638 ms in sequential baseline
  • Heavy search max was 5407 ms
  • End-to-end CRUD lifecycle passed 5/5

The caveat is one client-observed patch_document latency spike at 115323 ms. That spike is real, but current evidence does not tie it to Qdrant, PostgreSQL, memory pressure, CPU pressure, or backend error logs.

6.2 Is there any silent instability left?

No silent backend failure was found during the hardtest window.

Evidence:

  • incomex-agent-data: no 500, ERROR, WARNING, Traceback, or exception
  • postgres: no error|slow|timeout|deadlock
  • qdrant: no error|warn|timeout|panic|fail
  • Host and container resource headroom remained healthy

The only notable residual signals are:

  • the unexplained patch_document latency spike
  • an unrelated 404 on a missing draft knowledge path
  • latest Đ31 cron still warning on rsyslog history

6.3 Does the Đ31 chain PASS after the fixes?

Not fully.

What is confirmed PASS:

  • fresh cron artifact exists within 7h
  • env contract check ran successfully
  • logrotate config dry-run completed
  • runner is alive
  • watchdog is alive

What is not clean:

  • latest cron log still reports rsyslog-health-check detected fault
  • latest integrity run still shows PASS: 10 | FAIL: 117 | ERROR: 1

So the correct conclusion is: Đ31 is revived, but the whole chain is not yet fully green.

7. Unknowns

  • Root cause of the 115323 ms patch latency spike is not proven by current evidence.
  • It is not yet proven whether the rsyslog warning in the latest cron log reflects a still-active problem or a stale one-hour lookback that had already self-recovered before the hardtest window.
  • The system_issues lifecycle metadata for rsyslog_fault is partially inconsistent because some resolved rows have blank resolved_at.

8. Appendix — Commands and Evidence

8.1 MCP Test Operations

Sequential search test:

20 direct calls to search_knowledge(<distinct query>)
Result: 20/20 PASS

Sequential get test:

10 direct calls to get_document(<distinct path>)
Result: 10/10 PASS

Concurrent search stress:

3 rounds x 10 parallel search_knowledge calls
Pause: 5s between rounds
Result: 30/30 PASS

CRUD lifecycle:

For N=1..5:
- upload_document(knowledge/current-state/tests/hardtest-N.md)
- get_document(...)
- patch_document(...)
- search_knowledge(unique patched marker)
- delete_document(...)
Result: 5/5 lifecycle PASS

Cleanup:

list_documents(path="knowledge/current-state/tests")
=> {"items":[],"count":0}

8.2 VPS Log and Resource Commands

ssh root@38.242.240.89 "docker logs incomex-agent-data --since '2026-04-09T05:24:42+02:00' 2>&1 | tail -200"
ssh root@38.242.240.89 "docker logs incomex-qdrant --since '2026-04-09T05:24:42+02:00' 2>&1 | tail -100"
ssh root@38.242.240.89 "docker logs incomex-qdrant --since '2026-04-09T05:24:42+02:00' 2>&1 | grep -iE 'error|warn|timeout|panic|fail' | tail -100"
ssh root@38.242.240.89 "docker logs postgres --since '2026-04-09T05:24:42+02:00' 2>&1 | grep -iE 'error|slow|timeout|deadlock' | tail -50"
ssh root@38.242.240.89 "docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}'"
ssh root@38.242.240.89 "free -h && echo '---' && df -h /"
ssh root@38.242.240.89 "ss -s"

8.3 Đ31 Commands

ssh root@38.242.240.89 "ls -lt /opt/incomex/logs/integrity/cron-*.log | head -3"
ssh root@38.242.240.89 "grep -nE 'env-contract-check|logrotate-config-check|rsyslog-health-check|WATCHDOG|runner sống|Starting integrity scan|Missing required env' /opt/incomex/logs/integrity/cron-20260409-044906.log | tail -50"
ssh root@38.242.240.89 "docker exec postgres psql -U directus -d directus -Atc \"SELECT issue_class, status, count(*) FROM system_issues WHERE issue_class IN ('config_error','env_drift','logrotate_drift','rsyslog_fault') GROUP BY 1,2 ORDER BY 1,2;\""
ssh root@38.242.240.89 "docker exec postgres psql -U directus -d directus -Atc \"SELECT issue_class, status, code, last_seen_at, resolved_at, occurrence_count FROM system_issues WHERE issue_class='rsyslog_fault' ORDER BY last_seen_at DESC LIMIT 5;\""