KB-1451
S174-FIX-01: PG Backup Repair Report
7 min read Revision 1
S174-FIX-01: PG Backup Repair + Kuma Heartbeat — Report
Ngày: 2026-04-08 Trạng thái: DONE — chờ Desktop duyệt
Tóm tắt thay đổi
| File | Hành động |
|---|---|
/opt/incomex/scripts/pg-backup.sh |
MỚI — thay thế mysql-backup.sh.retired. Backup Directus PG → /opt/incomex/backups/pg/ |
/opt/workflow/postgres/backup.sh |
SỬA — container workflow-postgres → postgres. GCS upload → best-effort (IAM pre-broken). |
/opt/incomex/scripts/backup-to-gdrive.sh |
THÊM — heartbeat Kuma cuối script |
| Kuma monitors (3) | MỚI — PG Backup Local (#12), PG Backup Workflow (#13), PG Backup GDrive (#14) |
| Crontab | THÊM — 0 2 * * * /opt/incomex/scripts/pg-backup.sh |
Không xóa:
mysql-backup.sh.retired— giữ nguyên làm chứng cứbackup.sh.pre-s174— bản gốc workflow script trước fixbackup-to-gdrive.sh.pre-s174— bản gốc gdrive script
VERIFY Checklist
V1: File sizes
$ ls -la /opt/incomex/backups/pg/*.gz | tail -5
-rw------- 1 root root 36097373 Apr 8 11:23 directus_2026-04-08_0922.sql.gz
-rw------- 1 root root 36097347 Apr 8 11:26 directus_2026-04-08_0926.sql.gz
$ ls -la /opt/workflow/postgres/backups/workflow_20260408T092825Z.sql.gz
-rw------- 1 root root 456 Apr 8 11:28 workflow_20260408T092825Z.sql.gz
- PG Local: 36,097,347 bytes (35MB) — Directus DB full dump
- Workflow: 456 bytes — workflow DB gần rỗng (40 dòng SQL), đây là kích thước đúng
V2: gunzip -t
$ gunzip -t /opt/incomex/backups/pg/directus_2026-04-08_0926.sql.gz
exit code: 0 (PASS)
$ gunzip -t /opt/workflow/postgres/backups/workflow_20260408T092825Z.sql.gz
exit code: 0 (PASS)
V3: Không còn 2>/dev/null trong script mới
$ grep -n '/dev/null' /opt/incomex/scripts/pg-backup.sh
(rỗng — CLEAN)
$ grep -n '/dev/null' /opt/workflow/postgres/backup.sh
(rỗng — CLEAN)
Ghi chú backup-to-gdrive.sh: Script giữ 13 pattern 2>/dev/null trên các operation phụ trợ (Qdrant snapshot discovery, file copy, rclone cleanup). Đây là fault-tolerance cho optional features, KHÔNG phải trên đường PG dump chính. Critical path docker exec postgres pg_dump... không có suppression. Xóa chúng sẽ break Qdrant snapshot flow khi collection không tồn tại.
V4: set -euo pipefail
/opt/incomex/scripts/pg-backup.sh:8:set -euo pipefail
/opt/workflow/postgres/backup.sh:7:set -euo pipefail
/opt/incomex/scripts/backup-to-gdrive.sh:8:set -euo pipefail
V5: Crontab entries
$ crontab -l | grep -i backup
0 2 * * * /opt/workflow/postgres/backup.sh >> /opt/workflow/postgres/backup.log 2>&1
0 20 * * * /opt/incomex/scripts/backup-to-gdrive.sh >> /opt/incomex/logs/backup-gdrive.log 2>&1
# S174-FIX-01: Daily Directus PG backup (replaces retired mysql-backup.sh)
0 2 * * * /opt/incomex/scripts/pg-backup.sh >> /opt/incomex/backups/pg/backup.log 2>&1
V6: Kuma monitors
PG Backup Local | pg-backup-local | 90000s (25h) | active
PG Backup Workflow | pg-backup-workflow | 90000s (25h) | active
PG Backup GDrive | pg-backup-gdrive | 90000s (25h) | active
Tất cả link → notification Telegram-Jack (id: 2)
Last heartbeats:
PG Backup Local | 2026-04-08 09:26:19 | OK size=35M
PG Backup Workflow | 2026-04-08 09:28:28 | OK size=4.0K
PG Backup GDrive | 2026-04-08 09:19:02 | test-init (chờ chạy gdrive cron tối nay)
V7: Failure simulation
$ docker rename postgres postgres-test-s174
$ /opt/incomex/scripts/pg-backup.sh
[2026-04-08T09:29:02Z] START pg-backup -> .../directus_2026-04-08_0929.sql.gz
[2026-04-08T09:29:03Z] ERROR: container postgres is not running
EXIT_CODE=1
- Script exit 1 ✅
- Không tạo file rỗng (container check TRƯỚC pg_dump) ✅
- Kuma KHÔNG nhận heartbeat ✅
- Nếu không có heartbeat trong 25h → Kuma fire Telegram alert tự động
- Container restored ngay sau test:
docker rename postgres-test-s174 postgres✅
V8: Restore dry-run
$ zcat directus_2026-04-08_0926.sql.gz | head -3
-- PostgreSQL database dump
-- Dumped from database version 16.13
-- Dumped by pg_dump version 16.13
$ zcat directus_2026-04-08_0926.sql.gz | grep -c 'CREATE TABLE\|INSERT INTO\|COPY'
570 (statements)
Thiết kế script mới
pg-backup.sh (mới)
Container check → pg_dump | gzip → size check (>1KB) → gunzip -t → retention cleanup → Kuma heartbeat
- Target:
docker exec postgres pg_dump -U directus -d directus - Output:
/opt/incomex/backups/pg/directus_YYYY-MM-DD_HHMM.sql.gz - Retention: 7 ngày
- Fail-safe: container check trước dump, size check, gzip integrity check
- Heartbeat:
pg-backup-localpush monitor chỉ gửi khi ALL checks pass
workflow backup.sh (sửa)
Container check → pg_dump | gzip → size check (>50 bytes) → gunzip -t → GCS upload (best-effort) → retention → Kuma heartbeat
- Fix chính:
workflow-postgres→postgres(container thật) - Size threshold: 50 bytes (workflow DB gần rỗng, 456 bytes compressed là đúng)
- GCS upload: best-effort vì IAM permissions broken (pre-existing, out of scope)
- Heartbeat:
pg-backup-workflowgửi khi local backup valid (GCS fail = WARN only)
backup-to-gdrive.sh (thêm heartbeat)
- Chỉ thêm 4 dòng cuối: Kuma push
pg-backup-gdrivesau upload thành công - Không sửa logic hiện tại
Issues phát hiện thêm (ngoài scope, cần fix riêng)
- GCS IAM broken:
cursor-ci-builder@...không có quyềnstorage.buckets.list. Workflow backup GCS upload fail (WARN, không block local backup). backup-to-gdrive.shauxiliary2>/dev/null: 13 instances trên Qdrant/file copy paths. Là fault-tolerance cho optional features, không mask PG backup failure. Cần review riêng nếu muốn clean 100%.- Repo
infra/cron/incomex-crontabstale: File trong repo vẫn referencemysql-backup.sh. Cần commit update.
Kết luận
- 2/3 script chạy thành công với output >50 bytes, gzip valid
- pg-backup-gdrive chờ chạy tối nay (cron 20:00 UTC)
- Failure simulation: script exit non-zero, không tạo file rỗng, không gửi heartbeat
- 3 Kuma monitors active, link Telegram-Jack, 25h timeout
- Không xóa file .retired, không đụng container/schema