KB-1451

S174-FIX-01: PG Backup Repair Report

7 min read Revision 1

S174-FIX-01: PG Backup Repair + Kuma Heartbeat — Report

Ngày: 2026-04-08 Trạng thái: DONE — chờ Desktop duyệt


Tóm tắt thay đổi

File Hành động
/opt/incomex/scripts/pg-backup.sh MỚI — thay thế mysql-backup.sh.retired. Backup Directus PG → /opt/incomex/backups/pg/
/opt/workflow/postgres/backup.sh SỬA — container workflow-postgrespostgres. GCS upload → best-effort (IAM pre-broken).
/opt/incomex/scripts/backup-to-gdrive.sh THÊM — heartbeat Kuma cuối script
Kuma monitors (3) MỚIPG Backup Local (#12), PG Backup Workflow (#13), PG Backup GDrive (#14)
Crontab THÊM0 2 * * * /opt/incomex/scripts/pg-backup.sh

Không xóa:

  • mysql-backup.sh.retired — giữ nguyên làm chứng cứ
  • backup.sh.pre-s174 — bản gốc workflow script trước fix
  • backup-to-gdrive.sh.pre-s174 — bản gốc gdrive script

VERIFY Checklist

V1: File sizes

$ ls -la /opt/incomex/backups/pg/*.gz | tail -5
-rw------- 1 root root 36097373 Apr  8 11:23 directus_2026-04-08_0922.sql.gz
-rw------- 1 root root 36097347 Apr  8 11:26 directus_2026-04-08_0926.sql.gz

$ ls -la /opt/workflow/postgres/backups/workflow_20260408T092825Z.sql.gz
-rw------- 1 root root 456 Apr  8 11:28 workflow_20260408T092825Z.sql.gz
  • PG Local: 36,097,347 bytes (35MB) — Directus DB full dump
  • Workflow: 456 bytes — workflow DB gần rỗng (40 dòng SQL), đây là kích thước đúng

V2: gunzip -t

$ gunzip -t /opt/incomex/backups/pg/directus_2026-04-08_0926.sql.gz
exit code: 0 (PASS)

$ gunzip -t /opt/workflow/postgres/backups/workflow_20260408T092825Z.sql.gz
exit code: 0 (PASS)

V3: Không còn 2>/dev/null trong script mới

$ grep -n '/dev/null' /opt/incomex/scripts/pg-backup.sh
(rỗng — CLEAN)

$ grep -n '/dev/null' /opt/workflow/postgres/backup.sh
(rỗng — CLEAN)

Ghi chú backup-to-gdrive.sh: Script giữ 13 pattern 2>/dev/null trên các operation phụ trợ (Qdrant snapshot discovery, file copy, rclone cleanup). Đây là fault-tolerance cho optional features, KHÔNG phải trên đường PG dump chính. Critical path docker exec postgres pg_dump... không có suppression. Xóa chúng sẽ break Qdrant snapshot flow khi collection không tồn tại.

V4: set -euo pipefail

/opt/incomex/scripts/pg-backup.sh:8:set -euo pipefail
/opt/workflow/postgres/backup.sh:7:set -euo pipefail
/opt/incomex/scripts/backup-to-gdrive.sh:8:set -euo pipefail

V5: Crontab entries

$ crontab -l | grep -i backup
0 2 * * * /opt/workflow/postgres/backup.sh >> /opt/workflow/postgres/backup.log 2>&1
0 20 * * * /opt/incomex/scripts/backup-to-gdrive.sh >> /opt/incomex/logs/backup-gdrive.log 2>&1
# S174-FIX-01: Daily Directus PG backup (replaces retired mysql-backup.sh)
0 2 * * * /opt/incomex/scripts/pg-backup.sh >> /opt/incomex/backups/pg/backup.log 2>&1

V6: Kuma monitors

PG Backup Local    | pg-backup-local    | 90000s (25h) | active
PG Backup Workflow | pg-backup-workflow | 90000s (25h) | active
PG Backup GDrive   | pg-backup-gdrive   | 90000s (25h) | active

Tất cả link → notification Telegram-Jack (id: 2)

Last heartbeats:

PG Backup Local    | 2026-04-08 09:26:19 | OK size=35M
PG Backup Workflow | 2026-04-08 09:28:28 | OK size=4.0K
PG Backup GDrive   | 2026-04-08 09:19:02 | test-init (chờ chạy gdrive cron tối nay)

V7: Failure simulation

$ docker rename postgres postgres-test-s174
$ /opt/incomex/scripts/pg-backup.sh
[2026-04-08T09:29:02Z] START pg-backup -> .../directus_2026-04-08_0929.sql.gz
[2026-04-08T09:29:03Z] ERROR: container postgres is not running
EXIT_CODE=1
  • Script exit 1 ✅
  • Không tạo file rỗng (container check TRƯỚC pg_dump) ✅
  • Kuma KHÔNG nhận heartbeat ✅
  • Nếu không có heartbeat trong 25h → Kuma fire Telegram alert tự động
  • Container restored ngay sau test: docker rename postgres-test-s174 postgres

V8: Restore dry-run

$ zcat directus_2026-04-08_0926.sql.gz | head -3
-- PostgreSQL database dump
-- Dumped from database version 16.13
-- Dumped by pg_dump version 16.13

$ zcat directus_2026-04-08_0926.sql.gz | grep -c 'CREATE TABLE\|INSERT INTO\|COPY'
570  (statements)

Thiết kế script mới

pg-backup.sh (mới)

Container check → pg_dump | gzip → size check (>1KB) → gunzip -t → retention cleanup → Kuma heartbeat
  • Target: docker exec postgres pg_dump -U directus -d directus
  • Output: /opt/incomex/backups/pg/directus_YYYY-MM-DD_HHMM.sql.gz
  • Retention: 7 ngày
  • Fail-safe: container check trước dump, size check, gzip integrity check
  • Heartbeat: pg-backup-local push monitor chỉ gửi khi ALL checks pass

workflow backup.sh (sửa)

Container check → pg_dump | gzip → size check (>50 bytes) → gunzip -t → GCS upload (best-effort) → retention → Kuma heartbeat
  • Fix chính: workflow-postgrespostgres (container thật)
  • Size threshold: 50 bytes (workflow DB gần rỗng, 456 bytes compressed là đúng)
  • GCS upload: best-effort vì IAM permissions broken (pre-existing, out of scope)
  • Heartbeat: pg-backup-workflow gửi khi local backup valid (GCS fail = WARN only)

backup-to-gdrive.sh (thêm heartbeat)

  • Chỉ thêm 4 dòng cuối: Kuma push pg-backup-gdrive sau upload thành công
  • Không sửa logic hiện tại

Issues phát hiện thêm (ngoài scope, cần fix riêng)

  1. GCS IAM broken: cursor-ci-builder@... không có quyền storage.buckets.list. Workflow backup GCS upload fail (WARN, không block local backup).
  2. backup-to-gdrive.sh auxiliary 2>/dev/null: 13 instances trên Qdrant/file copy paths. Là fault-tolerance cho optional features, không mask PG backup failure. Cần review riêng nếu muốn clean 100%.
  3. Repo infra/cron/incomex-crontab stale: File trong repo vẫn reference mysql-backup.sh. Cần commit update.

Kết luận

  • 2/3 script chạy thành công với output >50 bytes, gzip valid
  • pg-backup-gdrive chờ chạy tối nay (cron 20:00 UTC)
  • Failure simulation: script exit non-zero, không tạo file rỗng, không gửi heartbeat
  • 3 Kuma monitors active, link Telegram-Jack, 25h timeout
  • Không xóa file .retired, không đụng container/schema