KB-2059

P9 G6 Backup Integrity Recovery — 2026-04-27

13 min read Revision 1
dieu38g6backuppg-dumpsandbox_tacrecovery2026-04-27hard-stop-5

title: P9 G6 Backup Integrity Recovery — 2026-04-27 date: 2026-04-27 executor: Claude Code (medium effort, VPS context via SSH contabo) type: investigation-recovery-report status: HARD_STOP_5_TRIGGERED — fresh backup integrity FAIL

G6 Backup Integrity Recovery — 2026-04-27

TL;DR

Hai anomaly có CÙNG MỘT root cause duy nhất: PostgreSQL role directus thiếu USAGE privilege trên schema sandbox_tac (tạo bởi S178 Fix 20 M3A). pg_dump lock-table phase fail → output gzip rỗng (20 bytes) → cả pg-backup.sh lẫn backup-to-gdrive.sh đều hỏng cùng kiểu.

  • Việc 3 chạy fresh backup: governed script pg-backup.sh exit 1 cùng lỗi → file mới directus_2026-04-27_1353.sql.gz cũng 20 bytes → Hard Stop #5 TRIGGERED.
  • KHÔNG retry, KHÔNG patch script, KHÔNG GRANT. Chờ orchestrator + user authorize fix DDL.

§0. Execution Context

hostname: vmi3080463 (VPS Contabo)
whoami:   root
pwd:      /root
kernel:   Linux 6.8.0-90-generic Ubuntu

VPS context confirmed (qua SSH contabo, key-based, BatchMode=yes).


§1. Việc 1 — Anomaly #2 Root Cause

1.1 Backup directory state

ls -lht /opt/incomex/backups/pg/ (head):
-rw------- 1 root root  20 Apr 27 02:00  directus_2026-04-27_0000.sql.gz   ← BROKEN
-rw-r--r-- 1 root root 11K Apr 27 02:00  backup.log
-rw------- 1 root root 42M Apr 26 02:00  directus_2026-04-26_0000.sql.gz
-rw------- 1 root root 42M Apr 25 02:00  directus_2026-04-25_0000.sql.gz
... (>=40M baseline trở về quá khứ)

1.2 gzip / file metadata

  • gzip -t directus_2026-04-27_0000.sql.gz → exit 0 (technically valid empty stream)
  • file ...gzip compressed data, max compression, from Unix, original size modulo 2^32 0 (original size = 0 bytes → gzip header + EOF only)
  • zcat ... | head → empty (no PG header, no SQL)

→ File là gzip header + empty payload (~20 bytes), không phải corruption — pg_dump ghi 0 byte vào pipe vì fail trước SQL output.

1.3 backup.log evidence (masked)

[2026-04-25T00:00:02Z] OK size=42M (43311450 bytes)
[2026-04-26T00:00:01Z] OK size=42M (43763755 bytes)
[2026-04-27T00:00:01Z] START pg-backup -> .../directus_2026-04-27_0000.sql.gz
pg_dump: error: query failed: ERROR:  permission denied for schema sandbox_tac
pg_dump: detail: Query was: LOCK TABLE public.agent_views, ..., 
        sandbox_tac.section_type_vocab, sandbox_tac.publication_type_vocab,
        sandbox_tac.logical_unit, sandbox_tac.unit_version,
        sandbox_tac.publication, sandbox_tac.publication_member,
        sandbox_tac.change_set, sandbox_tac.change_set_member
        IN ACCESS SHARE MODE

1.4 Root cause confirmation

Read-only diagnostic (không mutate):

docker exec postgres psql -U directus -d directus -tAc \
  "SELECT has_schema_privilege(current_user, 'sandbox_tac', 'USAGE');"
→ f
SELECT current_user → directus

Role directus KHÔNG có USAGE trên schema sandbox_tac. Khi pg_dump enumerate tables (toàn DB) và phát LOCK TABLE bao gồm 8 bảng sandbox_tac.*, fail tại đó.

Lịch sử (memory): sandbox_tac schema được tạo trong S178 Fix 20 M3B (2026-04-19) cho 5 PG-only governance tables + nhiều bảng phụ. Permissions chưa được mở cho dump role.

Anomaly #2 = permission gap kế thừa từ S178 Fix 20. KHÔNG phải bug của pg-backup.sh.


§2. Việc 2 — Backup Script Inspection

2.1 Metadata

path: /opt/incomex/scripts/pg-backup.sh
size: 2157 bytes, mtime 2026-04-08 11:23
owner: root:root, mode 755 (NOT world-writable ✓)

2.2 Selective grep (business logic, masked)

4:  # S174-FIX-01: Replaces retired mysql-backup.sh
5:  # Runs via cron, keeps 7 days of backups
6:  # Heartbeat → Uptime Kuma push monitor (pg-backup-local)
11: CONTAINER="postgres"
12: DB_USER="directus"
13: DB_NAME="directus"
14: BACKUP_DIR="/opt/incomex/backups/pg"
16: KUMA_PUSH_URL="http://localhost:3001/api/push/***"     ← TOKEN MASKED
22: echo "[$(date -u ...)] START pg-backup -> ${BACKUP_FILE}"
31: docker exec "$CONTAINER" \
32:   pg_dump -U "$DB_USER" -d "$DB_NAME" --no-owner --no-acl \
33:   | gzip -9 > "$BACKUP_FILE"
38: echo "[...] ERROR: backup file too small (${FILE_SIZE} bytes)" >&2
43: # Verify gzip integrity
45: echo "[...] ERROR: gzip integrity check failed" >&2
52: # Cleanup old backups
58: curl -fsS "${KUMA_PUSH_URL}?status=up&msg=...&ping=${FILE_SIZE}" >&2

Observations:

  • Pipeline pg_dump | gzip > FILE — không check ${PIPESTATUS[0]} cho pg_dump exit code; gzip trên empty input vẫn exit 0 → script chỉ catch fail qua size check (line 38).
  • Heartbeat (line 58) chỉ chạy khi mọi check PASS → ngày 2026-04-27 00:00 cron run không gửi heartbeat (Kuma sẽ alert "down" — nhưng đó là detection layer riêng, ngoài scope).
  • Script hiện tại ĐÚNG NGHIỆP VỤ — fix phải ở DB permission, không phải script.

§3. Việc 3 — Fresh Backup Run (Governed)

3.1 Preconditions (P1–P6)

# Check Result Status
P1 Cron registration 0 2 * * * /opt/incomex/scripts/pg-backup.sh ... PASS
P2 NOT world-writable 755 root:root PASS
P3 Disk space 55 GB free / 96 GB PASS
P4 No concurrent job pgrep empty PASS
P5 Postgres container Running=true Status=running PASS
P6 PG governed path Lines 11+31: CONTAINER="postgres", docker exec "$CONTAINER" pg_dump ... (Docker-local via variable, not literal) PASS

3.2 Pre-run diagnostic (read-only, không bypass)

docker exec postgres psql -U directus -d directus -tAc \
  "SELECT has_schema_privilege(current_user, 'sandbox_tac', 'USAGE');"
→ f

Diagnostic chỉ ra fail gần như chắc chắn. Theo dispatch tôi vẫn run governed script (preconditions PASS), không bypass; nếu fail → Hard Stop #5.

3.3 Run result

$ bash /opt/incomex/scripts/pg-backup.sh >> /opt/incomex/backups/pg/backup.log 2>&1
exit_code=1   duration=0s

backup.log tail (masked):

[2026-04-27T13:53:48Z] START pg-backup -> .../directus_2026-04-27_1353.sql.gz
pg_dump: error: query failed: ERROR:  permission denied for schema sandbox_tac
pg_dump: detail: Query was: LOCK TABLE ... sandbox_tac.section_type_vocab ...
                                                                IN ACCESS SHARE MODE

Side effects observed:

  • File mới directus_2026-04-27_1353.sql.gz (20 bytes) — script dùng _HHMM suffix, KHÔNG overwrite _0000
  • Log entry mới (cùng error)
  • Heartbeat không fire (script exit non-zero trước line 58)
  • DB data/schema: không mutation

§4. Việc 4 — Integrity Verification

# Check New file directus_2026-04-27_1353.sql.gz Result
V4-1 File exists -rw------- 20 bytes 2026-04-27 15:53 PRESENT
V4-2 gzip -t exit 0 (empty stream technically valid) PASS (misleading)
V4-3 PG dump header (empty zcat output) FAIL
V4-4 CREATE/COPY count 0 FAIL
V4-5 Size > 1 MB / baseline order 20 bytes vs baseline ~42 MB FAIL
V4-6 Fresh mtime 2026-04-27 15:53:48 (~2 min ago) PASS
V4-7 Baseline comparison 22-26: 40–42 MB; 27_0000 + 27_1353: 20 B FAIL

Verdict: integrity FAIL (4/7 fail; gzip-only check misleading).

Hard Stop #5 TRIGGERED. STOP, không retry, không tự sửa.


§5. Việc 5 — Anomaly #1 (tar lag)

5.1 Directory state

/opt/incomex/backups/vps-backup-20260426_200001/
└── postgresql-directus.sql.gz  (20 bytes, mtime 2026-04-26 20:00)

stat: Modify/Change/Birth tất cả 2026-04-26 20:00:02 → directory dừng ở step 1, không có file thêm.

5.2 Process check

ps -eo pid,etime,cmd | grep -iE "tar|backup-to-gdrive"
→ (no matching process)

5.3 backup-gdrive.log evidence (masked)

2026-04-25 20:01:53 INFO  : vps-backup-20260425_200002.tar.gz: Copied (new)
2026-04-25 20:01:57 BACKUP DONE: ... Archive: 105M PG: 43M | Qdrant: 104M
HEARTBEAT sent to Kuma (pg-backup-gdrive)
==========================================
BACKUP START: 2026-04-26 20:00:01 CEST
==========================================
[1/5] PostgreSQL dump...
pg_dump: error: query failed: ERROR:  permission denied for schema sandbox_tac
pg_dump: detail: Query was: LOCK TABLE ... sandbox_tac.* IN ACCESS SHARE MODE

5.4 Classification & root cause

  • Classification: (b) Backup job fail finalization — process đã chết, dir staging chưa tar, không có file Qdrant/archive nào sau step 1.
  • Root cause IDENTICAL to Anomaly #2: directus role thiếu USAGE trên sandbox_tac schema. backup-to-gdrive.sh step [1/5] PostgreSQL dump fail → script abort trước khi bước [2/5] Qdrant, [3/5] archive, [4/5] tar, [5/5] rclone copy.
  • 2026-04-27 20:00 (UTC 13:00) cron run của backup-to-gdrive.sh chưa chạy (cron là 20:00 CEST = 18:00 UTC; current 13:53 UTC; pending). Khi chạy, cũng sẽ fail cùng kiểu.

KHÔNG cleanup directory vps-backup-20260426_200001/ (per Hard Exclusion #10).


§6. Recovery Path Recommendation

Đây là proposal cho orchestrator + GPT R-next, KHÔNG tự thực thi.

6.1 Fix root cause (DDL — CẦN AUTHORIZE riêng)

Option A (preferred, minimal blast radius):

GRANT USAGE ON SCHEMA sandbox_tac TO directus;
GRANT SELECT ON ALL TABLES IN SCHEMA sandbox_tac TO directus;
ALTER DEFAULT PRIVILEGES IN SCHEMA sandbox_tac
  GRANT SELECT ON TABLES TO directus;

Option B (excludes sandbox_tac from dump — workaround, không recommend):

  • Sửa pg-backup.sh thêm --exclude-schema=sandbox_tac → mất đi backup của 8 governance tables.

Recommend Option A vì: (i) --no-owner --no-acl đã có trong script nên dump không carry permission state ra; (ii) backup phải cover toàn DB; (iii) sandbox_tac chứa governance data quan trọng.

6.2 Sequence sau khi fix

  1. User authorize Option A (DDL).
  2. Apply GRANT trong session riêng (out-of-scope dispatch hiện tại).
  3. Re-run pg-backup.sh → expect file ~42 MB, integrity PASS.
  4. Re-run backup-to-gdrive.sh → expect tar + rclone upload thành công.
  5. Lúc đó PF-07 v0.5 (lag window 30h) mới có khả năng PASS.
  6. G6 retry mới authorize được.

6.3 Anti-regression (out-of-scope dispatch)

  • Patch pg-backup.sh check ${PIPESTATUS[0]} của pg_dump (catch fail trước khi gzip thành công trên input rỗng).
  • Bổ sung Kuma "down" alert path khi pg_dump fail (hiện tại heartbeat im lặng → Kuma sau X phút sẽ trigger).
  • Chuẩn hóa S178: mỗi schema mới tạo phải GRANT USAGE/SELECT cho dump role.

§7. Compliance Confirm

  • ✅ VPS context (vmi3080463), KHÔNG Mac local
  • ✅ Read-only Việc 1, 2, 4, 5; Việc 3 governed-script side effect duy nhất (1 file 20-byte + 1 log line, 0 heartbeat)
  • ✅ no DDL · no DML · no SCHEMA mutation · no DB data mutation
  • ✅ no script edit · no cron edit · no systemd edit · no rclone destination change
  • ✅ no cat/head/tail/less toàn script (chỉ grep -nE)
  • ✅ no rclone config show / cat rclone.conf
  • ✅ no backup payload download
  • ✅ no backup deletion (incl. broken 20-byte files giữ làm evidence)
  • ✅ no anomaly #1 cleanup
  • ✅ no sudo interactive
  • ✅ no G6 retry / no PF-07 patching wrapper
  • ✅ no git commit/git push

Secret hygiene scan (pre-upload)

Patterns scanned trong report:

  • password= / PASSWORD= → 0 hit
  • token= / TOKEN= / Authorization: Bearer → 0 hit
  • Kuma URL có token → masked *** (1 occurrence ở §2.2 line 16)
  • rclone client_secret / refresh_token → 0 hit (rclone.conf chưa từng read)
  • postgres://user:pass@ → 0 hit (chỉ giá trị DB_USER="directus" plain, không phải credstring)
  • API keys (sk-, ghp_, gh_pat_, JWT eyJ) → 0 hit
  • DB password literal → 0 hit (script không chứa, dump qua docker exec dùng peer auth)

Hygiene scan PASS, 0 leaks.


§8. Hand-off

STOP HERE. Investigation + governed verification done. Both anomalies documented + root cause proven (single shared cause).

Chờ orchestrator + GPT review:

  1. Authorize Option A GRANT (separate dispatch, DDL gate)
  2. Sau fix → re-run pg-backup.sh + backup-to-gdrive.sh ngoài cron để rút ngắn restore window
  3. Sau backup integrity PASS → PF-07 v0.5 wrapper + G6 retry chain
  4. Cleanup /opt/incomex/backups/{directus_2026-04-27_0000,directus_2026-04-27_1353}.sql.gz + dir vps-backup-20260426_200001/ sau khi có fresh successful backup (separate dispatch)

Report — 2026-04-27 — Claude Code VPS executor — medium effort — Hard Stop #5 (integrity FAIL) triggered đúng quy trình — GPT R14+R15+R16 chain