KB-2278

P9-G6 Backup Incident Triage + Fix Plan — 2026-04-27

17 min read Revision 1
backupincidenttriagefix-planp9-g6sandbox_tacpg_dump

P9-G6 Backup Incident — Triage + Fix Plan

Date: 2026-04-27 Executor: Claude Code on Mac via SSH wrapper (ssh contabo …) → VPS Contabo (vmi3080463, root) Mode: Read-only investigation + 4-option fix proposal — NO mutation Authorize chain: GPT R17+R18+R19 PASS; user authorize PENDING


0. Hostname verify

hostname: vmi3080463 (VPS Contabo)
whoami:   root
pwd:      /root

VPS context confirmed. Mac local ruled out. All commands executed via ssh contabo "…" wrapper (read-only DB queries, file inspections, log reads).


1. Việc 1 — Schema sandbox_tac inspection

1.1 Owner

nspname owner
sandbox_tac workflow_admin

1.2 Privileges (has_schema_privilege)

role_name has_usage has_create
directus f f
workflow_admin t t
postgres t t (superuser)

→ Confirmed: role directus thiếu USAGE và CREATE trên sandbox_tac. Chính là root cause pg_dump LOCK TABLE fail.

1.3 Object inventory (catalog metadata)

relkind count
r (tables) 8
i (indexes) 17
sequences 0
functions 3
triggers 2

Tables: change_set, change_set_member, logical_unit, publication, publication_member, publication_type_vocab, section_type_vocab, unit_version.

1.4 Row counts (pg_stat_user_tables)

table n_live_tup
unit_version 76
logical_unit 76
publication_member 76
section_type_vocab 11
publication 3
publication_type_vocab 1
change_set 0
change_set_member 0

→ Schema CÓ data thật (≥243 live tuples). KHÔNG phải empty test schema.

1.5 External dependencies

pg_depend → only pg_toast.* references (PG-internal storage, không phải cross-schema FK).

→ KHÔNG có schema khác reference vào sandbox_tac. Standalone island.

1.6 Directus exposure

SELECT COUNT(*) FROM directus_collections
WHERE collection LIKE 'sandbox_tac%' OR collection IN
  ('change_set','change_set_member','logical_unit','publication',
   'publication_member','publication_type_vocab','section_type_vocab','unit_version');

Result: 0

Khớp với prior P7B PF-4 evidence (PASS = 0 sandbox_tac collections). Schema KHÔNG exposed qua Directus API/admin.

1.7 Classification

  • Owner: workflow_admin (P7/C2 pilot — legacy, không có active Directus binding)
  • Use status: Standalone, có data nhưng không bind Directus, không có cross-schema dep
  • Risk to drop: Low (nếu confirm với owner). Có data → cần export/archive trước khi DROP.

2. Việc 2 — Backup history audit

2.1 Local backup dir: /opt/incomex/backups/pg/

Timestamp (UTC+2) Size gzip File
2026-04-19 02:00 39 MB VALID directus_2026-04-19_0000.sql.gz
2026-04-20 02:00 40 MB VALID directus_2026-04-20_0000.sql.gz
2026-04-21 02:00 40 MB VALID directus_2026-04-21_0000.sql.gz
2026-04-22 02:00 40 MB VALID directus_2026-04-22_0000.sql.gz
2026-04-23 02:01 41 MB VALID directus_2026-04-23_0000.sql.gz
2026-04-24 02:00 41 MB VALID directus_2026-04-24_0000.sql.gz
2026-04-25 02:00 42 MB VALID directus_2026-04-25_0000.sql.gz
2026-04-26 02:00 42 MB VALID directus_2026-04-26_0000.sql.gz ← last good
2026-04-27 02:00 20 B VALID* directus_2026-04-27_0000.sql.gz ← first broken
2026-04-27 13:53 20 B VALID* directus_2026-04-27_1353.sql.gz ← manual retry

(*) gzip header on empty stream = 20 bytes, gzip -t reports valid (passes integrity), nhưng decompressed payload = 0 bytes.

2.2 Window broken

  • Last good: 2026-04-26 00:00:51 UTC, 43,763,755 bytes
  • First broken: 2026-04-27 00:00:06 UTC, 20 bytes
  • Latest broken: 2026-04-27 13:53:48 UTC (manual retry), 20 bytes
  • Time without good backup: ≈37–41h (dispatch claim "≥41h" consistent with orchestrator rounding forward)
  • Baseline size growth: Linear ~400–500 KB/ngày (compression healthy until break)

2.3 Remote audit

Backup-to-gdrive.sh archive includes PG dump as one of 5 components. From log: 2026-04-25 20:00 archive uploaded successfully (105 MB, includes ~43 MB PG). 2026-04-26 20:00 run also failed at step 1 (same permission denied for schema sandbox_tac error). Remote rclone listing skipped — local primary evidence sufficient.

2.4 Size growth & monitor cadence

  • pg-backup.sh: nightly 02:00 local (Hanoi)
  • backup-to-gdrive.sh: nightly 20:00 UTC = 03:00 Hanoi
  • Both currently broken since 2026-04-26 20:00 UTC and 2026-04-27 00:00 UTC respectively.

3. Việc 3 — pg-backup.sh + backup-to-gdrive.sh behavior

3.1 File metadata

File Owner Mode Size
/opt/incomex/scripts/pg-backup.sh root:root 755 2,157 B
/opt/incomex/scripts/backup-to-gdrive.sh root:root 755 5,646 B

KHÔNG world-writable (Hard stop #6 PASS).

3.2 Shebang + flags

Cả hai script đều có:

#!/usr/bin/env bash    (pg-backup.sh — line 1)
#!/bin/bash            (backup-to-gdrive.sh — line 1)
set -euo pipefail      (pg-backup.sh:8, backup-to-gdrive.sh:9)
umask 077              (pg-backup.sh:9)

→ Shell hardening đã có từ S174-FIX-01 / S174-FIX-01B.

3.3 pg-backup.sh validation chain

Sequence (lines 22–60):

  1. START log
  2. Check container postgres running
  3. Dump line: docker exec postgres pg_dump … | gzip -9 > "$BACKUP_FILE" (no || true, no stderr suppression — comment "AP-SILENT-SWALLOW")
  4. Size check: if [[ FILE_SIZE -lt 1024 ]]; then rm -f && exit 1
  5. gzip integrity: if ! gunzip -t … then exit 1
  6. Retention (-mtime +7)
  7. Kuma heartbeat (only after full success)

3.4 Vì sao 20B file vẫn tồn tại?

Critical analysis — Dispatch claim "Kuma push success ngay cả khi fail" KHÔNG khớp evidence. Thực tế:

  • pg_dump exits non-zero (permission denied)
  • gzip -9 succeeds with empty stdin → ghi 20-byte gzip header
  • Pipe overall → non-zero (pipefail)
  • set -e triggers IMMEDIATELY ở dòng pipe → script EXIT
  • Validation block (size + gzip-t) KHÔNG bao giờ chạy — vì set -e đã exit trước
  • Hệ quả: rm -f "$BACKUP_FILE" (nằm INSIDE size check) KHÔNG fire → 20B file remains
  • Kuma heartbeat (cuối script) KHÔNG fire → push monitor pg-backup-local SHOULD đi DOWN sau timeout window

Conclusion: Script KHÔNG silent về Kuma; nó silent về on-disk artifact (20B file giả-valid gzip). Kuma có alert hay không phụ thuộc monitor timeout config + ai đang đọc dashboard. ≥37h gap → rất khả năng Kuma đã DOWN nhưng nobody noticed.

3.5 backup-to-gdrive.sh validation chain

5 steps. Step 1 (pg_dump | gzip > postgresql-directus.sql.gz) fails identically. set -e kills script before steps 2–5. Heartbeat at line 139 KHÔNG fire. Same silent-artifact pattern.

3.6 Validation gaps identified

Gap Current state Impact
Partial-file cleanup rm -f only inside size-check block (unreachable when set -e triggers earlier) 20B file stays, looks "VALID" to gzip -t
pg_dump exit isolation pg_dump | gzip — no separate pg_dump exit check Can't distinguish pg_dump fail vs gzip fail
Active failure alert Kuma DOWN-by-absence only; no Telegram/email push on detected fail Discovery delayed until someone reads dashboard
Pre-flight schema USAGE check None Adding new schema without GRANT silently breaks backup
Trap-on-error No trap 'rm -f $BACKUP_FILE' ERR Partial artifacts persist on any error path

3.7 Role discovery

Privilege query covered directus, workflow_admin, postgres. List of all non-pg_* roles NOT pulled (not needed — primary failure role is directus). If governance later needs full role list → run separate query, không tự đoán.


4. Việc 4 — Fix options

Option D (Script hardening) là REQUIRED COMPANION với A/B/C, không phải alternative. Lý do: nếu chỉ GRANT mà không hardening, lần sau (auth fail / disk full / container down / new schema added without GRANT) vẫn để lại artifact giả-valid 20B + delayed alert.

4.1 Option A — GRANT path

Áp dụng khi: sandbox_tac cần nằm trong backup dump (governance/audit/recovery).

Pre-check:

-- Schema còn tồn tại?
SELECT 1 FROM pg_namespace WHERE nspname='sandbox_tac';
-- Privileges hiện tại
SELECT has_schema_privilege('directus','sandbox_tac','USAGE');

Diff đề xuất (proposal — KHÔNG apply):

GRANT USAGE ON SCHEMA sandbox_tac TO directus;
GRANT SELECT ON ALL TABLES IN SCHEMA sandbox_tac TO directus;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA sandbox_tac TO directus;
ALTER DEFAULT PRIVILEGES FOR ROLE workflow_admin IN SCHEMA sandbox_tac
  GRANT SELECT ON TABLES TO directus;
ALTER DEFAULT PRIVILEGES FOR ROLE workflow_admin IN SCHEMA sandbox_tac
  GRANT SELECT ON SEQUENCES TO directus;

Rollback: REVOKE …; symmetric.

Post-verify:

SELECT has_schema_privilege('directus','sandbox_tac','USAGE'); -- expect t
-- + smoke pg_dump test
docker exec postgres pg_dump -U directus -d directus -n sandbox_tac --schema-only | head -20

Pros: Backup hoàn chỉnh, không thay đổi data ownership/topology. Cons: Mở rộng surface đọc cho directus role; default privileges cần ALTER cho cả role workflow_admin (tránh drift khi tạo bảng mới).

+ Option D companion bắt buộc.

4.2 Option B — Backup-role path

Áp dụng khi: Kiến trúc đúng phải dùng dedicated backup role (separation of concerns), directus chỉ là application role.

Diff đề xuất:

CREATE ROLE pg_backup_role WITH LOGIN PASSWORD '<env-managed>';
GRANT pg_read_all_data TO pg_backup_role;  -- PG14+ predefined role
-- Hoặc explicit:
-- GRANT USAGE ON SCHEMA <each schema> TO pg_backup_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA <each schema> TO pg_backup_role;
  • Update pg-backup.sh line 12: DB_USER="pg_backup_role" + script gets password từ secret store (vault/env).

Rollback: DROP ROLE pg_backup_role; + revert script.

Post-verify: identical to Option A but role=pg_backup_role.

Pros: Architecturally cleaner, principle-of-least-privilege cho directus, future-proof khi thêm schema (chỉ cần pg_read_all_data). Cons: Thêm credential cần manage (env/secret rotation); script update + cron audit; introduce một artifact mới trong governance registry.

+ Option D companion bắt buộc.

4.3 Option C — Exclude/drop path

Áp dụng khi: sandbox_tac xác nhận obsolete pilot (P7/C2 abandoned, không có owner/usecase active).

Sub-option C1 (exclude):

# pg-backup.sh + backup-to-gdrive.sh
pg_dump -U directus -d directus --no-owner --no-acl \
  --exclude-schema=sandbox_tac \
  | gzip -9 > "$BACKUP_FILE"
  • Pros: Không touch data, reversible (xóa flag là quay lại).
  • Cons: Schema vẫn tồn tại, vẫn có data orphan, vẫn không có disaster recovery cho sandbox_tac.

Sub-option C2 (drop):

-- Pre-check: dependency confirm = 0 (xem 1.5, 1.6 — đã PASS)
-- Pre-archive: pg_dump -n sandbox_tac > sandbox_tac_archive.sql.gz (off-line copy)
DROP SCHEMA sandbox_tac CASCADE;
  • Cần authorize gate riêng (irreversible). Cần pre-archive off-line copy.
  • Pros: Loại bỏ root cause vĩnh viễn, simplify backup graph.
  • Cons: Irreversible. Cần verify owner workflow_admin (P7/C2 stakeholder) confirm abandonment.

+ Option D companion bắt buộc.

4.4 Option D — Script hardening (REQUIRED COMPANION)

Diff đề xuất cho pg-backup.sh (between line 30 and pg_dump):

# Trap to cleanup partial artifacts on any error path
cleanup_on_error() {
  local rc=$?
  if [[ -f "$BACKUP_FILE" ]]; then
    local sz; sz=$(stat -c%s "$BACKUP_FILE" 2>/dev/null || echo 0)
    if [[ "$sz" -lt 1024 ]]; then
      rm -f "$BACKUP_FILE"
      echo "[ERROR] partial backup removed (size=$sz)" >&2
    fi
  fi
  # Active failure alert (Telegram or Kuma DOWN push)
  curl -fsS "${KUMA_PUSH_URL}?status=down&msg=backup_failed_rc${rc}" >&2 || true
  exit $rc
}
trap cleanup_on_error ERR

# Isolate pg_dump exit
docker exec "$CONTAINER" \
  pg_dump -U "$DB_USER" -d "$DB_NAME" --no-owner --no-acl \
  > "${BACKUP_FILE}.raw"
gzip -9 < "${BACKUP_FILE}.raw" > "$BACKUP_FILE"
rm -f "${BACKUP_FILE}.raw"

Optional pre-flight (paranoid):

# Verify directus has USAGE on every schema before dump
docker exec "$CONTAINER" psql -U "$DB_USER" -d "$DB_NAME" -tAc "
  SELECT n.nspname FROM pg_namespace n
  WHERE n.nspname NOT IN ('pg_catalog','information_schema','pg_toast')
    AND NOT n.nspname LIKE 'pg_temp%'
    AND NOT has_schema_privilege('$DB_USER', n.nspname, 'USAGE');
" | grep -q . && { echo "[ERROR] missing USAGE on schemas above" >&2; exit 2; }

Pros: Cứu mọi failure mode tương lai (auth fail, disk full, container down, new schema, etc.). Chuyển từ silent-artifact + DOWN-by-absence → clean-fail + active-DOWN-push. Cons: Script phức tạp hơn ~25 dòng; cần governance review trước apply.


5. Việc 5 — Execution gate

Step Option A Option B Option C1 Option C2 Option D
Pre-check has_schema_privilege query role check + pg_read_all_data avail --exclude-schema smoke test dep+exposure recheck (1.5/1.6) + pre-archive shellcheck pg-backup.sh patched
Diff GRANT block (4.1) CREATE ROLE + script edit (4.2) script flag edit DROP SCHEMA CASCADE trap+isolation (4.4)
Rollback REVOKE block DROP ROLE + revert revert flag pg_restore từ pre-archive git checkout script prev
Action log dot_action_log (DOT-PG-GRANT-DIRECTUS) dot_action_log (DOT-PG-NEW-BACKUP-ROLE) dot_action_log (DOT-PG-EXCLUDE-SCHEMA) dot_action_log (DOT-PG-DROP-SCHEMA) — high-risk dot_action_log (DOT-PG-BACKUP-HARDEN)
Post-verify smoke pg_dump smoke pg_dump as new role smoke pg_dump excludes schema \dn shows no sandbox_tac + smoke pg_dump dry-run pg_dump triggers trap
Luật Đ32 (gate enforce), Đ33 (gate), Đ35 (DOT 100%), Đ39 (canonical) + Đ37 §5 (registry) cho role mới Đ32, Đ33, Đ35 + Hiến pháp irreversible gate, pre-archive mandate Đ32, Đ33, Đ35, Đ39
DOT execution DOT mới DOT-PG-GRANT-SCHEMA (đề xuất tạo + enact) hoặc reuse APR-runtime v2.0.1 (patch_ops_code slot) DOT mới DOT-PG-NEW-ROLE DOT mới DOT-PG-BACKUP-EXCLUDE-SCHEMA DOT mới DOT-PG-DROP-SCHEMA (irreversible — cần dual-control) DOT mới DOT-BACKUP-SCRIPT-HARDEN (diff via APR runtime)

Execution path (chung): governed AI/DOT gate qua APR runtime v2.0.1 (S178 Fix 20 M3A). KHÔNG manual psql. KHÔNG manual sed/edit script.

Recommendation (đề xuất, chưa quyết):

  1. Khẩn (≤24h): Option A (GRANT) + Option D (hardening) song song qua DOT pipeline → restore backup ngay.
  2. Medium-term (≤1 tuần): Đánh giá sandbox_tac ownership với P7/C2 stakeholder. Nếu confirm abandoned → C2 (DROP). Nếu vẫn dùng → giữ A long-term hoặc upgrade lên B (backup role tách biệt).
  3. D bất kể path nào — phải apply đầu tiên hoặc đồng thời.

Lý do recommend A+D thay vì B+D ngay: A reversible, không thay đổi credential surface, đủ giải bug ngay. B là kiến trúc đẹp hơn nhưng chậm hơn (cần secret management, governance review role mới). D bắt buộc.


6. Confirmations

  • No DDL executed
  • No DML executed
  • No GRANT/REVOKE executed
  • No script edit
  • No cron edit
  • No git commit / git push
  • No backup payload download (only size + gzip -t metadata)
  • No cat/head/less of full scripts (only awk/sed/grep -nE)
  • All commands via ssh contabo "…" from Mac → VPS
  • Secret hygiene scan PASS (Kuma URL = http://localhost:3001/api/push/pg-backup-local — no token; QDRANT_KEY only referenced as variable name, not value; no DB password literal; no Telegram URL)
  • Hard stop checks: VPS hostname OK, schema exists, no Directus binding (PASS C-eligible), no secret leak, backup dir found, scripts not world-writable

7. Post-execution

  1. STOP after upload.
  2. KHÔNG tự apply option nào.
  3. KHÔNG retry G6 backup.
  4. Chờ GPT R18 review + User authorize.

P9-G6 Backup Incident Triage — 2026-04-27 — Claude Code via ssh contabo wrapper — read-only — medium effort — 100% DOT compliance — root cause: directus role thiếu USAGE on sandbox_tac (owner=workflow_admin); 20B artifact persists vì set -e exits trước cleanup block; recommendation: Option A (GRANT) + Option D (hardening) qua DOT pipeline.

Back to Knowledge Hub knowledge/dev/laws/dieu38-trien-khai/reports/p9-g6-backup-incident-triage-fix-plan-2026-04-27.md