P9-G6 Backup Incident Triage + Fix Plan — 2026-04-27
P9-G6 Backup Incident — Triage + Fix Plan
Date: 2026-04-27
Executor: Claude Code on Mac via SSH wrapper (ssh contabo …) → VPS Contabo (vmi3080463, root)
Mode: Read-only investigation + 4-option fix proposal — NO mutation
Authorize chain: GPT R17+R18+R19 PASS; user authorize PENDING
0. Hostname verify
hostname: vmi3080463 (VPS Contabo)
whoami: root
pwd: /root
VPS context confirmed. Mac local ruled out. All commands executed via ssh contabo "…" wrapper (read-only DB queries, file inspections, log reads).
1. Việc 1 — Schema sandbox_tac inspection
1.1 Owner
| nspname | owner |
|---|---|
| sandbox_tac | workflow_admin |
1.2 Privileges (has_schema_privilege)
| role_name | has_usage | has_create |
|---|---|---|
directus |
f | f |
workflow_admin |
t | t |
postgres |
t | t (superuser) |
→ Confirmed: role directus thiếu USAGE và CREATE trên sandbox_tac. Chính là root cause pg_dump LOCK TABLE fail.
1.3 Object inventory (catalog metadata)
| relkind | count |
|---|---|
r (tables) |
8 |
i (indexes) |
17 |
| sequences | 0 |
| functions | 3 |
| triggers | 2 |
Tables: change_set, change_set_member, logical_unit, publication, publication_member, publication_type_vocab, section_type_vocab, unit_version.
1.4 Row counts (pg_stat_user_tables)
| table | n_live_tup |
|---|---|
unit_version |
76 |
logical_unit |
76 |
publication_member |
76 |
section_type_vocab |
11 |
publication |
3 |
publication_type_vocab |
1 |
change_set |
0 |
change_set_member |
0 |
→ Schema CÓ data thật (≥243 live tuples). KHÔNG phải empty test schema.
1.5 External dependencies
pg_depend → only pg_toast.* references (PG-internal storage, không phải cross-schema FK).
→ KHÔNG có schema khác reference vào sandbox_tac. Standalone island.
1.6 Directus exposure
SELECT COUNT(*) FROM directus_collections
WHERE collection LIKE 'sandbox_tac%' OR collection IN
('change_set','change_set_member','logical_unit','publication',
'publication_member','publication_type_vocab','section_type_vocab','unit_version');
Result: 0
Khớp với prior P7B PF-4 evidence (PASS = 0 sandbox_tac collections). Schema KHÔNG exposed qua Directus API/admin.
1.7 Classification
- Owner:
workflow_admin(P7/C2 pilot — legacy, không có active Directus binding) - Use status: Standalone, có data nhưng không bind Directus, không có cross-schema dep
- Risk to drop: Low (nếu confirm với owner). Có data → cần export/archive trước khi DROP.
2. Việc 2 — Backup history audit
2.1 Local backup dir: /opt/incomex/backups/pg/
| Timestamp (UTC+2) | Size | gzip | File |
|---|---|---|---|
| 2026-04-19 02:00 | 39 MB | VALID | directus_2026-04-19_0000.sql.gz |
| 2026-04-20 02:00 | 40 MB | VALID | directus_2026-04-20_0000.sql.gz |
| 2026-04-21 02:00 | 40 MB | VALID | directus_2026-04-21_0000.sql.gz |
| 2026-04-22 02:00 | 40 MB | VALID | directus_2026-04-22_0000.sql.gz |
| 2026-04-23 02:01 | 41 MB | VALID | directus_2026-04-23_0000.sql.gz |
| 2026-04-24 02:00 | 41 MB | VALID | directus_2026-04-24_0000.sql.gz |
| 2026-04-25 02:00 | 42 MB | VALID | directus_2026-04-25_0000.sql.gz |
| 2026-04-26 02:00 | 42 MB | VALID | directus_2026-04-26_0000.sql.gz ← last good |
| 2026-04-27 02:00 | 20 B | VALID* | directus_2026-04-27_0000.sql.gz ← first broken |
| 2026-04-27 13:53 | 20 B | VALID* | directus_2026-04-27_1353.sql.gz ← manual retry |
(*) gzip header on empty stream = 20 bytes, gzip -t reports valid (passes integrity), nhưng decompressed payload = 0 bytes.
2.2 Window broken
- Last good: 2026-04-26 00:00:51 UTC, 43,763,755 bytes
- First broken: 2026-04-27 00:00:06 UTC, 20 bytes
- Latest broken: 2026-04-27 13:53:48 UTC (manual retry), 20 bytes
- Time without good backup: ≈37–41h (dispatch claim "≥41h" consistent with orchestrator rounding forward)
- Baseline size growth: Linear ~400–500 KB/ngày (compression healthy until break)
2.3 Remote audit
Backup-to-gdrive.sh archive includes PG dump as one of 5 components. From log: 2026-04-25 20:00 archive uploaded successfully (105 MB, includes ~43 MB PG). 2026-04-26 20:00 run also failed at step 1 (same permission denied for schema sandbox_tac error). Remote rclone listing skipped — local primary evidence sufficient.
2.4 Size growth & monitor cadence
- pg-backup.sh: nightly 02:00 local (Hanoi)
- backup-to-gdrive.sh: nightly 20:00 UTC = 03:00 Hanoi
- Both currently broken since 2026-04-26 20:00 UTC and 2026-04-27 00:00 UTC respectively.
3. Việc 3 — pg-backup.sh + backup-to-gdrive.sh behavior
3.1 File metadata
| File | Owner | Mode | Size |
|---|---|---|---|
/opt/incomex/scripts/pg-backup.sh |
root:root | 755 | 2,157 B |
/opt/incomex/scripts/backup-to-gdrive.sh |
root:root | 755 | 5,646 B |
KHÔNG world-writable (Hard stop #6 PASS).
3.2 Shebang + flags
Cả hai script đều có:
#!/usr/bin/env bash (pg-backup.sh — line 1)
#!/bin/bash (backup-to-gdrive.sh — line 1)
set -euo pipefail (pg-backup.sh:8, backup-to-gdrive.sh:9)
umask 077 (pg-backup.sh:9)
→ Shell hardening đã có từ S174-FIX-01 / S174-FIX-01B.
3.3 pg-backup.sh validation chain
Sequence (lines 22–60):
STARTlog- Check container
postgresrunning - Dump line:
docker exec postgres pg_dump … | gzip -9 > "$BACKUP_FILE"(no|| true, no stderr suppression — comment "AP-SILENT-SWALLOW") - Size check:
if [[ FILE_SIZE -lt 1024 ]]; then rm -f && exit 1 - gzip integrity:
if ! gunzip -t … then exit 1 - Retention (
-mtime +7) - Kuma heartbeat (only after full success)
3.4 Vì sao 20B file vẫn tồn tại?
Critical analysis — Dispatch claim "Kuma push success ngay cả khi fail" KHÔNG khớp evidence. Thực tế:
pg_dumpexits non-zero (permission denied)gzip -9succeeds with empty stdin → ghi 20-byte gzip header- Pipe overall → non-zero (pipefail)
set -etriggers IMMEDIATELY ở dòng pipe → script EXIT- Validation block (size + gzip-t) KHÔNG bao giờ chạy — vì
set -eđã exit trước - Hệ quả:
rm -f "$BACKUP_FILE"(nằm INSIDE size check) KHÔNG fire → 20B file remains - Kuma heartbeat (cuối script) KHÔNG fire → push monitor
pg-backup-localSHOULD đi DOWN sau timeout window
Conclusion: Script KHÔNG silent về Kuma; nó silent về on-disk artifact (20B file giả-valid gzip). Kuma có alert hay không phụ thuộc monitor timeout config + ai đang đọc dashboard. ≥37h gap → rất khả năng Kuma đã DOWN nhưng nobody noticed.
3.5 backup-to-gdrive.sh validation chain
5 steps. Step 1 (pg_dump | gzip > postgresql-directus.sql.gz) fails identically. set -e kills script before steps 2–5. Heartbeat at line 139 KHÔNG fire. Same silent-artifact pattern.
3.6 Validation gaps identified
| Gap | Current state | Impact |
|---|---|---|
| Partial-file cleanup | rm -f only inside size-check block (unreachable when set -e triggers earlier) |
20B file stays, looks "VALID" to gzip -t |
| pg_dump exit isolation | pg_dump | gzip — no separate pg_dump exit check |
Can't distinguish pg_dump fail vs gzip fail |
| Active failure alert | Kuma DOWN-by-absence only; no Telegram/email push on detected fail | Discovery delayed until someone reads dashboard |
| Pre-flight schema USAGE check | None | Adding new schema without GRANT silently breaks backup |
| Trap-on-error | No trap 'rm -f $BACKUP_FILE' ERR |
Partial artifacts persist on any error path |
3.7 Role discovery
Privilege query covered directus, workflow_admin, postgres. List of all non-pg_* roles NOT pulled (not needed — primary failure role is directus). If governance later needs full role list → run separate query, không tự đoán.
4. Việc 4 — Fix options
Option D (Script hardening) là REQUIRED COMPANION với A/B/C, không phải alternative. Lý do: nếu chỉ GRANT mà không hardening, lần sau (auth fail / disk full / container down / new schema added without GRANT) vẫn để lại artifact giả-valid 20B + delayed alert.
4.1 Option A — GRANT path
Áp dụng khi: sandbox_tac cần nằm trong backup dump (governance/audit/recovery).
Pre-check:
-- Schema còn tồn tại?
SELECT 1 FROM pg_namespace WHERE nspname='sandbox_tac';
-- Privileges hiện tại
SELECT has_schema_privilege('directus','sandbox_tac','USAGE');
Diff đề xuất (proposal — KHÔNG apply):
GRANT USAGE ON SCHEMA sandbox_tac TO directus;
GRANT SELECT ON ALL TABLES IN SCHEMA sandbox_tac TO directus;
GRANT SELECT ON ALL SEQUENCES IN SCHEMA sandbox_tac TO directus;
ALTER DEFAULT PRIVILEGES FOR ROLE workflow_admin IN SCHEMA sandbox_tac
GRANT SELECT ON TABLES TO directus;
ALTER DEFAULT PRIVILEGES FOR ROLE workflow_admin IN SCHEMA sandbox_tac
GRANT SELECT ON SEQUENCES TO directus;
Rollback: REVOKE …; symmetric.
Post-verify:
SELECT has_schema_privilege('directus','sandbox_tac','USAGE'); -- expect t
-- + smoke pg_dump test
docker exec postgres pg_dump -U directus -d directus -n sandbox_tac --schema-only | head -20
Pros: Backup hoàn chỉnh, không thay đổi data ownership/topology.
Cons: Mở rộng surface đọc cho directus role; default privileges cần ALTER cho cả role workflow_admin (tránh drift khi tạo bảng mới).
→ + Option D companion bắt buộc.
4.2 Option B — Backup-role path
Áp dụng khi: Kiến trúc đúng phải dùng dedicated backup role (separation of concerns), directus chỉ là application role.
Diff đề xuất:
CREATE ROLE pg_backup_role WITH LOGIN PASSWORD '<env-managed>';
GRANT pg_read_all_data TO pg_backup_role; -- PG14+ predefined role
-- Hoặc explicit:
-- GRANT USAGE ON SCHEMA <each schema> TO pg_backup_role;
-- GRANT SELECT ON ALL TABLES IN SCHEMA <each schema> TO pg_backup_role;
- Update
pg-backup.shline 12:DB_USER="pg_backup_role"+ script gets password từ secret store (vault/env).
Rollback: DROP ROLE pg_backup_role; + revert script.
Post-verify: identical to Option A but role=pg_backup_role.
Pros: Architecturally cleaner, principle-of-least-privilege cho directus, future-proof khi thêm schema (chỉ cần pg_read_all_data).
Cons: Thêm credential cần manage (env/secret rotation); script update + cron audit; introduce một artifact mới trong governance registry.
→ + Option D companion bắt buộc.
4.3 Option C — Exclude/drop path
Áp dụng khi: sandbox_tac xác nhận obsolete pilot (P7/C2 abandoned, không có owner/usecase active).
Sub-option C1 (exclude):
# pg-backup.sh + backup-to-gdrive.sh
pg_dump -U directus -d directus --no-owner --no-acl \
--exclude-schema=sandbox_tac \
| gzip -9 > "$BACKUP_FILE"
- Pros: Không touch data, reversible (xóa flag là quay lại).
- Cons: Schema vẫn tồn tại, vẫn có data orphan, vẫn không có disaster recovery cho
sandbox_tac.
Sub-option C2 (drop):
-- Pre-check: dependency confirm = 0 (xem 1.5, 1.6 — đã PASS)
-- Pre-archive: pg_dump -n sandbox_tac > sandbox_tac_archive.sql.gz (off-line copy)
DROP SCHEMA sandbox_tac CASCADE;
- Cần authorize gate riêng (irreversible). Cần pre-archive off-line copy.
- Pros: Loại bỏ root cause vĩnh viễn, simplify backup graph.
- Cons: Irreversible. Cần verify owner
workflow_admin(P7/C2 stakeholder) confirm abandonment.
→ + Option D companion bắt buộc.
4.4 Option D — Script hardening (REQUIRED COMPANION)
Diff đề xuất cho pg-backup.sh (between line 30 and pg_dump):
# Trap to cleanup partial artifacts on any error path
cleanup_on_error() {
local rc=$?
if [[ -f "$BACKUP_FILE" ]]; then
local sz; sz=$(stat -c%s "$BACKUP_FILE" 2>/dev/null || echo 0)
if [[ "$sz" -lt 1024 ]]; then
rm -f "$BACKUP_FILE"
echo "[ERROR] partial backup removed (size=$sz)" >&2
fi
fi
# Active failure alert (Telegram or Kuma DOWN push)
curl -fsS "${KUMA_PUSH_URL}?status=down&msg=backup_failed_rc${rc}" >&2 || true
exit $rc
}
trap cleanup_on_error ERR
# Isolate pg_dump exit
docker exec "$CONTAINER" \
pg_dump -U "$DB_USER" -d "$DB_NAME" --no-owner --no-acl \
> "${BACKUP_FILE}.raw"
gzip -9 < "${BACKUP_FILE}.raw" > "$BACKUP_FILE"
rm -f "${BACKUP_FILE}.raw"
Optional pre-flight (paranoid):
# Verify directus has USAGE on every schema before dump
docker exec "$CONTAINER" psql -U "$DB_USER" -d "$DB_NAME" -tAc "
SELECT n.nspname FROM pg_namespace n
WHERE n.nspname NOT IN ('pg_catalog','information_schema','pg_toast')
AND NOT n.nspname LIKE 'pg_temp%'
AND NOT has_schema_privilege('$DB_USER', n.nspname, 'USAGE');
" | grep -q . && { echo "[ERROR] missing USAGE on schemas above" >&2; exit 2; }
Pros: Cứu mọi failure mode tương lai (auth fail, disk full, container down, new schema, etc.). Chuyển từ silent-artifact + DOWN-by-absence → clean-fail + active-DOWN-push. Cons: Script phức tạp hơn ~25 dòng; cần governance review trước apply.
5. Việc 5 — Execution gate
| Step | Option A | Option B | Option C1 | Option C2 | Option D |
|---|---|---|---|---|---|
| Pre-check | has_schema_privilege query |
role check + pg_read_all_data avail | --exclude-schema smoke test |
dep+exposure recheck (1.5/1.6) + pre-archive | shellcheck pg-backup.sh patched |
| Diff | GRANT block (4.1) | CREATE ROLE + script edit (4.2) | script flag edit | DROP SCHEMA CASCADE | trap+isolation (4.4) |
| Rollback | REVOKE block | DROP ROLE + revert | revert flag | pg_restore từ pre-archive |
git checkout script prev |
| Action log | dot_action_log (DOT-PG-GRANT-DIRECTUS) |
dot_action_log (DOT-PG-NEW-BACKUP-ROLE) |
dot_action_log (DOT-PG-EXCLUDE-SCHEMA) |
dot_action_log (DOT-PG-DROP-SCHEMA) — high-risk |
dot_action_log (DOT-PG-BACKUP-HARDEN) |
| Post-verify | smoke pg_dump | smoke pg_dump as new role | smoke pg_dump excludes schema | \dn shows no sandbox_tac + smoke pg_dump |
dry-run pg_dump triggers trap |
| Luật | Đ32 (gate enforce), Đ33 (gate), Đ35 (DOT 100%), Đ39 (canonical) | + Đ37 §5 (registry) cho role mới | Đ32, Đ33, Đ35 | + Hiến pháp irreversible gate, pre-archive mandate | Đ32, Đ33, Đ35, Đ39 |
| DOT execution | DOT mới DOT-PG-GRANT-SCHEMA (đề xuất tạo + enact) hoặc reuse APR-runtime v2.0.1 (patch_ops_code slot) |
DOT mới DOT-PG-NEW-ROLE |
DOT mới DOT-PG-BACKUP-EXCLUDE-SCHEMA |
DOT mới DOT-PG-DROP-SCHEMA (irreversible — cần dual-control) |
DOT mới DOT-BACKUP-SCRIPT-HARDEN (diff via APR runtime) |
Execution path (chung): governed AI/DOT gate qua APR runtime v2.0.1 (S178 Fix 20 M3A). KHÔNG manual psql. KHÔNG manual sed/edit script.
Recommendation (đề xuất, chưa quyết):
- Khẩn (≤24h): Option A (GRANT) + Option D (hardening) song song qua DOT pipeline → restore backup ngay.
- Medium-term (≤1 tuần): Đánh giá
sandbox_tacownership với P7/C2 stakeholder. Nếu confirm abandoned → C2 (DROP). Nếu vẫn dùng → giữ A long-term hoặc upgrade lên B (backup role tách biệt). - D bất kể path nào — phải apply đầu tiên hoặc đồng thời.
Lý do recommend A+D thay vì B+D ngay: A reversible, không thay đổi credential surface, đủ giải bug ngay. B là kiến trúc đẹp hơn nhưng chậm hơn (cần secret management, governance review role mới). D bắt buộc.
6. Confirmations
- No DDL executed
- No DML executed
- No GRANT/REVOKE executed
- No script edit
- No cron edit
- No
git commit/git push - No backup payload download (only size + gzip -t metadata)
- No
cat/head/lessof full scripts (onlyawk/sed/grep -nE) - All commands via
ssh contabo "…"from Mac → VPS - Secret hygiene scan PASS (Kuma URL =
http://localhost:3001/api/push/pg-backup-local— no token; QDRANT_KEY only referenced as variable name, not value; no DB password literal; no Telegram URL) - Hard stop checks: VPS hostname OK, schema exists, no Directus binding (PASS C-eligible), no secret leak, backup dir found, scripts not world-writable
7. Post-execution
- STOP after upload.
- KHÔNG tự apply option nào.
- KHÔNG retry G6 backup.
- Chờ GPT R18 review + User authorize.
P9-G6 Backup Incident Triage — 2026-04-27 — Claude Code via ssh contabo wrapper — read-only — medium effort — 100% DOT compliance — root cause: directus role thiếu USAGE on sandbox_tac (owner=workflow_admin); 20B artifact persists vì set -e exits trước cleanup block; recommendation: Option A (GRANT) + Option D (hardening) qua DOT pipeline.