KB-9123

PG Backup Phase C Hardening Report — 2026-05-19

8 min read Revision 1
pg-backupphase-chardeningpermission-guardkumarestore-drills174

PG Backup Phase C Hardening Report

Date: 2026-05-19 Host: vmi3080463 Scope: permission guard lifecycle, registry lock coverage audit, stale monitor cleanup, restore drill.

0. Governance

  • Skill read: .claude/skills/incomex-rules.md.
  • KB read via search_knowledge direct main process:
    • knowledge/dev/ssot/operating-rules.md — OR v7.58, 2026-05-01.
    • knowledge/dev/laws/constitution.md — Constitution v4.6.3.
    • P9-G6/S174 backup incident context: PIPESTATUS, active DOWN push, grant drift guard, restore drill.
  • Three statements:
    • Permanent: added independent daily permission drift monitor; stale workflow monitor no longer creates false DOWN; restore drill validates recoverability.
    • Cannot silently fail: guard has dedicated Kuma monitor and failure heartbeat, separate from local backup/GDrive.
    • Automated: guard runs daily via cron and sends its own heartbeat.

1. Permission Guard Hardening

Before Phase C, check-pg-dump-permissions.sh only ran inside pg-backup.sh; root crontab had no standalone guard entry.

Current guard coverage:

  • Dump role: directus.
  • Non-system schema scan excludes only information_schema and schemas matching pg_%.
  • No custom application allowlist/exclude schema is present.
  • Objects checked: schema USAGE; tables/partitioned tables/views/materialized views/foreign tables SELECT; sequences USAGE, SELECT.

Changes:

  • Added /opt/incomex/scripts/pg-dump-permission-guard-monitor.sh.
  • Added /opt/incomex/scripts/ensure-pg-dump-permission-guard-kuma-monitor.sh.
  • Created Uptime Kuma push monitor PG Dump Permission Guard, id 17, token pg-dump-permission-guard, interval 90000, active 1.
  • Added root cron: 17 4 * * * /opt/incomex/scripts/pg-dump-permission-guard-monitor.sh.

Manual verification:

PASS: dump role directus has required schema/table/view/sequence privileges
2026-05-19T05:05:49+02:00 FINAL_STATUS=success ERROR_CLASS=none MSG=PASS: dump role directus has required schema/table/view/sequence privileges
17  PG Dump Permission Guard  1  2026-05-19 03:05:49.881  1  permission guard PASS  1

No GRANT was applied in Phase C because the guard was already PASS.

2. Registry Lock Coverage Audit

Batch functions expected from Phase B:

fn_refresh_orphan_col            has_lock=t uses_key=t
fn_refresh_orphan_dot            has_lock=t uses_key=t
fn_refresh_orphan_species        has_lock=t uses_key=t
fn_refresh_species_per_level     has_lock=t uses_key=t
refresh_meta_catalog_from_pivot  has_lock=t uses_key=t
refresh_registry_views           has_lock=t uses_key=t truncates_registry_counts=t
refresh_pivot_results            has_lock=f uses_key=f
refresh_matrix_results           has_lock=f uses_key=f

Coverage gaps found read-only, not changed in Phase C:

  • refresh_pivot_results() and refresh_matrix_results() are scheduled write-heavy jobs and do not use incomex.registry_refresh.v1.
  • Trigger/function paths touching meta_catalog or v_registry_counts without the registry lock include fn_auto_cleanup_on_meta_delete, fn_auto_sync_v_registry_counts, fn_ensure_registry_counts, count refresh functions, refresh_all_meta_counts, update_record_count, trg_pivot_def_refresh, and trg_fn_refresh_orphan_*.
  • DOT/code paths that can mutate registry counts or meta_catalog exist outside these DB functions, including /opt/incomex/dot/bin/dot-schema-dot-origin-ensure, /opt/incomex/dot/bin/dot-orphan-scan, /opt/incomex/dot/bin/dot-registry-count-refresh, and /opt/incomex/dot/bin/dot-schema-trigger-registry-ensure.

Conclusion: Phase B removed the immediate cron/deadlock class for the main backup collision, but advisory-lock coverage is not complete for all registry/meta write paths. This should be a separate approved SQL/DOT hardening phase.

3. Monitoring Cleanup

Evidence that PG Backup Workflow was stale:

# S174-FIX-03 archived: 0 2 * * * /opt/workflow/postgres/backup.sh >> /opt/workflow/postgres/backup.log 2>&1
/opt/workflow/postgres/backup.sh.retired.20260408
/opt/workflow/postgres/backups.retired.20260408/workflow_20260401T000001Z.sql.gz 20 bytes

Before:

13  PG Backup Workflow  push  1  pg-backup-workflow  90000
2026-05-19 01:21:53.452  0  No heartbeat in the time window
2026-05-18 00:21:53.440  0  No heartbeat in the time window

Action: paused monitor #13 via Kuma socket API. Did not restart PostgreSQL. Did not touch nginx or unrelated services.

After:

13  PG Backup Workflow  push  0  pg-backup-workflow  90000

Current backup monitor separation:

12  PG Backup Local           active=1 latest success OK size=64M
14  PG Backup GDrive          active=1 latest success OK archive=131M
17  PG Dump Permission Guard  active=1 latest success permission guard PASS

4. Restore Drill

Local backup tested:

LOCAL_BACKUP=/opt/incomex/backups/pg/directus_2026-05-19_0240.sql.gz
LOCAL_SIZE_BYTES=66690408
LOCAL_PAYLOAD_BYTES=720906367
LOCAL_HEADER=-- -- PostgreSQL database dump --
RESTORE_DB=restore_drill_20260519_050848
RESTORE_TABLES=302
RESTORE_SCHEMAS=cutter_governance,public,sandbox_tac
RESTORE_ERROR_LINES=none

GDrive archive tested:

GDRIVE_ARCHIVE=vps-backup-20260519_044156.tar.gz
GDRIVE_ARCHIVE_SIZE_BYTES=136703863
GDRIVE_PG_MEMBER=vps-backup-20260519_044156/postgresql-directus.sql.gz
GDRIVE_PG_SIZE_BYTES=66690402
GDRIVE_PAYLOAD_BYTES=720906367
GDRIVE_HEADER=-- -- PostgreSQL database dump --

Cleanup evidence: no restore_drill_% database rows and no /tmp/pg-restore-drill-* directories remained.

5. Backups Created

/root/crontab.pre-phase-c-pgbackup-hardening-1779159891.txt
/opt/incomex/backups/kuma.db.pre-phase-c-pgbackup-hardening-1779159891
/opt/incomex/backups/kuma.db.pre-phase-c-disable-pg-backup-workflow-1779159993
/opt/incomex/scripts/ensure-pg-dump-permission-guard-kuma-monitor.sh.pre-phase-c-1779159934

6. Rollback

Restore crontab:

crontab /root/crontab.pre-phase-c-pgbackup-hardening-1779159891.txt

Full Kuma rollback to pre-Phase-C monitor state:

docker stop uptime-kuma
cp /opt/incomex/backups/kuma.db.pre-phase-c-pgbackup-hardening-1779159891 /opt/incomex/uptime-kuma/kuma.db
docker start uptime-kuma

Rollback only the stale monitor disable:

docker stop uptime-kuma
cp /opt/incomex/backups/kuma.db.pre-phase-c-disable-pg-backup-workflow-1779159993 /opt/incomex/uptime-kuma/kuma.db
docker start uptime-kuma

Remove Phase C guard scripts if doing full rollback:

rm -f /opt/incomex/scripts/pg-dump-permission-guard-monitor.sh
rm -f /opt/incomex/scripts/ensure-pg-dump-permission-guard-kuma-monitor.sh

7. Remaining Risks

  • Advisory lock coverage is incomplete for all registry/meta write paths. A separate task should wrap all registry-count/meta-catalog writers and relevant DOT jobs with a shared serialization mechanism.
  • TRUNCATE public.v_registry_counts remains in refresh_registry_views(). Replacement with upsert/delete-diff or a staging-table swap should be a separate approved redesign.
  • A scheduled periodic restore drill is recommended, not just manual incident-time validation.