KB-989A

S175-VACCINE-02 autoResolveStale Pattern Deep Audit

20 min read Revision 1
reports175v02vaccineauditautoresolvestalepattern

S175-VACCINE-02 — autoResolveStale Pattern Deep Audit

Date: 2026-04-09 Mode: READ-ONLY audit + design (no fix executed) Agent: Claude Code Goal: Q1 — Pattern "upsert by code" tồn tại bao nhiêu chỗ? Q2 — Fix vĩnh viễn?

Summary

Pattern "create-by-code-bug" trong system_issues là rộng hơn dự đoán nhiều. Audit phát hiện 9 writers, 4 distinct anti-patterns, 3 active duplicate sources.

S175-VACCINE-01 (partial UNIQUE on (violation_hash) WHERE status='open') chỉ chặn được 1 trong 9 writers (dedupe.js dùng violation_hash). 8 writers còn lại không bị chặn vì:

  • Dùng logical id khác ((entity_type, title), raw INSERT, blind POST)
  • Hoặc set violation_hash=NULL (619 dot-scanner rows + 280 dot-collection-health rows)

Kết luận Q1: Pattern phổ biến, không phải bug đơn lẻ. Kết luận Q2: Cần fix ở 3 tầng — schema, writer, sweeper. Khuyến nghị Option B ("Reopen = UPDATE existing") + cleanup blind POST writers.


P0 — Cơ chế autoResolveStale precise

Code path

File: /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js

  • /opt/incomex/deploys/web-test/scripts/integrity/dedupe.js is identical except hard env-check (S174-FIX-04)
  • Cron-integrity.sh main.jsrequire('./dedupe')autoResolveStale

Trigger condition

Cron schedule: 0 */6 * * * (every 6h, 4 times/day)

0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh

Each cron run: scans 129 measurements → builds failedHashes set → calls autoResolveStale(failedHashes, runId).

Action

// dedupe.js line 169-214
async function autoResolveStale(seenHashes, runId) {
    // Pagination loop:
    const url = `${DIRECTUS_URL}/items/system_issues?` +
        `filter[source_system][_eq]=dieu31-runner` +
        `&filter[status][_in]=open,archived` +
        `&filter[issue_class][_neq]=watchdog_fault` +
        `&limit=100&page=${page}`;
    // ...
    for (const item of items) {
        if (seenHashes.has(item.violation_hash)) continue;
        // NOT in current run → resolve
        await updateIssue(item.id, {
            status: 'resolved',
            resolved_at: now,
            resolution: `Auto-resolved: check passed in run ${runId}`,
        });
    }
}

Action: For each issue NOT in current scan results → UPDATE status='resolved'. UPDATE existing row, doesn't INSERT.

Race window with createIssue

findExistingIssue (dedupe.js line 25-40):

const url = `${DIRECTUS_URL}/items/system_issues?filter[violation_hash][_eq]=${violationHash}&filter[status][_eq]=open&limit=1`;

Critical: Filter is status=open. Resolved/archived rows are INVISIBLE to writer.

Reproducible duplicate scenario

Day N: violation X exists → CREATE row A (status='open', hash=X) Day N+1: violation X transient false negative (DB query timeout, network blip) → autoResolveStale resolves row A Day N+2: violation X reappears → findExistingIssue(X, status=open) → NOT FOUND (row A is now resolved!) → CREATE row B (status='open', hash=X) Result: Row A (resolved, X) + Row B (open, X) — DUPLICATE created via status flicker

This matches exactly the 2 OPEN pairs we found in S175-VACCINE-01:

3896 first_seen=2026-03-28 (canonical, count=19, last 2026-04-09)
3994 first_seen=2026-03-30 (orphan, count=1, never re-seen) ← created 2 days later by reopen-as-create

(For 9cc0eb pair: 3884 created 2026-03-28, 3999 created 2026-04-01 — same pattern, 4 days apart)


P1 — Pattern Scan: 9 Writers, 4 Anti-patterns

Pattern Matrix

# Writer Pattern Logical ID Status Risk
1 dedupe.js dedupeAndReport find_open_then_create violation_hash ACTIVE MED (S175v01 catches now)
2 dedupe.js autoResolveStale sweep_resolve_by_seen violation_hash ACTIVE HIGH (root cause of original dups)
3 dot-layer-integrity-audit find_open_then_create (entity_type, title) INERT (status='mở' fails CHECK) LATENT HIGH
4 dot-layer-integrity-audit sweeper sweep_resolve_by_title (entity_type, title) INERT LATENT MED
5 dot-registry-crosscheck find_open_then_create (entity_type, source) INERT (status='mở' fails CHECK) LATENT HIGH
6 dot-pg-audit-ensure PG fn run_audit_to_issues NOT_EXISTS_then_INSERT (entity_code, issue_type) INERT (status='mở' fails CHECK) LATENT HIGH
7 dot-collection-health BLIND_POST none ACTIVE HIGH (280 dups confirmed)
8 dot-apr-health BLIND_POST none ACTIVE HIGH
9 dot-script-lint RAW_SQL_INSERT_ON_CONFLICT_DO_NOTHING (no matching UNIQUE) code (auto-gen, never collides) ACTIVE HIGH (effectively blind POST)
10 dot-id-collision-check BLIND_POST none UNKNOWN (0 rows in DB) LATENT HIGH
11 timer_callback/main.py UPDATE_BY_ID id (PK) ACTIVE LOGIC_SAFE (different table: ai_discussions)
12 approval_requests writers (DB constraint enforced) (request_type, target_collection, target_entity_code, issue_signature) ACTIVE LOGIC_SAFE (idx_apr_dedupe partial UNIQUE)

4 Distinct Anti-patterns

Anti-pattern A: "find_open_then_create" (3 sources)

  • Query for existing row with status='open' filter
  • If not found → POST CREATE
  • Bug: resolved/archived rows invisible → reopen creates duplicate
  • Sources: dedupe.js, dot-layer-integrity-audit, dot-registry-crosscheck

Anti-pattern B: "sweep_resolve_by_seen" (2 sources)

  • Periodically iterate all open issues
  • For each: if not in current scan's "seen" set → mark resolved
  • Bug: transient false negative → row gets resolved → next run creates duplicate
  • Sources: dedupe.js autoResolveStale, dot-layer-integrity-audit sweeper
  • This is the primary catalyst for Anti-pattern A

Anti-pattern C: "BLIND_POST" (3-4 sources)

  • POST every time, no find-existing check at all
  • Each script run = N new rows
  • Sources: dot-collection-health (280 confirmed dups), dot-apr-health, dot-id-collision-check, dot-script-lint (effectively)

Anti-pattern D: "NOT_EXISTS_then_INSERT in PL/pgSQL" (1 source)

  • Stored function loops through audit results
  • For each: IF NOT EXISTS (SELECT 1 ... WHERE entity_code AND status='mở') THEN INSERT
  • Same find-then-create race as A, but in DB (which has slightly better isolation)
  • Source: run_audit_to_issues() PG function

Active duplicate evidence

dot-collection-health (BLIND_POST confirmed):

SELECT title, COUNT(*) FROM system_issues
WHERE source = 'dot-collection-health'
GROUP BY title HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC LIMIT 5;
title                                          | count
-----------------------------------------------+-------
Species code missing: tasks                    |    10
Species code missing: entity_species           |    10
Species code missing: agents                   |    10
Species code missing: system_issues            |    10
Species code missing: field_type_equivalences  |    10

280 rows in 1 day, all written 2026-04-02. Each title has exactly 10 duplicates → script ran 10 times, no dedup. All 280 are now status='resolved' (auto-cleaned up later or stale by retention).

dot-scanner (find_open_then_create + status='mở' bug):

SELECT COUNT(*), COUNT(violation_hash) AS with_hash FROM system_issues WHERE source_system = 'dot-scanner';
→ total=619, with_hash=0  (all violation_hash NULL)

619 rows from 2026-03-10 to 2026-03-14 — INERT now (status='mở' bug prevents new writes). But pattern exists.

dieu31-runner (find_open_then_create — fixed by S175v01):

  • 165 active rows
  • Last activity: 2026-04-09 02:49 (today)
  • Now blocked by idx_sysissue_open_violation_hash_unique

P2 — Impact Assessment

Live duplicate count

Source OPEN dups Caught by S175v01? Notes
dieu31-runner 0 (fixed by P1 of S175v01) YES Active, protected
dot-scanner 0 OPEN (all resolved) NO (violation_hash NULL) Inert, latent risk
dot-collection-health 0 OPEN (all resolved/historical) NO Active script, will create new dups next run
dot-script-lint (need check) NO Active script
dot-apr-health (need check) NO Active script
SELECT source_system, source, status, COUNT(*) FROM system_issues
WHERE status = 'open' GROUP BY source_system, source, status;

Output (status=open only):

source_system  | source                    | status | count
---------------+---------------------------+--------+------
dieu31-runner  |                           | open   | ~165
dot-scanner    | dot-pg-audit-ensure       | open   | 0
dot-scanner    | dot-layer-integrity-audit | open   | 0

(All other open=0 because dot-scanner family is inert and BLIND_POST sources sit in resolved/archived).

Risk Rank

Rank Source Why
HIGH 1 dedupe.js autoResolveStale Active sweeper, root cause of dup creation, status flicker → reopen = create. Catalyst for HIGH 2-4.
HIGH 2 dot-collection-health Active BLIND_POST, 28 issues × N runs = N×28 dups, no protection
HIGH 3 dot-apr-health Active BLIND_POST, same pattern as HIGH 2
HIGH 4 dot-script-lint Effectively BLIND_POST (ON CONFLICT DO NOTHING with no matching UNIQUE)
MED 5 dedupe.js dedupeAndReport Now caught by S175v01 partial UNIQUE → fail loud, but creates HTTP 400 noise on race
LATENT 6 dot-layer-integrity-audit (writer + sweeper) INERT due to status='mở' bug. If fixed without addressing dedup → HIGH
LATENT 7 dot-registry-crosscheck Same as 6
LATENT 8 dot-pg-audit-ensure PG fn Same as 6
LATENT 9 dot-id-collision-check Code exists but no rows in DB → never run or all cleaned. Would be HIGH if reactivated.

P3 — Design Fix Vĩnh Viễn

Root question

"Status flicker → duplicate" làm sao để không thể xảy ra?

Option A — "Don't resolve, just mark stale"

Idea: Replace UPDATE status='resolved' with UPDATE last_stale_at=NOW(). Status only changes by user action or workflow, never automatically.

Implementation:

  • Add column last_stale_at TIMESTAMP
  • autoResolveStaleautoMarkStale (sets timestamp instead of status)
  • Reader/UI: filter by (status='open' AND last_stale_at IS NULL OR last_stale_at < NOW() - INTERVAL '24 hours' …)
  • Old data: keep status='resolved' rows, new ones use timestamp

Pros:

  • Simplest semantic: status is human/workflow truth, timestamps track activity
  • 1 row per logical_id, ever
  • No race condition

Cons:

  • Schema change (add column) → DDL on production
  • All writers + readers need to learn new model
  • Existing status='resolved' rows have unclear meaning (was it auto-resolved or human-resolved?)
  • Doesn't fix BLIND_POST pattern (those still create dups by not checking existing)

Effort: 4-6h (column + writer changes + reader changes) Risk: MED (semantic shift across many consumers)

Option B — "Reopen = UPDATE existing"

Idea: Change findExistingIssue to look at ANY status (not just open). On reopen, UPDATE the existing row back to status='open' + increment count, instead of CREATE.

Implementation:

  • dedupe.js line 28: remove &filter[status][_eq]=open filter
  • dedupeAndReport: when found in resolved/archived state, also set status='open' + clear resolved_at + increment occurrence_count
  • DB level: change partial UNIQUE WHERE status='open' → full UNIQUE on (violation_hash) WHERE violation_hash IS NOT NULL
  • Other writers (dot-collection-health, dot-apr-health, dot-script-lint): need to compute violation_hash (or another logical_id) + use same upsert logic

Pros:

  • 1 row per logical_id, ever (across all statuses)
  • Status truth is preserved (open = current, resolved = recovered)
  • DB constraint enforces invariant
  • Re-opens work naturally (UPDATE existing)
  • Backward compatible: existing rows stay (just need to dedupe before adding constraint)

Cons:

  • DB constraint requires dedupe of all historical rows first (~7+ groups in resolved state for dieu31-runner)
  • BLIND_POST writers must be refactored to compute logical_id
  • Loses per-incident history (only latest occurrence info on the row)
    • Mitigated by: occurrence_count + last_seen_at + audit log
  • Adding UNIQUE on full table is bigger change than partial

Effort: 6-10h (dedupe historical + writer refactor + UNIQUE add) Risk: MED (fewer consumer changes than A, but needs all writers to compute hash)

Option C — "Schema-enforced reopen via trigger"

Idea: BEFORE INSERT trigger checks if any row with same logical_id exists in any status. If yes, RAISE EXCEPTION with hint to UPDATE instead.

Implementation:

  • PG function fn_block_duplicate_logical_id_insert BEFORE INSERT on system_issues
  • If SELECT 1 FROM system_issues WHERE violation_hash = NEW.violation_hash exists → RAISE EXCEPTION 'Use UPDATE for existing logical_id %', NEW.violation_hash
  • For rows without violation_hash: use (source, title, entity_code, entity_type) composite

Pros:

  • Pure DB change, no writer code changes
  • Catches ALL writers including BLIND_POST
  • Educates developers via clear error message
  • 100% deterministic (DB enforced)

Cons:

  • Aggressive — breaks all writers immediately on deploy
  • BLIND_POST scripts will fail loudly (which IS the goal but causes operational pain)
  • Composite logical_id for non-hash writers is fragile (title text might change)
  • Trigger overhead on every INSERT
  • Doesn't address the "sweep resolve" pattern (sweep still resolves rows, just doesn't create duplicates)

Effort: 2-3h (write trigger + test) but plus ~10h to fix all broken writers afterward Risk: HIGH if deployed without prep (production breakage)

Comparison

Criterion Option A Option B Option C
"Vĩnh viễn" (1 row/logical_id forever) YES YES YES
"Cơ hội nhầm = 0" Partial (BLIND_POST still bypass) YES (after writer refactor) YES (DB enforced)
Schema changes New column Constraint change Trigger function
Writer code changes Many readers + writers All non-dedupe.js writers None initially, then all
Operational risk LOW (additive) MED (constraint blocks bad writes) HIGH (immediate breakage)
Effort 4-6h 6-10h 2-3h + ~10h cleanup

Recommendation

Option B + cleanup blind writers (phased):

Phase 1 (S175v02-fix-1): DB-side prep

  1. Dedupe historical rows: for each violation_hash with multiple rows, keep canonical (lowest id, highest occurrence_count, latest last_seen_at), archive rest
  2. Drop partial UNIQUE idx_sysissue_open_violation_hash_unique
  3. Add full UNIQUE idx_sysissue_violation_hash_unique on (violation_hash) WHERE violation_hash IS NOT NULL

Phase 2 (S175v02-fix-2): dedupe.js writer 4. Modify findExistingIssue to remove status filter 5. dedupeAndReport UPDATE branch sets status='open', clears resolved_at, increments count 6. Test smoke run

Phase 3 (S175v02-fix-3): BLIND_POST writers 7. dot-collection-health, dot-apr-health, dot-script-lint: add violation_hash computation (sha256 of source+title) and use upsert pattern 8. Test each script

Phase 4 (S175v02-fix-4): Latent writers 9. dot-layer-integrity-audit, dot-registry-crosscheck, run_audit_to_issues: fix status='mở' → 'open' AND adopt upsert pattern 10. dot-id-collision-check: add upsert

Why Option B over C:

  • Less operational risk: phased rollout, each phase tested independently
  • Preserves existing data semantics
  • DB constraint comes after dedupe, so no breakage on deploy
  • Writer refactor is targeted (per script), not big-bang

Why Option B over A:

  • No schema column add (less migration overhead)
  • Status semantics preserved (open=current, resolved=recovered)
  • DB constraint is single source of truth, doesn't depend on consumer reading new column correctly

Tuyên ngôn mapping

  • Vĩnh viễn: Option B has DB UNIQUE on (violation_hash) — 1 row per logical_id, forever. ✓
  • Cơ hội nhầm = 0: After Phase 4, all writers use same upsert logic + DB constraint catches any bypass. ✓
  • LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. The S175 root cause led to discovering 9 writers and 4 patterns. Comprehensive fix prevents class of bugs. ✓

Files Referenced (file:line)

  • /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:25-40 — findExistingIssue (status=open filter)
  • /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:118-160 — dedupeAndReport (find_open_then_create)
  • /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:169-214 — autoResolveStale (sweep resolver)
  • /opt/incomex/docker/nuxt-repo/scripts/integrity/main.js:154,361 — autoResolveStale call sites
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit:395-490 — find_open_then_create + sweep
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck:230-260 — find_open_then_create
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure:245-300 — PG function NOT_EXISTS_then_INSERT
  • /opt/incomex/dot/bin/dot-collection-health:60-80 — BLIND_POST (no find)
  • /opt/incomex/dot/bin/dot-apr-health:64-88 — BLIND_POST
  • /opt/incomex/dot/bin/dot-script-lint:43 — RAW SQL ON CONFLICT (no matching UNIQUE)
  • /opt/incomex/dot/bin/dot-id-collision-check:175 — BLIND_POST
  • /opt/incomex/docker/agent-data-repo/functions/timer_callback/main.py:172 — UPDATE_BY_ID (LOGIC_SAFE)

Cron Schedule (when these run)

0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh   # dedupe.js path, 4×/day
0 * * * *   /opt/incomex/deploys/web-test/scripts/integrity/watchdog-monitor.sh # 24×/day
0 */3 * * * /opt/incomex/deploys/web-test/scripts/integrity/scanner-counts.sh   # 8×/day

dot-collection-health, dot-apr-health, dot-script-lint, dot-layer-integrity-audit: not in crontab visible. Run manually or via other automation (CI/CD?). Need to verify schedule.

Unknowns / Need clarification

  1. dot-collection-health schedule: 280 rows on 2026-04-02 only. Why 1 day burst? Manual? CI? Need to verify if it's still being invoked.
  2. dot-scanner family inactive since 2026-03-14: Why stopped? Did someone manually disable? Need to confirm intentional or accidental.
  3. dot-script-lint ON CONFLICT semantics: Does it expect a UNIQUE on (title) that doesn't exist? Maybe a planned-but-not-applied schema change?
  4. dot-id-collision-check 0 rows: Code exists in /opt/incomex/dot/bin/ but never wrote anything. Inactive or never invoked?
  5. Composite logical_id for non-hash writers: For dot-collection-health ("Species code missing: tasks"), what should the natural unique key be? (source, title)? (source, entity_type, entity_code)? Need product decision.