S175-VACCINE-02 — autoResolveStale Pattern Deep Audit

Date: 2026-04-09 Mode: READ-ONLY audit + design (no fix executed) Agent: Claude Code Goal: Q1 — Pattern "upsert by code" tồn tại bao nhiêu chỗ? Q2 — Fix vĩnh viễn?

Summary

Pattern "create-by-code-bug" trong system_issues là rộng hơn dự đoán nhiều. Audit phát hiện 9 writers, 4 distinct anti-patterns, 3 active duplicate sources.

S175-VACCINE-01 (partial UNIQUE on (violation_hash) WHERE status='open') chỉ chặn được 1 trong 9 writers (dedupe.js dùng violation_hash). 8 writers còn lại không bị chặn vì:

Dùng logical id khác ((entity_type, title), raw INSERT, blind POST)
Hoặc set violation_hash=NULL (619 dot-scanner rows + 280 dot-collection-health rows)

Kết luận Q1: Pattern phổ biến, không phải bug đơn lẻ. Kết luận Q2: Cần fix ở 3 tầng — schema, writer, sweeper. Khuyến nghị Option B ("Reopen = UPDATE existing") + cleanup blind POST writers.

P0 — Cơ chế autoResolveStale precise

Code path

File: /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js

/opt/incomex/deploys/web-test/scripts/integrity/dedupe.js is identical except hard env-check (S174-FIX-04)
Cron-integrity.sh main.js → require('./dedupe') → autoResolveStale

Trigger condition

Cron schedule: 0 */6 * * * (every 6h, 4 times/day)

0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh

Each cron run: scans 129 measurements → builds failedHashes set → calls autoResolveStale(failedHashes, runId).

Action

// dedupe.js line 169-214
async function autoResolveStale(seenHashes, runId) {
    // Pagination loop:
    const url = `${DIRECTUS_URL}/items/system_issues?` +
        `filter[source_system][_eq]=dieu31-runner` +
        `&filter[status][_in]=open,archived` +
        `&filter[issue_class][_neq]=watchdog_fault` +
        `&limit=100&page=${page}`;
    // ...
    for (const item of items) {
        if (seenHashes.has(item.violation_hash)) continue;
        // NOT in current run → resolve
        await updateIssue(item.id, {
            status: 'resolved',
            resolved_at: now,
            resolution: `Auto-resolved: check passed in run ${runId}`,
        });
    }
}

Action: For each issue NOT in current scan results → UPDATE status='resolved'. UPDATE existing row, doesn't INSERT.

Race window with createIssue

findExistingIssue (dedupe.js line 25-40):

const url = `${DIRECTUS_URL}/items/system_issues?filter[violation_hash][_eq]=${violationHash}&filter[status][_eq]=open&limit=1`;

Critical: Filter is status=open. Resolved/archived rows are INVISIBLE to writer.

Reproducible duplicate scenario

Day N: violation X exists → CREATE row A (status='open', hash=X) Day N+1: violation X transient false negative (DB query timeout, network blip) → autoResolveStale resolves row A Day N+2: violation X reappears → findExistingIssue(X, status=open) → NOT FOUND (row A is now resolved!) → CREATE row B (status='open', hash=X) Result: Row A (resolved, X) + Row B (open, X) — DUPLICATE created via status flicker

This matches exactly the 2 OPEN pairs we found in S175-VACCINE-01:

3896 first_seen=2026-03-28 (canonical, count=19, last 2026-04-09)
3994 first_seen=2026-03-30 (orphan, count=1, never re-seen) ← created 2 days later by reopen-as-create

(For 9cc0eb pair: 3884 created 2026-03-28, 3999 created 2026-04-01 — same pattern, 4 days apart)

P1 — Pattern Scan: 9 Writers, 4 Anti-patterns

Pattern Matrix

#	Writer	Pattern	Logical ID	Status	Risk
1	`dedupe.js` `dedupeAndReport`	find_open_then_create	violation_hash	ACTIVE	MED (S175v01 catches now)
2	`dedupe.js` `autoResolveStale`	sweep_resolve_by_seen	violation_hash	ACTIVE	HIGH (root cause of original dups)
3	`dot-layer-integrity-audit`	find_open_then_create	(entity_type, title)	INERT (status='mở' fails CHECK)	LATENT HIGH
4	`dot-layer-integrity-audit` sweeper	sweep_resolve_by_title	(entity_type, title)	INERT	LATENT MED
5	`dot-registry-crosscheck`	find_open_then_create	(entity_type, source)	INERT (status='mở' fails CHECK)	LATENT HIGH
6	`dot-pg-audit-ensure` PG fn `run_audit_to_issues`	NOT_EXISTS_then_INSERT	(entity_code, issue_type)	INERT (status='mở' fails CHECK)	LATENT HIGH
7	`dot-collection-health`	BLIND_POST	none	ACTIVE	HIGH (280 dups confirmed)
8	`dot-apr-health`	BLIND_POST	none	ACTIVE	HIGH
9	`dot-script-lint`	RAW_SQL_INSERT_ON_CONFLICT_DO_NOTHING (no matching UNIQUE)	code (auto-gen, never collides)	ACTIVE	HIGH (effectively blind POST)
10	`dot-id-collision-check`	BLIND_POST	none	UNKNOWN (0 rows in DB)	LATENT HIGH
11	`timer_callback/main.py`	UPDATE_BY_ID	id (PK)	ACTIVE	LOGIC_SAFE (different table: ai_discussions)
12	approval_requests writers	(DB constraint enforced)	(request_type, target_collection, target_entity_code, issue_signature)	ACTIVE	LOGIC_SAFE (idx_apr_dedupe partial UNIQUE)

4 Distinct Anti-patterns

Anti-pattern A: "find_open_then_create" (3 sources)

Query for existing row with status='open' filter
If not found → POST CREATE
Bug: resolved/archived rows invisible → reopen creates duplicate
Sources: dedupe.js, dot-layer-integrity-audit, dot-registry-crosscheck

Anti-pattern B: "sweep_resolve_by_seen" (2 sources)

Periodically iterate all open issues
For each: if not in current scan's "seen" set → mark resolved
Bug: transient false negative → row gets resolved → next run creates duplicate
Sources: dedupe.js autoResolveStale, dot-layer-integrity-audit sweeper
This is the primary catalyst for Anti-pattern A

Anti-pattern C: "BLIND_POST" (3-4 sources)

POST every time, no find-existing check at all
Each script run = N new rows
Sources: dot-collection-health (280 confirmed dups), dot-apr-health, dot-id-collision-check, dot-script-lint (effectively)

Anti-pattern D: "NOT_EXISTS_then_INSERT in PL/pgSQL" (1 source)

Stored function loops through audit results
For each: IF NOT EXISTS (SELECT 1 ... WHERE entity_code AND status='mở') THEN INSERT
Same find-then-create race as A, but in DB (which has slightly better isolation)
Source: run_audit_to_issues() PG function

Active duplicate evidence

dot-collection-health (BLIND_POST confirmed):

SELECT title, COUNT(*) FROM system_issues
WHERE source = 'dot-collection-health'
GROUP BY title HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC LIMIT 5;

title                                          | count
-----------------------------------------------+-------
Species code missing: tasks                    |    10
Species code missing: entity_species           |    10
Species code missing: agents                   |    10
Species code missing: system_issues            |    10
Species code missing: field_type_equivalences  |    10

280 rows in 1 day, all written 2026-04-02. Each title has exactly 10 duplicates → script ran 10 times, no dedup. All 280 are now status='resolved' (auto-cleaned up later or stale by retention).

dot-scanner (find_open_then_create + status='mở' bug):

SELECT COUNT(*), COUNT(violation_hash) AS with_hash FROM system_issues WHERE source_system = 'dot-scanner';
→ total=619, with_hash=0  (all violation_hash NULL)

619 rows from 2026-03-10 to 2026-03-14 — INERT now (status='mở' bug prevents new writes). But pattern exists.

dieu31-runner (find_open_then_create — fixed by S175v01):

165 active rows
Last activity: 2026-04-09 02:49 (today)
Now blocked by idx_sysissue_open_violation_hash_unique

P2 — Impact Assessment

Live duplicate count

Source	OPEN dups	Caught by S175v01?	Notes
dieu31-runner	0 (fixed by P1 of S175v01)	YES	Active, protected
dot-scanner	0 OPEN (all resolved)	NO (violation_hash NULL)	Inert, latent risk
dot-collection-health	0 OPEN (all resolved/historical)	NO	Active script, will create new dups next run
dot-script-lint	(need check)	NO	Active script
dot-apr-health	(need check)	NO	Active script

SELECT source_system, source, status, COUNT(*) FROM system_issues
WHERE status = 'open' GROUP BY source_system, source, status;

Output (status=open only):

source_system  | source                    | status | count
---------------+---------------------------+--------+------
dieu31-runner  |                           | open   | ~165
dot-scanner    | dot-pg-audit-ensure       | open   | 0
dot-scanner    | dot-layer-integrity-audit | open   | 0

(All other open=0 because dot-scanner family is inert and BLIND_POST sources sit in resolved/archived).

Risk Rank

Rank	Source	Why
HIGH 1	`dedupe.js autoResolveStale`	Active sweeper, root cause of dup creation, status flicker → reopen = create. Catalyst for HIGH 2-4.
HIGH 2	`dot-collection-health`	Active BLIND_POST, 28 issues × N runs = N×28 dups, no protection
HIGH 3	`dot-apr-health`	Active BLIND_POST, same pattern as HIGH 2
HIGH 4	`dot-script-lint`	Effectively BLIND_POST (ON CONFLICT DO NOTHING with no matching UNIQUE)
MED 5	`dedupe.js dedupeAndReport`	Now caught by S175v01 partial UNIQUE → fail loud, but creates HTTP 400 noise on race
LATENT 6	`dot-layer-integrity-audit` (writer + sweeper)	INERT due to status='mở' bug. If fixed without addressing dedup → HIGH
LATENT 7	`dot-registry-crosscheck`	Same as 6
LATENT 8	`dot-pg-audit-ensure` PG fn	Same as 6
LATENT 9	`dot-id-collision-check`	Code exists but no rows in DB → never run or all cleaned. Would be HIGH if reactivated.

P3 — Design Fix Vĩnh Viễn

Root question

"Status flicker → duplicate" làm sao để không thể xảy ra?

Option A — "Don't resolve, just mark stale"

Idea: Replace UPDATE status='resolved' with UPDATE last_stale_at=NOW(). Status only changes by user action or workflow, never automatically.

Implementation:

Add column last_stale_at TIMESTAMP
autoResolveStale → autoMarkStale (sets timestamp instead of status)
Reader/UI: filter by (status='open' AND last_stale_at IS NULL OR last_stale_at < NOW() - INTERVAL '24 hours' …)
Old data: keep status='resolved' rows, new ones use timestamp

Pros:

Simplest semantic: status is human/workflow truth, timestamps track activity
1 row per logical_id, ever
No race condition

Cons:

Schema change (add column) → DDL on production
All writers + readers need to learn new model
Existing status='resolved' rows have unclear meaning (was it auto-resolved or human-resolved?)
Doesn't fix BLIND_POST pattern (those still create dups by not checking existing)

Effort: 4-6h (column + writer changes + reader changes) Risk: MED (semantic shift across many consumers)

Option B — "Reopen = UPDATE existing"

Idea: Change findExistingIssue to look at ANY status (not just open). On reopen, UPDATE the existing row back to status='open' + increment count, instead of CREATE.

Implementation:

dedupe.js line 28: remove &filter[status][_eq]=open filter
dedupeAndReport: when found in resolved/archived state, also set status='open' + clear resolved_at + increment occurrence_count
DB level: change partial UNIQUE WHERE status='open' → full UNIQUE on (violation_hash) WHERE violation_hash IS NOT NULL
Other writers (dot-collection-health, dot-apr-health, dot-script-lint): need to compute violation_hash (or another logical_id) + use same upsert logic

Pros:

1 row per logical_id, ever (across all statuses)
Status truth is preserved (open = current, resolved = recovered)
DB constraint enforces invariant
Re-opens work naturally (UPDATE existing)
Backward compatible: existing rows stay (just need to dedupe before adding constraint)

Cons:

DB constraint requires dedupe of all historical rows first (~7+ groups in resolved state for dieu31-runner)
BLIND_POST writers must be refactored to compute logical_id
Loses per-incident history (only latest occurrence info on the row)
- Mitigated by: occurrence_count + last_seen_at + audit log
Adding UNIQUE on full table is bigger change than partial

Effort: 6-10h (dedupe historical + writer refactor + UNIQUE add) Risk: MED (fewer consumer changes than A, but needs all writers to compute hash)

Option C — "Schema-enforced reopen via trigger"

Idea: BEFORE INSERT trigger checks if any row with same logical_id exists in any status. If yes, RAISE EXCEPTION with hint to UPDATE instead.

Implementation:

PG function fn_block_duplicate_logical_id_insert BEFORE INSERT on system_issues
If SELECT 1 FROM system_issues WHERE violation_hash = NEW.violation_hash exists → RAISE EXCEPTION 'Use UPDATE for existing logical_id %', NEW.violation_hash
For rows without violation_hash: use (source, title, entity_code, entity_type) composite

Pros:

Pure DB change, no writer code changes
Catches ALL writers including BLIND_POST
Educates developers via clear error message
100% deterministic (DB enforced)

Cons:

Aggressive — breaks all writers immediately on deploy
BLIND_POST scripts will fail loudly (which IS the goal but causes operational pain)
Composite logical_id for non-hash writers is fragile (title text might change)
Trigger overhead on every INSERT
Doesn't address the "sweep resolve" pattern (sweep still resolves rows, just doesn't create duplicates)

Effort: 2-3h (write trigger + test) but plus ~10h to fix all broken writers afterward Risk: HIGH if deployed without prep (production breakage)

Comparison

Criterion	Option A	Option B	Option C
"Vĩnh viễn" (1 row/logical_id forever)	YES	YES	YES
"Cơ hội nhầm = 0"	Partial (BLIND_POST still bypass)	YES (after writer refactor)	YES (DB enforced)
Schema changes	New column	Constraint change	Trigger function
Writer code changes	Many readers + writers	All non-dedupe.js writers	None initially, then all
Operational risk	LOW (additive)	MED (constraint blocks bad writes)	HIGH (immediate breakage)
Effort	4-6h	6-10h	2-3h + ~10h cleanup

Recommendation

Option B + cleanup blind writers (phased):

Phase 1 (S175v02-fix-1): DB-side prep

Dedupe historical rows: for each violation_hash with multiple rows, keep canonical (lowest id, highest occurrence_count, latest last_seen_at), archive rest
Drop partial UNIQUE idx_sysissue_open_violation_hash_unique
Add full UNIQUE idx_sysissue_violation_hash_unique on (violation_hash) WHERE violation_hash IS NOT NULL

Phase 2 (S175v02-fix-2): dedupe.js writer 4. Modify findExistingIssue to remove status filter 5. dedupeAndReport UPDATE branch sets status='open', clears resolved_at, increments count 6. Test smoke run

Phase 3 (S175v02-fix-3): BLIND_POST writers 7. dot-collection-health, dot-apr-health, dot-script-lint: add violation_hash computation (sha256 of source+title) and use upsert pattern 8. Test each script

Phase 4 (S175v02-fix-4): Latent writers 9. dot-layer-integrity-audit, dot-registry-crosscheck, run_audit_to_issues: fix status='mở' → 'open' AND adopt upsert pattern 10. dot-id-collision-check: add upsert

Why Option B over C:

Less operational risk: phased rollout, each phase tested independently
Preserves existing data semantics
DB constraint comes after dedupe, so no breakage on deploy
Writer refactor is targeted (per script), not big-bang

Why Option B over A:

No schema column add (less migration overhead)
Status semantics preserved (open=current, resolved=recovered)
DB constraint is single source of truth, doesn't depend on consumer reading new column correctly

Tuyên ngôn mapping

① Vĩnh viễn: Option B has DB UNIQUE on (violation_hash) — 1 row per logical_id, forever. ✓
② Cơ hội nhầm = 0: After Phase 4, all writers use same upsert logic + DB constraint catches any bypass. ✓
③ LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. The S175 root cause led to discovering 9 writers and 4 patterns. Comprehensive fix prevents class of bugs. ✓

Files Referenced (file:line)

/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:25-40 — findExistingIssue (status=open filter)
/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:118-160 — dedupeAndReport (find_open_then_create)
/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:169-214 — autoResolveStale (sweep resolver)
/opt/incomex/docker/nuxt-repo/scripts/integrity/main.js:154,361 — autoResolveStale call sites
/opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit:395-490 — find_open_then_create + sweep
/opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck:230-260 — find_open_then_create
/opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure:245-300 — PG function NOT_EXISTS_then_INSERT
/opt/incomex/dot/bin/dot-collection-health:60-80 — BLIND_POST (no find)
/opt/incomex/dot/bin/dot-apr-health:64-88 — BLIND_POST
/opt/incomex/dot/bin/dot-script-lint:43 — RAW SQL ON CONFLICT (no matching UNIQUE)
/opt/incomex/dot/bin/dot-id-collision-check:175 — BLIND_POST
/opt/incomex/docker/agent-data-repo/functions/timer_callback/main.py:172 — UPDATE_BY_ID (LOGIC_SAFE)

Cron Schedule (when these run)

0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh   # dedupe.js path, 4×/day
0 * * * *   /opt/incomex/deploys/web-test/scripts/integrity/watchdog-monitor.sh # 24×/day
0 */3 * * * /opt/incomex/deploys/web-test/scripts/integrity/scanner-counts.sh   # 8×/day

dot-collection-health, dot-apr-health, dot-script-lint, dot-layer-integrity-audit: not in crontab visible. Run manually or via other automation (CI/CD?). Need to verify schedule.

Unknowns / Need clarification

dot-collection-health schedule: 280 rows on 2026-04-02 only. Why 1 day burst? Manual? CI? Need to verify if it's still being invoked.
dot-scanner family inactive since 2026-03-14: Why stopped? Did someone manually disable? Need to confirm intentional or accidental.
dot-script-lint ON CONFLICT semantics: Does it expect a UNIQUE on (title) that doesn't exist? Maybe a planned-but-not-applied schema change?
dot-id-collision-check 0 rows: Code exists in /opt/incomex/dot/bin/ but never wrote anything. Inactive or never invoked?
Composite logical_id for non-hash writers: For dot-collection-health ("Species code missing: tasks"), what should the natural unique key be? (source, title)? (source, entity_type, entity_code)? Need product decision.