S175-VACCINE-02 autoResolveStale Pattern Deep Audit
S175-VACCINE-02 — autoResolveStale Pattern Deep Audit
Date: 2026-04-09 Mode: READ-ONLY audit + design (no fix executed) Agent: Claude Code Goal: Q1 — Pattern "upsert by code" tồn tại bao nhiêu chỗ? Q2 — Fix vĩnh viễn?
Summary
Pattern "create-by-code-bug" trong system_issues là rộng hơn dự đoán nhiều. Audit phát hiện 9 writers, 4 distinct anti-patterns, 3 active duplicate sources.
S175-VACCINE-01 (partial UNIQUE on (violation_hash) WHERE status='open') chỉ
chặn được 1 trong 9 writers (dedupe.js dùng violation_hash). 8 writers còn
lại không bị chặn vì:
- Dùng logical id khác (
(entity_type, title), raw INSERT, blind POST) - Hoặc set
violation_hash=NULL(619 dot-scanner rows + 280 dot-collection-health rows)
Kết luận Q1: Pattern phổ biến, không phải bug đơn lẻ. Kết luận Q2: Cần fix ở 3 tầng — schema, writer, sweeper. Khuyến nghị Option B ("Reopen = UPDATE existing") + cleanup blind POST writers.
P0 — Cơ chế autoResolveStale precise
Code path
File: /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js
/opt/incomex/deploys/web-test/scripts/integrity/dedupe.jsis identical except hard env-check (S174-FIX-04)- Cron-integrity.sh
main.js→require('./dedupe')→autoResolveStale
Trigger condition
Cron schedule: 0 */6 * * * (every 6h, 4 times/day)
0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh
Each cron run: scans 129 measurements → builds failedHashes set → calls autoResolveStale(failedHashes, runId).
Action
// dedupe.js line 169-214
async function autoResolveStale(seenHashes, runId) {
// Pagination loop:
const url = `${DIRECTUS_URL}/items/system_issues?` +
`filter[source_system][_eq]=dieu31-runner` +
`&filter[status][_in]=open,archived` +
`&filter[issue_class][_neq]=watchdog_fault` +
`&limit=100&page=${page}`;
// ...
for (const item of items) {
if (seenHashes.has(item.violation_hash)) continue;
// NOT in current run → resolve
await updateIssue(item.id, {
status: 'resolved',
resolved_at: now,
resolution: `Auto-resolved: check passed in run ${runId}`,
});
}
}
Action: For each issue NOT in current scan results → UPDATE status='resolved'. UPDATE existing row, doesn't INSERT.
Race window with createIssue
findExistingIssue (dedupe.js line 25-40):
const url = `${DIRECTUS_URL}/items/system_issues?filter[violation_hash][_eq]=${violationHash}&filter[status][_eq]=open&limit=1`;
Critical: Filter is status=open. Resolved/archived rows are INVISIBLE to writer.
Reproducible duplicate scenario
Day N: violation X exists → CREATE row A (status='open', hash=X) Day N+1: violation X transient false negative (DB query timeout, network blip) → autoResolveStale resolves row A Day N+2: violation X reappears → findExistingIssue(X, status=open) → NOT FOUND (row A is now resolved!) → CREATE row B (status='open', hash=X) Result: Row A (resolved, X) + Row B (open, X) — DUPLICATE created via status flicker
This matches exactly the 2 OPEN pairs we found in S175-VACCINE-01:
3896 first_seen=2026-03-28 (canonical, count=19, last 2026-04-09)
3994 first_seen=2026-03-30 (orphan, count=1, never re-seen) ← created 2 days later by reopen-as-create
(For 9cc0eb pair: 3884 created 2026-03-28, 3999 created 2026-04-01 — same pattern, 4 days apart)
P1 — Pattern Scan: 9 Writers, 4 Anti-patterns
Pattern Matrix
| # | Writer | Pattern | Logical ID | Status | Risk |
|---|---|---|---|---|---|
| 1 | dedupe.js dedupeAndReport |
find_open_then_create | violation_hash | ACTIVE | MED (S175v01 catches now) |
| 2 | dedupe.js autoResolveStale |
sweep_resolve_by_seen | violation_hash | ACTIVE | HIGH (root cause of original dups) |
| 3 | dot-layer-integrity-audit |
find_open_then_create | (entity_type, title) | INERT (status='mở' fails CHECK) | LATENT HIGH |
| 4 | dot-layer-integrity-audit sweeper |
sweep_resolve_by_title | (entity_type, title) | INERT | LATENT MED |
| 5 | dot-registry-crosscheck |
find_open_then_create | (entity_type, source) | INERT (status='mở' fails CHECK) | LATENT HIGH |
| 6 | dot-pg-audit-ensure PG fn run_audit_to_issues |
NOT_EXISTS_then_INSERT | (entity_code, issue_type) | INERT (status='mở' fails CHECK) | LATENT HIGH |
| 7 | dot-collection-health |
BLIND_POST | none | ACTIVE | HIGH (280 dups confirmed) |
| 8 | dot-apr-health |
BLIND_POST | none | ACTIVE | HIGH |
| 9 | dot-script-lint |
RAW_SQL_INSERT_ON_CONFLICT_DO_NOTHING (no matching UNIQUE) | code (auto-gen, never collides) | ACTIVE | HIGH (effectively blind POST) |
| 10 | dot-id-collision-check |
BLIND_POST | none | UNKNOWN (0 rows in DB) | LATENT HIGH |
| 11 | timer_callback/main.py |
UPDATE_BY_ID | id (PK) | ACTIVE | LOGIC_SAFE (different table: ai_discussions) |
| 12 | approval_requests writers | (DB constraint enforced) | (request_type, target_collection, target_entity_code, issue_signature) | ACTIVE | LOGIC_SAFE (idx_apr_dedupe partial UNIQUE) |
4 Distinct Anti-patterns
Anti-pattern A: "find_open_then_create" (3 sources)
- Query for existing row with
status='open'filter - If not found → POST CREATE
- Bug: resolved/archived rows invisible → reopen creates duplicate
- Sources: dedupe.js, dot-layer-integrity-audit, dot-registry-crosscheck
Anti-pattern B: "sweep_resolve_by_seen" (2 sources)
- Periodically iterate all open issues
- For each: if not in current scan's "seen" set → mark resolved
- Bug: transient false negative → row gets resolved → next run creates duplicate
- Sources: dedupe.js
autoResolveStale, dot-layer-integrity-audit sweeper - This is the primary catalyst for Anti-pattern A
Anti-pattern C: "BLIND_POST" (3-4 sources)
- POST every time, no find-existing check at all
- Each script run = N new rows
- Sources: dot-collection-health (280 confirmed dups), dot-apr-health, dot-id-collision-check, dot-script-lint (effectively)
Anti-pattern D: "NOT_EXISTS_then_INSERT in PL/pgSQL" (1 source)
- Stored function loops through audit results
- For each:
IF NOT EXISTS (SELECT 1 ... WHERE entity_code AND status='mở') THEN INSERT - Same find-then-create race as A, but in DB (which has slightly better isolation)
- Source:
run_audit_to_issues()PG function
Active duplicate evidence
dot-collection-health (BLIND_POST confirmed):
SELECT title, COUNT(*) FROM system_issues
WHERE source = 'dot-collection-health'
GROUP BY title HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC LIMIT 5;
title | count
-----------------------------------------------+-------
Species code missing: tasks | 10
Species code missing: entity_species | 10
Species code missing: agents | 10
Species code missing: system_issues | 10
Species code missing: field_type_equivalences | 10
280 rows in 1 day, all written 2026-04-02. Each title has exactly 10 duplicates → script ran 10 times, no dedup. All 280 are now status='resolved' (auto-cleaned up later or stale by retention).
dot-scanner (find_open_then_create + status='mở' bug):
SELECT COUNT(*), COUNT(violation_hash) AS with_hash FROM system_issues WHERE source_system = 'dot-scanner';
→ total=619, with_hash=0 (all violation_hash NULL)
619 rows from 2026-03-10 to 2026-03-14 — INERT now (status='mở' bug prevents new writes). But pattern exists.
dieu31-runner (find_open_then_create — fixed by S175v01):
- 165 active rows
- Last activity: 2026-04-09 02:49 (today)
- Now blocked by
idx_sysissue_open_violation_hash_unique
P2 — Impact Assessment
Live duplicate count
| Source | OPEN dups | Caught by S175v01? | Notes |
|---|---|---|---|
| dieu31-runner | 0 (fixed by P1 of S175v01) | YES | Active, protected |
| dot-scanner | 0 OPEN (all resolved) | NO (violation_hash NULL) | Inert, latent risk |
| dot-collection-health | 0 OPEN (all resolved/historical) | NO | Active script, will create new dups next run |
| dot-script-lint | (need check) | NO | Active script |
| dot-apr-health | (need check) | NO | Active script |
SELECT source_system, source, status, COUNT(*) FROM system_issues
WHERE status = 'open' GROUP BY source_system, source, status;
Output (status=open only):
source_system | source | status | count
---------------+---------------------------+--------+------
dieu31-runner | | open | ~165
dot-scanner | dot-pg-audit-ensure | open | 0
dot-scanner | dot-layer-integrity-audit | open | 0
(All other open=0 because dot-scanner family is inert and BLIND_POST sources sit in resolved/archived).
Risk Rank
| Rank | Source | Why |
|---|---|---|
| HIGH 1 | dedupe.js autoResolveStale |
Active sweeper, root cause of dup creation, status flicker → reopen = create. Catalyst for HIGH 2-4. |
| HIGH 2 | dot-collection-health |
Active BLIND_POST, 28 issues × N runs = N×28 dups, no protection |
| HIGH 3 | dot-apr-health |
Active BLIND_POST, same pattern as HIGH 2 |
| HIGH 4 | dot-script-lint |
Effectively BLIND_POST (ON CONFLICT DO NOTHING with no matching UNIQUE) |
| MED 5 | dedupe.js dedupeAndReport |
Now caught by S175v01 partial UNIQUE → fail loud, but creates HTTP 400 noise on race |
| LATENT 6 | dot-layer-integrity-audit (writer + sweeper) |
INERT due to status='mở' bug. If fixed without addressing dedup → HIGH |
| LATENT 7 | dot-registry-crosscheck |
Same as 6 |
| LATENT 8 | dot-pg-audit-ensure PG fn |
Same as 6 |
| LATENT 9 | dot-id-collision-check |
Code exists but no rows in DB → never run or all cleaned. Would be HIGH if reactivated. |
P3 — Design Fix Vĩnh Viễn
Root question
"Status flicker → duplicate" làm sao để không thể xảy ra?
Option A — "Don't resolve, just mark stale"
Idea: Replace UPDATE status='resolved' with UPDATE last_stale_at=NOW().
Status only changes by user action or workflow, never automatically.
Implementation:
- Add column
last_stale_at TIMESTAMP autoResolveStale→autoMarkStale(sets timestamp instead of status)- Reader/UI: filter by
(status='open' AND last_stale_at IS NULL OR last_stale_at < NOW() - INTERVAL '24 hours' …) - Old data: keep status='resolved' rows, new ones use timestamp
Pros:
- Simplest semantic: status is human/workflow truth, timestamps track activity
- 1 row per logical_id, ever
- No race condition
Cons:
- Schema change (add column) → DDL on production
- All writers + readers need to learn new model
- Existing
status='resolved'rows have unclear meaning (was it auto-resolved or human-resolved?) - Doesn't fix BLIND_POST pattern (those still create dups by not checking existing)
Effort: 4-6h (column + writer changes + reader changes) Risk: MED (semantic shift across many consumers)
Option B — "Reopen = UPDATE existing"
Idea: Change findExistingIssue to look at ANY status (not just open). On reopen, UPDATE the existing row back to status='open' + increment count, instead of CREATE.
Implementation:
dedupe.jsline 28: remove&filter[status][_eq]=openfilterdedupeAndReport: when found in resolved/archived state, also setstatus='open'+ clearresolved_at+ increment occurrence_count- DB level: change partial UNIQUE
WHERE status='open'→ full UNIQUE on(violation_hash) WHERE violation_hash IS NOT NULL - Other writers (dot-collection-health, dot-apr-health, dot-script-lint): need to compute violation_hash (or another logical_id) + use same upsert logic
Pros:
- 1 row per logical_id, ever (across all statuses)
- Status truth is preserved (open = current, resolved = recovered)
- DB constraint enforces invariant
- Re-opens work naturally (UPDATE existing)
- Backward compatible: existing rows stay (just need to dedupe before adding constraint)
Cons:
- DB constraint requires dedupe of all historical rows first (~7+ groups in resolved state for dieu31-runner)
- BLIND_POST writers must be refactored to compute logical_id
- Loses per-incident history (only latest occurrence info on the row)
- Mitigated by: occurrence_count + last_seen_at + audit log
- Adding UNIQUE on full table is bigger change than partial
Effort: 6-10h (dedupe historical + writer refactor + UNIQUE add) Risk: MED (fewer consumer changes than A, but needs all writers to compute hash)
Option C — "Schema-enforced reopen via trigger"
Idea: BEFORE INSERT trigger checks if any row with same logical_id exists in any status. If yes, RAISE EXCEPTION with hint to UPDATE instead.
Implementation:
- PG function
fn_block_duplicate_logical_id_insertBEFORE INSERT on system_issues - If
SELECT 1 FROM system_issues WHERE violation_hash = NEW.violation_hashexists →RAISE EXCEPTION 'Use UPDATE for existing logical_id %', NEW.violation_hash - For rows without violation_hash: use
(source, title, entity_code, entity_type)composite
Pros:
- Pure DB change, no writer code changes
- Catches ALL writers including BLIND_POST
- Educates developers via clear error message
- 100% deterministic (DB enforced)
Cons:
- Aggressive — breaks all writers immediately on deploy
- BLIND_POST scripts will fail loudly (which IS the goal but causes operational pain)
- Composite logical_id for non-hash writers is fragile (title text might change)
- Trigger overhead on every INSERT
- Doesn't address the "sweep resolve" pattern (sweep still resolves rows, just doesn't create duplicates)
Effort: 2-3h (write trigger + test) but plus ~10h to fix all broken writers afterward Risk: HIGH if deployed without prep (production breakage)
Comparison
| Criterion | Option A | Option B | Option C |
|---|---|---|---|
| "Vĩnh viễn" (1 row/logical_id forever) | YES | YES | YES |
| "Cơ hội nhầm = 0" | Partial (BLIND_POST still bypass) | YES (after writer refactor) | YES (DB enforced) |
| Schema changes | New column | Constraint change | Trigger function |
| Writer code changes | Many readers + writers | All non-dedupe.js writers | None initially, then all |
| Operational risk | LOW (additive) | MED (constraint blocks bad writes) | HIGH (immediate breakage) |
| Effort | 4-6h | 6-10h | 2-3h + ~10h cleanup |
Recommendation
Option B + cleanup blind writers (phased):
Phase 1 (S175v02-fix-1): DB-side prep
- Dedupe historical rows: for each violation_hash with multiple rows, keep canonical (lowest id, highest occurrence_count, latest last_seen_at), archive rest
- Drop partial UNIQUE
idx_sysissue_open_violation_hash_unique - Add full UNIQUE
idx_sysissue_violation_hash_uniqueon(violation_hash) WHERE violation_hash IS NOT NULL
Phase 2 (S175v02-fix-2): dedupe.js writer
4. Modify findExistingIssue to remove status filter
5. dedupeAndReport UPDATE branch sets status='open', clears resolved_at, increments count
6. Test smoke run
Phase 3 (S175v02-fix-3): BLIND_POST writers
7. dot-collection-health, dot-apr-health, dot-script-lint: add violation_hash computation (sha256 of source+title) and use upsert pattern
8. Test each script
Phase 4 (S175v02-fix-4): Latent writers
9. dot-layer-integrity-audit, dot-registry-crosscheck, run_audit_to_issues: fix status='mở' → 'open' AND adopt upsert pattern
10. dot-id-collision-check: add upsert
Why Option B over C:
- Less operational risk: phased rollout, each phase tested independently
- Preserves existing data semantics
- DB constraint comes after dedupe, so no breakage on deploy
- Writer refactor is targeted (per script), not big-bang
Why Option B over A:
- No schema column add (less migration overhead)
- Status semantics preserved (open=current, resolved=recovered)
- DB constraint is single source of truth, doesn't depend on consumer reading new column correctly
Tuyên ngôn mapping
- ① Vĩnh viễn: Option B has DB UNIQUE on
(violation_hash)— 1 row per logical_id, forever. ✓ - ② Cơ hội nhầm = 0: After Phase 4, all writers use same upsert logic + DB constraint catches any bypass. ✓
- ③ LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. The S175 root cause led to discovering 9 writers and 4 patterns. Comprehensive fix prevents class of bugs. ✓
Files Referenced (file:line)
/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:25-40— findExistingIssue (status=open filter)/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:118-160— dedupeAndReport (find_open_then_create)/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js:169-214— autoResolveStale (sweep resolver)/opt/incomex/docker/nuxt-repo/scripts/integrity/main.js:154,361— autoResolveStale call sites/opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit:395-490— find_open_then_create + sweep/opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck:230-260— find_open_then_create/opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure:245-300— PG function NOT_EXISTS_then_INSERT/opt/incomex/dot/bin/dot-collection-health:60-80— BLIND_POST (no find)/opt/incomex/dot/bin/dot-apr-health:64-88— BLIND_POST/opt/incomex/dot/bin/dot-script-lint:43— RAW SQL ON CONFLICT (no matching UNIQUE)/opt/incomex/dot/bin/dot-id-collision-check:175— BLIND_POST/opt/incomex/docker/agent-data-repo/functions/timer_callback/main.py:172— UPDATE_BY_ID (LOGIC_SAFE)
Cron Schedule (when these run)
0 */6 * * * /opt/incomex/deploys/web-test/scripts/integrity/cron-integrity.sh # dedupe.js path, 4×/day
0 * * * * /opt/incomex/deploys/web-test/scripts/integrity/watchdog-monitor.sh # 24×/day
0 */3 * * * /opt/incomex/deploys/web-test/scripts/integrity/scanner-counts.sh # 8×/day
dot-collection-health, dot-apr-health, dot-script-lint, dot-layer-integrity-audit: not in crontab visible. Run manually or via other automation (CI/CD?). Need to verify schedule.
Unknowns / Need clarification
- dot-collection-health schedule: 280 rows on 2026-04-02 only. Why 1 day burst? Manual? CI? Need to verify if it's still being invoked.
- dot-scanner family inactive since 2026-03-14: Why stopped? Did someone manually disable? Need to confirm intentional or accidental.
- dot-script-lint ON CONFLICT semantics: Does it expect a UNIQUE on (title) that doesn't exist? Maybe a planned-but-not-applied schema change?
- dot-id-collision-check 0 rows: Code exists in /opt/incomex/dot/bin/ but never wrote anything. Inactive or never invoked?
- Composite logical_id for non-hash writers: For dot-collection-health ("Species code missing: tasks"), what should the natural unique key be?
(source, title)?(source, entity_type, entity_code)? Need product decision.