S175-VACCINE-02B — Deep WHY Audit (Tại sao có bệnh?)

Date: 2026-04-09 Mode: READ-ONLY archaeology + design Agent: Claude Code Goal: Trả lời 5 câu WHY về root cause của 9 writers / 4 anti-patterns

Summary

Vòng 1 (s175v02-autoresolvestale-audit) phát hiện 9 writers / 4 anti-patterns. Vòng 2 này điều tra TẠI SAO chúng tồn tại.

Câu trả lời cốt lõi: 9 writers được tạo trong 21 ngày (2026-03-10 → 2026-03-31) bởi AI-assisted development với:

Template propagation: copy-paste log_system_issue() byte-identical giữa các script
Mindset gap: AI default to procedural "SELECT-then-INSERT" thay vì DB constraint
Triple guideline gap: anti-patterns.md (21 patterns, 0 về upsert) + operating-rules.md (0 nội dung về DB constraint) + CI (11 workflows, 0 check duplicate)
No knowledge transfer: mỗi session/PR reinvent cùng pattern, không học từ session trước
Iterative dev tests: 5 PRs/1 ngày trên dot-collection-health = 5×28 = 140-280 dups (smoking gun explained)

WHY-1: AI viết ra 4 anti-patterns? Khi nào?

Timeline (git blame)

#	Writer	Origin commit	Date	Session/PR	Author	Pattern from start?
1	`dedupe.js findExistingIssue`	`41ba16f`	2026-03-23	S131-M3 (#578)	Huyên	YES
2	`dedupe.js dedupeAndReport`	`4003661`	2026-03-25	S161F (#620)	Huyên	YES
3	`dedupe.js autoResolveStale`	`b1fe6fe`	2026-03-25	S164 (#626)	Huyên	YES
4	`dot-layer-integrity-audit`	`fd770ca`	2026-03-26	S165 (#629)	Huyên	YES
5	`dot-registry-crosscheck`	`d87957b`	2026-03-10	S109 P1 (#476)	Huyên	YES
6	`dot-pg-audit-ensure` PG fn	`30b0a5f`	2026-03-14	S121 (#506)	Huyên	YES
7	`dot-id-collision-check`	`f699163/c5fcd77`	2026-03-12	Mission-E (#486/#487)	Huyên	YES
8	`dot-collection-health`	`75faf48`	2026-03-31	S155 (#655)	Huyên	YES
9	`dot-apr-health`	`75faf48`	2026-03-31	S155 (#655)	Huyên	YES

Findings

Window: 21 ngày (2026-03-10 → 2026-03-31)
Single git author for all 9: Nguyễn Minh Huyên (PR commits — actual code generated by AI assistants per project's documented workflow)
Pattern present from initial commit for ALL writers — none are later regressions
Sessions span: S109, S121, Mission-E, S131-M3, S155, S161F, S164, S165 — 8 different sessions, each one creating writers with same bug class
No session learned from previous: S121 had the bug, S131-M3 had it, S164/S165 had it. No fix between sessions.

Intent classification

Writer	Intent	Evidence
dedupe.js	DELIBERATE design (dedup engine)	"prevents duplicate system_issues" comment line 2
dot-layer-integrity-audit	DELIBERATE (find existing first)	comment "Check for existing open issue"
dot-registry-crosscheck	DELIBERATE	similar pattern
dot-pg-audit-ensure	DELIBERATE (NOT EXISTS)	PL/pgSQL with explicit "-- Skip if already exists" comment
dot-id-collision-check	NEGLIGENT	NO existence check at all, just POST
dot-collection-health	NEGLIGENT (template copy)	byte-identical `log_system_issue` from dot-apr-health
dot-apr-health	NEGLIGENT (template copy)	byte-identical `log_system_issue` from dot-collection-health
dot-script-lint	CARGO_CULT	`ON CONFLICT DO NOTHING` without column clause (broken syntax)

Pattern of failure:

DELIBERATE designs (5/9): Author intended to dedup but used wrong abstraction (REST API instead of DB constraint)
NEGLIGENT (3/9): No dedup at all (BLIND_POST family)
CARGO_CULT (1/9): Used ON CONFLICT syntax without understanding it

WHY-2: 4 anti-patterns có gốc chung?

Code DNA evidence: copy-paste smoking gun

dot-collection-health log_system_issue vs dot-apr-health log_system_issue:

# Both committed in same PR #655 (S155, 2026-03-31)
# Diff is ONLY: `source` value + `severity` value. Body byte-identical.

# dot-collection-health:
log_system_issue() {
  local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
  curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
    -H "Authorization: Bearer $DOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -n \
      --arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
      '{ title: $title, issue_type: $it, description: $desc,
         source: "dot-collection-health",  # ← only diff
         severity: "warning",                # ← only diff
         status: "open", detected_at: (now | todate) }')" \
    > /dev/null 2>&1 || true
}

# dot-apr-health:
log_system_issue() {
  local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
  curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
    -H "Authorization: Bearer $DOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -n \
      --arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
      --arg src "dot-apr-health" \               # ← only diff
      '{ title: $title, issue_type: $it, description: $desc,
         source: $src,
         severity: "info",                        # ← only diff
         status: "open", detected_at: (now | todate) }')" \
    > /dev/null 2>&1 || true
}

→ Same template, same PR, same author, same bug. Direct copy-paste.

dot-layer-integrity-audit vs dot-registry-crosscheck

# dot-registry-crosscheck (S109, 2026-03-10) — ORIGINAL
et_filter = urllib.parse.quote(issue['entity_type'])
existing = api_get(
    f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
    f'&filter[source][_eq]=dot-registry-crosscheck'
    f'&filter[status][_in]=mở,đang_xử_lý'
    f'&limit=1&fields=id'
)
if existing.get('data', []):
    skipped += 1
    continue
result = api_post('/items/system_issues', {... 'status': 'mở' ...})

# dot-layer-integrity-audit (S165, 2026-03-26) — COPY+EXTEND
et_filter = urllib.parse.quote(issue['entity_type'])
title_filter = urllib.parse.quote(issue['title'])  # ← added
existing = api_get(
    f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
    f'&filter[title][_eq]={title_filter}'           # ← added
    f'&filter[status][_in]=mở,đang_xử_lý'
    f'&limit=1&fields=id'
)
if existing.get('data', []):
    skipped += 1
    continue
# ... (identical structure)

→ S165 author copied S109's pattern, added one filter, never questioned the underlying race condition. 16-day gap, same bug propagated.

Common mindset gap

"Treat Directus REST API as the only abstraction layer"

The author/AI consistently:

Used api_get + filter[...][_eq]=X + limit=1 instead of PG SELECT ... FOR UPDATE
Used api_post instead of PG INSERT ... ON CONFLICT (col) DO UPDATE
Filtered by status='open' (or 'mở') believing this guards against duplicates
Never thought about: race condition, transaction isolation, DB constraint as truth

Why this mindset?

Directus REST API doesn't expose ON CONFLICT semantics
AI prompt likely "create record if not exists" → AI literal interpretation: SELECT then INSERT
No reference template in repo showing the CORRECT pattern (atomic upsert via PG)
"dedupe.js" was the most authoritative pattern but itself has the bug → propagated as best practice

Cargo cult evidence (dot-script-lint):

INSERT INTO system_issues (...) VALUES (...) ON CONFLICT DO NOTHING;
-- ON CONFLICT WITHOUT (column) clause defaults to PK conflict
-- PK = auto-generated id → never collides → DO NOTHING never fires
-- Effectively blind insert with extra syntax for show

The author/AI knew ON CONFLICT DO NOTHING exists as a PG idiom but didn't understand it requires (column) to specify which constraint. Knowledge of name without knowledge of semantics.

WHY-3: Tại sao lọt qua review/CI/test?

Triple guideline gap

Layer 1 — anti-patterns.md (rev 9, 21 patterns):

Listed: AP-01 to AP-21 covering Assembly Gate, deploy, branch protection, filter hiding, manual deploy, automation, config-first, DOT bypass, junction handling, dynamic pages, accuracy verify, law-reading, background agent, schema bypass (ALTER TABLE), Directus FK, API vs PG access, positive/negative verify, retroactive law, enacted content overwrite, hardcoded enum.

MISSING (relevant to S175 root cause):

❌ AP-X: "Upsert by code without DB constraint"
❌ AP-X: "Find-then-create race condition"
❌ AP-X: "Status flicker creates duplicate via reopen-as-create"
❌ AP-X: "BLIND_POST without dedup logic"
❌ AP-X: "ON CONFLICT without column clause"

Result: 0 of the 4 anti-patterns from vòng 1 are documented in the registry that authors are supposed to read before each mission.

Layer 2 — operating-rules.md (rev 48, 5727 chars):

$ grep -ic 'UNIQUE\|CONSTRAINT\|ON CONFLICT\|upsert\|race\|dedupe' operating-rules.md
0  (the 1 'CONSTRAINT' match is just a section header for mission template)

0 mentions of DB constraints, race conditions, or upsert pattern.

Layer 3 — CI workflows (.github/workflows/, 11 files):

deploy-vps.yml         deploy.yml
dieu31-vps-setup.yml   e2e-test.yml
enforce-table-rule.yml guard_critical_files.yml
nuxt-ci.yml            ops-smoke.yml
post-deploy-e2e.yml    required-docs-guard.yml
sync-check.yml

grep cho duplicate|upsert|UNIQUE → only false positives (unique from jq filter, race from "trace"). 0 CI checks for upsert pattern or duplicate prevention.

Layer 4 — pre-commit hooks:

$ ls /opt/incomex/docker/nuxt-repo/.git/hooks/pre-commit
(does not exist)
$ find /opt/incomex/docker/nuxt-repo -name '.husky' -o -name '.pre-commit*'
(none)

No client-side enforcement.

Code review evidence

PRs #476, #487, #506, #578, #620, #626, #629, #655 all merged with the bug pattern. Either:

Reviewers (or AI co-reviewer) didn't catch it
Or no review happened (single-author repo with self-merge)
Or review focused on functionality not architecture

enforce-table-rule.yml exists but doesn't cover this — its name suggests Nuxt page rules, not DB schema rules.

Conclusion WHY-3: Code lọt qua vì không có ai/cái gì kiểm tra. Triple gap: guidelines (no rule), CI (no test), review (no checklist). The bug class was invisible to all 3 layers.

WHY-4: 280 duplicates ngày 2026-04-02

git log 2026-04-02 commits touching dot-collection-health

03532f4 fix(health): H15a read permissions include locked collections (#666)
d4d5ae7 feat(health): add governance_role 'locked' support (Dieu 38 config tables) (#665)
0c8816d perf(health): bulk SQL for checks 1,2,11 — eliminate per-collection loops (#663)
982b68f fix(health): H14+H15 PG-only, H17 retroactive, Check 9 no hardcode (#662)
3f26b28 Điều 37 P1: DOT collection-create v2 + collection-health v2 (#661)

5 PRs in 1 day modifying dot-collection-health. Plus PR #664 (security commit, possibly triggered redeploy).

Script changelog corroborates

v2.2.0 (2026-04-02): +governance_role 'locked' support  ← #665
v2.1.1 (2026-04-02): Checks 1,2,11 bulk SQL              ← #663
v2.1.0 (2026-04-02): H14+H15 PG-only, H17 retroactive    ← #662
v2.0.0 (2026-04-02): +H14 FK constraints, +H15...        ← #661 (Điều 37 P1)

Math

28 unique titles per script run (from earlier P1 query: Species code missing: tasks/agents/...)
280 rows total / 28 titles = exactly 10 runs
5-6 PRs × 1-2 dev test invocations each ≈ 8-12 runs
10 falls within 8-12 range → matches iterative dev cycle

Reconstruction

Each PR cycle: developer (or AI agent) edits script → manual run to test → push PR → CI runs → merge → deploy → cron picks up
Each manual run = +28 issues (script has BLIND_POST, no dedup)
None of the runs cleaned up previous runs' issues (no autoResolveStale equivalent for dot-collection-health)
After 10 runs, 280 rows accumulate

Cron involvement: dot-collection-health is NOT in the visible crontab (/var/spool/cron/crontabs/root). Searched:

0 */6 * * * .../cron-integrity.sh
0 * * * *   .../watchdog-monitor.sh
0 */3 * * * .../scanner-counts.sh

None of these run dot-collection-health. So it's purely manual/dev invocation, not cron.

Conclusion WHY-4: 280 dups = 10 manual dev test runs during 1-day iterative development of v2.0→v2.2. No cron involvement. The script's BLIND_POST pattern + no cleanup = guaranteed accumulation.

Unknowns from vòng 1 — RESOLVED

#	Question	Answer	Evidence
(a)	dot-collection-health 1-day burst — cron or manual?	Manual dev iteration (10 runs, 5 PRs same day)	git log + script changelog + crontab inspection
(b)	dot-scanner disabled 2026-03-14 — why?	Abandoned after S121, no cron schedule, only ran during dev	git log shows no commits after `30b0a5f` (S121)
(c)	dot-script-lint ON CONFLICT expects which UNIQUE?	NONE — without column clause = PK conflict, never fires	Source line 43
(d)	dot-id-collision-check 0 rows — never run?	Likely never invoked in production, only 2 commits ever, no active schedule	git log shows only #486 (initial) + #487 (curl fix), no further work
(e)	Composite logical_id design for non-hash writers	DEFER to fix design (need product decision)	n/a

7 RESOLVED duplicate groups (from S175-VACCINE-01)

Origin reconstruction (deferred — would require querying first_seen timestamps for each group and matching against git history). The pattern is consistent with iterative dev cycles like 2026-04-02. Likely created during similar S121-S165 development sessions.

WHY-5: Systemic Prevention — 3-tier design

Current state: triple gap (P2 finding)

Layer	Status	Coverage
Guidelines (anti-patterns.md, OR)	rev 9 / rev 48	0% of upsert/race/UNIQUE
CI workflows	11 files	0% of duplicate prevention
Code review checklist	not found	0%

Tier 1 — GUIDELINE (knowledge update)

G1.1: Add 5 anti-patterns to anti-patterns.md (rev 10)

ID	Name	Symptom	Fix
AP-22	UPSERT-BY-CODE-NO-CONSTRAINT	findExisting+POST without DB UNIQUE → race creates duplicates	DB partial UNIQUE first, then writer can be best-effort
AP-23	BLIND-POST	curl POST without find-existing check at all	Compute logical_id (sha256 of natural key) + use upsert
AP-24	STATUS-FLICKER-REOPEN	sweep_resolve marks row resolved → next scan creates duplicate (status=open filter)	Reopen = UPDATE existing (find any status), not POST new
AP-25	ON-CONFLICT-WITHOUT-COLUMN	`INSERT ... ON CONFLICT DO NOTHING` without (column) → defaults to PK, never fires for auto-gen IDs	Always specify the conflict target column
AP-26	TEMPLATE-PROPAGATION-WITHOUT-AUDIT	Copy-paste log_system_issue from script A to script B without auditing dedup	Code review checklist requires "is this the right abstraction?"

G1.2: Add operating-rule note:

NT-DB-CONSTRAINT-FIRST: Mọi bảng có logical_id (slug, code, source_id, hash, file_path) BẮT BUỘC có UNIQUE/partial index TRƯỚC khi viết writer. "Writer atomic" không thay thế "DB constraint" — phải có cả hai. DB constraint là single source of truth, writer là best-effort.

G1.3: Writer template:

Add knowledge/dev/ssot/templates/atomic-upsert-template.md with:

PG-direct upsert pattern (using psycopg2 + ON CONFLICT (col) DO UPDATE)
Bash curl + Directus REST upsert pattern (with explicit retry on conflict)
Anti-pattern examples (the 5 from G1.1) with X-marks

Effort: 2-3h Risk: LOW (additive documentation) Owner: Desktop chính designs, agent updates

Tier 2 — ENFORCEMENT (CI/lint)

G2.1: Pre-commit lint rule

Script: scripts/lint-no-blind-insert.sh

grep added/modified files for /items/.+ POST patterns
Flag if no nearby filter[.+][_eq]= query (warn for find-then-create)
Flag if ON CONFLICT without ( (broken cargo-cult syntax)
Flag if any new file in scripts/integrity/ or dot/bin/ writes to system_issues without computing violation_hash

G2.2: CI workflow (.github/workflows/duplicate-prevention.yml)

Runs on PRs touching scripts/, dot/bin/, functions/
Smoke test: deploy + invoke writer twice + assert no duplicates
Schema test: every table with logical_id col has corresponding UNIQUE index (query pg_indexes)

G2.3: Custom Directus extension hook (optional)

Server-side hook on system_issues items.create
Reject POST if violation_hash exists in any status
Force client to use UPDATE pattern

Effort: 4-6h (G2.1+G2.2), +6h G2.3 Risk: LOW (G2.1, G2.2), MED (G2.3 — could break existing writers until refactored) Owner: Desktop approves design, agent implements

Tier 3 — REVIEW CHECKLIST

G3.1: PR template (.github/PULL_REQUEST_TEMPLATE.md) Add section:

## DB Write Checklist (only if PR touches Directus tables)
- [ ] Bảng có UNIQUE constraint trên logical_id chưa? Nếu chưa: tạo trong PR này
- [ ] Writer atomic (ON CONFLICT hoặc transaction) hay best-effort?
- [ ] Nếu best-effort: DB UNIQUE đã có để bắt race chưa?
- [ ] Có copy-paste pattern từ writer khác không? Audit nguồn nếu có
- [ ] Test: chạy writer 2 lần liên tiếp, COUNT không tăng quá 1

G3.2: Reviewer guideline (knowledge/dev/ssot/code-review-checklist.md)

Quote 5 new anti-patterns from G1.1
Provide diff examples of GOOD vs BAD patterns
Mandatory question: "What happens if this writer is invoked 10 times in 1 day?"

Effort: 1-2h Risk: LOW (additive) Owner: Desktop chính

Combined effort

Tier	Effort	Risk	Priority
1 (Guidelines)	2-3h	LOW	HIGH (foundation for tier 2-3)
2 (CI/lint)	4-12h	LOW-MED	MED (catches future regressions)
3 (Review)	1-2h	LOW	HIGH (cheap immediate wins)
Total	7-17h	LOW overall	—

Tuyên ngôn 3-câu mapping

① Vĩnh viễn: Tier 1 (knowledge) + Tier 2 (CI test) — once added, future code must comply ✓
② Cơ hội nhầm = 0: Tier 2 G2.3 (DB hook) is the only true zero-chance prevention. Without it, Tier 1+3 reduce probability but don't eliminate. Recommend G2.3 for critical tables (system_issues, knowledge_documents, universal_rule_violations) ✓ partial
③ LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. 9 writers / 4 patterns / 21-day window / triple guideline gap → comprehensive prevention design ✓

CÁI CHƯA BIẾT (cần điều tra thêm)

Why no code review caught these PRs? — Need to read PR review comments (if any) on #476, #506, #578, #620, #626, #629, #655. Could be:
- Self-merge (no human review)
- AI co-reviewer didn't have anti-patterns.md context
- Reviewer focused on functionality/syntax not architecture
Did Codex/Gemini agents also write any of these? — git author is always Huyên (single PR account). Need session handoff files to identify which AI session generated each writer.
Why no fix-forward between sessions? — S109 → S121 → S131 → S155 → S161 → S164 → S165 = 7 sessions in 16 days, all with same bug. No retrospective happened. Need to check if there was a "after-action" doc or if sessions ran in parallel.
Is there a known correct pattern in repo somewhere? — Did anyone EVER write a correct atomic upsert in this repo? Need to grep for ON CONFLICT (col) with proper syntax. If yes, why wasn't it propagated. If no, the pattern was never seeded.
7 RESOLVED duplicate groups origin — same archeology needed for those hashes. Skipped in this session due to time.

Files Referenced (file:line + commit)

/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js (3 commits: 41ba16f S131-M3, 4003661 S161F, b1fe6fe S164)
/opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit (fd770ca S165)
/opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck (d87957b S109)
/opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure (30b0a5f S121)
/opt/incomex/docker/nuxt-repo/dot/bin/dot-id-collision-check (f699163 Mission-E)
/opt/incomex/dot/bin/dot-collection-health (75faf48 S155, +5 PRs on 2026-04-02)
/opt/incomex/dot/bin/dot-apr-health (75faf48 S155)
/opt/incomex/dot/bin/dot-script-lint (no git history, line 43 has broken ON CONFLICT DO NOTHING)
knowledge/dev/ssot/anti-patterns.md rev 9 (21 patterns, 0 about upsert)
knowledge/dev/ssot/operating-rules.md rev 48 (0 mentions of UNIQUE/CONSTRAINT/upsert)
/opt/incomex/docker/nuxt-repo/.github/workflows/*.yml (11 files, 0 about duplicate)

Verify checklist

WHY	Has answer?	Evidence type	Confidence
WHY-1	YES	git log + commit hashes for 9/9 writers	HIGH
WHY-2	YES	byte-identical code diff (2 pairs) + cargo-cult ON CONFLICT example	HIGH
WHY-3	YES	rev counts + grep counts (0 matches)	HIGH
WHY-4	YES	5 PRs on 2026-04-02 + math (10 runs × 28 issues = 280)	HIGH
WHY-5	YES	3-tier design with effort/risk/owner per tier	HIGH (design), N/A for execution