S175-VACCINE-02B Deep WHY Audit
S175-VACCINE-02B — Deep WHY Audit (Tại sao có bệnh?)
Date: 2026-04-09 Mode: READ-ONLY archaeology + design Agent: Claude Code Goal: Trả lời 5 câu WHY về root cause của 9 writers / 4 anti-patterns
Summary
Vòng 1 (s175v02-autoresolvestale-audit) phát hiện 9 writers / 4 anti-patterns. Vòng 2 này điều tra TẠI SAO chúng tồn tại.
Câu trả lời cốt lõi: 9 writers được tạo trong 21 ngày (2026-03-10 → 2026-03-31) bởi AI-assisted development với:
- Template propagation: copy-paste
log_system_issue()byte-identical giữa các script - Mindset gap: AI default to procedural "SELECT-then-INSERT" thay vì DB constraint
- Triple guideline gap: anti-patterns.md (21 patterns, 0 về upsert) + operating-rules.md (0 nội dung về DB constraint) + CI (11 workflows, 0 check duplicate)
- No knowledge transfer: mỗi session/PR reinvent cùng pattern, không học từ session trước
- Iterative dev tests: 5 PRs/1 ngày trên dot-collection-health = 5×28 = 140-280 dups (smoking gun explained)
WHY-1: AI viết ra 4 anti-patterns? Khi nào?
Timeline (git blame)
| # | Writer | Origin commit | Date | Session/PR | Author | Pattern from start? |
|---|---|---|---|---|---|---|
| 1 | dedupe.js findExistingIssue |
41ba16f |
2026-03-23 | S131-M3 (#578) | Huyên | YES |
| 2 | dedupe.js dedupeAndReport |
4003661 |
2026-03-25 | S161F (#620) | Huyên | YES |
| 3 | dedupe.js autoResolveStale |
b1fe6fe |
2026-03-25 | S164 (#626) | Huyên | YES |
| 4 | dot-layer-integrity-audit |
fd770ca |
2026-03-26 | S165 (#629) | Huyên | YES |
| 5 | dot-registry-crosscheck |
d87957b |
2026-03-10 | S109 P1 (#476) | Huyên | YES |
| 6 | dot-pg-audit-ensure PG fn |
30b0a5f |
2026-03-14 | S121 (#506) | Huyên | YES |
| 7 | dot-id-collision-check |
f699163/c5fcd77 |
2026-03-12 | Mission-E (#486/#487) | Huyên | YES |
| 8 | dot-collection-health |
75faf48 |
2026-03-31 | S155 (#655) | Huyên | YES |
| 9 | dot-apr-health |
75faf48 |
2026-03-31 | S155 (#655) | Huyên | YES |
Findings
- Window: 21 ngày (2026-03-10 → 2026-03-31)
- Single git author for all 9:
Nguyễn Minh Huyên(PR commits — actual code generated by AI assistants per project's documented workflow) - Pattern present from initial commit for ALL writers — none are later regressions
- Sessions span: S109, S121, Mission-E, S131-M3, S155, S161F, S164, S165 — 8 different sessions, each one creating writers with same bug class
- No session learned from previous: S121 had the bug, S131-M3 had it, S164/S165 had it. No fix between sessions.
Intent classification
| Writer | Intent | Evidence |
|---|---|---|
| dedupe.js | DELIBERATE design (dedup engine) | "prevents duplicate system_issues" comment line 2 |
| dot-layer-integrity-audit | DELIBERATE (find existing first) | comment "Check for existing open issue" |
| dot-registry-crosscheck | DELIBERATE | similar pattern |
| dot-pg-audit-ensure | DELIBERATE (NOT EXISTS) | PL/pgSQL with explicit "-- Skip if already exists" comment |
| dot-id-collision-check | NEGLIGENT | NO existence check at all, just POST |
| dot-collection-health | NEGLIGENT (template copy) | byte-identical log_system_issue from dot-apr-health |
| dot-apr-health | NEGLIGENT (template copy) | byte-identical log_system_issue from dot-collection-health |
| dot-script-lint | CARGO_CULT | ON CONFLICT DO NOTHING without column clause (broken syntax) |
Pattern of failure:
- DELIBERATE designs (5/9): Author intended to dedup but used wrong abstraction (REST API instead of DB constraint)
- NEGLIGENT (3/9): No dedup at all (BLIND_POST family)
- CARGO_CULT (1/9): Used
ON CONFLICTsyntax without understanding it
WHY-2: 4 anti-patterns có gốc chung?
Code DNA evidence: copy-paste smoking gun
dot-collection-health log_system_issue vs dot-apr-health log_system_issue:
# Both committed in same PR #655 (S155, 2026-03-31)
# Diff is ONLY: `source` value + `severity` value. Body byte-identical.
# dot-collection-health:
log_system_issue() {
local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
-H "Authorization: Bearer $DOT_TOKEN" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
'{ title: $title, issue_type: $it, description: $desc,
source: "dot-collection-health", # ← only diff
severity: "warning", # ← only diff
status: "open", detected_at: (now | todate) }')" \
> /dev/null 2>&1 || true
}
# dot-apr-health:
log_system_issue() {
local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
-H "Authorization: Bearer $DOT_TOKEN" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
--arg src "dot-apr-health" \ # ← only diff
'{ title: $title, issue_type: $it, description: $desc,
source: $src,
severity: "info", # ← only diff
status: "open", detected_at: (now | todate) }')" \
> /dev/null 2>&1 || true
}
→ Same template, same PR, same author, same bug. Direct copy-paste.
dot-layer-integrity-audit vs dot-registry-crosscheck
# dot-registry-crosscheck (S109, 2026-03-10) — ORIGINAL
et_filter = urllib.parse.quote(issue['entity_type'])
existing = api_get(
f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
f'&filter[source][_eq]=dot-registry-crosscheck'
f'&filter[status][_in]=mở,đang_xử_lý'
f'&limit=1&fields=id'
)
if existing.get('data', []):
skipped += 1
continue
result = api_post('/items/system_issues', {... 'status': 'mở' ...})
# dot-layer-integrity-audit (S165, 2026-03-26) — COPY+EXTEND
et_filter = urllib.parse.quote(issue['entity_type'])
title_filter = urllib.parse.quote(issue['title']) # ← added
existing = api_get(
f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
f'&filter[title][_eq]={title_filter}' # ← added
f'&filter[status][_in]=mở,đang_xử_lý'
f'&limit=1&fields=id'
)
if existing.get('data', []):
skipped += 1
continue
# ... (identical structure)
→ S165 author copied S109's pattern, added one filter, never questioned the underlying race condition. 16-day gap, same bug propagated.
Common mindset gap
"Treat Directus REST API as the only abstraction layer"
The author/AI consistently:
- Used
api_get+filter[...][_eq]=X+limit=1instead of PGSELECT ... FOR UPDATE - Used
api_postinstead of PGINSERT ... ON CONFLICT (col) DO UPDATE - Filtered by
status='open'(or 'mở') believing this guards against duplicates - Never thought about: race condition, transaction isolation, DB constraint as truth
Why this mindset?
- Directus REST API doesn't expose
ON CONFLICTsemantics - AI prompt likely "create record if not exists" → AI literal interpretation: SELECT then INSERT
- No reference template in repo showing the CORRECT pattern (atomic upsert via PG)
- "dedupe.js" was the most authoritative pattern but itself has the bug → propagated as best practice
Cargo cult evidence (dot-script-lint):
INSERT INTO system_issues (...) VALUES (...) ON CONFLICT DO NOTHING;
-- ON CONFLICT WITHOUT (column) clause defaults to PK conflict
-- PK = auto-generated id → never collides → DO NOTHING never fires
-- Effectively blind insert with extra syntax for show
The author/AI knew ON CONFLICT DO NOTHING exists as a PG idiom but didn't understand it requires (column) to specify which constraint. Knowledge of name without knowledge of semantics.
WHY-3: Tại sao lọt qua review/CI/test?
Triple guideline gap
Layer 1 — anti-patterns.md (rev 9, 21 patterns):
Listed: AP-01 to AP-21 covering Assembly Gate, deploy, branch protection, filter hiding, manual deploy, automation, config-first, DOT bypass, junction handling, dynamic pages, accuracy verify, law-reading, background agent, schema bypass (ALTER TABLE), Directus FK, API vs PG access, positive/negative verify, retroactive law, enacted content overwrite, hardcoded enum.
MISSING (relevant to S175 root cause):
- ❌ AP-X: "Upsert by code without DB constraint"
- ❌ AP-X: "Find-then-create race condition"
- ❌ AP-X: "Status flicker creates duplicate via reopen-as-create"
- ❌ AP-X: "BLIND_POST without dedup logic"
- ❌ AP-X: "ON CONFLICT without column clause"
Result: 0 of the 4 anti-patterns from vòng 1 are documented in the registry that authors are supposed to read before each mission.
Layer 2 — operating-rules.md (rev 48, 5727 chars):
$ grep -ic 'UNIQUE\|CONSTRAINT\|ON CONFLICT\|upsert\|race\|dedupe' operating-rules.md
0 (the 1 'CONSTRAINT' match is just a section header for mission template)
0 mentions of DB constraints, race conditions, or upsert pattern.
Layer 3 — CI workflows (.github/workflows/, 11 files):
deploy-vps.yml deploy.yml
dieu31-vps-setup.yml e2e-test.yml
enforce-table-rule.yml guard_critical_files.yml
nuxt-ci.yml ops-smoke.yml
post-deploy-e2e.yml required-docs-guard.yml
sync-check.yml
grep cho duplicate|upsert|UNIQUE → only false positives (unique from jq filter, race from "trace"). 0 CI checks for upsert pattern or duplicate prevention.
Layer 4 — pre-commit hooks:
$ ls /opt/incomex/docker/nuxt-repo/.git/hooks/pre-commit
(does not exist)
$ find /opt/incomex/docker/nuxt-repo -name '.husky' -o -name '.pre-commit*'
(none)
No client-side enforcement.
Code review evidence
PRs #476, #487, #506, #578, #620, #626, #629, #655 all merged with the bug pattern. Either:
- Reviewers (or AI co-reviewer) didn't catch it
- Or no review happened (single-author repo with self-merge)
- Or review focused on functionality not architecture
enforce-table-rule.yml exists but doesn't cover this — its name suggests Nuxt page rules, not DB schema rules.
Conclusion WHY-3: Code lọt qua vì không có ai/cái gì kiểm tra. Triple gap: guidelines (no rule), CI (no test), review (no checklist). The bug class was invisible to all 3 layers.
WHY-4: 280 duplicates ngày 2026-04-02
git log 2026-04-02 commits touching dot-collection-health
03532f4 fix(health): H15a read permissions include locked collections (#666)
d4d5ae7 feat(health): add governance_role 'locked' support (Dieu 38 config tables) (#665)
0c8816d perf(health): bulk SQL for checks 1,2,11 — eliminate per-collection loops (#663)
982b68f fix(health): H14+H15 PG-only, H17 retroactive, Check 9 no hardcode (#662)
3f26b28 Điều 37 P1: DOT collection-create v2 + collection-health v2 (#661)
5 PRs in 1 day modifying dot-collection-health. Plus PR #664 (security commit, possibly triggered redeploy).
Script changelog corroborates
v2.2.0 (2026-04-02): +governance_role 'locked' support ← #665
v2.1.1 (2026-04-02): Checks 1,2,11 bulk SQL ← #663
v2.1.0 (2026-04-02): H14+H15 PG-only, H17 retroactive ← #662
v2.0.0 (2026-04-02): +H14 FK constraints, +H15... ← #661 (Điều 37 P1)
Math
- 28 unique titles per script run (from earlier P1 query:
Species code missing: tasks/agents/...) - 280 rows total / 28 titles = exactly 10 runs
- 5-6 PRs × 1-2 dev test invocations each ≈ 8-12 runs
- 10 falls within 8-12 range → matches iterative dev cycle
Reconstruction
- Each PR cycle: developer (or AI agent) edits script → manual run to test → push PR → CI runs → merge → deploy → cron picks up
- Each manual run = +28 issues (script has BLIND_POST, no dedup)
- None of the runs cleaned up previous runs' issues (no autoResolveStale equivalent for dot-collection-health)
- After 10 runs, 280 rows accumulate
Cron involvement: dot-collection-health is NOT in the visible crontab (/var/spool/cron/crontabs/root). Searched:
0 */6 * * * .../cron-integrity.sh
0 * * * * .../watchdog-monitor.sh
0 */3 * * * .../scanner-counts.sh
None of these run dot-collection-health. So it's purely manual/dev invocation, not cron.
Conclusion WHY-4: 280 dups = 10 manual dev test runs during 1-day iterative development of v2.0→v2.2. No cron involvement. The script's BLIND_POST pattern + no cleanup = guaranteed accumulation.
Unknowns from vòng 1 — RESOLVED
| # | Question | Answer | Evidence |
|---|---|---|---|
| (a) | dot-collection-health 1-day burst — cron or manual? | Manual dev iteration (10 runs, 5 PRs same day) | git log + script changelog + crontab inspection |
| (b) | dot-scanner disabled 2026-03-14 — why? | Abandoned after S121, no cron schedule, only ran during dev | git log shows no commits after 30b0a5f (S121) |
| (c) | dot-script-lint ON CONFLICT expects which UNIQUE? | NONE — without column clause = PK conflict, never fires | Source line 43 |
| (d) | dot-id-collision-check 0 rows — never run? | Likely never invoked in production, only 2 commits ever, no active schedule | git log shows only #486 (initial) + #487 (curl fix), no further work |
| (e) | Composite logical_id design for non-hash writers | DEFER to fix design (need product decision) | n/a |
7 RESOLVED duplicate groups (from S175-VACCINE-01)
Origin reconstruction (deferred — would require querying first_seen timestamps for each group and matching against git history). The pattern is consistent with iterative dev cycles like 2026-04-02. Likely created during similar S121-S165 development sessions.
WHY-5: Systemic Prevention — 3-tier design
Current state: triple gap (P2 finding)
| Layer | Status | Coverage |
|---|---|---|
| Guidelines (anti-patterns.md, OR) | rev 9 / rev 48 | 0% of upsert/race/UNIQUE |
| CI workflows | 11 files | 0% of duplicate prevention |
| Code review checklist | not found | 0% |
Tier 1 — GUIDELINE (knowledge update)
G1.1: Add 5 anti-patterns to anti-patterns.md (rev 10)
| ID | Name | Symptom | Fix |
|---|---|---|---|
| AP-22 | UPSERT-BY-CODE-NO-CONSTRAINT | findExisting+POST without DB UNIQUE → race creates duplicates | DB partial UNIQUE first, then writer can be best-effort |
| AP-23 | BLIND-POST | curl POST without find-existing check at all | Compute logical_id (sha256 of natural key) + use upsert |
| AP-24 | STATUS-FLICKER-REOPEN | sweep_resolve marks row resolved → next scan creates duplicate (status=open filter) | Reopen = UPDATE existing (find any status), not POST new |
| AP-25 | ON-CONFLICT-WITHOUT-COLUMN | INSERT ... ON CONFLICT DO NOTHING without (column) → defaults to PK, never fires for auto-gen IDs |
Always specify the conflict target column |
| AP-26 | TEMPLATE-PROPAGATION-WITHOUT-AUDIT | Copy-paste log_system_issue from script A to script B without auditing dedup | Code review checklist requires "is this the right abstraction?" |
G1.2: Add operating-rule note:
NT-DB-CONSTRAINT-FIRST: Mọi bảng có logical_id (slug, code, source_id, hash, file_path) BẮT BUỘC có UNIQUE/partial index TRƯỚC khi viết writer. "Writer atomic" không thay thế "DB constraint" — phải có cả hai. DB constraint là single source of truth, writer là best-effort.
G1.3: Writer template:
Add knowledge/dev/ssot/templates/atomic-upsert-template.md with:
- PG-direct upsert pattern (using psycopg2 + ON CONFLICT (col) DO UPDATE)
- Bash curl + Directus REST upsert pattern (with explicit retry on conflict)
- Anti-pattern examples (the 5 from G1.1) with X-marks
Effort: 2-3h Risk: LOW (additive documentation) Owner: Desktop chính designs, agent updates
Tier 2 — ENFORCEMENT (CI/lint)
G2.1: Pre-commit lint rule
Script: scripts/lint-no-blind-insert.sh
- grep added/modified files for
/items/.+POST patterns - Flag if no nearby
filter[.+][_eq]=query (warn for find-then-create) - Flag if
ON CONFLICTwithout((broken cargo-cult syntax) - Flag if any new file in
scripts/integrity/ordot/bin/writes to system_issues without computing violation_hash
G2.2: CI workflow (.github/workflows/duplicate-prevention.yml)
- Runs on PRs touching
scripts/,dot/bin/,functions/ - Smoke test: deploy + invoke writer twice + assert no duplicates
- Schema test: every table with logical_id col has corresponding UNIQUE index (query
pg_indexes)
G2.3: Custom Directus extension hook (optional)
- Server-side hook on
system_issuesitems.create - Reject POST if violation_hash exists in any status
- Force client to use UPDATE pattern
Effort: 4-6h (G2.1+G2.2), +6h G2.3 Risk: LOW (G2.1, G2.2), MED (G2.3 — could break existing writers until refactored) Owner: Desktop approves design, agent implements
Tier 3 — REVIEW CHECKLIST
G3.1: PR template (.github/PULL_REQUEST_TEMPLATE.md)
Add section:
## DB Write Checklist (only if PR touches Directus tables)
- [ ] Bảng có UNIQUE constraint trên logical_id chưa? Nếu chưa: tạo trong PR này
- [ ] Writer atomic (ON CONFLICT hoặc transaction) hay best-effort?
- [ ] Nếu best-effort: DB UNIQUE đã có để bắt race chưa?
- [ ] Có copy-paste pattern từ writer khác không? Audit nguồn nếu có
- [ ] Test: chạy writer 2 lần liên tiếp, COUNT không tăng quá 1
G3.2: Reviewer guideline (knowledge/dev/ssot/code-review-checklist.md)
- Quote 5 new anti-patterns from G1.1
- Provide diff examples of GOOD vs BAD patterns
- Mandatory question: "What happens if this writer is invoked 10 times in 1 day?"
Effort: 1-2h Risk: LOW (additive) Owner: Desktop chính
Combined effort
| Tier | Effort | Risk | Priority |
|---|---|---|---|
| 1 (Guidelines) | 2-3h | LOW | HIGH (foundation for tier 2-3) |
| 2 (CI/lint) | 4-12h | LOW-MED | MED (catches future regressions) |
| 3 (Review) | 1-2h | LOW | HIGH (cheap immediate wins) |
| Total | 7-17h | LOW overall | — |
Tuyên ngôn 3-câu mapping
- ① Vĩnh viễn: Tier 1 (knowledge) + Tier 2 (CI test) — once added, future code must comply ✓
- ② Cơ hội nhầm = 0: Tier 2 G2.3 (DB hook) is the only true zero-chance prevention. Without it, Tier 1+3 reduce probability but don't eliminate. Recommend G2.3 for critical tables (system_issues, knowledge_documents, universal_rule_violations) ✓ partial
- ③ LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. 9 writers / 4 patterns / 21-day window / triple guideline gap → comprehensive prevention design ✓
CÁI CHƯA BIẾT (cần điều tra thêm)
-
Why no code review caught these PRs? — Need to read PR review comments (if any) on #476, #506, #578, #620, #626, #629, #655. Could be:
- Self-merge (no human review)
- AI co-reviewer didn't have anti-patterns.md context
- Reviewer focused on functionality/syntax not architecture
-
Did Codex/Gemini agents also write any of these? — git author is always Huyên (single PR account). Need session handoff files to identify which AI session generated each writer.
-
Why no fix-forward between sessions? — S109 → S121 → S131 → S155 → S161 → S164 → S165 = 7 sessions in 16 days, all with same bug. No retrospective happened. Need to check if there was a "after-action" doc or if sessions ran in parallel.
-
Is there a known correct pattern in repo somewhere? — Did anyone EVER write a correct atomic upsert in this repo? Need to grep for
ON CONFLICT (col)with proper syntax. If yes, why wasn't it propagated. If no, the pattern was never seeded. -
7 RESOLVED duplicate groups origin — same archeology needed for those hashes. Skipped in this session due to time.
Files Referenced (file:line + commit)
/opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js(3 commits: 41ba16f S131-M3, 4003661 S161F, b1fe6fe S164)/opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit(fd770ca S165)/opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck(d87957b S109)/opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure(30b0a5f S121)/opt/incomex/docker/nuxt-repo/dot/bin/dot-id-collision-check(f699163 Mission-E)/opt/incomex/dot/bin/dot-collection-health(75faf48 S155, +5 PRs on 2026-04-02)/opt/incomex/dot/bin/dot-apr-health(75faf48 S155)/opt/incomex/dot/bin/dot-script-lint(no git history, line 43 has brokenON CONFLICT DO NOTHING)knowledge/dev/ssot/anti-patterns.mdrev 9 (21 patterns, 0 about upsert)knowledge/dev/ssot/operating-rules.mdrev 48 (0 mentions of UNIQUE/CONSTRAINT/upsert)/opt/incomex/docker/nuxt-repo/.github/workflows/*.yml(11 files, 0 about duplicate)
Verify checklist
| WHY | Has answer? | Evidence type | Confidence |
|---|---|---|---|
| WHY-1 | YES | git log + commit hashes for 9/9 writers | HIGH |
| WHY-2 | YES | byte-identical code diff (2 pairs) + cargo-cult ON CONFLICT example | HIGH |
| WHY-3 | YES | rev counts + grep counts (0 matches) | HIGH |
| WHY-4 | YES | 5 PRs on 2026-04-02 + math (10 runs × 28 issues = 280) | HIGH |
| WHY-5 | YES | 3-tier design with effort/risk/owner per tier | HIGH (design), N/A for execution |