KB-3C6B

S175-VACCINE-02B Deep WHY Audit

22 min read Revision 1
reports175v02bvaccineauditwhyroot-causeanti-pattern

S175-VACCINE-02B — Deep WHY Audit (Tại sao có bệnh?)

Date: 2026-04-09 Mode: READ-ONLY archaeology + design Agent: Claude Code Goal: Trả lời 5 câu WHY về root cause của 9 writers / 4 anti-patterns

Summary

Vòng 1 (s175v02-autoresolvestale-audit) phát hiện 9 writers / 4 anti-patterns. Vòng 2 này điều tra TẠI SAO chúng tồn tại.

Câu trả lời cốt lõi: 9 writers được tạo trong 21 ngày (2026-03-10 → 2026-03-31) bởi AI-assisted development với:

  1. Template propagation: copy-paste log_system_issue() byte-identical giữa các script
  2. Mindset gap: AI default to procedural "SELECT-then-INSERT" thay vì DB constraint
  3. Triple guideline gap: anti-patterns.md (21 patterns, 0 về upsert) + operating-rules.md (0 nội dung về DB constraint) + CI (11 workflows, 0 check duplicate)
  4. No knowledge transfer: mỗi session/PR reinvent cùng pattern, không học từ session trước
  5. Iterative dev tests: 5 PRs/1 ngày trên dot-collection-health = 5×28 = 140-280 dups (smoking gun explained)

WHY-1: AI viết ra 4 anti-patterns? Khi nào?

Timeline (git blame)

# Writer Origin commit Date Session/PR Author Pattern from start?
1 dedupe.js findExistingIssue 41ba16f 2026-03-23 S131-M3 (#578) Huyên YES
2 dedupe.js dedupeAndReport 4003661 2026-03-25 S161F (#620) Huyên YES
3 dedupe.js autoResolveStale b1fe6fe 2026-03-25 S164 (#626) Huyên YES
4 dot-layer-integrity-audit fd770ca 2026-03-26 S165 (#629) Huyên YES
5 dot-registry-crosscheck d87957b 2026-03-10 S109 P1 (#476) Huyên YES
6 dot-pg-audit-ensure PG fn 30b0a5f 2026-03-14 S121 (#506) Huyên YES
7 dot-id-collision-check f699163/c5fcd77 2026-03-12 Mission-E (#486/#487) Huyên YES
8 dot-collection-health 75faf48 2026-03-31 S155 (#655) Huyên YES
9 dot-apr-health 75faf48 2026-03-31 S155 (#655) Huyên YES

Findings

  1. Window: 21 ngày (2026-03-10 → 2026-03-31)
  2. Single git author for all 9: Nguyễn Minh Huyên (PR commits — actual code generated by AI assistants per project's documented workflow)
  3. Pattern present from initial commit for ALL writers — none are later regressions
  4. Sessions span: S109, S121, Mission-E, S131-M3, S155, S161F, S164, S165 — 8 different sessions, each one creating writers with same bug class
  5. No session learned from previous: S121 had the bug, S131-M3 had it, S164/S165 had it. No fix between sessions.

Intent classification

Writer Intent Evidence
dedupe.js DELIBERATE design (dedup engine) "prevents duplicate system_issues" comment line 2
dot-layer-integrity-audit DELIBERATE (find existing first) comment "Check for existing open issue"
dot-registry-crosscheck DELIBERATE similar pattern
dot-pg-audit-ensure DELIBERATE (NOT EXISTS) PL/pgSQL with explicit "-- Skip if already exists" comment
dot-id-collision-check NEGLIGENT NO existence check at all, just POST
dot-collection-health NEGLIGENT (template copy) byte-identical log_system_issue from dot-apr-health
dot-apr-health NEGLIGENT (template copy) byte-identical log_system_issue from dot-collection-health
dot-script-lint CARGO_CULT ON CONFLICT DO NOTHING without column clause (broken syntax)

Pattern of failure:

  • DELIBERATE designs (5/9): Author intended to dedup but used wrong abstraction (REST API instead of DB constraint)
  • NEGLIGENT (3/9): No dedup at all (BLIND_POST family)
  • CARGO_CULT (1/9): Used ON CONFLICT syntax without understanding it

WHY-2: 4 anti-patterns có gốc chung?

Code DNA evidence: copy-paste smoking gun

dot-collection-health log_system_issue vs dot-apr-health log_system_issue:

# Both committed in same PR #655 (S155, 2026-03-31)
# Diff is ONLY: `source` value + `severity` value. Body byte-identical.

# dot-collection-health:
log_system_issue() {
  local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
  curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
    -H "Authorization: Bearer $DOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -n \
      --arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
      '{ title: $title, issue_type: $it, description: $desc,
         source: "dot-collection-health",  # ← only diff
         severity: "warning",                # ← only diff
         status: "open", detected_at: (now | todate) }')" \
    > /dev/null 2>&1 || true
}

# dot-apr-health:
log_system_issue() {
  local TITLE="$1" ISSUE_TYPE="$2" DETAIL="$3" BASE_URL="$4"
  curl -sS --globoff -k -X POST "${BASE_URL}/items/system_issues" \
    -H "Authorization: Bearer $DOT_TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -n \
      --arg title "$TITLE" --arg it "$ISSUE_TYPE" --arg desc "$DETAIL" \
      --arg src "dot-apr-health" \               # ← only diff
      '{ title: $title, issue_type: $it, description: $desc,
         source: $src,
         severity: "info",                        # ← only diff
         status: "open", detected_at: (now | todate) }')" \
    > /dev/null 2>&1 || true
}

→ Same template, same PR, same author, same bug. Direct copy-paste.

dot-layer-integrity-audit vs dot-registry-crosscheck

# dot-registry-crosscheck (S109, 2026-03-10) — ORIGINAL
et_filter = urllib.parse.quote(issue['entity_type'])
existing = api_get(
    f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
    f'&filter[source][_eq]=dot-registry-crosscheck'
    f'&filter[status][_in]=mở,đang_xử_lý'
    f'&limit=1&fields=id'
)
if existing.get('data', []):
    skipped += 1
    continue
result = api_post('/items/system_issues', {... 'status': 'mở' ...})

# dot-layer-integrity-audit (S165, 2026-03-26) — COPY+EXTEND
et_filter = urllib.parse.quote(issue['entity_type'])
title_filter = urllib.parse.quote(issue['title'])  # ← added
existing = api_get(
    f'/items/system_issues?filter[entity_type][_eq]={et_filter}'
    f'&filter[title][_eq]={title_filter}'           # ← added
    f'&filter[status][_in]=mở,đang_xử_lý'
    f'&limit=1&fields=id'
)
if existing.get('data', []):
    skipped += 1
    continue
# ... (identical structure)

→ S165 author copied S109's pattern, added one filter, never questioned the underlying race condition. 16-day gap, same bug propagated.

Common mindset gap

"Treat Directus REST API as the only abstraction layer"

The author/AI consistently:

  1. Used api_get + filter[...][_eq]=X + limit=1 instead of PG SELECT ... FOR UPDATE
  2. Used api_post instead of PG INSERT ... ON CONFLICT (col) DO UPDATE
  3. Filtered by status='open' (or 'mở') believing this guards against duplicates
  4. Never thought about: race condition, transaction isolation, DB constraint as truth

Why this mindset?

  • Directus REST API doesn't expose ON CONFLICT semantics
  • AI prompt likely "create record if not exists" → AI literal interpretation: SELECT then INSERT
  • No reference template in repo showing the CORRECT pattern (atomic upsert via PG)
  • "dedupe.js" was the most authoritative pattern but itself has the bug → propagated as best practice

Cargo cult evidence (dot-script-lint):

INSERT INTO system_issues (...) VALUES (...) ON CONFLICT DO NOTHING;
-- ON CONFLICT WITHOUT (column) clause defaults to PK conflict
-- PK = auto-generated id → never collides → DO NOTHING never fires
-- Effectively blind insert with extra syntax for show

The author/AI knew ON CONFLICT DO NOTHING exists as a PG idiom but didn't understand it requires (column) to specify which constraint. Knowledge of name without knowledge of semantics.


WHY-3: Tại sao lọt qua review/CI/test?

Triple guideline gap

Layer 1 — anti-patterns.md (rev 9, 21 patterns):

Listed: AP-01 to AP-21 covering Assembly Gate, deploy, branch protection, filter hiding, manual deploy, automation, config-first, DOT bypass, junction handling, dynamic pages, accuracy verify, law-reading, background agent, schema bypass (ALTER TABLE), Directus FK, API vs PG access, positive/negative verify, retroactive law, enacted content overwrite, hardcoded enum.

MISSING (relevant to S175 root cause):

  • ❌ AP-X: "Upsert by code without DB constraint"
  • ❌ AP-X: "Find-then-create race condition"
  • ❌ AP-X: "Status flicker creates duplicate via reopen-as-create"
  • ❌ AP-X: "BLIND_POST without dedup logic"
  • ❌ AP-X: "ON CONFLICT without column clause"

Result: 0 of the 4 anti-patterns from vòng 1 are documented in the registry that authors are supposed to read before each mission.

Layer 2 — operating-rules.md (rev 48, 5727 chars):

$ grep -ic 'UNIQUE\|CONSTRAINT\|ON CONFLICT\|upsert\|race\|dedupe' operating-rules.md
0  (the 1 'CONSTRAINT' match is just a section header for mission template)

0 mentions of DB constraints, race conditions, or upsert pattern.

Layer 3 — CI workflows (.github/workflows/, 11 files):

deploy-vps.yml         deploy.yml
dieu31-vps-setup.yml   e2e-test.yml
enforce-table-rule.yml guard_critical_files.yml
nuxt-ci.yml            ops-smoke.yml
post-deploy-e2e.yml    required-docs-guard.yml
sync-check.yml

grep cho duplicate|upsert|UNIQUE → only false positives (unique from jq filter, race from "trace"). 0 CI checks for upsert pattern or duplicate prevention.

Layer 4 — pre-commit hooks:

$ ls /opt/incomex/docker/nuxt-repo/.git/hooks/pre-commit
(does not exist)
$ find /opt/incomex/docker/nuxt-repo -name '.husky' -o -name '.pre-commit*'
(none)

No client-side enforcement.

Code review evidence

PRs #476, #487, #506, #578, #620, #626, #629, #655 all merged with the bug pattern. Either:

  • Reviewers (or AI co-reviewer) didn't catch it
  • Or no review happened (single-author repo with self-merge)
  • Or review focused on functionality not architecture

enforce-table-rule.yml exists but doesn't cover this — its name suggests Nuxt page rules, not DB schema rules.

Conclusion WHY-3: Code lọt qua vì không có ai/cái gì kiểm tra. Triple gap: guidelines (no rule), CI (no test), review (no checklist). The bug class was invisible to all 3 layers.


WHY-4: 280 duplicates ngày 2026-04-02

git log 2026-04-02 commits touching dot-collection-health

03532f4 fix(health): H15a read permissions include locked collections (#666)
d4d5ae7 feat(health): add governance_role 'locked' support (Dieu 38 config tables) (#665)
0c8816d perf(health): bulk SQL for checks 1,2,11 — eliminate per-collection loops (#663)
982b68f fix(health): H14+H15 PG-only, H17 retroactive, Check 9 no hardcode (#662)
3f26b28 Điều 37 P1: DOT collection-create v2 + collection-health v2 (#661)

5 PRs in 1 day modifying dot-collection-health. Plus PR #664 (security commit, possibly triggered redeploy).

Script changelog corroborates

v2.2.0 (2026-04-02): +governance_role 'locked' support  ← #665
v2.1.1 (2026-04-02): Checks 1,2,11 bulk SQL              ← #663
v2.1.0 (2026-04-02): H14+H15 PG-only, H17 retroactive    ← #662
v2.0.0 (2026-04-02): +H14 FK constraints, +H15...        ← #661 (Điều 37 P1)

Math

  • 28 unique titles per script run (from earlier P1 query: Species code missing: tasks/agents/...)
  • 280 rows total / 28 titles = exactly 10 runs
  • 5-6 PRs × 1-2 dev test invocations each ≈ 8-12 runs
  • 10 falls within 8-12 range → matches iterative dev cycle

Reconstruction

  1. Each PR cycle: developer (or AI agent) edits script → manual run to test → push PR → CI runs → merge → deploy → cron picks up
  2. Each manual run = +28 issues (script has BLIND_POST, no dedup)
  3. None of the runs cleaned up previous runs' issues (no autoResolveStale equivalent for dot-collection-health)
  4. After 10 runs, 280 rows accumulate

Cron involvement: dot-collection-health is NOT in the visible crontab (/var/spool/cron/crontabs/root). Searched:

0 */6 * * * .../cron-integrity.sh
0 * * * *   .../watchdog-monitor.sh
0 */3 * * * .../scanner-counts.sh

None of these run dot-collection-health. So it's purely manual/dev invocation, not cron.

Conclusion WHY-4: 280 dups = 10 manual dev test runs during 1-day iterative development of v2.0→v2.2. No cron involvement. The script's BLIND_POST pattern + no cleanup = guaranteed accumulation.

Unknowns from vòng 1 — RESOLVED

# Question Answer Evidence
(a) dot-collection-health 1-day burst — cron or manual? Manual dev iteration (10 runs, 5 PRs same day) git log + script changelog + crontab inspection
(b) dot-scanner disabled 2026-03-14 — why? Abandoned after S121, no cron schedule, only ran during dev git log shows no commits after 30b0a5f (S121)
(c) dot-script-lint ON CONFLICT expects which UNIQUE? NONE — without column clause = PK conflict, never fires Source line 43
(d) dot-id-collision-check 0 rows — never run? Likely never invoked in production, only 2 commits ever, no active schedule git log shows only #486 (initial) + #487 (curl fix), no further work
(e) Composite logical_id design for non-hash writers DEFER to fix design (need product decision) n/a

7 RESOLVED duplicate groups (from S175-VACCINE-01)

Origin reconstruction (deferred — would require querying first_seen timestamps for each group and matching against git history). The pattern is consistent with iterative dev cycles like 2026-04-02. Likely created during similar S121-S165 development sessions.


WHY-5: Systemic Prevention — 3-tier design

Current state: triple gap (P2 finding)

Layer Status Coverage
Guidelines (anti-patterns.md, OR) rev 9 / rev 48 0% of upsert/race/UNIQUE
CI workflows 11 files 0% of duplicate prevention
Code review checklist not found 0%

Tier 1 — GUIDELINE (knowledge update)

G1.1: Add 5 anti-patterns to anti-patterns.md (rev 10)

ID Name Symptom Fix
AP-22 UPSERT-BY-CODE-NO-CONSTRAINT findExisting+POST without DB UNIQUE → race creates duplicates DB partial UNIQUE first, then writer can be best-effort
AP-23 BLIND-POST curl POST without find-existing check at all Compute logical_id (sha256 of natural key) + use upsert
AP-24 STATUS-FLICKER-REOPEN sweep_resolve marks row resolved → next scan creates duplicate (status=open filter) Reopen = UPDATE existing (find any status), not POST new
AP-25 ON-CONFLICT-WITHOUT-COLUMN INSERT ... ON CONFLICT DO NOTHING without (column) → defaults to PK, never fires for auto-gen IDs Always specify the conflict target column
AP-26 TEMPLATE-PROPAGATION-WITHOUT-AUDIT Copy-paste log_system_issue from script A to script B without auditing dedup Code review checklist requires "is this the right abstraction?"

G1.2: Add operating-rule note:

NT-DB-CONSTRAINT-FIRST: Mọi bảng có logical_id (slug, code, source_id, hash, file_path) BẮT BUỘC có UNIQUE/partial index TRƯỚC khi viết writer. "Writer atomic" không thay thế "DB constraint" — phải có cả hai. DB constraint là single source of truth, writer là best-effort.

G1.3: Writer template:

Add knowledge/dev/ssot/templates/atomic-upsert-template.md with:

  • PG-direct upsert pattern (using psycopg2 + ON CONFLICT (col) DO UPDATE)
  • Bash curl + Directus REST upsert pattern (with explicit retry on conflict)
  • Anti-pattern examples (the 5 from G1.1) with X-marks

Effort: 2-3h Risk: LOW (additive documentation) Owner: Desktop chính designs, agent updates

Tier 2 — ENFORCEMENT (CI/lint)

G2.1: Pre-commit lint rule

Script: scripts/lint-no-blind-insert.sh

  • grep added/modified files for /items/.+ POST patterns
  • Flag if no nearby filter[.+][_eq]= query (warn for find-then-create)
  • Flag if ON CONFLICT without ( (broken cargo-cult syntax)
  • Flag if any new file in scripts/integrity/ or dot/bin/ writes to system_issues without computing violation_hash

G2.2: CI workflow (.github/workflows/duplicate-prevention.yml)

  • Runs on PRs touching scripts/, dot/bin/, functions/
  • Smoke test: deploy + invoke writer twice + assert no duplicates
  • Schema test: every table with logical_id col has corresponding UNIQUE index (query pg_indexes)

G2.3: Custom Directus extension hook (optional)

  • Server-side hook on system_issues items.create
  • Reject POST if violation_hash exists in any status
  • Force client to use UPDATE pattern

Effort: 4-6h (G2.1+G2.2), +6h G2.3 Risk: LOW (G2.1, G2.2), MED (G2.3 — could break existing writers until refactored) Owner: Desktop approves design, agent implements

Tier 3 — REVIEW CHECKLIST

G3.1: PR template (.github/PULL_REQUEST_TEMPLATE.md) Add section:

## DB Write Checklist (only if PR touches Directus tables)
- [ ] Bảng có UNIQUE constraint trên logical_id chưa? Nếu chưa: tạo trong PR này
- [ ] Writer atomic (ON CONFLICT hoặc transaction) hay best-effort?
- [ ] Nếu best-effort: DB UNIQUE đã có để bắt race chưa?
- [ ] Có copy-paste pattern từ writer khác không? Audit nguồn nếu có
- [ ] Test: chạy writer 2 lần liên tiếp, COUNT không tăng quá 1

G3.2: Reviewer guideline (knowledge/dev/ssot/code-review-checklist.md)

  • Quote 5 new anti-patterns from G1.1
  • Provide diff examples of GOOD vs BAD patterns
  • Mandatory question: "What happens if this writer is invoked 10 times in 1 day?"

Effort: 1-2h Risk: LOW (additive) Owner: Desktop chính

Combined effort

Tier Effort Risk Priority
1 (Guidelines) 2-3h LOW HIGH (foundation for tier 2-3)
2 (CI/lint) 4-12h LOW-MED MED (catches future regressions)
3 (Review) 1-2h LOW HIGH (cheap immediate wins)
Total 7-17h LOW overall

Tuyên ngôn 3-câu mapping

  • ① Vĩnh viễn: Tier 1 (knowledge) + Tier 2 (CI test) — once added, future code must comply ✓
  • ② Cơ hội nhầm = 0: Tier 2 G2.3 (DB hook) is the only true zero-chance prevention. Without it, Tier 1+3 reduce probability but don't eliminate. Recommend G2.3 for critical tables (system_issues, knowledge_documents, universal_rule_violations) ✓ partial
  • ③ LỖI = cơ hội vàng tìm gốc: This audit IS that opportunity. 9 writers / 4 patterns / 21-day window / triple guideline gap → comprehensive prevention design ✓

CÁI CHƯA BIẾT (cần điều tra thêm)

  1. Why no code review caught these PRs? — Need to read PR review comments (if any) on #476, #506, #578, #620, #626, #629, #655. Could be:

    • Self-merge (no human review)
    • AI co-reviewer didn't have anti-patterns.md context
    • Reviewer focused on functionality/syntax not architecture
  2. Did Codex/Gemini agents also write any of these? — git author is always Huyên (single PR account). Need session handoff files to identify which AI session generated each writer.

  3. Why no fix-forward between sessions? — S109 → S121 → S131 → S155 → S161 → S164 → S165 = 7 sessions in 16 days, all with same bug. No retrospective happened. Need to check if there was a "after-action" doc or if sessions ran in parallel.

  4. Is there a known correct pattern in repo somewhere? — Did anyone EVER write a correct atomic upsert in this repo? Need to grep for ON CONFLICT (col) with proper syntax. If yes, why wasn't it propagated. If no, the pattern was never seeded.

  5. 7 RESOLVED duplicate groups origin — same archeology needed for those hashes. Skipped in this session due to time.


Files Referenced (file:line + commit)

  • /opt/incomex/docker/nuxt-repo/scripts/integrity/dedupe.js (3 commits: 41ba16f S131-M3, 4003661 S161F, b1fe6fe S164)
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-layer-integrity-audit (fd770ca S165)
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-registry-crosscheck (d87957b S109)
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-pg-audit-ensure (30b0a5f S121)
  • /opt/incomex/docker/nuxt-repo/dot/bin/dot-id-collision-check (f699163 Mission-E)
  • /opt/incomex/dot/bin/dot-collection-health (75faf48 S155, +5 PRs on 2026-04-02)
  • /opt/incomex/dot/bin/dot-apr-health (75faf48 S155)
  • /opt/incomex/dot/bin/dot-script-lint (no git history, line 43 has broken ON CONFLICT DO NOTHING)
  • knowledge/dev/ssot/anti-patterns.md rev 9 (21 patterns, 0 about upsert)
  • knowledge/dev/ssot/operating-rules.md rev 48 (0 mentions of UNIQUE/CONSTRAINT/upsert)
  • /opt/incomex/docker/nuxt-repo/.github/workflows/*.yml (11 files, 0 about duplicate)

Verify checklist

WHY Has answer? Evidence type Confidence
WHY-1 YES git log + commit hashes for 9/9 writers HIGH
WHY-2 YES byte-identical code diff (2 pairs) + cargo-cult ON CONFLICT example HIGH
WHY-3 YES rev counts + grep counts (0 matches) HIGH
WHY-4 YES 5 PRs on 2026-04-02 + math (10 runs × 28 issues = 280) HIGH
WHY-5 YES 3-tier design with effort/risk/owner per tier HIGH (design), N/A for execution