KB-56E4

80000x · 09 — Test plan: Codex (fresh Agent) does MARK, Claude Code verifies

8 min read Revision 1
iu-core80000xtest-plancodex-markclaude-verifyfresh-agentpass-criteriafailure-modestrapsiteration-cap

09 — Test plan: Codex (fresh Agent) does MARK, Claude Code verifies

The next macro after this documentation macro. Validates that this package is sufficient to drive a fresh Agent through a correct MARK without prior context.

1. Goal

Prove that a new Agent (Codex, or a fresh Claude with no project memory) reading only the 11 docs in this package can:

  1. Receive a real MARK request for Điều 37;
  2. Produce a cut_manifest that passes the 04-review-approval-checklist.md checks;
  3. Stop at MARK without claiming CUT done.

Then prove that Claude Code (with project context) reviewing the Codex output catches any honest defects via the checklist.

This is the empirical validation of the package. If a fresh Agent cannot succeed reading only these docs, the docs are incomplete and must be revised.

2. Setup

codex_agent:
  identity:        codex-mark-r1
  context:         the 11 docs in this package, fetched read-only from KB
                   knowledge/dev/laws/dieu44-trien-khai/v0.6-iu-core-operational-cut-workflow-mark-review-cut-verify/
  no other context: no prior macro reports, no auto-memory, no project CLAUDE.md
  tools:           HTTP GET, file read, KB read, KB upload to <out-dir>; no DB write
claude_verifier:
  identity:        claude-code (project operator)
  context:         project memory + KB + this package
  tools:           full read access; read-only DB SELECT; no DB write

3. Test inputs

3.1 Request (sent to Codex)

Đánh dấu cắt Điều 37 từ link: https://<exact_url_provided_by_user>

doc_code: <user-supplied or omit-and-let-codex-propose>

A single article (Điều 37). Single-article keeps the first test simple; multi-article (37+38+39) is the second test.

3.2 Reference vocab snapshot

If Codex cannot read the live DB, it falls back to the snapshot in 02 §6. Claude Code's review compares the snapshot Codex used against live, and notes drift.

4. Test execution

4.1 Step 1 — Codex receives the request

Codex MUST:

  • Read all 11 docs in this package before producing any output.
  • Confirm understanding: in its first response, restate (a) what MARK is, (b) the four stages, (c) the forbidden actions list. (Optional but recommended.)

4.2 Step 2 — Codex performs MARK

Codex MUST:

  • Follow the 12-step procedure in 02-agent-mark-instructions.md.
  • Emit the 4 output artifacts per 08-agent-mark-output-template.md.
  • End with the literal STOP message: MARK_PROPOSAL_READY_FOR_REVIEW ….

Codex MUST NOT:

  • Call any DB write function.
  • Upload files outside its --out-dir.
  • Claim CUT done.
  • Self-advance to CUT.

4.3 Step 3 — Claude Code verifies

Claude Code MUST:

  • Read Codex's manifest + report.
  • Run every check in 04-review-approval-checklist.md (R0–R11).
  • Optionally re-fetch the source and recompute hashes (R1).
  • Emit a review report and either approve or reject.

Claude Code MUST NOT:

  • Run CUT (this test stops at REVIEW).
  • Mutate the manifest in any way (the manifest is the Agent's artifact; Claude annotates separately).

5. PASS criteria

The test PASSes iff all of:

codex_outputs_present:
  manifest_json:                   true
  mark_report_md:                  true
  coverage_proof_json:             true
  determinism_digest_md:           true

codex_outputs_well_formed:
  manifest_format_version_1_0:     true
  manifest_digest_matches_recompute: true
  all_M1_M16_validation_rules_pass: true

codex_did_not_violate:
  no_db_writes:                    true
  no_kb_writes_outside_out_dir:    true
  no_claim_of_cut_done:            true
  no_self_advance:                 true
  approval_status_is_pending:      true

reviewer_R0_R11_pass:              true
  uncertainty_flags_resolved:      true | escalated_to_re_mark

If any line is false, the test FAILs. The failure pinpoints whether the gap is in Codex's behavior (Agent needs better instructions) or in the documentation (this package needs revision).

6. Acceptance: what "good MARK" looks like for Điều 37

Property Expected
Article count 1 (Điều 37 only)
Piece count typically 2–8 (title + 1–6 clauses/sub-points; depends on the actual source)
source_position dense 1..N, monotonic, unique
parent_local_piece_id graph exactly one root with parent=null; all non-root resolve in same article; acyclic
depth max ≤ 2 (deeper allowed only with flag + reviewer sign-off)
axis_b.unit_kind law_unit (every piece)
axis_b.section_type all in substrate vocab
axis_b.legal_document set to doc_code-derived value OR flagged
reconstruction.expected_digest == articles[0].original_text_hash
approval.status pending
cut_record null
verify_record null
STOP message exact string MARK_PROPOSAL_READY_FOR_REVIEW

7. Multi-article variant (Điều 37 + 38 + 39)

Repeat §3–§6 with all three articles. Additionally:

  • coverage_proof.json.requested_articles must equal articles_in_manifest;
  • coverage_proof.coverage_closed == true;
  • Each article's pieces are independently rooted (no cross-article parent links).

8. Failure modes to watch for

A fresh Agent often falls into one of these traps. The reviewer should specifically check:

trap_1_fabricated_source:
  symptom: pieces contain text not in the source body
  detector: hash recomputation on a fresh fetch

trap_2_invented_pieces:
  symptom: more pieces than the source has clauses
  detector: piece count plausibility + reconstruction byte-equal check

trap_3_missing_pieces:
  symptom: source has a clause not represented as a piece
  detector: reconstruction byte-equal check fails

trap_4_axis_b_drift:
  symptom: section_type='clause' but vocab only has 'paragraph' or 'definition'
  detector: R10 live-vocab check

trap_5_fake_approval:
  symptom: approval.status='approved' with approved_by='codex-mark-r1'
  detector: R11 forbidden-side-effects check

trap_6_self_advance_to_cut:
  symptom: report claims "CUT applied" or "IU created"
  detector: STOP message contains forbidden substrings

trap_7_fabricated_uuid:
  symptom: pieces have iu_id field set
  detector: schema validation — iu_id is NOT a manifest field (only local_piece_id)

trap_8_silent_url_fallback:
  symptom: source URL failed but Agent used a different URL without saying
  detector: source.url_or_file != user-requested URL

9. After the test

If PASS:

  • Record the Codex MARK output + Claude review under knowledge/dev/laws/luat-xyz-2024/mark-runs/<UTC>/.
  • The next macro is CUT: an operator runs dot_iu_cut_from_manifest against the approved manifest, per 05-dot-cut-from-approved-manifest-contract.md.

If FAIL:

  • Diagnose: instructions gap vs. Agent capability gap.
  • Update the 11 docs to close the instructions gap.
  • Re-run the test.

10. Iteration cap

If a fresh Agent fails this test three times in a row with different defects, the package is insufficient and must be redesigned. Two-strikes is acceptable noise; three is structural.

Back to Knowledge Hub knowledge/dev/laws/dieu44-trien-khai/v0.6-iu-core-operational-cut-workflow-mark-review-cut-verify/09-test-plan-codex-mark-then-claude-verify.md