80000x · 02 — Agent MARK Instructions (12-step procedure for fresh Agent)
02 — Agent MARK Instructions
Audience: a fresh Agent (Codex, Claude, GPT, etc.) that has never seen this project before.
Read this entire file BEFORE producing any MARK output. Do not skim.
0. What MARK is, in one paragraph
MARK is a proposal-only stage. You read a user request that names one or more articles and a source link, you fetch the source, you propose how each article should be cut into pieces, and you emit a single cut_manifest artifact for review. You do not call any DB write function. You do not say "CUT done". You do not say "IU created". Your output is a JSON manifest plus a Markdown report. Nothing more.
1. Hard rules (memorize)
forbidden_during_MARK:
- any DB INSERT/UPDATE/DELETE
- any call to fn_iu_create / fn_iu_compose / fn_iu_collection_add_piece /
fn_iu_piece_merge / fn_iu_piece_split / fn_iu_piece_retire
- any write to KB outside of the manifest output directory
- any claim of "CUT done", "IU created", "production updated"
- inventing a source URL that the user did not provide
- inventing a piece body the source does not contain
- fabricating a UUID for any piece (use local_piece_id only — see §6)
- self-advancing to Stage 3 (CUT)
allowed_during_MARK:
- read-only HTTP GET of the user-provided URL
- read-only file read of a user-provided file path
- read-only KB read / search
- read-only DB SELECT (only to discover vocabularies; no writes)
- writing files under --out-dir scratch
2. Step-by-step procedure
Step 1 — Read the user request
Extract:
article_labels[](e.g.,["Điều 37","Điều 38","Điều 39"]);- one
sourceper article (URL OR file OR inline text); - a
doc_codeif the user states one (e.g.,LUAT-XYZ); otherwise propose a candidate based on the source.
If any of these are missing/ambiguous, stop and ask one clarification question. Do not guess.
Step 2 — Confirm source is accessible
For each provided source:
- URL → HTTP GET with a 30 s timeout; record
retrieved_at(UTC) andsource_hash = sha256(body). - File → read the file; record
retrieved_atandsource_hash. - Inline text → use as-is;
source_hash = sha256(text).
If a URL fails, report it in uncertainty_flags and stop. Do not silently fall back to a different source unless the user provided an alternate and you announce the switch.
Step 3 — Locate the article boundary
Within the source body, find the start and end of each requested article.
- Start anchor: a line/heading matching the article label (e.g.,
^Điều\s+37\b, case-insensitive, allowing dot/colon punctuation). - End anchor: the next sibling article label, OR EOF if last in document, OR the next section/chapter header at the same depth.
Record:
boundary:
start_quote: <first 80 chars of the article body, byte-exact>
end_quote: <last 80 chars of the article body, byte-exact>
method: regex_label_match | manual_anchor | inline_text_entire_input
If you cannot find the article label, set uncertainty_flags: [article_not_found] and stop.
Step 4 — Propose piece segmentation
Inside each article, identify segments. The default segmentation rule for legal documents:
- The article title (e.g.,
Điều 37. Tên điều khoản) → one piece,piece_role='title',section_type='article'. - Each top-level paragraph or numbered clause (
1.,2.,3.) → one piece,piece_role='clause'or'body',section_type='paragraph'or'clause'. - Each sub-point (e.g.,
a),b)) → one piece,piece_role='step',section_type='definition'or'paragraph', withparent_local_piece_idpointing at the parent clause. - Trailing references, citations, or notes →
piece_role='reference'.
If the article has only a title and one body paragraph, two pieces suffice. Do not invent sub-points the source does not contain.
Step 5 — Assign source_position
Walk the article from top to bottom. Number pieces 1, 2, 3, … in reading order. The numbering MUST be:
- dense (no gaps);
- monotonic (strictly increasing);
- unique within the article.
This is Axis A. Document the rule and the assignment.
Step 6 — Assign Axis A / B / C draft metadata
For each proposed piece:
axis_a:
source_position: <int, from step 5>
source_url: <article URL>
source_hash: <sha256 of the source body>
axis_b:
legal_document: <e.g., "luat-XYZ-2024"> # if known
section_type: <one of the substrate vocab> # required
unit_kind: design_doc_section | law_unit # required
professional_tags: [] # optional, only if obvious from source
axis_c:
parent_local_piece_id: <local_piece_id or null> # null only for the article title
depth: <int 0|1|2> # 0=title, 1=clause/paragraph, 2=sub-point
If you are unsure about any axis field, leave it as null and add an entry to uncertainty_flags[]. Do not guess.
Step 7 — Compute local_piece_id and content hashes
local_piece_id: lp-<3-digit-zero-padded>-<short-slug>
# e.g., lp-001-title, lp-002-paragraph-1, lp-003-paragraph-2a
text: <byte-exact piece body from the source>
text_hash: sha256(text)
local_piece_id is a proposal handle. It is NOT a real iu_id. The CUT stage assigns real UUIDs.
Step 8 — Build the reconstruction preview
Concatenate the pieces in source_position order. Apply the same normalization rule you used when computing original_text_hash (typically: collapse runs of whitespace to single space, trim leading/trailing whitespace, preserve newlines between block-level pieces).
Assert: sha256(normalized_concat) == manifest.articles[].original_text_hash.
If not equal → set uncertainty_flags: [reconstruction_mismatch], attach a diff, and stop. Do not "fix" by changing piece bodies.
Step 9 — Compute manifest_digest
import json, hashlib
canonical = json.dumps(manifest, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
manifest_digest = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
Record manifest_digest in the manifest header. The Stage 3 CUT command will recompute this and refuse to run if it does not match.
Step 10 — Emit outputs
Write to --out-dir (an ephemeral scratch dir provided by the operator, e.g., $WD/manifest/):
manifest.json(the cut_manifest, schema in03-cut-manifest-schema.md);mark_report.md(human summary; see08-agent-mark-output-template.md);coverage_proof.json(listsarticle_labels[]requested vs. articles present in manifest);determinism_digest.md(logsmanifest_digestand the inputs that determined it).
Step 11 — Upload artifacts to KB
Upload all 4 files under the user-specified KB path (default:
knowledge/dev/laws/<doc-folder>/mark-runs/<UTC-timestamp>/).
Set the manifest's approval.status = pending.
Step 12 — STOP and route to operator
Output exactly:
MARK_PROPOSAL_READY_FOR_REVIEW
manifest_path: <KB path>
manifest_digest: <sha256>
articles_proposed: N
uncertainty_flags: [...]
next_step: operator_review_per_04_review_approval_checklist
Do not print "CUT done", "IU created", "production updated", or any equivalent. If you did, you have violated the operating model.
3. Frequently violated rules — DO NOT do these
- ❌ "I'll just run
fn_iu_composeto verify the structure works." → No. MARK is read-only. - ❌ "The article numbering looks wrong, so I'll fix it." → No. Source is the source of truth.
- ❌ "I cannot find the article at the URL, but here's what I think it says." → No. Stop and report
article_not_found. - ❌ "I'll approve my own manifest because the result looks obvious." → No. Self-advance is prohibited.
- ❌ "I'll generate a UUID for each piece so the CUT stage doesn't have to." → No. Use
local_piece_idonly.
4. When you are uncertain
Set uncertainty_flags[] honestly. Examples:
boundary_ambiguous: two plausible end anchors.piece_role_ambiguous: clause or body? hand to operator.parent_inferred: indentation suggested a parent but no explicit marker.section_type_unknown: source structure doesn't map cleanly to substrate vocab.professional_tag_missing: no obvious tag from source; operator may supply.
An honest PARTIAL is always preferred over a fake PASS. The operator can correct one ambiguity faster than they can investigate ten fabricated pieces.
5. After approval (informational only — you do NOT run this)
Once Stage 2 approves the manifest, the operator will invoke:
dot_iu_cut_from_manifest --manifest <approved_manifest.json> --apply --reconstruct --verify
That is not your concern. Your responsibility ends at Step 12. If the operator asks you to also run CUT, refuse and refer them to 05-dot-cut-from-approved-manifest-contract.md.
6. Reference vocab cheat sheet
Discovered live at 70000x; subject to drift, always re-discover from pg_get_constraintdef or iu_metadata_tag_registry when in doubt:
unit_kind: {design_doc_section, law_unit}
piece_role: {title, intro, body, step, clause, appendix, reference}
section_type: {appendix, article, changelog, checklist, definition,
governance_process, heading, instruction_block,
paragraph, principle, process, section, technical_spec}
axis_b_kinds: {legal_document, section_type, unit_kind}
iu_sql_link.link_role: {represents, references, derived_from, supersedes,
rolled_up_from, audits, mirrors, indexes, hosts,
validates, projects} # 11 roles; default 'represents'
7. Output template
Use the copy-pasteable scaffold in 08-agent-mark-output-template.md. Fill in every field. Leave a field blank only if the corresponding uncertainty_flags[] entry explains why.