02 — Agent MARK Instructions

Audience: a fresh Agent (Codex, Claude, GPT, etc.) that has never seen this project before.

Read this entire file BEFORE producing any MARK output. Do not skim.

0. What MARK is, in one paragraph

MARK is a proposal-only stage. You read a user request that names one or more articles and a source link, you fetch the source, you propose how each article should be cut into pieces, and you emit a single cut_manifest artifact for review. You do not call any DB write function. You do not say "CUT done". You do not say "IU created". Your output is a JSON manifest plus a Markdown report. Nothing more.

1. Hard rules (memorize)

forbidden_during_MARK:
  - any DB INSERT/UPDATE/DELETE
  - any call to fn_iu_create / fn_iu_compose / fn_iu_collection_add_piece /
                fn_iu_piece_merge / fn_iu_piece_split / fn_iu_piece_retire
  - any write to KB outside of the manifest output directory
  - any claim of "CUT done", "IU created", "production updated"
  - inventing a source URL that the user did not provide
  - inventing a piece body the source does not contain
  - fabricating a UUID for any piece (use local_piece_id only — see §6)
  - self-advancing to Stage 3 (CUT)
allowed_during_MARK:
  - read-only HTTP GET of the user-provided URL
  - read-only file read of a user-provided file path
  - read-only KB read / search
  - read-only DB SELECT (only to discover vocabularies; no writes)
  - writing files under --out-dir scratch

2. Step-by-step procedure

Step 1 — Read the user request

Extract:

article_labels[] (e.g., ["Điều 37","Điều 38","Điều 39"]);
one source per article (URL OR file OR inline text);
a doc_code if the user states one (e.g., LUAT-XYZ); otherwise propose a candidate based on the source.

If any of these are missing/ambiguous, stop and ask one clarification question. Do not guess.

Step 2 — Confirm source is accessible

For each provided source:

URL → HTTP GET with a 30 s timeout; record retrieved_at (UTC) and source_hash = sha256(body).
File → read the file; record retrieved_at and source_hash.
Inline text → use as-is; source_hash = sha256(text).

If a URL fails, report it in uncertainty_flags and stop. Do not silently fall back to a different source unless the user provided an alternate and you announce the switch.

Step 3 — Locate the article boundary

Within the source body, find the start and end of each requested article.

Start anchor: a line/heading matching the article label (e.g., ^Điều\s+37\b, case-insensitive, allowing dot/colon punctuation).
End anchor: the next sibling article label, OR EOF if last in document, OR the next section/chapter header at the same depth.

Record:

boundary:
  start_quote: <first 80 chars of the article body, byte-exact>
  end_quote:   <last 80 chars of the article body, byte-exact>
  method:      regex_label_match | manual_anchor | inline_text_entire_input

If you cannot find the article label, set uncertainty_flags: [article_not_found] and stop.

Step 4 — Propose piece segmentation

Inside each article, identify segments. The default segmentation rule for legal documents:

The article title (e.g., Điều 37. Tên điều khoản) → one piece, piece_role='title', section_type='article'.
Each top-level paragraph or numbered clause (1., 2., 3.) → one piece, piece_role='clause' or 'body', section_type='paragraph' or 'clause'.
Each sub-point (e.g., a), b)) → one piece, piece_role='step', section_type='definition' or 'paragraph', with parent_local_piece_id pointing at the parent clause.
Trailing references, citations, or notes → piece_role='reference'.

If the article has only a title and one body paragraph, two pieces suffice. Do not invent sub-points the source does not contain.

Step 5 — Assign source_position

Walk the article from top to bottom. Number pieces 1, 2, 3, … in reading order. The numbering MUST be:

dense (no gaps);
monotonic (strictly increasing);
unique within the article.

This is Axis A. Document the rule and the assignment.

Step 6 — Assign Axis A / B / C draft metadata

For each proposed piece:

axis_a:
  source_position: <int, from step 5>
  source_url:      <article URL>
  source_hash:     <sha256 of the source body>
axis_b:
  legal_document:  <e.g., "luat-XYZ-2024">           # if known
  section_type:    <one of the substrate vocab>      # required
  unit_kind:       design_doc_section | law_unit     # required
  professional_tags: []                              # optional, only if obvious from source
axis_c:
  parent_local_piece_id: <local_piece_id or null>    # null only for the article title
  depth: <int 0|1|2>                                  # 0=title, 1=clause/paragraph, 2=sub-point

If you are unsure about any axis field, leave it as null and add an entry to uncertainty_flags[]. Do not guess.

Step 7 — Compute `local_piece_id` and content hashes

local_piece_id: lp-<3-digit-zero-padded>-<short-slug>
                # e.g., lp-001-title, lp-002-paragraph-1, lp-003-paragraph-2a
text:           <byte-exact piece body from the source>
text_hash:      sha256(text)

local_piece_id is a proposal handle. It is NOT a real iu_id. The CUT stage assigns real UUIDs.

Step 8 — Build the reconstruction preview

Concatenate the pieces in source_position order. Apply the same normalization rule you used when computing original_text_hash (typically: collapse runs of whitespace to single space, trim leading/trailing whitespace, preserve newlines between block-level pieces).

Assert: sha256(normalized_concat) == manifest.articles[].original_text_hash.

If not equal → set uncertainty_flags: [reconstruction_mismatch], attach a diff, and stop. Do not "fix" by changing piece bodies.

Step 9 — Compute `manifest_digest`

import json, hashlib
canonical = json.dumps(manifest, sort_keys=True, separators=(",", ":"), ensure_ascii=False)
manifest_digest = hashlib.sha256(canonical.encode("utf-8")).hexdigest()

Record manifest_digest in the manifest header. The Stage 3 CUT command will recompute this and refuse to run if it does not match.

Step 10 — Emit outputs

Write to --out-dir (an ephemeral scratch dir provided by the operator, e.g., $WD/manifest/):

manifest.json (the cut_manifest, schema in 03-cut-manifest-schema.md);
mark_report.md (human summary; see 08-agent-mark-output-template.md);
coverage_proof.json (lists article_labels[] requested vs. articles present in manifest);
determinism_digest.md (logs manifest_digest and the inputs that determined it).

Step 11 — Upload artifacts to KB

Upload all 4 files under the user-specified KB path (default: knowledge/dev/laws/<doc-folder>/mark-runs/<UTC-timestamp>/). Set the manifest's approval.status = pending.

Step 12 — STOP and route to operator

Output exactly:

MARK_PROPOSAL_READY_FOR_REVIEW
manifest_path: <KB path>
manifest_digest: <sha256>
articles_proposed: N
uncertainty_flags: [...]
next_step: operator_review_per_04_review_approval_checklist

Do not print "CUT done", "IU created", "production updated", or any equivalent. If you did, you have violated the operating model.

3. Frequently violated rules — DO NOT do these

❌ "I'll just run fn_iu_compose to verify the structure works." → No. MARK is read-only.
❌ "The article numbering looks wrong, so I'll fix it." → No. Source is the source of truth.
❌ "I cannot find the article at the URL, but here's what I think it says." → No. Stop and report article_not_found.
❌ "I'll approve my own manifest because the result looks obvious." → No. Self-advance is prohibited.
❌ "I'll generate a UUID for each piece so the CUT stage doesn't have to." → No. Use local_piece_id only.

4. When you are uncertain

Set uncertainty_flags[] honestly. Examples:

boundary_ambiguous: two plausible end anchors.
piece_role_ambiguous: clause or body? hand to operator.
parent_inferred: indentation suggested a parent but no explicit marker.
section_type_unknown: source structure doesn't map cleanly to substrate vocab.
professional_tag_missing: no obvious tag from source; operator may supply.

An honest PARTIAL is always preferred over a fake PASS. The operator can correct one ambiguity faster than they can investigate ten fabricated pieces.

5. After approval (informational only — you do NOT run this)

Once Stage 2 approves the manifest, the operator will invoke:

dot_iu_cut_from_manifest --manifest <approved_manifest.json> --apply --reconstruct --verify

That is not your concern. Your responsibility ends at Step 12. If the operator asks you to also run CUT, refuse and refer them to 05-dot-cut-from-approved-manifest-contract.md.

6. Reference vocab cheat sheet

Discovered live at 70000x; subject to drift, always re-discover from pg_get_constraintdef or iu_metadata_tag_registry when in doubt:

unit_kind:         {design_doc_section, law_unit}
piece_role:        {title, intro, body, step, clause, appendix, reference}
section_type:      {appendix, article, changelog, checklist, definition,
                    governance_process, heading, instruction_block,
                    paragraph, principle, process, section, technical_spec}
axis_b_kinds:      {legal_document, section_type, unit_kind}
iu_sql_link.link_role: {represents, references, derived_from, supersedes,
                        rolled_up_from, audits, mirrors, indexes, hosts,
                        validates, projects}   # 11 roles; default 'represents'

7. Output template

Use the copy-pasteable scaffold in 08-agent-mark-output-template.md. Fill in every field. Leave a field blank only if the corresponding uncertainty_flags[] entry explains why.