dot-iu-cutter v0.1 — Operational Problem Statement Rev2 — C1A Integrated — 2026-05-14

Status

status=DRAFT_FOR_USER_APPROVAL
agent_dispatch_allowed=false
implementation_allowed=false
purpose=approve_problem_statement_before_design

This is not an Agent task. This document is the proposed operational problem statement for User approval before any Agent is asked to design.

0. Change from Rev1

Rev1 correctly identified Mark → Review → Cut, but did not fully integrate the existing canonical segmentation law.

Rev2 fixes that.

Controlling segmentation foundation:

knowledge/dev/laws/dieu38-trien-khai/C1A-segmentation-operating-model.md

C1A is OFFICIAL, User PASS / GPT PASS. It already answers the foundational question:

Agent cắt tài liệu thành miếng thông tin theo quy tắc gì?

Therefore, dot-iu-cutter must not invent a new segmentation law. It must operationalize C1A into an automated closed-loop process.

1. Đề bài quy trình ngắn gọn

Design an operational process so that when the user says:

Cắt luật A

or, later:

Cắt văn bản X

the system can automatically execute a closed-loop workflow:

Resolve source
→ check existing cut/history/collisions
→ MARK semantic cut manifest under C1A rules
→ REVIEW manifest under independent AI review
→ CUT deterministically from approved manifest
→ VERIFY by round-trip/no-loss/no-overlap/invariants
→ REPORT result and rollback keys
→ if later wrong granularity is found, correct by governed SPLIT/MERGE lifecycle

The core design problem is not “how to call fn_iu_create.” Phase 5C2 proved that execution is feasible. The core design problem is:

How does the system decide where to cut, prove the decision is safe, detect errors, and repair structure later without relying on hidden human judgment?

Default authority:

AI decides and reviews by default.
Human/User is final approver of this problem statement and later high-risk policies, but is not in the normal per-document cutting loop.

Human escalation is exceptional, not normal.

2. Canonical rules that must be inherited from C1A

The future design must explicitly inherit these C1A elements.

2.1 Three-question test — primary cut test

Every candidate unit must pass C1A §3.2:

Title rõ? — Can the unit be named so another agent understands the main idea without opening it?
Sửa riêng được? — Can this unit be edited without necessarily editing another unit?
Không quá khó sửa? — Is it not so long/complex that review becomes impractical?

These three questions are the default semantic unit test. They must appear in the Mark and Review stages.

2.2 SR-1 → SR-7 — official segmentation rules

Design must carry forward:

SR-1: section with clear title + independently editable → one logical unit.
SR-2: no clear title → body of parent, not its own unit.
SR-3: if editing A necessarily pulls B → merge/group as one unit.
SR-4: too short + no authority → merge with parent/sibling.
SR-5: cut by meaning, not mechanically.
SR-6: title must describe meaning; no mechanical A/B/C names.
SR-7: each unit has exactly one canonical parent in the structural tree.

2.3 OD-PILOT edge cases

Design must carry forward:

OD-01 Code/config block: default body of parent; separate only if independently referenced, versioned, testable, or reusable.
OD-02 Heading-only: valid unit if it has authority/governance role; otherwise structural/navigation node.
OD-03 Mission/instruction block: keep atomic even if long unless it can be split into independent missions.
OD-04 section_type: controlled vocabulary; no silent invention.
OD-05 Hard-limit matrix/table: split by semantic dimension unless matrix must be read as a whole and has approved exception.
OD-06 Field responsibility matrix: split by object family when independently editable.

2.4 NL1–NL4 and length management

Design must apply:

NL1 Unit-Centric — unit is center.
NL2 Semantic Unit Rule — cut when title clear + separately editable + not too hard to edit.
NL3 Risk-tiered Authority — agent authority depends on risk tier and lifecycle.
NL4 Length as Trigger — length warns/reviews; it does not mechanically cut.

Length rules:

normal <= 500 words default
soft-limit 500–1500 words default
hard-limit >1500 words default

These thresholds are defaults/review triggers, not blind rules. Publish/enact with hard-limit requires one of:

split
re-segment
length exception approved

2.5 C1A invariants CI-1 → CI-12

Design must not violate, especially:

CI-3: publication does not contain inline content, only references unit_versions.
CI-4: label doc=X is not publication membership.
CI-6: canonical address stable for same logical unit.
CI-8: no mechanical A/B/C split; title must describe meaning.
CI-9: no parallel label registry.
CI-11: each unit has one canonical parent.
CI-12: every new unit goes through birth gate.

2.6 Lessons from P10A/P10B/Phase5C2

The design must integrate these empirical lessons:

P10A D35 v1 segmentation showed why a root must not duplicate full document body.
P10A D35 v2 showed section_type diversity and semantic split of §4/§6 are required.
P10B D32 proved round-trip 0 drift is a reliable validation gate.
Phase5C2 proved bounded transaction + rollback keys + fn_iu_create writer path works.
DIEU-32 proved heading/container body policy must distinguish TAC representation from IU body requirements.

3. Questions the design must answer

Group A — Source and command

Q1. How does the user command work?

Input: “Cắt luật A”.
System resolves source by doc_code/name/path.
If exactly one match: continue.
If none/multiple: ask one clarification question.

Q2. What is the source of truth?

KB markdown for new documents.
TAC publication for existing TAC sources.
Future source modes allowed only if they declare canonical source and source hash.

Q3. What if the source was already cut?

Collision/history check is mandatory.
If already cut, system must not blindly duplicate.
It must classify: status-only, split/merge, supersede/re-cut, or block for review.

Group B — Marking / segmentation decision

Q4. What does MARK produce?

A manifest, not writes.

Minimum fields:

manifest_id
source_doc_ref
source_version_ref
source_hash
source_mode
unit_index
source_start_line/source_end_line or byte/char span
canonical_address_proposal
title
section_type
unit_kind
parent_manifest_id
hierarchy_depth
body_source_policy
semantic_role
cut_reason
C1A_rule_refs
three_question_test_result
confidence
review_required_flags
length_flag
edge_readiness_notes
split_merge_notes

Q5. How does AI decide a cut boundary?

Expected answer:

First apply C1A three-question test.
Then apply SR-1→SR-7.
Use structure as evidence, not as blind rule.
Semantic role changes can create cuts even without headings.
Length triggers review, not mechanical split.
Every decision must carry cut_reason and C1A_rule_refs.

Q6. How are section_type and unit_kind chosen?

Expected answer:

Use controlled vocabulary.
Do not invent silently.
If no type fits, set NEW_VOCAB_REQUIRED and escalate.
section_type must support later edge/professional linking.

Q7. How are heading/container and body policies handled?

Expected answer:

Inherit C1A OD-02 and Phase5C2 policy.
Heading with authority/governance role can be unit.
Approved body policy:

SYNTHESIZE_TITLE iff section_type='heading' AND body IS NULL AND children>0
PRESERVE iff body IS NOT NULL
BLOCK iff body IS NULL AND not heading-container

Group C — Review / decision authority

Q8. Who reviews the manifest?

Expected answer:

AI review by default.
Reviewer must be role-separated from Marker, even if same agent performs second pass.
Human not in normal loop.

Q9. What does REVIEW check?

Expected answer:

coverage/no-loss
no-overlap
C1A three-question test per unit
SR/OD rule compliance
semantic cohesion
actionability
section_type/vocab correctness
hierarchy and one-parent rule
length flags and exceptions
body policy
edge readiness
round-trip feasibility

Review output:

manifest_review_status=PASS|PATCHED_PASS|BLOCKED
human_escalation_required=true|false

Q10. When is human escalation required?

Expected answer:

Human escalation only for:

source ambiguity that AI cannot resolve;
new vocab/type required;
suspected data loss/corruption;
competing valid cuts with different legal/governance meaning;
high/highest-risk finalization if law requires it;
split/merge changes enacted canonical meaning.

Group D — Cut / verify / rollback

Q11. How does CUT execute?

Expected answer:

CUT executes only from approved manifest.
Use canonical writer only (fn_iu_create for IU path).
No direct IU/UV insert.
Per-document/publication bounded transaction.
Birth trigger must fire.
Profile/provenance must include manifest and C1A decision evidence.

Q12. How does the system prove the cut is correct?

Expected answer:

Mandatory round-trip verification.
Reconstruct from created pieces and compare to canonical source or declared normalizer.
No-loss/no-overlap must hold.
For representation conversions, e.g. heading-title synthesis, provenance and V-3b' policy must prove equivalence.

Q13. What happens if verification fails?

Expected answer:

Content/integrity error → rollback automatically using exact keys.
Semantic granularity issue after successful round-trip → Split/Merge lifecycle, not rollback.

Q14. How does rollback work?

Expected answer:

Exact-key rollback only.
Rollback keys dual-written to KB + VPS log before COMMIT.
Pattern deletion prohibited.

Group E — Split/Merge correction lifecycle

Q15. How does Split work?

Expected answer:

Mark split points
Review split manifest
Create new units/versions through canonical writer
Mark old unit superseded, not deleted
Create split_from/supersedes/superseded_by relations
Reassign/propose edge reassignment
Round-trip verify new units equal old unit content or accepted normalized representation
Report

Required metadata:

operation=split
source_unit_id
source_unit_version_id
new_unit_ids
split_reason
span_mapping
semantic_mapping
old_canonical_address
new_canonical_addresses
edge_reassignment_plan
rollback_plan

Q16. How does Merge work?

Expected answer:

Mark merge candidate units
Review merge decision
Create new merged unit
Mark old units superseded
Preserve aliases/redirects
Reassign/propose edge reassignment
Round-trip verify merged content equals ordered old content
Report

Group F — Simplicity and operability

Q17. Can this run in one operation?

Expected answer:

Normal case yes:

Resolve → Mark → Review → Cut → Verify → Report

Correction case:

Find issue → Mark structural change → Review → Apply Split/Merge → Verify → Report

Q18. How do agents remember the process?

Expected answer:

The detailed design must reduce to two state machines:

Cắt: Resolve → Mark → Review → Cut → Verify → Report
Sửa cấu trúc: Detect → Mark → Review → Split/Merge → Verify → Report

All detailed gates map to one of these states.

Q19. How does this support all document kinds?

Expected answer:

Same workflow, different unit_kind, section_type profile, render policy:

law → law_unit
design doc → design_doc_section
process → process_section
report → report_section

Q20. How are decisions persisted?

Expected answer:

Persist or make persistable:

source resolution
manifest
manifest review
execution report
round-trip result
rollback keys
split/merge operation
policy exceptions

v0.1 may use KB artifacts; design must state PG-native direction.

4. Expected approach for answering the questions

The design should answer the above questions with this approach:

4.1 Do not write a new segmentation law

Use C1A as canonical segmentation law.

dot-iu-cutter adds automation process and execution controls around C1A.

4.2 Separate decision from execution

MARK decides.
REVIEW validates/repairs/blocks.
CUT executes.

Execution backend must not silently invent cut boundaries.

4.3 Use manifest as audit object

The manifest is the durable decision record.

Every unit must record:

source span;
title;
type;
parent;
C1A rules used;
three-question test result;
cut reason;
confidence;
review flags.

4.4 Require round-trip verification

No successful cut without reconstruction and comparison.

This is the concrete answer to:

Cắt xong có ai kiểm tra không?

Yes: the system checks by reassembling and comparing. Human is not needed for normal cases.

4.5 Treat semantic mistakes as structural lifecycle, not silent edits

If a cut is content-wrong, rollback.

If a cut is semantically suboptimal, use Split/Merge with supersession/history and edge reassignment.

4.6 Keep human out by default, but define escalation clearly

AI should decide normal cuts.

Human should only approve the problem statement, policy changes, high-risk exceptions, and semantic/legal ambiguities that AI cannot resolve.

4.7 Keep the process simple for executors

The design must be rich internally, but executor-facing workflow must remain:

Resolve → Mark → Review → Cut → Verify → Report

If the design cannot be expressed this simply, it is not ready.

5. Acceptance criteria for User approval of the problem statement

This problem statement is ready for Agent design only if User accepts:

C1A is canonical segmentation law.
Mark → Review → Cut is the default operating model.
AI decides cuts by default; human escalation is exceptional.
Manifest is mandatory before execution.
Independent AI review is mandatory before cut.
Round-trip verification is mandatory after cut.
Content/integrity failure rolls back automatically.
Semantic-granularity failure is handled by Split/Merge lifecycle.
Split/Merge preserves history and uses supersession, not silent overwrite.
Phase5C2 body policy and patched V-3 semantics are carried forward.
Process must remain reducible to simple state machines.
No Agent design or implementation starts until User approves this problem statement.

6. Current decision needed

User should approve, amend, or reject this Rev2 problem statement.

No Agent work should be started until User approval.

Final flags

problem_statement_rev2_status=DRAFT_FOR_USER_APPROVAL
c1a_integrated=true
agent_design_allowed=false_until_user_approval
implementation_allowed=false
next_step=user_review_problem_statement_rev2