P3D Pack 1 Phase 5 — Read-Only/Dry-Run TAC→IU Migration Mission Prompt (rev6)

Date: 2026-05-11 | Mode: READ-ONLY / DRY-RUN — no writes, no migration, no seed Style: MISSION PROMPT — semantic field registry + deterministic resolution + bounded output Design reference: p3d-pack1-phase5-tac-to-iu-migration-design.md

0. SCOPE BLOCK

target_collection_primary       = 'information_unit'
target_collection_subordinate   = 'unit_version'
target_unit_kind                = 'law_unit'
target_vocab_key                = 'vocab.unit_kind.law_unit'

source_table_logical            = 'tac_logical_unit'
source_table_version            = 'tac_unit_version'
source_table_publication        = 'tac_publication'
source_table_pub_member         = 'tac_publication_member'

0.1 SEMANTIC CANDIDATE FIELD REGISTRY

Instead of writing column names, this prompt defines SEMANTIC CONCEPTS. Each concept has candidate labels. Agent resolves each concept against live schema using the resolution rule below.

Resolution rule (deterministic, no exceptions):

For each (concept, table):
  candidates_found = columns in table whose name matches ANY candidate label
  
  0 found → FIELD_ABSENT. Skip any query requiring this concept on this table.
  1 found → RESOLVED. Use that column name.
  >1 found → AMBIGUOUS_FIELD. Report ALL matching column names. Do NOT pick one.
             Skip any query requiring this concept on this table.
             GPT/User resolves ambiguity in review.

Registry:

Concept ID	Semantic meaning	Candidate labels	Tables to check
`address_field`	Canonical/unique address identifier	`canonical_address`, `address`, `uri`, `path`, `slug`	source_table_logical, target_collection_primary
`content_body`	Main text content	`body`, `content`, `text`, `raw_content`, `markdown`	source_table_version, target_collection_subordinate
`content_hash`	Hash digest of content	`content_hash`, `hash`, `checksum`, `digest`	source_table_version, target_collection_subordinate
`provenance_json_profile`	JSONB structured metadata/provenance container	`content_profile`, `profile`, `metadata_profile`	target_collection_subordinate
`provenance_text_note`	Free-text provenance/origin note	`provenance`, `provenance_note`, `origin`	target_collection_subordinate, source_table_version
`parent_ref`	Reference to parent entity	`parent_id`, `parent_ref`, `parent_or_container_ref`, `container_ref`	source_table_logical, target_collection_primary
`sort_order`	Ordering/sequencing field (intra-unit)	`sort_order`, `order`, `sort`, `position`, `rank`, `sequence`	source_table_logical
`section_type`	Classification of content section	`section_type`, `type`, `kind`, `category`	source_table_logical, target_collection_primary
`lifecycle`	Status/lifecycle state	`lifecycle_status`, `status`, `state`	source_table_logical, target_collection_primary
`publication_ref`	FK to publication table	`publication_id`, `publication_ref`, `pub_id`, `document_id`	source_table_pub_member
`logical_unit_ref`	FK to logical unit table	`logical_unit_id`, `unit_id`, `lu_id`, `logical_unit_ref`	source_table_pub_member
`publication_render_order`	Intra-publication ordering	`render_order`, `pub_order`, `display_order`, `render_sequence`	source_table_pub_member
`title_field`	Title/name of content	`title`, `name`, `display_name`, `heading`	source_table_version, target_collection_subordinate
`description_field`	Description/summary	`description`, `summary`, `abstract`, `desc`	source_table_version, target_collection_subordinate

--- Registry/Species domain ---
`species_identifier`	Unique species code	`species_code`, `species_id`, `species`	entity_species, species_collection_map, collection_registry, birth_registry
`species_display`	Human-readable species name	`display_name`, `name`, `species_name`, `label`, `title`	entity_species
`species_composition`	Composition level of species	`composition_level`, `composition`, `level`	entity_species, birth_registry
`species_management`	Management/governance mode of species	`management_mode`, `mode`, `governance`, `management`	entity_species
`species_hierarchy_parent`	Parent species reference	`parent_id`, `parent_ref`, `parent_species_id`	entity_species
`species_hierarchy_depth`	Tree depth	`depth`, `tree_depth`, `level`	entity_species
`collection_table_key`	Collection PG table name (FK/JOIN key)	`collection_name`, `table_name`, `collection`	collection_registry, species_collection_map, birth_registry
`collection_display_name`	Human-readable collection label	`name`, `display_name`, `title`, `name_en`	collection_registry
`governance_role`	Collection governance classification	`governance_role`, `role`, `governance`	collection_registry, birth_registry
`migration_state`	Migration status	`migration_state`, `state`, `migration`	collection_registry
`birth_strategy`	Birth code strategy	`birth_code_strategy`, `strategy`, `birth_strategy`	collection_registry
`mapping_primary`	Primary mapping flag	`is_primary`, `primary`, `is_default`	species_collection_map
`disc_field`	Discriminator field name	`discriminator_field`, `disc_field`	species_collection_map
`disc_value`	Discriminator value	`discriminator_value`, `disc_value`	species_collection_map
`disc_operator`	Discriminator operator	`discriminator_operator`, `disc_operator`	species_collection_map
`disc_config`	Discriminator config	`discriminator_config`, `disc_config`	species_collection_map
`birth_entity`	Birth entity identifier	`entity_code`, `code`, `entity_id`	birth_registry

Agent must resolve ALL concepts against ALL listed tables in a single introspection pass (GATE-0 extension). Results form the "resolved field map" used by all subsequent goals. No goal may reference a column name directly — only concept IDs from this registry.

1. GATE-0 — Scope preflight + field resolution

Phase 1: Table existence

Check	What
P1-P4	All 4 source tables exist (relkind='r', public schema)
P5-P6	Both target tables exist
P7	collection_registry rows for both targets
P8	dot_config key = target_vocab_key
P8b-P8e	Registry tables exist (relkind='r', public): entity_species, species_collection_map, collection_registry, birth_registry

Any P1-P8e fail → STOP. P9-P14 informational (PRESENT/ABSENT, non-fatal).

Phase 2: Semantic field resolution

For EVERY (concept, table) pair in §0.1 registry:

1. Query information_schema.columns for that table
2. Check which candidate labels match actual column names
3. Record: RESOLVED(column_name) / FIELD_ABSENT / AMBIGUOUS_FIELD(list)

Output: Resolved field map — a table of (concept_id, scope_table, resolution_status, resolved_column_name_or_null). This map is THE authoritative reference for all subsequent goals. If a goal needs a concept that is ABSENT or AMBIGUOUS on a table → that goal's sub-query for that table is SKIPPED.

Phase 3: Function existence

Check	What
P9	fn_iu_create exists in public
P10	fn_iu_create_plan exists in public
P11	fn_content_hash exists in public (or equivalent hash function)
P12	fn_iu_birth_gate_layer1 exists
P13	fn_iu_birth_gate_layer2 exists
P14	fn_iu_verify_invariants exists

P9-P14: PRESENT or ABSENT. Non-fatal (informational for G8/G9).

2. GOALS (all column references via concept IDs from §0.1 registry — no old/deprecated concept names)

G1 — Source row accounting

Compute: SELECT count(*) per source table. For each source table, also query distinct values of RESOLVED categorical concepts (section_type, lifecycle) with counts.
Column references: section_type concept on source_table_logical. lifecycle concept on source_table_logical. If ABSENT → skip that breakdown.
Output: Counts per table. Distinct breakdowns for resolved concepts.

G2 — Schema alignment: source→target field mapping

Compute: For each (source_table, target_table) pair, compare the resolved field map entries. For each concept: both RESOLVED (aligned), source RESOLVED + target ABSENT (migration gap), target RESOLVED + source ABSENT (target-only field).
Also: List ALL columns on each table (not just registered concepts) to catch fields outside the registry.
Output: Alignment table per pair. Gap analysis. Fields not covered by any concept (agent reports them as UNREGISTERED_FIELD for GPT review).

G3 — Nesting evidence

Column reference: parent_ref concept on source_table_logical.
If RESOLVED: Count roots (NULL parent) vs children (non-NULL). Recursive depth (WITH RECURSIVE, limit 10). Distribution by depth level. Parent→child section_type concept pairs (if section_type RESOLVED).
If ABSENT: Report parent_ref FIELD_ABSENT on source_table_logical. Nesting analysis = N/A. Nesting decision defaults to "flatten" evidence.
Output: Nesting structure. This is THE input for design §D nesting decision.

G4 — Publication structure

Column references: publication_ref concept on source_table_pub_member (FK to publication). logical_unit_ref concept on source_table_pub_member (FK to logical unit). publication_render_order concept on source_table_pub_member.
Compute: Introspect source_table_publication columns (SELECT * for full dump). Count publications. If publication_ref RESOLVED: count members per publication via JOIN. If logical_unit_ref RESOLVED: verify 1:1 with source_table_logical. If publication_render_order RESOLVED: verify ordering integrity (gaps, duplicates per publication).
Output: Publication count, membership distribution per publication, ordering analysis.

G5 — Canonical address collision check

Column reference: address_field concept on source_table_logical AND target_collection_primary.
If both RESOLVED: Compute intersection: source addresses that already exist in target.
If either ABSENT: Report which side is ABSENT. Collision check = N/A.
Output: Collision count + sample (if any). Zero = PASS.

G6 — Hash/provenance projection

Column references: content_body on source_table_version + target_collection_subordinate. content_hash on source_table_version + target_collection_subordinate. provenance_json_profile on target_collection_subordinate (for structured TAC hash provenance per Phase 4B addendum §7). provenance_text_note on target_collection_subordinate (informational only).
Function reference: fn_content_hash (from GATE-0 P11).
If content_body + content_hash RESOLVED on source, and fn_content_hash PRESENT:
- Sample 5 source rows: compute fn_content_hash(source.content_body) and compare with source.content_hash. Show whether recipes differ (re-verify live, don't assume).
- Project: target would get fn_content_hash(migrated_body) as its hash. Source's original hash becomes provenance.
If provenance_json_profile RESOLVED on target: Report its data_type (verify JSONB). Check if existing rows have this field populated. This is THE carrier for source_hashes.tac_v1 per Phase 4B contract.
If provenance_text_note RESOLVED on target: Report as informational (editor note, not structured hash provenance).
If any concept ABSENT: Report which. Hash/provenance analysis partial.
Output: Hash recipe comparison, provenance storage feasibility.

G7 — Species/composition per nesting option (NO interpretation by agent)

Source: Live entity_species (SELECT * for all rows + all columns). Re-verify Phase 4C governance_role match live.
Concept references: species_composition concept on entity_species. species_management concept on entity_species. governance_role concept on collection_registry (for target).
Compute:
1. If species_composition concept RESOLVED on entity_species → query ALL distinct values of that resolved column (live, not listed from memory).
2. If RESOLVED → group ALL species rows by that resolved column.
3. If species_management concept RESOLVED on entity_species AND governance_role concept RESOLVED on collection_registry → for each group, count species where resolved species_management value equals resolved governance_role value of target_collection_primary. Label: PLAUSIBLE (match), CONFLICT (mismatch), UNKNOWN (either concept ABSENT or AMBIGUOUS).
4. If species_composition ABSENT or AMBIGUOUS → composition grouping = UNKNOWN. Still dump all species via SELECT * for GPT review.
5. If species_management or governance_role ABSENT or AMBIGUOUS → PLAUSIBLE/CONFLICT labels = UNKNOWN for all species.
Agent does NOT interpret which composition "allows containment" or "is simplest." Agent presents grouping + labels. GPT/User applies Điều 0-B.
Output: Table: resolved_composition_value | species_count | plausible_count | conflict_count | unknown_count | sample_species (from live SELECT *). All labelled candidate_not_approved.

G8 — IU creation pathway analysis (bounded output)

For each function in GATE-0 P9-P14 that is PRESENT:
- pg_get_function_identity_arguments(oid) → identity args string
- proargnames → parameter name array
- prorettype::regtype::text → return type
- If fn_iu_create: does its identity_args include a parameter whose name matches parent_ref concept candidates? (Check by string containment on identity_args, not by assuming a specific name.)
Trigger inventory for target_collection_primary:
- SELECT tgname, tgtype, tgfoid::regproc AS function_name FROM pg_trigger WHERE tgrelid = '<target>'::regclass AND NOT tgisinternal ORDER BY tgname
Output per function: function_name | identity_args | return_type | parameter_names. Per trigger: trigger_name | trigger_type | function_called.
If a function is ABSENT: Report ABSENT. Mark related analysis as UNKNOWN.
Agent does NOT interpret function behavior from source code. Agent provides identity/signature/trigger inventory. GPT/User interprets.

G9 — Birth gate field requirements (bounded output)

For fn_iu_birth_gate_layer1 and fn_iu_birth_gate_layer2 (if PRESENT):
- Capture function source via pg_get_functiondef.
- Extract: every NEW.<field_name> reference in the source text. List them as gate_required_fields.
- For each gate_required_field: check if it exists on target_collection_primary (via resolved field map or direct introspection).
- Label: SATISFIED (field exists on target), GAP (field absent), UNKNOWN (can't determine).
Output: Per gate function: field_name | exists_on_target | status (SATISFIED/GAP/UNKNOWN).

G10 — Pilot document selection candidates (metric-only, NO subjective ranking)

Source: source_table_publication (all rows via SELECT *).
If publication_ref RESOLVED on source_table_pub_member: JOIN to source_table_publication to count members per publication.
If logical_unit_ref RESOLVED on source_table_pub_member AND section_type RESOLVED on source_table_logical: Count distinct section_type values per publication via pub_member→logical_unit JOIN (= diversity metric).
If publication_render_order RESOLVED on source_table_pub_member: Report ordering integrity per publication (min, max, gaps, duplicates).
Compute per publication (ALL publications, not a filtered subset):
- member_count (total logical units in publication)
- diversity_count (distinct section_type values, if concept RESOLVED; else UNKNOWN)
- nesting_depth (max parent_ref depth within this publication's members, if parent_ref RESOLVED; else UNKNOWN)
Sort: by diversity_count DESC, member_count ASC (deterministic). If diversity_count = UNKNOWN, sort by member_count ASC only.
Output: ALL publications with metrics. Agent does NOT filter by "manageable" or "not the largest." Agent does NOT pick favorites. GPT/User selects pilot from the full list.
All rows labelled candidate_not_approved.

G11 — Rollback input computation

If address_field RESOLVED on source_table_logical: Count total source rows (= projected insert count). If publication_ref RESOLVED on source_table_pub_member: group by publication to compute per-publication projected insert count for per-batch rollback sizing.
Output: Total projected migration size. Per-publication breakdown.

3. ACCEPTANCE CRITERIA

#	Criterion	PASS
1	GATE-0 Phase 1-3 all reported	Each check: PASS/FAIL/PRESENT/ABSENT
2	Resolved field map complete	Every (concept, table) pair resolved
3	G1-G11 all attempted	Evidence, skip reason, or FIELD_ABSENT
4	Zero writes	Confirmed
5	No column name from memory	All via resolved field map
6	No "or similar" language	All resolution deterministic
7	No composition interpretation by agent	G7 groups only, no "simplest"/"allows containment"
8	G8/G9 output bounded	Identity/params/triggers/fields, with UNKNOWN
9	AMBIGUOUS_FIELD handled	Reported, not silently resolved
10	All candidates labelled	`candidate_not_approved`
11	Scale brittleness self-audit	§5

4. CONSTRAINTS

No writes. No migration. No seed. No backfill.
No column names outside resolved field map. Every field reference → concept ID → resolved via §0.1 registry.
No fuzzy resolution. 0=ABSENT, 1=USE, >1=AMBIGUOUS. No "or similar."
No composition/species interpretation. Agent groups and presents. GPT/User interprets.
No function behavior interpretation. Agent provides signature/source/triggers. GPT/User interprets.
Re-verify everything. Phase 3/4 findings are labels only.
Mission-style. Agent writes queries using resolved field map.

5. SCALE BRITTLENESS SELF-AUDIT

Check	Required
Column name from memory used in a query?	No — all via resolved field map
"or similar" or "equivalent" in any selector?	No
Agent interpreted "simplest" or "allows containment"?	No
AMBIGUOUS_FIELD silently resolved?	No — reported for GPT/User
Fixed counts as logic?	No
Phase 3/4 labels as facts?	No — re-verified
Semantic concept missing from registry?	Report as `UNREGISTERED_FIELD`

6. REPORT REQUIREMENTS

Save to: knowledge/dev/laws/dieu44-trien-khai/reports/p3d-pack1-phase5-tac-to-iu-migration-dryrun-report.md

Must include: GATE-0 (tables + resolved field map + functions), G1-G11 evidence, self-audit, status flags.

Phase 5 Mission Prompt rev6 | All goals use current concept IDs | No deprecated concepts | No migration note workaround | 2026-05-11