Assembly-First Open-Source Integration — Critique + Recommended Architecture

Path: knowledge/dev/design/assembly-first-open-source-integration-critique.md Status: DRAFT — Claude Code critique of the user's candidate strategy (2026-05-27 addendum). Not final truth; subject to Council review. Principle assumed: PG/DOT/registry remains the legal/config/source-of-truth core. Open-source tools attach as replaceable adapters, executors, brokers, gateways — never owners of governance.

1. Critique matrix

Legend: A=Accept · M=Modify · R=Reject

#	Slot	Proposed choice	A/M/R	Reason	Risk	Better option if any	Phase
1.1	Producer / CDC	Keep PG-native event_outbox	A	event_outbox + event_type_registry + register-before-emit is curated, governed, already operational (140 k+ rows historical). Aligns with Hiến pháp NT13 + Điều 45 §6.	None.	—	Immediate.
1.2	Producer / CDC	Add Benthos / Redpanda Connect (config-driven CDC)	M	YAML config aligns with NT2/NT4. BUT WAL/CDC requires logical replication slot + pgoutput / wal2json plugin + slot supervision; on a single VPS this is non-trivial ops. More critically, CDC captures every row change indiscriminately — that conflicts with the "register-before-emit" governance enshrined in event_type_registry. CDC is useful for external table mirroring (e.g., Lark/Notion replicas), not for IU/workflow/task domain events.	If adopted broadly: shadow event taxonomy outside event_type_registry → governance fragmentation; replication slot bloat if consumer lags.	Keep trigger-emitted events for domain; reserve Benthos for bounded mirroring use cases (e.g., S177 Lark CRUD shadow → PG mirror feed) — and only when a real such case appears.	Later (Phase 5+ if any).
1.3	Producer / CDC	Artie / PeerDB	R for now	Heavier replication tooling; same governance problem as Benthos; more vendor surface.	High ops cost for marginal benefit at current scale.	Re-evaluate only if multi-DC PG replication becomes a hard requirement.	Not in roadmap.
2.1	Broker / Event bus	PG-native outbox + LISTEN/NOTIFY + event_subscription	A	Already exists. queue.notify.enabled is currently false but gate is in place. Phase 1 just turns it on with discipline.	Polling lag if NOTIFY disabled; bounded by outbox tail cursor.	—	Immediate.
2.2	Broker / Event bus	NATS	M (defer)	NATS is excellent for fanout/pub-sub across hosts, but introduces a separate cluster and a parallel SoT for messages. Live evidence does not yet show multi-host workers or cross-service consumers requiring NATS. Adoption now duplicates Điều 45's outbox semantics. Migration path: introduce later as a transport fed by an outbox-relay worker — event_outbox stays SoT; NATS publishes to topic mirrors. No law change required since Điều 45 §6 is substrate-neutral.	Premature NATS = two SoTs for events, drift between PG and broker, harder DLQ semantics.	PG LISTEN/NOTIFY for in-host, in-cluster fanout first. NATS later iff multi-host or external consumer demands it.	Phase 5+ (only if profile demands).
2.3	Broker / Event bus	Redis Streams	R if no existing Redis	Live PG survey shows no Redis in the stack. Adding Redis is a new persistence + ops surface for marginal benefit over PG. If Redis appears later for caching, then Streams becomes a possible add — but Redis persistence semantics are weaker than PG; not a substitute SoT.	Two SoTs; Redis durability < PG; ops surface grows.	Skip.	Not in roadmap (revisit only if Redis already arrives for another reason).
2.4	Broker / Event bus	Kafka / Redpanda	R	Operational cost completely out of proportion to current scale. Conflicts with VPS-light constraint.	Massive ops + cost.	Skip until cross-region durability is mandatory and PG cannot keep up — likely years away.	Not in roadmap.
3.1	Job queue / Worker	Existing PG job_queue + job_dead_letter + queue_heartbeat + dot_iu_runtime_lease	A	Tables and gates already in place; queue.runtime.phase=`phase2_governance`; missing pieces are config-on + caller obligations, not new substrate.	None — improvements native.	—	Immediate.
3.2	Job queue / Worker	pg-boss	M / R as substrate; A as library pattern reference	pg-boss owns its own state vocab (created→active→completed/failed/expired) that does not match Điều 45 §6.7 work_state_machine (≥9 states). Adopting pg-boss as substrate means either (a) Council amends §6.7 — high blast radius — or (b) we wrap pg-boss and lose its straight-line benefits. Better to keep our state model and borrow patterns from pg-boss (cron-style schedules, throttling) implemented natively.	If adopted as substrate: Điều 45 §6.7 violation.	Native implementation, pattern-inspired.	Patterns immediate; library no.
3.3	Job queue / Worker	Graphile Worker	R as substrate	Same shape problem as pg-boss; tightly coupled to Postgres functions called by name; cron-extension dependency for scheduling on container Postgres can be messy.	Same as pg-boss.	Native.	Not in roadmap as substrate.
3.4	Job queue / Worker	Implement missing semantics natively	A	Idempotency_key (`UNIQUE NULLS NOT DISTINCT`) + lease + heartbeat (Điều 45 §15.5) + retry with backoff + DLQ are already partially in place. Phase 0–1 finishes the rest.	None.	—	Immediate–Phase 1.
4.1	Executor	DOT executor (governance layer)	A	DOT remains the audit + governance plane. Executor classes register; DOT logs runs (`dot_iu_command_run`).	None.	—	Immediate.
4.2	Executor	Node / Python / Go workers	M	Node fits the existing Directus/Nuxt stack — minimal new runtime. Python only where AI agent runs require it (memo: do not import Python services prematurely). Go optional, only if a tight inner loop justifies it. Hard rule: workers contain only execution code; business state stays in PG via DOT command results. Recommend: Node first, Python optional, Go deferred.	Workers turn into business-logic sprawl if not policed.	Node-first; per-class concurrency from executor_class_registry.	Phase 2.
4.3	Executor	Watermill (Go routing lib)	R	Adds Go dependency and a routing abstraction we do not yet need; conflicts with PG-first routing (event_subscription / job_queue class filter already covers routing).	Premature complexity + dead dependency.	Skip.	Not in roadmap.
4.4	Executor	Executor class registry PG-native	A	Per master design §5.3 and law extraction plan §3.8 / §18 candidate clauses.	None.	—	Phase 1.
5.1	Realtime gateway	Custom Nuxt server-route SSE	A as starting choice	Lowest ops cost; no new service; SSE sufficient for problem-only governance summaries (one-way push). Nuxt server route reads outbox tail via LISTEN/NOTIFY.	Saturates above ~1k concurrent connected clients on a single instance.	—	Phase 5 entry.
5.2	Realtime gateway	WebSocket gateway (custom)	M (later)	Add only when a real bidirectional UI flow appears (e.g., live form locking, collaborative editing). Otherwise SSE wins on simplicity.	None if deferred.	—	Phase 5+ on demand.
5.3	Realtime gateway	Socket.io	R	Carries opinionated room/namespace + custom framing on top of WS; harder to reason about backpressure; harder to enforce permission filter at backend layer.	Hidden complexity.	Plain WS if WS needed.	Not in roadmap.
5.4	Realtime gateway	Centrifugo (later)	M (defer)	Excellent for >10k connected clients with channel auth. Adopt only when Phase 5 SSE/WS shows saturation.	None if deferred.	—	Phase 6+ on demand.
5.5	Realtime gateway	Directus realtime	R	Directus realtime ties UI to Directus schema/permission model and to its WS protocol; violates the gateway boundary (Nuxt → backend gateway, NOT Nuxt → Directus realtime). Also bypasses the governance-UI principle of summaries-only because Directus emits raw row changes.	Boundary violation + raw event leakage to UI.	Skip.	Not in roadmap.
6.1	Workflow orchestrator	MOW native state machine (PG)	A as primary	Most production workflow systems run on plain PG state machines + cron. Year-long durability is a PG property if snapshots + checkpoints exist. Config-first requires the workflow grammar in PG — Temporal/Camunda keep it elsewhere.	Need to engineer snapshot + resume + cron + escalation carefully (Phase 3 scope).	—	Phase 3.
6.2	Workflow orchestrator	Temporal	R for primary; M (later, conditional)	Temporal puts workflow logic in code (TS/Go/Java/Python) — fundamentally at odds with config-first + DOT-governed. Adopting Temporal would either (a) move ownership of workflow definitions out of PG (violates SoT), or (b) wrap Temporal as a thin executor (then most Temporal value is wasted). Reconsider only if MOW state machine cannot meet documented requirements after Phase 6 scale hardening: deterministic replay for compliance, cross-DC durability, or >100 k concurrent active runs with strict SLA.	High lock-in; opex; governance bypass risk.	Continue MOW native.	Re-evaluate post-Phase 6 only.
6.3	Workflow orchestrator	Camunda	R	BPMN XML is a parallel SoT for workflow grammar; adopting it means DOT/registry no longer own workflow definitions. Camunda for human approval also overlaps Điều 32 — double ownership.	Boundary violation + double ownership.	Skip.	Not in roadmap.
6.4	Workflow orchestrator	Airflow	R	Airflow is a batch/data-pipeline scheduler; semantics (DAG run, task instance) do not match business workflow with human approvals + long pauses.	Wrong paradigm.	Skip.	Not in roadmap.
7.1	Observability	PG audit tables	A	Already partially in place (`dot_iu_command_run`, `cut_request_transition`, `iu_lifecycle_log`, `iu_tree_change_log`). Extend per phase.	Disk growth — handled by retention policy.	—	Immediate.
7.2	Observability	event_type_registry as schema registry	A with hardening	Add `schema_jsonb` (JSON Schema), `schema_version` (semver), `compatibility_mode` (forward/backward/none), and `fn_event_schema_validate(event_type, payload_jsonb)` (Phase 1).	If validator turned on suddenly, existing producers may fail — gate per producer.	—	Phase 1.
7.3	Observability	trace_id / correlation_id	A — adopt W3C tracecontext shape now	Adopt W3C: trace_id (16-byte hex, 32 chars), parent_span_id (8-byte hex, 16 chars), sampled flag. Even without OpenTelemetry SDK now, the shape is forward-compatible so OTel can attach later. Present on every new event_outbox + job_queue + workflow_run + task_run row.	None — additive.	—	Phase 1.
7.4	Observability	OpenTelemetry	M (later attach)	Adopt SDK in Node workers when worker code is written (Phase 2+). Endpoint = local collector → Jaeger. Until then, trace_id columns alone are sufficient for end-to-end correlation in PG.	None if shape adopted now.	—	Phase 2+ when workers ship.
7.5	Observability	Jaeger	M (later)	Backend for OTel; attach only when worker count justifies UI for traces.	Extra service.	Could also be SigNoz/Grafana Tempo — defer choice.	Phase 5+.
7.6	Observability	Governance UI summaries	A	Per master design §15: problem-only surface; aggregate counts; 1-line AI/worker summaries; drill-down on demand.	None.	—	Phase 5.

2. Cross-cutting critique

2.1 The "assembly-first" principle is right — but watch for SoT drift

The strategy correctly insists PG/DOT/registry remain SoT. The biggest hidden risk in every candidate above is a second SoT sneaking in: Benthos config file as event taxonomy, pg-boss state vocab as job state, Temporal code as workflow definition, Camunda BPMN as workflow grammar. Each looks like an adapter but quietly becomes an owner. Discipline: every adopted tool MUST point its identity columns / config back to a PG registry row.

2.2 "Adapters are replaceable" must be enforced by interface, not aspiration

Replaceability requires an interface boundary in PG: e.g., an event_publisher interface (function signature) that internal producers call, with the real publisher selectable by config (PG-only → PG+NATS → NATS-relay). Without that, "replaceable" remains rhetorical.

2.3 The current VPS-on-single-host shape rejects most heavy tools by default

NATS cluster, Kafka, Temporal cluster, Camunda zeebe — all assume multi-host or dedicated nodes. The current Incomex deploy (single VPS, single Postgres container) immediately disqualifies these for the first two years. Re-evaluation gate: when a second VPS or managed cluster joins the topology.

2.4 Config-first vs framework-first is the deciding lens

Tools whose definitions live in code or YAML inside their own runtime (Temporal workflows, Camunda BPMN, Airflow DAGs, Watermill routes) fight config-first. Tools whose runtime is driven by data in your DB (Benthos reading PG, pg-boss reading PG, Graphile reading PG) merely fight state-vocab fit. Always prefer the latter shape — and even there, only adopt when the alternative (native PG) is provably insufficient.

2.5 The "register-before-emit" gate is the single biggest governance win

Already in place via event_type_registry. Adding fn_event_schema_validate and turning it on at producer boundaries gives Incomex a property few OSS tools provide out of the box: emission is governed, not opportunistic.

3. Final recommended architecture

3.1 Immediate (Phase 0 → Phase 1)

Producers: PG triggers + DOT command emits → event_outbox. Register-before-emit enforced via event_type_registry + new fn_event_schema_validate.
Broker: PG event_outbox + event_pending/event_read/event_subscription + LISTEN/NOTIFY when queue.notify.enabled=true.
Job queue: PG job_queue + job_dead_letter + lease + heartbeat (Điều 45 §15.5) + retry/backoff.
Executor: DOT command catalog as governance layer; Node worker pool consumes job_queue with class filter; executor classes registered in new executor_class_registry.
Realtime: none yet — governance UI consumes via REST + simple polling until Phase 5.
Observability: PG audit tables; W3C trace_id shape on every new write; dot_iu_command_run for command audit; healthcheck surface (already present).
Validation: native PG side (cross-field engine in fn_moit_validate).

3.2 Near term (Phase 2 → Phase 4)

MOT with 2 executor classes operational (dot + human), then expand to sql + ai_agent + external_api + notification + render.
MOW core with state machine + advance loop + proposal mode wired to workflow_change_requests.
MOIT / MOUT factories: field_registry, input_form_registry, output_table_registry, dot_function_registry.
Schema registry hardening: validator turned on per producer, gradually.
OpenTelemetry SDK added to Node workers as they ship (trace_id columns already populated → just attach exporters).
Heartbeat caller obligation enforced for all new workers.

3.3 Phase 5 (when governance UI ships)

Realtime gateway: Nuxt server-route SSE as starting choice; one-way summary push from outbox tail. Permission filter at backend.
Governance UI per master design §15.
DLQ replay via UI with approval gate (dlq_replay_request table).

3.4 Later scale (Phase 6+, profile-driven)

NATS introduced as transport (not SoT) iff multi-host workers or cross-service consumers emerge. event_outbox stays SoT; an outbox-relay worker mirrors to NATS topics.
WebSocket gateway iff a real bidirectional UI flow appears (collaborative editing, live form locking).
Centrifugo iff connected-client count exceeds Nuxt server-route or custom WS capacity.
Benthos in a bounded scope (external table mirroring), not as domain event source.
OpenTelemetry collector + Jaeger / Grafana Tempo / SigNoz for worker fleet trace UI.

3.5 Re-evaluation criteria

Trigger	Action
Two or more VPS hosts running workers	Re-evaluate NATS as transport.
Outbox tail consistently >N minutes lag	Tune indexes; if still lagging, evaluate NATS or partitioning.
External tables mirror needed (Lark/Notion/...)	Evaluate Benthos for bounded mirror.
Connected real-time clients > ~1 k	Evaluate WS gateway / Centrifugo.
Workflow state machine soak shows insufficiency (>100 k active runs OR deterministic replay required)	Evaluate Temporal — full Council review.
AI agent fleet emerges with Python deps	Add Python worker class; share trace_id.

4. Decisions that should NOT be made yet

Decision	Why not yet
Adopt or reject NATS finally.	Need Phase 1–4 metrics on outbox + LISTEN/NOTIFY behaviour.
Adopt or reject Temporal finally.	Need Phase 3 MOW state machine soak + Phase 6 scale hardening results.
Choose SSE vs WS vs Centrifugo for realtime.	Need Phase 5 client profile and a real bidirectional use case.
Adopt pg-boss / Graphile Worker as job queue substrate.	Would require Điều 45 §6.7 amendment — premature; native is provably enough.
Adopt Benthos for domain events.	Would create taxonomy outside event_type_registry — wait for a bounded mirror use case.
Adopt Camunda for human approval.	Would double-own with Điều 32 — premature and likely never.
Adopt Airflow.	Wrong paradigm — no current need.
Adopt Watermill.	Routing covered by event_subscription + job_queue class filter — likely never needed.
Adopt Directus realtime for UI.	Boundary violation — likely never.
Adopt Hasura subscriptions.	Boundary + ownership conflict — already rejected by master design.
Pick OTel backend (Jaeger vs Tempo vs SigNoz).	Defer to Phase 5+ when worker fleet justifies a UI.

5. What this critique adds beyond the candidate strategy

State-vocab fit as a first-class adoption criterion (rejects pg-boss/Graphile as substrate even though they look ideal).
Config-first vs framework-first distinction (rejects Temporal/Camunda/Airflow at a deeper level than "later").
SoT-drift watchpoint (every adopted tool must point back to a PG registry row).
W3C trace_id shape adoption now so OTel attaches later without schema migration.
Re-evaluation triggers (concrete metrics, not vibes).
Directus realtime explicit reject — the candidate strategy didn't flag this and it is the easiest accidental boundary violation.

End critique.