KB-E815

Assembly-First Open-Source Integration — Critique + Recommended Architecture (DRAFT, 2026-05-27)

19 min read Revision 1
designcritiqueopen-sourceintegrationassembly-firstbrokerqueuecdcrealtimeorchestratorobservabilitydraft2026-05-27

Assembly-First Open-Source Integration — Critique + Recommended Architecture

Path: knowledge/dev/design/assembly-first-open-source-integration-critique.md Status: DRAFT — Claude Code critique of the user's candidate strategy (2026-05-27 addendum). Not final truth; subject to Council review. Principle assumed: PG/DOT/registry remains the legal/config/source-of-truth core. Open-source tools attach as replaceable adapters, executors, brokers, gateways — never owners of governance.


1. Critique matrix

Legend: A=Accept · M=Modify · R=Reject

# Slot Proposed choice A/M/R Reason Risk Better option if any Phase
1.1 Producer / CDC Keep PG-native event_outbox A event_outbox + event_type_registry + register-before-emit is curated, governed, already operational (140 k+ rows historical). Aligns with Hiến pháp NT13 + Điều 45 §6. None. Immediate.
1.2 Producer / CDC Add Benthos / Redpanda Connect (config-driven CDC) M YAML config aligns with NT2/NT4. BUT WAL/CDC requires logical replication slot + pgoutput / wal2json plugin + slot supervision; on a single VPS this is non-trivial ops. More critically, CDC captures every row change indiscriminately — that conflicts with the "register-before-emit" governance enshrined in event_type_registry. CDC is useful for external table mirroring (e.g., Lark/Notion replicas), not for IU/workflow/task domain events. If adopted broadly: shadow event taxonomy outside event_type_registry → governance fragmentation; replication slot bloat if consumer lags. Keep trigger-emitted events for domain; reserve Benthos for bounded mirroring use cases (e.g., S177 Lark CRUD shadow → PG mirror feed) — and only when a real such case appears. Later (Phase 5+ if any).
1.3 Producer / CDC Artie / PeerDB R for now Heavier replication tooling; same governance problem as Benthos; more vendor surface. High ops cost for marginal benefit at current scale. Re-evaluate only if multi-DC PG replication becomes a hard requirement. Not in roadmap.
2.1 Broker / Event bus PG-native outbox + LISTEN/NOTIFY + event_subscription A Already exists. queue.notify.enabled is currently false but gate is in place. Phase 1 just turns it on with discipline. Polling lag if NOTIFY disabled; bounded by outbox tail cursor. Immediate.
2.2 Broker / Event bus NATS M (defer) NATS is excellent for fanout/pub-sub across hosts, but introduces a separate cluster and a parallel SoT for messages. Live evidence does not yet show multi-host workers or cross-service consumers requiring NATS. Adoption now duplicates Điều 45's outbox semantics. Migration path: introduce later as a transport fed by an outbox-relay worker — event_outbox stays SoT; NATS publishes to topic mirrors. No law change required since Điều 45 §6 is substrate-neutral. Premature NATS = two SoTs for events, drift between PG and broker, harder DLQ semantics. PG LISTEN/NOTIFY for in-host, in-cluster fanout first. NATS later iff multi-host or external consumer demands it. Phase 5+ (only if profile demands).
2.3 Broker / Event bus Redis Streams R if no existing Redis Live PG survey shows no Redis in the stack. Adding Redis is a new persistence + ops surface for marginal benefit over PG. If Redis appears later for caching, then Streams becomes a possible add — but Redis persistence semantics are weaker than PG; not a substitute SoT. Two SoTs; Redis durability < PG; ops surface grows. Skip. Not in roadmap (revisit only if Redis already arrives for another reason).
2.4 Broker / Event bus Kafka / Redpanda R Operational cost completely out of proportion to current scale. Conflicts with VPS-light constraint. Massive ops + cost. Skip until cross-region durability is mandatory and PG cannot keep up — likely years away. Not in roadmap.
3.1 Job queue / Worker Existing PG job_queue + job_dead_letter + queue_heartbeat + dot_iu_runtime_lease A Tables and gates already in place; queue.runtime.phase=phase2_governance; missing pieces are config-on + caller obligations, not new substrate. None — improvements native. Immediate.
3.2 Job queue / Worker pg-boss M / R as substrate; A as library pattern reference pg-boss owns its own state vocab (created→active→completed/failed/expired) that does not match Điều 45 §6.7 work_state_machine (≥9 states). Adopting pg-boss as substrate means either (a) Council amends §6.7 — high blast radius — or (b) we wrap pg-boss and lose its straight-line benefits. Better to keep our state model and borrow patterns from pg-boss (cron-style schedules, throttling) implemented natively. If adopted as substrate: Điều 45 §6.7 violation. Native implementation, pattern-inspired. Patterns immediate; library no.
3.3 Job queue / Worker Graphile Worker R as substrate Same shape problem as pg-boss; tightly coupled to Postgres functions called by name; cron-extension dependency for scheduling on container Postgres can be messy. Same as pg-boss. Native. Not in roadmap as substrate.
3.4 Job queue / Worker Implement missing semantics natively A Idempotency_key (UNIQUE NULLS NOT DISTINCT) + lease + heartbeat (Điều 45 §15.5) + retry with backoff + DLQ are already partially in place. Phase 0–1 finishes the rest. None. Immediate–Phase 1.
4.1 Executor DOT executor (governance layer) A DOT remains the audit + governance plane. Executor classes register; DOT logs runs (dot_iu_command_run). None. Immediate.
4.2 Executor Node / Python / Go workers M Node fits the existing Directus/Nuxt stack — minimal new runtime. Python only where AI agent runs require it (memo: do not import Python services prematurely). Go optional, only if a tight inner loop justifies it. Hard rule: workers contain only execution code; business state stays in PG via DOT command results. Recommend: Node first, Python optional, Go deferred. Workers turn into business-logic sprawl if not policed. Node-first; per-class concurrency from executor_class_registry. Phase 2.
4.3 Executor Watermill (Go routing lib) R Adds Go dependency and a routing abstraction we do not yet need; conflicts with PG-first routing (event_subscription / job_queue class filter already covers routing). Premature complexity + dead dependency. Skip. Not in roadmap.
4.4 Executor Executor class registry PG-native A Per master design §5.3 and law extraction plan §3.8 / §18 candidate clauses. None. Phase 1.
5.1 Realtime gateway Custom Nuxt server-route SSE A as starting choice Lowest ops cost; no new service; SSE sufficient for problem-only governance summaries (one-way push). Nuxt server route reads outbox tail via LISTEN/NOTIFY. Saturates above ~1k concurrent connected clients on a single instance. Phase 5 entry.
5.2 Realtime gateway WebSocket gateway (custom) M (later) Add only when a real bidirectional UI flow appears (e.g., live form locking, collaborative editing). Otherwise SSE wins on simplicity. None if deferred. Phase 5+ on demand.
5.3 Realtime gateway Socket.io R Carries opinionated room/namespace + custom framing on top of WS; harder to reason about backpressure; harder to enforce permission filter at backend layer. Hidden complexity. Plain WS if WS needed. Not in roadmap.
5.4 Realtime gateway Centrifugo (later) M (defer) Excellent for >10k connected clients with channel auth. Adopt only when Phase 5 SSE/WS shows saturation. None if deferred. Phase 6+ on demand.
5.5 Realtime gateway Directus realtime R Directus realtime ties UI to Directus schema/permission model and to its WS protocol; violates the gateway boundary (Nuxt → backend gateway, NOT Nuxt → Directus realtime). Also bypasses the governance-UI principle of summaries-only because Directus emits raw row changes. Boundary violation + raw event leakage to UI. Skip. Not in roadmap.
6.1 Workflow orchestrator MOW native state machine (PG) A as primary Most production workflow systems run on plain PG state machines + cron. Year-long durability is a PG property if snapshots + checkpoints exist. Config-first requires the workflow grammar in PG — Temporal/Camunda keep it elsewhere. Need to engineer snapshot + resume + cron + escalation carefully (Phase 3 scope). Phase 3.
6.2 Workflow orchestrator Temporal R for primary; M (later, conditional) Temporal puts workflow logic in code (TS/Go/Java/Python) — fundamentally at odds with config-first + DOT-governed. Adopting Temporal would either (a) move ownership of workflow definitions out of PG (violates SoT), or (b) wrap Temporal as a thin executor (then most Temporal value is wasted). Reconsider only if MOW state machine cannot meet documented requirements after Phase 6 scale hardening: deterministic replay for compliance, cross-DC durability, or >100 k concurrent active runs with strict SLA. High lock-in; opex; governance bypass risk. Continue MOW native. Re-evaluate post-Phase 6 only.
6.3 Workflow orchestrator Camunda R BPMN XML is a parallel SoT for workflow grammar; adopting it means DOT/registry no longer own workflow definitions. Camunda for human approval also overlaps Điều 32 — double ownership. Boundary violation + double ownership. Skip. Not in roadmap.
6.4 Workflow orchestrator Airflow R Airflow is a batch/data-pipeline scheduler; semantics (DAG run, task instance) do not match business workflow with human approvals + long pauses. Wrong paradigm. Skip. Not in roadmap.
7.1 Observability PG audit tables A Already partially in place (dot_iu_command_run, cut_request_transition, iu_lifecycle_log, iu_tree_change_log). Extend per phase. Disk growth — handled by retention policy. Immediate.
7.2 Observability event_type_registry as schema registry A with hardening Add schema_jsonb (JSON Schema), schema_version (semver), compatibility_mode (forward/backward/none), and fn_event_schema_validate(event_type, payload_jsonb) (Phase 1). If validator turned on suddenly, existing producers may fail — gate per producer. Phase 1.
7.3 Observability trace_id / correlation_id A — adopt W3C tracecontext shape now Adopt W3C: trace_id (16-byte hex, 32 chars), parent_span_id (8-byte hex, 16 chars), sampled flag. Even without OpenTelemetry SDK now, the shape is forward-compatible so OTel can attach later. Present on every new event_outbox + job_queue + workflow_run + task_run row. None — additive. Phase 1.
7.4 Observability OpenTelemetry M (later attach) Adopt SDK in Node workers when worker code is written (Phase 2+). Endpoint = local collector → Jaeger. Until then, trace_id columns alone are sufficient for end-to-end correlation in PG. None if shape adopted now. Phase 2+ when workers ship.
7.5 Observability Jaeger M (later) Backend for OTel; attach only when worker count justifies UI for traces. Extra service. Could also be SigNoz/Grafana Tempo — defer choice. Phase 5+.
7.6 Observability Governance UI summaries A Per master design §15: problem-only surface; aggregate counts; 1-line AI/worker summaries; drill-down on demand. None. Phase 5.

2. Cross-cutting critique

2.1 The "assembly-first" principle is right — but watch for SoT drift

The strategy correctly insists PG/DOT/registry remain SoT. The biggest hidden risk in every candidate above is a second SoT sneaking in: Benthos config file as event taxonomy, pg-boss state vocab as job state, Temporal code as workflow definition, Camunda BPMN as workflow grammar. Each looks like an adapter but quietly becomes an owner. Discipline: every adopted tool MUST point its identity columns / config back to a PG registry row.

2.2 "Adapters are replaceable" must be enforced by interface, not aspiration

Replaceability requires an interface boundary in PG: e.g., an event_publisher interface (function signature) that internal producers call, with the real publisher selectable by config (PG-only → PG+NATS → NATS-relay). Without that, "replaceable" remains rhetorical.

2.3 The current VPS-on-single-host shape rejects most heavy tools by default

NATS cluster, Kafka, Temporal cluster, Camunda zeebe — all assume multi-host or dedicated nodes. The current Incomex deploy (single VPS, single Postgres container) immediately disqualifies these for the first two years. Re-evaluation gate: when a second VPS or managed cluster joins the topology.

2.4 Config-first vs framework-first is the deciding lens

Tools whose definitions live in code or YAML inside their own runtime (Temporal workflows, Camunda BPMN, Airflow DAGs, Watermill routes) fight config-first. Tools whose runtime is driven by data in your DB (Benthos reading PG, pg-boss reading PG, Graphile reading PG) merely fight state-vocab fit. Always prefer the latter shape — and even there, only adopt when the alternative (native PG) is provably insufficient.

2.5 The "register-before-emit" gate is the single biggest governance win

Already in place via event_type_registry. Adding fn_event_schema_validate and turning it on at producer boundaries gives Incomex a property few OSS tools provide out of the box: emission is governed, not opportunistic.

3.1 Immediate (Phase 0 → Phase 1)

  • Producers: PG triggers + DOT command emits → event_outbox. Register-before-emit enforced via event_type_registry + new fn_event_schema_validate.
  • Broker: PG event_outbox + event_pending/event_read/event_subscription + LISTEN/NOTIFY when queue.notify.enabled=true.
  • Job queue: PG job_queue + job_dead_letter + lease + heartbeat (Điều 45 §15.5) + retry/backoff.
  • Executor: DOT command catalog as governance layer; Node worker pool consumes job_queue with class filter; executor classes registered in new executor_class_registry.
  • Realtime: none yet — governance UI consumes via REST + simple polling until Phase 5.
  • Observability: PG audit tables; W3C trace_id shape on every new write; dot_iu_command_run for command audit; healthcheck surface (already present).
  • Validation: native PG side (cross-field engine in fn_moit_validate).

3.2 Near term (Phase 2 → Phase 4)

  • MOT with 2 executor classes operational (dot + human), then expand to sql + ai_agent + external_api + notification + render.
  • MOW core with state machine + advance loop + proposal mode wired to workflow_change_requests.
  • MOIT / MOUT factories: field_registry, input_form_registry, output_table_registry, dot_function_registry.
  • Schema registry hardening: validator turned on per producer, gradually.
  • OpenTelemetry SDK added to Node workers as they ship (trace_id columns already populated → just attach exporters).
  • Heartbeat caller obligation enforced for all new workers.

3.3 Phase 5 (when governance UI ships)

  • Realtime gateway: Nuxt server-route SSE as starting choice; one-way summary push from outbox tail. Permission filter at backend.
  • Governance UI per master design §15.
  • DLQ replay via UI with approval gate (dlq_replay_request table).

3.4 Later scale (Phase 6+, profile-driven)

  • NATS introduced as transport (not SoT) iff multi-host workers or cross-service consumers emerge. event_outbox stays SoT; an outbox-relay worker mirrors to NATS topics.
  • WebSocket gateway iff a real bidirectional UI flow appears (collaborative editing, live form locking).
  • Centrifugo iff connected-client count exceeds Nuxt server-route or custom WS capacity.
  • Benthos in a bounded scope (external table mirroring), not as domain event source.
  • OpenTelemetry collector + Jaeger / Grafana Tempo / SigNoz for worker fleet trace UI.

3.5 Re-evaluation criteria

Trigger Action
Two or more VPS hosts running workers Re-evaluate NATS as transport.
Outbox tail consistently >N minutes lag Tune indexes; if still lagging, evaluate NATS or partitioning.
External tables mirror needed (Lark/Notion/...) Evaluate Benthos for bounded mirror.
Connected real-time clients > ~1 k Evaluate WS gateway / Centrifugo.
Workflow state machine soak shows insufficiency (>100 k active runs OR deterministic replay required) Evaluate Temporal — full Council review.
AI agent fleet emerges with Python deps Add Python worker class; share trace_id.

4. Decisions that should NOT be made yet

Decision Why not yet
Adopt or reject NATS finally. Need Phase 1–4 metrics on outbox + LISTEN/NOTIFY behaviour.
Adopt or reject Temporal finally. Need Phase 3 MOW state machine soak + Phase 6 scale hardening results.
Choose SSE vs WS vs Centrifugo for realtime. Need Phase 5 client profile and a real bidirectional use case.
Adopt pg-boss / Graphile Worker as job queue substrate. Would require Điều 45 §6.7 amendment — premature; native is provably enough.
Adopt Benthos for domain events. Would create taxonomy outside event_type_registry — wait for a bounded mirror use case.
Adopt Camunda for human approval. Would double-own with Điều 32 — premature and likely never.
Adopt Airflow. Wrong paradigm — no current need.
Adopt Watermill. Routing covered by event_subscription + job_queue class filter — likely never needed.
Adopt Directus realtime for UI. Boundary violation — likely never.
Adopt Hasura subscriptions. Boundary + ownership conflict — already rejected by master design.
Pick OTel backend (Jaeger vs Tempo vs SigNoz). Defer to Phase 5+ when worker fleet justifies a UI.

5. What this critique adds beyond the candidate strategy

  • State-vocab fit as a first-class adoption criterion (rejects pg-boss/Graphile as substrate even though they look ideal).
  • Config-first vs framework-first distinction (rejects Temporal/Camunda/Airflow at a deeper level than "later").
  • SoT-drift watchpoint (every adopted tool must point back to a PG registry row).
  • W3C trace_id shape adoption now so OTel attaches later without schema migration.
  • Re-evaluation triggers (concrete metrics, not vibes).
  • Directus realtime explicit reject — the candidate strategy didn't flag this and it is the easiest accidental boundary violation.

End critique.