Assembly-First Open-Source Integration — Critique + Recommended Architecture (DRAFT, 2026-05-27)
Assembly-First Open-Source Integration — Critique + Recommended Architecture
Path:
knowledge/dev/design/assembly-first-open-source-integration-critique.mdStatus: DRAFT — Claude Code critique of the user's candidate strategy (2026-05-27 addendum). Not final truth; subject to Council review. Principle assumed: PG/DOT/registry remains the legal/config/source-of-truth core. Open-source tools attach as replaceable adapters, executors, brokers, gateways — never owners of governance.
1. Critique matrix
Legend: A=Accept · M=Modify · R=Reject
| # | Slot | Proposed choice | A/M/R | Reason | Risk | Better option if any | Phase |
|---|---|---|---|---|---|---|---|
| 1.1 | Producer / CDC | Keep PG-native event_outbox | A | event_outbox + event_type_registry + register-before-emit is curated, governed, already operational (140 k+ rows historical). Aligns with Hiến pháp NT13 + Điều 45 §6. | None. | — | Immediate. |
| 1.2 | Producer / CDC | Add Benthos / Redpanda Connect (config-driven CDC) | M | YAML config aligns with NT2/NT4. BUT WAL/CDC requires logical replication slot + pgoutput / wal2json plugin + slot supervision; on a single VPS this is non-trivial ops. More critically, CDC captures every row change indiscriminately — that conflicts with the "register-before-emit" governance enshrined in event_type_registry. CDC is useful for external table mirroring (e.g., Lark/Notion replicas), not for IU/workflow/task domain events. | If adopted broadly: shadow event taxonomy outside event_type_registry → governance fragmentation; replication slot bloat if consumer lags. | Keep trigger-emitted events for domain; reserve Benthos for bounded mirroring use cases (e.g., S177 Lark CRUD shadow → PG mirror feed) — and only when a real such case appears. | Later (Phase 5+ if any). |
| 1.3 | Producer / CDC | Artie / PeerDB | R for now | Heavier replication tooling; same governance problem as Benthos; more vendor surface. | High ops cost for marginal benefit at current scale. | Re-evaluate only if multi-DC PG replication becomes a hard requirement. | Not in roadmap. |
| 2.1 | Broker / Event bus | PG-native outbox + LISTEN/NOTIFY + event_subscription | A | Already exists. queue.notify.enabled is currently false but gate is in place. Phase 1 just turns it on with discipline. | Polling lag if NOTIFY disabled; bounded by outbox tail cursor. | — | Immediate. |
| 2.2 | Broker / Event bus | NATS | M (defer) | NATS is excellent for fanout/pub-sub across hosts, but introduces a separate cluster and a parallel SoT for messages. Live evidence does not yet show multi-host workers or cross-service consumers requiring NATS. Adoption now duplicates Điều 45's outbox semantics. Migration path: introduce later as a transport fed by an outbox-relay worker — event_outbox stays SoT; NATS publishes to topic mirrors. No law change required since Điều 45 §6 is substrate-neutral. | Premature NATS = two SoTs for events, drift between PG and broker, harder DLQ semantics. | PG LISTEN/NOTIFY for in-host, in-cluster fanout first. NATS later iff multi-host or external consumer demands it. | Phase 5+ (only if profile demands). |
| 2.3 | Broker / Event bus | Redis Streams | R if no existing Redis | Live PG survey shows no Redis in the stack. Adding Redis is a new persistence + ops surface for marginal benefit over PG. If Redis appears later for caching, then Streams becomes a possible add — but Redis persistence semantics are weaker than PG; not a substitute SoT. | Two SoTs; Redis durability < PG; ops surface grows. | Skip. | Not in roadmap (revisit only if Redis already arrives for another reason). |
| 2.4 | Broker / Event bus | Kafka / Redpanda | R | Operational cost completely out of proportion to current scale. Conflicts with VPS-light constraint. | Massive ops + cost. | Skip until cross-region durability is mandatory and PG cannot keep up — likely years away. | Not in roadmap. |
| 3.1 | Job queue / Worker | Existing PG job_queue + job_dead_letter + queue_heartbeat + dot_iu_runtime_lease | A | Tables and gates already in place; queue.runtime.phase=phase2_governance; missing pieces are config-on + caller obligations, not new substrate. |
None — improvements native. | — | Immediate. |
| 3.2 | Job queue / Worker | pg-boss | M / R as substrate; A as library pattern reference | pg-boss owns its own state vocab (created→active→completed/failed/expired) that does not match Điều 45 §6.7 work_state_machine (≥9 states). Adopting pg-boss as substrate means either (a) Council amends §6.7 — high blast radius — or (b) we wrap pg-boss and lose its straight-line benefits. Better to keep our state model and borrow patterns from pg-boss (cron-style schedules, throttling) implemented natively. | If adopted as substrate: Điều 45 §6.7 violation. | Native implementation, pattern-inspired. | Patterns immediate; library no. |
| 3.3 | Job queue / Worker | Graphile Worker | R as substrate | Same shape problem as pg-boss; tightly coupled to Postgres functions called by name; cron-extension dependency for scheduling on container Postgres can be messy. | Same as pg-boss. | Native. | Not in roadmap as substrate. |
| 3.4 | Job queue / Worker | Implement missing semantics natively | A | Idempotency_key (UNIQUE NULLS NOT DISTINCT) + lease + heartbeat (Điều 45 §15.5) + retry with backoff + DLQ are already partially in place. Phase 0–1 finishes the rest. |
None. | — | Immediate–Phase 1. |
| 4.1 | Executor | DOT executor (governance layer) | A | DOT remains the audit + governance plane. Executor classes register; DOT logs runs (dot_iu_command_run). |
None. | — | Immediate. |
| 4.2 | Executor | Node / Python / Go workers | M | Node fits the existing Directus/Nuxt stack — minimal new runtime. Python only where AI agent runs require it (memo: do not import Python services prematurely). Go optional, only if a tight inner loop justifies it. Hard rule: workers contain only execution code; business state stays in PG via DOT command results. Recommend: Node first, Python optional, Go deferred. | Workers turn into business-logic sprawl if not policed. | Node-first; per-class concurrency from executor_class_registry. | Phase 2. |
| 4.3 | Executor | Watermill (Go routing lib) | R | Adds Go dependency and a routing abstraction we do not yet need; conflicts with PG-first routing (event_subscription / job_queue class filter already covers routing). | Premature complexity + dead dependency. | Skip. | Not in roadmap. |
| 4.4 | Executor | Executor class registry PG-native | A | Per master design §5.3 and law extraction plan §3.8 / §18 candidate clauses. | None. | — | Phase 1. |
| 5.1 | Realtime gateway | Custom Nuxt server-route SSE | A as starting choice | Lowest ops cost; no new service; SSE sufficient for problem-only governance summaries (one-way push). Nuxt server route reads outbox tail via LISTEN/NOTIFY. | Saturates above ~1k concurrent connected clients on a single instance. | — | Phase 5 entry. |
| 5.2 | Realtime gateway | WebSocket gateway (custom) | M (later) | Add only when a real bidirectional UI flow appears (e.g., live form locking, collaborative editing). Otherwise SSE wins on simplicity. | None if deferred. | — | Phase 5+ on demand. |
| 5.3 | Realtime gateway | Socket.io | R | Carries opinionated room/namespace + custom framing on top of WS; harder to reason about backpressure; harder to enforce permission filter at backend layer. | Hidden complexity. | Plain WS if WS needed. | Not in roadmap. |
| 5.4 | Realtime gateway | Centrifugo (later) | M (defer) | Excellent for >10k connected clients with channel auth. Adopt only when Phase 5 SSE/WS shows saturation. | None if deferred. | — | Phase 6+ on demand. |
| 5.5 | Realtime gateway | Directus realtime | R | Directus realtime ties UI to Directus schema/permission model and to its WS protocol; violates the gateway boundary (Nuxt → backend gateway, NOT Nuxt → Directus realtime). Also bypasses the governance-UI principle of summaries-only because Directus emits raw row changes. | Boundary violation + raw event leakage to UI. | Skip. | Not in roadmap. |
| 6.1 | Workflow orchestrator | MOW native state machine (PG) | A as primary | Most production workflow systems run on plain PG state machines + cron. Year-long durability is a PG property if snapshots + checkpoints exist. Config-first requires the workflow grammar in PG — Temporal/Camunda keep it elsewhere. | Need to engineer snapshot + resume + cron + escalation carefully (Phase 3 scope). | — | Phase 3. |
| 6.2 | Workflow orchestrator | Temporal | R for primary; M (later, conditional) | Temporal puts workflow logic in code (TS/Go/Java/Python) — fundamentally at odds with config-first + DOT-governed. Adopting Temporal would either (a) move ownership of workflow definitions out of PG (violates SoT), or (b) wrap Temporal as a thin executor (then most Temporal value is wasted). Reconsider only if MOW state machine cannot meet documented requirements after Phase 6 scale hardening: deterministic replay for compliance, cross-DC durability, or >100 k concurrent active runs with strict SLA. | High lock-in; opex; governance bypass risk. | Continue MOW native. | Re-evaluate post-Phase 6 only. |
| 6.3 | Workflow orchestrator | Camunda | R | BPMN XML is a parallel SoT for workflow grammar; adopting it means DOT/registry no longer own workflow definitions. Camunda for human approval also overlaps Điều 32 — double ownership. | Boundary violation + double ownership. | Skip. | Not in roadmap. |
| 6.4 | Workflow orchestrator | Airflow | R | Airflow is a batch/data-pipeline scheduler; semantics (DAG run, task instance) do not match business workflow with human approvals + long pauses. | Wrong paradigm. | Skip. | Not in roadmap. |
| 7.1 | Observability | PG audit tables | A | Already partially in place (dot_iu_command_run, cut_request_transition, iu_lifecycle_log, iu_tree_change_log). Extend per phase. |
Disk growth — handled by retention policy. | — | Immediate. |
| 7.2 | Observability | event_type_registry as schema registry | A with hardening | Add schema_jsonb (JSON Schema), schema_version (semver), compatibility_mode (forward/backward/none), and fn_event_schema_validate(event_type, payload_jsonb) (Phase 1). |
If validator turned on suddenly, existing producers may fail — gate per producer. | — | Phase 1. |
| 7.3 | Observability | trace_id / correlation_id | A — adopt W3C tracecontext shape now | Adopt W3C: trace_id (16-byte hex, 32 chars), parent_span_id (8-byte hex, 16 chars), sampled flag. Even without OpenTelemetry SDK now, the shape is forward-compatible so OTel can attach later. Present on every new event_outbox + job_queue + workflow_run + task_run row. | None — additive. | — | Phase 1. |
| 7.4 | Observability | OpenTelemetry | M (later attach) | Adopt SDK in Node workers when worker code is written (Phase 2+). Endpoint = local collector → Jaeger. Until then, trace_id columns alone are sufficient for end-to-end correlation in PG. | None if shape adopted now. | — | Phase 2+ when workers ship. |
| 7.5 | Observability | Jaeger | M (later) | Backend for OTel; attach only when worker count justifies UI for traces. | Extra service. | Could also be SigNoz/Grafana Tempo — defer choice. | Phase 5+. |
| 7.6 | Observability | Governance UI summaries | A | Per master design §15: problem-only surface; aggregate counts; 1-line AI/worker summaries; drill-down on demand. | None. | — | Phase 5. |
2. Cross-cutting critique
2.1 The "assembly-first" principle is right — but watch for SoT drift
The strategy correctly insists PG/DOT/registry remain SoT. The biggest hidden risk in every candidate above is a second SoT sneaking in: Benthos config file as event taxonomy, pg-boss state vocab as job state, Temporal code as workflow definition, Camunda BPMN as workflow grammar. Each looks like an adapter but quietly becomes an owner. Discipline: every adopted tool MUST point its identity columns / config back to a PG registry row.
2.2 "Adapters are replaceable" must be enforced by interface, not aspiration
Replaceability requires an interface boundary in PG: e.g., an event_publisher interface (function signature) that internal producers call, with the real publisher selectable by config (PG-only → PG+NATS → NATS-relay). Without that, "replaceable" remains rhetorical.
2.3 The current VPS-on-single-host shape rejects most heavy tools by default
NATS cluster, Kafka, Temporal cluster, Camunda zeebe — all assume multi-host or dedicated nodes. The current Incomex deploy (single VPS, single Postgres container) immediately disqualifies these for the first two years. Re-evaluation gate: when a second VPS or managed cluster joins the topology.
2.4 Config-first vs framework-first is the deciding lens
Tools whose definitions live in code or YAML inside their own runtime (Temporal workflows, Camunda BPMN, Airflow DAGs, Watermill routes) fight config-first. Tools whose runtime is driven by data in your DB (Benthos reading PG, pg-boss reading PG, Graphile reading PG) merely fight state-vocab fit. Always prefer the latter shape — and even there, only adopt when the alternative (native PG) is provably insufficient.
2.5 The "register-before-emit" gate is the single biggest governance win
Already in place via event_type_registry. Adding fn_event_schema_validate and turning it on at producer boundaries gives Incomex a property few OSS tools provide out of the box: emission is governed, not opportunistic.
3. Final recommended architecture
3.1 Immediate (Phase 0 → Phase 1)
- Producers: PG triggers + DOT command emits →
event_outbox. Register-before-emit enforced viaevent_type_registry+ newfn_event_schema_validate. - Broker: PG
event_outbox+event_pending/event_read/event_subscription+ LISTEN/NOTIFY whenqueue.notify.enabled=true. - Job queue: PG
job_queue+job_dead_letter+ lease + heartbeat (Điều 45 §15.5) + retry/backoff. - Executor: DOT command catalog as governance layer; Node worker pool consumes job_queue with class filter; executor classes registered in new
executor_class_registry. - Realtime: none yet — governance UI consumes via REST + simple polling until Phase 5.
- Observability: PG audit tables; W3C trace_id shape on every new write;
dot_iu_command_runfor command audit; healthcheck surface (already present). - Validation: native PG side (cross-field engine in fn_moit_validate).
3.2 Near term (Phase 2 → Phase 4)
- MOT with 2 executor classes operational (dot + human), then expand to sql + ai_agent + external_api + notification + render.
- MOW core with state machine + advance loop + proposal mode wired to
workflow_change_requests. - MOIT / MOUT factories:
field_registry,input_form_registry,output_table_registry,dot_function_registry. - Schema registry hardening: validator turned on per producer, gradually.
- OpenTelemetry SDK added to Node workers as they ship (trace_id columns already populated → just attach exporters).
- Heartbeat caller obligation enforced for all new workers.
3.3 Phase 5 (when governance UI ships)
- Realtime gateway: Nuxt server-route SSE as starting choice; one-way summary push from outbox tail. Permission filter at backend.
- Governance UI per master design §15.
- DLQ replay via UI with approval gate (
dlq_replay_requesttable).
3.4 Later scale (Phase 6+, profile-driven)
- NATS introduced as transport (not SoT) iff multi-host workers or cross-service consumers emerge. event_outbox stays SoT; an outbox-relay worker mirrors to NATS topics.
- WebSocket gateway iff a real bidirectional UI flow appears (collaborative editing, live form locking).
- Centrifugo iff connected-client count exceeds Nuxt server-route or custom WS capacity.
- Benthos in a bounded scope (external table mirroring), not as domain event source.
- OpenTelemetry collector + Jaeger / Grafana Tempo / SigNoz for worker fleet trace UI.
3.5 Re-evaluation criteria
| Trigger | Action |
|---|---|
| Two or more VPS hosts running workers | Re-evaluate NATS as transport. |
| Outbox tail consistently >N minutes lag | Tune indexes; if still lagging, evaluate NATS or partitioning. |
| External tables mirror needed (Lark/Notion/...) | Evaluate Benthos for bounded mirror. |
| Connected real-time clients > ~1 k | Evaluate WS gateway / Centrifugo. |
| Workflow state machine soak shows insufficiency (>100 k active runs OR deterministic replay required) | Evaluate Temporal — full Council review. |
| AI agent fleet emerges with Python deps | Add Python worker class; share trace_id. |
4. Decisions that should NOT be made yet
| Decision | Why not yet |
|---|---|
| Adopt or reject NATS finally. | Need Phase 1–4 metrics on outbox + LISTEN/NOTIFY behaviour. |
| Adopt or reject Temporal finally. | Need Phase 3 MOW state machine soak + Phase 6 scale hardening results. |
| Choose SSE vs WS vs Centrifugo for realtime. | Need Phase 5 client profile and a real bidirectional use case. |
| Adopt pg-boss / Graphile Worker as job queue substrate. | Would require Điều 45 §6.7 amendment — premature; native is provably enough. |
| Adopt Benthos for domain events. | Would create taxonomy outside event_type_registry — wait for a bounded mirror use case. |
| Adopt Camunda for human approval. | Would double-own with Điều 32 — premature and likely never. |
| Adopt Airflow. | Wrong paradigm — no current need. |
| Adopt Watermill. | Routing covered by event_subscription + job_queue class filter — likely never needed. |
| Adopt Directus realtime for UI. | Boundary violation — likely never. |
| Adopt Hasura subscriptions. | Boundary + ownership conflict — already rejected by master design. |
| Pick OTel backend (Jaeger vs Tempo vs SigNoz). | Defer to Phase 5+ when worker fleet justifies a UI. |
5. What this critique adds beyond the candidate strategy
- State-vocab fit as a first-class adoption criterion (rejects pg-boss/Graphile as substrate even though they look ideal).
- Config-first vs framework-first distinction (rejects Temporal/Camunda/Airflow at a deeper level than "later").
- SoT-drift watchpoint (every adopted tool must point back to a PG registry row).
- W3C trace_id shape adoption now so OTel attaches later without schema migration.
- Re-evaluation triggers (concrete metrics, not vibes).
- Directus realtime explicit reject — the candidate strategy didn't flag this and it is the easiest accidental boundary violation.
End critique.