KB-4B13

IU 4-Mothers Master Design Rev2 — WS7 OSS Candidate Strategy + Gate A/B (DRAFT 2026-05-27)

21 min read Revision 1
designmaster-design-rev2oss-strategygate-a-b7-labelspg-bosstemporalcamundaairflowbenthosnatscentrifugow3c-trace-idotelws7draftdocument-only2026-05-27

Master Design Rev2 — OSS Candidate Strategy (WS7)

Path: knowledge/dev/design/v0.6-iu-4mothers-event-foundation-rev2/05-oss-candidate-strategy-rev2.md Status: DRAFT Rev2 (document-only). Companion to 00-master-design-rev2.md. Date: 2026-05-27 Authority: Rev2 brief §15 (OSS labels + Gate A/B), §13 (Constitution matrix, Hiến pháp NT13 PG-first), §11 (No-double-ownership), §6 (Event 5-layer), §7.3 (≥9-state machine, Điều 45 §6.7). Boundary: NO FINAL TOOL SELECTION. Per Rev2 §15 + §20: labels only, no tool pin / version / CI step / dockerfile. Apply the 7 labels via Gates A + B (state-vocab fit + config-first fit). Apply Hiến pháp NT13 (PG-first / SoT-back PG) and Điều 7 (Assembly First — OSS as adapter, never owner).


§1. Method — labels, gates, and verdict shape

§1.1 The 7 labels (Rev2 §15)

Label Meaning When applied
L1 confirmed_invariant Already invariant; do not change Pattern is universal contract (e.g. W3C trace shape)
L2 reject_as_core_owner Do not let tool own core (queue/event/workflow/IU) Tool would force a SoT outside PG
L3 reject_as_primary_substrate_now Not primary substrate today; revisit by profile trigger PG-native covers current scale
L4 defer_until_profile_trigger Hold until metric/scale crosses a named threshold Tool buys us nothing until X
L5 future_adapter_slot_preserved Keep adapter slot (read-only / bounded mirror) Likely needed later but bounded
L6 sandbox/reference_only Read pattern from project; do not run in prod Architectural inspiration only
L7 not_a_second_SoT Must point back to PG registry row Adopted as transport / projection only

§1.2 The 2 gates (memory feedback-oss-tool-adoption-state-vocab-fit-and-config-first-test)

  • Gate A — state-vocab fit (Điều 45 §6.7 ≥9-state). Does the tool impose its own state vocabulary that conflicts with the 9-state floor in 02-step-state-machine-and-workflow-ui-design.md §3?
    • PASS = tool does not impose state vocab (e.g. transport) OR tool's vocab is aligned/extensible.
    • FAIL = tool's vocab is its identity (e.g. pg-boss created/active/completed/failed/expired) AND would replace the 9-state floor.
  • Gate B — config-first fit (Hiến pháp NT2/NT4). Do workflow / task / event definitions live in PG/registry, or in the tool's code/yaml?
    • PASS = tool reads definitions from PG/registry; tool's role is transport/runtime.
    • FAIL = tool requires definitions in its own DSL/yaml/code (workflow-as-code in the tool's repo, e.g. Temporal workflow code in Go/TS).

§1.3 Verdict shape

Verdict per tool: Gates: A pass/fail + B pass/fail → Verdict: A candidate (both pass) / M candidate (one pass — borrow pattern natively or bounded adapter) / R candidate (both fail).

Labels applied: 1..N labels from the 7-set, justified by the gates.

Per Rev2 §15: no final tool pick, no version pin, no CI step. Sentinel: this document mentions zero version numbers, zero dockerfile lines, zero CI steps.


§2. Default adoption / rejection — applied universally

These are the binding pre-conditions before any tool can be considered:

  1. Hiến pháp NT13 — PG-first. No tool can be the SoT for any artifact (workflow def / event type / state machine / IU body). PG owns. Tool may project.
  2. Điều 7 — Assembly First. OSS as adapter, never owner. Adoption requires a SoT-pointback row in PG.
  3. Điều 45 §6.7 — ≥9-state. Adopted tool must not collapse the 9-state floor.
  4. Điều 45 §11.5 — Executor boundary. Adopted tool must not blur executor/orchestrator/queue/event boundaries.
  5. Điều 0-G — Birth registry. New tool category must be registered (event type / executor class / adapter class) before run.
  6. No cross-IU vector pollution. Adapter mirroring IU events must respect "1 IU → ≥1 point" (per 03-event-5layer-… §3.4).
  7. Reversibility. Adoption must have a documented exit path (Điều 30).

§3. Per-tool verdicts

§3.1 pg-boss / Graphile Worker

Role inspected: PG-backed job queue libraries (worker scheduling, retry, DLQ).

  • Gate A (state-vocab): FAIL — pg-boss imposes created / active / completed / failed / expired / cancelled / retry, distinct from our 9-state floor. Adoption as substrate would re-implement work_state_machine outside Điều 45 §6.7.
  • Gate B (config-first): FAIL as substrate — pg-boss owns table schemas and lifecycle. PARTIAL as embedded library (could be called as a wrapper) but still encodes its own table model.

Verdict: R candidate as substrate; M candidate as borrow-pattern.

Labels: L2 reject_as_core_owner + L5 future_adapter_slot_preserved (only if we map their lifecycle back to Điều 45 lifecycle in a bounded adapter) + L7 not_a_second_SoT.

Decision in this design: Reject as substrate. Keep job_queue PG-native owned by Điều 45. Permitted use: borrow patterns (priority + lease + backoff) for our own design, with native implementation.

Profile trigger to re-evaluate: None on the horizon — Điều 45 native covers current and projected scale. If multi-host worker fleet grows >100 nodes and lease contention becomes a measurable issue, revisit per 06-… §S5/§S6.

SoT-pointback requirement: Not applicable (rejected as substrate).

§3.2 Temporal

Role inspected: Workflow-as-code orchestrator with durable execution, history replay, signals.

  • Gate A (state-vocab): FAIL — Temporal's workflow + activity model imposes its own lifecycle vocabulary; integrating would collapse the 9-state floor or force a translation layer.
  • Gate B (config-first): FAIL — Temporal workflows are code (Go/TS/Java/Python files inside the Temporal server's repo or worker code). Our requirement is config-first via PG registry. Re-implementing every workflow as Temporal code violates Assembly First.

Verdict: R candidate as primary orchestrator; defer for evaluation post-scale.

Labels: L2 reject_as_core_owner + L3 reject_as_primary_substrate_now + L4 defer_until_profile_trigger.

Decision in this design: Reject as MOW core. MOW orchestration stays PG/registry-driven.

Profile trigger to re-evaluate (per Rev2 OD6): Post-Phase 6, IF (a) MOW process saturation observed in production AND (b) >100k long-running workflows active simultaneously AND (c) PG-native scheduling cannot meet p95 SLA, THEN convene Council Round (Điều 37) to evaluate bounded Temporal as deep-engine substrate (workflows still declared in PG registry; Temporal acts as execution engine via adapter). Even then, workflow definitions stay in PG; only execution semantics borrow.

SoT-pointback requirement: If ever adopted, every Temporal workflow execution row must reference workflow_def.workflow_def_id and workflow_run.workflow_run_id — these stay PG SoT.

§3.3 Camunda

Role inspected: BPMN engine; visual workflow + approval / human-task patterns.

  • Gate A (state-vocab): FAIL — BPMN state model imposes its own task lifecycle.
  • Gate B (config-first): PARTIAL — BPMN XML is config, but it's Camunda's config, not PG. Adoption would create a second SoT.

Verdict: R candidate as approval engine (overlaps Điều 32); M candidate as reference for BPMN patterns.

Labels: L2 reject_as_core_owner (approval = Điều 32) + L6 sandbox/reference_only.

Decision in this design: Reject as approval engine (Điều 32 owns approval). Permitted: reference BPMN patterns for our workflow visualization / proposal-mode UI / decision-gate semantics.

Profile trigger: None — Điều 32 covers approval natively.

SoT-pointback requirement: N/A (rejected as engine).

§3.4 Airflow

Role inspected: Batch / data pipeline workflow scheduler (Python DAGs).

  • Gate A (state-vocab): FAIL — Airflow's DAG / task instance states differ from our 9-state floor.
  • Gate B (config-first): FAIL — DAGs are Python code in Airflow's repo. Config lives in Airflow's metadata DB, not PG SoT.

Verdict: R candidate as MOW core; M candidate as batch / data-flow reference for non-MOW workloads.

Labels: L2 reject_as_core_owner + L6 sandbox/reference_only (batch / data workflow patterns).

Decision in this design: Reject as MOW core. MOW is not a data pipeline scheduler. Permitted: reference Airflow for our future batch / data-projection workloads if they appear separately from MOW (e.g. analytical aggregations not bound to IU).

Profile trigger: N/A for MOW. For batch / data side, re-evaluate when batch workload volume exceeds what PG pg_cron + DOT functions can serve.

SoT-pointback requirement: If adopted for batch side later, DAG identity rows must reference PG-side batch_job_registry (paper).

§3.5 Benthos / Redpanda Connect

Role inspected: Stream processor with declarative pipelines (YAML config), mature for CDC + transform + sink fan-out.

  • Gate A (state-vocab): PASS — Benthos is transport / transform; does not impose work_state vocabulary.
  • Gate B (config-first): PARTIAL — Benthos pipelines are YAML, config-driven, but the YAML lives in Benthos's repo, not in PG. As bounded adapter, this is acceptable IF the adapter only sinks PG-side events to external systems (read-only mirror), with the source-of-truth event_type registered in PG.

Verdict: M candidate as bounded external mirror (CDC sink).

Labels: L3 reject_as_primary_substrate_now (PG outbox covers current scale) + L5 future_adapter_slot_preserved (bounded external mirror) + L7 not_a_second_SoT.

Decision in this design: Adapter slot preserved. When external mirror is needed (e.g. analytical warehouse, partner stream, future Kafka federation), Benthos is a candidate to consume from event_outbox and sink externally. Pipeline YAML must reference event_type_registry and not duplicate event schema.

Profile trigger (per Rev2 OD5): When p95 lag of PG-native outbox-to-external mirror exceeds threshold OR when external system mandates a streaming integration that PG-native cannot match.

SoT-pointback requirement: Every Benthos pipeline row references event_type_registry.event_type and includes trace_id propagation.

§3.6 NATS

Role inspected: Lightweight pub/sub + JetStream durable storage; transport layer.

  • Gate A (state-vocab): PASS — NATS is transport; no work_state vocab imposition.
  • Gate B (config-first): PASS — Subjects can be declared in PG registry and NATS consumed as transport. No workflow definition required in NATS.

Verdict: A candidate as future transport layer (NOT as event SoT).

Labels: L4 defer_until_profile_trigger (multi-host / fan-out scale) + L7 not_a_second_SoT.

Decision in this design: Defer. PG-native event_outbox + event_subscription covers current single-VPS scale. NATS adoption deferred until multi-host workers / multi-region fan-out becomes a real load.

Profile trigger (per Rev2 OD5): Multi-host worker fleet (>2 hosts), per-event fan-out >10 subscribers, or per-event throughput overwhelming PG NOTIFY + polling.

SoT-pointback requirement: NATS subjects map 1:1 to event_type_registry rows. Every NATS message carries event_outbox_id so subscribers can correlate back to canonical ledger.

§3.7 Redis Streams

Role inspected: Stream data structure in Redis; consumer group semantics for distributed processing.

  • Gate A (state-vocab): PASS — transport.
  • Gate B (config-first): PASS — same as NATS, stream key = event_type, decl in PG.

Verdict: A candidate as transport — but only if Redis is already in ops mix for other reasons.

Labels: L4 defer_until_profile_trigger + L7 not_a_second_SoT.

Decision in this design: Defer. Adopt only if Redis is already operationally justified (caching layer, session store) AND profile trigger met. Otherwise NATS is preferred for clean dedicated transport (single tool).

Profile trigger: Same as NATS, plus "Redis already in ops".

SoT-pointback requirement: Same as NATS.

§3.8 Hasura subscriptions

Role inspected: GraphQL subscription engine over PG; client-facing realtime via WebSocket.

  • Gate A (state-vocab): PASS — does not impose work_state vocab.
  • Gate B (config-first): FAIL — Hasura would own query/subscription patterns and direct PG access from clients; conflicts with Điều 33 v2.1 (3-layer: Nuxt → Directus/backend → PG) and Điều 28 (Nuxt render shell). Hasura connecting clients directly to PG bypasses our backend gateway.

Verdict: R candidate as core realtime owner.

Labels: L2 reject_as_core_owner (boundary violation app plane) + L6 sandbox/reference_only.

Decision in this design: Reject as realtime core. Backend realtime gateway owns the boundary (03-event-5layer-… §7). Permitted: reference Hasura for understanding subscription-over-PG patterns.

Profile trigger: N/A.

SoT-pointback requirement: N/A.

§3.9 Directus realtime

Role inspected: Directus built-in realtime via WebSocket on Directus collections.

  • Gate A (state-vocab): PASS — transport.
  • Gate B (config-first): FAIL — Directus realtime fan-outs collection events directly to clients, bypassing our backend gateway permission/relevance filter. Boundary violation (app event plane).

Verdict: R candidate as app event plane; M candidate as admin diagnostics if scoped.

Labels: L2 reject_as_core_owner (boundary violation) + L6 sandbox/reference_only (admin diagnostics if approved).

Decision in this design: Reject as app event plane. Permitted: Directus realtime in admin-only console for diagnostics, scoped to admin role, never reaches app clients. Sentinel: no app-facing route subscribes to Directus realtime.

Profile trigger: N/A.

SoT-pointback requirement: N/A.

§3.10 Watermill

Role inspected: Go library for event-driven applications (pub/sub abstraction over multiple brokers).

  • Gate A (state-vocab): PASS — abstraction layer.
  • Gate B (config-first): PARTIAL — Watermill is a library used in worker code; doesn't itself define workflow / event in code, but its handler functions are Go code. Not config-first for workflow.

Verdict: R candidate as substrate; M candidate as code reference.

Labels: L3 reject_as_primary_substrate_now + L6 sandbox/reference_only.

Decision in this design: Not primary substrate. PG-native covers needs. Permitted: reference Watermill design (router / middleware patterns) for our own worker library if we build one.

Profile trigger: N/A.

SoT-pointback requirement: N/A.

§3.11 Centrifugo

Role inspected: Real-time messaging server (WebSocket / SSE / SockJS / GRPC) with horizontal scaling.

  • Gate A (state-vocab): PASS — transport only.
  • Gate B (config-first): PASS — channels/permissions configured via Centrifugo config + JWT, can be derived from PG. Workflow definitions remain in PG.

Verdict: A candidate as future realtime gateway implementation IF concurrent client load justifies.

Labels: L4 defer_until_profile_trigger (>1k concurrent realtime clients) + L5 future_adapter_slot_preserved (gateway adapter slot).

Decision in this design: Defer. Realtime gateway starts as native (Nuxt SSE shell → Node backend gateway service per 03-event-5layer-… §7.4; OD4 default kept). Preserve Centrifugo adapter slot for >1k concurrent clients.

Profile trigger (per Rev2 OD4): Concurrent realtime clients >1k OR multi-region distribution.

SoT-pointback requirement: Topic identity rows in realtime_gateway_topic_registry (PG SoT) map 1:1 to Centrifugo channels.

§3.12 W3C traceparent shape

Role inspected: Distributed tracing header format (W3C Trace Context spec).

  • Gate A: PASS — it's an invariant; not a state-vocab.
  • Gate B: PASS — adoption is shape-only; no engine, no DSL.

Verdict: L1 confirmed_invariant — adopt NOW (per memory feedback and Rev2 §15 L1).

Labels: L1 confirmed_invariant.

Decision in this design: Adopt NOW. Every event_outbox / job_queue / step_run carries trace_id (32 hex) + parent_span_id (16 hex) + correlation_id (uuid) per 03-event-5layer-… §5.6. Later OTel ingestion is zero-migration.

Profile trigger: N/A — adopt unconditionally.

SoT-pointback requirement: N/A (shape, not data).

§3.13 OpenTelemetry collector + Jaeger

Role inspected: OTel collector (OTLP receiver / processor / exporter) + Jaeger backend for trace UI.

  • Gate A: PASS — transport.
  • Gate B: PASS — collector config is YAML, can be derived from PG (export pipelines decl in PG).

Verdict: A candidate after W3C trace_id ubiquity.

Labels: L4 defer_until_profile_trigger (after trace_id phổ cập across producers/workers; ~Phase 3+).

Decision in this design: Defer. Adopt OTel collector once trace_id is consistently present in events + jobs + heartbeats. Jaeger as trace UI is a Phase 3+ adoption candidate.

Profile trigger: trace_id present in ≥95% of producers/workers AND need for distributed trace UI grows (e.g. multi-service architecture).

SoT-pointback requirement: Trace ID lives in PG audit rows (event_outbox.trace_id); OTel is consumer.

§3.14 SSE / WebSocket / Centrifugo (realtime transports)

Native SSE shell is the start; WebSocket / Centrifugo are upgrade paths. Already covered via §3.11. No final tool pick.

§3.15 Jaeger (covered §3.13 as OTel sink).


§4. Aggregate summary table

Tool Gate A Gate B Verdict Labels (Rev2 §15) Decision now Profile trigger
pg-boss / Graphile Worker FAIL FAIL R L2 + L5 + L7 Reject as substrate Multi-host worker contention
Temporal FAIL FAIL R L2 + L3 + L4 Reject as MOW core Post-Phase 6 saturation
Camunda FAIL PARTIAL R L2 + L6 Reject as approval engine None
Airflow FAIL FAIL R L2 + L6 Reject as MOW core Batch / data side revisit
Benthos / Redpanda Connect PASS PARTIAL M L3 + L5 + L7 Adapter slot preserved External mirror need
NATS PASS PASS A L4 + L7 Defer Multi-host / >10 fan-out
Redis Streams PASS PASS A L4 + L7 Defer Multi-host + Redis-in-ops
Hasura subs PASS FAIL R L2 + L6 Reject as core None
Directus realtime PASS FAIL R L2 + L6 Reject as app plane None
Watermill PASS PARTIAL R L3 + L6 Not substrate; ref only N/A
Centrifugo PASS PASS A L4 + L5 Defer; gateway slot kept >1k concurrent clients
W3C trace_id PASS PASS A1 L1 Adopt NOW N/A
OpenTelemetry collector PASS PASS A L4 Defer trace_id ubiquity
Jaeger PASS PASS A L4 Defer OTel adoption

§5. Invariants / sentinels for OSS adoption

For any tool category considered later, this design imposes the following sentinel checks before adoption:

  1. Gate A test — explicit verdict on state-vocab fit (no implicit assumption).
  2. Gate B test — explicit verdict on config-first fit.
  3. SoT-pointback row — adopted tool's identity table contains a column pointing to PG registry primary key. Schema-level enforcement (FK or trigger).
  4. Reversibility — documented exit path before adoption.
  5. No double-ownership — tool's role mapped to no overlapping concern with Điều 32 / 35 / 38 / 39 / 45.
  6. Birth registry — new tool category registered in external_tool_registry (paper) before run.
  7. Heartbeat — any process-class tool (worker / gateway) has heartbeat caller (Điều 45 §15.5).
  8. No raw event noise — any UI-exposed tool surface (e.g. Jaeger) is gated behind backend permission filter.

§6. Cross-references and acceptance

Cross-refs:

  • State-machine 9-floor (Gate A test target) → 02-step-state-machine-and-workflow-ui-design.md §3.
  • Event 5-layer substrate (Gate B test target) → 03-event-5layer-realtime-dlq-design.md §§3-7.
  • IU / 4 Mothers boundary (no tool can replace) → 04-iu-centered-4mothers-binding-design.md §§3-4.
  • Open Decisions OD4 (realtime), OD5 (CDC), OD6 (Temporal/Camunda re-eval) → 06-open-decisions-and-readiness.md.

Acceptance:

A1. Every tool listed in Rev2 §15 has a verdict row in §4 summary table. A2. Every verdict carries Gate A + Gate B test outcomes, labels from the 7-set, and a profile trigger if deferred. A3. No final tool selection / no pinned version / no CI step / no dockerfile. A4. W3C trace_id is the only L1 confirmed_invariant; adopt NOW. A5. Sentinel rules §5 capture binding requirements for any future adoption.

End WS7 design.

Back to Knowledge Hub knowledge/dev/design/v0.6-iu-4mothers-event-foundation-rev2/05-oss-candidate-strategy-rev2.md