IU 4-Mothers Master Design Rev2 — WS7 OSS Candidate Strategy + Gate A/B (DRAFT 2026-05-27)
Master Design Rev2 — OSS Candidate Strategy (WS7)
Path:
knowledge/dev/design/v0.6-iu-4mothers-event-foundation-rev2/05-oss-candidate-strategy-rev2.mdStatus: DRAFT Rev2 (document-only). Companion to00-master-design-rev2.md. Date: 2026-05-27 Authority: Rev2 brief §15 (OSS labels + Gate A/B), §13 (Constitution matrix, Hiến pháp NT13 PG-first), §11 (No-double-ownership), §6 (Event 5-layer), §7.3 (≥9-state machine, Điều 45 §6.7). Boundary: NO FINAL TOOL SELECTION. Per Rev2 §15 + §20: labels only, no tool pin / version / CI step / dockerfile. Apply the 7 labels via Gates A + B (state-vocab fit + config-first fit). Apply Hiến pháp NT13 (PG-first / SoT-back PG) and Điều 7 (Assembly First — OSS as adapter, never owner).
§1. Method — labels, gates, and verdict shape
§1.1 The 7 labels (Rev2 §15)
| Label | Meaning | When applied |
|---|---|---|
L1 confirmed_invariant |
Already invariant; do not change | Pattern is universal contract (e.g. W3C trace shape) |
L2 reject_as_core_owner |
Do not let tool own core (queue/event/workflow/IU) | Tool would force a SoT outside PG |
L3 reject_as_primary_substrate_now |
Not primary substrate today; revisit by profile trigger | PG-native covers current scale |
L4 defer_until_profile_trigger |
Hold until metric/scale crosses a named threshold | Tool buys us nothing until X |
L5 future_adapter_slot_preserved |
Keep adapter slot (read-only / bounded mirror) | Likely needed later but bounded |
L6 sandbox/reference_only |
Read pattern from project; do not run in prod | Architectural inspiration only |
L7 not_a_second_SoT |
Must point back to PG registry row | Adopted as transport / projection only |
§1.2 The 2 gates (memory feedback-oss-tool-adoption-state-vocab-fit-and-config-first-test)
- Gate A — state-vocab fit (Điều 45 §6.7 ≥9-state). Does the tool impose its own state vocabulary that conflicts with the 9-state floor in
02-step-state-machine-and-workflow-ui-design.md§3?- PASS = tool does not impose state vocab (e.g. transport) OR tool's vocab is aligned/extensible.
- FAIL = tool's vocab is its identity (e.g. pg-boss
created/active/completed/failed/expired) AND would replace the 9-state floor.
- Gate B — config-first fit (Hiến pháp NT2/NT4). Do workflow / task / event definitions live in PG/registry, or in the tool's code/yaml?
- PASS = tool reads definitions from PG/registry; tool's role is transport/runtime.
- FAIL = tool requires definitions in its own DSL/yaml/code (workflow-as-code in the tool's repo, e.g. Temporal workflow code in Go/TS).
§1.3 Verdict shape
Verdict per tool: Gates: A pass/fail + B pass/fail → Verdict: A candidate (both pass) / M candidate (one pass — borrow pattern natively or bounded adapter) / R candidate (both fail).
Labels applied: 1..N labels from the 7-set, justified by the gates.
Per Rev2 §15: no final tool pick, no version pin, no CI step. Sentinel: this document mentions zero version numbers, zero dockerfile lines, zero CI steps.
§2. Default adoption / rejection — applied universally
These are the binding pre-conditions before any tool can be considered:
- Hiến pháp NT13 — PG-first. No tool can be the SoT for any artifact (workflow def / event type / state machine / IU body). PG owns. Tool may project.
- Điều 7 — Assembly First. OSS as adapter, never owner. Adoption requires a SoT-pointback row in PG.
- Điều 45 §6.7 — ≥9-state. Adopted tool must not collapse the 9-state floor.
- Điều 45 §11.5 — Executor boundary. Adopted tool must not blur executor/orchestrator/queue/event boundaries.
- Điều 0-G — Birth registry. New tool category must be registered (event type / executor class / adapter class) before run.
- No cross-IU vector pollution. Adapter mirroring IU events must respect "1 IU → ≥1 point" (per
03-event-5layer-…§3.4). - Reversibility. Adoption must have a documented exit path (Điều 30).
§3. Per-tool verdicts
§3.1 pg-boss / Graphile Worker
Role inspected: PG-backed job queue libraries (worker scheduling, retry, DLQ).
- Gate A (state-vocab): FAIL — pg-boss imposes
created / active / completed / failed / expired / cancelled / retry, distinct from our 9-state floor. Adoption as substrate would re-implement work_state_machine outside Điều 45 §6.7. - Gate B (config-first): FAIL as substrate — pg-boss owns table schemas and lifecycle. PARTIAL as embedded library (could be called as a wrapper) but still encodes its own table model.
Verdict: R candidate as substrate; M candidate as borrow-pattern.
Labels: L2 reject_as_core_owner + L5 future_adapter_slot_preserved (only if we map their lifecycle back to Điều 45 lifecycle in a bounded adapter) + L7 not_a_second_SoT.
Decision in this design: Reject as substrate. Keep job_queue PG-native owned by Điều 45. Permitted use: borrow patterns (priority + lease + backoff) for our own design, with native implementation.
Profile trigger to re-evaluate: None on the horizon — Điều 45 native covers current and projected scale. If multi-host worker fleet grows >100 nodes and lease contention becomes a measurable issue, revisit per 06-… §S5/§S6.
SoT-pointback requirement: Not applicable (rejected as substrate).
§3.2 Temporal
Role inspected: Workflow-as-code orchestrator with durable execution, history replay, signals.
- Gate A (state-vocab): FAIL — Temporal's workflow + activity model imposes its own lifecycle vocabulary; integrating would collapse the 9-state floor or force a translation layer.
- Gate B (config-first): FAIL — Temporal workflows are code (Go/TS/Java/Python files inside the Temporal server's repo or worker code). Our requirement is config-first via PG registry. Re-implementing every workflow as Temporal code violates Assembly First.
Verdict: R candidate as primary orchestrator; defer for evaluation post-scale.
Labels: L2 reject_as_core_owner + L3 reject_as_primary_substrate_now + L4 defer_until_profile_trigger.
Decision in this design: Reject as MOW core. MOW orchestration stays PG/registry-driven.
Profile trigger to re-evaluate (per Rev2 OD6): Post-Phase 6, IF (a) MOW process saturation observed in production AND (b) >100k long-running workflows active simultaneously AND (c) PG-native scheduling cannot meet p95 SLA, THEN convene Council Round (Điều 37) to evaluate bounded Temporal as deep-engine substrate (workflows still declared in PG registry; Temporal acts as execution engine via adapter). Even then, workflow definitions stay in PG; only execution semantics borrow.
SoT-pointback requirement: If ever adopted, every Temporal workflow execution row must reference workflow_def.workflow_def_id and workflow_run.workflow_run_id — these stay PG SoT.
§3.3 Camunda
Role inspected: BPMN engine; visual workflow + approval / human-task patterns.
- Gate A (state-vocab): FAIL — BPMN state model imposes its own task lifecycle.
- Gate B (config-first): PARTIAL — BPMN XML is config, but it's Camunda's config, not PG. Adoption would create a second SoT.
Verdict: R candidate as approval engine (overlaps Điều 32); M candidate as reference for BPMN patterns.
Labels: L2 reject_as_core_owner (approval = Điều 32) + L6 sandbox/reference_only.
Decision in this design: Reject as approval engine (Điều 32 owns approval). Permitted: reference BPMN patterns for our workflow visualization / proposal-mode UI / decision-gate semantics.
Profile trigger: None — Điều 32 covers approval natively.
SoT-pointback requirement: N/A (rejected as engine).
§3.4 Airflow
Role inspected: Batch / data pipeline workflow scheduler (Python DAGs).
- Gate A (state-vocab): FAIL — Airflow's DAG / task instance states differ from our 9-state floor.
- Gate B (config-first): FAIL — DAGs are Python code in Airflow's repo. Config lives in Airflow's metadata DB, not PG SoT.
Verdict: R candidate as MOW core; M candidate as batch / data-flow reference for non-MOW workloads.
Labels: L2 reject_as_core_owner + L6 sandbox/reference_only (batch / data workflow patterns).
Decision in this design: Reject as MOW core. MOW is not a data pipeline scheduler. Permitted: reference Airflow for our future batch / data-projection workloads if they appear separately from MOW (e.g. analytical aggregations not bound to IU).
Profile trigger: N/A for MOW. For batch / data side, re-evaluate when batch workload volume exceeds what PG pg_cron + DOT functions can serve.
SoT-pointback requirement: If adopted for batch side later, DAG identity rows must reference PG-side batch_job_registry (paper).
§3.5 Benthos / Redpanda Connect
Role inspected: Stream processor with declarative pipelines (YAML config), mature for CDC + transform + sink fan-out.
- Gate A (state-vocab): PASS — Benthos is transport / transform; does not impose work_state vocabulary.
- Gate B (config-first): PARTIAL — Benthos pipelines are YAML, config-driven, but the YAML lives in Benthos's repo, not in PG. As bounded adapter, this is acceptable IF the adapter only sinks PG-side events to external systems (read-only mirror), with the source-of-truth event_type registered in PG.
Verdict: M candidate as bounded external mirror (CDC sink).
Labels: L3 reject_as_primary_substrate_now (PG outbox covers current scale) + L5 future_adapter_slot_preserved (bounded external mirror) + L7 not_a_second_SoT.
Decision in this design: Adapter slot preserved. When external mirror is needed (e.g. analytical warehouse, partner stream, future Kafka federation), Benthos is a candidate to consume from event_outbox and sink externally. Pipeline YAML must reference event_type_registry and not duplicate event schema.
Profile trigger (per Rev2 OD5): When p95 lag of PG-native outbox-to-external mirror exceeds threshold OR when external system mandates a streaming integration that PG-native cannot match.
SoT-pointback requirement: Every Benthos pipeline row references event_type_registry.event_type and includes trace_id propagation.
§3.6 NATS
Role inspected: Lightweight pub/sub + JetStream durable storage; transport layer.
- Gate A (state-vocab): PASS — NATS is transport; no work_state vocab imposition.
- Gate B (config-first): PASS — Subjects can be declared in PG registry and NATS consumed as transport. No workflow definition required in NATS.
Verdict: A candidate as future transport layer (NOT as event SoT).
Labels: L4 defer_until_profile_trigger (multi-host / fan-out scale) + L7 not_a_second_SoT.
Decision in this design: Defer. PG-native event_outbox + event_subscription covers current single-VPS scale. NATS adoption deferred until multi-host workers / multi-region fan-out becomes a real load.
Profile trigger (per Rev2 OD5): Multi-host worker fleet (>2 hosts), per-event fan-out >10 subscribers, or per-event throughput overwhelming PG NOTIFY + polling.
SoT-pointback requirement: NATS subjects map 1:1 to event_type_registry rows. Every NATS message carries event_outbox_id so subscribers can correlate back to canonical ledger.
§3.7 Redis Streams
Role inspected: Stream data structure in Redis; consumer group semantics for distributed processing.
- Gate A (state-vocab): PASS — transport.
- Gate B (config-first): PASS — same as NATS, stream key = event_type, decl in PG.
Verdict: A candidate as transport — but only if Redis is already in ops mix for other reasons.
Labels: L4 defer_until_profile_trigger + L7 not_a_second_SoT.
Decision in this design: Defer. Adopt only if Redis is already operationally justified (caching layer, session store) AND profile trigger met. Otherwise NATS is preferred for clean dedicated transport (single tool).
Profile trigger: Same as NATS, plus "Redis already in ops".
SoT-pointback requirement: Same as NATS.
§3.8 Hasura subscriptions
Role inspected: GraphQL subscription engine over PG; client-facing realtime via WebSocket.
- Gate A (state-vocab): PASS — does not impose work_state vocab.
- Gate B (config-first): FAIL — Hasura would own query/subscription patterns and direct PG access from clients; conflicts with Điều 33 v2.1 (3-layer: Nuxt → Directus/backend → PG) and Điều 28 (Nuxt render shell). Hasura connecting clients directly to PG bypasses our backend gateway.
Verdict: R candidate as core realtime owner.
Labels: L2 reject_as_core_owner (boundary violation app plane) + L6 sandbox/reference_only.
Decision in this design: Reject as realtime core. Backend realtime gateway owns the boundary (03-event-5layer-… §7). Permitted: reference Hasura for understanding subscription-over-PG patterns.
Profile trigger: N/A.
SoT-pointback requirement: N/A.
§3.9 Directus realtime
Role inspected: Directus built-in realtime via WebSocket on Directus collections.
- Gate A (state-vocab): PASS — transport.
- Gate B (config-first): FAIL — Directus realtime fan-outs collection events directly to clients, bypassing our backend gateway permission/relevance filter. Boundary violation (app event plane).
Verdict: R candidate as app event plane; M candidate as admin diagnostics if scoped.
Labels: L2 reject_as_core_owner (boundary violation) + L6 sandbox/reference_only (admin diagnostics if approved).
Decision in this design: Reject as app event plane. Permitted: Directus realtime in admin-only console for diagnostics, scoped to admin role, never reaches app clients. Sentinel: no app-facing route subscribes to Directus realtime.
Profile trigger: N/A.
SoT-pointback requirement: N/A.
§3.10 Watermill
Role inspected: Go library for event-driven applications (pub/sub abstraction over multiple brokers).
- Gate A (state-vocab): PASS — abstraction layer.
- Gate B (config-first): PARTIAL — Watermill is a library used in worker code; doesn't itself define workflow / event in code, but its handler functions are Go code. Not config-first for workflow.
Verdict: R candidate as substrate; M candidate as code reference.
Labels: L3 reject_as_primary_substrate_now + L6 sandbox/reference_only.
Decision in this design: Not primary substrate. PG-native covers needs. Permitted: reference Watermill design (router / middleware patterns) for our own worker library if we build one.
Profile trigger: N/A.
SoT-pointback requirement: N/A.
§3.11 Centrifugo
Role inspected: Real-time messaging server (WebSocket / SSE / SockJS / GRPC) with horizontal scaling.
- Gate A (state-vocab): PASS — transport only.
- Gate B (config-first): PASS — channels/permissions configured via Centrifugo config + JWT, can be derived from PG. Workflow definitions remain in PG.
Verdict: A candidate as future realtime gateway implementation IF concurrent client load justifies.
Labels: L4 defer_until_profile_trigger (>1k concurrent realtime clients) + L5 future_adapter_slot_preserved (gateway adapter slot).
Decision in this design: Defer. Realtime gateway starts as native (Nuxt SSE shell → Node backend gateway service per 03-event-5layer-… §7.4; OD4 default kept). Preserve Centrifugo adapter slot for >1k concurrent clients.
Profile trigger (per Rev2 OD4): Concurrent realtime clients >1k OR multi-region distribution.
SoT-pointback requirement: Topic identity rows in realtime_gateway_topic_registry (PG SoT) map 1:1 to Centrifugo channels.
§3.12 W3C traceparent shape
Role inspected: Distributed tracing header format (W3C Trace Context spec).
- Gate A: PASS — it's an invariant; not a state-vocab.
- Gate B: PASS — adoption is shape-only; no engine, no DSL.
Verdict: L1 confirmed_invariant — adopt NOW (per memory feedback and Rev2 §15 L1).
Labels: L1 confirmed_invariant.
Decision in this design: Adopt NOW. Every event_outbox / job_queue / step_run carries trace_id (32 hex) + parent_span_id (16 hex) + correlation_id (uuid) per 03-event-5layer-… §5.6. Later OTel ingestion is zero-migration.
Profile trigger: N/A — adopt unconditionally.
SoT-pointback requirement: N/A (shape, not data).
§3.13 OpenTelemetry collector + Jaeger
Role inspected: OTel collector (OTLP receiver / processor / exporter) + Jaeger backend for trace UI.
- Gate A: PASS — transport.
- Gate B: PASS — collector config is YAML, can be derived from PG (export pipelines decl in PG).
Verdict: A candidate after W3C trace_id ubiquity.
Labels: L4 defer_until_profile_trigger (after trace_id phổ cập across producers/workers; ~Phase 3+).
Decision in this design: Defer. Adopt OTel collector once trace_id is consistently present in events + jobs + heartbeats. Jaeger as trace UI is a Phase 3+ adoption candidate.
Profile trigger: trace_id present in ≥95% of producers/workers AND need for distributed trace UI grows (e.g. multi-service architecture).
SoT-pointback requirement: Trace ID lives in PG audit rows (event_outbox.trace_id); OTel is consumer.
§3.14 SSE / WebSocket / Centrifugo (realtime transports)
Native SSE shell is the start; WebSocket / Centrifugo are upgrade paths. Already covered via §3.11. No final tool pick.
§3.15 Jaeger (covered §3.13 as OTel sink).
§4. Aggregate summary table
| Tool | Gate A | Gate B | Verdict | Labels (Rev2 §15) | Decision now | Profile trigger |
|---|---|---|---|---|---|---|
| pg-boss / Graphile Worker | FAIL | FAIL | R | L2 + L5 + L7 | Reject as substrate | Multi-host worker contention |
| Temporal | FAIL | FAIL | R | L2 + L3 + L4 | Reject as MOW core | Post-Phase 6 saturation |
| Camunda | FAIL | PARTIAL | R | L2 + L6 | Reject as approval engine | None |
| Airflow | FAIL | FAIL | R | L2 + L6 | Reject as MOW core | Batch / data side revisit |
| Benthos / Redpanda Connect | PASS | PARTIAL | M | L3 + L5 + L7 | Adapter slot preserved | External mirror need |
| NATS | PASS | PASS | A | L4 + L7 | Defer | Multi-host / >10 fan-out |
| Redis Streams | PASS | PASS | A | L4 + L7 | Defer | Multi-host + Redis-in-ops |
| Hasura subs | PASS | FAIL | R | L2 + L6 | Reject as core | None |
| Directus realtime | PASS | FAIL | R | L2 + L6 | Reject as app plane | None |
| Watermill | PASS | PARTIAL | R | L3 + L6 | Not substrate; ref only | N/A |
| Centrifugo | PASS | PASS | A | L4 + L5 | Defer; gateway slot kept | >1k concurrent clients |
| W3C trace_id | PASS | PASS | A1 | L1 | Adopt NOW | N/A |
| OpenTelemetry collector | PASS | PASS | A | L4 | Defer | trace_id ubiquity |
| Jaeger | PASS | PASS | A | L4 | Defer | OTel adoption |
§5. Invariants / sentinels for OSS adoption
For any tool category considered later, this design imposes the following sentinel checks before adoption:
- Gate A test — explicit verdict on state-vocab fit (no implicit assumption).
- Gate B test — explicit verdict on config-first fit.
- SoT-pointback row — adopted tool's identity table contains a column pointing to PG registry primary key. Schema-level enforcement (FK or trigger).
- Reversibility — documented exit path before adoption.
- No double-ownership — tool's role mapped to no overlapping concern with Điều 32 / 35 / 38 / 39 / 45.
- Birth registry — new tool category registered in
external_tool_registry(paper) before run. - Heartbeat — any process-class tool (worker / gateway) has heartbeat caller (Điều 45 §15.5).
- No raw event noise — any UI-exposed tool surface (e.g. Jaeger) is gated behind backend permission filter.
§6. Cross-references and acceptance
Cross-refs:
- State-machine 9-floor (Gate A test target) →
02-step-state-machine-and-workflow-ui-design.md§3. - Event 5-layer substrate (Gate B test target) →
03-event-5layer-realtime-dlq-design.md§§3-7. - IU / 4 Mothers boundary (no tool can replace) →
04-iu-centered-4mothers-binding-design.md§§3-4. - Open Decisions OD4 (realtime), OD5 (CDC), OD6 (Temporal/Camunda re-eval) →
06-open-decisions-and-readiness.md.
Acceptance:
A1. Every tool listed in Rev2 §15 has a verdict row in §4 summary table. A2. Every verdict carries Gate A + Gate B test outcomes, labels from the 7-set, and a profile trigger if deferred. A3. No final tool selection / no pinned version / no CI step / no dockerfile. A4. W3C trace_id is the only L1 confirmed_invariant; adopt NOW. A5. Sentinel rules §5 capture binding requirements for any future adoption.
End WS7 design.