KB-5A00

S177 Architecture Design — Lark Base Controlled CRUD Gateway (2026-05-19)

32 min read Revision 1
larks177crudarchitecturedesigndraft

S177 — Architecture Design Document: Lark Base Controlled CRUD Gateway

Status: DRAFT for Huyên review → then Sprint 1 implementation Date: 2026-05-19 Source of truth: knowledge/dev/lark/s177-controlled-crud-gateway-requirements-v2.md (đề bài v2.2 FINAL, GPT R3 8.8/10) Author: Claude Code (Opus 4.7) Survey basis: KB architecture contract (knowledge/dev/lark/README.md rev3, lark-client-architecture.md S176, lark-base-registry.md, snapshot 88-phai-cu-base-dem.md) + live @larksuiteoapi/lark-mcp tool surface observed in-session. Live source /opt/incomex/lark-client/ was NOT directly read — see §J Open Question OQ-1.

Readback provenance (S177-DESIGN-E1): original written to VPS /opt/incomex/docs/mcp-writes/s177-architecture-design.md; write_file reported 31407 bytes; byte-identical readback ⇒ SHA-256 0440ef92ee9f5355c16902aaf417a346b1b2a97adbd7dded360cf320763639e5, 502 lines. NOT git-committed; final repo path not yet populated.


A. Executive Summary

Scope

Add controlled write capability (records + fields + later tables/views) to the existing read-only lark-client v1.0.0, behind a mandatory 8-layer SafetyLayer, exposed through two tracks that share one Application Service Layer:

  • Track B (CLI, production-grade, built first): lark-tool records ... / lark-tool fields ...
  • Track A (MCP, Cowork interactive): adapter over the same service layer; production delete forbidden, Base đệm only.

Architecture (target)

Cowork / MCP (Track A)          Claude Code CLI / Cron (Track B)
        │                                  │
        └────────────────┬─────────────────┘
                         ▼
          Application Service Layer  (lark_client/service.py)
          — single write entrypoint, no duplicated write logic —
                         ▼
              SafetyLayer  (lark_client/safety.py)
   dry-run → approval → backup(GPG) → audit-pre → lock →
   rate-limit → PII-scan → Lark API call → audit-post
                         ▼
                   LarkCore  (existing — GSM token, whitelist, retry)
                         ▼
            Lark Open API  https://open.larksuite.com

Sprint plan (from requirements §10, unchanged)

Sprint Deliverable Track
1 writer.py + SafetyLayer core + GPG backup + 2-phase audit + tests B
2 MCP adapter over service layer + record.get / record.delete MCP A
3 field_manager.py (Text/Number/SingleChoice/Checkbox) + ApprovalProvider interface + Directus prototype B
4 table/base schema ops + monitoring + full integration test A+B

Top risks identified during survey

# Risk Severity Mitigation
R-1 Design not validated against live source code (could not read /opt/incomex/lark-client/) HIGH Sprint 1 step 0 = code-reconcile checklist (§J OQ-1) before any new code
R-2 Audit-post failure after a successful destructive API call → write with no trail HIGH 2-phase audit + emergency fallback log to a second sink (§C.6)
R-3 PII leaking into audit/backup logs HIGH 2 parallel PII layers + metadata-only audit + GPG-encrypted backup
R-4 GPG private key on VPS → backups decryptable by an attacker who roots the box HIGH Public-key-only on VPS; private key offline with Huyên (§E)
R-5 Lark batch partial failure leaves data half-written MED Stop-and-report, no auto-rollback, auto-generated manual rollback cmd (req §12.10)
R-6 Track A lark-mcp cannot be extended via config → needs custom MCP server MED Gap analysis §F decides this in Sprint 2; service layer makes either path cheap
R-7 Cowork/MCP accidentally hitting production Base instead of Base đệm MED Hard allowlist: MCP write path rejects any app_token ≠ Base đệm token (§F, §H)

B. Application Service Layer

File: lark_client/service.py Principle (req §12.8): CLI and MCP MUST call this layer. No write logic anywhere else. Never import requests; all HTTP via existing LarkCore.

B.1 Interface

class WriteOutcome(TypedDict):
    status: Literal["dry_run", "success", "partial_failure", "failed", "aborted"]
    operation: str                 # "record.create" | "record.delete" | "field.create" ...
    base_key: str
    table_id: str
    targets: list[str]             # record_ids / field_ids affected (or planned)
    idempotency_key: str           # UUID v4
    rollback_command: str | None   # auto-generated, printed to stdout
    audit_pre_id: str              # id of the pre-execution audit entry
    audit_post_id: str | None
    pii: dict                      # metadata only (see §D.4)
    error: str | None

class LarkWriteServiceABC(ABC):
    @abstractmethod
    def create_record(self, ctx: WriteContext, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_create_records(self, ctx: WriteContext, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def get_record(self, ctx: ReadContext, record_id: str) -> dict: ...
    @abstractmethod
    def update_record(self, ctx: WriteContext, record_id: str, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_update_records(self, ctx: WriteContext, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def delete_record(self, ctx: WriteContext, record_id: str) -> WriteOutcome: ...
    @abstractmethod
    def batch_delete_records(self, ctx: WriteContext, record_ids: list[str]) -> WriteOutcome: ...
    # Sprint 3+
    @abstractmethod
    def create_field(self, ctx: WriteContext, spec: FieldSpec) -> WriteOutcome: ...
    @abstractmethod
    def update_field(self, ctx: WriteContext, field_id: str, spec: FieldSpec) -> WriteOutcome: ...
    @abstractmethod
    def delete_field(self, ctx: WriteContext, field_id: str) -> WriteOutcome: ...
    # Sprint 4
    @abstractmethod
    def create_table(self, ctx: WriteContext, spec: TableSpec) -> WriteOutcome: ...
    @abstractmethod
    def delete_table(self, ctx: WriteContext) -> WriteOutcome: ...
    @abstractmethod
    def list_views(self, ctx: ReadContext) -> list[dict]: ...

WriteContext carries the intent, not the credential:

@dataclass(frozen=True)
class WriteContext:
    base_key: str          # registry key, resolved to app_token via bases.yaml (NEVER hardcoded)
    table_id: str
    operation: str         # canonical op id, drives approval defaults + wildcard policy
    agent: str             # $LARK_AGENT ∈ {claude-code, cowork-mcp, cron}
    approval_id: str
    dry_run: bool = True   # default ON (req §5.1)
    confirmed: bool = False  # --confirm; required for update/delete on production
    idempotency_key: str = field(default_factory=lambda: str(uuid4()))
    is_buffer_base: bool = False  # True iff base_key resolves to Base đệm token

B.2 LarkWriteService (concrete, Sprint 1)

  • Constructed with dependency-injected collaborators (no internal new): LarkWriteService(core: LarkCore, safety: SafetyLayer, registry: Registry)
  • Resolves base_key → app_token through the existing Registry/bases.yaml SSOT. Raises UnknownBaseError if not in registry (req §12.1, no hardcoded app_token).
  • Builds the Lark request, then delegates the entire mutating call to SafetyLayer.guard(...) — the service never calls LarkCore write methods directly.
  • Returns WriteOutcome; never raises for an expected guarded rejection (returns status="aborted" + error); raises only on programming errors.

B.3 Error handling strategy

Class Example Behaviour
ApprovalError missing/expired/scope-mismatch/used one-time approval abort before API, status=aborted, exit code 3
SafetyViolation dry-run not run, lock held, PII block, audit-pre fail abort before API, exit code 3
LarkApiError 4xx/5xx from Lark after retries status=failed, audit-post records failure, exit 4
PartialFailureError batch: some items ok, some not status=partial_failure, no auto-rollback, print manual rollback cmd, exit 5 (req §12.10)
AuditError (post) audit-post sink down after API success status=success + warning, emergency fallback log written (§C.6), exit 0 with warning

All errors derive from existing lark_client.exceptions base; add the new subclasses there (do not invent a parallel hierarchy).

B.4 Rate limit

Reuse the existing LarkCore global file lock /var/lock/lark-api.lock @ 10 req/s (README §4). The service adds batch sizing: split any batch >500 into ≤500-record chunks (req §5.7), each chunk a separate guarded call with its own idempotency sub-key {idempotency_key}#{chunk_index}. Chunk failure → stop, report which chunks committed (R-5).


C. SafetyLayer Design

File: lark_client/safety.py Single public method: guard(ctx: WriteContext, payload, api_call: Callable) -> WriteOutcome api_call is a zero-arg closure that performs exactly one LarkCore mutating request; SafetyLayer decides if/when to invoke it.

C.1 Execution order (req §4 invariant)

1 dry-run gate      2 approval check    3 backup (GPG)     4 audit-pre
5 lock acquire      6 rate-limit        7 PII scan         8 → api_call()
                                                            9 audit-post
                                                           10 lock release

approval_exempt_bases bypasses layer 2 only; layers 1,3,4,5,6,7,8,9 always run (req §5 note, §13.3).

C.2 Per-layer behaviour & failure mode

# Layer Pass condition Failure mode
1 dry-run gate if ctx.dry_run: build + validate payload, return status=dry_run WITHOUT calling API. Real run requires dry_run=False; update/delete on a non-buffer base also requires confirmed=True not confirmed → SafetyViolation, abort
2 approval ApprovalProvider.check(ctx) → valid, unexpired, scope covers base_key+table_id, op allowed, wildcard policy ok, one-time not yet consumed invalid → ApprovalError, abort
3 backup for update/delete: get_record(s) BEFORE mutation, serialize, GPG-encrypt, write to backups dir, fsync encryption/write fail → SafetyViolation, abort (never mutate without a backup)
4 audit-pre append phase=planned JSONL entry, fsync, capture entry id write fail → ABORT, do not call API (req §9)
5 lock acquire per-record advisory lock lark-write:{base_key}:{table_id}:{record_id} (and the global rate lock) lock held → SafetyViolation (concurrent write), abort
6 rate-limit token-bucket 10 req/s via existing global lock; batch ≤500 exceeded → block/wait, then proceed
7 PII scan run FieldPIIRegistry + PatternPIIDetector over payload; compute redaction metadata; policy: detection never blocks the write itself — it only controls what audit/backup record (metadata-only) scanner crash → fail-closed: abort with SafetyViolation
8 api_call invoke the closure once (with idempotency/client_token) Lark error after retries → LarkApiError, jump to audit-post(failed)
9 audit-post append `phase=success failed` JSONL entry
10 release always release locks in finally

C.3 ApprovalProvider — dependency-injected (req §7, §13.11)

class ApprovalProvider(ABC):
    @abstractmethod
    def check(self, ctx: WriteContext) -> ApprovalDecision: ...
    @abstractmethod
    def consume(self, ctx: WriteContext, approval_id: str) -> None: ...  # one-time-use marking

class YamlApprovalProvider(ApprovalProvider):   # Sprint 1–2
    def __init__(self, path="config/write-approvals.yaml"): ...

class DirectusApprovalProvider(ApprovalProvider):  # Sprint 3+ prototype
    ...

SafetyLayer.__init__(self, *, approval_provider: ApprovalProvider, ...)SafetyLayer never imports YamlApprovalProvider. Wiring happens in a composition root (lark_client/factory.py or CLI bootstrap). Swapping YAML→Directus must not touch safety.py.

C.4 Wildcard / first-write policy (req §7, §13.8, §13.10)

Enforced inside layer 2 from a static table:

Operation Wildcard table allowed?
record.create ✅ (within scope+expiry)
record.update / delete
field.create/update
field/table delete ❌ (break-glass, explicit)

First write to any specific base/table → approval scope MUST name explicit base_key+table_id; wildcard rejected regardless of operation.

C.5 Approval defaults (req §8, baked into provider validation)

record.create reusable-within-expiry (narrow scope only); record.update one-time; record.delete one-time mandatory; field.create/update one-time mandatory; field/table delete break-glass one-time mandatory. Reusable must be explicit and is forbidden for delete/schema ops.

C.6 Audit 2-phase + emergency fallback (req §9)

  • Primary sink: /var/log/lark-ops/YYYYMMDD.jsonl (existing audit stream, append+fsync).
  • Phase 1 (pre): entry {phase:"planned", op, base_key, table_id, targets, agent, approval_id, idempotency_key, ts}. Fsync fail → abort, API not called.
  • Phase 3 (post): entry {phase:"success"|"failed", ...same id..., lark_response_meta, pii:{...}}.
  • Emergency fallback: if phase-3 write fails after a successful API call, write to an independent sink /var/log/lark-ops/EMERGENCY/<ts>-<idempotency_key>.json (separate file, separate fd) AND emit WriteOutcome.status="success" with error="audit_post_degraded". Never silently swallow. If even the emergency sink fails → also stderr-print a structured LARK-AUDIT-LOST line so cron/CI capture it.

D. PII Protection (req §6)

Two layers run in parallel (both always active — 18 bases built by many people, unknown PII fields).

D.1 FieldPIIRegistry (whitelist)

  • Structure: config/pii-fields.yaml
    bases:  "65-yeu-cau-thanh-toan":    tblXXXX:      fldYYYY: { type: national_id, label: "CMND/CCCD" }      fldZZZZ: { type: bank_account }
    
  • Loaded once at service init; keyed by (base_key, table_id, field_id)field_id, never name (req §12.7). Seed from the S176 schema snapshots; growable by PR.

D.2 PatternPIIDetector (regex, for unknown/legacy fields)

VN-specific patterns (ordered, longest/most-specific first to reduce false positives):

Type Pattern (anchored on token boundaries) Note
national_id_cccd \b\d{12}\b CCCD 12 digits — match before phone
national_id_cmnd \b\d{9}\b CMND 9 digits
passport \b[A-Z]{1,2}\d{7}\b e.g. B1234567, C12345678
phone_vn `\b(?:+84 0)(?:3
bank_account \b\d{8,16}\b heuristic — high false-positive; only flag, never auto-mutate
email RFC-lite \b[\w.+-]+@[\w-]+\.[\w.-]+\b optional, low risk

Detector returns types + counts only, never the matched substrings.

D.3 Pipeline integration

Both layers feed SafetyLayer layer 7. Union of (registry hits ∪ pattern hits) → redaction_types, redacted_fields_count. Policy decision (matches req §6): PII presence does NOT block the write — it governs what gets logged (metadata-only audit) and ensures the GPG backup (which does contain raw old values) is encrypted. A --pii-strict mode (off by default) MAY be added to abort on detection; default = log+proceed. Flag this default in §J OQ-3 for Huyên confirmation.

D.4 Audit redaction format (req §6)

{ "pii_redacted": true,
  "redaction_types": ["national_id_cccd","bank_account"],
  "redacted_fields_count": 3,
  "detector": ["registry","pattern"] }

Raw values appear ONLY inside the GPG-encrypted backup blob — never in JSONL, never in stdout, never in rollback command (rollback cmd references the encrypted backup file path, not inline values).


E. GPG Backup Design (req §6, §13.6 — mandatory from Sprint 1)

E.1 Key source — GSM, public-key-only on VPS

Consistent with the golden rule "1 credential, GSM SSOT, never hardcode":

  • New GSM secret in project github-chatgpt-ggcloud: LARK_BACKUP_GPG_PUBKEY = ASCII-armored public key.
  • VPS fetches it via the same LarkCore/secret path used for LARK_APP_* (do not read GSM directly in business code — extend the existing secret accessor).
  • Private key is NEVER on the VPS. Held offline by Huyên (hardware token / offline keyring). VPS can encrypt, cannot decrypt → a rooted VPS still cannot read PII backups (mitigates R-4).

E.2 Rotation policy

  • Rotate annually, or immediately on suspected compromise.
  • Procedure: generate new keypair offline → publish new public key as a new GSM version of LARK_BACKUP_GPG_PUBKEY → service picks up latest on next start → old backups remain decryptable with the retired private key (retain old private keys offline, indexed by fingerprint). Each backup file records the encrypting key fingerprint in its sidecar metadata.

E.3 File naming & storage

/var/log/lark-ops/writes/<YYYYMMDD>/
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.json.gpg
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.meta.json   # unencrypted: key fp, ts, op, NO pii

Batch: one .json.gpg per chunk, records concatenated as JSON lines before encryption.

E.4 Recovery procedure

  1. Locate backup file by idempotency_key (also recorded in the audit-pre entry).
  2. Read sidecar .meta.json → confirm key fingerprint.
  3. On Huyên's offline machine: gpg --decrypt <file>.json.gpg > restored.json.
  4. Re-apply via lark-tool records update ... --data @restored.json --approval <new APR> --no-dry-run --confirm (recovery is itself a guarded write — fully audited, no special bypass).

F. Track A — MCP Plugin

F.1 Survey result: @larksuiteoapi/lark-mcp

Could not run npm list on the VPS (no exec tool). Authoritative substitute: the live lark-mcp tool surface bound to this very session, which is the same @larksuiteoapi/lark-mcp plugin. Exactly 9 bitable tools exposed:

Available now (9) Category
bitable_v1_app_create Base app
bitable_v1_appTable_create, bitable_v1_appTable_list Table
bitable_v1_appTableField_list Field (read)
bitable_v1_appTableRecord_search Record (read)
bitable_v1_appTableRecord_create, _batchCreate Record (create)
bitable_v1_appTableRecord_update, _batchUpdate Record (update)

This exactly matches requirements §2 ("MCP plugin 9 tools — Create + Read + Update, NO Delete, NO field management").

F.2 Gap analysis

Needed In plugin? Decision
record.get by id ❌ (only search) must add
record.delete / batchDelete must add
field.create/update/delete ❌ (only field_list) must add
appTable.update/delete ❌ (only create/list) must add (Sprint 4)
view.list/create/delete must add (Sprint 4)

The published @larksuiteoapi/lark-mcp exposes a fixed tool set; missing operations are not togglable via config (the plugin simply does not implement delete/field-mgmt tools). Conclusion: a thin custom MCP server is required for Track A — but it must NOT re-implement Lark calls. It is an adapter that imports lark_client.service.LarkWriteService and exposes new MCP tools.

F.3 Custom MCP adapter design (Sprint 2)

  • New small server lark_client/mcp_adapter/ (Python, MCP SDK) OR extend if an internal MCP host exists — adapter only, zero write logic.
  • Tools exposed (Sprint 2 scope): lark_record_get, lark_record_delete, lark_record_create, lark_record_update. Sprint 4: field/view tools.
  • Every tool builds a WriteContext with agent="cowork-mcp" and calls LarkWriteService.
  • Hard guard (R-7): the adapter resolves base_key; if the resolved app_token ≠ Base đệm token Nf2bb1ExXaYnlksgoyQl72GNgAc, any delete / schema op is rejected at the adapter boundary with a clear error (req §11, §13.4 — Cowork/MCP delete = Base đệm only). Production writes via MCP are adapter-only and still pass full SafetyLayer.
  • Auth: reuses GSM LARK_APP_* through LarkCore. No new bot, no new credential (req §12.2).
  • The existing 9-tool @larksuiteoapi/lark-mcp may remain mounted for read/create/update interactive use; the custom adapter only fills the gap. Final mount topology = §J OQ-4.

G. Track B — CLI Write Module

G.1 lark_client/writer.py

class LarkWriter:
    def __init__(self, service: LarkWriteService): ...   # DI, no write logic here
    def create(self, ctx, fields: dict) -> WriteOutcome
    def batch_create(self, ctx, records: list[dict]) -> WriteOutcome
    def get(self, ctx, record_id: str) -> dict
    def update(self, ctx, record_id: str, fields: dict) -> WriteOutcome
    def batch_update(self, ctx, records: list[dict]) -> WriteOutcome
    def delete(self, ctx, record_id: str) -> WriteOutcome
    def batch_delete(self, ctx, record_ids: list[str]) -> WriteOutcome

writer.py is a typed façade over service.py (keeps CLI thin, satisfies req §12.8 single-layer rule). Return type always WriteOutcome.

G.2 lark_client/field_manager.py (Sprint 3)

class LarkFieldManager:
    SUPPORTED = {"Text", "Number", "SingleSelect", "Checkbox"}   # req §10 note
    def create(self, ctx, name: str, ftype: str, options: dict|None) -> WriteOutcome
    def update(self, ctx, field_id: str, spec: FieldSpec) -> WriteOutcome
    def delete(self, ctx, field_id: str) -> WriteOutcome   # field_id only, never name (req §12.7)

Complex types (Formula, Lookup, Link) → explicit UnsupportedFieldType until Sprint 4+.

G.3 CLI commands (Click)

cli/commands/records.py:

lark-tool records create <base-key> <table-id> --data '{...}|@file' --approval APR-xxx [--no-dry-run]
lark-tool records get    <base-key> <table-id> <record-id>
lark-tool records update <base-key> <table-id> <record-id> --data '{...}' --approval APR-xxx --no-dry-run --confirm
lark-tool records delete <base-key> <table-id> <record-id> --approval APR-xxx --no-dry-run --confirm
lark-tool records batch-create/-update/-delete <base-key> <table-id> --data @file.jsonl --approval APR-xxx --no-dry-run [--confirm]

cli/commands/fields.py:

lark-tool fields create <base-key> <table-id> --name "X" --type Text --approval APR-xxx --no-dry-run
lark-tool fields update <base-key> <table-id> <field-id> ... --approval APR-xxx --no-dry-run --confirm
lark-tool fields delete <base-key> <table-id> <field-id> --approval APR-xxx --no-dry-run --confirm "tôi hiểu không thể undo"

Conventions: --dry-run default ON (omit --no-dry-run ⇒ dry run); update/delete on non-buffer base require --confirm; field/table delete require the literal acknowledgement string. $LARK_AGENT mandatory for batch (req §12.5), defaults to claude-code for interactive CLI. Exit codes per §B.3. Registered into the existing cli/lark_tool.py Click group (do not fork a new entrypoint).

G.4 config/write-approvals.yaml schema

approvals:
  - id: APR-001
    operation: record.update           # canonical op id
    scope:
      base_key: "65-yeu-cau-thanh-toan"  # explicit; wildcard table only for record.create
      table_id: "tblXXXX"                # required for first write & all update/delete/schema
    one_time_use: true                  # default true; reusable must be explicit + not delete/schema
    used: false
    reason: "Fix sai số tiền dòng 42 theo yêu cầu KT"
    created_by: "Huyên"                 # human-created (req §13.2)
    created_at: "2026-05-19T10:00:00Z"
    expires_at: "2026-05-20T10:00:00Z"
approval_exempt_bases:                   # bypass approval CHECK only; all other layers apply
  - "88-phai-cu-base-dem"

G.5 Integration with existing LarkCore

  • Reuse LarkCore for: GSM token, retry (3× backoff on 429/503/network), global rate lock, endpoint whitelist.
  • Whitelist additions to config/allowed_endpoints.yaml (each = 1 reviewed change, req README §5.2 "write endpoints initially EMPTY"):
    • POST /open-apis/bitable/v1/apps/:app_token/tables/:table_id/records
    • GET .../records/:record_id
    • PUT .../records/:record_id
    • DELETE .../records/:record_id
    • POST .../records/batch_create | batch_update | batch_delete
    • POST .../tables/:table_id/fields · PUT/DELETE .../fields/:field_id (Sprint 3)
    • POST /open-apis/bitable/v1/apps/:app_token/tables · DELETE .../tables/:table_id · view endpoints (Sprint 4)
  • Idempotency: pass client_token/UUID per Lark write API where supported (record create/batch).

H. Testing Strategy

H.1 Base đệm (test target — CONFIRMED not production)

  • Name: 88 - Phái cử (Base đệm) — registry row 8, role "staging/buffer".
  • app_token: Nf2bb1ExXaYnlksgoyQl72GNgAc
  • Tables: TTS = tblPQ6N79EeOmnTm (7 fields, PK STT); Đơn hàng = tblaU7kxyPTNBSrR (5 fields, PK STT). Duplex link fields between them.
  • Production Base 88 is a DIFFERENT token: YSIkb8PxOaNaozs2vwalOOcagkf (80 tables, "Core"). Tests MUST NOT use this token. A test-time assertion rejects any app_token other than the Base đệm token (req §H, §13.4).

H.2 12 test cases (Base đệm only)

# Case Expect
T1 record.create dry-run default no API call, status=dry_run
T2 record.create --no-dry-run valid approval record created, audit pre+post present
T3 record.update without --confirm on non-buffer aborted SafetyViolation
T4 record.update on Base đệm with confirm updated, GPG backup of old value exists
T5 record.delete one-time approval, reuse same approval 2nd call → ApprovalError (consumed)
T6 batch_create 600 records split 500+100, both chunks audited
T7 batch partial failure (1 bad record) partial_failure, no auto-rollback, rollback cmd printed
T8 approval scope mismatch (wrong table_id) ApprovalError
T9 wildcard table on record.delete rejected by wildcard policy
T10 PII payload (CCCD + bank acct) write proceeds, audit shows metadata only, raw only in GPG backup
T11 audit-pre sink unwritable API NOT called, abort
T12 audit-post fails after success (inject) status=success + emergency fallback file written

H.3 Isolation & mocking

  • Unit tests: mock LarkCore HTTP layer (no real API) — assert SafetyLayer ordering, approval logic, PII metadata, GPG invoked, audit phases. This is the bulk; mirrors existing 19/19 + 8/8 mocked style.
  • Integration tests (T2,T4,T6 subset): real Lark API against Base đệm only, gated behind env LARK_TEST_INTEGRATION=1, hard-asserting the Base đệm token. Base đệm reset by Claude Code is permitted but the reset itself must be audited (req §12.12).
  • app_token literal allowed only in tests/ and bases.yaml (README §6).

I. Sprint Breakdown

Sprint 1 — Track B core (CLI records + safety)

Deliverables: service.py (record ops), writer.py, safety.py (8 layers), ApprovalProvider ABC + YamlApprovalProvider, GPG backup module, 2-phase audit, cli/commands/records.py, config/write-approvals.yaml, config/pii-fields.yaml, PII registry+pattern, whitelist record endpoints. Acceptance: T1–T12 (record scope) green; existing 19/19+8/8 still pass; no import requests; no hardcoded app_token (pre-commit grep); dry-run default verified; GPG backup decryptable offline with the private key; audit-pre-fail aborts before API.

Sprint 2 — Track A MCP adapter

Deliverables: lark_client/mcp_adapter/ exposing lark_record_get/delete/create/update; Base-đệm hard guard for delete; wired to LarkWriteService. Acceptance: Cowork can get/delete a record on Base đệm via MCP; MCP delete on a production token is rejected at adapter boundary; all MCP writes show in the same audit stream with agent=cowork-mcp; no write logic duplicated (adapter imports service).

Sprint 3 — Field operations + ApprovalProvider swap

Deliverables: field_manager.py (Text/Number/SingleSelect/Checkbox), cli/commands/fields.py, field endpoints whitelisted, DirectusApprovalProvider prototype injected without touching safety.py. Acceptance: create/update/delete a Text+Number field on Base đệm; complex types rejected with UnsupportedFieldType; Directus provider passes the same approval contract tests as YAML provider (provider-swap test).

Sprint 4 — Schema ops + monitoring + full integration

Deliverables: table/base create/delete + view list/create/delete, maintenance-window + staging gate for schema ops (req §13.5), monitoring (audit volume / failure-rate alarms, e.g. uptime-kuma push), full end-to-end integration suite. Acceptance: schema op refuses to run outside declared maintenance window; full T1–T12 + schema cases green on Base đệm; monitoring fires on injected audit-loss; documentation + README §3/§8 updated.


J. Open Questions (resolve before Sprint 1 coding)

  • OQ-1 (BLOCKER, R-1): This design was built from the KB architecture contract, not from reading /opt/incomex/lark-client/ (no shell/file access to that path in this environment). Sprint 1 must begin with a code-reconcile checklist: confirm actual module names/signatures of LarkCore (token method, retry, rate-lock API), Registry/bases.yaml loader, lark_client.exceptions base classes, the Click group in cli/lark_tool.py, and existing test harness conventions. Any deviation from this doc's assumed names is an implementation detail to adjust, not a redesign — but it must be checked first.
  • OQ-2: GPG key — confirm the public-key-only on VPS / private-key-offline model (§E) and who custodies the private key + GSM secret name LARK_BACKUP_GPG_PUBKEY. If Huyên wants on-VPS decryption capability, R-4 mitigation weakens — needs explicit sign-off.
  • OQ-3: PII default — confirm "detect → log metadata → proceed" (NOT block) is the intended behaviour (matches req §6 wording). Decide whether --pii-strict (abort on detection) ships in Sprint 1 or later.
  • OQ-4: Track A topology — keep the existing 9-tool @larksuiteoapi/lark-mcp mounted alongside the custom adapter, or replace it entirely with the custom server? (§F.3)
  • OQ-5: Lark batch hard limit — requirements say 500/batch; confirm against Lark Open API current limit for batch_delete specifically (some endpoints cap lower). Sprint 1 will treat 500 as the configured ceiling, overridable in config.
  • OQ-6 (process): This file was written to /opt/incomex/docs/mcp-writes/s177-architecture-design.md (the only VPS write-allowlisted dir) and was not git-committed (no exec tool / repo not in scope). Huyên or an agent with repo access must move it to the intended path and run the S177-DESIGN: commit.

End of S177 Architecture Design Document — DRAFT awaiting Huyên review on OQ-1…OQ-6.