S177 — Architecture Design Document: Lark Base Controlled CRUD Gateway

Status: DRAFT for Huyên review → then Sprint 1 implementation Date: 2026-05-19 Source of truth: knowledge/dev/lark/s177-controlled-crud-gateway-requirements-v2.md (đề bài v2.2 FINAL, GPT R3 8.8/10) Author: Claude Code (Opus 4.7) Survey basis: KB architecture contract (knowledge/dev/lark/README.md rev3, lark-client-architecture.md S176, lark-base-registry.md, snapshot 88-phai-cu-base-dem.md) + live @larksuiteoapi/lark-mcp tool surface observed in-session. Live source /opt/incomex/lark-client/ was NOT directly read — see §J Open Question OQ-1.

Readback provenance (S177-DESIGN-E1): original written to VPS /opt/incomex/docs/mcp-writes/s177-architecture-design.md; write_file reported 31407 bytes; byte-identical readback ⇒ SHA-256 0440ef92ee9f5355c16902aaf417a346b1b2a97adbd7dded360cf320763639e5, 502 lines. NOT git-committed; final repo path not yet populated.

A. Executive Summary

Scope

Add controlled write capability (records + fields + later tables/views) to the existing read-only lark-client v1.0.0, behind a mandatory 8-layer SafetyLayer, exposed through two tracks that share one Application Service Layer:

Track B (CLI, production-grade, built first): lark-tool records ... / lark-tool fields ...
Track A (MCP, Cowork interactive): adapter over the same service layer; production delete forbidden, Base đệm only.

Architecture (target)

Cowork / MCP (Track A)          Claude Code CLI / Cron (Track B)
        │                                  │
        └────────────────┬─────────────────┘
                         ▼
          Application Service Layer  (lark_client/service.py)
          — single write entrypoint, no duplicated write logic —
                         ▼
              SafetyLayer  (lark_client/safety.py)
   dry-run → approval → backup(GPG) → audit-pre → lock →
   rate-limit → PII-scan → Lark API call → audit-post
                         ▼
                   LarkCore  (existing — GSM token, whitelist, retry)
                         ▼
            Lark Open API  https://open.larksuite.com

Sprint plan (from requirements §10, unchanged)

Sprint	Deliverable	Track
1	`writer.py` + SafetyLayer core + GPG backup + 2-phase audit + tests	B
2	MCP adapter over service layer + `record.get` / `record.delete` MCP	A
3	`field_manager.py` (Text/Number/SingleChoice/Checkbox) + `ApprovalProvider` interface + Directus prototype	B
4	table/base schema ops + monitoring + full integration test	A+B

Top risks identified during survey

#	Risk	Severity	Mitigation
R-1	Design not validated against live source code (could not read `/opt/incomex/lark-client/`)	HIGH	Sprint 1 step 0 = code-reconcile checklist (§J OQ-1) before any new code
R-2	Audit-post failure after a successful destructive API call → write with no trail	HIGH	2-phase audit + emergency fallback log to a second sink (§C.6)
R-3	PII leaking into audit/backup logs	HIGH	2 parallel PII layers + metadata-only audit + GPG-encrypted backup
R-4	GPG private key on VPS → backups decryptable by an attacker who roots the box	HIGH	Public-key-only on VPS; private key offline with Huyên (§E)
R-5	Lark batch partial failure leaves data half-written	MED	Stop-and-report, no auto-rollback, auto-generated manual rollback cmd (req §12.10)
R-6	Track A `lark-mcp` cannot be extended via config → needs custom MCP server	MED	Gap analysis §F decides this in Sprint 2; service layer makes either path cheap
R-7	Cowork/MCP accidentally hitting production Base instead of Base đệm	MED	Hard allowlist: MCP write path rejects any app_token ≠ Base đệm token (§F, §H)

B. Application Service Layer

File: lark_client/service.py Principle (req §12.8): CLI and MCP MUST call this layer. No write logic anywhere else. Never import requests; all HTTP via existing LarkCore.

B.1 Interface

class WriteOutcome(TypedDict):
    status: Literal["dry_run", "success", "partial_failure", "failed", "aborted"]
    operation: str                 # "record.create" | "record.delete" | "field.create" ...
    base_key: str
    table_id: str
    targets: list[str]             # record_ids / field_ids affected (or planned)
    idempotency_key: str           # UUID v4
    rollback_command: str | None   # auto-generated, printed to stdout
    audit_pre_id: str              # id of the pre-execution audit entry
    audit_post_id: str | None
    pii: dict                      # metadata only (see §D.4)
    error: str | None

class LarkWriteServiceABC(ABC):
    @abstractmethod
    def create_record(self, ctx: WriteContext, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_create_records(self, ctx: WriteContext, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def get_record(self, ctx: ReadContext, record_id: str) -> dict: ...
    @abstractmethod
    def update_record(self, ctx: WriteContext, record_id: str, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_update_records(self, ctx: WriteContext, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def delete_record(self, ctx: WriteContext, record_id: str) -> WriteOutcome: ...
    @abstractmethod
    def batch_delete_records(self, ctx: WriteContext, record_ids: list[str]) -> WriteOutcome: ...
    # Sprint 3+
    @abstractmethod
    def create_field(self, ctx: WriteContext, spec: FieldSpec) -> WriteOutcome: ...
    @abstractmethod
    def update_field(self, ctx: WriteContext, field_id: str, spec: FieldSpec) -> WriteOutcome: ...
    @abstractmethod
    def delete_field(self, ctx: WriteContext, field_id: str) -> WriteOutcome: ...
    # Sprint 4
    @abstractmethod
    def create_table(self, ctx: WriteContext, spec: TableSpec) -> WriteOutcome: ...
    @abstractmethod
    def delete_table(self, ctx: WriteContext) -> WriteOutcome: ...
    @abstractmethod
    def list_views(self, ctx: ReadContext) -> list[dict]: ...

WriteContext carries the intent, not the credential:

@dataclass(frozen=True)
class WriteContext:
    base_key: str          # registry key, resolved to app_token via bases.yaml (NEVER hardcoded)
    table_id: str
    operation: str         # canonical op id, drives approval defaults + wildcard policy
    agent: str             # $LARK_AGENT ∈ {claude-code, cowork-mcp, cron}
    approval_id: str
    dry_run: bool = True   # default ON (req §5.1)
    confirmed: bool = False  # --confirm; required for update/delete on production
    idempotency_key: str = field(default_factory=lambda: str(uuid4()))
    is_buffer_base: bool = False  # True iff base_key resolves to Base đệm token

B.2 `LarkWriteService` (concrete, Sprint 1)

Constructed with dependency-injected collaborators (no internal new): LarkWriteService(core: LarkCore, safety: SafetyLayer, registry: Registry)
Resolves base_key → app_token through the existing Registry/bases.yaml SSOT. Raises UnknownBaseError if not in registry (req §12.1, no hardcoded app_token).
Builds the Lark request, then delegates the entire mutating call to SafetyLayer.guard(...) — the service never calls LarkCore write methods directly.
Returns WriteOutcome; never raises for an expected guarded rejection (returns status="aborted" + error); raises only on programming errors.

B.3 Error handling strategy

Class	Example	Behaviour
`ApprovalError`	missing/expired/scope-mismatch/used one-time approval	abort before API, `status=aborted`, exit code 3
`SafetyViolation`	dry-run not run, lock held, PII block, audit-pre fail	abort before API, exit code 3
`LarkApiError`	4xx/5xx from Lark after retries	`status=failed`, audit-post records failure, exit 4
`PartialFailureError`	batch: some items ok, some not	`status=partial_failure`, no auto-rollback, print manual rollback cmd, exit 5 (req §12.10)
`AuditError` (post)	audit-post sink down after API success	`status=success` + warning, emergency fallback log written (§C.6), exit 0 with warning

All errors derive from existing lark_client.exceptions base; add the new subclasses there (do not invent a parallel hierarchy).

B.4 Rate limit

Reuse the existing LarkCore global file lock /var/lock/lark-api.lock @ 10 req/s (README §4). The service adds batch sizing: split any batch >500 into ≤500-record chunks (req §5.7), each chunk a separate guarded call with its own idempotency sub-key {idempotency_key}#{chunk_index}. Chunk failure → stop, report which chunks committed (R-5).

C. SafetyLayer Design

File: lark_client/safety.py Single public method: guard(ctx: WriteContext, payload, api_call: Callable) -> WriteOutcome api_call is a zero-arg closure that performs exactly one LarkCore mutating request; SafetyLayer decides if/when to invoke it.

C.1 Execution order (req §4 invariant)

1 dry-run gate      2 approval check    3 backup (GPG)     4 audit-pre
5 lock acquire      6 rate-limit        7 PII scan         8 → api_call()
                                                            9 audit-post
                                                           10 lock release

approval_exempt_bases bypasses layer 2 only; layers 1,3,4,5,6,7,8,9 always run (req §5 note, §13.3).

C.2 Per-layer behaviour & failure mode

#	Layer	Pass condition	Failure mode
1	dry-run gate	if `ctx.dry_run`: build + validate payload, return `status=dry_run` WITHOUT calling API. Real run requires `dry_run=False`; `update/delete` on a non-buffer base also requires `confirmed=True`	not confirmed → `SafetyViolation`, abort
2	approval	`ApprovalProvider.check(ctx)` → valid, unexpired, scope covers `base_key+table_id`, op allowed, wildcard policy ok, one-time not yet consumed	invalid → `ApprovalError`, abort
3	backup	for update/delete: `get_record`(s) BEFORE mutation, serialize, GPG-encrypt, write to backups dir, fsync	encryption/write fail → `SafetyViolation`, abort (never mutate without a backup)
4	audit-pre	append `phase=planned` JSONL entry, fsync, capture entry id	write fail → ABORT, do not call API (req §9)
5	lock	acquire per-record advisory lock `lark-write:{base_key}:{table_id}:{record_id}` (and the global rate lock)	lock held → `SafetyViolation` (concurrent write), abort
6	rate-limit	token-bucket 10 req/s via existing global lock; batch ≤500	exceeded → block/wait, then proceed
7	PII scan	run FieldPIIRegistry + PatternPIIDetector over payload; compute redaction metadata; policy: detection never blocks the write itself — it only controls what audit/backup record (metadata-only)	scanner crash → fail-closed: abort with `SafetyViolation`
8	api_call	invoke the closure once (with idempotency/client_token)	Lark error after retries → `LarkApiError`, jump to audit-post(failed)
9	audit-post	append `phase=success	failed` JSONL entry
10	release	always release locks in `finally`	—

C.3 ApprovalProvider — dependency-injected (req §7, §13.11)

class ApprovalProvider(ABC):
    @abstractmethod
    def check(self, ctx: WriteContext) -> ApprovalDecision: ...
    @abstractmethod
    def consume(self, ctx: WriteContext, approval_id: str) -> None: ...  # one-time-use marking

class YamlApprovalProvider(ApprovalProvider):   # Sprint 1–2
    def __init__(self, path="config/write-approvals.yaml"): ...

class DirectusApprovalProvider(ApprovalProvider):  # Sprint 3+ prototype
    ...

SafetyLayer.__init__(self, *, approval_provider: ApprovalProvider, ...) — SafetyLayer never imports YamlApprovalProvider. Wiring happens in a composition root (lark_client/factory.py or CLI bootstrap). Swapping YAML→Directus must not touch safety.py.

C.4 Wildcard / first-write policy (req §7, §13.8, §13.10)

Enforced inside layer 2 from a static table:

Operation	Wildcard table allowed?
record.create	✅ (within scope+expiry)
record.update / delete	❌
field.create/update	❌
field/table delete	❌ (break-glass, explicit)

First write to any specific base/table → approval scope MUST name explicit base_key+table_id; wildcard rejected regardless of operation.

C.5 Approval defaults (req §8, baked into provider validation)

record.create reusable-within-expiry (narrow scope only); record.update one-time; record.delete one-time mandatory; field.create/update one-time mandatory; field/table delete break-glass one-time mandatory. Reusable must be explicit and is forbidden for delete/schema ops.

C.6 Audit 2-phase + emergency fallback (req §9)

Primary sink: /var/log/lark-ops/YYYYMMDD.jsonl (existing audit stream, append+fsync).
Phase 1 (pre): entry {phase:"planned", op, base_key, table_id, targets, agent, approval_id, idempotency_key, ts}. Fsync fail → abort, API not called.
Phase 3 (post): entry {phase:"success"|"failed", ...same id..., lark_response_meta, pii:{...}}.
Emergency fallback: if phase-3 write fails after a successful API call, write to an independent sink /var/log/lark-ops/EMERGENCY/<ts>-<idempotency_key>.json (separate file, separate fd) AND emit WriteOutcome.status="success" with error="audit_post_degraded". Never silently swallow. If even the emergency sink fails → also stderr-print a structured LARK-AUDIT-LOST line so cron/CI capture it.

D. PII Protection (req §6)

Two layers run in parallel (both always active — 18 bases built by many people, unknown PII fields).

D.1 `FieldPIIRegistry` (whitelist)

Structure: config/pii-fields.yaml

bases:  "65-yeu-cau-thanh-toan":    tblXXXX:      fldYYYY: { type: national_id, label: "CMND/CCCD" }      fldZZZZ: { type: bank_account }

Loaded once at service init; keyed by (base_key, table_id, field_id) — field_id, never name (req §12.7). Seed from the S176 schema snapshots; growable by PR.

D.2 `PatternPIIDetector` (regex, for unknown/legacy fields)

VN-specific patterns (ordered, longest/most-specific first to reduce false positives):

Type	Pattern (anchored on token boundaries)	Note
`national_id_cccd`	`\b\d{12}\b`	CCCD 12 digits — match before phone
`national_id_cmnd`	`\b\d{9}\b`	CMND 9 digits
`passport`	`\b[A-Z]{1,2}\d{7}\b`	e.g. `B1234567`, `C12345678`
`phone_vn`	`\b(?:+84	0)(?:3
`bank_account`	`\b\d{8,16}\b`	heuristic — high false-positive; only flag, never auto-mutate
`email`	RFC-lite `\b[\w.+-]+@[\w-]+\.[\w.-]+\b`	optional, low risk

Detector returns types + counts only, never the matched substrings.

D.3 Pipeline integration

Both layers feed SafetyLayer layer 7. Union of (registry hits ∪ pattern hits) → redaction_types, redacted_fields_count. Policy decision (matches req §6): PII presence does NOT block the write — it governs what gets logged (metadata-only audit) and ensures the GPG backup (which does contain raw old values) is encrypted. A --pii-strict mode (off by default) MAY be added to abort on detection; default = log+proceed. Flag this default in §J OQ-3 for Huyên confirmation.

D.4 Audit redaction format (req §6)

{ "pii_redacted": true,
  "redaction_types": ["national_id_cccd","bank_account"],
  "redacted_fields_count": 3,
  "detector": ["registry","pattern"] }

Raw values appear ONLY inside the GPG-encrypted backup blob — never in JSONL, never in stdout, never in rollback command (rollback cmd references the encrypted backup file path, not inline values).

E. GPG Backup Design (req §6, §13.6 — mandatory from Sprint 1)

E.1 Key source — GSM, public-key-only on VPS

Consistent with the golden rule "1 credential, GSM SSOT, never hardcode":

New GSM secret in project github-chatgpt-ggcloud: LARK_BACKUP_GPG_PUBKEY = ASCII-armored public key.
VPS fetches it via the same LarkCore/secret path used for LARK_APP_* (do not read GSM directly in business code — extend the existing secret accessor).
Private key is NEVER on the VPS. Held offline by Huyên (hardware token / offline keyring). VPS can encrypt, cannot decrypt → a rooted VPS still cannot read PII backups (mitigates R-4).

E.2 Rotation policy

Rotate annually, or immediately on suspected compromise.
Procedure: generate new keypair offline → publish new public key as a new GSM version of LARK_BACKUP_GPG_PUBKEY → service picks up latest on next start → old backups remain decryptable with the retired private key (retain old private keys offline, indexed by fingerprint). Each backup file records the encrypting key fingerprint in its sidecar metadata.

E.3 File naming & storage

/var/log/lark-ops/writes/<YYYYMMDD>/
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.json.gpg
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.meta.json   # unencrypted: key fp, ts, op, NO pii

Batch: one .json.gpg per chunk, records concatenated as JSON lines before encryption.

E.4 Recovery procedure

Locate backup file by idempotency_key (also recorded in the audit-pre entry).
Read sidecar .meta.json → confirm key fingerprint.
On Huyên's offline machine: gpg --decrypt <file>.json.gpg > restored.json.
Re-apply via lark-tool records update ... --data @restored.json --approval <new APR> --no-dry-run --confirm (recovery is itself a guarded write — fully audited, no special bypass).

F. Track A — MCP Plugin

F.1 Survey result: `@larksuiteoapi/lark-mcp`

Could not run npm list on the VPS (no exec tool). Authoritative substitute: the live lark-mcp tool surface bound to this very session, which is the same @larksuiteoapi/lark-mcp plugin. Exactly 9 bitable tools exposed:

Available now (9)	Category
`bitable_v1_app_create`	Base app
`bitable_v1_appTable_create`, `bitable_v1_appTable_list`	Table
`bitable_v1_appTableField_list`	Field (read)
`bitable_v1_appTableRecord_search`	Record (read)
`bitable_v1_appTableRecord_create`, `_batchCreate`	Record (create)
`bitable_v1_appTableRecord_update`, `_batchUpdate`	Record (update)

This exactly matches requirements §2 ("MCP plugin 9 tools — Create + Read + Update, NO Delete, NO field management").

F.2 Gap analysis

Needed	In plugin?	Decision
`record.get` by id	❌ (only `search`)	must add
`record.delete` / `batchDelete`	❌	must add
`field.create/update/delete`	❌ (only `field_list`)	must add
`appTable.update/delete`	❌ (only create/list)	must add (Sprint 4)
`view.list/create/delete`	❌	must add (Sprint 4)

The published @larksuiteoapi/lark-mcp exposes a fixed tool set; missing operations are not togglable via config (the plugin simply does not implement delete/field-mgmt tools). Conclusion: a thin custom MCP server is required for Track A — but it must NOT re-implement Lark calls. It is an adapter that imports lark_client.service.LarkWriteService and exposes new MCP tools.

F.3 Custom MCP adapter design (Sprint 2)

New small server lark_client/mcp_adapter/ (Python, MCP SDK) OR extend if an internal MCP host exists — adapter only, zero write logic.
Tools exposed (Sprint 2 scope): lark_record_get, lark_record_delete, lark_record_create, lark_record_update. Sprint 4: field/view tools.
Every tool builds a WriteContext with agent="cowork-mcp" and calls LarkWriteService.
Hard guard (R-7): the adapter resolves base_key; if the resolved app_token ≠ Base đệm token Nf2bb1ExXaYnlksgoyQl72GNgAc, any delete / schema op is rejected at the adapter boundary with a clear error (req §11, §13.4 — Cowork/MCP delete = Base đệm only). Production writes via MCP are adapter-only and still pass full SafetyLayer.
Auth: reuses GSM LARK_APP_* through LarkCore. No new bot, no new credential (req §12.2).
The existing 9-tool @larksuiteoapi/lark-mcp may remain mounted for read/create/update interactive use; the custom adapter only fills the gap. Final mount topology = §J OQ-4.

G. Track B — CLI Write Module

G.1 `lark_client/writer.py`

class LarkWriter:
    def __init__(self, service: LarkWriteService): ...   # DI, no write logic here
    def create(self, ctx, fields: dict) -> WriteOutcome
    def batch_create(self, ctx, records: list[dict]) -> WriteOutcome
    def get(self, ctx, record_id: str) -> dict
    def update(self, ctx, record_id: str, fields: dict) -> WriteOutcome
    def batch_update(self, ctx, records: list[dict]) -> WriteOutcome
    def delete(self, ctx, record_id: str) -> WriteOutcome
    def batch_delete(self, ctx, record_ids: list[str]) -> WriteOutcome

writer.py is a typed façade over service.py (keeps CLI thin, satisfies req §12.8 single-layer rule). Return type always WriteOutcome.

G.2 `lark_client/field_manager.py` (Sprint 3)

class LarkFieldManager:
    SUPPORTED = {"Text", "Number", "SingleSelect", "Checkbox"}   # req §10 note
    def create(self, ctx, name: str, ftype: str, options: dict|None) -> WriteOutcome
    def update(self, ctx, field_id: str, spec: FieldSpec) -> WriteOutcome
    def delete(self, ctx, field_id: str) -> WriteOutcome   # field_id only, never name (req §12.7)

Complex types (Formula, Lookup, Link) → explicit UnsupportedFieldType until Sprint 4+.

G.3 CLI commands (Click)

cli/commands/records.py:

lark-tool records create <base-key> <table-id> --data '{...}|@file' --approval APR-xxx [--no-dry-run]
lark-tool records get    <base-key> <table-id> <record-id>
lark-tool records update <base-key> <table-id> <record-id> --data '{...}' --approval APR-xxx --no-dry-run --confirm
lark-tool records delete <base-key> <table-id> <record-id> --approval APR-xxx --no-dry-run --confirm
lark-tool records batch-create/-update/-delete <base-key> <table-id> --data @file.jsonl --approval APR-xxx --no-dry-run [--confirm]

cli/commands/fields.py:

lark-tool fields create <base-key> <table-id> --name "X" --type Text --approval APR-xxx --no-dry-run
lark-tool fields update <base-key> <table-id> <field-id> ... --approval APR-xxx --no-dry-run --confirm
lark-tool fields delete <base-key> <table-id> <field-id> --approval APR-xxx --no-dry-run --confirm "tôi hiểu không thể undo"

Conventions: --dry-run default ON (omit --no-dry-run ⇒ dry run); update/delete on non-buffer base require --confirm; field/table delete require the literal acknowledgement string. $LARK_AGENT mandatory for batch (req §12.5), defaults to claude-code for interactive CLI. Exit codes per §B.3. Registered into the existing cli/lark_tool.py Click group (do not fork a new entrypoint).

G.4 `config/write-approvals.yaml` schema

approvals:
  - id: APR-001
    operation: record.update           # canonical op id
    scope:
      base_key: "65-yeu-cau-thanh-toan"  # explicit; wildcard table only for record.create
      table_id: "tblXXXX"                # required for first write & all update/delete/schema
    one_time_use: true                  # default true; reusable must be explicit + not delete/schema
    used: false
    reason: "Fix sai số tiền dòng 42 theo yêu cầu KT"
    created_by: "Huyên"                 # human-created (req §13.2)
    created_at: "2026-05-19T10:00:00Z"
    expires_at: "2026-05-20T10:00:00Z"
approval_exempt_bases:                   # bypass approval CHECK only; all other layers apply
  - "88-phai-cu-base-dem"

G.5 Integration with existing `LarkCore`

Reuse LarkCore for: GSM token, retry (3× backoff on 429/503/network), global rate lock, endpoint whitelist.
Whitelist additions to config/allowed_endpoints.yaml (each = 1 reviewed change, req README §5.2 "write endpoints initially EMPTY"):
- POST /open-apis/bitable/v1/apps/:app_token/tables/:table_id/records
- GET .../records/:record_id
- PUT .../records/:record_id
- DELETE .../records/:record_id
- POST .../records/batch_create | batch_update | batch_delete
- POST .../tables/:table_id/fields · PUT/DELETE .../fields/:field_id (Sprint 3)
- POST /open-apis/bitable/v1/apps/:app_token/tables · DELETE .../tables/:table_id · view endpoints (Sprint 4)
Idempotency: pass client_token/UUID per Lark write API where supported (record create/batch).

H. Testing Strategy

H.1 Base đệm (test target — CONFIRMED not production)

Name: 88 - Phái cử (Base đệm) — registry row 8, role "staging/buffer".
app_token: Nf2bb1ExXaYnlksgoyQl72GNgAc
Tables: TTS = tblPQ6N79EeOmnTm (7 fields, PK STT); Đơn hàng = tblaU7kxyPTNBSrR (5 fields, PK STT). Duplex link fields between them.
Production Base 88 is a DIFFERENT token: YSIkb8PxOaNaozs2vwalOOcagkf (80 tables, "Core"). Tests MUST NOT use this token. A test-time assertion rejects any app_token other than the Base đệm token (req §H, §13.4).

H.2 12 test cases (Base đệm only)

#	Case	Expect
T1	record.create dry-run default	no API call, `status=dry_run`
T2	record.create `--no-dry-run` valid approval	record created, audit pre+post present
T3	record.update without `--confirm` on non-buffer	aborted `SafetyViolation`
T4	record.update on Base đệm with confirm	updated, GPG backup of old value exists
T5	record.delete one-time approval, reuse same approval	2nd call → `ApprovalError` (consumed)
T6	batch_create 600 records	split 500+100, both chunks audited
T7	batch partial failure (1 bad record)	`partial_failure`, no auto-rollback, rollback cmd printed
T8	approval scope mismatch (wrong table_id)	`ApprovalError`
T9	wildcard table on record.delete	rejected by wildcard policy
T10	PII payload (CCCD + bank acct)	write proceeds, audit shows metadata only, raw only in GPG backup
T11	audit-pre sink unwritable	API NOT called, abort
T12	audit-post fails after success (inject)	`status=success` + emergency fallback file written

H.3 Isolation & mocking

Unit tests: mock LarkCore HTTP layer (no real API) — assert SafetyLayer ordering, approval logic, PII metadata, GPG invoked, audit phases. This is the bulk; mirrors existing 19/19 + 8/8 mocked style.
Integration tests (T2,T4,T6 subset): real Lark API against Base đệm only, gated behind env LARK_TEST_INTEGRATION=1, hard-asserting the Base đệm token. Base đệm reset by Claude Code is permitted but the reset itself must be audited (req §12.12).
app_token literal allowed only in tests/ and bases.yaml (README §6).

I. Sprint Breakdown

Sprint 1 — Track B core (CLI records + safety)

Deliverables: service.py (record ops), writer.py, safety.py (8 layers), ApprovalProvider ABC + YamlApprovalProvider, GPG backup module, 2-phase audit, cli/commands/records.py, config/write-approvals.yaml, config/pii-fields.yaml, PII registry+pattern, whitelist record endpoints. Acceptance: T1–T12 (record scope) green; existing 19/19+8/8 still pass; no import requests; no hardcoded app_token (pre-commit grep); dry-run default verified; GPG backup decryptable offline with the private key; audit-pre-fail aborts before API.

Sprint 2 — Track A MCP adapter

Deliverables: lark_client/mcp_adapter/ exposing lark_record_get/delete/create/update; Base-đệm hard guard for delete; wired to LarkWriteService. Acceptance: Cowork can get/delete a record on Base đệm via MCP; MCP delete on a production token is rejected at adapter boundary; all MCP writes show in the same audit stream with agent=cowork-mcp; no write logic duplicated (adapter imports service).

Sprint 3 — Field operations + ApprovalProvider swap

Deliverables: field_manager.py (Text/Number/SingleSelect/Checkbox), cli/commands/fields.py, field endpoints whitelisted, DirectusApprovalProvider prototype injected without touching safety.py. Acceptance: create/update/delete a Text+Number field on Base đệm; complex types rejected with UnsupportedFieldType; Directus provider passes the same approval contract tests as YAML provider (provider-swap test).

Sprint 4 — Schema ops + monitoring + full integration

Deliverables: table/base create/delete + view list/create/delete, maintenance-window + staging gate for schema ops (req §13.5), monitoring (audit volume / failure-rate alarms, e.g. uptime-kuma push), full end-to-end integration suite. Acceptance: schema op refuses to run outside declared maintenance window; full T1–T12 + schema cases green on Base đệm; monitoring fires on injected audit-loss; documentation + README §3/§8 updated.

J. Open Questions (resolve before Sprint 1 coding)

OQ-1 (BLOCKER, R-1): This design was built from the KB architecture contract, not from reading /opt/incomex/lark-client/ (no shell/file access to that path in this environment). Sprint 1 must begin with a code-reconcile checklist: confirm actual module names/signatures of LarkCore (token method, retry, rate-lock API), Registry/bases.yaml loader, lark_client.exceptions base classes, the Click group in cli/lark_tool.py, and existing test harness conventions. Any deviation from this doc's assumed names is an implementation detail to adjust, not a redesign — but it must be checked first.
OQ-2: GPG key — confirm the public-key-only on VPS / private-key-offline model (§E) and who custodies the private key + GSM secret name LARK_BACKUP_GPG_PUBKEY. If Huyên wants on-VPS decryption capability, R-4 mitigation weakens — needs explicit sign-off.
OQ-3: PII default — confirm "detect → log metadata → proceed" (NOT block) is the intended behaviour (matches req §6 wording). Decide whether --pii-strict (abort on detection) ships in Sprint 1 or later.
OQ-4: Track A topology — keep the existing 9-tool @larksuiteoapi/lark-mcp mounted alongside the custom adapter, or replace it entirely with the custom server? (§F.3)
OQ-5: Lark batch hard limit — requirements say 500/batch; confirm against Lark Open API current limit for batch_delete specifically (some endpoints cap lower). Sprint 1 will treat 500 as the configured ceiling, overridable in config.
OQ-6 (process): This file was written to /opt/incomex/docs/mcp-writes/s177-architecture-design.md (the only VPS write-allowlisted dir) and was not git-committed (no exec tool / repo not in scope). Huyên or an agent with repo access must move it to the intended path and run the S177-DESIGN: commit.

End of S177 Architecture Design Document — DRAFT awaiting Huyên review on OQ-1…OQ-6.