S177 — Architecture Design Document: Lark Base Controlled CRUD Gateway — PATCH1

Status: DRAFT v1.1 (PATCH1) for Huyên review → then S177-R0 Code Reconcile → Sprint 1 Date: 2026-05-19 Supersedes: s177-architecture-design-2026-05-19.md (base, SHA-256 0440ef92…3639e5, 31407 B, 502 ln) Source of truth: knowledge/dev/lark/s177-controlled-crud-gateway-requirements-v2.md (đề bài v2.2 FINAL, GPT R3 8.8/10) Author: Claude Code (Opus 4.7)

PATCH1 changelog (8 changes): P1 atomic approval check-and-consume · P2 MCP topology hardening · P3 configurable API limits + batch_delete default 100 · P4 $LARK_AGENT write/dry-run nuance · P5 new S177-R0 Code Reconcile gate · P6 PII policy nuance (block plaintext/export paths) · P7 orphan-backup handling · P8 Sprint 1 test split (unit-before-commit vs gated integration). Sections changed: A, B.4, C.2, C.3, C.6, D.3, E (new E.5), F.1, F.3, G.3, G.5, H.3, I (new Sprint 0), J.

A. Executive Summary

Scope

Add controlled write capability (records + fields + later tables/views) to the existing read-only lark-client v1.0.0, behind a mandatory 8-layer SafetyLayer, exposed through two tracks that share one Application Service Layer:

Track B (CLI, production-grade, built first): lark-tool records ... / lark-tool fields ...
Track A (MCP, Cowork interactive): custom adapter over the same service layer; production delete forbidden, Base đệm only.

Architecture (target)

Cowork / MCP (Track A)          Claude Code CLI / Cron (Track B)
        │                                  │
        └────────────────┬─────────────────┘
                         ▼
          Application Service Layer  (lark_client/service.py)
          — single write entrypoint, no duplicated write logic —
                         ▼
              SafetyLayer  (lark_client/safety.py)
   dry-run → approval(atomic) → backup(GPG) → audit-pre → lock →
   rate-limit → PII-scan → Lark API call → audit-post
                         ▼
                   LarkCore  (existing — GSM token, whitelist, retry)
                         ▼
            Lark Open API  https://open.larksuite.com

Sprint plan (PATCH1: Sprint 0 added)

Sprint	Deliverable	Track
0 (S177-R0)	Code Reconcile — inspect live source, confirm assumptions, STOP on material drift (no code written)	—
1	`writer.py` + SafetyLayer core + atomic approval + GPG backup + 2-phase audit + unit tests	B
2	Custom MCP adapter over service layer + `record.get`/`delete` MCP	A
3	`field_manager.py` (Text/Number/SingleChoice/Checkbox) + `ApprovalProvider` swap + Directus prototype	B
4	table/base schema ops + monitoring + full integration test	A+B

Top risks

#	Risk	Sev	Mitigation
R-1	Design not validated vs live source	HIGH	S177-R0 Code Reconcile gate (Sprint 0) before any code (§I.0, OQ-1)
R-2	Audit-post fail after destructive API success	HIGH	2-phase audit + emergency fallback sink (§C.6)
R-3	PII leaking into logs/exports	HIGH	2 parallel PII layers; metadata-only audit; block non-GPG egress (§D.3)
R-4	GPG private key on VPS	HIGH	Public-key-only on VPS; private key offline (§E)
R-5	Batch partial failure / oversized batch	MED	Configurable limits, batch_delete default 100, stop-and-report (§B.4,§G.5)
R-6	Plugin write tools bypass SafetyLayer	HIGH (raised)	Plugin = read/list/search only; production create/update via custom adapter; replace plugin if hiding impossible (§F)
R-7	Cowork/MCP hitting production not Base đệm	MED	Hard app_token allowlist at adapter boundary (§F.3,§H)
R-8	Approval one-time double-spend under concurrency	MED (new)	Atomic file-locked check-and-consume, single winner (§C.3)
R-9	Orphan GPG backup if audit-pre fails after backup	LOW (new)	Delete-if-safe else mark orphaned in cleanup log (§C.6,§E.5)

B. Application Service Layer

File: lark_client/service.py Principle (req §12.8): CLI and MCP MUST call this layer. No write logic anywhere else. Never import requests; all HTTP via existing LarkCore.

B.1 Interface

class WriteOutcome(TypedDict):
    status: Literal["dry_run", "success", "partial_failure", "failed", "aborted"]
    operation: str
    base_key: str
    table_id: str
    targets: list[str]
    idempotency_key: str           # UUID v4
    rollback_command: str | None   # auto-generated; NEVER contains raw PII (refs encrypted backup path)
    audit_pre_id: str
    audit_post_id: str | None
    pii: dict                      # metadata only (see §D.4)
    error: str | None

class LarkWriteServiceABC(ABC):
    @abstractmethod
    def create_record(self, ctx, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_create_records(self, ctx, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def get_record(self, ctx, record_id: str) -> dict: ...
    @abstractmethod
    def update_record(self, ctx, record_id: str, fields: dict) -> WriteOutcome: ...
    @abstractmethod
    def batch_update_records(self, ctx, records: list[dict]) -> WriteOutcome: ...
    @abstractmethod
    def delete_record(self, ctx, record_id: str) -> WriteOutcome: ...
    @abstractmethod
    def batch_delete_records(self, ctx, record_ids: list[str]) -> WriteOutcome: ...
    # Sprint 3+
    @abstractmethod
    def create_field(self, ctx, spec) -> WriteOutcome: ...
    @abstractmethod
    def update_field(self, ctx, field_id: str, spec) -> WriteOutcome: ...
    @abstractmethod
    def delete_field(self, ctx, field_id: str) -> WriteOutcome: ...
    # Sprint 4
    @abstractmethod
    def create_table(self, ctx, spec) -> WriteOutcome: ...
    @abstractmethod
    def delete_table(self, ctx) -> WriteOutcome: ...
    @abstractmethod
    def list_views(self, ctx) -> list[dict]: ...

WriteContext carries the intent, not the credential:

@dataclass(frozen=True)
class WriteContext:
    base_key: str
    table_id: str
    operation: str
    agent: str             # $LARK_AGENT ∈ {claude-code, cowork-mcp, cron}; see §G.3
    approval_id: str
    dry_run: bool = True
    confirmed: bool = False
    idempotency_key: str = field(default_factory=lambda: str(uuid4()))
    is_buffer_base: bool = False

B.2 `LarkWriteService` (concrete, Sprint 1)

Same as base: DI of (core, safety, registry); base_key→app_token via Registry/bases.yaml (no hardcode, req §12.1); delegates the entire mutating call to SafetyLayer.guard(...); returns WriteOutcome, raises only on programming errors.

B.3 Error handling strategy

Class	Example	Behaviour
`ApprovalError`	missing/expired/scope-mismatch/already-consumed one-time / lost concurrency race	abort before API, `status=aborted`, exit 3
`SafetyViolation`	dry-run not run, lock held, audit-pre fail, PII egress to non-GPG path	abort before API, exit 3
`LarkApiError`	4xx/5xx after retries	`status=failed`, audit-post failure, exit 4
`PartialFailureError`	batch some ok/some not	`status=partial_failure`, no auto-rollback, manual rollback cmd, exit 5
`AuditError` (post)	audit-post sink down after API success	`status=success`+warning, emergency fallback (§C.6), exit 0 w/ warning

All derive from existing lark_client.exceptions base (do not fork the hierarchy).

B.4 Rate limit & batch sizing — [PATCH1 P3]

Reuse LarkCore global file lock /var/lock/lark-api.lock @ 10 req/s (README §4).

Batch ceilings are configurable, not hardcoded. New file config/lark-api-limits.yaml is the single source for per-operation limits:

rate:  requests_per_sec: 10batch:  record_create_max: 500  record_update_max: 500  record_delete_max: 100      # PATCH1 default — conservative until Lark docs / Base đệm probe confirmnotes:  record_delete_max: "Lower default than create/update; some Lark batch_delete endpoints cap below 500.                       Raise ONLY after official doc citation OR a Base đệm probe recorded in KB."

Service splits any batch above the configured ceiling into chunks of ≤ceiling, each a separate guarded call with idempotency sub-key {idempotency_key}#{chunk_index}. Chunk failure → stop, report which chunks committed (R-5).
The ceiling is read at service init; an out-of-range explicit request is rejected with SafetyViolation (no silent truncation).

C. SafetyLayer Design

File: lark_client/safety.py Single public method: guard(ctx, payload, api_call: Callable) -> WriteOutcome. api_call is a zero-arg closure performing exactly one LarkCore mutating request.

C.1 Execution order (req §4 invariant)

1 dry-run gate   2 approval(atomic check+consume)   3 backup(GPG)   4 audit-pre
5 lock acquire   6 rate-limit   7 PII scan   8 → api_call()   9 audit-post   10 release

approval_exempt_bases bypasses layer 2 only; layers 1,3,4,5,6,7,8,9 always run (req §5 note, §13.3).

C.2 Per-layer behaviour & failure mode — [PATCH1 P1, P7]

#	Layer	Pass condition	Failure mode
1	dry-run gate	if `ctx.dry_run`: build+validate payload, return `status=dry_run`, NO API. Real run needs `dry_run=False`; update/delete on non-buffer base also needs `confirmed=True`	not confirmed → `SafetyViolation`, abort
2	approval	`ApprovalProvider.check_and_consume(ctx)` — atomic for one-time (see §C.3): valid, unexpired, scope covers `base_key+table_id`, op allowed, wildcard policy ok, one-time not yet consumed AND consumed in the same critical section	invalid / lost race → `ApprovalError`, abort
3	backup	update/delete: `get_record`(s) BEFORE mutation, serialize, GPG-encrypt, write to backups dir, fsync; record backup path on the in-flight context	encrypt/write fail → `SafetyViolation`, abort (never mutate without backup)
4	audit-pre	append `phase=planned` JSONL, fsync, capture id	write fail → ABORT, API NOT called; if a backup was already written at layer 3 → orphan-backup handling (§C.6 / §E.5)
5	lock	per-record advisory lock `lark-write:{base_key}:{table_id}:{record_id}` + global rate lock	held → `SafetyViolation` (concurrent write), abort
6	rate-limit	token-bucket 10 req/s; batch ≤ configured ceiling (§B.4)	exceeded → block/wait then proceed
7	PII scan	run FieldPIIRegistry + PatternPIIDetector; compute redaction metadata; does not block a guarded record write by default but blocks any non-GPG egress path (§D.3)	scanner crash → fail-closed `SafetyViolation`; PII routed to plaintext/export/stdout → `SafetyViolation`, abort
8	api_call	invoke closure once (idempotency/client_token)	Lark error after retries → `LarkApiError`, go to audit-post(failed)
9	audit-post	append `phase=success	failed` JSONL
10	release	release locks in `finally`	—

C.3 ApprovalProvider — DI + atomic check-and-consume [PATCH1 P1]

class ApprovalProvider(ABC):
    @abstractmethod
    def check_and_consume(self, ctx: WriteContext) -> ApprovalDecision:
        """ATOMIC for one-time approvals: validate AND mark-consumed inside ONE
        critical section. Reusable approvals: validate only (no consume).
        Exactly one concurrent caller may win a one-time approval; all others
        receive ApprovalError('already_consumed')."""

class YamlApprovalProvider(ApprovalProvider):           # Sprint 1–2
    def __init__(self, path="config/write-approvals.yaml"): ...
    # Implementation contract (Sprint 1):
    #  - acquire an exclusive OS file lock on write-approvals.yaml
    #    (flock/lockf on the file or a sidecar .lock) for the WHOLE
    #    check→decide→consume→rewrite sequence
    #  - re-read the file INSIDE the lock (no stale in-memory copy)
    #  - one-time: if used==false → set used=true + used_by(agent)
    #    + used_at + idempotency_key, atomically rewrite (tmp+fsync+rename),
    #    THEN release lock
    #  - if used==true on entry → ApprovalError('already_consumed')
    #  - lock contention → bounded wait then ApprovalError('approval_locked')
    #  - reusable-within-expiry: validate under lock, do NOT mutate

class DirectusApprovalProvider(ApprovalProvider):       # Sprint 3+ prototype
    ...   # atomicity via a conditional/transactional UPDATE (compare-and-set on used flag)

SafetyLayer.__init__(self, *, approval_provider: ApprovalProvider, ...) — SafetyLayer never imports YamlApprovalProvider; wired in a composition root (lark_client/factory.py / CLI bootstrap). Swapping YAML→Directus must not touch safety.py. The atomicity contract is part of the interface, so any provider must guarantee single-winner one-time consumption (covered by the provider-swap contract test, §H.3 / Sprint 3).

C.4 Wildcard / first-write policy (req §7, §13.8, §13.10) — unchanged

record.create wildcard-table ✅ (within scope+expiry); record.update/delete ❌; field.create/update ❌; field/table delete ❌ (break-glass). First write to any base/table → explicit base_key+table_id, wildcard rejected.

C.5 Approval defaults (req §8) — unchanged

record.create reusable-within-expiry (narrow scope only); record.update one-time; record.delete one-time mandatory; field.create/update one-time mandatory; field/table delete break-glass one-time mandatory. Reusable explicit only, forbidden for delete/schema.

C.6 Audit 2-phase + emergency fallback + orphan-backup handling [PATCH1 P7]

Primary sink: /var/log/lark-ops/YYYYMMDD.jsonl (append+fsync).
Phase 1 (pre): {phase:"planned", op, base_key, table_id, targets, agent, approval_id, idempotency_key, backup_ref, ts}. Fsync fail → abort, API NOT called.
Phase 3 (post): {phase:"success"|"failed", ...same id..., lark_response_meta, pii:{…}}.
Emergency fallback: phase-3 write fail after API success → independent sink /var/log/lark-ops/EMERGENCY/<ts>-<idempotency_key>.json (separate fd) + WriteOutcome.status="success", error="audit_post_degraded"; never silently swallow; if even emergency sink fails → stderr LARK-AUDIT-LOST line.
Orphan-backup handling (audit-pre fails after a backup was already written at layer 3): the mutation never happened, so the GPG backup is orphaned.
1. If the backup file can be removed safely (it was created this attempt, path matches the in-flight idempotency_key, no concurrent reader) → delete it.
2. Else (uncertain ownership / removal error) → leave it and append a record to /var/log/lark-ops/orphan-backups.log: {idempotency_key, backup_path, key_fingerprint, reason:"audit_pre_failed", ts} for later operator sweep. Never let an orphan encrypted blob accumulate silently. Orphan log lines contain no raw PII (only path + fingerprint + id).

D. PII Protection (req §6)

Two layers run in parallel (always active — 18 bases, unknown PII fields).

D.1 `FieldPIIRegistry` (whitelist)

config/pii-fields.yaml, keyed by (base_key, table_id, field_id) — field_id never name (req §12.7); seed from S176 snapshots; growable by PR.

D.2 `PatternPIIDetector` (regex, VN)

Type	Pattern	Note
`national_id_cccd`	`\b\d{12}\b`	match before phone
`national_id_cmnd`	`\b\d{9}\b`
`passport`	`\b[A-Z]{1,2}\d{7}\b`
`phone_vn`	`\b(?:+84	0)(?:3
`bank_account`	`\b\d{8,16}\b`	heuristic, flag only
`email`	`\b[\w.+-]+@[\w-]+\.[\w.-]+\b`	optional

Returns types + counts only, never substrings.

D.3 Pipeline integration & policy nuance [PATCH1 P6]

Both layers feed SafetyLayer layer 7; union of hits → redaction_types, redacted_fields_count.

Policy (PATCH1, explicit two-rule split):

Guarded record write path: PII detection does NOT block the write by default (matches req §6 — the write itself proceeds; the protection is in what gets logged + GPG backup).
Egress paths: PII detection MUST block when the same data would leave through a non-protected channel — i.e. plaintext file output, --export/dump, stdout/console print, or any non-GPG backup/log path. Such a route with detected PII → SafetyViolation, abort.
Audit + rollback command MUST NEVER contain raw PII — audit holds metadata only (§D.4); the auto-generated rollback command references the encrypted backup file path, never inline values.

Optional --pii-strict (off by default) escalates rule 1 to also abort guarded writes on detection — decision deferred to Huyên (OQ-3).

D.4 Audit redaction format (req §6)

{ "pii_redacted": true,
  "redaction_types": ["national_id_cccd","bank_account"],
  "redacted_fields_count": 3,
  "detector": ["registry","pattern"] }

E. GPG Backup Design (req §6, §13.6 — mandatory from Sprint 1)

E.1 Key source — GSM, public-key-only on VPS

GSM secret LARK_BACKUP_GPG_PUBKEY (project github-chatgpt-ggcloud), ASCII-armored public key, fetched via the existing secret accessor (not read directly in business code). Private key NEVER on VPS — offline with Huyên. VPS encrypts, cannot decrypt (mitigates R-4).

E.2 Rotation policy

Annual or on compromise. New keypair offline → publish new public key as a new GSM version → service picks up latest on next start → old backups still decryptable with retained offline private keys, indexed by fingerprint (recorded in each backup's sidecar meta).

E.3 File naming & storage

/var/log/lark-ops/writes/<YYYYMMDD>/
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.json.gpg
  <base_key>__<table_id>__<record_id>__<idempotency_key>__pre.meta.json   # unencrypted: key fp, ts, op, NO pii

Batch: one .json.gpg per chunk; records as JSON lines before encryption.

E.4 Recovery procedure

Locate by idempotency_key (also in audit-pre backup_ref) → verify fingerprint via sidecar → gpg --decrypt on Huyên's offline machine → re-apply via a normal guarded lark-tool records update ... --no-dry-run --confirm (recovery is itself fully guarded/audited).

E.5 Orphan backup lifecycle — [PATCH1 P7]

A backup is committed only once its paired audit-pre entry is durably written (the audit-pre records backup_ref). Backups whose audit-pre never landed are orphans:

Inline handling: §C.6 (delete-if-safe, else append orphan-backups.log).
Operator sweep (documented runbook, Sprint 4 monitoring): periodically reconcile writes/ blobs vs audit-pre backup_ref set; any blob with no corresponding committed audit-pre and present in orphan-backups.log (or older than a grace window with no audit-pre) → securely delete after operator confirmation. Sweep actions are themselves audited. No automatic unconfirmed deletion of anything not provably this-attempt orphaned.

F. Track A — MCP Plugin

F.1 Survey result + topology rule [PATCH1 P2]

Live @larksuiteoapi/lark-mcp surface (observed in-session, = same plugin) — exactly 9 bitable tools: app_create, appTable_create, appTable_list, appTableField_list, appTableRecord_search, appTableRecord_create, _batchCreate, appTableRecord_update, _batchUpdate. Matches requirements §2.

PATCH1 topology rule (hard):

The existing plugin MAY remain mounted ONLY for read-class tools: appTable_list, appTableField_list, appTableRecord_search (and read-only app/table listing).
The plugin's write tools (appTableRecord_create/_batchCreate/_update/_batchUpdate, appTable_create, app_create) MUST NOT be used in any production workflow — they call Lark directly and bypass SafetyLayer entirely (no dry-run, approval, backup, audit, PII). This is risk R-6 (raised to HIGH).
If the MCP host cannot hide/disable individual plugin tools (tool-level allowlist not supported) → replace the @larksuiteoapi/lark-mcp plugin entirely with the custom adapter (which itself exposes guarded read + write tools). Partial trust of a plugin whose write tools cannot be removed is not acceptable.

F.2 Gap analysis

Needed	In plugin?	Decision
record.get by id	❌ (only search)	custom adapter
record.delete / batchDelete	❌	custom adapter
field.create/update/delete	❌	custom adapter (Sprint 3)
appTable.update/delete	❌	custom adapter (Sprint 4)
view.list/create/delete	❌	custom adapter (Sprint 4)
guarded record.create/update	plugin has UNguarded only	must come from custom adapter (plugin write tools forbidden, P2)

F.3 Custom MCP adapter (Sprint 2)

lark_client/mcp_adapter/ (Python MCP SDK) — adapter only, zero write logic, imports lark_client.service.LarkWriteService.
Sprint 2 tools: lark_record_get, lark_record_delete, lark_record_create, lark_record_update (all guarded). Sprint 4: field/view.
Each tool builds WriteContext(agent="cowork-mcp") → LarkWriteService → full SafetyLayer.
Hard guard (R-7): resolve base_key; if resolved app_token ≠ Base đệm Nf2bb1ExXaYnlksgoyQl72GNgAc, any delete/schema op rejected at adapter boundary (req §11, §13.4).
Auth reuses GSM LARK_APP_* via LarkCore; no new bot/credential (req §12.2).
Final mount topology (plugin-read-only-alongside vs full-replace) depends on whether the host supports tool-level hiding → OQ-4 (now a concrete feasibility check, not an open preference).

G. Track B — CLI Write Module

G.1 `lark_client/writer.py`

Typed façade over service.py (CLI thin, req §12.8): create / batch_create / get / update / batch_update / delete / batch_delete, return WriteOutcome. No write logic here.

G.2 `lark_client/field_manager.py` (Sprint 3)

SUPPORTED = {"Text","Number","SingleSelect","Checkbox"}; create/update/delete by field_id only (req §12.7); complex types → UnsupportedFieldType until Sprint 4+.

G.3 CLI commands (Click) + `$LARK_AGENT` nuance [PATCH1 P4]

cli/commands/records.py and cli/commands/fields.py as in base (create/get/update/delete/batch-*, fields create/update/delete with literal ack string for delete).

$LARK_AGENT rule (PATCH1):

Every real write (--no-dry-run) — single or batch, CLI or MCP — MUST have an explicit agent identity. Missing/empty $LARK_AGENT on a real write → abort SafetyViolation before layer 8.
Dry-run only: CLI MAY default $LARK_AGENT=claude-code (convenience for --dry-run, which never mutates).
Valid values: claude-code | cowork-mcp | cron (MCP adapter always sets cowork-mcp).
Audit always records agent in both phase-1 and phase-3 entries — there is no audit entry without an agent field.

Conventions otherwise unchanged: dry-run default ON; update/delete on non-buffer base need --confirm; field/table delete need the literal ack string; exit codes per §B.3; registered into existing cli/lark_tool.py Click group (no new entrypoint).

G.4 `config/write-approvals.yaml` schema

As base (id, operation, scope{base_key,table_id}, one_time_use default true, used, reason, created_by human, created_at, expires_at; approval_exempt_bases: ["88-phai-cu-base-dem"]). PATCH1: the used/used_by/used_at fields are written under file lock by YamlApprovalProvider (§C.3).

G.5 Integration with `LarkCore` + API limits config [PATCH1 P3]

Reuse LarkCore: GSM token, retry (3× backoff 429/503/network), global rate lock, endpoint whitelist.
Whitelist additions to config/allowed_endpoints.yaml (each = 1 reviewed change; write endpoints initially empty per README §5.2): record create/get/update/delete + batch_create/batch_update/batch_delete; (Sprint 3) field create/update/delete; (Sprint 4) table create/delete + views.
New config/lark-api-limits.yaml (see §B.4) holds rate + per-op batch ceilings; record_delete_max default 100 until an official Lark doc citation or a recorded Base đệm probe raises it. Service rejects explicit over-ceiling requests (no silent truncation).
Idempotency: client_token/UUID per Lark write API where supported.

H. Testing Strategy

H.1 Base đệm (test target — CONFIRMED not production)

88 - Phái cử (Base đệm), app_token Nf2bb1ExXaYnlksgoyQl72GNgAc; tables TTS=tblPQ6N79EeOmnTm (7 fld), Đơn hàng=tblaU7kxyPTNBSrR (5 fld). Production Base 88 is the different token YSIkb8PxOaNaozs2vwalOOcagkf (80 tbl) — tests MUST NOT use it; hard token assert enforced.

H.2 12 test cases (Base đệm only)

T1 dry-run default no API · T2 create real + audit pre/post · T3 update no-confirm aborts · T4 update w/ confirm + GPG backup exists · T5 one-time approval reuse → ApprovalError · T5b (PATCH1) two concurrent one-time consumers → exactly one success, other already_consumed · T6 batch_create 600 → 500+100 chunks audited · T6b (PATCH1) batch_delete 150 with default ceiling 100 → rejected or 100+50 chunked per config · T7 batch partial failure → partial_failure, no auto-rollback, rollback cmd · T8 scope mismatch → ApprovalError · T9 wildcard delete rejected · T10 PII payload → guarded write proceeds, audit metadata only, raw only in GPG · T10b (PATCH1) PII to plaintext/--export/stdout → SafetyViolation · T11 audit-pre unwritable → API not called + orphan backup deleted-or-logged · T12 audit-post fails post-success → status=success + emergency fallback.

H.3 Isolation & test split [PATCH1 P8]

Unit / mock tests — MANDATORY before any commit. Mock LarkCore HTTP; assert SafetyLayer ordering, atomic approval single-winner (T5b via threads/processes), PII metadata + egress block (T10b), GPG invoked, 2-phase audit, orphan-backup path (T11), batch-ceiling logic (T6b). Mirrors existing 19/19+8/8 mocked style. Commit gate = all unit/mock green.
Base đệm integration tests — gated. Run only when env LARK_TEST_INTEGRATION=1 AND a hard assertion confirms app_token == Base đệm token; otherwise skipped (never silently hit any base). Base đệm reset by Claude Code allowed but the reset itself must be audited (req §12.12). Integration green is NOT required for the unit-level commit gate but IS required before Sprint 1 sign-off.
app_token literal allowed only in tests/ and bases.yaml (README §6).

I. Sprint Breakdown

Sprint 0 — S177-R0 Code Reconcile [PATCH1 P5] (NO CODE WRITTEN)

Goal: validate every assumption this design makes against the live lark-client source before a single line of Sprint 1 code. Activities (read-only, by an agent/operator with repo+shell access to /opt/incomex/lark-client/):

Confirm LarkCore actual class/module name and method signatures (token acquisition/refresh, retry, the global rate-lock API, the GSM/secret accessor used for LARK_APP_*).
Confirm Registry / bases.yaml loader name + base_key→app_token resolution API and the Base đệm key string.
Confirm lark_client.exceptions base class hierarchy (so new errors subclass it, not a parallel tree).
Confirm CLI entrypoint cli/lark_tool.py Click group name + how subcommand modules are registered.
Confirm config dir conventions and existing files (allowed_endpoints.yaml, bases.yaml) and where new write-approvals.yaml / pii-fields.yaml / lark-api-limits.yaml should live.
Confirm existing test layout/harness conventions (the 19/19 + 8/8 suites) and the app_token-literal exception scope. Exit / STOP rule: produce a reconcile report (KB) mapping each design symbol → actual symbol. If the live source materially differs from design assumptions (e.g. no LarkCore global lock, different secret path, no Click group) → STOP, do not start Sprint 1, route the delta back to Huyên/GPT for a design amendment. Only a clean/aligned reconcile (or trivial rename mapping) authorizes Sprint 1.

Sprint 1 — Track B core (CLI records + safety)

Deliverables: service.py (record ops), writer.py, safety.py (8 layers incl. atomic approval + orphan-backup), ApprovalProvider ABC + YamlApprovalProvider (file-locked atomic), GPG backup module, 2-phase audit + emergency + orphan log, cli/commands/records.py, config/write-approvals.yaml, config/pii-fields.yaml, config/lark-api-limits.yaml, PII registry+pattern + egress guard, record endpoints whitelisted. Acceptance: all unit/mock tests green (incl. T5b, T6b, T10b, T11 orphan) — mandatory commit gate; gated Base đệm integration subset green before sign-off; existing 19/19+8/8 still pass; no import requests; no hardcoded app_token (pre-commit grep); dry-run default verified; real write without $LARK_AGENT aborts; GPG backup decryptable offline; audit-pre-fail aborts before API with orphan handling.

Sprint 2 — Track A custom MCP adapter

Deliverables: lark_client/mcp_adapter/ (lark_record_get/delete/create/update, all guarded), Base-đệm hard guard, topology decision per OQ-4 (read-only plugin alongside vs full replace). Acceptance: Cowork get/delete on Base đệm via guarded adapter; production-token delete rejected at boundary; plugin write tools demonstrably unused/removed for production; all MCP writes in the audit stream with agent=cowork-mcp; no duplicated write logic.

Sprint 3 — Field ops + ApprovalProvider swap

Deliverables: field_manager.py (Text/Number/SingleSelect/Checkbox), cli/commands/fields.py, field endpoints whitelisted, DirectusApprovalProvider prototype injected without touching safety.py. Acceptance: field create/update/delete on Base đệm; complex types → UnsupportedFieldType; provider-swap contract test: Directus provider passes the SAME atomic single-winner one-time-consume contract as YAML provider.

Sprint 4 — Schema ops + monitoring + full integration

Deliverables: table/base create/delete + view list/create/delete, maintenance-window + staging gate (req §13.5), monitoring (audit volume / failure-rate / orphan-backup sweep alarms, e.g. uptime-kuma push), full e2e integration suite. Acceptance: schema op refuses outside maintenance window; full T1–T12 (+b variants) + schema cases green on Base đệm; monitoring fires on injected audit-loss and on orphan backups; README §3/§8 updated.

J. Open Questions

OQ-1 → folded into Sprint 0 (S177-R0). Still the top blocker, but now a defined gated phase with an explicit STOP rule (§I.0). Closed by a clean reconcile report; escalate on material drift.
OQ-2: GPG key — confirm public-key-only-on-VPS / private-key-offline model and private-key custodian + GSM secret name LARK_BACKUP_GPG_PUBKEY. (unchanged)
OQ-3 (refined): Confirm the PATCH1 two-rule PII policy — guarded writes proceed, but plaintext/export/stdout/non-GPG paths are blocked; audit/rollback never raw PII. Decide if --pii-strict (also abort guarded writes) ships Sprint 1 or later.
OQ-4 (refined to a feasibility check): Does the MCP host support tool-level hiding of individual plugin tools? If YES → keep @larksuiteoapi/lark-mcp mounted read-only-tools-only alongside the custom adapter. If NO → fully replace the plugin with the custom adapter (P2). Needs a concrete host-capability answer.
OQ-5 (refined): record.batch_delete ceiling defaults to 100 in config/lark-api-limits.yaml. Raise only after (a) an official Lark Open API doc citation, or (b) a recorded Base đệm probe. Confirm acceptance of the conservative default.
OQ-6 (process, unchanged): Patched doc still lives only in KB (and the base on VPS /opt/incomex/docs/mcp-writes/); not git-committed, final repo path not populated — requires repo+shell access this environment lacks.
OQ-7 (new): Orphan-backup sweep — confirm the operator runbook + grace window and that automated sweep deletion requires operator confirmation (no unconfirmed auto-delete of non-provably-orphaned blobs).

End of S177 Architecture Design — PATCH1. DRAFT v1.1 awaiting Huyên review (OQ-2…OQ-7) and S177-R0 Code Reconcile (OQ-1). Not commit-ready.