S177 Architecture Design — Lark CRUD Gateway — PATCH1 (2026-05-19)
S177 — Architecture Design Document: Lark Base Controlled CRUD Gateway — PATCH1
Status: DRAFT v1.1 (PATCH1) for Huyên review → then S177-R0 Code Reconcile → Sprint 1
Date: 2026-05-19
Supersedes: s177-architecture-design-2026-05-19.md (base, SHA-256 0440ef92…3639e5, 31407 B, 502 ln)
Source of truth: knowledge/dev/lark/s177-controlled-crud-gateway-requirements-v2.md (đề bài v2.2 FINAL, GPT R3 8.8/10)
Author: Claude Code (Opus 4.7)
PATCH1 changelog (8 changes): P1 atomic approval check-and-consume · P2 MCP topology hardening · P3 configurable API limits + batch_delete default 100 · P4
$LARK_AGENTwrite/dry-run nuance · P5 new S177-R0 Code Reconcile gate · P6 PII policy nuance (block plaintext/export paths) · P7 orphan-backup handling · P8 Sprint 1 test split (unit-before-commit vs gated integration). Sections changed: A, B.4, C.2, C.3, C.6, D.3, E (new E.5), F.1, F.3, G.3, G.5, H.3, I (new Sprint 0), J.
A. Executive Summary
Scope
Add controlled write capability (records + fields + later tables/views) to the existing read-only lark-client v1.0.0, behind a mandatory 8-layer SafetyLayer, exposed through two tracks that share one Application Service Layer:
- Track B (CLI, production-grade, built first):
lark-tool records .../lark-tool fields ... - Track A (MCP, Cowork interactive): custom adapter over the same service layer; production delete forbidden, Base đệm only.
Architecture (target)
Cowork / MCP (Track A) Claude Code CLI / Cron (Track B)
│ │
└────────────────┬─────────────────┘
▼
Application Service Layer (lark_client/service.py)
— single write entrypoint, no duplicated write logic —
▼
SafetyLayer (lark_client/safety.py)
dry-run → approval(atomic) → backup(GPG) → audit-pre → lock →
rate-limit → PII-scan → Lark API call → audit-post
▼
LarkCore (existing — GSM token, whitelist, retry)
▼
Lark Open API https://open.larksuite.com
Sprint plan (PATCH1: Sprint 0 added)
| Sprint | Deliverable | Track |
|---|---|---|
| 0 (S177-R0) | Code Reconcile — inspect live source, confirm assumptions, STOP on material drift (no code written) | — |
| 1 | writer.py + SafetyLayer core + atomic approval + GPG backup + 2-phase audit + unit tests |
B |
| 2 | Custom MCP adapter over service layer + record.get/delete MCP |
A |
| 3 | field_manager.py (Text/Number/SingleChoice/Checkbox) + ApprovalProvider swap + Directus prototype |
B |
| 4 | table/base schema ops + monitoring + full integration test | A+B |
Top risks
| # | Risk | Sev | Mitigation |
|---|---|---|---|
| R-1 | Design not validated vs live source | HIGH | S177-R0 Code Reconcile gate (Sprint 0) before any code (§I.0, OQ-1) |
| R-2 | Audit-post fail after destructive API success | HIGH | 2-phase audit + emergency fallback sink (§C.6) |
| R-3 | PII leaking into logs/exports | HIGH | 2 parallel PII layers; metadata-only audit; block non-GPG egress (§D.3) |
| R-4 | GPG private key on VPS | HIGH | Public-key-only on VPS; private key offline (§E) |
| R-5 | Batch partial failure / oversized batch | MED | Configurable limits, batch_delete default 100, stop-and-report (§B.4,§G.5) |
| R-6 | Plugin write tools bypass SafetyLayer | HIGH (raised) | Plugin = read/list/search only; production create/update via custom adapter; replace plugin if hiding impossible (§F) |
| R-7 | Cowork/MCP hitting production not Base đệm | MED | Hard app_token allowlist at adapter boundary (§F.3,§H) |
| R-8 | Approval one-time double-spend under concurrency | MED (new) | Atomic file-locked check-and-consume, single winner (§C.3) |
| R-9 | Orphan GPG backup if audit-pre fails after backup | LOW (new) | Delete-if-safe else mark orphaned in cleanup log (§C.6,§E.5) |
B. Application Service Layer
File: lark_client/service.py
Principle (req §12.8): CLI and MCP MUST call this layer. No write logic anywhere else. Never import requests; all HTTP via existing LarkCore.
B.1 Interface
class WriteOutcome(TypedDict):
status: Literal["dry_run", "success", "partial_failure", "failed", "aborted"]
operation: str
base_key: str
table_id: str
targets: list[str]
idempotency_key: str # UUID v4
rollback_command: str | None # auto-generated; NEVER contains raw PII (refs encrypted backup path)
audit_pre_id: str
audit_post_id: str | None
pii: dict # metadata only (see §D.4)
error: str | None
class LarkWriteServiceABC(ABC):
@abstractmethod
def create_record(self, ctx, fields: dict) -> WriteOutcome: ...
@abstractmethod
def batch_create_records(self, ctx, records: list[dict]) -> WriteOutcome: ...
@abstractmethod
def get_record(self, ctx, record_id: str) -> dict: ...
@abstractmethod
def update_record(self, ctx, record_id: str, fields: dict) -> WriteOutcome: ...
@abstractmethod
def batch_update_records(self, ctx, records: list[dict]) -> WriteOutcome: ...
@abstractmethod
def delete_record(self, ctx, record_id: str) -> WriteOutcome: ...
@abstractmethod
def batch_delete_records(self, ctx, record_ids: list[str]) -> WriteOutcome: ...
# Sprint 3+
@abstractmethod
def create_field(self, ctx, spec) -> WriteOutcome: ...
@abstractmethod
def update_field(self, ctx, field_id: str, spec) -> WriteOutcome: ...
@abstractmethod
def delete_field(self, ctx, field_id: str) -> WriteOutcome: ...
# Sprint 4
@abstractmethod
def create_table(self, ctx, spec) -> WriteOutcome: ...
@abstractmethod
def delete_table(self, ctx) -> WriteOutcome: ...
@abstractmethod
def list_views(self, ctx) -> list[dict]: ...
WriteContext carries the intent, not the credential:
@dataclass(frozen=True)
class WriteContext:
base_key: str
table_id: str
operation: str
agent: str # $LARK_AGENT ∈ {claude-code, cowork-mcp, cron}; see §G.3
approval_id: str
dry_run: bool = True
confirmed: bool = False
idempotency_key: str = field(default_factory=lambda: str(uuid4()))
is_buffer_base: bool = False
B.2 LarkWriteService (concrete, Sprint 1)
Same as base: DI of (core, safety, registry); base_key→app_token via Registry/bases.yaml (no hardcode, req §12.1); delegates the entire mutating call to SafetyLayer.guard(...); returns WriteOutcome, raises only on programming errors.
B.3 Error handling strategy
| Class | Example | Behaviour |
|---|---|---|
ApprovalError |
missing/expired/scope-mismatch/already-consumed one-time / lost concurrency race | abort before API, status=aborted, exit 3 |
SafetyViolation |
dry-run not run, lock held, audit-pre fail, PII egress to non-GPG path | abort before API, exit 3 |
LarkApiError |
4xx/5xx after retries | status=failed, audit-post failure, exit 4 |
PartialFailureError |
batch some ok/some not | status=partial_failure, no auto-rollback, manual rollback cmd, exit 5 |
AuditError (post) |
audit-post sink down after API success | status=success+warning, emergency fallback (§C.6), exit 0 w/ warning |
All derive from existing lark_client.exceptions base (do not fork the hierarchy).
B.4 Rate limit & batch sizing — [PATCH1 P3]
- Reuse
LarkCoreglobal file lock/var/lock/lark-api.lock@ 10 req/s (README §4). - Batch ceilings are configurable, not hardcoded. New file
config/lark-api-limits.yamlis the single source for per-operation limits:rate: requests_per_sec: 10batch: record_create_max: 500 record_update_max: 500 record_delete_max: 100 # PATCH1 default — conservative until Lark docs / Base đệm probe confirmnotes: record_delete_max: "Lower default than create/update; some Lark batch_delete endpoints cap below 500. Raise ONLY after official doc citation OR a Base đệm probe recorded in KB." - Service splits any batch above the configured ceiling into chunks of ≤ceiling, each a separate guarded call with idempotency sub-key
{idempotency_key}#{chunk_index}. Chunk failure → stop, report which chunks committed (R-5). - The ceiling is read at service init; an out-of-range explicit request is rejected with
SafetyViolation(no silent truncation).
C. SafetyLayer Design
File: lark_client/safety.py
Single public method: guard(ctx, payload, api_call: Callable) -> WriteOutcome. api_call is a zero-arg closure performing exactly one LarkCore mutating request.
C.1 Execution order (req §4 invariant)
1 dry-run gate 2 approval(atomic check+consume) 3 backup(GPG) 4 audit-pre
5 lock acquire 6 rate-limit 7 PII scan 8 → api_call() 9 audit-post 10 release
approval_exempt_bases bypasses layer 2 only; layers 1,3,4,5,6,7,8,9 always run (req §5 note, §13.3).
C.2 Per-layer behaviour & failure mode — [PATCH1 P1, P7]
| # | Layer | Pass condition | Failure mode |
|---|---|---|---|
| 1 | dry-run gate | if ctx.dry_run: build+validate payload, return status=dry_run, NO API. Real run needs dry_run=False; update/delete on non-buffer base also needs confirmed=True |
not confirmed → SafetyViolation, abort |
| 2 | approval | ApprovalProvider.check_and_consume(ctx) — atomic for one-time (see §C.3): valid, unexpired, scope covers base_key+table_id, op allowed, wildcard policy ok, one-time not yet consumed AND consumed in the same critical section |
invalid / lost race → ApprovalError, abort |
| 3 | backup | update/delete: get_record(s) BEFORE mutation, serialize, GPG-encrypt, write to backups dir, fsync; record backup path on the in-flight context |
encrypt/write fail → SafetyViolation, abort (never mutate without backup) |
| 4 | audit-pre | append phase=planned JSONL, fsync, capture id |
write fail → ABORT, API NOT called; if a backup was already written at layer 3 → orphan-backup handling (§C.6 / §E.5) |
| 5 | lock | per-record advisory lock lark-write:{base_key}:{table_id}:{record_id} + global rate lock |
held → SafetyViolation (concurrent write), abort |
| 6 | rate-limit | token-bucket 10 req/s; batch ≤ configured ceiling (§B.4) | exceeded → block/wait then proceed |
| 7 | PII scan | run FieldPIIRegistry + PatternPIIDetector; compute redaction metadata; does not block a guarded record write by default but blocks any non-GPG egress path (§D.3) | scanner crash → fail-closed SafetyViolation; PII routed to plaintext/export/stdout → SafetyViolation, abort |
| 8 | api_call | invoke closure once (idempotency/client_token) | Lark error after retries → LarkApiError, go to audit-post(failed) |
| 9 | audit-post | append `phase=success | failed` JSONL |
| 10 | release | release locks in finally |
— |
C.3 ApprovalProvider — DI + atomic check-and-consume [PATCH1 P1]
class ApprovalProvider(ABC):
@abstractmethod
def check_and_consume(self, ctx: WriteContext) -> ApprovalDecision:
"""ATOMIC for one-time approvals: validate AND mark-consumed inside ONE
critical section. Reusable approvals: validate only (no consume).
Exactly one concurrent caller may win a one-time approval; all others
receive ApprovalError('already_consumed')."""
class YamlApprovalProvider(ApprovalProvider): # Sprint 1–2
def __init__(self, path="config/write-approvals.yaml"): ...
# Implementation contract (Sprint 1):
# - acquire an exclusive OS file lock on write-approvals.yaml
# (flock/lockf on the file or a sidecar .lock) for the WHOLE
# check→decide→consume→rewrite sequence
# - re-read the file INSIDE the lock (no stale in-memory copy)
# - one-time: if used==false → set used=true + used_by(agent)
# + used_at + idempotency_key, atomically rewrite (tmp+fsync+rename),
# THEN release lock
# - if used==true on entry → ApprovalError('already_consumed')
# - lock contention → bounded wait then ApprovalError('approval_locked')
# - reusable-within-expiry: validate under lock, do NOT mutate
class DirectusApprovalProvider(ApprovalProvider): # Sprint 3+ prototype
... # atomicity via a conditional/transactional UPDATE (compare-and-set on used flag)
SafetyLayer.__init__(self, *, approval_provider: ApprovalProvider, ...) — SafetyLayer never imports YamlApprovalProvider; wired in a composition root (lark_client/factory.py / CLI bootstrap). Swapping YAML→Directus must not touch safety.py. The atomicity contract is part of the interface, so any provider must guarantee single-winner one-time consumption (covered by the provider-swap contract test, §H.3 / Sprint 3).
C.4 Wildcard / first-write policy (req §7, §13.8, §13.10) — unchanged
record.create wildcard-table ✅ (within scope+expiry); record.update/delete ❌; field.create/update ❌; field/table delete ❌ (break-glass). First write to any base/table → explicit base_key+table_id, wildcard rejected.
C.5 Approval defaults (req §8) — unchanged
record.create reusable-within-expiry (narrow scope only); record.update one-time; record.delete one-time mandatory; field.create/update one-time mandatory; field/table delete break-glass one-time mandatory. Reusable explicit only, forbidden for delete/schema.
C.6 Audit 2-phase + emergency fallback + orphan-backup handling [PATCH1 P7]
- Primary sink:
/var/log/lark-ops/YYYYMMDD.jsonl(append+fsync). - Phase 1 (pre):
{phase:"planned", op, base_key, table_id, targets, agent, approval_id, idempotency_key, backup_ref, ts}. Fsync fail → abort, API NOT called. - Phase 3 (post):
{phase:"success"|"failed", ...same id..., lark_response_meta, pii:{…}}. - Emergency fallback: phase-3 write fail after API success → independent sink
/var/log/lark-ops/EMERGENCY/<ts>-<idempotency_key>.json(separate fd) +WriteOutcome.status="success",error="audit_post_degraded"; never silently swallow; if even emergency sink fails → stderrLARK-AUDIT-LOSTline. - Orphan-backup handling (audit-pre fails after a backup was already written at layer 3): the mutation never happened, so the GPG backup is orphaned.
- If the backup file can be removed safely (it was created this attempt, path matches the in-flight
idempotency_key, no concurrent reader) → delete it. - Else (uncertain ownership / removal error) → leave it and append a record to
/var/log/lark-ops/orphan-backups.log:{idempotency_key, backup_path, key_fingerprint, reason:"audit_pre_failed", ts}for later operator sweep. Never let an orphan encrypted blob accumulate silently. Orphan log lines contain no raw PII (only path + fingerprint + id).
- If the backup file can be removed safely (it was created this attempt, path matches the in-flight
D. PII Protection (req §6)
Two layers run in parallel (always active — 18 bases, unknown PII fields).
D.1 FieldPIIRegistry (whitelist)
config/pii-fields.yaml, keyed by (base_key, table_id, field_id) — field_id never name (req §12.7); seed from S176 snapshots; growable by PR.
D.2 PatternPIIDetector (regex, VN)
| Type | Pattern | Note |
|---|---|---|
national_id_cccd |
\b\d{12}\b |
match before phone |
national_id_cmnd |
\b\d{9}\b |
|
passport |
\b[A-Z]{1,2}\d{7}\b |
|
phone_vn |
`\b(?:+84 | 0)(?:3 |
bank_account |
\b\d{8,16}\b |
heuristic, flag only |
email |
\b[\w.+-]+@[\w-]+\.[\w.-]+\b |
optional |
Returns types + counts only, never substrings.
D.3 Pipeline integration & policy nuance [PATCH1 P6]
Both layers feed SafetyLayer layer 7; union of hits → redaction_types, redacted_fields_count.
Policy (PATCH1, explicit two-rule split):
- Guarded record write path: PII detection does NOT block the write by default (matches req §6 — the write itself proceeds; the protection is in what gets logged + GPG backup).
- Egress paths: PII detection MUST block when the same data would leave through a non-protected channel — i.e. plaintext file output,
--export/dump, stdout/console print, or any non-GPG backup/log path. Such a route with detected PII →SafetyViolation, abort. - Audit + rollback command MUST NEVER contain raw PII — audit holds metadata only (§D.4); the auto-generated rollback command references the encrypted backup file path, never inline values.
Optional --pii-strict (off by default) escalates rule 1 to also abort guarded writes on detection — decision deferred to Huyên (OQ-3).
D.4 Audit redaction format (req §6)
{ "pii_redacted": true,
"redaction_types": ["national_id_cccd","bank_account"],
"redacted_fields_count": 3,
"detector": ["registry","pattern"] }
E. GPG Backup Design (req §6, §13.6 — mandatory from Sprint 1)
E.1 Key source — GSM, public-key-only on VPS
GSM secret LARK_BACKUP_GPG_PUBKEY (project github-chatgpt-ggcloud), ASCII-armored public key, fetched via the existing secret accessor (not read directly in business code). Private key NEVER on VPS — offline with Huyên. VPS encrypts, cannot decrypt (mitigates R-4).
E.2 Rotation policy
Annual or on compromise. New keypair offline → publish new public key as a new GSM version → service picks up latest on next start → old backups still decryptable with retained offline private keys, indexed by fingerprint (recorded in each backup's sidecar meta).
E.3 File naming & storage
/var/log/lark-ops/writes/<YYYYMMDD>/
<base_key>__<table_id>__<record_id>__<idempotency_key>__pre.json.gpg
<base_key>__<table_id>__<record_id>__<idempotency_key>__pre.meta.json # unencrypted: key fp, ts, op, NO pii
Batch: one .json.gpg per chunk; records as JSON lines before encryption.
E.4 Recovery procedure
Locate by idempotency_key (also in audit-pre backup_ref) → verify fingerprint via sidecar → gpg --decrypt on Huyên's offline machine → re-apply via a normal guarded lark-tool records update ... --no-dry-run --confirm (recovery is itself fully guarded/audited).
E.5 Orphan backup lifecycle — [PATCH1 P7]
A backup is committed only once its paired audit-pre entry is durably written (the audit-pre records backup_ref). Backups whose audit-pre never landed are orphans:
- Inline handling: §C.6 (delete-if-safe, else append
orphan-backups.log). - Operator sweep (documented runbook, Sprint 4 monitoring): periodically reconcile
writes/blobs vs audit-prebackup_refset; any blob with no corresponding committed audit-pre and present inorphan-backups.log(or older than a grace window with no audit-pre) → securely delete after operator confirmation. Sweep actions are themselves audited. No automatic unconfirmed deletion of anything not provably this-attempt orphaned.
F. Track A — MCP Plugin
F.1 Survey result + topology rule [PATCH1 P2]
Live @larksuiteoapi/lark-mcp surface (observed in-session, = same plugin) — exactly 9 bitable tools: app_create, appTable_create, appTable_list, appTableField_list, appTableRecord_search, appTableRecord_create, _batchCreate, appTableRecord_update, _batchUpdate. Matches requirements §2.
PATCH1 topology rule (hard):
- The existing plugin MAY remain mounted ONLY for read-class tools:
appTable_list,appTableField_list,appTableRecord_search(and read-onlyapp/tablelisting). - The plugin's write tools (
appTableRecord_create/_batchCreate/_update/_batchUpdate,appTable_create,app_create) MUST NOT be used in any production workflow — they call Lark directly and bypass SafetyLayer entirely (no dry-run, approval, backup, audit, PII). This is risk R-6 (raised to HIGH). - If the MCP host cannot hide/disable individual plugin tools (tool-level allowlist not supported) → replace the
@larksuiteoapi/lark-mcpplugin entirely with the custom adapter (which itself exposes guarded read + write tools). Partial trust of a plugin whose write tools cannot be removed is not acceptable.
F.2 Gap analysis
| Needed | In plugin? | Decision |
|---|---|---|
| record.get by id | ❌ (only search) | custom adapter |
| record.delete / batchDelete | ❌ | custom adapter |
| field.create/update/delete | ❌ | custom adapter (Sprint 3) |
| appTable.update/delete | ❌ | custom adapter (Sprint 4) |
| view.list/create/delete | ❌ | custom adapter (Sprint 4) |
| guarded record.create/update | plugin has UNguarded only | must come from custom adapter (plugin write tools forbidden, P2) |
F.3 Custom MCP adapter (Sprint 2)
lark_client/mcp_adapter/(Python MCP SDK) — adapter only, zero write logic, importslark_client.service.LarkWriteService.- Sprint 2 tools:
lark_record_get,lark_record_delete,lark_record_create,lark_record_update(all guarded). Sprint 4: field/view. - Each tool builds
WriteContext(agent="cowork-mcp")→LarkWriteService→ full SafetyLayer. - Hard guard (R-7): resolve
base_key; if resolved app_token ≠ Base đệmNf2bb1ExXaYnlksgoyQl72GNgAc, any delete/schema op rejected at adapter boundary (req §11, §13.4). - Auth reuses GSM
LARK_APP_*viaLarkCore; no new bot/credential (req §12.2). - Final mount topology (plugin-read-only-alongside vs full-replace) depends on whether the host supports tool-level hiding → OQ-4 (now a concrete feasibility check, not an open preference).
G. Track B — CLI Write Module
G.1 lark_client/writer.py
Typed façade over service.py (CLI thin, req §12.8): create / batch_create / get / update / batch_update / delete / batch_delete, return WriteOutcome. No write logic here.
G.2 lark_client/field_manager.py (Sprint 3)
SUPPORTED = {"Text","Number","SingleSelect","Checkbox"}; create/update/delete by field_id only (req §12.7); complex types → UnsupportedFieldType until Sprint 4+.
G.3 CLI commands (Click) + $LARK_AGENT nuance [PATCH1 P4]
cli/commands/records.py and cli/commands/fields.py as in base (create/get/update/delete/batch-*, fields create/update/delete with literal ack string for delete).
$LARK_AGENT rule (PATCH1):
- Every real write (
--no-dry-run) — single or batch, CLI or MCP — MUST have an explicit agent identity. Missing/empty$LARK_AGENTon a real write → abortSafetyViolationbefore layer 8. - Dry-run only: CLI MAY default
$LARK_AGENT=claude-code(convenience for--dry-run, which never mutates). - Valid values:
claude-code | cowork-mcp | cron(MCP adapter always setscowork-mcp). - Audit always records
agentin both phase-1 and phase-3 entries — there is no audit entry without an agent field.
Conventions otherwise unchanged: dry-run default ON; update/delete on non-buffer base need --confirm; field/table delete need the literal ack string; exit codes per §B.3; registered into existing cli/lark_tool.py Click group (no new entrypoint).
G.4 config/write-approvals.yaml schema
As base (id, operation, scope{base_key,table_id}, one_time_use default true, used, reason, created_by human, created_at, expires_at; approval_exempt_bases: ["88-phai-cu-base-dem"]). PATCH1: the used/used_by/used_at fields are written under file lock by YamlApprovalProvider (§C.3).
G.5 Integration with LarkCore + API limits config [PATCH1 P3]
- Reuse
LarkCore: GSM token, retry (3× backoff 429/503/network), global rate lock, endpoint whitelist. - Whitelist additions to
config/allowed_endpoints.yaml(each = 1 reviewed change; write endpoints initially empty per README §5.2): record create/get/update/delete + batch_create/batch_update/batch_delete; (Sprint 3) field create/update/delete; (Sprint 4) table create/delete + views. - New
config/lark-api-limits.yaml(see §B.4) holds rate + per-op batch ceilings;record_delete_maxdefault 100 until an official Lark doc citation or a recorded Base đệm probe raises it. Service rejects explicit over-ceiling requests (no silent truncation). - Idempotency:
client_token/UUID per Lark write API where supported.
H. Testing Strategy
H.1 Base đệm (test target — CONFIRMED not production)
88 - Phái cử (Base đệm), app_token Nf2bb1ExXaYnlksgoyQl72GNgAc; tables TTS=tblPQ6N79EeOmnTm (7 fld), Đơn hàng=tblaU7kxyPTNBSrR (5 fld). Production Base 88 is the different token YSIkb8PxOaNaozs2vwalOOcagkf (80 tbl) — tests MUST NOT use it; hard token assert enforced.
H.2 12 test cases (Base đệm only)
T1 dry-run default no API · T2 create real + audit pre/post · T3 update no-confirm aborts · T4 update w/ confirm + GPG backup exists · T5 one-time approval reuse → ApprovalError · T5b (PATCH1) two concurrent one-time consumers → exactly one success, other already_consumed · T6 batch_create 600 → 500+100 chunks audited · T6b (PATCH1) batch_delete 150 with default ceiling 100 → rejected or 100+50 chunked per config · T7 batch partial failure → partial_failure, no auto-rollback, rollback cmd · T8 scope mismatch → ApprovalError · T9 wildcard delete rejected · T10 PII payload → guarded write proceeds, audit metadata only, raw only in GPG · T10b (PATCH1) PII to plaintext/--export/stdout → SafetyViolation · T11 audit-pre unwritable → API not called + orphan backup deleted-or-logged · T12 audit-post fails post-success → status=success + emergency fallback.
H.3 Isolation & test split [PATCH1 P8]
- Unit / mock tests — MANDATORY before any commit. Mock
LarkCoreHTTP; assert SafetyLayer ordering, atomic approval single-winner (T5b via threads/processes), PII metadata + egress block (T10b), GPG invoked, 2-phase audit, orphan-backup path (T11), batch-ceiling logic (T6b). Mirrors existing 19/19+8/8 mocked style. Commit gate = all unit/mock green. - Base đệm integration tests — gated. Run only when env
LARK_TEST_INTEGRATION=1AND a hard assertion confirms app_token == Base đệm token; otherwise skipped (never silently hit any base). Base đệm reset by Claude Code allowed but the reset itself must be audited (req §12.12). Integration green is NOT required for the unit-level commit gate but IS required before Sprint 1 sign-off. app_tokenliteral allowed only intests/andbases.yaml(README §6).
I. Sprint Breakdown
Sprint 0 — S177-R0 Code Reconcile [PATCH1 P5] (NO CODE WRITTEN)
Goal: validate every assumption this design makes against the live lark-client source before a single line of Sprint 1 code.
Activities (read-only, by an agent/operator with repo+shell access to /opt/incomex/lark-client/):
- Confirm
LarkCoreactual class/module name and method signatures (token acquisition/refresh, retry, the global rate-lock API, the GSM/secret accessor used forLARK_APP_*). - Confirm
Registry/bases.yamlloader name +base_key→app_tokenresolution API and the Base đệm key string. - Confirm
lark_client.exceptionsbase class hierarchy (so new errors subclass it, not a parallel tree). - Confirm CLI entrypoint
cli/lark_tool.pyClick group name + how subcommand modules are registered. - Confirm config dir conventions and existing files (
allowed_endpoints.yaml,bases.yaml) and where newwrite-approvals.yaml/pii-fields.yaml/lark-api-limits.yamlshould live. - Confirm existing test layout/harness conventions (the 19/19 + 8/8 suites) and the
app_token-literal exception scope. Exit / STOP rule: produce a reconcile report (KB) mapping each design symbol → actual symbol. If the live source materially differs from design assumptions (e.g. noLarkCoreglobal lock, different secret path, no Click group) → STOP, do not start Sprint 1, route the delta back to Huyên/GPT for a design amendment. Only a clean/aligned reconcile (or trivial rename mapping) authorizes Sprint 1.
Sprint 1 — Track B core (CLI records + safety)
Deliverables: service.py (record ops), writer.py, safety.py (8 layers incl. atomic approval + orphan-backup), ApprovalProvider ABC + YamlApprovalProvider (file-locked atomic), GPG backup module, 2-phase audit + emergency + orphan log, cli/commands/records.py, config/write-approvals.yaml, config/pii-fields.yaml, config/lark-api-limits.yaml, PII registry+pattern + egress guard, record endpoints whitelisted.
Acceptance: all unit/mock tests green (incl. T5b, T6b, T10b, T11 orphan) — mandatory commit gate; gated Base đệm integration subset green before sign-off; existing 19/19+8/8 still pass; no import requests; no hardcoded app_token (pre-commit grep); dry-run default verified; real write without $LARK_AGENT aborts; GPG backup decryptable offline; audit-pre-fail aborts before API with orphan handling.
Sprint 2 — Track A custom MCP adapter
Deliverables: lark_client/mcp_adapter/ (lark_record_get/delete/create/update, all guarded), Base-đệm hard guard, topology decision per OQ-4 (read-only plugin alongside vs full replace).
Acceptance: Cowork get/delete on Base đệm via guarded adapter; production-token delete rejected at boundary; plugin write tools demonstrably unused/removed for production; all MCP writes in the audit stream with agent=cowork-mcp; no duplicated write logic.
Sprint 3 — Field ops + ApprovalProvider swap
Deliverables: field_manager.py (Text/Number/SingleSelect/Checkbox), cli/commands/fields.py, field endpoints whitelisted, DirectusApprovalProvider prototype injected without touching safety.py.
Acceptance: field create/update/delete on Base đệm; complex types → UnsupportedFieldType; provider-swap contract test: Directus provider passes the SAME atomic single-winner one-time-consume contract as YAML provider.
Sprint 4 — Schema ops + monitoring + full integration
Deliverables: table/base create/delete + view list/create/delete, maintenance-window + staging gate (req §13.5), monitoring (audit volume / failure-rate / orphan-backup sweep alarms, e.g. uptime-kuma push), full e2e integration suite. Acceptance: schema op refuses outside maintenance window; full T1–T12 (+b variants) + schema cases green on Base đệm; monitoring fires on injected audit-loss and on orphan backups; README §3/§8 updated.
J. Open Questions
- OQ-1 → folded into Sprint 0 (S177-R0). Still the top blocker, but now a defined gated phase with an explicit STOP rule (§I.0). Closed by a clean reconcile report; escalate on material drift.
- OQ-2: GPG key — confirm public-key-only-on-VPS / private-key-offline model and private-key custodian + GSM secret name
LARK_BACKUP_GPG_PUBKEY. (unchanged) - OQ-3 (refined): Confirm the PATCH1 two-rule PII policy — guarded writes proceed, but plaintext/export/stdout/non-GPG paths are blocked; audit/rollback never raw PII. Decide if
--pii-strict(also abort guarded writes) ships Sprint 1 or later. - OQ-4 (refined to a feasibility check): Does the MCP host support tool-level hiding of individual plugin tools? If YES → keep
@larksuiteoapi/lark-mcpmounted read-only-tools-only alongside the custom adapter. If NO → fully replace the plugin with the custom adapter (P2). Needs a concrete host-capability answer. - OQ-5 (refined):
record.batch_deleteceiling defaults to 100 inconfig/lark-api-limits.yaml. Raise only after (a) an official Lark Open API doc citation, or (b) a recorded Base đệm probe. Confirm acceptance of the conservative default. - OQ-6 (process, unchanged): Patched doc still lives only in KB (and the base on VPS
/opt/incomex/docs/mcp-writes/); not git-committed, final repo path not populated — requires repo+shell access this environment lacks. - OQ-7 (new): Orphan-backup sweep — confirm the operator runbook + grace window and that automated sweep deletion requires operator confirmation (no unconfirmed auto-delete of non-provably-orphaned blobs).
End of S177 Architecture Design — PATCH1. DRAFT v1.1 awaiting Huyên review (OQ-2…OQ-7) and S177-R0 Code Reconcile (OQ-1). Not commit-ready.