Decisions

13. Key Architectural Decisions Log¶

#	Decision	Rationale	Date
ADR-001	Use cagent native YAML, no wrapper format	Zero translation layer, users get full cagent features	2026-02-23
ADR-002	`soul.yaml` as single identity file per agent	Simpler than OpenClaw's 6+ bootstrap files. Can add more via `add_prompt_files`	2026-02-23
ADR-003	`cagent serve api` as primary container entrypoint	HTTP API is the natural interface for containerized agents	2026-02-23
ADR-004	Bash CLI, not compiled binary	Minimal dependencies (docker, curl, jq). Ship fast, iterate.	2026-02-23
ADR-005	Debian slim base image	Better cagent/tool compat than Alpine. Acceptable size trade-off.	2026-02-23
ADR-006	`mobyclaw.yaml` is dev-only, not product config	Separation of concerns: dev agent ≠ product agent	2026-02-23
ADR-007	"moby" as the default/reference agent	Clear identity, easy onboarding, extensible pattern	2026-02-23
ADR-008	Docker Compose over Kubernetes	Right-sized for personal agent deployment. K8s is overkill.	2026-02-23
ADR-009	Delegate agent loop entirely to cagent	Focus on orchestration, not reimplementing inference + tool execution	2026-02-23
ADR-010	Memory as plain Markdown files (OpenClaw pattern)	Simple, portable, agent can read/write with filesystem tools. No DB needed.	2026-02-23
ADR-011	Gateway as separate container from agent	Clean separation: gateway handles I/O + routing, agent handles thinking + acting	2026-02-23
ADR-012	Messaging adapters inside gateway, not separate containers	Simpler (one container), all JS libs anyway, enable/disable via env vars. Matches OpenClaw.	2026-02-23
ADR-013	Docker volumes for persistence	Workspace (memory) and data (sessions, cron) survive container restarts	2026-02-23
ADR-014	4-service separation: moby, gateway, workspace, memory	Each concern in its own container. Clean ownership. Independent scaling/failure.	2026-02-23
ADR-015	Workspace + memory as MCP servers	cagent's `type: mcp` toolset connects moby to services. No direct host mounts on agent.	2026-02-23
ADR-016	Separate workspace and memory volumes	Workspace = host files (projects, code). Memory = agent state (MEMORY.md, daily logs). Different lifecycles, different owners.	2026-02-23
ADR-017	`~/.mobyclaw/` as user data directory, bind-mounted	User-visible, editable, portable, survives `docker system prune`. Not a Docker volume.	2026-02-23
ADR-018	Messaging adapters inside gateway, not separate bridge containers	Simpler, less config, matches OpenClaw. Enable via env var presence.	2026-02-23
ADR-019	Single agent only — no multi-agent support	Mobyclaw is a personal agent, not a platform. One agent (moby), one container. Simplifies routing, config, and mental model. Can always revisit.	2026-02-23
ADR-020	Sessions created with `tools_approved: true`	`cagent serve api` pauses at `tool_call_confirmation` unless the session has `tools_approved: true`. Gateway sets this on session creation. Container isolation provides the safety boundary.	2026-02-23
ADR-021	`.env` file for secrets management	Single file, Docker Compose native, no Swarm/Vault needed. Least-privilege: per-service `environment` blocks control which container sees which var.	2026-02-23
ADR-022	End-to-end streaming via SSE PassThrough	cagent emits tokens in real-time. Gateway streams them through via PassThrough piped to HTTP response. Critical: use `res.on('close')` not `req.on('close')` for disconnect detection. Telegram adapter edits message every ~1s. CLI prints tokens to stdout.	2026-02-23
ADR-023	`docker-compose.override.yml` for per-user config	Base compose stays static + git-committed. Override is auto-generated from `credentials.env` + `workspaces.conf` on every `mobyclaw up`. Docker Compose merges them automatically. Gitignored.	2026-02-23
ADR-024	Separate `credentials.env` from `.env`	`.env` = mobyclaw infra (LLM keys, messaging). `credentials.env` = user service tokens (gh, aws). Different owners, different lifecycle. credentials.env lives in `~/.mobyclaw/` (portable with agent state).	2026-02-23
ADR-025	Workspaces as host bind mounts via `workspaces.conf`	Simple `name=path` format in `~/.mobyclaw/workspaces.conf`. CLI manages it (`workspace add/remove/list`). Override generation maps to Docker volumes. Changes require restart.	2026-02-23
ADR-026	Gateway-side scheduler with agent-created schedules via REST API	Agent calls `POST /api/schedules` via curl. Gateway owns timing, persistence, and delivery. Separation: agent composes messages, gateway delivers at the right time. No agent involvement at fire time (pre-composed messages).	2026-02-23
ADR-027	Heartbeat as periodic agent prompt, separate from scheduler	Scheduler = precise dumb timer (30s resolution). Heartbeat = intelligent agent review (15m interval). Different concerns: scheduler delivers pre-composed messages; heartbeat invokes full LLM reasoning. Agent uses `/api/deliver` to proactively message users from heartbeat.	2026-02-23
ADR-028	TASKS.md as agent-managed task store (Markdown)	Flexible Markdown file. Agent writes entries via filesystem tools. `[scheduled]` marker prevents double-scheduling. Channel stored per-task. Heartbeat reviews it. Complements schedules.json (gateway-owned) — TASKS.md is the agent's view, schedules.json is the gateway's execution state.	2026-02-23
ADR-029	Channel context injected as message prefix by gateway	Gateway prepends `[context: channel=telegram:123, time=...]` to every user message. Only mechanism available since cagent API has no per-message metadata. Agent extracts channel for schedule creation. Never displayed to user.	2026-02-23
ADR-030	Last active channel for fallback delivery	Gateway tracks last messaging channel used. Fallback when heartbeat/agent needs to deliver without a specific channel target. Resets on restart (acceptable for personal agent).	2026-02-23
ADR-031	Source code mounted at `/source` for self-modification	Agent needs to modify its own Dockerfile, gateway source, compose config, CLI, and documentation. Bind-mounting the project root gives full read-write access. Safety via: git (revert), permission-before-modify policy, syntax checks before rebuild. Four signal types: `restart`, `rebuild`, `rebuild-gateway`, `rebuild-all`.	2026-02-23
ADR-032	Persistent channel store	`ChannelStore` persists known channels to `~/.mobyclaw/channels.json` (one entry per platform). Saved on first message. Schedule API falls back to known channel. Heartbeat includes known channels in prompt. Replaces old in-memory `lastActiveChannel`.	2026-02-24
ADR-033	Schedule pruning — splice-on-delivery	`markDelivered()` and `cancel()` splice entries out of array. `_load()` filters to only `pending` on startup. `schedules.json` only ever contains pending entries. Prevents unbounded growth.	2026-02-24
ADR-034	Heartbeat skip guard	`let running = false` flag prevents heartbeat overlap. If previous heartbeat still running, next tick skips. Uses `try/finally` to reset. Prevents infinite queue buildup at 30s intervals.	2026-02-24
ADR-035	Collect queue mode (OpenClaw-inspired)	Default queue mode coalesces rapid queued messages into a single combined turn. Prevents "continue, continue" spam. Messages separated by `---`. All promises resolve with the same response. Configurable via `QUEUE_MODE` env var.	2026-02-24
ADR-036	Typing indicators on message receipt	Telegram adapter sends `sendChatAction('typing')` immediately when a message is received, before any processing. Refresh every 4s while processing. OpenClaw pattern: `instant` mode. Makes agent feel responsive even during queue waits.	2026-02-24
ADR-037	Queue feedback to user	When message is queued behind a running task, user sees "⏳ Working on something else, I'll get to this next..." Telegram message. Deleted automatically when processing starts. SSE endpoint emits `queued` event. Visible acknowledgment prevents confusion.	2026-02-24
ADR-038	Session daily/idle reset	Sessions auto-reset at configurable hour (default 4 AM) and/or after idle timeout. OpenClaw pattern: daily reset clears stale context, idle reset catches long gaps. `/new` and `/reset` commands force immediate reset. Persisted `lastActivity` timestamp survives restarts.	2026-02-24
ADR-039	/stop abort command	`/stop` in Telegram (or `POST /api/stop`) clears the queue and signals abort on the current run. Returns count of cleared messages. Graceful: doesn't crash the agent, just ends the current turn.	2026-02-24
ADR-040	Queue cap with oldest-drop overflow	Max 20 queued messages (configurable). When cap exceeded, oldest message is dropped with error. Prevents unbounded memory growth from spam or runaway loops. OpenClaw uses summarize policy; we use simple drop for now.	2026-02-24
ADR-041	Debounce on queue drain	1000ms debounce before draining collected messages (collect mode only). Lets rapid messages accumulate before the agent processes them as one turn. Configurable via `QUEUE_DEBOUNCE_MS`.	2026-02-24
ADR-042	Tool Gateway as MCP aggregator in separate container	External service access (Notion, Google, etc.) routed through a dedicated `tool-gateway` container. Manages upstream MCP connections, auth, and token lifecycle independently. Exposes aggregated tools as a single MCP server to cagent via HTTP bridge. Clean separation: agent doesn't know about OAuth, tokens, or MCP wiring.	2026-02-24
ADR-043	Chat-mediated auth for all external services	No CLI commands, no admin UIs for auth. All OAuth/device-code flows are initiated conversationally — user says "connect notion", agent sends auth URL via Telegram, user clicks and authorizes, agent confirms. Mirrors how `gh auth login` worked (Moby sent the device code via Telegram). For OAuth redirect flows (Notion), tool-gateway hosts callback endpoint.	2026-02-24
ADR-044	mcp-bridge: stdio-to-HTTP relay for cagent → tool-gateway	cagent only supports MCP via stdio (`command` + `args`). Tool-gateway runs in a separate container with HTTP. Bridge script in moby container translates stdio ↔ HTTP. ~50 lines, shell or Go. Allows clean container separation while keeping native MCP tool discovery.	2026-02-24
ADR-045	CLI tools (gh, git, curl) installed directly in agent container	If a service has a solid CLI, skip the MCP layer. `gh` already in moby container. Agent uses via shell toolset. Simpler, fewer moving parts. MCP reserved for services that need structured tool schemas or complex auth.	2026-02-24
ADR-046	Zod schemas required for McpServer.tool()	MCP SDK v1.27.0's `McpServer.tool()` requires Zod schema objects, not plain JSON Schema `{type:"string"}`. `isZodRawShapeCompat()` silently rejects plain objects → empty `inputSchema.properties`. All tool definitions (tool-gateway + mcp-bridge re-registration) must use `z.string()`, `z.number()`, etc.	2026-02-24
ADR-047	zod installed globally in moby container	mcp-bridge runs inside moby and needs zod to convert JSON Schema → Zod when re-registering remote tools. Added `zod` to `npm install -g` in Dockerfile alongside `@modelcontextprotocol/sdk`. Bridge uses `NODE_PATH` auto-discovery for global modules.	2026-02-24
ADR-048	Full Playwright browser in tool-gateway	Headless Chromium via Playwright in tool-gateway container for full web interaction (navigate, click, type, fill forms, screenshots). Uses Playwright’s internal `_snapshotForAI()` for accessibility snapshots with aria-ref element targeting — same approach as `@playwright/mcp`. Single persistent browser context with 10min idle auto-close. Browser is ~400MB but enables account creation, multi-step flows, CAPTCHA viewing via screenshots.	2026-02-24
ADR-049	Accessibility snapshots over screenshots for interaction	Agent uses text-based accessibility tree (with ref IDs) to understand and interact with pages. Screenshots are secondary — useful for visual verification (CAPTCHAs, layout) but you "can’t perform actions based on screenshots." Refs change after every action; agent must use refs from the most recent snapshot. Matches Playwright MCP’s design philosophy.	2026-02-24
ADR-050	Recursive JSON Schema → Zod in mcp-bridge	Bridge now handles nested types: arrays (`z.array()`), objects (`z.object()`), enums (`z.enum()`), not just primitives. Required for `browser_fill_form` (array of field objects) and `browser_tabs` (enum action). Single recursive `jsonSchemaToZod()` function.	2026-02-24
ADR-051	Agent max_iterations raised to 15	Browser automation tasks require many sequential tool calls (navigate → snapshot → fill → click → wait → snapshot → ...). The default 5 iterations was too low. 15 allows a realistic multi-step flow while still preventing runaway loops.	2026-02-24
ADR-052	Snapshot trimming — tree-based compact mode	`_snapshotForAI()` returns full accessibility trees (59KB+ for HN, 135KB for Wikipedia). Rewrote trimmer from naive line-based to proper tree parser: parse indentation tree, strip /url metadata, unwrap noise wrappers, remove separator text nodes, collapse single-child chains, collapse repeated siblings, hard-cap at 5000 chars. Results: HN 59KB→1.4KB (98%), Wikipedia 135KB→25KB (96%). `browser_snapshot` accepts `full=true` escape hatch.	2026-02-24
ADR-053	Read-only integrations as native tool-gateway tools	Slack, Notion, Gmail, and Calendar integrations implemented as native tool-gateway tools (direct REST calls) rather than proxying upstream MCP servers. We only need 3-5 read-only endpoints per service; native tools are simpler, faster, no third-party MCP dependencies. 15 new tools total.	2026-02-24
ADR-054	Notion uses internal integration token, not OAuth	For read-only access, Notion's internal integration token is dramatically simpler than OAuth 2.0 + PKCE. User creates integration at notion.so/my-integrations, shares pages with it, pastes token. No callback URLs, no browser redirects, no token refresh complexity.	2026-02-24
ADR-055	Google OAuth shared across Gmail + Calendar	One Google Cloud project, one OAuth consent screen, one auth flow. Both `gmail.readonly` and `calendar.readonly` scopes requested in the same authorization URL. User authorizes once, gets access to both services. Single token in `~/.mobyclaw/tokens/google.json`.	2026-02-24
ADR-056	Short-term memory (STM) for session continuity	Rolling buffer of last 20 user↔agent exchanges saved to `short-term-memory.json`. On new session creation, injected as `[SHORT-TERM MEMORY]` block into the first message. Heartbeat/system messages excluded. Messages capped at 1500 chars. Solves the amnesia problem where daily/turn-limit session resets lose all context.	2026-02-24
ADR-057	Context optimizer — smart context injection	Before user messages reach the agent, fetch relevant MEMORY.md sections (scored by keyword overlap), inner emotional state, self-model summary, and matching explorations. Prepend as `[MEMORY CONTEXT]` block. Agent doesn't need to manually read MEMORY.md each turn. Fetches from dashboard API with 3s timeout and 1500-token budget. Graceful fallback if API fails.	2026-02-24
ADR-058	Exploration heartbeats	Every Nth heartbeat (default: 4th, configurable via `EXPLORATION_FREQUENCY`) allows the agent to follow a curiosity topic from its `curiosity_queue`, fetch 1 URL, and write a summary to `explorations/`. Normal heartbeats are reflection-only (cheap). At 2h intervals, that's ~1 exploration per 8 hours. Cost-controlled: max fetches and summary length are configurable.	2026-02-24
ADR-059	Session turn limit (80 exchanges)	Sessions auto-rotate after 80 turns. Prevents cagent history from growing to 100+ messages where Anthropic's tool_use/tool_result sequencing can corrupt (discovered in production: messages.102 corruption caused permanent 400 errors). Combined with STM injection, context is preserved across rotations.	2026-02-24
ADR-060	Stream error detection and auto-recovery	cagent returns HTTP 200 even when Anthropic returns 400. The error appears only in SSE `type: "error"` events. Gateway tracks stream errors; if stream ends with error and no content, rejects the promise. `isSessionError()` expanded to recognize corruption patterns. Auto-clears session and retries once.	2026-02-24
ADR-061	Heartbeat consecutive failure tracking	After 2 consecutive heartbeat failures, pauses heartbeats until the session changes (user `/new` or auto-recovery). Prevents hammering a corrupted session every N minutes. Auto-resumes when `lastKnownSessionId` changes.	2026-02-24
ADR-062	Context fetch after setBusy to prevent double-processing	The context optimizer's async HTTP fetch created a race window where `isBusy()` was false but message processing had logically started. Fix: orchestrator sets `busy=true` FIRST, then awaits context fetch via a `contextFetcher` callback. No async work before the busy guard.	2026-02-24
ADR-063	Telegram message deduplication	Tracks last 50 `message_id`s in a Set. Skips any message already processed. Prevents double-processing when Telegraf's polling restarts and re-delivers updates. In-memory only (resets on gateway restart, which is fine since polling restart is what causes re-delivery).	2026-02-24
ADR-064	Telegraf polling liveness monitor	Telegraf v4's long-polling can die silently (no error events, no crash). Gateway tracks last update activity via `handleUpdate` intercept. If idle >5 minutes and Telegram API is reachable, restarts polling. Conservative threshold prevents false positives during quiet periods.	2026-02-24
ADR-065	Agent-controlled tunnel start via POST /api/tunnel/start	The agent cannot exec into the dashboard container or access the Docker socket. Added `POST /api/tunnel/start` to the dashboard API so moby can start the Cloudflare tunnel over HTTP. Endpoint checks for stale PID, kills old process if needed, spawns fresh cloudflared, and delivers the new URL to the user via gateway's `/api/deliver`.	2026-02-26