fix(container-runner): skip broken nested file mounts on Apple Container#2649
Open
jurre-mbt-it wants to merge 14 commits into
Open
fix(container-runner): skip broken nested file mounts on Apple Container#2649jurre-mbt-it wants to merge 14 commits into
jurre-mbt-it wants to merge 14 commits into
Conversation
Replaces Docker as the agent container runtime on macOS. Apple Container is preferred because it's natively installed via Homebrew on this host; Docker required brew install --cask docker-desktop which conflicted with an existing /usr/local/bin/hub-tool symlink. - src/container-runtime.ts: CONTAINER_RUNTIME_BIN='container', readonly mounts use --mount type=bind,source,target,readonly (Apple Container's preferred syntax). ensureContainerRuntimeRunning probes `container system status` and auto-starts via `container system start`. cleanupOrphans parses `container ls --all --format json` and filters by install-label (supports both object and array label forms). - container/build.sh: default CONTAINER_RUNTIME=container. - setup/container.ts: dual-runtime support — prefers Apple Container on macOS when installed, falls back to Docker. Probes use runtime-specific status commands. - setup/environment.ts, setup/verify.ts: detect Apple Container before Docker; report under existing docker key for backwards compatibility. Verified: agent image rebuilt as nanoclaw-agent-v2-05ec8912:latest, all 13 container-runtime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OneCLI's installer is docker-compose based, which doesn't work on hosts using Apple Container instead of Docker. This adds a built-in HTTP proxy that reads ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN from .env and injects credentials on every forwarded API request — containers see only a placeholder token. Activated automatically when ONECLI_URL is unset. When ONECLI_URL is set, behavior is unchanged: the OneCLI gateway is still used and the proxy is not started. - src/credential-proxy.ts: new HTTP forward-proxy. Supports both api-key (injects x-api-key) and oauth (injects Authorization: Bearer on the Claude CLI temp-key exchange request) modes. Strips hop-by-hop headers. - src/config.ts: adds USE_NATIVE_CREDENTIAL_PROXY (true when ONECLI_URL is empty), CREDENTIAL_PROXY_PORT (default 3001), CREDENTIAL_PROXY_HOST (default 127.0.0.1; Apple Container hosts override with the bridge100 gateway IP or 0.0.0.0). - src/index.ts: starts the proxy on boot when in native mode. - src/container-runner.ts: branches per mode. Native mode pushes ANTHROPIC_BASE_URL=http://<CONTAINER_HOST_GATEWAY>:<port> and a placeholder token; OneCLI mode keeps its ensureAgent + applyContainerConfig flow. - src/modules/approvals/index.ts: skips the OneCLI manual-approval long-poll handler in native mode (no gateway to poll). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the Telegram channel from the `channels` branch into the host build. Replaces the stock cli-only config so the bot can receive messages on the user's existing Telegram bot identity (Hivemind_evabot). - Copy telegram.ts + telegram-pairing.ts + telegram-markdown-sanitize.ts (and tests) from origin/channels into src/channels/. - Append the registration import to src/channels/index.ts. - Pin @chat-adapter/telegram@4.27.0 (peer dep brings chat@4.27.0; bump the direct chat dependency from ^4.24.0 to ^4.27.0 so TypeScript resolves one consistent version). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test surfaced four issues with the Apple Container path: 1. Bind-mount of files: `--mount type=bind,source=<file>,...` is rejected by Apple Container with "is not a directory". Container-runner mounts `container.json`, the composed CLAUDE.md, and the shared base CLAUDE.md as individual files. Switch `readonlyMountArgs` to the `-v src:dst:ro` syntax instead, which Apple Container accepts for both files and dirs and still honors :ro. Tests updated. 2. Dockerfile WORKDIR pointed at `/workspace/group`, the v1 mount path. v2 mounts the agent group folder at `/workspace/agent`. With Docker the missing dir was created at run time; Apple Container errors out at process start with "failed to change directory". Update WORKDIR and the mkdir line to /workspace/agent. 3. The native credential proxy was passing `CLAUDE_CODE_OAUTH_TOKEN= placeholder` to the container. Claude CLI then tried to exchange that placeholder via /api/oauth/claude_cli/create_api_key, the proxy rewrote the bearer with the real token, and api.anthropic.com returned 403 because the user's Pro-tier OAuth token lacks `org:create_api_key` scope. Switch to `ANTHROPIC_AUTH_TOKEN=placeholder` so the CLI sends a plain Bearer header on /v1/messages, which the proxy rewrites — bypassing the failing exchange. 4. `CONTAINER_HOST_GATEWAY` was read from `process.env`, but the value lives in .env (not exported). Reading via .env fell back to `host.docker.internal`, which Apple Container does not resolve. Containers couldn't reach the host proxy at all. Wire it through config.ts via readEnvFile() with `host.docker.internal` retained as the Docker-on-macOS fallback default. Plus: setup/service.ts plist PATH gains /opt/homebrew/bin so launchd-launched v2 can find the `container` binary (Apple Container installs there on Apple Silicon). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the manual portion of the v1 → v2 migration. Migration is live: service `com.nanoclaw-v2-05ec8912` running, Telegram answering on @Hivemind_evabot, owner role granted, both messaging groups locked down to known-senders-only with the owner added as member of both agent groups. - Add three v2-standard container skills (capabilities, pdf-reader, status) that were copied in by migrate-v2.sh but never committed. - Delete orphan group folders groups/global/ and groups/main/. They were v2 stock template copies; the cutover at boot already drops groups/global/. v1 originals remain at /Users/eva/nanoclaw/groups/ if recovery is ever needed. - Archive .nanoclaw-migrations/ (the prior migration's gotchas notes and guide.md) so future contributors can see how we got here. - Archive logs/setup-migration/handoff.json into docs/migration-2026-05-28.json. - Add docs/v1-fork-followups.md describing what remains to port from the v1 fork (Telegram image vision / 4-bot swarm / reply context, Apple Pages MCP tools, Gmail tool, container memory tuning) and what's already done (Apple Container runtime, native credential proxy). CLAUDE.local.md cleanups for the two wired groups (telegram_main 325 → 21 lines, telegram_hivemind 152 → 38 lines, paths /workspace/group/ → /workspace/agent/) aren't tracked — those files are deliberately per-install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional `memory_mb` column to container_configs. When set, `buildContainerArgs` passes `-m <N>MiB` to the runtime so the cgroup limit is honored (both Docker and Apple Container accept this syntax). Surfaces via `ncl groups config update --memory-mb <N>` (0 clears the limit). Migration 016 adds the column NULL-able so existing rows keep the runtime default. Reason: v1 reliably set ≥2 GB so Chromium (used by agent-browser MCP) wouldn't OOM mid-session. v2 had no exposed lever. Both wired groups are now set to 2048 MiB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chat-sdk-bridge.messageToInbound` previously ate any `fetchData()` error with a generic warn-log; the agent would then see an attachment entry without a `data` field and have no way to tell whether the file was missing, too big, or undecodable. Telegram caps bot getFile at 20 MiB, which is the most common cause and the one v1 surfaced explicitly. Now we classify the error: `tooBig` matches /too big|too large|too_big| 413|payload too large/i and tags the entry with `error: 'too_big'` and `errorMessage: <raw>`. Other failures get `error: 'download_failed'`. `container/agent-runner/src/formatter.ts` then renders explanatory markers in place of the usual `[type: name — saved to …]` line: - too_big: `[<type>: <name> (~N MB) — too large to download (over the bot's getFile limit); ask the user to share a smaller file or a download link]` - download_failed: `[<type>: <name> — download failed; ask the user to resend or share another way]` The agent then crafts a natural reply instead of silently dropping the attachment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores v1's multi-bot persona pattern: when an agent team builds out multiple personas (e.g. "Marine Biologist" + "Alexander Hamilton") in the Hivemind group, each persona's messages appear from a dedicated Telegram bot with a matching display name. The user sees a real multi-participant conversation instead of a single bot relaying everyone. - src/channels/telegram-swarm.ts: new module. Reads TELEGRAM_BOT_POOL (comma-separated tokens) and init-validates each via getMe. Maintains a sticky map keyed on (platformId, sender) → pool index, assigned round-robin on first use. On first assignment of a (platformId, sender) pair, calls setMyName(sender) so the bot's display name matches the persona; waits 1.5 s for Telegram propagation before sending. Sends a plain-text Telegram message via the picked bot's sendMessage endpoint. - src/channels/telegram.ts: wraps the adapter's `deliver`. When the outbound content has `text` and `sender` and the pool is up (and the message isn't a special operation like edit/reaction), routes through the pool; falls back to the primary bot on any error. - container/agent-runner/src/mcp-tools/core.ts: `send_message` gains an optional `sender` field. When supplied, it flows into the content JSON so the host's deliver path can route via the pool. Without TELEGRAM_BOT_POOL configured, behavior is unchanged — `sender` is ignored and everything goes through the primary bot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Marks the four items shipped in this batch (container memory cap, file-too-big handling, reply context [v2 already had it], Telegram bot pool / swarm) and rewrites the Gmail + Pages sections with the architectural detail uncovered while attempting them. Gmail: upstream skill assumes OneCLI TLS-MITM. v1-style file-based tokens are blocked by the mount-security `credentials` substring pattern; viable path is rename → GMAIL_CREDENTIALS_PATH override + per- group rebuild. Or build a real TLS MITM proxy (essentially mini-OneCLI). Pages: v1's filesystem IPC (DATA_DIR/ipc/<group>/messages/<id>.json) doesn't fit v2's DB-only host↔container channel. v2-shaped designs are sketched out — preferred is a `kind=system-action, action=pages.<verb>'` on messages_out with a host handler that calls osascript. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-runner's default of 14 days is too tight when restoring an install whose last conversation activity is just over 2 weeks old — which is the normal state after a migration that took a few days, or any user returning from vacation. What actually went wrong today: after migrate-v2.sh restored v1's session transcripts into the container's CLAUDE_CONFIG_DIR, the first container wake found them 12–55 days old (depending on group) and rotated the active continuation, archiving the v1 transcript to conversations/<date>.md and starting fresh sessions. Both wired groups lost their resume context as a result; the data was preserved on disk but the session pointer in `session_state.continuation` got rewritten to a brand-new empty session. Bumping to 180 keeps multi-month context alive across reasonable gaps. Transcripts still rotate on the size cap (CLAUDE_TRANSCRIPT_ROTATE_BYTES, default 12 MiB), which is what bounds memory. Recovery for the existing install (not in this commit, done out of band): - telegram_main: `session_state.continuation` switched from c66f8582 (3 days) back to 74ebe63b (6.2 MB, Apr 4 → May 27). - telegram_hivemind: switched from 72ea255a (25 lines) back to d046f59c (1954 lines, Apr 3 → May 16). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores v1's local-only Pages integration. v1's filesystem-IPC pattern
(DATA_DIR/ipc/<group>/messages/<id>.json) doesn't fit v2's DB-only
host↔container channel; this port uses the existing system-action
registry instead.
Container side (container/agent-runner/src/mcp-tools/pages.ts) exposes
11 MCP tools — pages_create, pages_open, pages_save, pages_close,
pages_get_text, pages_insert_text, pages_replace_text,
pages_format_paragraph, pages_export_pdf, pages_list, pages_delete.
Each tool writes a `kind='system', action='pages_request'` message with
{requestId, verb, args} to outbound.db and polls inbound.db for a
matching pages_response. Same correlation pattern as cli_request in
cli/ncl.ts.
Host side (src/modules/pages/applescript.ts) is the v1 osascript helper
module ported verbatim except for the logger swap (pino-style
`logger.info({obj}, 'msg')` → v2's `log.info('msg', {obj})`) and import
path updates. All v1 sandboxing logic ported intact: FILENAME_PATTERN
allowlist (letters/digits/space/_-().), path-escape check after resolve,
sandbox under `groups/<folder>/pages/`. AppleScript correctness — style
applied first because it resets font/color, per-property `set X of Y`
assignment because the `properties` record form fails on rich text —
preserved unchanged from v1.
src/modules/pages/index.ts registers the delivery action, looks the
group folder up via getAgentGroup(session.agent_group_id), dispatches
verb to the AppleScript helper, and writes the response frame to
inbound.db. Self-registers via the modules barrel.
No DB schema or container_configs changes. Works for any group on a
host with Pages.app installed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last gap in the container_config CLI surface. Up to now the only way to wire an additional host→container mount was direct UPDATE on data/v2.db — fine for the maintainer but unreproducible for anyone following the followups doc, and outside the approval-gated CLI path that every other config-mutating command goes through. - `config add-mount`: upserts by containerPath. Calls validateMount() up front so a bad host path (nonexistent, outside the allowlist, matches a blocked pattern) fails immediately with the actual reason instead of being silently dropped at spawn time with a WARN buried in the error log. Reports the effective container path (`/workspace/extra/<container>`) so the caller knows where their files will appear inside the container. - `config remove-mount`: filters by containerPath; errors if not found. Both gated as `approval`-tier, matching add-mcp-server / add-package. Updates the Gmail wiring recipe in docs/v1-fork-followups.md to use the new command instead of the prior SQL one-liner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both Gmail and Apple Pages were ported in prior commits (the "Ported" sections at the top of the file cover them with the actual wiring recipes). The old "## To port — …" headings underneath were leftover from when they were still deferred; remove to avoid implying open work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
container.json and the composed CLAUDE.md were mounted as nested read-only file overlays on top of the agent group directory mount. Apple Container's virtio-fs creates these inodes at the destination path (stat() returns 0644) but blocks reads — even as root — returning EACCES. This silently disabled all MCP servers configured via `ncl groups config add-mcp-server` and skipped the composed CLAUDE.md context (skill instructions, on-wake messages, fragments) for every Apple Container user. The agent kept answering because CLAUDE.local.md lives in the always-OK dir-mount-only file — so chat looked normal while tool-driven workflows silently improvised without ever calling the tools. The dir mount at /workspace/agent already exposes both files. The read-only intent is preserved in practice because the host re-materializes container.json from the central DB and recomposes CLAUDE.md from the shared base + fragments on every container spawn — any in-container write to either is clobbered on the next session. If preserving the read-only guarantee with defense-in-depth matters, a follow-up could detect the runtime at host startup (e.g. inspect CONTAINER_RUNTIME_BIN or run `container --version`) and only skip the nested mounts on Apple Container. Reported separately so the agent-runner mount-race retry (which only helps once these mounts are skipped) can land independently.
3 tasks
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The nested file mounts for
container.jsonand the composedCLAUDE.mdproduce phantom inodes on Apple Container —stat()returns mode0644but reads returnEACCESeven as root. This silently disables every MCP server configured viancl groups config add-mcp-serverand skips composed CLAUDE.md context for every Apple Container user.Discovered while wiring a new MCP server on a v2.0.70 install. Symptom: agent answered "I don't have access to that data" while the configured MCP server was sitting unreached. From inside the container:
statlies. The kernel returnsEACCESregardless of POSIX bits because Apple Container's virtio-fs doesn't support the "file mount on top of a dir mount" pattern that works on Docker.Impact (when unpatched on Apple Container)
ncl groups config add-mcp-serverentries are silently ignored —loadConfig()falls back to all-empty defaults.CLAUDE.local.mdlives in the always-OK dir-mount-only file.Users discover this only when they try to use a feature that depends on MCP tools — the agent improvises around the missing capability rather than erroring.
Fix
Remove both nested file mounts. The dir mount at
/workspace/agentalready exposes both files. The host re-materializes both from the central DB / composer on every container spawn, so any in-container writes are clobbered next session — the read-only intent is preserved in practice.Follow-up
A separate PR adds a startup retry to
agent-runner/src/config.tsfor a related virtio-fs race that surfaces once these mounts are skipped (first read can still fail withENOENTbecause the nested dir mount takes ~100–500 ms to expose on Apple Container).Suggested upstream-quality version
If preserving the read-only guarantee with defense-in-depth matters, detect the runtime at host startup and only skip the nested mounts on Apple Container:
Happy to rework this PR into that shape if you'd prefer — let me know.
Test plan
container exec ... cat /workspace/agent/container.jsonsucceeds./workspace/agent/CLAUDE.md.ncl groups config add-mcp-serverand send a message that should trigger it — agent should actually call the tool.🤖 Generated with Claude Code