Skip to content

fix(container-runner): skip broken nested file mounts on Apple Container#2649

Open
jurre-mbt-it wants to merge 14 commits into
nanocoai:mainfrom
jurre-mbt-it:fix/apple-container-nested-file-mounts
Open

fix(container-runner): skip broken nested file mounts on Apple Container#2649
jurre-mbt-it wants to merge 14 commits into
nanocoai:mainfrom
jurre-mbt-it:fix/apple-container-nested-file-mounts

Conversation

@jurre-mbt-it
Copy link
Copy Markdown

Summary

The nested file mounts for container.json and the composed CLAUDE.md produce phantom inodes on Apple Container — stat() returns mode 0644 but reads return EACCES even as root. This silently disables every MCP server configured via ncl groups config add-mcp-server and skips composed CLAUDE.md context for every Apple Container user.

Discovered while wiring a new MCP server on a v2.0.70 install. Symptom: agent answered "I don't have access to that data" while the configured MCP server was sitting unreached. From inside the container:

$ container exec --user 0:0 <container> sh -c 'stat /workspace/agent/container.json; cat /workspace/agent/container.json'
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
cat: /workspace/agent/container.json: Permission denied

stat lies. The kernel returns EACCES regardless of POSIX bits because Apple Container's virtio-fs doesn't support the "file mount on top of a dir mount" pattern that works on Docker.

Impact (when unpatched on Apple Container)

  • All ncl groups config add-mcp-server entries are silently ignored — loadConfig() falls back to all-empty defaults.
  • Composed CLAUDE.md (skill instructions, fragments, on-wake messages) doesn't reach the agent.
  • Existing chat keeps working because CLAUDE.local.md lives in the always-OK dir-mount-only file.

Users discover this only when they try to use a feature that depends on MCP tools — the agent improvises around the missing capability rather than erroring.

Fix

Remove both nested file mounts. The dir mount at /workspace/agent already exposes both files. The host re-materializes both from the central DB / composer on every container spawn, so any in-container writes are clobbered next session — the read-only intent is preserved in practice.

Follow-up

A separate PR adds a startup retry to agent-runner/src/config.ts for a related virtio-fs race that surfaces once these mounts are skipped (first read can still fail with ENOENT because the nested dir mount takes ~100–500 ms to expose on Apple Container).

Suggested upstream-quality version

If preserving the read-only guarantee with defense-in-depth matters, detect the runtime at host startup and only skip the nested mounts on Apple Container:

const isAppleContainer = process.env.CONTAINER_RUNTIME_BIN === 'container'
                      || detectAppleContainerRuntime();
if (!isAppleContainer) {
  mounts.push({ hostPath: containerJsonPath, containerPath: '/workspace/agent/container.json', readonly: true });
  mounts.push({ hostPath: composedClaudeMd, containerPath: '/workspace/agent/CLAUDE.md', readonly: true });
}

Happy to rework this PR into that shape if you'd prefer — let me know.

Test plan

  • Spawn an agent container on Apple Container; confirm container exec ... cat /workspace/agent/container.json succeeds.
  • Confirm same for /workspace/agent/CLAUDE.md.
  • Configure an MCP server via ncl groups config add-mcp-server and send a message that should trigger it — agent should actually call the tool.
  • Spawn an agent container on Docker; verify no regression (both files still readable; behavior unchanged).

🤖 Generated with Claude Code

Eva and others added 14 commits May 28, 2026 18:31
Replaces Docker as the agent container runtime on macOS. Apple Container
is preferred because it's natively installed via Homebrew on this host;
Docker required brew install --cask docker-desktop which conflicted with
an existing /usr/local/bin/hub-tool symlink.

- src/container-runtime.ts: CONTAINER_RUNTIME_BIN='container', readonly
  mounts use --mount type=bind,source,target,readonly (Apple Container's
  preferred syntax). ensureContainerRuntimeRunning probes
  `container system status` and auto-starts via `container system start`.
  cleanupOrphans parses `container ls --all --format json` and filters by
  install-label (supports both object and array label forms).
- container/build.sh: default CONTAINER_RUNTIME=container.
- setup/container.ts: dual-runtime support — prefers Apple Container on
  macOS when installed, falls back to Docker. Probes use runtime-specific
  status commands.
- setup/environment.ts, setup/verify.ts: detect Apple Container before
  Docker; report under existing docker key for backwards compatibility.

Verified: agent image rebuilt as nanoclaw-agent-v2-05ec8912:latest, all
13 container-runtime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OneCLI's installer is docker-compose based, which doesn't work on hosts
using Apple Container instead of Docker. This adds a built-in HTTP proxy
that reads ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN from .env and
injects credentials on every forwarded API request — containers see only
a placeholder token.

Activated automatically when ONECLI_URL is unset. When ONECLI_URL is set,
behavior is unchanged: the OneCLI gateway is still used and the proxy is
not started.

- src/credential-proxy.ts: new HTTP forward-proxy. Supports both api-key
  (injects x-api-key) and oauth (injects Authorization: Bearer on the
  Claude CLI temp-key exchange request) modes. Strips hop-by-hop headers.
- src/config.ts: adds USE_NATIVE_CREDENTIAL_PROXY (true when ONECLI_URL
  is empty), CREDENTIAL_PROXY_PORT (default 3001), CREDENTIAL_PROXY_HOST
  (default 127.0.0.1; Apple Container hosts override with the bridge100
  gateway IP or 0.0.0.0).
- src/index.ts: starts the proxy on boot when in native mode.
- src/container-runner.ts: branches per mode. Native mode pushes
  ANTHROPIC_BASE_URL=http://<CONTAINER_HOST_GATEWAY>:<port> and a
  placeholder token; OneCLI mode keeps its ensureAgent + applyContainerConfig
  flow.
- src/modules/approvals/index.ts: skips the OneCLI manual-approval
  long-poll handler in native mode (no gateway to poll).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the Telegram channel from the `channels` branch into the host build.
Replaces the stock cli-only config so the bot can receive messages on the
user's existing Telegram bot identity (Hivemind_evabot).

- Copy telegram.ts + telegram-pairing.ts + telegram-markdown-sanitize.ts
  (and tests) from origin/channels into src/channels/.
- Append the registration import to src/channels/index.ts.
- Pin @chat-adapter/telegram@4.27.0 (peer dep brings chat@4.27.0; bump the
  direct chat dependency from ^4.24.0 to ^4.27.0 so TypeScript resolves one
  consistent version).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test surfaced four issues with the Apple Container path:

1. Bind-mount of files: `--mount type=bind,source=<file>,...` is rejected
   by Apple Container with "is not a directory". Container-runner mounts
   `container.json`, the composed CLAUDE.md, and the shared base CLAUDE.md
   as individual files. Switch `readonlyMountArgs` to the `-v src:dst:ro`
   syntax instead, which Apple Container accepts for both files and dirs
   and still honors :ro. Tests updated.

2. Dockerfile WORKDIR pointed at `/workspace/group`, the v1 mount path.
   v2 mounts the agent group folder at `/workspace/agent`. With Docker
   the missing dir was created at run time; Apple Container errors out
   at process start with "failed to change directory". Update WORKDIR
   and the mkdir line to /workspace/agent.

3. The native credential proxy was passing `CLAUDE_CODE_OAUTH_TOKEN=
   placeholder` to the container. Claude CLI then tried to exchange that
   placeholder via /api/oauth/claude_cli/create_api_key, the proxy
   rewrote the bearer with the real token, and api.anthropic.com
   returned 403 because the user's Pro-tier OAuth token lacks
   `org:create_api_key` scope. Switch to `ANTHROPIC_AUTH_TOKEN=placeholder`
   so the CLI sends a plain Bearer header on /v1/messages, which the
   proxy rewrites — bypassing the failing exchange.

4. `CONTAINER_HOST_GATEWAY` was read from `process.env`, but the value
   lives in .env (not exported). Reading via .env fell back to
   `host.docker.internal`, which Apple Container does not resolve.
   Containers couldn't reach the host proxy at all. Wire it through
   config.ts via readEnvFile() with `host.docker.internal` retained as
   the Docker-on-macOS fallback default.

Plus: setup/service.ts plist PATH gains /opt/homebrew/bin so
launchd-launched v2 can find the `container` binary (Apple Container
installs there on Apple Silicon).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the manual portion of the v1 → v2 migration. Migration is live:
service `com.nanoclaw-v2-05ec8912` running, Telegram answering on
@Hivemind_evabot, owner role granted, both messaging groups locked down
to known-senders-only with the owner added as member of both agent
groups.

- Add three v2-standard container skills (capabilities, pdf-reader,
  status) that were copied in by migrate-v2.sh but never committed.
- Delete orphan group folders groups/global/ and groups/main/. They were
  v2 stock template copies; the cutover at boot already drops
  groups/global/. v1 originals remain at /Users/eva/nanoclaw/groups/ if
  recovery is ever needed.
- Archive .nanoclaw-migrations/ (the prior migration's gotchas notes and
  guide.md) so future contributors can see how we got here.
- Archive logs/setup-migration/handoff.json into
  docs/migration-2026-05-28.json.
- Add docs/v1-fork-followups.md describing what remains to port from the
  v1 fork (Telegram image vision / 4-bot swarm / reply context, Apple
  Pages MCP tools, Gmail tool, container memory tuning) and what's
  already done (Apple Container runtime, native credential proxy).

CLAUDE.local.md cleanups for the two wired groups (telegram_main 325 →
21 lines, telegram_hivemind 152 → 38 lines, paths /workspace/group/ →
/workspace/agent/) aren't tracked — those files are deliberately
per-install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional `memory_mb` column to container_configs. When set,
`buildContainerArgs` passes `-m <N>MiB` to the runtime so the cgroup
limit is honored (both Docker and Apple Container accept this syntax).

Surfaces via `ncl groups config update --memory-mb <N>` (0 clears the
limit). Migration 016 adds the column NULL-able so existing rows keep
the runtime default.

Reason: v1 reliably set ≥2 GB so Chromium (used by agent-browser MCP)
wouldn't OOM mid-session. v2 had no exposed lever. Both wired groups
are now set to 2048 MiB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chat-sdk-bridge.messageToInbound` previously ate any `fetchData()` error
with a generic warn-log; the agent would then see an attachment entry
without a `data` field and have no way to tell whether the file was
missing, too big, or undecodable. Telegram caps bot getFile at 20 MiB,
which is the most common cause and the one v1 surfaced explicitly.

Now we classify the error: `tooBig` matches /too big|too large|too_big|
413|payload too large/i and tags the entry with `error: 'too_big'` and
`errorMessage: <raw>`. Other failures get `error: 'download_failed'`.

`container/agent-runner/src/formatter.ts` then renders explanatory
markers in place of the usual `[type: name — saved to …]` line:
- too_big: `[<type>: <name> (~N MB) — too large to download (over the
  bot's getFile limit); ask the user to share a smaller file or a
  download link]`
- download_failed: `[<type>: <name> — download failed; ask the user to
  resend or share another way]`

The agent then crafts a natural reply instead of silently dropping the
attachment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores v1's multi-bot persona pattern: when an agent team builds out
multiple personas (e.g. "Marine Biologist" + "Alexander Hamilton") in
the Hivemind group, each persona's messages appear from a dedicated
Telegram bot with a matching display name. The user sees a real
multi-participant conversation instead of a single bot relaying everyone.

- src/channels/telegram-swarm.ts: new module. Reads TELEGRAM_BOT_POOL
  (comma-separated tokens) and init-validates each via getMe. Maintains
  a sticky map keyed on (platformId, sender) → pool index, assigned
  round-robin on first use. On first assignment of a (platformId, sender)
  pair, calls setMyName(sender) so the bot's display name matches the
  persona; waits 1.5 s for Telegram propagation before sending. Sends a
  plain-text Telegram message via the picked bot's sendMessage endpoint.
- src/channels/telegram.ts: wraps the adapter's `deliver`. When the
  outbound content has `text` and `sender` and the pool is up (and the
  message isn't a special operation like edit/reaction), routes through
  the pool; falls back to the primary bot on any error.
- container/agent-runner/src/mcp-tools/core.ts: `send_message` gains an
  optional `sender` field. When supplied, it flows into the content JSON
  so the host's deliver path can route via the pool.

Without TELEGRAM_BOT_POOL configured, behavior is unchanged — `sender`
is ignored and everything goes through the primary bot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Marks the four items shipped in this batch (container memory cap,
file-too-big handling, reply context [v2 already had it], Telegram bot
pool / swarm) and rewrites the Gmail + Pages sections with the
architectural detail uncovered while attempting them.

Gmail: upstream skill assumes OneCLI TLS-MITM. v1-style file-based
tokens are blocked by the mount-security `credentials` substring
pattern; viable path is rename → GMAIL_CREDENTIALS_PATH override + per-
group rebuild. Or build a real TLS MITM proxy (essentially mini-OneCLI).

Pages: v1's filesystem IPC (DATA_DIR/ipc/<group>/messages/<id>.json)
doesn't fit v2's DB-only host↔container channel. v2-shaped designs are
sketched out — preferred is a `kind=system-action, action=pages.<verb>'`
on messages_out with a host handler that calls osascript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent-runner's default of 14 days is too tight when restoring an
install whose last conversation activity is just over 2 weeks old —
which is the normal state after a migration that took a few days, or
any user returning from vacation.

What actually went wrong today: after migrate-v2.sh restored v1's
session transcripts into the container's CLAUDE_CONFIG_DIR, the first
container wake found them 12–55 days old (depending on group) and
rotated the active continuation, archiving the v1 transcript to
conversations/<date>.md and starting fresh sessions. Both wired groups
lost their resume context as a result; the data was preserved on disk
but the session pointer in `session_state.continuation` got rewritten
to a brand-new empty session.

Bumping to 180 keeps multi-month context alive across reasonable gaps.
Transcripts still rotate on the size cap (CLAUDE_TRANSCRIPT_ROTATE_BYTES,
default 12 MiB), which is what bounds memory.

Recovery for the existing install (not in this commit, done out of band):
- telegram_main: `session_state.continuation` switched from c66f8582
  (3 days) back to 74ebe63b (6.2 MB, Apr 4 → May 27).
- telegram_hivemind: switched from 72ea255a (25 lines) back to d046f59c
  (1954 lines, Apr 3 → May 16).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores v1's local-only Pages integration. v1's filesystem-IPC pattern
(DATA_DIR/ipc/<group>/messages/<id>.json) doesn't fit v2's DB-only
host↔container channel; this port uses the existing system-action
registry instead.

Container side (container/agent-runner/src/mcp-tools/pages.ts) exposes
11 MCP tools — pages_create, pages_open, pages_save, pages_close,
pages_get_text, pages_insert_text, pages_replace_text,
pages_format_paragraph, pages_export_pdf, pages_list, pages_delete.
Each tool writes a `kind='system', action='pages_request'` message with
{requestId, verb, args} to outbound.db and polls inbound.db for a
matching pages_response. Same correlation pattern as cli_request in
cli/ncl.ts.

Host side (src/modules/pages/applescript.ts) is the v1 osascript helper
module ported verbatim except for the logger swap (pino-style
`logger.info({obj}, 'msg')` → v2's `log.info('msg', {obj})`) and import
path updates. All v1 sandboxing logic ported intact: FILENAME_PATTERN
allowlist (letters/digits/space/_-().), path-escape check after resolve,
sandbox under `groups/<folder>/pages/`. AppleScript correctness — style
applied first because it resets font/color, per-property `set X of Y`
assignment because the `properties` record form fails on rich text —
preserved unchanged from v1.

src/modules/pages/index.ts registers the delivery action, looks the
group folder up via getAgentGroup(session.agent_group_id), dispatches
verb to the AppleScript helper, and writes the response frame to
inbound.db. Self-registers via the modules barrel.

No DB schema or container_configs changes. Works for any group on a
host with Pages.app installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last gap in the container_config CLI surface. Up to now the
only way to wire an additional host→container mount was direct UPDATE
on data/v2.db — fine for the maintainer but unreproducible for anyone
following the followups doc, and outside the approval-gated CLI path
that every other config-mutating command goes through.

- `config add-mount`: upserts by containerPath. Calls validateMount()
  up front so a bad host path (nonexistent, outside the allowlist,
  matches a blocked pattern) fails immediately with the actual reason
  instead of being silently dropped at spawn time with a WARN buried in
  the error log. Reports the effective container path
  (`/workspace/extra/<container>`) so the caller knows where their
  files will appear inside the container.
- `config remove-mount`: filters by containerPath; errors if not found.

Both gated as `approval`-tier, matching add-mcp-server / add-package.

Updates the Gmail wiring recipe in docs/v1-fork-followups.md to use the
new command instead of the prior SQL one-liner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both Gmail and Apple Pages were ported in prior commits (the "Ported"
sections at the top of the file cover them with the actual wiring
recipes). The old "## To port — …" headings underneath were leftover
from when they were still deferred; remove to avoid implying open work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
container.json and the composed CLAUDE.md were mounted as nested
read-only file overlays on top of the agent group directory mount.
Apple Container's virtio-fs creates these inodes at the destination
path (stat() returns 0644) but blocks reads — even as root —
returning EACCES.

This silently disabled all MCP servers configured via
`ncl groups config add-mcp-server` and skipped the composed
CLAUDE.md context (skill instructions, on-wake messages, fragments)
for every Apple Container user. The agent kept answering because
CLAUDE.local.md lives in the always-OK dir-mount-only file — so
chat looked normal while tool-driven workflows silently improvised
without ever calling the tools.

The dir mount at /workspace/agent already exposes both files. The
read-only intent is preserved in practice because the host
re-materializes container.json from the central DB and recomposes
CLAUDE.md from the shared base + fragments on every container spawn
— any in-container write to either is clobbered on the next session.

If preserving the read-only guarantee with defense-in-depth matters,
a follow-up could detect the runtime at host startup (e.g. inspect
CONTAINER_RUNTIME_BIN or run `container --version`) and only skip
the nested mounts on Apple Container. Reported separately so the
agent-runner mount-race retry (which only helps once these mounts
are skipped) can land independently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant