fix(container-runner): rewrite OneCLI proxy port for multi-instance installs#2652
Closed
matty271828 wants to merge 101 commits into
Closed
fix(container-runner): rewrite OneCLI proxy port for multi-instance installs#2652matty271828 wants to merge 101 commits into
matty271828 wants to merge 101 commits into
Conversation
…ild, enable --now
Add Makefile and systemd service
Resolve nanoclaw unit name via install slug in Makefile
Logger emitted only HH:MM:SS.mmm, which made correlating events across days (e.g. when investigating slow message round-trips) painful. Include the date so each line is self-contained.
Deps like @chat-adapter/telegram emit log lines via console.error directly,
bypassing the host logger and producing un-timestamped output (visible in
logs/nanoclaw.error.log). Wrap console.{log,info,warn,error,debug} at
startup so every line gets the same timestamp + level prefix as the host
logger. Multi-line outputs are prefixed per line.
- Switch ts() to toISOString() — includes offset (Z), unambiguous across
hosts and DST transitions, still sortable
- Extract console capture into installConsoleCapture(); call it once from
src/index.ts. Importing { log } no longer mutates global console, which
fixes test isolation (vi.spyOn(console, ...) etc.)
- Use util.format(...) so printf-style placeholders work
- Special-case Error so console.error(err) preserves stack+message
- Route console.log -> debug (chatty deps don't pollute info); explicit
console.info still maps to info
…rformance-vXYFK Timestamp host + third-party log lines
Stage 1 of issue #4 streaming replies. Adds a `partial` provider event that carries the accumulated assistant text as the Claude SDK emits it, and an early-dispatch path in the poll loop that sends each <message to="..."> block to the channel as soon as its closing tag appears — instead of waiting for the SDK's final `result` event. Gated by NANOCLAW_STREAM_REPLIES=1; default off. Wins for multi-message turns and text-then-tool-use turns. Mid-block partial streaming (the bigger ~1-2s win for single short replies) requires edit-on-existing-message support in delivery + adapters and is designed in docs/streaming-replies.md as Stage 2. https://claude.ai/code/session_01Vn6LaaQpAyibb1nt7z3QXp
Two correctness fixes from Copilot review on #18: - Track successfully dispatched blocks by index (set), not by count. Unknown-destination blocks no longer skew skipFirstBlocks, so the result pass still scratchpad-logs them and the unwrapped-output nudge fires correctly. - Actually reset the stream extractor and dispatched-index set on `init` events (e.g. PreCompact mid-stream), matching what the doc comment claimed. https://claude.ai/code/session_01Vn6LaaQpAyibb1nt7z3QXp
perf: stream closed <message> blocks as they arrive (#4 stage 1)
When a 'Message routed' log line appeared with no corresponding row in
inbound.db, there was no way to tell whether the write happened, what
primary key was used, or what seq was assigned. Make the route -> write
-> persist chain auditable from logs alone.
- insertMessage returns { seq } (was void).
- writeSessionMessage returns { id, seq } and emits a debug 'Session
message persisted' log with the messageId, seq, kind, and resolved
inbound.db path. Absence of this line for a given routed message
proves the insert never ran; presence with a different path proves
cross-mount confusion.
- 'Message routed' in router.ts now includes messageId and seq so
every host-side route log is correlatable to its DB row.
- host-sweep emits a symmetric 'Container healthy' debug log on the
'no action' branch (with heartbeatAgeMs), to distinguish 'sweep saw
container and decided not to kill' from 'sweep never ran' when
diagnosing stuck sessions.
Return-type changes are non-breaking: existing callers that ignore the
return value continue to work.
Addressing Copilot review: the audit log emits after updateSession(), so a throw there would suppress the line even though insertMessage already committed. That breaks the 'absence proves the insert never ran' invariant the log exists to provide. Move the log into the try block, immediately after insertMessage returns. db.close() in finally still runs; updateSession() throws no longer hide the persist log.
Audit trail for inbound writes and sweep decisions
Splits Stage 2 of the streaming-replies plan into two PRs. This is the
host-only half — pure dead code until the container writer lands in 2b,
but lets the schema + delivery path land first.
## What changes
- `getDeliveredPlatformId(db, messageOutId)` — read helper against the
per-session `delivered` table, returns the platform message ID for an
already-delivered row (or null).
- `src/delivery.ts` recognises `content.operation === 'stream_edit'`:
- Resolves `targetMessageOutId` via `getDeliveredPlatformId`.
- If unresolved → throw, falling into the existing 3-attempt retry
path. No new state.
- If resolved → rewrite to `{ operation: 'edit', messageId, text }`
and dispatch through the same adapter path `edit` already uses
(the chat-sdk bridge has handled `operation: 'edit'` for a while).
- Coalescing pass before the per-row delivery loop: walks the
undelivered queue and marks every `stream_edit` row as delivered
*without* sending if a later `stream_edit` targeting the same
`messages_out.id` also sits in the same pass. Every edit is a full
replacement, so superseded ones would just burn the per-chat rate
limit budget.
## Why split
Stage 2 as originally specced bundled schema, host resolver, container
writer, and per-channel rate limiting. Landing the host resolver first
means 2b can be reviewed as a focused container-side diff (writer +
throttle, no host plumbing). Until 2b ships, this branch is a no-op:
nothing in the codebase emits `stream_edit` rows.
## Test plan
- [x] `pnpm run build` — clean
- [x] `pnpm test` — 372 pass (3 new in `delivery.test.ts`)
- resolves stream_edit → edit using the delivered platform ID
- retries until target lands, then sends the edit
- coalesces 3 stream_edits to 1 adapter call; superseded rows marked delivered
Four review comments on #20 — all valid, all addressed: 1. Deterministic coalesce order. `getDueOutboundMessages` now orders by `timestamp ASC, seq ASC` (was just `timestamp ASC`). `datetime('now')` is second-precision, so multiple stream_edits could share a timestamp and the relative order of the coalesce loop's "last write wins" was nondeterministic. `seq` is monotonic per outbound queue and was already on the row. 2. `deliveryAttempts` cleanup on coalesce. Superseded stream_edit rows that had previously accrued retries (e.g. their target wasn't delivered on an earlier pass) were leaving stale entries in the in-memory map. Delete them alongside the markDelivered call. 3. Fail-fast on permanently-failed targets. `getDeliveredPlatformId` only returned null and treated "missing", "failed", and "delivered with NULL platform_id" identically — so an edit pointing at a permanently-failed target burned 3 retries before the throw path marked it failed too. Replaced with `getDeliveredLookup` which returns a discriminated union (missing | failed | delivered). Edits targeting failed (or delivered-with-no-platform-id) rows are dropped immediately with a clear log line; only `missing` triggers the retry-throw. 4. Doc: the Status preamble said "the current branch ships Stage 1" but the table marked 2a as shipped. Updated to reflect both. New test: `fails fast on stream_edit whose target has permanently failed` — exhausts the 3 retries on the target, then asserts the follow-up stream_edit lands in `delivered` (as failed) after a single drain with zero adapter calls. 373 tests pass.
perf: host-side stream_edit resolver + coalescing (#4 stage 2a)
Stage 2b of #4. Completes the streaming-replies pipeline: with `NANOCLAW_STREAM_REPLIES=1` the user now sees a `<message>` block grow in real time, not just a single delivery once the block closes. ## What changes - `stream-dispatch.ts` exports `detectOpenBlock(text, closedCount)`, returning the trailing `<message to="…">` block whose closing tag hasn't arrived yet. Its `index` lines up with the closed-block index the Stage 1 extractor returns, so `dispatchResultText` and `earlyDispatched` work unchanged. - `poll-loop.ts` adds a per-turn `StreamingBlockState` and two helpers: - `advanceStreamingBlock` — on first sight of a non-empty open block with a known destination, does an ordinary `writeMessageOut` and captures the row's `messages_out.id` as the streaming target. - `flushStreamEdit` — writes a `stream_edit` row referencing that target, gated by `STREAM_EDIT_MIN_INTERVAL_MS` (1500ms, picked for Telegram's 1 edit/sec/chat limit; Stage 3 will make this per-channel) and a text-equality skip. - When the streamed block finally closes, the partial handler emits one throttle-bypassing final `flushStreamEdit` and adds the block's index to `earlyDispatched` so the `result`-pass doesn't re-send it. - State is cleared on `init` (e.g. PreCompact mid-stream) and `result`. ## Why this is safe - Empty openings and unknown destinations bail out before any send, so the initial `messages_out` row is never written without a body or a target. The result-event scratchpad path still handles those cases. - The host already coalesces superseded `stream_edit` rows (Stage 2a), so if delivery is slower than the 1500ms write cadence, only the newest queued edit per target hits the adapter on each drain pass. - Default off — gated by `NANOCLAW_STREAM_REPLIES` like Stage 1. ## Test plan - [x] `bun test` in `container/agent-runner/` — 100 pass (5 new in `stream-dispatch.test.ts` covering `detectOpenBlock`). - [x] `bun run typecheck` — clean. - [x] `pnpm run build` — clean. - [x] `pnpm test` (host) — 373 pass. - [ ] Manual: set `NANOCLAW_STREAM_REPLIES=1`, send a prompt that produces a long single-block reply, watch the message grow in the channel in real time. Closes the implementation half of #4. Stage 3 (per-channel rate limiting + `supportsEdits` fallback) is still designed-only.
perf: container-side mid-block stream_edit writer (#4 stage 2b)
…advance Host periodically fetches the current branch's upstream, and when local HEAD is strictly behind and fast-forwardable, shells out to `systemd-run --user --no-block bash -lc 'make deploy'` so the deploy survives the parent restart triggered by `systemctl --user restart` in the Makefile target. Off by default. Enable with `NANOCLAW_SELF_UPGRADE_ENABLED=true`. Skips dirty working trees, diverged forks, and detached HEADs to avoid clobbering local work. Linux-only — matches what `make deploy` already requires. https://claude.ai/code/session_01YNGgookqKaA3cQnnUycVnh
- Fail safe when `git status` errors out: treat as dirty and skip the deploy instead of letting null short-circuit `workingTreeDirty` to false. Previously a lock file, broken index, or missing git binary would silently shell out `make deploy` against an unknown working state. - Convert the tick from spawnSync to async spawn. The previous version blocked the Node event loop on every git fetch — including network round-trips — freezing delivery polls, host sweep, and the CLI socket server until the fetch returned. All shellouts now have explicit timeouts (fetch capped at min(60s, interval/2), quick git ops 10s, systemd-run 10s) with SIGTERM → SIGKILL escalation. Added inFlight guard so overlapping ticks (from very long fetches) are skipped, not queued. - Best-effort `systemctl --user reset-failed <unit>.service` before systemd-run. A previous deploy that exited non-zero leaves a stale failed unit with the same name, which would otherwise block every subsequent self-upgrade with "Unit already exists". - Documented the synchronous-shutdown assumption on stopSelfUpgrade. https://claude.ai/code/session_01YNGgookqKaA3cQnnUycVnh
…3nzX feat: opt-in self-upgrade poller that runs `make deploy` on upstream advance
Flip the default for `NANOCLAW_SELF_UPGRADE_ENABLED` from off to on. Single-user installs don't need to opt in; set the env var to `false` to disable. https://claude.ai/code/session_013d9ysGuomLMQMHgmnC5XgA
…fault-jW2SP feat(self-upgrade): enable by default
Tens of ms per write saved on slow disks (Pi SD card target) across the chatty inbound/outbound poll path. Also adds busy_timeout=5000 to the central DB, which was missing. Safety: - synchronous=NORMAL with journal_mode=WAL: standard safe pairing; durable across crashes, just skips the redundant cross-commit fsync. - synchronous=NORMAL with journal_mode=DELETE: each commit still fsyncs the rollback journal before deletion. Power-loss tolerance unchanged from our existing model. - temp_store=MEMORY: any query materializing a temp table/index stays in RAM. Always safe. Closes #16
`pollActive` only iterates running sessions (containers with a turn actively in flight), so cadence already gates on "work in progress" — 5× polling adds negligible idle load. The existing inflightDeliveries guard prevents double-delivery races with the 60s sweep. Worst-case detection lag for an outbound row drops from ~1s to 200ms. Closes #11
Two writer opens in container/agent-runner/src/cli/ncl.ts (writeRequest and the processing_ack INSERT in pollResponse) were missed by the prior pragma pass. Bring them in line with the other session-DB writers for consistency — ncl invocations are infrequent enough that perf doesn't matter, but the next reader shouldn't have to wonder why these writers are different. Addresses review feedback on #16.
Setup wrote NEXTAUTH_SECRET into every per-instance OneCLI .env, but never wired Google OAuth. Recent OneCLI versions refuse to start in this state and the web UI shows "NEXTAUTH_SECRET is set but GOOGLE_CLIENT_ID/SECRET are missing." Default installs run in local mode, so only persist the secret when OAuth creds are also present, and strip stale secrets from existing .env files so the gateway boots cleanly.
The writeback rewrote .env from a hardcoded list every tick, which would wipe any user-added vars — including GOOGLE_CLIENT_ID/SECRET added to opt into OAuth mode. Only the keys setup actually owns get rewritten; everything else is carried through.
The previous filter dropped every comment line and blank line, which contradicted the header promising user-added content is preserved across redeploys. Now we strip only the previous auto-generated header block (so it doesn't accumulate) and managed KEY=VALUE lines; user comments, blank lines, and unknown keys carry through verbatim.
Covers what differs because two instances run from one checkout: NCL_INSTANCE fan-out, auto-derived OneCLI port triples, the flock-serialized make deploy, per-instance channels/tokens, the shared container image, add/remove flow, and common gotchas. Everything else defers to CLAUDE.md and docs/.
- Container image is nanoclaw-agent-v2-<slug>:latest, not nanoclaw-agent:<slug>; the slug is in the image name and the tag is always latest. - ncl socket path is <DATA_DIR>/ncl.sock — per-instance via DATA_DIR, not derived from the install slug. - Instance removal needs docker compose -p onecli-<name> down; bumping ONECLI_PORT_STRIDE needs the OneCLI compose project brought down and re-upped on the new port too.
The image bakes NEXTAUTH_URL=http://localhost:10254 as default, so NextAuth-generated OAuth callbacks (e.g. GitHub App connect) always pointed at port 10254 regardless of ONECLI_APP_PORT. Non-default-port installs hit "redirect_uri not associated with this application" or landed on the wrong instance's gateway. Bind to 127.0.0.1 explicitly — ONECLI_BIND_HOST may be 172.17.0.1 (docker bridge), but OAuth providers redirect the user's browser, which lives on the host.
Path is /api/apps/github/callback (NextAuth route), not /v1/... Also note the multi-instance port lookup and 127.0.0.1 vs localhost, which NextAuth distinguishes.
- Re-add the remote-install example dropped in the previous pass. - Explain *why* 127.0.0.1 vs localhost matters (NextAuth normalization) so future readers don't 'fix' it back to localhost. - Call out the path change (/v1/ -> /api/) and host-part change for users upgrading existing GitHub App registrations.
OneCLI's /app-connect OAuth flow (GitHub Apps, etc.) reads NEXT_PUBLIC_APP_URL — not NEXTAUTH_URL — when constructing redirect_uri and the 'Add this URL to your OAuth app' string in the UI. The image bakes 'http://localhost:10254' as the fallback, so non-default-port installs sent users to the wrong port mid-OAuth even after the previous NEXTAUTH_URL fix. NEXTAUTH_URL only governs NextAuth's own /api/auth/callback/* routes.
Yesterday's fix stripped NEXTAUTH_SECRET from per-instance .env files in
local mode, but the compose template still had
`NEXTAUTH_SECRET: ${NEXTAUTH_SECRET:-}`, which injects the variable as
an empty string into the container. OneCLI 1.23.0 treats that as "set"
and refuses to start, surfacing the same "NEXTAUTH_SECRET is set but
GOOGLE_CLIENT_ID/SECRET are missing" error in the web UI.
Render the compose file instead of copying it, and only emit the
NEXTAUTH_SECRET environment line when we actually have a secret.
`docker compose restart` keeps the pgdata and app-data named volumes intact (no down/up, no --volumes), so credentials and the postgres state survive a deploy.
`docker compose restart` only bounces the existing containers — it won't re-read the compose file or pull a new ONECLI_VERSION. `up -d` preserves the named volumes the same way and actually applies any template/.env changes the deploy might have brought in.
`POLL_INTERVAL_MS` gates only the idle / accumulate-only sleeps between
poll iterations — the active follow-up poll while a turn is in flight
already runs at 500ms (ACTIVE_POLL_INTERVAL_MS, untouched). Each idle
poll is a single indexed read against messages_in, so 5× cadence is
cheap. Worst-case wake lag for a freshly-inserted inbound row drops
from ~1s to 200ms — symmetric with the host delivery poll change in
the prior PR.
Also refreshes stale 1s references in docs/architecture.md,
architecture-diagram.{md,html}, and agent-runner-details.md (a few were
about the host delivery poll missed in the prior pass).
Closes #10
The deploy step that ran `setup --step onecli` (which re-copies the compose template from setup/onecli/docker-compose.yml.template into ~/.onecli-<inst>/) was gated behind 'container not already running' — so template updates landed in the repo on `make deploy`, but the per-instance compose file was only re-rendered on first install. This meant the recent NEXTAUTH_URL / NEXT_PUBLIC_APP_URL additions never made it into running gateways: `docker compose up -d` saw an unchanged compose file and treated the containers as up-to-date. Copy the template before the existing bounce step so the per-instance compose file always reflects the repo, and `up -d` recreates the container whenever the env block changes.
OneCLI creates agents in `selective` secret mode by default, which
means no vault secrets are assigned — every credentialed call from
the container 401s until an operator manually runs
`onecli agents set-secret-mode --mode all` (documented gotcha in
CLAUDE.md). The SDK doesn't expose setSecretMode, so we hit the HTTP
API directly with the existing Bearer token: GET /api/agents to
resolve the internal id from our identifier, then PATCH
/api/agents/<id>/secret-mode {mode:"all"}.
Failures are logged and swallowed so a transient OneCLI hiccup
doesn't block all container spawns for the agent group — the agent
still starts, just in the old selective mode.
…il comment, drop unrelated Makefile change - ensureAgentSecretModeAll now warns when the agent record is missing the expected `secretMode` field, so a future OneCLI rename surfaces in logs instead of silently turning the idempotency check into a PATCH-per-spawn. - Comment block above the OneCLI wiring calls now distinguishes the hard-fail wiring (ensureAgent + applyContainerConfig — throws block spawn) from the soft-fail secret-mode flip (degrades to selective mode, doesn't block spawn). - Reverts the Makefile change that slipped in from a stash pop — it belongs to a separate PR about `make deploy` compose refresh.
… tail Split the deploy recipe into two phases. Phase 1 (git pull, install, bootstrap, build) runs best-effort under `set -e` in a subshell; if it fails, deploy logs a WARN, marks a non-zero exit, and falls through instead of aborting. Phase 2 (OneCLI bounce, agent-container kill, unit restart) now ALWAYS runs, so a wedged agent container or a dirty/broken tree no longer leaves the instance stuck — deploy bounces it onto the last good dist/ regardless. Phase 2 also force-kills every agent container by its `nanoclaw-install=<slug>` label (the same scope the host's startup cleanupOrphans() uses) before restarting, so a hung turn is cleared immediately rather than waiting on the restarted host to reap it. Automated self-upgrade behaviour is unchanged: src/self-upgrade.ts only fires `make deploy` on a clean fast-forward and skips dirty trees, so removing the recipe-level hard stop cannot cause restart thrash.
…nstalls OneCLI's /api/container-config hardcodes host.docker.internal:10255 in HTTPS_PROXY/HTTP_PROXY, which is correct for stock single-instance setups but wrong for multi-instance installs where each instance's gateway lives on app_port + 1 (e.g. 10355, 10455). Containers point at :10255, nothing listens there, every Claude API call hangs, the agent produces no reply, heartbeat eventually stales, and host-sweep kills the container at the 30-minute absolute ceiling — surfacing as "typing indicator but no reply". Fix: parse the port from ONECLI_URL after applyContainerConfig and rewrite :10255 in the -e HTTP[S]_PROXY args. No-op when the computed port is already 10255 (single-instance default), so stock installs are unaffected.
Author
|
Closing — should have been raised on the fork, not upstream. Reopening on madebymac/nanoclaw. |
This was referenced May 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OneCLI's
/api/container-confighardcodeshost.docker.internal:10255inHTTPS_PROXY/HTTP_PROXY. That's correct for stock single-instance setups (gateway on host port 10255), but wrong for multi-instance installs (instances.confwithONECLI_BASE_PORT/ONECLI_PORT_STRIDE), where each instance's gateway lives onapp_port + 1(e.g. 10355, 10455).Symptom: containers spawn fine, but
HTTPS_PROXYpoints at:10255where nothing is listening. Every Claude API call hangs, the agent never writes a reply, heartbeat eventually stales, andhost-sweepkills the container at the 30-minute absolute ceiling. The user sees a Telegram "typing…" indicator and then nothing.Fix: parse the port from
ONECLI_URL, computeexpected = port + 1, and rewrite:10255→:expectedin the-e HTTP[S]_PROXY=...args thatapplyContainerConfigjust pushed. Early-return when the computed port is already 10255, so stock single-instance installs are unaffected.Verified on a Pi running two instances (
reviewon 10354/10355,generalon 10454/10455) — before the fix,curl -x $HTTPS_PROXY https://api.anthropic.comfrom inside the agent container timed out; after the fix it succeeds and the OneCLI gateway logs showMITM POST /v1/messages 200.Test plan
pnpm run buildpassesscripts/q.test.tspasses in isolation (the 2 failures in the full suite are pre-existing Pi-IO flakiness ontsx-subprocess tests, not caused by this change)ONECLI_URLparses to port 10254,expected = 10255, early-return path taken