Skip to content

fix(container-runner): rewrite OneCLI proxy port for multi-instance installs#2652

Closed
matty271828 wants to merge 101 commits into
nanocoai:mainfrom
madebymac:fix/onecli-proxy-port-multi-instance
Closed

fix(container-runner): rewrite OneCLI proxy port for multi-instance installs#2652
matty271828 wants to merge 101 commits into
nanocoai:mainfrom
madebymac:fix/onecli-proxy-port-multi-instance

Conversation

@matty271828
Copy link
Copy Markdown

Summary

OneCLI's /api/container-config hardcodes host.docker.internal:10255 in HTTPS_PROXY/HTTP_PROXY. That's correct for stock single-instance setups (gateway on host port 10255), but wrong for multi-instance installs (instances.conf with ONECLI_BASE_PORT / ONECLI_PORT_STRIDE), where each instance's gateway lives on app_port + 1 (e.g. 10355, 10455).

Symptom: containers spawn fine, but HTTPS_PROXY points at :10255 where nothing is listening. Every Claude API call hangs, the agent never writes a reply, heartbeat eventually stales, and host-sweep kills the container at the 30-minute absolute ceiling. The user sees a Telegram "typing…" indicator and then nothing.

Fix: parse the port from ONECLI_URL, compute expected = port + 1, and rewrite :10255:expected in the -e HTTP[S]_PROXY=... args that applyContainerConfig just pushed. Early-return when the computed port is already 10255, so stock single-instance installs are unaffected.

Verified on a Pi running two instances (review on 10354/10355, general on 10454/10455) — before the fix, curl -x $HTTPS_PROXY https://api.anthropic.com from inside the agent container timed out; after the fix it succeeds and the OneCLI gateway logs show MITM POST /v1/messages 200.

Test plan

  • pnpm run build passes
  • scripts/q.test.ts passes in isolation (the 2 failures in the full suite are pre-existing Pi-IO flakiness on tsx-subprocess tests, not caused by this change)
  • Send a Telegram message to a multi-instance bot and confirm a reply lands (manual)
  • Confirm single-instance install is unaffected — ONECLI_URL parses to port 10254, expected = 10255, early-return path taken

matty271828 and others added 30 commits May 19, 2026 22:46
Resolve nanoclaw unit name via install slug in Makefile
Logger emitted only HH:MM:SS.mmm, which made correlating events across
days (e.g. when investigating slow message round-trips) painful. Include
the date so each line is self-contained.
Deps like @chat-adapter/telegram emit log lines via console.error directly,
bypassing the host logger and producing un-timestamped output (visible in
logs/nanoclaw.error.log). Wrap console.{log,info,warn,error,debug} at
startup so every line gets the same timestamp + level prefix as the host
logger. Multi-line outputs are prefixed per line.
- Switch ts() to toISOString() — includes offset (Z), unambiguous across
  hosts and DST transitions, still sortable
- Extract console capture into installConsoleCapture(); call it once from
  src/index.ts. Importing { log } no longer mutates global console, which
  fixes test isolation (vi.spyOn(console, ...) etc.)
- Use util.format(...) so printf-style placeholders work
- Special-case Error so console.error(err) preserves stack+message
- Route console.log -> debug (chatty deps don't pollute info); explicit
  console.info still maps to info
…rformance-vXYFK

Timestamp host + third-party log lines
Stage 1 of issue #4 streaming replies. Adds a `partial` provider event
that carries the accumulated assistant text as the Claude SDK emits it,
and an early-dispatch path in the poll loop that sends each
<message to="..."> block to the channel as soon as its closing tag
appears — instead of waiting for the SDK's final `result` event.

Gated by NANOCLAW_STREAM_REPLIES=1; default off.

Wins for multi-message turns and text-then-tool-use turns. Mid-block
partial streaming (the bigger ~1-2s win for single short replies)
requires edit-on-existing-message support in delivery + adapters and
is designed in docs/streaming-replies.md as Stage 2.

https://claude.ai/code/session_01Vn6LaaQpAyibb1nt7z3QXp
Two correctness fixes from Copilot review on #18:

- Track successfully dispatched blocks by index (set), not by count.
  Unknown-destination blocks no longer skew skipFirstBlocks, so the
  result pass still scratchpad-logs them and the unwrapped-output
  nudge fires correctly.
- Actually reset the stream extractor and dispatched-index set on
  `init` events (e.g. PreCompact mid-stream), matching what the doc
  comment claimed.

https://claude.ai/code/session_01Vn6LaaQpAyibb1nt7z3QXp
perf: stream closed &lt;message&gt; blocks as they arrive (#4 stage 1)
When a 'Message routed' log line appeared with no corresponding row in
inbound.db, there was no way to tell whether the write happened, what
primary key was used, or what seq was assigned. Make the route -> write
-> persist chain auditable from logs alone.

- insertMessage returns { seq } (was void).
- writeSessionMessage returns { id, seq } and emits a debug 'Session
  message persisted' log with the messageId, seq, kind, and resolved
  inbound.db path. Absence of this line for a given routed message
  proves the insert never ran; presence with a different path proves
  cross-mount confusion.
- 'Message routed' in router.ts now includes messageId and seq so
  every host-side route log is correlatable to its DB row.
- host-sweep emits a symmetric 'Container healthy' debug log on the
  'no action' branch (with heartbeatAgeMs), to distinguish 'sweep saw
  container and decided not to kill' from 'sweep never ran' when
  diagnosing stuck sessions.

Return-type changes are non-breaking: existing callers that ignore the
return value continue to work.
Addressing Copilot review: the audit log emits after updateSession(),
so a throw there would suppress the line even though insertMessage
already committed. That breaks the 'absence proves the insert never
ran' invariant the log exists to provide.

Move the log into the try block, immediately after insertMessage
returns. db.close() in finally still runs; updateSession() throws
no longer hide the persist log.
Audit trail for inbound writes and sweep decisions
Splits Stage 2 of the streaming-replies plan into two PRs. This is the
host-only half — pure dead code until the container writer lands in 2b,
but lets the schema + delivery path land first.

## What changes

- `getDeliveredPlatformId(db, messageOutId)` — read helper against the
  per-session `delivered` table, returns the platform message ID for an
  already-delivered row (or null).
- `src/delivery.ts` recognises `content.operation === 'stream_edit'`:
  - Resolves `targetMessageOutId` via `getDeliveredPlatformId`.
  - If unresolved → throw, falling into the existing 3-attempt retry
    path. No new state.
  - If resolved → rewrite to `{ operation: 'edit', messageId, text }`
    and dispatch through the same adapter path `edit` already uses
    (the chat-sdk bridge has handled `operation: 'edit'` for a while).
- Coalescing pass before the per-row delivery loop: walks the
  undelivered queue and marks every `stream_edit` row as delivered
  *without* sending if a later `stream_edit` targeting the same
  `messages_out.id` also sits in the same pass. Every edit is a full
  replacement, so superseded ones would just burn the per-chat rate
  limit budget.

## Why split

Stage 2 as originally specced bundled schema, host resolver, container
writer, and per-channel rate limiting. Landing the host resolver first
means 2b can be reviewed as a focused container-side diff (writer +
throttle, no host plumbing). Until 2b ships, this branch is a no-op:
nothing in the codebase emits `stream_edit` rows.

## Test plan

- [x] `pnpm run build` — clean
- [x] `pnpm test` — 372 pass (3 new in `delivery.test.ts`)
  - resolves stream_edit → edit using the delivered platform ID
  - retries until target lands, then sends the edit
  - coalesces 3 stream_edits to 1 adapter call; superseded rows marked delivered
Four review comments on #20 — all valid, all addressed:

1. Deterministic coalesce order. `getDueOutboundMessages` now orders by
   `timestamp ASC, seq ASC` (was just `timestamp ASC`). `datetime('now')`
   is second-precision, so multiple stream_edits could share a timestamp
   and the relative order of the coalesce loop's "last write wins" was
   nondeterministic. `seq` is monotonic per outbound queue and was
   already on the row.

2. `deliveryAttempts` cleanup on coalesce. Superseded stream_edit rows
   that had previously accrued retries (e.g. their target wasn't
   delivered on an earlier pass) were leaving stale entries in the
   in-memory map. Delete them alongside the markDelivered call.

3. Fail-fast on permanently-failed targets. `getDeliveredPlatformId`
   only returned null and treated "missing", "failed", and "delivered
   with NULL platform_id" identically — so an edit pointing at a
   permanently-failed target burned 3 retries before the throw path
   marked it failed too. Replaced with `getDeliveredLookup` which
   returns a discriminated union (missing | failed | delivered). Edits
   targeting failed (or delivered-with-no-platform-id) rows are dropped
   immediately with a clear log line; only `missing` triggers the
   retry-throw.

4. Doc: the Status preamble said "the current branch ships Stage 1" but
   the table marked 2a as shipped. Updated to reflect both.

New test: `fails fast on stream_edit whose target has permanently
failed` — exhausts the 3 retries on the target, then asserts the
follow-up stream_edit lands in `delivered` (as failed) after a single
drain with zero adapter calls.

373 tests pass.
perf: host-side stream_edit resolver + coalescing (#4 stage 2a)
Stage 2b of #4. Completes the streaming-replies pipeline: with
`NANOCLAW_STREAM_REPLIES=1` the user now sees a `<message>` block grow
in real time, not just a single delivery once the block closes.

## What changes

- `stream-dispatch.ts` exports `detectOpenBlock(text, closedCount)`,
  returning the trailing `<message to="…">` block whose closing tag
  hasn't arrived yet. Its `index` lines up with the closed-block index
  the Stage 1 extractor returns, so `dispatchResultText` and
  `earlyDispatched` work unchanged.
- `poll-loop.ts` adds a per-turn `StreamingBlockState` and two helpers:
  - `advanceStreamingBlock` — on first sight of a non-empty open block
    with a known destination, does an ordinary `writeMessageOut` and
    captures the row's `messages_out.id` as the streaming target.
  - `flushStreamEdit` — writes a `stream_edit` row referencing that
    target, gated by `STREAM_EDIT_MIN_INTERVAL_MS` (1500ms, picked for
    Telegram's 1 edit/sec/chat limit; Stage 3 will make this
    per-channel) and a text-equality skip.
- When the streamed block finally closes, the partial handler emits
  one throttle-bypassing final `flushStreamEdit` and adds the block's
  index to `earlyDispatched` so the `result`-pass doesn't re-send it.
- State is cleared on `init` (e.g. PreCompact mid-stream) and `result`.

## Why this is safe

- Empty openings and unknown destinations bail out before any send, so
  the initial `messages_out` row is never written without a body or a
  target. The result-event scratchpad path still handles those cases.
- The host already coalesces superseded `stream_edit` rows (Stage 2a),
  so if delivery is slower than the 1500ms write cadence, only the
  newest queued edit per target hits the adapter on each drain pass.
- Default off — gated by `NANOCLAW_STREAM_REPLIES` like Stage 1.

## Test plan

- [x] `bun test` in `container/agent-runner/` — 100 pass (5 new in
      `stream-dispatch.test.ts` covering `detectOpenBlock`).
- [x] `bun run typecheck` — clean.
- [x] `pnpm run build` — clean.
- [x] `pnpm test` (host) — 373 pass.
- [ ] Manual: set `NANOCLAW_STREAM_REPLIES=1`, send a prompt that
      produces a long single-block reply, watch the message grow in
      the channel in real time.

Closes the implementation half of #4. Stage 3 (per-channel rate
limiting + `supportsEdits` fallback) is still designed-only.
perf: container-side mid-block stream_edit writer (#4 stage 2b)
…advance

Host periodically fetches the current branch's upstream, and when local
HEAD is strictly behind and fast-forwardable, shells out to
`systemd-run --user --no-block bash -lc 'make deploy'` so the deploy
survives the parent restart triggered by `systemctl --user restart` in
the Makefile target.

Off by default. Enable with `NANOCLAW_SELF_UPGRADE_ENABLED=true`. Skips
dirty working trees, diverged forks, and detached HEADs to avoid clobbering
local work. Linux-only — matches what `make deploy` already requires.

https://claude.ai/code/session_01YNGgookqKaA3cQnnUycVnh
- Fail safe when `git status` errors out: treat as dirty and skip the
  deploy instead of letting null short-circuit `workingTreeDirty` to
  false. Previously a lock file, broken index, or missing git binary
  would silently shell out `make deploy` against an unknown working
  state.

- Convert the tick from spawnSync to async spawn. The previous version
  blocked the Node event loop on every git fetch — including network
  round-trips — freezing delivery polls, host sweep, and the CLI socket
  server until the fetch returned. All shellouts now have explicit
  timeouts (fetch capped at min(60s, interval/2), quick git ops 10s,
  systemd-run 10s) with SIGTERM → SIGKILL escalation. Added inFlight
  guard so overlapping ticks (from very long fetches) are skipped, not
  queued.

- Best-effort `systemctl --user reset-failed <unit>.service` before
  systemd-run. A previous deploy that exited non-zero leaves a stale
  failed unit with the same name, which would otherwise block every
  subsequent self-upgrade with "Unit already exists".

- Documented the synchronous-shutdown assumption on stopSelfUpgrade.

https://claude.ai/code/session_01YNGgookqKaA3cQnnUycVnh
…3nzX

feat: opt-in self-upgrade poller that runs `make deploy` on upstream advance
Flip the default for `NANOCLAW_SELF_UPGRADE_ENABLED` from off to on.
Single-user installs don't need to opt in; set the env var to `false`
to disable.

https://claude.ai/code/session_013d9ysGuomLMQMHgmnC5XgA
…fault-jW2SP

feat(self-upgrade): enable by default
matty271828 and others added 24 commits May 24, 2026 19:04
Tens of ms per write saved on slow disks (Pi SD card target) across the
chatty inbound/outbound poll path. Also adds busy_timeout=5000 to the
central DB, which was missing.

Safety:
- synchronous=NORMAL with journal_mode=WAL: standard safe pairing;
  durable across crashes, just skips the redundant cross-commit fsync.
- synchronous=NORMAL with journal_mode=DELETE: each commit still fsyncs
  the rollback journal before deletion. Power-loss tolerance unchanged
  from our existing model.
- temp_store=MEMORY: any query materializing a temp table/index stays
  in RAM. Always safe.

Closes #16
`pollActive` only iterates running sessions (containers with a turn
actively in flight), so cadence already gates on "work in progress" —
5× polling adds negligible idle load. The existing inflightDeliveries
guard prevents double-delivery races with the 60s sweep.

Worst-case detection lag for an outbound row drops from ~1s to 200ms.

Closes #11
Two writer opens in container/agent-runner/src/cli/ncl.ts (writeRequest
and the processing_ack INSERT in pollResponse) were missed by the prior
pragma pass. Bring them in line with the other session-DB writers for
consistency — ncl invocations are infrequent enough that perf doesn't
matter, but the next reader shouldn't have to wonder why these writers
are different.

Addresses review feedback on #16.
Setup wrote NEXTAUTH_SECRET into every per-instance OneCLI .env, but
never wired Google OAuth. Recent OneCLI versions refuse to start in this
state and the web UI shows "NEXTAUTH_SECRET is set but
GOOGLE_CLIENT_ID/SECRET are missing." Default installs run in local mode,
so only persist the secret when OAuth creds are also present, and strip
stale secrets from existing .env files so the gateway boots cleanly.
The writeback rewrote .env from a hardcoded list every tick, which would
wipe any user-added vars — including GOOGLE_CLIENT_ID/SECRET added to
opt into OAuth mode. Only the keys setup actually owns get rewritten;
everything else is carried through.
The previous filter dropped every comment line and blank line, which
contradicted the header promising user-added content is preserved
across redeploys. Now we strip only the previous auto-generated header
block (so it doesn't accumulate) and managed KEY=VALUE lines; user
comments, blank lines, and unknown keys carry through verbatim.
Covers what differs because two instances run from one checkout:
NCL_INSTANCE fan-out, auto-derived OneCLI port triples, the
flock-serialized make deploy, per-instance channels/tokens, the
shared container image, add/remove flow, and common gotchas.
Everything else defers to CLAUDE.md and docs/.
- Container image is nanoclaw-agent-v2-<slug>:latest, not
  nanoclaw-agent:<slug>; the slug is in the image name and the
  tag is always latest.
- ncl socket path is <DATA_DIR>/ncl.sock — per-instance via
  DATA_DIR, not derived from the install slug.
- Instance removal needs docker compose -p onecli-<name> down;
  bumping ONECLI_PORT_STRIDE needs the OneCLI compose project
  brought down and re-upped on the new port too.
The image bakes NEXTAUTH_URL=http://localhost:10254 as default, so
NextAuth-generated OAuth callbacks (e.g. GitHub App connect) always
pointed at port 10254 regardless of ONECLI_APP_PORT. Non-default-port
installs hit "redirect_uri not associated with this application" or
landed on the wrong instance's gateway.

Bind to 127.0.0.1 explicitly — ONECLI_BIND_HOST may be 172.17.0.1
(docker bridge), but OAuth providers redirect the user's browser,
which lives on the host.
Path is /api/apps/github/callback (NextAuth route), not /v1/...
Also note the multi-instance port lookup and 127.0.0.1 vs localhost,
which NextAuth distinguishes.
- Re-add the remote-install example dropped in the previous pass.
- Explain *why* 127.0.0.1 vs localhost matters (NextAuth normalization)
  so future readers don't 'fix' it back to localhost.
- Call out the path change (/v1/ -> /api/) and host-part change for
  users upgrading existing GitHub App registrations.
OneCLI's /app-connect OAuth flow (GitHub Apps, etc.) reads
NEXT_PUBLIC_APP_URL — not NEXTAUTH_URL — when constructing redirect_uri
and the 'Add this URL to your OAuth app' string in the UI. The image
bakes 'http://localhost:10254' as the fallback, so non-default-port
installs sent users to the wrong port mid-OAuth even after the previous
NEXTAUTH_URL fix. NEXTAUTH_URL only governs NextAuth's own
/api/auth/callback/* routes.
Yesterday's fix stripped NEXTAUTH_SECRET from per-instance .env files in
local mode, but the compose template still had
`NEXTAUTH_SECRET: ${NEXTAUTH_SECRET:-}`, which injects the variable as
an empty string into the container. OneCLI 1.23.0 treats that as "set"
and refuses to start, surfacing the same "NEXTAUTH_SECRET is set but
GOOGLE_CLIENT_ID/SECRET are missing" error in the web UI.

Render the compose file instead of copying it, and only emit the
NEXTAUTH_SECRET environment line when we actually have a secret.
`docker compose restart` keeps the pgdata and app-data named volumes
intact (no down/up, no --volumes), so credentials and the postgres
state survive a deploy.
`docker compose restart` only bounces the existing containers — it
won't re-read the compose file or pull a new ONECLI_VERSION. `up -d`
preserves the named volumes the same way and actually applies any
template/.env changes the deploy might have brought in.
`POLL_INTERVAL_MS` gates only the idle / accumulate-only sleeps between
poll iterations — the active follow-up poll while a turn is in flight
already runs at 500ms (ACTIVE_POLL_INTERVAL_MS, untouched). Each idle
poll is a single indexed read against messages_in, so 5× cadence is
cheap. Worst-case wake lag for a freshly-inserted inbound row drops
from ~1s to 200ms — symmetric with the host delivery poll change in
the prior PR.

Also refreshes stale 1s references in docs/architecture.md,
architecture-diagram.{md,html}, and agent-runner-details.md (a few were
about the host delivery poll missed in the prior pass).

Closes #10
The deploy step that ran `setup --step onecli` (which re-copies the
compose template from setup/onecli/docker-compose.yml.template into
~/.onecli-<inst>/) was gated behind 'container not already running' —
so template updates landed in the repo on `make deploy`, but the
per-instance compose file was only re-rendered on first install.

This meant the recent NEXTAUTH_URL / NEXT_PUBLIC_APP_URL additions
never made it into running gateways: `docker compose up -d` saw an
unchanged compose file and treated the containers as up-to-date.

Copy the template before the existing bounce step so the per-instance
compose file always reflects the repo, and `up -d` recreates the
container whenever the env block changes.
OneCLI creates agents in `selective` secret mode by default, which
means no vault secrets are assigned — every credentialed call from
the container 401s until an operator manually runs
`onecli agents set-secret-mode --mode all` (documented gotcha in
CLAUDE.md). The SDK doesn't expose setSecretMode, so we hit the HTTP
API directly with the existing Bearer token: GET /api/agents to
resolve the internal id from our identifier, then PATCH
/api/agents/<id>/secret-mode {mode:"all"}.

Failures are logged and swallowed so a transient OneCLI hiccup
doesn't block all container spawns for the agent group — the agent
still starts, just in the old selective mode.
…il comment, drop unrelated Makefile change

- ensureAgentSecretModeAll now warns when the agent record is missing
  the expected `secretMode` field, so a future OneCLI rename surfaces
  in logs instead of silently turning the idempotency check into a
  PATCH-per-spawn.
- Comment block above the OneCLI wiring calls now distinguishes the
  hard-fail wiring (ensureAgent + applyContainerConfig — throws block
  spawn) from the soft-fail secret-mode flip (degrades to selective
  mode, doesn't block spawn).
- Reverts the Makefile change that slipped in from a stash pop — it
  belongs to a separate PR about `make deploy` compose refresh.
… tail

Split the deploy recipe into two phases. Phase 1 (git pull, install,
bootstrap, build) runs best-effort under `set -e` in a subshell; if it
fails, deploy logs a WARN, marks a non-zero exit, and falls through
instead of aborting. Phase 2 (OneCLI bounce, agent-container kill, unit
restart) now ALWAYS runs, so a wedged agent container or a dirty/broken
tree no longer leaves the instance stuck — deploy bounces it onto the
last good dist/ regardless.

Phase 2 also force-kills every agent container by its
`nanoclaw-install=<slug>` label (the same scope the host's startup
cleanupOrphans() uses) before restarting, so a hung turn is cleared
immediately rather than waiting on the restarted host to reap it.

Automated self-upgrade behaviour is unchanged: src/self-upgrade.ts only
fires `make deploy` on a clean fast-forward and skips dirty trees, so
removing the recipe-level hard stop cannot cause restart thrash.
…nstalls

OneCLI's /api/container-config hardcodes host.docker.internal:10255 in
HTTPS_PROXY/HTTP_PROXY, which is correct for stock single-instance setups
but wrong for multi-instance installs where each instance's gateway lives
on app_port + 1 (e.g. 10355, 10455). Containers point at :10255, nothing
listens there, every Claude API call hangs, the agent produces no reply,
heartbeat eventually stales, and host-sweep kills the container at the
30-minute absolute ceiling — surfacing as "typing indicator but no reply".

Fix: parse the port from ONECLI_URL after applyContainerConfig and rewrite
:10255 in the -e HTTP[S]_PROXY args. No-op when the computed port is
already 10255 (single-instance default), so stock installs are unaffected.
@matty271828
Copy link
Copy Markdown
Author

Closing — should have been raised on the fork, not upstream. Reopening on madebymac/nanoclaw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants