Skip to content

fix(whatsapp): correct recovery guidance for decrypt failures and 401 logout#2469

Open
dwudwu wants to merge 127 commits into
nanocoai:mainfrom
dwudwu:feat/whatsapp-recovery-guidance
Open

fix(whatsapp): correct recovery guidance for decrypt failures and 401 logout#2469
dwudwu wants to merge 127 commits into
nanocoai:mainfrom
dwudwu:feat/whatsapp-recovery-guidance

Conversation

@dwudwu
Copy link
Copy Markdown

@dwudwu dwudwu commented May 14, 2026

Summary

Two small fixes to the WhatsApp adapter's failure-mode operator UX:

  • Decrypt-failure alert text. The current alert tells the operator to launchctl kickstart NanoClaw, which doesn't recover Signal session corruption — the bad sessions live on disk in store/auth/. Replace with accurate guidance: the session usually self-heals on the next inbound, and if it doesn't, point at the specific store/auth/session-<userPart>*.json file to scrub.
  • 401 DisconnectReason.loggedOut handler. Today this only emits log.info('WhatsApp logged out') and goes silent — easy to miss until messages have been dropping for hours. Promote to log.warn with explicit re-pair instructions, and write a flag file at store/whatsapp-logged-out.txt so operator tooling (status bar app, monitoring scripts) can detect the dead-credential state without scraping logs. Mirrors the existing pairing-code.txt pattern; cleared on successful reconnect.

Context

While diagnosing a WhatsApp test message I found this install had hit 5× 401 logouts and 3,190 raw Bad MAC log lines over the lifetime of the log file. Investigation showed the existing whatsapp-decrypt-tracker (#7d3fce8) correctly catches the ~8 real dropped messages — the 3,190:8 ratio is libsignal's per-session-attempt error logging, most of which are benign (a later session in the multi-session decrypt succeeds). But two follow-on issues remained:

  1. When the tracker does fire, the recovery hint sends the operator down a wrong path (launchctl kickstart doesn't fix anything that lives in store/auth/).
  2. When WhatsApp does invalidate the device (401), the adapter goes silent. This is the most user-visible failure mode and currently has the quietest log.

Changes

src/channels/whatsapp.ts:

  • Add loggedOutFile constant alongside pairingCodeFile.
  • Rewrite the handleDecryptFailure operator-DM text.
  • On 401 close: log.warn (was log.info) + write the flag file.
  • On connection === 'open': also clear the flag file.

Test plan

  • pnpm run build (host typecheck) clean.
  • pnpm test — 279/279 host tests pass.
  • Manual: trigger a 401 (e.g. log out the linked device from the phone) → confirm store/whatsapp-logged-out.txt appears + WARN logged.
  • Manual: re-pair → confirm flag file is removed on connection === 'open'.
  • Manual: trigger 3+ decrypt failures in 60s → confirm new operator DM text references the correct session filename pattern.

🤖 Generated with Claude Code

gavrielc and others added 30 commits March 8, 2026 22:59
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	container/agent-runner/src/index.ts
#	src/container-runner.ts
Syncs with upstream main (on schedule, dispatch, or push), then
merges main into all skill/* branches with build+test validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dwudwu and others added 26 commits April 30, 2026 17:54
- Remove reply quote duplication from formatted prompts (reply_to ID
  attribute is sufficient since the original is in conversation history)
- Use compact JSON for webhook payloads instead of pretty-printed
- Gate MCP tool loading on optional `capabilities` array in container.json
  so unused tool schemas aren't sent to the model every turn
- Lower default compaction window from 165K to 120K tokens and make it
  configurable via `compactWindowTokens` in container.json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	container/agent-runner/src/providers/claude.ts
#	setup/register.ts
SQLite cannot recover a hot journal in readonly mode. When the container
crashes mid-write or the host polls during an active write, the leftover
journal causes "attempt to write a readonly database" on every subsequent
read — blocking all message delivery for that session indefinitely.

Opening read-write lets SQLite transparently replay the journal on open.
The single-writer-per-file invariant is maintained by convention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy adapters from upstream channels branch and wire self-registration
imports. Deploy machine needs `pnpm install` + `pnpm run build`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy adapter from upstream channels branch and wire self-registration import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-task session clear (clearBeforeTask) resets the SDK transcript before
scheduled tasks so history doesn't grow unboundedly. Cross-day context
comes from files, not transcript replay. Configurable tool allowlist
(allowedTools) lets per-group container.json trim unused SDK tools.
Instruction fragment composition now respects the capabilities array,
skipping fragments for disabled MCP modules.

Estimated 83% input token reduction over 7 days for the daily news
gather + follow-up use case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-agent install — no reason to accumulate cross-day task history
by default. Opt out with "clearBeforeTask": false in container.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… 100k compact window)

No container.json edits needed on deploy machine — sensible defaults
baked into code. Override with explicit values in container.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nSession

The delivery poll had several code paths that returned silently with no
logging — inflight skip, missing agent group, and DB open failures. This
made it impossible to diagnose why messages weren't being delivered.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…debug)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only WhatsApp is needed. The iMessage adapter was hanging during init
(websocket reconnection loop) and preventing the host from reaching
delivery poll startup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the data layer for the memo/knowledge-base feature:
- Migration: memos table + memos_fts FTS5 virtual table + sync triggers
- Types: Memo and MemoSearchResult interfaces
- CRUD module: insert, get, update, delete, searchMemos (FTS5 with bm25
  ranking, query sanitization, LIKE fallback), listMemos (tag filter, pagination)
- Tests: 16 cases covering CRUD, FTS5 search, edge cases, tenant isolation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memos with agent_group_id = NULL are visible to all agent groups.
Queries use (agent_group_id = ? OR agent_group_id IS NULL) so each
group sees its own memos plus global ones.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moves memos from central DB to a standalone data/memos.db (DELETE journal
mode) for cross-mount RO access. Host writes via delivery action handlers;
containers read directly. Adds memo_save, memo_search, memo_list,
memo_delete MCP tools that always load (no capability flag).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Barrel was already stripped in ce449a0; these files were dead. Channel
adapters are skill-installed from the `channels` branch so trunk
shouldn't carry stale copies.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
If a session directory gets deleted out from under an active session
row, the delivery poller hits openInboundDb at ~1Hz forever, either
crashing on a missing parent dir or silently creating an empty
unmigrated DB ("no such table: messages_in" cascade — 2026-05-06).

- delivery.ts: guard drainSession with an inboundDbPath existence check
  and demote the session via closeOrphanSession when the dir is gone.
- session-manager.ts: add closeOrphanSession + reconcileOrphanSessions
  (startup pass over getActiveSessions).
- index.ts: run the startup reconcile after migrations, before the
  pollers start.
- delivery.test.ts: cover both the per-tick guard and the startup pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The MCP server is spawned as a stdio child of the Claude Agent SDK,
which swallows the child's stderr — so console.error never reaches
docker logs. Mirror each call/done/fail line into
/workspace/.mcp-tool-calls.log so it surfaces on the host at
data/v2-sessions/<ag>/<session>/.mcp-tool-calls.log for debugging.

Args are previewed (120-char cap) to keep entries terse and avoid
dumping memo bodies or scheduled-task prompts into the log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No behavior change — collapse multi-line .prepare() / .filter().map()
calls onto single lines and rewrap long template literals to match
prettier's printWidth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Shared script fetches configured news sources in parallel, strips HTML,
and returns structured text via the scriptOutput protocol. Saves ~60k
tokens per scheduled news task by eliminating WebSearch/WebFetch tool
calls from the transcript. Mount added at /app/scripts/ (read-only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker now parses RSS/Atom feeds into structured items
({title, link, pubDate, summary}) and filters to a recency window
(default 24h, configurable via news-sources.json windowHours).

Output shape moves from a stripped-HTML blob per source to an array
of dated items, so the agent can dedupe by topic and respect the
recency rule without re-deriving timestamps from page text.

Non-feed URLs now return `error: "not RSS/Atom"` instead of being
silently rendered as HTML chunks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rver

Adds a Bun-spawned end-to-end test for the pre-task news fetcher:
RSS/Atom parsing, recency window, non-feed and HTTP-error reporting,
and total-budget trimming. Uses a temp Bun.serve fixture instead of
hitting real feeds.

To make the worker addressable from tests without writing into
/workspace/agent/, CONFIG_PATH now honors a NEWS_SOURCES_PATH env
override (default unchanged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tests the scheduled-task pipeline at the agent-runner boundary:
pass-through for non-task / scriptless messages, scriptOutput injection
on wakeAgent=true, and skip on wakeAgent=false / non-zero exit /
non-JSON last line / empty stdout.

Final case wires the real news-fetch worker into a task script against
a local Bun.serve RSS fixture, asserting parsed items land in the
enriched message content. No mocks — execFile, fetch, and parser all
run their production code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
libsignal Bad MAC errors after socket reconnects were dropping messages
silently — the host only logged them to error.log. Track per-sender
decrypt-stub messages in a sliding window; on 3-in-60s, DM the operator
on their phone JID (different Signal session, stays viable), then
cool down for 10 min.
… logout

The decrypt-failure operator alert pointed at `launchctl kickstart`, which
doesn't recover Signal session corruption — the bad sessions live on disk
in store/auth/. Rewrite the alert to describe the actual self-heal path
and, if drops persist, the specific session file to remove.

For 401 (DisconnectReason.loggedOut) closes, promote the log from INFO to
WARN with explicit re-pair instructions, and write a flag file at
store/whatsapp-logged-out.txt so operator tooling can detect the
dead-credential state without scraping logs. Mirror the existing
pairing-code.txt pattern; clear the flag on successful reconnect.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dwudwu and others added 2 commits May 15, 2026 10:50
The host's host-sweep kills stale containers with `docker stop`, but the
agent-runner had no SIGTERM handler — so Bun exited without closing
bun:sqlite. On macOS Docker's gRPC-FUSE bind mount, the truncated write
plus unreleased file lock left outbound.db in a state where the next
container saw "attempt to write a readonly database" and the host saw
"disk I/O error" on PRAGMA, looping for days.

- Add closeSessionDbs() and wire it to SIGTERM/SIGINT in agent-runner.
- Bump `docker stop -t 1` → `-t 10` so SQLite has time to flush before
  SIGKILL escalates.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The LID→phone JID translation cache was in-memory only. After a host
restart, neither the cache nor Baileys' signalRepository know the mapping,
so an inbound DM arriving with a LID remoteJid never resolves back to the
phone JID stored in messaging_groups. The router then misses the group
lookup and silently exits at the "no mg + not a mention" path.

Persist the map to store/lid-map.json: load on adapter init, write on
every setLidPhoneMapping. Also promote two previously-debug-only router
drop logs to info/warn so the same class of silent drop is visible next
time.

Lockfile bump folds in the discord and imessage chat adapters that were
already installed locally via skills.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants