fix(whatsapp): correct recovery guidance for decrypt failures and 401 logout#2469
Open
dwudwu wants to merge 127 commits into
Open
fix(whatsapp): correct recovery guidance for decrypt failures and 401 logout#2469dwudwu wants to merge 127 commits into
dwudwu wants to merge 127 commits into
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # container/agent-runner/src/index.ts # src/container-runner.ts
Syncs with upstream main (on schedule, dispatch, or push), then merges main into all skill/* branches with build+test validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove reply quote duplication from formatted prompts (reply_to ID attribute is sufficient since the original is in conversation history) - Use compact JSON for webhook payloads instead of pretty-printed - Gate MCP tool loading on optional `capabilities` array in container.json so unused tool schemas aren't sent to the model every turn - Lower default compaction window from 165K to 120K tokens and make it configurable via `compactWindowTokens` in container.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # container/agent-runner/src/providers/claude.ts # setup/register.ts
SQLite cannot recover a hot journal in readonly mode. When the container crashes mid-write or the host polls during an active write, the leftover journal causes "attempt to write a readonly database" on every subsequent read — blocking all message delivery for that session indefinitely. Opening read-write lets SQLite transparently replay the journal on open. The single-writer-per-file invariant is maintained by convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy adapters from upstream channels branch and wire self-registration imports. Deploy machine needs `pnpm install` + `pnpm run build`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy adapter from upstream channels branch and wire self-registration import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-task session clear (clearBeforeTask) resets the SDK transcript before scheduled tasks so history doesn't grow unboundedly. Cross-day context comes from files, not transcript replay. Configurable tool allowlist (allowedTools) lets per-group container.json trim unused SDK tools. Instruction fragment composition now respects the capabilities array, skipping fragments for disabled MCP modules. Estimated 83% input token reduction over 7 days for the daily news gather + follow-up use case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-agent install — no reason to accumulate cross-day task history by default. Opt out with "clearBeforeTask": false in container.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… 100k compact window) No container.json edits needed on deploy machine — sensible defaults baked into code. Override with explicit values in container.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nSession The delivery poll had several code paths that returned silently with no logging — inflight skip, missing agent group, and DB open failures. This made it impossible to diagnose why messages weren't being delivered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…debug) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only WhatsApp is needed. The iMessage adapter was hanging during init (websocket reconnection loop) and preventing the host from reaching delivery poll startup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the data layer for the memo/knowledge-base feature: - Migration: memos table + memos_fts FTS5 virtual table + sync triggers - Types: Memo and MemoSearchResult interfaces - CRUD module: insert, get, update, delete, searchMemos (FTS5 with bm25 ranking, query sanitization, LIKE fallback), listMemos (tag filter, pagination) - Tests: 16 cases covering CRUD, FTS5 search, edge cases, tenant isolation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memos with agent_group_id = NULL are visible to all agent groups. Queries use (agent_group_id = ? OR agent_group_id IS NULL) so each group sees its own memos plus global ones. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Moves memos from central DB to a standalone data/memos.db (DELETE journal mode) for cross-mount RO access. Host writes via delivery action handlers; containers read directly. Adds memo_save, memo_search, memo_list, memo_delete MCP tools that always load (no capability flag). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Barrel was already stripped in ce449a0; these files were dead. Channel adapters are skill-installed from the `channels` branch so trunk shouldn't carry stale copies. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
If a session directory gets deleted out from under an active session
row, the delivery poller hits openInboundDb at ~1Hz forever, either
crashing on a missing parent dir or silently creating an empty
unmigrated DB ("no such table: messages_in" cascade — 2026-05-06).
- delivery.ts: guard drainSession with an inboundDbPath existence check
and demote the session via closeOrphanSession when the dir is gone.
- session-manager.ts: add closeOrphanSession + reconcileOrphanSessions
(startup pass over getActiveSessions).
- index.ts: run the startup reconcile after migrations, before the
pollers start.
- delivery.test.ts: cover both the per-tick guard and the startup pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The MCP server is spawned as a stdio child of the Claude Agent SDK, which swallows the child's stderr — so console.error never reaches docker logs. Mirror each call/done/fail line into /workspace/.mcp-tool-calls.log so it surfaces on the host at data/v2-sessions/<ag>/<session>/.mcp-tool-calls.log for debugging. Args are previewed (120-char cap) to keep entries terse and avoid dumping memo bodies or scheduled-task prompts into the log. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
No behavior change — collapse multi-line .prepare() / .filter().map() calls onto single lines and rewrap long template literals to match prettier's printWidth. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Shared script fetches configured news sources in parallel, strips HTML, and returns structured text via the scriptOutput protocol. Saves ~60k tokens per scheduled news task by eliminating WebSearch/WebFetch tool calls from the transcript. Mount added at /app/scripts/ (read-only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Worker now parses RSS/Atom feeds into structured items
({title, link, pubDate, summary}) and filters to a recency window
(default 24h, configurable via news-sources.json windowHours).
Output shape moves from a stripped-HTML blob per source to an array
of dated items, so the agent can dedupe by topic and respect the
recency rule without re-deriving timestamps from page text.
Non-feed URLs now return `error: "not RSS/Atom"` instead of being
silently rendered as HTML chunks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rver Adds a Bun-spawned end-to-end test for the pre-task news fetcher: RSS/Atom parsing, recency window, non-feed and HTTP-error reporting, and total-budget trimming. Uses a temp Bun.serve fixture instead of hitting real feeds. To make the worker addressable from tests without writing into /workspace/agent/, CONFIG_PATH now honors a NEWS_SOURCES_PATH env override (default unchanged). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tests the scheduled-task pipeline at the agent-runner boundary: pass-through for non-task / scriptless messages, scriptOutput injection on wakeAgent=true, and skip on wakeAgent=false / non-zero exit / non-JSON last line / empty stdout. Final case wires the real news-fetch worker into a task script against a local Bun.serve RSS fixture, asserting parsed items land in the enriched message content. No mocks — execFile, fetch, and parser all run their production code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
libsignal Bad MAC errors after socket reconnects were dropping messages silently — the host only logged them to error.log. Track per-sender decrypt-stub messages in a sliding window; on 3-in-60s, DM the operator on their phone JID (different Signal session, stays viable), then cool down for 10 min.
… logout The decrypt-failure operator alert pointed at `launchctl kickstart`, which doesn't recover Signal session corruption — the bad sessions live on disk in store/auth/. Rewrite the alert to describe the actual self-heal path and, if drops persist, the specific session file to remove. For 401 (DisconnectReason.loggedOut) closes, promote the log from INFO to WARN with explicit re-pair instructions, and write a flag file at store/whatsapp-logged-out.txt so operator tooling can detect the dead-credential state without scraping logs. Mirror the existing pairing-code.txt pattern; clear the flag on successful reconnect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The host's host-sweep kills stale containers with `docker stop`, but the agent-runner had no SIGTERM handler — so Bun exited without closing bun:sqlite. On macOS Docker's gRPC-FUSE bind mount, the truncated write plus unreleased file lock left outbound.db in a state where the next container saw "attempt to write a readonly database" and the host saw "disk I/O error" on PRAGMA, looping for days. - Add closeSessionDbs() and wire it to SIGTERM/SIGINT in agent-runner. - Bump `docker stop -t 1` → `-t 10` so SQLite has time to flush before SIGKILL escalates. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The LID→phone JID translation cache was in-memory only. After a host restart, neither the cache nor Baileys' signalRepository know the mapping, so an inbound DM arriving with a LID remoteJid never resolves back to the phone JID stored in messaging_groups. The router then misses the group lookup and silently exits at the "no mg + not a mention" path. Persist the map to store/lid-map.json: load on adapter init, write on every setLidPhoneMapping. Also promote two previously-debug-only router drop logs to info/warn so the same class of silent drop is visible next time. Lockfile bump folds in the discord and imessage chat adapters that were already installed locally via skills. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small fixes to the WhatsApp adapter's failure-mode operator UX:
launchctl kickstartNanoClaw, which doesn't recover Signal session corruption — the bad sessions live on disk instore/auth/. Replace with accurate guidance: the session usually self-heals on the next inbound, and if it doesn't, point at the specificstore/auth/session-<userPart>*.jsonfile to scrub.DisconnectReason.loggedOuthandler. Today this only emitslog.info('WhatsApp logged out')and goes silent — easy to miss until messages have been dropping for hours. Promote tolog.warnwith explicit re-pair instructions, and write a flag file atstore/whatsapp-logged-out.txtso operator tooling (status bar app, monitoring scripts) can detect the dead-credential state without scraping logs. Mirrors the existingpairing-code.txtpattern; cleared on successful reconnect.Context
While diagnosing a WhatsApp test message I found this install had hit 5× 401 logouts and 3,190 raw
Bad MAClog lines over the lifetime of the log file. Investigation showed the existingwhatsapp-decrypt-tracker(#7d3fce8) correctly catches the ~8 real dropped messages — the 3,190:8 ratio is libsignal's per-session-attempt error logging, most of which are benign (a later session in the multi-session decrypt succeeds). But two follow-on issues remained:launchctl kickstartdoesn't fix anything that lives instore/auth/).Changes
src/channels/whatsapp.ts:loggedOutFileconstant alongsidepairingCodeFile.handleDecryptFailureoperator-DM text.log.warn(waslog.info) + write the flag file.connection === 'open': also clear the flag file.Test plan
pnpm run build(host typecheck) clean.pnpm test— 279/279 host tests pass.store/whatsapp-logged-out.txtappears + WARN logged.connection === 'open'.🤖 Generated with Claude Code