[pull] main from danny-avila:main by pull[bot] · Pull Request #77 · innFactory/agents

pull · 2026-05-06T04:43:13Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…ges (#141) * 🧵 feat: Recency-window splitter for compaction boundaries Introduces splitAtRecencyBoundary, a pure helper that partitions a message list into a head (older messages, eligible for summarization) and a tail (recent messages, preserved verbatim). Cuts strictly at HumanMessage boundaries so tool_use ↔ tool_result pairs are never split. The most recent user-led turn is always included in the tail, so a single oversized first message is never destroyed by summarization. Optional token cap (`tokens`) allows walk-back; tokenless usage falls back to turn-count only (`turns`). 13 unit tests cover default behavior, the disabled (turns: 0) path, the token cap, and degenerate inputs (no HumanMessage, trailing tool results, single message). The splitter is wired into the summarize node in a follow-up commit. * 🛡️ feat: Preserve recent turns during summarization The summarize node previously summarized every message and replaced them all with the LLM-generated checkpoint. This destroyed the user's most recent payload whenever pruning fired on a single big turn — see librechat#12940 ("Skip Summarization of First Turn"), where a 30K-token paste on turn 1 was silently replaced by a generic summary. Adds SummarizationConfig.retainRecent ({ turns?, tokens? }) with a default of 2 turns. Before invoking the LLM, the node splits the state at a recency boundary; only the head is summarized and the tail is re-injected via the messages reducer: return { messages: [createRemoveAllMessage(), ...messagesToRetain] } When the head is empty (e.g. fresh conversation with one big user message), the LLM call is skipped entirely, the trigger is marked consumed via markSummarizationTriggered, and no ON_SUMMARIZE_* events are emitted — the user's payload survives intact. Setting `retainRecent: { turns: 0 }` reverts to legacy behavior for callers who want the old "summarize everything" semantics. Test updates: - node.test.ts default agent context now uses { turns: 0 } so the 17 existing LLM-call tests continue exercising that path. - Four new node-level tests cover first-turn skip, single-turn skip including tool messages, head/tail summarization, and the legacy shape. * 🧪 chore: Live multi-provider script for recency-window verification Adds src/scripts/summarization-recency.ts that exercises both recency-window scenarios end-to-end against real provider APIs: 1. First-turn protection — a single oversized user message must not trigger ON_SUMMARIZE_START/COMPLETE. An empty_messages pruning error is treated as expected when the message exceeds the configured budget; we only fail if summarization fired. 2. Multi-turn compaction — across four padded turns, summarization must fire (proving the LLM path still works) while the recency tail is preserved verbatim. Verified across anthropic (claude-sonnet-4-5), openAI (gpt-5.4-mini), google (gemini-2.5-flash), bedrock (us.anthropic.claude-sonnet-4-5), openrouter (moonshotai/kimi-k2.6), and deepseek (deepseek-v4-flash). Run with the dotenv preload + override flag so AWS SDK module-init captures populated env vars: DOTENV_CONFIG_OVERRIDE=true node -r dotenv/config \ --loader ./tsconfig-paths-bootstrap.mjs \ --experimental-specifier-resolution=node \ ./src/scripts/summarization-recency.ts --provider all The script auto-skips providers whose credentials are absent, and defaults BEDROCK_AWS_REGION to AWS_DEFAULT_REGION (or us-east-1) when the dedicated knob is missing. * 🪡 fix: PostCompact reports retained-tail length, not always zero The PostCompact hook payload's `messagesAfterCount` was hard-coded to `0`, which was correct under the legacy "remove-all only" return shape but stale once the recency window started preserving a tail ([removeAll, ...messagesToRetain]). Threads the retained tail length through dispatchCompletionEvents so observers (metrics, cleanup logic) see the true post-compaction message count. When retainRecent.turns is 0, the value is still 0; when retainRecent.turns > 0 and a tail survives, it reflects the preserved-message count. compactHooks.test.ts: - Existing test now opts into { turns: 0 } via a helper-default arg to keep documenting the legacy shape. - New test asserts messagesAfterCount === 4 under the default retainRecent.turns: 2 with a 20-turn synthetic conversation (last 2 turns × 2 messages each = 4 retained). Reported by chatgpt-codex-connector on PR #141. * 🪡 fix: Keep masked tool payloads in retained recency tail The retained tail was sliced from `restoredMessages`, which has oversized ToolMessage content rehydrated by `restoreOriginalToolContent` for the summarizer's benefit. Reinjecting that restored content into state defeats the masking that prune applied earlier in the turn, bloats checkpoint size, and forces more expensive re-pruning on later turns. The summarizer still reads the restored head (full content for summary quality), but the tail now slices `state.messages` (masked form) at the same `tailStartIndex` returned by the splitter. Both arrays have identical length and structure since `restoreOriginalToolContent` only swaps content at specific indices, so the same boundary applies to both. Adds a node-level test that exercises the divergence: the summarizer sees the restored head content (`FULL_ORIGINAL_OUTPUT...`) while the returned tail keeps the masked stub (`masked-tail-stub`). Reported by chatgpt-codex-connector on PR #141. * 🪡 fix: Carry forward original tool content for retained-tail messages The previous fix kept the masked tail content in live state but eagerly cleared `agentContext.pendingOriginalToolContent`. When a later compaction moves those tail messages into its head, the new pruner (recreated by `setSummary` clearing `pruneMessages`) has no record of the original full content — so the summarizer sees only the masked stub, dropping fidelity for tool-derived facts. Two changes: 1. summarize/node.ts: capture `pendingOriginalToolContent` locally before restoration; on the summarize-fired path, after computing `tailStartIndex`, build a reindexed map containing only the tail-relevant entries (idx >= tailStartIndex shifted to start at 0) and store it back on AgentContext. Entries < tailStartIndex belonged to the just-summarized head and are correctly dropped. On the skip path, the captured value is left untouched — state is unchanged so its indices remain valid. 2. graphs/Graph.ts: merge the pruner's masking record into `pendingOriginalToolContent` instead of overwriting it. Both maps key off current-state indices, so a key-wise union is the right semantic. Without this, a fresh pruner that masks new content on a later turn would clobber the carry-over before the next summarize could read it. Tests: - New: `preserves tail-relevant pendingOriginalToolContent entries (reindexed) for future summaries` — runs summarize with a map that spans both head and tail indices, asserts only tail entries survive and are reindexed correctly. - New: `does not clear pendingOriginalToolContent on the skip path (state unchanged)` — confirms the same Map reference survives when summarization is skipped, so the next prune cycle still sees the originals. Reported by chatgpt-codex-connector on PR #141. * 🪡 fix: Cap merged tool-content map and align dedupe baseline with tail Two issues introduced by the recency-window carry-over plumbing: 1. **Pending-content size cap bypassed by merge** `createPruneMessages` enforces ORIGINAL_CONTENT_MAX_CHARS (~2 MB) inside its own `originalToolContent` map via the `onContentStored` callback, but the prior commit's key-wise union in Graph.ts bypassed that accounting. Long sessions with large tool outputs could grow `pendingOriginalToolContent` without bound across compaction cycles. Fix: extract `enforceOriginalContentCap` from prune.ts and re-apply it in Graph.ts after the merge. 2. **Dedupe baseline reset to zero after partial compaction** `dispatchCompletionEvents` calls `rebuildTokenMapAfterSummarization({})` which sets `_lastSummarizationMsgCount = 0` — correct for the legacy "remove-all only" shape, but stale once the recency window keeps a tail. With baseline at 0, a follow-up prune cycle on the unchanged tail re-evaluates the trigger from scratch and can loop into another summarize/skip cycle. Fix: explicitly call `markSummarizationTriggered(messagesToRetain.length)` after dispatch so the baseline matches the surviving tail. Tests: - New prune.test.ts cases: `enforceOriginalContentCap` is no-op below cap, evicts oldest entries (Map insertion order) above cap, and exposes the cap constant. - New node.test.ts case: after compaction with retainRecent.turns: 1 on a 2-turn conversation, `shouldSkipSummarization(2)` returns true (matches tail) and `shouldSkipSummarization(3)` returns false (state grew by 1). Reported by chatgpt-codex-connector on PR #141. * 🪡 fix: Make recency-split linearity self-evident + pin via test The recency-split inner loop's upper bound was the mutated `tailStartIndex`, which is decreased to the current `turnStart` at the end of each iteration. The slices are disjoint and the algorithm is O(messages_in_visited_turns), bounded by O(messages.length) — but the nested-loop-over-`messages` shape was triggering repeated quadratic flags in static analysis (Codex P3 → P1). Refactor the inner upper bound to `turnStarts[t + 1]`, derived from the immutable `turnStarts` array. The disjoint-range invariant is now obvious without having to trace `tailStartIndex` mutations across iterations. Add a comment explaining the strategy. Also add two tests that pin the runtime call count: - `calls tokenCounter once per message in visited turns` — 2,000-message history (200 turns × 10 messages), generous caps so every turn is visited. Asserts `tokenCounter` is called exactly `messages.length` times. Quadratic behavior would push it to ~400,000 calls. - `stops counting once the tokens cap is exceeded` — 50-turn history with a 10-token cap. Asserts call count ≤ 12 (5 included turns + 1 rejected turn × 2 messages each), confirming the loop short-circuits rather than scanning all 100 messages. Reported by chatgpt-codex-connector on PR #141.

* feat: add local execution engine * feat(local-engine): hardening + ergonomics follow-ups Adds the follow-up items flagged in the design comparison so the local engine ships with a defensible default posture. Security & correctness - Bridge bearer token: per-Run 256-bit token; programmatic Python and bash callers send `x-librechat-bridge-token`; server uses `timingSafeEqual`. Stops cross-process listeners on the same loopback port from invoking SDK tools. - Symlink-aware workspace path: new `resolveWorkspacePathSafe()` walks up to the nearest existing ancestor, realpaths it, and rejects containment escapes — covers `write_file` to a brand-new path without false positives on `/tmp -> /private/tmp` style ancestors. - Read tool guards: stat-cap (default 10 MiB, configurable via `local.maxReadBytes`), NUL-byte binary detection on the first 8 KiB before reading the rest, returns descriptive stubs. - Local engine warns once per process when it runs without `@anthropic-ai/sandbox-runtime` wrapping so the no-sandbox default stays loud in CI/server logs. Validation - New `bashAst` config: `'off' | 'auto' | 'strict'` heuristic categorical-hazard pass over the quote-stripped command (command substitution, zsh privileged builtins, `/proc/<pid>/environ` reads, `IFS=` injection, hex-escape obfuscation, `source $VAR`, plus `eval`/`exec` denied in strict). Forward-compatible name for a future tree-sitter-bash AST drop-in. Ergonomics - File checkpointing: `LocalFileCheckpointerImpl` snapshots pre-write contents for `write_file`/`edit_file` and rewinds them on demand. `createLocalCodingToolBundle` returns the checkpointer alongside the tools when `local.fileCheckpointing: true`. - Pluggable spawn seam: `LocalExecutionConfig.spawn?: LocalSpawn` lets callers route every command through SSH/containers/remote runners without forking the engine. - Ripgrep fallback: `grep_search` and `glob_search` now probe for `rg` once per process and fall back to a Node-side walker + JS regex when ripgrep isn't on PATH; the old hard requirement is gone. Tests - 20 new unit tests covering AST findings, sandbox-off warning latch, checkpointer round-trip and file deletion, binary-file refusal, oversize stub, symlink escape, ripgrep fallback, and bridge auth rejection. Existing 109 tests still pass. * feat(toolnode): fire PreToolUse/PostToolUse/PostToolUseFailure on direct path Before this change, in-process tools added to `directToolNames` (every graphTool with a real implementation, including the new local-engine tools) skipped the hook lifecycle entirely. Only schema-only tools that round-tripped through the host event-dispatch path actually fired PreToolUse/PostToolUse/PostToolUseFailure/PermissionDenied. That meant `createToolPolicyHook`, the new HITL `'ask'` flow, and PostToolUse output rewriting all silently no-op'd on direct tools — including every `bash`/`code`/`edit_file`/`write_file` from the local engine. This adds `runDirectToolWithLifecycleHooks(call, config, batchContext)` which wraps `runTool` with the same hook semantics `dispatchToolEvents` uses for the single-call case, and routes the three direct-call sites in `run()` through it. Behavior: - Fast path: when the registry has none of PreToolUse / PostToolUse / PostToolUseFailure registered for this run, falls through to `runTool` with zero overhead. No regression for callers without hooks. - PreToolUse: pre-resolves args (matching the event path), fires the hook with resolved args. - `decision: 'deny'` → synthesizes a Blocked error ToolMessage and fires PermissionDenied (observational). - `decision: 'ask'` → fail-closed deny regardless of `humanInTheLoop.enabled`. The event path handles 'ask' via LangGraph's interrupt(), which requires restructuring the direct invoke path; until that lands, returning 'ask' on a direct tool blocks the call (strict improvement vs. ignoring the hook entirely). A one-time warning fires when HITL is enabled, so hosts notice the gap. - `updatedInput` → applied to the call before runTool; runTool's placeholder resolution is idempotent on already-resolved args. - PostToolUse: when runTool returns a non-error ToolMessage, fires PostToolUse and applies `updatedOutput` (preserving id/name/status). - PostToolUseFailure: when runTool returns a status='error' ToolMessage, fires the failure hook (observational); the original error continues to propagate. PostToolBatch participation across direct + dispatched outcomes is intentionally out of scope: the batch hook accumulates inside dispatchToolEvents and is fired at the end of its scope, and merging direct-call outcomes into that aggregation crosses the two paths' result-shaping logic. Left as a follow-up. 8 new unit tests in directToolHooks.test.ts cover deny+PermissionDenied, ask fail-closed, allow, updatedInput, updatedOutput, PostToolUseFailure, no-hooks fast path, and a mixed batch where one direct call denies and the sibling runs. Total non-API test count: 1163 -> 1171 passing. * chore(scripts): add live-API smoke scripts for the local engine Four scripts that exercise the local execution engine end-to-end with a real LLM provider (matching the convention in src/scripts/*). Each spins up a temporary workspace, runs the graph, then dumps observable state so the behavior under test is visible in the console. - local_engine.ts — baseline e2e: bash + write_file + read_file + edit_file + list_directory in a single multi-step task. Sets `bashAst: 'auto'` so the categorical-hazard pass is exercised. - local_engine_ptc.ts — `run_tools_with_code` (Python) calling write_file / read_file / edit_file through the localhost bridge in a single call. The bridge token auth (timingSafeEqual) is on by default for every Run, so this also exercises that path. - local_engine_hooks.ts — direct-path hook integration: registers a `createToolPolicyHook` that denies write_file / edit_file plus an explicit PostToolUse that prefixes every successful tool result with "[reviewed]". Asks the model to write then read; the write must be blocked, the read must come back tagged. Also asserts blocked.txt does not land on disk. - local_engine_checkpointer.ts — builds the local coding tools via `createLocalCodingToolBundle({ fileCheckpointing: true })` so the host keeps a handle on the checkpointer, lets the model edit two seeded files, then calls `rewind()` and verifies originals come back. Exits non-zero if rewind didn't restore both files. npm scripts added for each: `local`, `local:ptc`, `local:hooks`, `local:checkpointer`. All four typecheck clean (`tsc --noEmit`). * fix(graph): pass hookRegistry on the non-event-driven path Live integration test (`npm run local:hooks`) surfaced that hooks weren't actually firing on local-engine tools when used with a plain `graphConfig: { type: 'standard' }` (no `agentContext.toolDefinitions`). Reason: `Graph.ts` only forwarded `hookRegistry` and `humanInTheLoop` to ToolNode in the event-driven branch — the legacy non-event-driven branch dropped both, so the policy hook silently no-op'd and a `write_file` that the script's `createToolPolicyHook` was supposed to deny landed on disk. Two changes: 1. `StandardGraph.createToolNode` now passes `hookRegistry` and `humanInTheLoop` in BOTH branches. Hooks are conceptually orthogonal to whether tools dispatch as events or invoke in process; only the event path was historically wired. 2. `ToolNode.run`'s non-event-driven `else` branch now invokes `runDirectToolWithLifecycleHooks` instead of `runTool` directly, so PreToolUse / PostToolUse / PostToolUseFailure / PermissionDenied fire the same way they do on the event-driven mixed-batch path. The helper still falls through to `runTool` on the no-hooks fast path, so callers without hooks pay zero overhead. Verified by re-running `npm run local:hooks`: - PermissionDenied fired 1 time(s) for write_file with the policy reason - PostToolUse saw read_file (the allowed read got the [reviewed] tag) - blocked.txt landed on disk? false (expected: false) Also re-ran the full non-API jest suite — 1171 passing, no regressions. * feat(local-engine): coding-tool parity push (fuzzy edit, diff, BOM/EOL, overflow) Live cross-comparison against claude-code, pi-mono, and opencode (sst) surfaced four meaningful gaps in the foundation tools. This commit closes them; LSP integration and apply_patch (multi-file) are larger and stay as documented follow-ups. Edit fuzzy matching ------------------- Our edit_file required exact `oldText` match, so a single mistyped space would force the LLM to re-read and retry. New `editStrategies.ts` walks a four-stage chain — exact, line-trimmed, whitespace-normalized, indentation-flexible — stopping at the first strategy that locates EXACTLY ONE match. The matched on-disk slice is then literally replaced with `newText`; we never modify newText. The strategy that fired is surfaced in the tool result's `strategies` metadata so callers can see when fuzzy fallback kicked in. Inspired by opencode's nine-strategy chain; kept to the four highest-yield strategies for a first cut, room for more (block-anchor + Levenshtein etc.) as needed. Unified-diff output on edit/write --------------------------------- `edit_file` and `write_file` now embed a unified diff in the ToolMessage so the model sees what actually changed instead of the prior `Applied 1 edit(s)` blob. Uses npm `diff`'s `createTwoFilesPatch` with 3 lines of context, capped at 4 KB so large rewrites don't blow context. BOM + line-ending preservation ------------------------------ New `textEncoding.ts`: - `decodeFile` strips and remembers UTF-8 BOM and CRLF/LF newline. - `encodeFile` reapplies them on write. `edit_file` / `write_file` now read the file, decode to LF + remember, operate, and re-encode on write. Verified by tests: editing a CRLF file keeps CRLF; overwriting a BOM file keeps the BOM. Bash output overflow → temp file -------------------------------- `spawnLocalProcess` previously truncated overflowing output in place. Now: when total bytes exceed 2× `maxOutputChars`, the FULL stdout/stderr is written to a temp file (`/tmp/lc-local-output-*.txt`) and the path is appended to the formatted output as `full_output_path: …`. The model can then `bash tail -n N <path>` (or `grep`) to inspect specifics it actually needs. Mirrors pi-mono's and opencode's overflow handling. New deps -------- - `diff@9` and `@types/diff@7` — the canonical TS diff lib opencode uses too, MIT-licensed. Tests ----- Added unit tests for each new behavior — 5 new cases for fuzzy matching, diff embedding, CRLF preservation, BOM preservation. Ran the full non-API jest suite: 1176 passing (was 1171). Ran all four live integration scripts (`local`, `local:hooks`, `local:checkpointer`, `local:ptc`) end-to-end against the real Anthropic API; all four green and the diff payload now shows up in the model-visible tool results. * feat(local-engine): inline image/PDF attachments in read_file Closes the last "Close now" parity gap: read_file can now return image and PDF files as inline `MessageContentComplex[]` content blocks so vision-capable models actually see the bytes, instead of getting our old "binary file" stub. New module ---------- src/tools/local/attachments.ts - Magic-byte sniff for the five formats we care about (PNG, JPEG, GIF, WebP, PDF). Inlined the signatures rather than pulling in `file-type` (ESM-only, awkward under ts-jest); LibreChat's api/server/utils/files.js takes the same approach but uses the package since it's a CJS-OK Node server. - Returns a typed Attachment union that the read tool branches on: image -> image_url block, pdf -> image_url block (for Anthropic's PDF-as-data-url path), oversize -> stub, binary -> stub, text-or-unknown -> falls through to the existing line-window read. - Configurable via two new fields on `LocalExecutionConfig`: - `attachReadAttachments`: 'off' (default) | 'images-only' | 'images-and-pdf'. Defaults `'off'` to preserve current behavior; hosts opt in when their model is vision-capable. - `maxAttachmentBytes`: pre-encoding size cap (default 5 MiB) that bounds the post-base64 token cost. Plumbing -------- - read_file branches on attachment.kind. For an image, it returns `[ [textBlock, imageUrlBlock], { artifact } ]`. LangChain's Anthropic adapter (verified at node_modules/@langchain/anthropic/dist/utils/message_inputs.js) preserves image_url blocks inside `tool_result` content arrays; OpenAI Chat Completions accepts them on vision models too. The Responses API flattens to text on the wire (function_call_output is string-only) which is the safe degrade. - The textBlock that goes alongside the image_url carries the filename/mime/byte count so non-vision models still get useful signal. Tests ----- 4 new unit cases: - default behavior: still returns the binary stub - images-only: returns array content with `image_url` data URL + artifact { mime: 'image/png', attachment: 'image' } - oversize: caps embedding past `maxAttachmentBytes` -> "Refusing to embed" stub - text files unaffected when embedding is enabled 29 LocalExecutionTools tests pass; 1180 non-API tests overall. Live verification ----------------- New `npm run local:image` script: 1. Copies a real PNG (the macOS Certificate Assistant droppedImage) into a temp workspace. 2. Asks the agent to `read_file` it and describe what's in it. 3. Inspects the run's ToolMessage to confirm the content is `[textBlock, imageBlock]` with a `data:image/png;base64,...` URL. Live run against real Anthropic Sonnet: ToolMessage(name=read_file, content=[text,image_url] url=data:image/png;base64,iVBORw0KGgo…) Image attachment landed in tool result: true ✔ ASSISTANT: "The image shows a blank certificate template with a light blue border. The word 'Certificate' appears at the top in an elegant, cursive script font. There's a decorative flourish or underline beneath the text, and a gold/yellow seal or badge decoration in the lower right portion..." That's the actual content of the source PNG — vision is wired through end-to-end. * feat(local-engine): post-edit syntax check + compile_check tool The two cheap "tell-the-agent-when-it-broke-the-file" alternatives I called out as the pragmatic substitute for full LSP integration. Both are opt-in. 1. Post-edit syntax check ------------------------- New `local.postEditSyntaxCheck: 'off' | 'auto' | 'strict'` (default `'off'`). After every successful `edit_file` / `write_file`, runs a fast per-file syntax checker keyed on extension and appends any error to the tool result so the model self-corrects on the next turn without a separate read round-trip: - `.js` / `.mjs` / `.cjs` / `.jsx` -> `node --check <file>` - `.py` / `.pyw` -> `python3 -m py_compile` - `.json` -> `JSON.parse` (in process) - `.sh` / `.bash` -> `bash -n <file>` `'auto'` appends `[syntax-check warning ...]` to the tool result and returns success. `'strict'` throws the error so the next turn must react. Each checker probes once for tool availability and silently skips if missing. TypeScript per-file syntax check is intentionally not in this list: per-file `tsc` is slow and only catches a small fraction of TS issues without type information. Use `compile_check` (below) for that. 2. compile_check tool --------------------- New tool auto-bound when `engine: 'local'` and `includeCodingTools` is on. Lets the agent ask "did my change break anything?" without us shipping a real LSP client. Auto-detection (first hit wins): - tsconfig.json or package.json[typescript] -> `npx --no-install tsc --noEmit` - Cargo.toml -> `cargo check --message-format=short` - go.mod -> `go vet ./...` - pyproject.toml/setup.py with mypy -> `python3 -m mypy .` - pyproject.toml/setup.py without mypy -> `compileall .` - none of the above -> tells the agent "no toolchain" Honours `local.compileCheck.command` and `compile_check.command` arg overrides; honours `local.compileCheck.timeoutMs` (default 120s). Output is truncated through the standard local-engine output cap and spills overflow to a temp file like the bash tool does. Returns exit-code-aware `PASSED`/`FAILED` headline + body. Tests ----- 8 new unit cases: - JS/JSON broken/valid round-trip - returns null for unknown extensions - write_file appends `[syntax-check warning ...]` in `auto` - write_file in `strict` throws - compile_check reports "no recognised project marker" in an empty workspace - compile_check honours an explicit command override and reports exit codes 37 LocalExecutionTools tests pass; 1188 non-API tests overall. Live integration ---------------- New `npm run local:compile` script that: 1. Seeds a tiny TS workspace (tsconfig.json + index.ts). 2. Asks Anthropic Sonnet to write a broken JS file (post-edit syntax-check warns) -> fix it -> write a broken TS file with a real type error -> call compile_check (must FAIL with `error TS2322`) -> fix it -> call compile_check again (must PASS). Live run, end-to-end: - `[syntax-check warning via node --check]` appended to write_file - First compile_check: FAILED via `npx --no-install tsc --noEmit` (exit=2): `broken.ts(1,14): error TS2322: Type 'string' is not assignable to type 'number'.` - Second compile_check (after edit_file fix): PASSED - Both assertions ✔. Other live scripts (local, local:hooks, local:checkpointer, local:ptc, local:image) still green. * chore(scripts): side-by-side comparison harness vs pi-mono `npm run compare:pi` runs four identical tasks against pi-mono's CLI and our local engine in parallel temp workspaces using the same model (claude-sonnet-4-5 by default). Captures: tool-call shape, wall time, input/output/cache tokens, and verifies the workspace ended in the expected state. Tasks ----- - T1 simple-edit : single literal substitution in greet.py - T2 fuzzy-edit : edit a file with trailing whitespace + tabs; exercises both agents' fuzzy-match recovery - T3 syntax-error-fix : pre-seeded broken JS; ours has post-edit syntax check, pi has to discover via bash - T4 type-error-fix : pre-seeded TS file with a real type error in a tiny tsconfig project; ours has compile_check, pi has bash + tsc Pi runner spawns `node $PI_BIN --print --mode json --no-session --provider anthropic --model <model>` and parses its JSON event stream. Ours uses Run.create with `engine: 'local'` and walks the final conversation. PI_BIN defaults to ~/Projects/pi-mono/packages/coding-agent/dist/cli.js. Live results (real Anthropic, Sonnet 4.5) ----------------------------------------- task metric pi ours T1 simple-edit verify ✔ ✔ wall 8.4s 11.0s tool calls 2 2 output tok 313 281 T2 fuzzy-edit verify ✔ ✔ wall 16.1s 7.2s tool calls 2 2 output tok 384 246 T4 type-error-fix-loop verify ✔ ✔ wall 23.5s 13.6s tool calls 6 4 output tok 805 397 T3 syntax-error-fix verify ✔ ✔ wall 16.2s 11.7s tool calls 3 4 (note: ours used 1 bash, pi used 2) output tok 371 503 Both agents 4/4 verify ✔. Ours wins wall-time on 3/4 tasks, output tokens on every task, and tool-call count on T4 (compile_check short-circuits the bash+read+bash loop pi did). Open follow-up surfaced by the harness: pi reports thousands of cacheRead tokens per turn (Anthropic prompt cache rebate), ours reports zero. We're paying full rate for the system+tools prefix on every turn. Not a regression introduced here, but worth its own PR to wire prompt caching on the Anthropic adapter. * chore(scripts): compare:pi — caching enabled, +2 tasks, +variance Updates to the side-by-side harness based on first-run findings. 1. Anthropic prompt caching wired on our side ----------------------------------------------- The first comparison reported ours' cache_read=0 — pre-existing PR clarification: caching IS supported (`addCacheControl` + agent context's `hasAnthropicPromptCache()`), but it's gated on `clientOptions.promptCache === true`. The harness's first cut set that on `graphConfig.clientOptions`, but `Run.createLegacyGraph` (src/run.ts:156) destructures `llmConfig` and rebuilds `agentConfig.clientOptions` from it — `graphConfig.clientOptions` is silently dropped on the standard path. So in this harness we set `promptCache: true` on the llmConfig itself. After the fix, ours reports real cache_read / cache_creation per turn. Caching works. 2. New tasks ------------ - T5 multi-file-rename : rename `calc_total` → `calculateTotal` across src/lib.ts, src/index.ts, src/index.test.ts. Tests how each agent finds + applies the rename. - T6 image-read-and-describe (ours-only): copies a real PNG into the workspace (macOS Certificate Assistant icon) and asks the model to describe it. Exercises our `attachReadAttachments: 'images-only'` end-to-end. pi has no equivalent and is skipped via the new `skip: 'pi'` field. 3. Variance support ------------------- COMPARE_ITERS=N runs each task N times per side and reports the mean. Skipped tasks count cleanly (no false N/A entries). The verify check now force-fails when a runner errored, so a soft verify (e.g. T6's "file-still-on-disk" check) can't mask a real provider rejection. 4. Cost computation ------------------- We now compute our cost from the same per-turn breakdown using Sonnet 4.5 pricing ($3/$15 per Mtok input/output, $3.75 cache write, $0.30 cache read), so the cost columns are directly comparable. Live results (N=2) ------------------ Functional: pi 10/10 ✔, ours 12/12 ✔ (T6 ours-only adds 2). Wall time: ours wins all 5 shared tasks (~30% faster on average). Tool calls: ours fewer or equal on every shared task. Output tokens: ours fewer on every shared task. Cost: ours is ~6× more expensive per task even with caching on. Why the cost gap survives caching --------------------------------- Per-turn "input new" tokens: pi (T1): 35 ours (T1): 28298 pi (T4): 76 ours (T4): 48630 Pi only sends the new turn payload (~35 tokens); ~the entire system prefix and tool defs are cache-hits. Ours sends ~28k tokens of new input per turn even with caching on, which strongly suggests our cache prefix isn't covering tool definitions, OR the breakpoint position isn't matching across turns. Worth a dedicated PR to tune the cache breakpoints in `addCacheControl` / `AgentContext.systemRunnable` so tool defs sit inside the cached prefix; this will close the cost gap considerably. The harness made this gap impossible to miss — the entire reason for the cross-comparison. * feat(cache): tool-prefix cache breakpoint + harness accounting fix Two related changes; second one was the bigger win. 1. Tool-prefix cache breakpoint (Anthropic) ------------------------------------------- New `partitionAndMarkAnthropicToolCache` helper at src/messages/anthropicToolCache.ts. When `clientOptions.promptCache === true` and the provider is Anthropic, we now: - Stable-partition the bound tool list into [static, deferred] using `defer_loading` from `agentContext.toolDefinitions`. - Stamp `cache_control: { type: 'ephemeral' }` on the LAST static tool via the `extras` field LangChain's Anthropic adapter forwards (`AnthropicToolExtrasSchema`). This way, the cached prefix covers exactly the static tool inventory. Discovered deferred tools that arrive across turns sit *after* the breakpoint and don't invalidate the prefix. The wrap is non-mutating (clones with `Object.create(getProto…)` so the original tool reference is untouched) and idempotent. 10 unit tests in `src/messages/__tests__/anthropicToolCache.test.ts` cover partition order, stamp behaviour, prototype preservation, extras merging, and idempotency. Wired in `Graph.ts` right after `resolveLocalToolsForBinding`, gated on Anthropic + `promptCache: true`. In practice the impact is smaller than I expected because `addCacheControl` on the last user message ALREADY creates a cache breakpoint that covers the tools+system prefix as part of the cumulative cached prefix. The tool-level breakpoint becomes load- bearing in cases where: - A turn arrives without a fresh user message (resume after tool call only), - The user message would otherwise blow the cache breakpoint, or - Cache budget needs to be split across multiple breakpoints to fit inside Anthropic's 4-breakpoint cap. Still correct; still worth shipping; just not a 6× win on its own. 2. Harness accounting fix (the actual ~6× swing) ------------------------------------------------ `compare:pi`'s previous run reported ours' input tokens as ~28k–48k per turn vs pi's ~35–80, suggesting a 6× cost gap that survived caching. That number was wrong. LangChain's Anthropic adapter at src/llm/anthropic/utils/message_outputs.ts:31 reports `usage_metadata.input_tokens` as the SUM of uncached input + cache_creation + cache_read — i.e., the total prompt size, NOT the uncached portion. Pi reports `usage.input` as uncached only. So our "input new" column was double-counting cached tokens. The harness now subtracts `cache_read + cache_creation` from `input_tokens` when computing our "input new", so the column is apples-to-apples with pi's `input` field. Recomputed live results (N=2, caching on, real Anthropic): task pi cost ours cost delta T1 simple-edit $0.0158 $0.0171 +8% T2 fuzzy-edit $0.0157 $0.0173 +10% T3 syntax-fix $0.0210 $0.0248 +18% T4 type-error-loop $0.0325 $0.0292 -10% ← we win T5 multi-file $0.0264 $0.0312 +18% T6 image (ours-only) — $0.0158 Functional: pi 10/10 ✔, ours 12/12 ✔. Wall time: ours wins T3/T4/T5 (most-complex tasks), pi wins T1/T2. Tool calls: ours wins or ties every shared task. Output tokens: ours wins every shared task. Cost: pi ~10–20% cheaper on simpler tasks, ours wins the most complex one (T4). We're effectively at parity. The "we're 6× more expensive" claim posted earlier was the harness error. * feat(local-engine): workspace + exec seams + HITL on direct path Three changes that fit together — first two answer the "what about a second engine" + "what's the workspace" questions; the third was the last gap blocking workspace-policy `'ask'` from working end-to-end. 1. WorkspaceFS exec seam (engine reusability) --------------------------------------------- New `WorkspaceFS` interface in src/tools/local/workspaceFS.ts (read, write, stat, readdir, mkdir, realpath, unlink, open) with a `nodeWorkspaceFS` default. Every file-touching factory in the local coding suite (read_file, write_file, edit_file, grep_search, glob_search, list_directory, file checkpointer, binary detection helper) now routes through this interface. A future engine — stateful remote sandbox, in-memory test FS, ssh jail — supplies its own `WorkspaceFS` and inherits every tool's behavior (fuzzy-match edit, BOM/EOL preservation, syntax check, checkpointer, attachment classifier) for free. Same story for process spawning: the existing `local.spawn` is now nested under `local.exec.spawn` with the legacy top-level field still honoured. The seams are wired through three new resolvers: - `getWorkspaceRoots` — canonical root + additionalRoots - `getReadRoots`/`getWriteRoots` — split allow-outside knobs - `getWorkspaceFS` — `local.exec.fs` ?? nodeWorkspaceFS - `getSpawn` — `local.exec.spawn` ?? local.spawn ?? Node Plus 7 unit tests in src/tools/__tests__/workspaceSeam.test.ts that verify additionalRoots, the new allowReadOutside/allowWriteOutside split, legacy `cwd` + `allowOutsideWorkspace` back-compat, and that a custom WorkspaceFS actually receives the file tool calls. 2. First-class workspace boundary + policy hook ----------------------------------------------- Nested `local.workspace` config: workspace: { root: string, additionalRoots?: readonly string[], // monorepo: extend boundary allowReadOutside?: boolean, // split from write allowWriteOutside?: boolean, } Back-compat: top-level `cwd` and `allowOutsideWorkspace` still work and map to `workspace.root` / both `allowOutside*` flags. `resolveWorkspacePathSafe(path, config, intent)` now takes a `'read'` | `'write'` intent so reads and writes can have different allow- outside semantics. Default callers pass `'write'` for stricter clamping; read tools (read_file, grep_search, glob_search, list_directory) explicitly pass `'read'`. New `createWorkspacePolicyHook` in src/hooks/createWorkspacePolicyHook.ts returns a `PreToolUse` callback that: - inspects each tool call's input via per-tool path extractors (defaults cover the local-engine coding suite; hosts can override / extend), - returns `allow` for in-workspace paths (incl. additionalRoots), - returns `'ask'` (default) / `'allow'` / `'deny'` for outside paths, with separate read/write policies. This is exactly the user-visible "ask once / always" semantic claude-code and opencode have, but built on the existing PreToolUse + HITL machinery instead of a parallel permissions API. 14 unit tests in src/hooks/__tests__/createWorkspacePolicyHook.test.ts. 3. HITL interrupt() on the direct path (the missing piece) ---------------------------------------------------------- The previous direct-path hook commit fail-closed `'ask'` decisions because the interrupt machinery only existed in the event-dispatch path. The workspace policy hook above wants to use `'ask'` for real, so this commit lifts the gap. `runDirectToolWithLifecycleHooks` now: - When `humanInTheLoop.enabled === true` and PreToolUse returns `'ask'`: builds a single-tool `tool_approval` payload and raises a real LangGraph `interrupt()` (anchored to the node's RunnableConfig the same way `dispatchToolEvents` does — ToolNode disables LangSmith tracing so the AsyncLocalStorage frame must be re-established). - On resume: - `approve` runs the tool with the (possibly hook-rewritten) args - `reject` blocks via the new `blockDirectCall` helper - `respond` returns the host-supplied `responseText` as a synthetic success ToolMessage (no tool execution) - `edit` re-runs the tool with the host-edited args - When HITL is off: collapses to a fail-closed deny (matches the rest of the SDK's HITL-disabled default). One-time warning logged so hosts notice the gap. - `allowedDecisions` enforcement matches the event path — a decision type outside the policy's allowlist is fail-closed. `blockDirectCall` is the shared helper that synthesises the Blocked ToolMessage and fires `PermissionDenied`, used by every deny path (deny decision, fail-closed ask, reject, allowedDecisions violation, malformed respond). Updated the existing fail-closed-on-ask test in directToolHooks.test.ts to assert the new HITL-disabled behaviour explicitly. Live verification ----------------- New `npm run local:workspace` script: - workspace.root = temp dir - additionalRoots = sibling temp dir - createWorkspacePolicyHook with outsideRead/outsideWrite='ask' - humanInTheLoop.enabled = true - asks the model to read 3 files: in-workspace, sibling, outside Live result against real Anthropic Sonnet: ✔ inside.txt → workspace policy 'allow' → read INSIDE ✔ sibling.txt → 'allow' (additionalRoots) → read SIBLING ✔ outside.txt → 'ask' → INTERRUPT raised with action_request carrying the path → script auto-approves → tool runs → read SECRET Final: HITL interrupts handled = 1, successful tool messages = 3 The model didn't even need its "retry if blocked" instruction — HITL handled the approval transparently. Functional + non-API tests: 1245 passing (was 1224; +21 new across workspace seam + workspace policy hook + the updated direct-tool ask test). * feat(common): canonical tool-name constants for local coding tools LibreChat's `getToolIconType` (client/src/components/Chat/Messages/ Content/ToolOutput/ToolIcon.tsx) special-renders four of our tools by exact name match: `bash_tool`, `read_file`, `execute_code`, `run_tools_with_code`. We already use those names, so they pick up the right icon end-to-end. The newer local-engine tools (`write_file`, `edit_file`, `grep_search`, `glob_search`, `list_directory`, `compile_check`) were defined as per-file `Local*ToolName` consts — string literals that would silently drift if anyone renamed them in only one place, and that no consumer UI could discover programmatically. Promotes them to `Constants.*` (single source of truth) and adds two grouped lists alongside the workspace types so consumers can match against canonical strings: - `Constants.WRITE_FILE` ('write_file') - `Constants.EDIT_FILE` ('edit_file') - `Constants.GREP_SEARCH` ('grep_search') - `Constants.GLOB_SEARCH` ('glob_search') - `Constants.LIST_DIRECTORY` ('list_directory') - `Constants.COMPILE_CHECK` ('compile_check') - `LOCAL_CODING_TOOL_NAMES` — the 7 local-only tools (the 6 above plus READ_FILE) - `LOCAL_CODING_BUNDLE_NAMES` — the local-only set + the bash/code/ PTC pair the bundle wraps around the existing factories The per-file `Local*ToolName` and `CompileCheckToolName` exports are retained as back-compat aliases pointing at the canonical Constants so existing consumers keep working. Wired through: - `createWorkspacePolicyHook` default path extractors now key off `Constants.*` (was string literals). - `LocalCodingTools.ts` Local*ToolName aliases collapsed into `Constants.*` references. - `CompileCheckTool.ts` likewise. 5 new unit tests in localToolNames.test.ts pin: - per-file aliases match Constants - canonical strings match the wire-level expectations LibreChat (and other consumer UIs) special-case - LOCAL_CODING_BUNDLE_NAMES matches what `createLocalCodingTools` actually emits - LOCAL_CODING_TOOL_NAMES matches what `createLocalCodingToolDefinitions` emits - bundle list is a strict superset of the local-only list Net effect: when LibreChat (or any other consumer UI) eventually adds icons for write_file / edit_file / grep_search / glob_search / list_directory / compile_check, the wiring picks up automatically without an SDK change. And renames from one side without the other get caught at test time. Tests: 1250 passing (was 1245; +5 new). * fix(local): three Codex P2 review nits — bash args, rg cache, root-relative extras Three real issues Codex flagged on the PR. Each is small but behavior-affecting; tests pin each. #1 — `executeLocalCode` dropped `args` when lang was bash -------------------------------------------------------- For every other runtime (py, js, ts, php, go, rs, c, cpp, java, r, d, f90), `getRuntimeCommand` appends `input.args` as positional parameters. The `lang === 'bash'` short-circuit was calling `executeLocalBash(input.code, config)` and silently discarding `args`, so `{lang:'bash', code:'echo $1', args:['hi']}` printed an empty string instead of `hi`. Fix: when lang is bash AND args is non-empty, route through a new `executeLocalBashWithArgs` helper that uses the standard `bash -lc <code> -- <args...>` form so `$1`, `$2`, … resolve. Same AST validation as the no-args path. The plain `executeLocalBash` (used by the `bash_tool` factory itself) is unchanged. Test: `codex review fixes › executeLocalCode bash args › passes input.args as positional shell parameters when lang is bash`. #2 — ripgrep availability cache was process-global -------------------------------------------------- With per-Run `local.exec.spawn` backends (the seam added for the future remote-engine), a single `let cachedRgAvailable` would let Run A's "rg works" verdict poison Run B whose backend (e.g. a remote sandbox without rg) doesn't have it. Run B would skip the probe, try to spawn rg, throw, and never reach the Node fallback. Fix: cache is now a `WeakMap<LocalSpawn, Promise<boolean>>` keyed on the effective backend (whatever `getSpawn(config)` returns). WeakMap lets disposed backends GC their entry; the test reset hook re-creates the map. Test: `codex review fixes › ripgrep cache backend scope › does not bleed an "rg available" verdict from one backend to another`. #3 — `additionalRoots` resolved against process.cwd, not workspace root ---------------------------------------------------------------------- With `workspace: { root: '/repo/app', additionalRoots: ['../shared'] }`, the intended boundary is `/repo/shared`. The previous `path.resolve(extra)` anchored to the process cwd, producing `${process.cwd()}/../shared` — completely different on a server with a different cwd, and could deny the legitimate sibling or allow an unrelated directory. Fix: `getWorkspaceRoots` and `createWorkspacePolicyHook` now both do `isAbsolute(extra) ? resolve(extra) : resolve(root, extra)`. Behaviour matches what users would expect from a "monorepo sibling" config. Test: `codex review fixes › additionalRoots resolved against workspace root › treats relative additionalRoots as siblings of root, not of process.cwd`. Tests: 1254 passing (was 1250; +4 new). Lint + tsc clean. * docs(hitl): unstale the "event-driven only" caveats; pin resume scope Three JSDoc blocks were honest at PR #134 ship time but went stale once `6122460` (hooks fire on the legacy non-event-driven path) and `caed7a2` (HITL interrupt() lifted into the direct path) landed: - `HumanInTheLoopConfig` — "Scope: event-driven tools only" header plus the bullet declaring direct-path tools "bypass the hook system entirely. PreToolUse hooks do not fire for those tools and HITL approval does not gate them." Both halves are now false. - `ToolNodeOptions.hookRegistry` — "Only fires for event-driven tool calls; tools routed through directToolNames bypass hook dispatch entirely." Now false. - `ToolNodeOptions.humanInTheLoop` — "the interrupt path is only wired into the event-driven dispatch (dispatchToolEvents), not into directToolNames execution — direct tools bypass HITL entirely." Now false. Also re-evaluated the "mixed direct + event batches re-execute the direct half on resume" caveat. New tests in `directToolHITLResumeScope.test.ts` pin the actual scope: - When a single tool interrupts via the direct path, its body runs **once** total. The first pass interrupted before reaching runTool; the resume pass ran the body after the host's decision was applied. The PreToolUse hook fires twice (the documented idempotency caveat). - When a sibling tool already ran in the same batch before the interrupting tool, the sibling's body runs **twice** — once on each pass. This applies symmetrically to event-dispatched and direct siblings; LangGraph rolls back the entire ToolNode invocation, not just the direct half. Updated wording reflects both: HITL spans both paths uniformly, and the side-effect warning applies to "any tool with side effects in a batch where another tool interrupts" rather than singling out direct tools. Hosts that want a tool to skip HITL must omit it from any registered matcher rather than relying on the path it takes. Tests: 1256 passing (was 1254; +2 new for resume-scope pinning). * fix(local): two more Codex review nits — output OOM cap + bash_tool args #4 P1 — `spawnLocalProcess` could OOM the host on noisy commands ---------------------------------------------------------------- Previous implementation accumulated every stdout/stderr chunk into in-memory strings and only truncated post-`close`. A noisy command (`yes`, `cat /dev/urandom | base64`, a verbose build) could stream gigabytes before the cap kicked in — production-stability risk since local tools execute arbitrary shell. Fix: stream-time cap with three layers, all in `spawnLocalProcess`: 1. Per-stream in-memory buffer capped at `maxOutputChars * 2`. Past that, chunks divert to a lazy temp file (the spill). 2. The spill file is seeded with whatever was already buffered so it holds the FULL output (head from memory + tail from stream), not just the post-cap remainder. 3. New `local.maxSpawnedBytes` config (default 50 MiB) hard-kills the process tree once total stdout+stderr crosses the cap. Independent from `maxOutputChars` (which only affects what the model sees) — this is the OOM backstop. `finish()` now waits for the spill stream to flush before resolving, so the model never sees `full_output_path: …` for a still-being- written file. Tests pinned: `codex review fixes (round 2) › streaming output cap`: - `yes` gets killed in ~200ms when capped at 64 KiB (vs the 30s timeout that would otherwise apply) - 200 KiB output with an 8 KB inline cap successfully spills; the file holds more than the in-memory truncation - small outputs do NOT create a spill file (no perf regression) #5 P2 — `bash_tool` factory broke positional shell parameters ------------------------------------------------------------- Same shape as the previous `executeLocalCode` bash-args fix (`9b5fb85`), but in the `bash_tool` factory itself. It was appending args literally to the command string: `${command} ${args.map(shellQuote).join(' ')}` So `command: 'echo "$1"'` with `args: ['hi']` ran `echo "$1" hi` — `$1` was empty, `hi` got appended literally. Tool schema advertised args as execution arguments; the actual behavior didn't match. Fix: route through the `executeLocalBashWithArgs` helper (now exported from LocalExecutionEngine) which uses `bash -lc <command> -- <args...>` so `$1`, `$2`, … resolve correctly. Falls back to the no-args `executeLocalBash` when args is empty (no behavior change for that case). Test: `bash_tool args › populates positional shell parameters from input.args` (and a second case for missing args). Tests: 1261 passing (was 1256; +5 new). * fix(local): three more Codex nits — config.shell, probe cache scope, grep -e #6 P1 — validateBashCommand hard-coded DEFAULT_SHELL ---------------------------------------------------- The `bash -n -c <command>` syntax preflight was hard-coded to DEFAULT_SHELL ('bash' on POSIX, 'bash.exe' on Windows), so a Run configured with `local.shell: '/bin/sh'` (or pointing at zsh, dash, or any host where bash isn't available) would get its commands syntax-validated against a different binary than the one that would actually execute them. Valid commands could be rejected before execution; commands relying on bashisms could be falsely accepted on hosts without bash. Fix: route the syntax preflight through `config.shell ?? DEFAULT_SHELL` so validation uses the same binary the actual execution path uses. Test: `codex review fixes (round 3) › validateBashCommand honours configured shell › routes the -n preflight through local.shell when set`. Intercepts the spawn calls and asserts the syntax check used `/bin/sh`, not the DEFAULT_SHELL fallback. #7 P2 — syntax-check probe cache was process-global --------------------------------------------------- Same shape as the ripgrep-cache fix in `9b5fb85`: per-Run `local.exec.spawn` backends mean a "node available" / "python available" / "bash available" verdict from one backend's probe could poison subsequent Runs whose backend (e.g. a remote sandbox that has python but not node) doesn't have those tools. Fix: cache is now `WeakMap<LocalSpawn, ProbeCache>` keyed on the effective spawn backend (whatever `getSpawn(config)` returns). Disposed backends GC their entry; the test reset hook re-creates the map. Mirrors the ripgrep cache exactly. Test: `codex review fixes (round 3) › syntax-check probe cache is backend-keyed › does not bleed an "rg/node/python available" verdict from one backend to another`. #8 P2 — grep pattern parsed as flag when dash-prefixed ------------------------------------------------------ The `grep_search` tool sent `input.pattern` as a positional arg to `rg`. A pattern like `-foo` got parsed as an unknown flag and rg bailed out instead of searching for it. `rg --help` explicitly requires `-e/--regexp` (or `--`) for dash-prefixed patterns. Fix: pass the pattern via `-e <pattern>`. Same trick also defends against any future flag-conflict if a user query happens to look like an rg long option. The Node fallback path doesn't need this — JS RegExp accepts dash-prefixed patterns natively. Test: `codex review fixes (round 3) › grep passes pattern via -e › handles dash-prefixed patterns without rg interpreting them as flags`. Tests: 1264 passing (was 1261; +3 new). * fix(local): two more P1 Codex nits — quoted destructive + symlink-escape #9 P1 — dangerous-command gate missed quoted targets ---------------------------------------------------- `validateBashCommand` checks `dangerousCommandPatterns` against `normalized` — the post-`stripQuotedContent` form. That pass blanks the contents of every quoted span, so destructive targets that the LLM (accidentally or not) wraps in quotes — `rm -rf "/"`, `rm -rf "$HOME"`, `chmod -R 777 "/"` — slip past the regex even though they execute identically to the bare form. Real safety regression on the default-on guard. Fix: split the destructive check into two passes. 1. The existing `dangerousCommandPatterns` continue to run against `normalized`. Same false-positive defence (`echo "rm -rf /"` stays valid because the destructive text is wrapped by the OUTER quote pair, not by quotes around the path argument). 2. New `quotedDestructivePatterns` run against the ORIGINAL command string. Each pattern explicitly REQUIRES a matching quote pair around the destructive path (`"/"`, `"$HOME"`, `'/'`, etc.). `echo "rm -rf /"` doesn't match — the `/` isn't surrounded by quotes there, only the whole `rm -rf /` string is. Tests: `codex review fixes (round 4) › quoted destructive targets` covers `"/"`, `"$HOME"`, `'/'`, `chmod -R 777 "/"`, no-regression on bare forms, and the `echo "rm -rf /"` no-false-positive case. #10 P1 — workspace policy hook compared lexical paths only ---------------------------------------------------------- `createWorkspacePolicyHook` did `isAbsolute(p) ? resolve(p) : resolve(root, p)` and called it done. A symlink inside the workspace pointing outside (e.g. `workspace/escape → /etc/secret`) lexically appears under `workspace/` and got auto-allowed — even though the agent reading through it touches a path the policy meant to gate. Critical when the hook is the primary gate: the documented "ask the user" setup turns `allowReadOutside: true` so the file tools' own clamp is off, leaving the hook as the sole defence. Fix: the hook now realpaths both the candidate and the workspace roots before comparing. Roots are pre-realpath'd once at construction (memoized via a Promise so the per-call cost is one realpath) and candidate paths get the same `realpathOfPathOrAncestor` walk `resolveWorkspacePathSafe` already uses — so paths that don't yet exist (e.g. `write_file` to a brand-new path) still get checked against their nearest existing ancestor's realpath. Two test cases pin both halves: - `workspace/escape → extra/secret.txt` — hook DENIES read of `escape` even though it's lexically inside. - `extra/alt-mount → workspace/` — hook ALLOWS read of `extra/alt-mount/file.ts` because the realpath lands inside the workspace root (covers the alternate-mount case so we don't over-correct). Tests: 1272 passing (was 1264; +8 across both fixes). * fix(local): treat maxSpawnedBytes=0 as unlimited (Codex P2 #11) The public LocalExecutionConfig contract documents `maxSpawnedBytes: 0` as "no cap", but the kill check `totalSpawnedBytes > hardKillBytes` triggered on the very first byte when the cap was zero. Skip the kill path entirely when `hardKillBytes <= 0` so configs that explicitly opt out of the OOM backstop behave as documented. * chore(scripts): forward ON_RUN_STEP in pi-vs-ours harness Without an ON_RUN_STEP handler the aggregator's stepMap is empty when ON_RUN_STEP_COMPLETED arrives, so every tool call logged "No run step or runId found for completed step event" to stderr. Mirror the pattern used in src/scripts/code_exec.ts (and friends): register both events. Harness-only — no SDK behavior change. * fix(local): ESM-safe spill + surface rg failures in glob_search (Codex P1 #12, P2 #13) Two issues from the latest review pass: 1) `spawnLocalProcess`'s spill path used `require('fs')` inside an ESM-shipped module. Fine in CJS test runs, throws `ReferenceError: require is not defined` for any ESM consumer that triggers the overflow path. Switch to a static `import { createWriteStream } from 'fs'` so both build outputs work. 2) `createLocalGlobSearchTool` ignored ripgrep's exit code and stderr, so `rg --files <missing-target>` (exit 2) silently mapped to "No files found." — the agent then treated a tooling failure as a real absence of matches. Now we check `result.exitCode > 1` (and `timedOut`) and return an explicit `glob_search failed: <stderr>` string + an artifact carrying the error. Tests pinned: `codex review fixes (round 5) › spill path is ESM-safe` and `… › glob_search surfaces ripgrep failures`. The glob test uses an injected spawn backend so it runs deterministically on hosts without ripgrep installed. * fix(local): sandbox loopback bridge + additionalRoots writes (Codex P1 #14, P2 #15) Two issues with how `buildSandboxRuntimeConfig` lays out the SandboxRuntimeConfig: 1) The local programmatic-tool bridge listens on 127.0.0.1, but the sandbox default `allowedDomains: []` denies all outbound network — so `run_tools_with_code` / `run_tools_with_bash` fail under sandbox even though they work unsandboxed. Seed allowedDomains with `127.0.0.1`, `localhost`, `::1` (skip any host the user explicitly listed in `deniedDomains`, and dedupe against user-supplied `allowedDomains`). 2) The sandbox `allowWrite` allowlist was seeded with `cwd` only, so `workspace.additionalRoots` paths were resolvable by file tools but blocked for sandboxed shell/code. Now seed from `getWorkspaceRoots` so both surfaces share the same boundary. Exported `buildSandboxRuntimeConfig` so the round-6 tests can drive it directly without standing up the real native sandbox runtime. Test count: 1283 passing (was 1276). Lint baseline unchanged. * fix(direct-path): HITL edit reads updatedInput; PostToolUse updates registry (Codex P1 #16, P2 #17) Two bugs in `runDirectToolWithLifecycleHooks`: 1) The `edit` decision branch read `decision.args`, but the documented `ToolApprovalDecision` shape uses `updatedInput`. Hosts following the public type signature were silently ignored — the tool ran with the original (un-edited) arguments. Now mirrors the event-driven path's validation: read `updatedInput`, fail closed on missing/wrong-shaped payloads (returns an error ToolMessage instead of executing with undefined args). 2) When a `PostToolUse` hook returns `updatedOutput`, the direct path returned a new ToolMessage with the replacement content but never updated the `toolOutputReferences` registry. Subsequent `{{tool<i>turn<n>}}` substitutions then delivered the stale pre-hook bytes while the model observed the post-hook replacement. Now reads `_refKey`/`_refScope` from the message metadata (already stamped by `recordOutputReference`) and calls `toolOutputRegistry.set` with the replacement. Tests: `direct-path HITL: resume scope > edit decision (Codex P1 #16)` covers both the happy path (edited args reach the body) and the fail-closed branch. `ToolNode tool output references > PostToolUse updatedOutput updates the registry (Codex P2 #17)` chains two calls where the second references the first via {{tool0turn0}} and asserts the substituted value matches the post-hook content. * fix: snapshot direct-batch args + curl-first bash bridge (Codex P1 #18, P2 #19) P1 #18 (ToolNode.ts): direct-path tools resolved {{tool…turn…}} placeholders against the LIVE registry inside `runTool`, which runs AFTER awaiting PreToolUse hooks. In a `Promise.all`-driven batch a slow hook on call A could let call B finish first and register its output, then A's late re-resolve would substitute B's output into A's args — order-dependent leakage that violates same-turn isolation. Snapshot the registry once per batch in `run()` (mirrors the event-driven path), thread it through `RunToolBatchContext`, and have both `runDirectToolWithLifecycleHooks` and `runTool` resolve against the snapshot when present. P2 #19 (LocalProgrammaticToolCalling.ts): the bash bridge required `python3` on PATH, breaking `run_tools_with_bash` on minimal containers and Windows hosts without python3 installed. Bridge now serves a `?mode=text` variant that returns the already-serialized result body so bash callers can use `curl` (universally available) without pulling in a JSON parser. Bash helper tries curl first, falls back to python3 for environments without curl, errors helpfully if neither is available. Tests pinned: `ToolNode tool output references › direct-batch snapshot isolation (Codex P1 #18)` (slow PreToolUse hook + sibling finishes-first scenario), and `ProgrammaticToolCalling › bash bridge script does not require python3 (Codex P2 #19)` (asserts both helper branches present in the generated script with curl preferred). * fix(local): block `--` end-of-options + gate compile_check overrides (Codex P1 #20, P1 #21) P1 #20: the destructive-command guard required the target path to follow option flags directly, so `rm -rf -- "/"` (and similarly `chmod -R 777 -- "/"`) bypassed the check via the GNU/BSD end-of-options marker. Both `dangerousCommandPatterns` and `quotedDestructivePatterns` now allow an optional `(?:--\\s+)?` between the flags and the destructive target. P1 #21: `compile_check` ran the resolved command through `bash -lc` without going through `validateBashCommand`, so a host-supplied `command` override (or `compileCheck.command` config entry) could execute mutating/destructive commands even with `readOnly: true` or the dangerous-command guard normally in force. Now routes through `validateBashCommand(detection.command, config)` first; on failure returns an explanatory ToolMessage with `ran: false` rather than spawning the process. Tests pinned: `codex review fixes (round 6) › destructive guard handles `--` end-of-options` (5 cases including a benign-`--` no-regression check) and `… › compile_check enforces validateBashCommand + readOnly` (rm -rf, touch under readOnly, benign echo). Lint baseline unchanged. * chore: audit-pass cleanups + missing test coverage (manual review #1, #3, #5, #7, #9, #10, #11) Comprehensive-review audit findings, fixed in one pass: #1 (attachments full-file MIME sniff): switch classifyAttachment to open() + 12-byte header read for sniffing; only readFile() the full buffer when we're about to embed. A 9 MB image now allocates 12 bytes for sniffing instead of 9 MB. #3 (fallback walker SKIP_DIRS too small): expanded the Node-fallback walker's skip set to cover dist/build/.next/.cache/__pycache__/ .venv/venv/target/vendor/coverage/.tox/.mypy_cache and friends. ripgrep itself respects .gitignore so this only affects the fallback. #5 (dynamic fs/promises import): replaced two `await import('fs/ promises').then(...)` calls in CompileCheckTool with a static `import { readFile, stat } from 'fs/promises'` per AGENTS.md. #7 (test-only resets in public API): added @internal JSDoc to all three `_reset*ForTests` functions so IDE autocomplete signals they aren't part of the SDK surface. (The leading underscore was convention-only; @internal is the standard tag.) #9 (legacy fields missing @deprecated): added @deprecated tags to `LocalExecutionConfig.cwd` and `allowOutsideWorkspace` pointing to their workspace.* replacements. #10 (dead lexicallyInside compute in workspace policy hook): dropped the unused variable. Realpath is the source of truth for both the symlink-escape and alternate-mount cases; the lexical check provides no information so we don't pay for it. #11 (voided unused imports in pi comparison script): removed the `void readdir; void cp;` workaround and the dead imports they were silencing. Bonus: also addressed the comprehensive review's #2 (bitwise OR) by verifying it was a misread — the code uses `||`, not `|`. Manual review's reading was wrong; no fix needed. Larger findings addressed: #4 (no editStrategies tests): added `src/tools/local/__tests__/editStrategies.test.ts` with 12 cases covering exact / line-trimmed / whitespace-normalized / indentation-flexible match boundaries, start/end-of-file matches, unicode content, multi-line needles spanning blank lines, ambiguous-match rejection, and applyEdit semantics. #12 (no FileCheckpointer tests): added `src/tools/local/__tests__/FileCheckpointer.test.ts` with 6 cases covering capture+rewind happy path, idempotent capture, absent-file rewind (deletion), multi-file rewind, oversize-file tracking-but-not-snapshotting, and rewind across a deleted parent directory. Performance: #8 (lineWindow split-whole-file): replaced `content.split('\\n')` in `lineWindow` with a newline-walk that's O(start + limit) instead of O(file). For a 10 MB file with `offset: 1, limit: 10` this drops from "split the whole file into millions of strings" to ~10 indexOf calls. Falls back to the simple split when no limit is supplied. Deferred: #6 (spill files never cleaned) needs a Run-lifecycle hook to run cleanup; tracked separately rather than fixed inline. * fix: comprehensive review (round 7) — bridge hooks, sandbox latch, nested-shell, regex DoS, attachments via WorkspaceFS, expose checkpointer Codex P2 (sandbox warning latch): syntax preflight, ripgrep-availability, and node/python/bash probes used to spawn through the public path that fires `maybeWarnSandboxOff`, both emitting a misleading "sandbox is off" warning when the run had `sandbox.enabled: true` AND latching `sandboxOffWarned = true` so a later genuinely-unsandboxed execution silently skipped its warning. Added an opt-in `{ internal: true }` options arg to spawnLocalProcess that suppresses both. Threaded it through every internal probe. Manual #A (P1): the in-process programmatic-tool bridge invoked inner tools via `executeTools` directly, bypassing every PreToolUse hook the host registered. ToolNode now plumbs a `hookContext` (registry + runId/threadId/agentId) into the programmatic-tool factory; the bridge runs PreToolUse before each inner tool.invoke. `deny` and `ask` (HITL not reachable from inside an HTTP handler) fail closed; `updatedInput` is applied to the inner tool args. 4 tests pinned. Manual #B (P1): already fixed in c6e2632 (round 9 — compile_check routes through validateBashCommand). Reviewer was at 75a54b3, predates that commit. Manual #C (P1): `bash -lc "rm -rf $HOME"` (and friends) bypassed the destructive-command guard because `stripQuotedContent` blanks the nested payload before the patterns run. Added a third pattern list `nestedShellDestructivePatterns` that runs against the original command and matches destructive ops (`rm -rf`, `chmod -R 777`, `chown -R`) inside `<shell> -[l]?c "..."` and `eval "..."` payloads. 4 tests including a benign-nested-echo no-regression check. Manual #D (P2): `fallbackGrep` compiled the model-supplied pattern via `new RegExp` and ran it across every file, opening the door to catastrophic-backtracking DoS (`(a+)+$`-style patterns). Added a guardrail layer: 1024-char pattern length cap, nested-quantifier heuristic rejection, 5-second wall-clock budget for the overall search. Surfaced as a structured `FallbackGrepError` -> tool result with engine: 'node-fallback' + the kind tag. Hosts that need bulletproof regex safety should install ripgrep (RE2-based, no backtracking). 2 tests pinned. Manual #E (P2): `local.fileCheckpointing: true` in RunConfig was a silent no-op in the auto-bind path — `createLocalCodingTools()` made a checkpointer that was immediately discarded. Now the auto-bind path uses `createLocalCodingToolBundle()` when checkpointing is on, returns the checkpointer through `ResolveLocalToolsResult`, and ToolNode stashes it + exposes via `getFileCheckpointer()`. 2 tests pinned. Manual #F (P3): `classifyAttachment` used host `fs/promises` directly instead of the configured `WorkspaceFS`. A custom/remote engine could fail to embed valid attachments OR accidentally read a host path with the same absolute name. Added optional `fs` arg to classifyAttachment threaded from `read_file` through to its `WorkspaceFS`. Backward compatible — defaults to host fs/promises when no fs is supplied. Tests touched: 768 passing across all suites (was 750ish). Lint baseline unchanged. * chore: address audit-of-audit findings (4/4 valid) Follow-up audit on commit ebf8b6a flagged 4 nits/minors. All valid: F1 (NIT): the lineWindow perf fix introduced exactly the same `void <var>` pattern that finding #10 had me remove from createWorkspacePolicyHook. Embarrassing — dropped the dead `line` variable + `line++` + `void line`. F2 (MINOR): the editStrategies test commit message claimed coverage of "ambiguous-match rejection" but no test actually exercised the multi-match → null path. Added the missing case for the exact strategy. Noted in a comment that the looser strategies in the chain may still resolve duplicates unambiguously by falling through; whether that's correct is a separate design call. F3 (NIT): the new FileCheckpointer.test.ts had a `import('fs/ promises')` inside a catch block when `mkdir` was already available via the static import at the top. Static-imported `mkdir` and restructured the test to mkdir-then-writeFile unconditionally (the catch-and-retry pattern was unnecessary defensive plumbing). F4 (NIT): the unused-imports cleanup missed `sum`/`void sum` in the comparison harness. Removed the function and its void. * fix: expose file checkpointer through Run/Graph (audit follow-up) Round-7 finding E only fixed half the problem: ToolNode got `getFileCheckpointer()` so direct `new ToolNode(...)` users could reach it, but the normal `Run.create(...)` path constructed ToolNodes inline inside `StandardGraph.initializeTools` and dropped the reference. So `RunConfig.toolExecution.local.fileCheckpointing: true` was still a no-op for Run callers — and the public JSDoc on `fileCheckpointing` referenced a `Run.rewindFiles()` that didn't exist. Now: - Graph adds `getOrCreateFileCheckpointer()` — lazily constructs ONE per-Run checkpointer when local fileCheckpointing is on. Cleared in `clearHeavyState()` along with the other per-Run state. - `StandardGraph.initializeTools` threads it into BOTH ToolNode constructions (the agentContext branch and the traditional-tools branch). Multi-agent graphs end up sharing a single snapshot store, so `rewind()` reaches writes from any agent. - ToolNode accepts `fileCheckpointer?` in its constructor and passes it to `resolveLocalExecutionTools`, which forwards into `createLocalCodingToolBundle({ checkpointer })`. Caller-provided wins over the auto-created one. - Run gains `getFileCheckpointer()` and `rewindFiles()` — the documented APIs. `rewindFiles()` returns 0 when checkpointing is disabled (no-op rather than throwing). - `LocalExecutionConfig.fileCheckpointing` JSDoc updated to reference the now-real APIs. Tests pinned: `… › fileCheckpointer reachable through Run.getFileChec…

`CodeEnvFile` and `FileRef` gain optional `entity_id`, propagated by `ToolNode`'s `runTool` and `getCodeSessionContext` paths so codeapi can resolve a per-file sessionKey. Without this, an execute that mixes files uploaded under different entities (e.g. a skill file plus a user attachment) collapses to a single request-level entity and one side 403s during file authorization. Additive: absent `entity_id` preserves existing behavior end-to-end.

* 🧬 feat: Inherit Parent Configurable Into Subagent Workflow `SubagentExecutor` previously invoked the child workflow with a fresh `configurable: { thread_id: childRunId }`, dropping every host-set field on the parent's outer `configurable` (e.g. `requestBody`, `user`, `userMCPAuthMap`). Subagent tool dispatches to the parent's `ON_TOOL_EXECUTE` handler then arrived with that stripped view, so any host that reads context from the configurable saw `undefined` for fields that work fine for the parent agent — placeholder substitution and per-user lookups silently failed in subagent contexts even though the same handler runs in the same process. Changes: - `SubagentExecuteParams.parentConfigurable?: Record<string, unknown>` — optional snapshot of the parent's `config.configurable`. The executor spreads it into the child's `workflow.invoke` configurable, scrubbing `thread_id` / `run_id` / `parent_run_id` (so the child doesn't claim parent's run identity), then sets `thread_id` to the new `childRunId`. Anything else the host put on parent's configurable is forwarded as-is — the SDK doesn't interpret host-defined keys. - `Graph.ts` — the spawn-subagent tool's `_call` now passes `parentConfigurable: config.configurable`. The closure already has the parent's runnable config in scope, so this is a one-line wire-up. - Live verification script (`src/scripts/subagent-configurable-inheritance.ts`) spawns a self-subagent that uses a calculator tool, captures `data.configurable` from each `ON_TOOL_EXECUTE` dispatch, and asserts every key the parent put on its outer configurable shows up on the subagent dispatch (and `thread_id` is overridden). Passes against `gpt-4o-mini`. - Unit tests cover: forwarding host fields, overriding parent `thread_id`, scrubbing run-identity fields, and back-compat for hosts that omit `parentConfigurable`. `parentConfigurable` is optional, so existing callers continue to work unchanged — they just keep the old isolated behavior. * v3.1.78 * 🪄 fix: Drop run-identity scrubbing from inherited configurable Reviewer pushback: "scrubbing `run_id`/`parent_run_id`/`thread_id` for safety" was the wrong default. The subagent IS the same run / conversation from the user's perspective, and `SubagentExecutor` already commits to that scope by firing `SubagentStart` / `SubagentStop` hooks against `this.parentRunId`. Scrubbing run-identity fields contradicts that contract. Changes: - Forward `parentConfigurable` verbatim into the child workflow's `configurable` — no key-level deletes. `thread_id` falls back to `childRunId` only when parent didn't supply one (was: `thread_id` always overridden to `childRunId`). - Updated the live verification script to capture run-identity fields alongside host-set keys and dump them on each dispatch. Empirical finding (verified against gpt-4o + LangGraph runtime): * `thread_id`: propagates as inherited ✅ * `parent_run_id`: propagates as inherited ✅ * `run_id`: LangGraph runtime overwrites at child invoke time — the SDK forwards what we put in, but the runtime swaps it out before tools see it. Host can't reliably propagate `run_id` via configurable. Type doc updated to reflect this so callers aren't misled. - The script now uses a non-self subagent (math-worker with calculator; parent has no tools) so the spawn path is forced and parent vs subagent dispatches are reliably distinguishable by `agentId`. - Unit tests updated: replaced the "scrubs run-identity fields" test with "inherits parent thread_id when supplied" + "falls back to childRunId when not" + "forwards run-identity fields verbatim into child invoke configurable" (with a note that LangGraph runtime overrides `run_id` downstream). A future revision will likely make this inheritance configurable per spawn type — background / async subagents may want isolation rather than sharing parent's host context. Out of scope here.

danny-avila added 5 commits May 4, 2026 17:32

📦 chore: bump axios & dev version

19876f1

pull Bot locked and limited conversation to collaborators May 6, 2026

pull Bot added the ⤵️ pull label May 6, 2026

pull Bot merged commit e909096 into innFactory:main May 6, 2026
2 checks passed

pull Bot had a problem deploying to publish May 6, 2026 04:43 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from danny-avila:main#77

[pull] main from danny-avila:main#77
pull[bot] merged 5 commits intoinnFactory:mainfrom
danny-avila:main

pull Bot commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pull Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pull Bot commented May 6, 2026 •

edited

Loading