diff --git a/designs/0004-sandboxes-and-codeact.md b/designs/0004-sandboxes-and-codeact.md new file mode 100644 index 000000000..328e562c7 --- /dev/null +++ b/designs/0004-sandboxes-and-codeact.md @@ -0,0 +1,1197 @@ +# Sandboxes and code execution + +## Overview + +Strands tools that execute code (shell, python_repl, file operations) hardcode their execution environment. This design introduces a Sandbox abstraction that decouples tool logic from where code runs, a programmatic tool calling tool that orchestrates tools via code, and a CodeAct plugin that replaces standard tool calling with code-based orchestration. Together, these enable tool reuse across environments (local, Docker, AgentCore) and unlock higher-level agent paradigms. + +## Problem statement + +### Tools are coupled to their execution environment + +Today, each tool that runs code manages its own execution. `python_repl` creates a PTY and forks a process. `shell` does the same with its own `CommandExecutor`. `code_interpreter` talks to AgentCore's sandbox API. Each tool reimplements filesystem access, process management, and output capture independently. + +This coupling creates real problems. + +When you deploy an agent to AgentCore Runtime, tools that assume local filesystem access break immediately. Issue [#335](https://github.com/strands-agents/tools/issues/335) documents this: `python_repl` writes to `Path.cwd()`, `journal` writes to the home directory, `workflow` writes to `~/.strands/workflows`. The only writable directory in Runtime is `/tmp`. Users work around this by setting environment variables per tool (`PYTHON_REPL_PERSISTENCE_DIR=/tmp/workflows/repl`, `STRANDS_WORKFLOW_DIR=/tmp/workflows`), but every tool needs its own workaround. + +There is no way to say "use the same shell tool but run it inside a Docker container." You cannot take a working local agent and deploy it to a sandboxed environment without rewriting tool configurations or swapping tool implementations entirely. + +### No shared execution environment across tools + +When an agent uses `shell` to create a file and then `python_repl` to read it, these tools operate in separate execution contexts. There is no shared filesystem, no shared working directory, no shared environment variables. Each tool is an island. + +A comment on issue #335 captures this well: + +> Should we have a `STRANDS_WORKING_DIR` env variable and config or equivalent, so that all tools can get a common directory to write these files to, that isn't the current working directory? + +The answer is yes, but the solution needs to go deeper than an environment variable. Tools need a shared execution environment. + +### Code-based tool orchestration is ad hoc + +Research shows that having agents write code to call tools (instead of making individual JSON tool calls) can reduce token usage by up to 98.7% and improve accuracy. Anthropic's [advanced tool use](https://www.anthropic.com/engineering/advanced-tool-use) and HuggingFace's [smolagents](https://huggingface.co/docs/smolagents/en/index) both demonstrate this. + +Issue [#1540](https://github.com/strands-agents/sdk-python/issues/1540) requests this for Strands. PR [#387](https://github.com/strands-agents/tools/pull/387) (`programmatic_tool_caller`) is a first implementation, but it defines its own `Executor` class independently of any shared abstraction. There is no standard way for code-execution tools to share an environment or for higher-level paradigms like CodeAct to build on a common foundation. + +## Proposed design + +The design has three layers, each building on the one below. + +``` +┌─────────────────────────────────────────────────┐ +│ CodeAct Plugin │ +│ Wraps an agent, replaces tools with itself, │ +│ forces code-based orchestration │ +├─────────────────────────────────────────────────┤ +│ Tooling │ +│ Updated tools + programmatic tool caller │ +│ All tools delegate to the Sandbox │ +├─────────────────────────────────────────────────┤ +│ Sandbox │ +│ Abstract interface for code execution │ +│ Local │ Docker │ AgentCore │ Custom │ +└─────────────────────────────────────────────────┘ +``` + +### Layer 1: Sandbox + +A Sandbox is an execution environment that provides code execution, command execution, and filesystem operations. It is the "where" — completely independent of the SDK's `ToolExecutor` (which controls scheduling: concurrent versus sequential). + +Tools that need to run code or access a filesystem receive a Sandbox instead of managing their own execution. + + +#### Interface + +The Sandbox ABC lives in the SDK (`strands-agents/sdk-python`). Concrete implementations (`LocalSandbox`, `DockerSandbox`, `AgentCoreSandbox`) also live in the SDK as vended sandbox providers, similar to how the SDK already vends tools and plugins. Third-party sandbox providers can be published as separate packages. + +There are two viable interface patterns from the ecosystem (see [Appendix B](#appendix-b-prior-art) for full survey). We present both and recommend Option A. + +##### Option A: Minimal `execute()` (recommended) + +A single abstract method with convenience methods built on top. The base class derives filesystem operations by running shell commands through `execute()`. + +LangChain DeepAgents is an example of this pattern — Daytona, Modal, E2B, and Runloop all implement a single `execute()` method against this interface. + +```python +from abc import ABC, abstractmethod +import secrets +import shlex +from dataclasses import dataclass + + +@dataclass +class ExecutionResult: + """Result of code or command execution.""" + + exit_code: int + stdout: str + stderr: str + + +class Sandbox(ABC): + """Abstract execution environment for agent tools. + + A Sandbox provides the runtime context where tools execute code, + run commands, and interact with a filesystem. Multiple tools + share the same Sandbox instance, giving them a common working + directory, environment variables, and filesystem. + + Implementations only need to provide execute(). All other methods + are built on top of it. Implementations may override convenience + methods with native versions for better performance (for example, + DockerSandbox can use docker cp instead of shell-based file ops). + """ + + @abstractmethod + async def execute( + self, + command: str, + timeout: int | None = None, + ) -> ExecutionResult: + """Execute a shell command in the sandbox. + + This is the only method implementations must provide. All other + methods are built on top of this one by default. + + Args: + command: The shell command to execute. + timeout: Maximum execution time in seconds. None means no timeout. + + Returns: + The result of the command execution. + """ + ... + + # --- Convenience methods built on execute() --- + + async def execute_code( + self, + code: str, + language: str = "python", + timeout: int | None = None, + ) -> ExecutionResult: + """Execute code in the sandbox. Override for native code execution.""" + return await self.execute(f"{language} -c {shlex.quote(code)}", timeout=timeout) + + async def read_file(self, path: str) -> str: + """Read a file from the sandbox filesystem. Override for native file I/O.""" + result = await self.execute(f"cat {shlex.quote(path)}") + if result.exit_code != 0: + raise FileNotFoundError(result.stderr) + return result.stdout + + async def write_file(self, path: str, content: str) -> None: + """Write a file to the sandbox filesystem. Override for native file I/O.""" + delimiter = f"STRANDS_EOF_{secrets.token_hex(8)}" + result = await self.execute( + f"cat > {shlex.quote(path)} << '{delimiter}'\n{content}\n{delimiter}" + ) + if result.exit_code != 0: + raise IOError(result.stderr) + + async def list_files(self, path: str = ".") -> list[str]: + """List files in a sandbox directory. Override for native listing.""" + result = await self.execute(f"ls -1 {shlex.quote(path)}") + if result.exit_code != 0: + raise FileNotFoundError(result.stderr) + return [f for f in result.stdout.strip().split("\n") if f] + + # --- Lifecycle --- + + async def start(self) -> None: + """Initialize the sandbox. Called once before first use.""" + pass + + async def stop(self) -> None: + """Clean up sandbox resources.""" + pass + + async def __aenter__(self) -> "Sandbox": + await self.start() + return self + + async def __aexit__(self, *args) -> None: + await self.stop() +``` + +New sandbox providers implement one method. The base class handles everything else. Providers that have native filesystem or code execution APIs (for example, AgentCore's `invoke("executeCode", ...)` or Docker's `docker cp`) override the convenience methods for better performance, encoding safety, and binary file support. + +The convenience methods use `shlex.quote()` for shell escaping and a randomized heredoc delimiter (`STRANDS_EOF_{random_hex}`) to prevent content injection in `write_file()`. Providers that need better file handling (binary files, encoding) override the convenience methods with native implementations. + +##### Option B: Multi-method ABC + +Every method is abstract. Implementations must provide all of them. + +E2B and Daytona are examples of this pattern, exposing separate modules or methods for commands, code, and filesystem. Our existing `CodeInterpreter` ABC in `strands_tools/code_interpreter/code_interpreter.py` also follows this pattern. + +```python +class Sandbox(ABC): + """Every method is abstract. Implementations must provide all of them.""" + + @abstractmethod + async def execute_command(self, command: str, timeout: int | None = None) -> ExecutionResult: ... + + @abstractmethod + async def execute_code(self, code: str, language: str = "python", timeout: int | None = None) -> ExecutionResult: ... + + @abstractmethod + async def read_file(self, path: str) -> str: ... + + @abstractmethod + async def write_file(self, path: str, content: str) -> None: ... + + @abstractmethod + async def list_files(self, path: str = ".") -> list[str]: ... + + async def start(self) -> None: ... + async def stop(self) -> None: ... +``` + +##### Comparison + +| Aspect | Option A (minimal `execute()`) | Option B (multi-method ABC) | +|--------|-------------------------------|----------------------------| +| Methods to implement | 1 required, 4 optional overrides | 5 required | +| New provider effort | Minimal — one method gets you a working sandbox | Higher — must implement all filesystem ops | +| Filesystem quality | Shell-based by default (encoding/binary issues) | Native by default (provider controls quality) | +| Third-party adoption | Lower barrier, more providers | Higher barrier, fewer providers | +| Type safety | Weaker for file ops (everything is strings through shell) | Stronger (dedicated method signatures) | + +##### Recommendation + +We recommend Option A. The LangChain DeepAgents ecosystem validates this pattern — four major sandbox providers (Daytona, Modal, E2B, Runloop) all implement a single `execute()` method successfully. The key insight is that filesystem operations are just shell commands (`cat`, `ls`, `echo >`), and most sandboxes already have a shell. Providers that need better file handling (binary files, encoding) override the convenience methods. + +This also matches how tools actually use sandboxes. `shell` only needs `execute()`. `python_repl` only needs `execute_code()`. `file_read` only needs `read_file()`. By making the base class handle the wiring, we avoid forcing every provider to implement methods they may not natively support. + +#### Implementations + +The following table summarizes the planned Sandbox implementations. See [Appendix A](#appendix-a-sandbox-implementation-sketches) for code examples. + +| Sandbox | Where it runs | Isolation | Latency | Use case | +|---------|--------------|-----------|---------|----------| +| `LocalSandbox` | Host process | None | Minimal | Development, trusted agents. Default when no sandbox is configured. | +| `DockerSandbox` | Docker container | Process + filesystem | ~1s startup | Production, untrusted code | +| `AgentCoreSandbox` | AWS Bedrock AgentCore | Full cloud isolation | ~800ms startup | AWS-deployed agents | + +The `Sandbox` ABC is designed for extensibility. Third-party providers (for example, [E2B](https://e2b.dev), [Daytona](https://www.daytona.io), or custom in-process virtual shells) can implement the interface to plug into the same ecosystem. + +#### How tools use a Sandbox + +Tools access the sandbox through `tool_context.agent.sandbox`. The `Agent` class gains a new `sandbox` attribute as part of this design. It defaults to `LocalSandbox` when no sandbox is explicitly configured, so all tools are backwards compatible without any conditional fallback logic. + +```python +from strands import Agent +from strands_tools import shell, python_repl, file_write + +# No sandbox specified — defaults to LocalSandbox +agent = Agent(tools=[shell, python_repl, file_write]) + +# Explicit sandbox — all tools share it +agent = Agent( + tools=[shell, python_repl, file_write], + sandbox=DockerSandbox(image="python:3.12-slim"), +) + +# The agent creates a file with shell, reads it with python_repl +# Both operate in the same sandbox +agent("Create a file called data.csv with some sample data, then analyze it with Python") +``` + +Inside a tool, the Sandbox is accessed via [`ToolContext`](https://strandsagents.com/docs/user-guide/concepts/tools/custom-tools/index.md): + +```python +from strands import tool, ToolContext + + +@tool(context=True) +async def shell(command: str, tool_context: ToolContext) -> str: + """Execute a shell command in the sandbox.""" + result = await tool_context.agent.sandbox.execute(command) + return result.stdout +``` + +Because the agent always has a sandbox (defaulting to `LocalSandbox`), tools do not need fallback logic. They always delegate to the sandbox. + +#### State management + +The sandbox must decide whether state (working directory, environment variables) persists between `execute()` calls. + +We recommend stateless by default: each `execute()` call is independent. This avoids concurrency issues when multiple tools call the sandbox in parallel (tool A does `cd /tmp`, tool B does `cd /home` — with shared state, the last one wins and both are confused). Stateless execution is predictable and matches the LangChain DeepAgents model. + +This means `cd`, `export`, and other state-modifying commands do not persist between calls. Tools that need state persistence (for example, a shell tool that tracks `cd` across calls) can manage it themselves, as the current `shell` tool already does via `CommandContext`. + +Each sandbox implementation handles this differently: + +- `LocalSandbox` — spawns a fresh subprocess per `execute()` call. Stateless by default. +- `DockerSandbox` — each `docker exec` gets its own process. Filesystem changes persist (shared container), but cwd and env do not carry across calls. +- `AgentCoreSandbox` — session state persists natively via the AgentCore API. + +Sandbox implementations may optionally offer a stateful mode (for example, `LocalSandbox` could use a persistent shell process for tools that opt in), but the default behavior and the abstract interface make no persistence promise. + +#### Tool proxy + +When code runs inside a sandbox (for example, the programmatic tool caller or CodeAct executing model-generated code in Docker), that code needs to call agent tools. But agent tools are Python objects in the host process — they cannot be serialized into a remote sandbox. The tool proxy solves this by bridging tool calls from the sandbox back to the host. + +We considered three approaches: + +| Approach | How it works | Pros | Cons | +|----------|-------------|------|------| +| Host-process execution | Orchestration code runs locally, tools are local functions | Simple, no IPC | Orchestration code is unsandboxed | +| Serialize tools as source code | Generate Python source for each tool, send to sandbox (smolagents approach) | Tools run fully in sandbox | Only works for self-contained tools. Strands tools have closures over the agent, access `ToolContext`, make API calls — they cannot be serialized as source | +| Callback proxy | Sandbox code calls tools via HTTP back to the host | Full sandbox isolation, works with any tool | Requires proxy server, network latency per tool call | + +We recommend the callback proxy approach, implemented incrementally: + +1. Phase 1 (P0/P1): orchestration code runs in the host process. Tools are local async functions. This is simple and works today. +2. Phase 2 (P2): add a tool proxy server. Orchestration code runs fully inside the sandbox. Tool calls are proxied back to the host via HTTP. + +The tool proxy works as follows: + +1. Before executing code in the sandbox, the host starts a lightweight HTTP server +2. For each agent tool, a stub function is generated as Python source code. The stub makes an HTTP POST to the proxy server with the tool name and arguments +3. The generated stubs are prepended to the model's code and sent to `sandbox.execute_code()` +4. When the sandbox code calls a tool stub, the HTTP request reaches the host proxy +5. The host proxy dispatches to `agent.tool.X()`, which executes the tool (using the sandbox for its own execution) +6. The result is returned as the HTTP response, and sandbox code continues + +``` +┌─────────────────────────────────────────────────────────┐ +│ Sandbox (Docker, AgentCore, etc.) │ +│ │ +│ # Generated stub (prepended to model's code) │ +│ async def calculator(expression: str) -> str: │ +│ resp = await httpx.post( │ +│ "http://host:9999/tool/calculator", │ +│ json={"expression": expression} │ +│ ) │ +│ return resp.json()["result"] │ +│ │ +│ # Model's code │ +│ result = await calculator(expression="2 + 2") │ +│ print(result) │ +└──────────────────────┬──────────────────────────────────┘ + │ HTTP POST /tool/calculator + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Host process │ +│ │ +│ Tool Proxy Server (port 9999) │ +│ ├── POST /tool/calculator │ +│ │ → agent.tool.calculator(expression="2 + 2") │ +│ │ → returns {"result": "4"} │ +│ └── POST /tool/shell │ +│ → agent.tool.shell(command="ls") │ +│ → (shell tool uses sandbox.execute() internally) │ +└─────────────────────────────────────────────────────────┘ +``` + +The proxy server is managed by the sandbox or the tool that needs it (programmatic tool caller, CodeAct). It starts before code execution and stops after. The `Sandbox` interface does not change — the proxy is a layer on top. + +Considerations: + +- The proxy server must be reachable from the sandbox. For `DockerSandbox`, this means host networking or `host.docker.internal`. For `AgentCoreSandbox`, the sandbox must allow outbound HTTP. +- `httpx` (or equivalent) must be available inside the sandbox for the stubs to work. +- Each tool call adds network round-trip latency. For tools that are themselves fast (like `calculator`), this overhead is noticeable. For tools that do I/O (like `http_request` or `shell`), it is negligible. +- The proxy should authenticate requests (for example, a short-lived token) to prevent unauthorized tool calls from other processes. + + +### Layer 2: Tooling + +This layer covers the tools that use the Sandbox. It includes updating existing tools (`shell`, `python_repl`, `file_read`, `file_write`, `editor`) to delegate to the sandbox, and adding the `programmatic_tool_caller` — a new tool that lets the model write code to orchestrate other tools. + +#### Updating existing tools + +Existing tools are updated to use `tool_context.agent.sandbox` instead of managing their own execution. The [`ToolContext`](https://strandsagents.com/docs/user-guide/concepts/tools/custom-tools/index.md) provides access to the agent and its sandbox via the `@tool(context=True)` decorator. See the shell example in [How tools use a Sandbox](#how-tools-use-a-sandbox) above. + +This replaces the current approach where each tool manages its own PTY, subprocess, and filesystem logic. + +#### New `python` tool + +A new sandbox-based `python` tool delegates to `sandbox.execute_code()`. Unlike the existing `python_repl` (which maintains a persistent namespace via `dill` serialization and supports interactive PTY), the `python` tool is stateless — each invocation runs in a fresh interpreter. This is simpler, works across all sandbox types, and is the recommended default. The existing `python_repl` tool remains available for use cases that need stateful execution. + +#### Programmatic tool caller + +The programmatic tool caller is a new tool that lets the model write code to orchestrate other tools. It lives in `strands-agents/tools` and builds on the Sandbox for its code execution. + +This is the approach from PR [#387](https://github.com/strands-agents/tools/pull/387) and aligns with Anthropic's [advanced tool use](https://www.anthropic.com/engineering/advanced-tool-use) programmatic tool calling feature. + +#### How it works + +1. The model receives `programmatic_tool_caller` as one of its available tools +2. When the model wants to orchestrate multiple tools, it writes Python code that calls them as async functions +3. The tool executes this code in the host process with tool wrappers injected (Phase 1), or inside the sandbox via the tool proxy (Phase 2) +4. Only `print()` output enters the agent's context window — intermediate tool results stay in the code execution context + +```python +from strands import Agent +from strands_tools import programmatic_tool_caller, shell, calculator + +agent = Agent( + tools=[programmatic_tool_caller, shell, calculator], + sandbox=LocalSandbox(), +) + +# The model can choose to use programmatic_tool_caller when it helps +result = agent("Calculate the squares of numbers 1-100 and sum them") +``` + +The model might respond with a `programmatic_tool_caller` invocation containing: + +```python +total = 0 +for i in range(1, 101): + square = await calculator(expression=f"{i} ** 2") + total += int(square) +print(f"Sum of squares: {total}") +``` + +This executes as a single tool call instead of 100 separate round-trips. The intermediate `calculator` results never enter the context window. + +#### Relationship to Sandbox + +The programmatic tool caller does not run its code in the sandbox. It runs the model's orchestration code in the host process, injecting tool wrappers as async functions that call `agent.tool.X()`. Those tools in turn use the sandbox. The sandbox is used by the individual tools, not by the orchestration layer. + +It: + +1. Introspects the agent's tool registry to find available tools +2. Creates async wrapper functions for each tool +3. Injects these wrappers into the local execution namespace +4. Runs the model's code in-process via `exec()` +5. Captures `print()` output as the tool result + +```python +class ProgrammaticToolCaller: + """Tool that executes code calling other tools.""" + + def __init__(self, allowed_tools: list[str] | None = None): + self.allowed_tools = allowed_tools + + async def execute(self, code: str, agent: Agent) -> str: + # Build namespace with tool wrappers that call agent.tool.X() + namespace = self._build_tool_namespace(agent) + + # Execute in host process — tools use the sandbox internally + exec(compile(code, "", "exec"), namespace) + return namespace.get("__output__", "") +``` + +This means programmatic tool calling works with any sandbox — the orchestration code does not need to know or care what sandbox is configured. + +The initial implementation runs orchestration code in the host process. Once the tool proxy is available (see [Tool proxy](#tool-proxy) in Layer 1), the programmatic tool caller can be upgraded to run code fully inside the sandbox. + +### Layer 3: CodeAct plugin + +CodeAct is a higher-level paradigm where the agent always responds with code instead of JSON tool calls. It is implemented as a hook using the existing agent lifecycle, not a modification to the agent loop. + +#### The CodeAct paradigm + +In standard tool calling, the model outputs structured JSON to invoke tools one at a time. In CodeAct (from the [Apple ML Research paper](https://machinelearning.apple.com/research/codeact)), the model outputs executable code that calls tools as functions. This has several advantages: + +- Loops, conditionals, and data transformations are native (no multi-turn back-and-forth) +- Intermediate results stay in code variables, not the context window +- The model can self-correct by catching exceptions and retrying in the next turn +- Token usage drops dramatically (up to 98.7% reduction reported) + +HuggingFace's [smolagents](https://huggingface.co/docs/smolagents/en/index) implements this as `CodeAgent` — a separate agent class that generates Python code instead of JSON tool calls. + +#### How it works in Strands + +CodeAct maps naturally to Strands' hook system. The [`AfterInvocationEvent.resume`](https://strandsagents.com/docs/user-guide/concepts/agents/hooks/index.md) property triggers a follow-up agent invocation with new input, which is exactly the CodeAct observation loop: model generates code → execute → feed results back as next observation → model generates more code. + +The plugin registers hooks on the agent: + +1. `BeforeInvocationEvent` — modifies the system prompt to instruct the model to respond with Python code, and injects tool function signatures into the prompt +2. `AfterInvocationEvent` — parses code from the model's response, executes it in a persistent namespace with tools available as callable functions, and sets `event.resume` with the execution result (stdout, errors) as the next observation +3. The loop terminates when the model calls `final_answer()` in its code or responds without a code block + +```python +from strands import Agent +from strands.hooks import AfterInvocationEvent, BeforeInvocationEvent + + +class CodeActPlugin: + """Implements CodeAct via agent hooks.""" + + def __init__(self, agent: Agent): + self.agent = agent + self.namespace = {"__builtins__": __builtins__} + + # Inject tool wrappers into the execution namespace + self._inject_tools() + + # Register hooks + agent.hooks.add_callback(BeforeInvocationEvent, self._setup_prompt) + agent.hooks.add_callback(AfterInvocationEvent, self._execute_and_resume) + + def _inject_tools(self): + """Make agent tools callable as Python functions in the namespace.""" + for tool_name in self.agent.tool_registry.registry: + self.namespace[tool_name] = self._make_tool_wrapper(tool_name) + + def final_answer(result): + self.namespace["__final_answer__"] = result + + self.namespace["final_answer"] = final_answer + + async def _setup_prompt(self, event: BeforeInvocationEvent): + """Instruct the model to respond with Python code.""" + # Prepend CodeAct instructions to system prompt + ... + + async def _execute_and_resume(self, event: AfterInvocationEvent): + """Parse code from response, execute, resume with results.""" + code = self._parse_code(event.result.message) + if not code: + return # No code block — model is done + + if "__final_answer__" in self.namespace: + return # Model called final_answer() — done + + try: + # exec() does not support top-level await, so we wrap in an async main + wrapped = f"async def __codeact_main__():\n" + "\n".join( + f" {line}" for line in code.splitlines() + ) + exec(compile(wrapped, "", "exec"), self.namespace) + await self.namespace["__codeact_main__"]() + output = self.namespace.get("__stdout__", "") + except Exception as e: + output = f"Error: {e}" + + # Feed execution result back as next observation + event.resume = f"Execution result:\n{output}" +``` + +Usage: + +```python +from strands import Agent +from strands_tools import shell, http_request, calculator + +agent = Agent(tools=[shell, http_request, calculator]) + +# Apply CodeAct — the agent now responds with code +codeact = CodeActPlugin(agent) + +result = agent("Fetch the top 10 HN stories and save them to a file") +``` + +The model's response would be natural language with a code block. The CodeAct plugin parses the code block, executes it, and feeds the result back via `resume`. + +Key properties of this approach: + +- Works within the existing agent loop — no custom loop needed +- The model outputs natural language reasoning alongside code (matching the paper) +- State persists across turns via the shared `namespace` dict +- Errors feed back as observations, enabling self-correction +- `final_answer()` terminates the loop +- `resume` handles the observation cycle natively + +Note: the initial implementation runs code in the host process with tools as local functions. Once the tool proxy is available (see [Tool proxy](#tool-proxy) in Layer 1), CodeAct can run code fully inside the sandbox. + +#### Difference from programmatic tool caller + +| Aspect | Programmatic tool caller | CodeAct | +|--------|------------------------|---------| +| Scope | One tool among many | Replaces the agent's interaction mode | +| Model choice | Model decides when to use it | Always active | +| Tool calling | Standard JSON tool calls | Code blocks in natural language response | +| State | Stateless per call | Persistent namespace across turns | +| Self-correction | Not built in | Native via error feedback loop | +| Integration | Drop-in tool | Hook-based plugin | +| Sandbox isolation | Phase 1: host process. Phase 2: full sandbox via tool proxy | Phase 1: host process. Phase 2: full sandbox via tool proxy | +| Use case | Optimization for batch operations | Full paradigm shift | + +The programmatic tool caller is a tool the model can optionally use within standard tool calling. CodeAct changes how the agent interacts entirely — the model always writes code, and the hook system handles execution and feedback. + + +## API surface + +### SDK changes (`strands-agents/sdk-python`) + +The Sandbox ABC and `ExecutionResult` dataclass are added to the SDK. The `Agent` class accepts an optional `sandbox` parameter that defaults to `LocalSandbox`. + +```python +from strands import Agent +from strands.sandbox import Sandbox, LocalSandbox, ExecutionResult +from strands.sandbox.docker import DockerSandbox + +# Default: LocalSandbox (backwards compatible, same behavior as today) +agent = Agent(tools=[shell]) + +# Explicit sandbox +agent = Agent(tools=[shell], sandbox=DockerSandbox(image="python:3.12-slim")) +``` + +Tools access the sandbox via [`ToolContext`](https://strandsagents.com/docs/user-guide/concepts/tools/custom-tools/index.md). No changes to the tool interface are required. + +```python +from strands import tool, ToolContext + + +@tool(context=True) +async def my_tool(tool_context: ToolContext) -> str: + result = await tool_context.agent.sandbox.execute("ls -la") + return result.stdout +``` + +### Tools changes (`strands-agents/tools`) + +Existing tools (`shell`, `python_repl`, `file_read`, `file_write`, `editor`) are updated to use `tool_context.agent.sandbox`. Since the agent always provides a sandbox (defaulting to `LocalSandbox`), tools do not need conditional fallback logic. + +A new sandbox-based `python` tool will be added that delegates to `sandbox.execute_code()`. The existing `python_repl` tool (with stateful REPL, PTY, and `dill` persistence) will be kept for use cases that need it. + +New tools (`programmatic_tool_caller`) use the sandbox the same way. + +### TypeScript SDK + +The Sandbox interface should be mirrored in `sdk-typescript`. The interface is simple enough that implementations can differ by language while maintaining the same mental model. + +```typescript +interface ExecutionResult { + exitCode: number; + stdout: string; + stderr: string; +} + +interface Sandbox { + // The only required method + execute(command: string, timeout?: number): Promise; + + // Convenience methods (default implementations built on execute()) + executeCode?(code: string, language?: string, timeout?: number): Promise; + readFile?(path: string): Promise; + writeFile?(path: string, content: string): Promise; + listFiles?(path?: string): Promise; + + // Lifecycle + start?(): Promise; + stop?(): Promise; +} +``` + +## Migration and backwards compatibility + +The default `LocalSandbox` preserves existing behavior. No user action is required. + +- `Agent` defaults to `LocalSandbox` when no `sandbox` parameter is provided +- `LocalSandbox` wraps the same subprocess/filesystem behavior tools use today +- Tools access the sandbox via `tool_context.agent.sandbox` — no conditional logic needed +- The `CodeInterpreter` ABC in `strands_tools` can be refactored to implement `Sandbox`, making `AgentCoreCodeInterpreter` both a `CodeInterpreter` and a `Sandbox` + +Migration path for existing tools: + +1. Update tools to use `tool_context.agent.sandbox` instead of managing their own execution +2. Deprecate tool-specific environment variables (`PYTHON_REPL_PERSISTENCE_DIR`, etc.) in favor of Sandbox configuration +3. Eventually remove direct execution code from tools, making Sandbox the only path + +## Open questions + +1. **Namespace injection for programmatic tool calling.** The programmatic tool caller and CodeAct need tool wrapper functions in the code execution namespace. Phase 1 runs orchestration code in the host process with tools as local functions. Phase 2 uses the tool proxy to run code fully inside the sandbox. See [Tool proxy](#tool-proxy) for the full design and alternatives considered. + +2. **Stateful sandbox mode.** The default is stateless (see [State management](#state-management)). Some use cases may benefit from a stateful mode where `cd` and `export` persist across calls. Should this be a constructor flag on sandbox implementations, a separate `StatefulSandbox` subclass, or left entirely to individual tools? + +3. **Parallel tool execution.** Strands' `ConcurrentToolExecutor` runs multiple tool calls in parallel. If two tools both call `sandbox.execute()` concurrently on a persistent-shell-based sandbox, the interleaved stdin/stdout is a race condition. `DockerSandbox` and `AgentCoreSandbox` are naturally concurrent-safe (each `execute()` gets its own process or API call). + + We recommend that the `Sandbox` interface makes no concurrency guarantees. Implementations document their behavior. `LocalSandbox` can handle this by spawning subprocesses that inherit the persistent shell's tracked state (cwd, env) rather than sharing the shell process for all calls. + +4. **Interactive mode.** Both `shell` and `python_repl` today support interactive PTY mode for real-time user input. This does not map cleanly to remote sandboxes (Docker, AgentCore). Our inclination is to keep interactive mode as a `LocalSandbox` concern and not complicate the abstract interface. This is not a blocker for the initial implementation and can be revisited later. + +5. **Naming.** This document uses "Sandbox" throughout. We recommend this name but want to surface alternatives for discussion. + + | Name | Pros | Cons | + |------|------|------| + | `Sandbox` (recommended) | Clear mental model, aligns with E2B/Daytona/LangChain ecosystem terminology, implies containment | Implies isolation that `LocalSandbox` does not provide | + | `Runtime` | Neutral, accurate | Conflicts with "Python runtime", "AgentCore Runtime" | + | `Environment` | Intuitive | Extremely overloaded in programming (env vars, virtual envs, deployment envs) | + | `ExecutionBackend` | Precise, no ambiguity | Verbose, feels enterprise-y | + | `Workspace` | Implies shared filesystem | Too narrow (does not convey code execution) | + | `Shell` | Familiar | Too narrow (does not cover code execution, file ops) | + +## Tasks + +### P0 + +- [ ] Define `Sandbox` ABC and `ExecutionResult` in `sdk-python` +- [ ] Implement `LocalSandbox` +- [ ] Add `sandbox` parameter to `Agent` (default: `LocalSandbox`) +- [ ] Update `shell` tool to use `tool_context.agent.sandbox` +- [ ] Update `file_read` and `file_write` tools to use sandbox +- [ ] Add sandbox-based `python` tool (stateless, delegates to `sandbox.execute_code()`) +- [ ] Keep existing `python_repl` tool for stateful REPL use cases + +### P1 + +- [ ] Implement `DockerSandbox` +- [ ] Refactor `AgentCoreCodeInterpreter` to implement `Sandbox` (`AgentCoreSandbox`) +- [ ] Update remaining tools (`editor`, `journal`, `workflow`) to use sandbox +- [ ] Implement `programmatic_tool_caller` using sandbox (refactor from PR #387) +- [ ] Deprecate tool-specific environment variables (`PYTHON_REPL_PERSISTENCE_DIR`, `STRANDS_WORKFLOW_DIR`, etc.) + +### P2 + +- [ ] Implement `CodeActPlugin` +- [ ] Add Anthropic advanced tool use support (tool search tool, tool use examples) +- [ ] Tool proxy: implement proxy server, stub generation, and integrate with programmatic tool caller and CodeAct +- [ ] Mirror `Sandbox` interface in `sdk-typescript` +- [ ] Implement `LocalSandbox` and `DockerSandbox` for TypeScript + +## Alternatives considered + +### Sandbox as a tool wrapper instead of SDK concept + +We considered making Sandbox a tool-level concern — each tool would optionally accept a Sandbox in its constructor. This avoids SDK changes but means every tool must independently handle Sandbox injection, and there is no guarantee that tools share the same Sandbox instance. + +Rejected because the shared-environment property (tools operating in the same filesystem) requires the Sandbox to be agent-level, not tool-level. + +### CodeAct as a new agent class + +smolagents implements CodeAct as a separate `CodeAgent` class. We considered adding `CodeActAgent` to the SDK alongside `Agent`. + +Rejected because it duplicates the agent loop and creates a maintenance burden. The hook-based approach using `AfterInvocationEvent.resume` implements the CodeAct observation loop within the existing agent lifecycle, composing with other agent features (hooks, conversation management, streaming) without reimplementation. + +### Extending ToolExecutor for Sandbox + +We considered adding Sandbox capabilities to the existing `ToolExecutor` hierarchy (which handles concurrent versus sequential execution). + +Rejected because these are orthogonal concerns. `ToolExecutor` controls scheduling. Sandbox controls environment. A tool can run in a Docker Sandbox with concurrent scheduling — these should compose independently. + +## References + +- [Issue #335: Tools break in AgentCore Runtime](https://github.com/strands-agents/tools/issues/335) +- [Issue #1540: Code-based tool execution](https://github.com/strands-agents/sdk-python/issues/1540) +- [Issue #1349: Anthropic advanced tool use](https://github.com/strands-agents/sdk-python/issues/1349) +- [PR #387: Programmatic tool caller](https://github.com/strands-agents/tools/pull/387) +- [PR #39: Shell tool (devtools)](https://github.com/strands-agents/devtools/pull/39) +- [Anthropic: Advanced tool use](https://www.anthropic.com/engineering/advanced-tool-use) +- [HuggingFace: smolagents](https://huggingface.co/docs/smolagents/en/index) +- [Apple ML Research: CodeAct](https://machinelearning.apple.com/research/codeact) +- [E2B: Code Interpreter SDK](https://e2b.dev) +- [Daytona: AI Sandbox SDK](https://www.daytona.io) +- [LangChain DeepAgents: Sandbox backends](https://docs.langchain.com/oss/python/deepagents/sandboxes) +- [OpenHands: Runtime architecture](https://docs.openhands.dev/openhands/usage/architecture/runtime) +- [Google ADK: GKE Code Executor](https://google.github.io/adk-docs/integrations/gke-code-executor/) +- [Anthropic: Sandbox Runtime](https://github.com/anthropic-experimental/sandbox-runtime) + +--- + +
+Appendix A: Sandbox implementation sketches + +### LocalSandbox + +Wraps the current behavior of `shell` and `python_repl` behind the Sandbox interface. Uses `asyncio.create_subprocess_shell` for command execution and native filesystem calls for file I/O. Overrides `read_file` and `write_file` to avoid shell escaping issues entirely. + +```python +import asyncio +import os + +from strands.sandbox import Sandbox, ExecutionResult + + +class LocalSandbox(Sandbox): + """Execute code and commands on the local host. + + Uses asyncio subprocesses for command execution and native filesystem + operations for file I/O. This is the default sandbox, providing the + same behavior as running commands directly on the host. + """ + + def __init__(self, working_dir: str | None = None) -> None: + self.working_dir = working_dir or os.getcwd() + + async def execute(self, command: str, timeout: int | None = None) -> ExecutionResult: + proc = await asyncio.create_subprocess_shell( + command, + cwd=self.working_dir, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + try: + stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=timeout) + except asyncio.TimeoutError: + proc.kill() + await proc.communicate() + raise + return ExecutionResult( + exit_code=proc.returncode or 0, + stdout=stdout.decode(), + stderr=stderr.decode(), + ) + + # Override for native file I/O (avoids shell escaping issues) + async def read_file(self, path: str) -> str: + full_path = os.path.join(self.working_dir, path) if not os.path.isabs(path) else path + with open(full_path) as f: + return f.read() + + async def write_file(self, path: str, content: str) -> None: + full_path = os.path.join(self.working_dir, path) if not os.path.isabs(path) else path + parent_dir = os.path.dirname(full_path) + if parent_dir: + os.makedirs(parent_dir, exist_ok=True) + with open(full_path, "w") as f: + f.write(content) +``` + +### DockerSandbox + +Runs commands inside a Docker container. The container is created on `start()` and destroyed on `stop()`. Each `execute()` call runs `docker exec` on the running container. Filesystem state persists across calls (shared container), but working directory and environment variables set via `export` do not carry across calls (each `docker exec` starts a new shell process). + +Overrides `write_file` to pipe content via `docker exec -i ... cat >` instead of using heredocs, eliminating content injection risks entirely. Overrides `read_file` with `shlex.quote()` for safe path handling. + +```python +import asyncio +import shlex + +from strands.sandbox import Sandbox, ExecutionResult + + +class DockerSandbox(Sandbox): + """Execute code and commands in a Docker container.""" + + def __init__( + self, + image: str = "python:3.12-slim", + volumes: dict[str, str] | None = None, + environment: dict[str, str] | None = None, + working_dir: str = "/workspace", + ): + self.image = image + self.volumes = volumes or {} + self.environment = environment or {} + self.working_dir = working_dir + self._container_id: str | None = None + + async def _run_docker( + self, args: list[str], timeout: int | None = None, stdin_data: bytes | None = None, + ) -> ExecutionResult: + """Run a docker CLI command and return the result.""" + proc = await asyncio.create_subprocess_exec( + "docker", *args, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + stdin=asyncio.subprocess.PIPE if stdin_data else asyncio.subprocess.DEVNULL, + ) + try: + stdout, stderr = await asyncio.wait_for( + proc.communicate(input=stdin_data), timeout=timeout, + ) + except asyncio.TimeoutError: + proc.kill() + await proc.communicate() + raise + return ExecutionResult( + exit_code=proc.returncode or 0, + stdout=stdout.decode(), + stderr=stderr.decode(), + ) + + async def start(self) -> None: + # docker create --rm -i -w /workspace sleep infinity + create_args = ["create", "--rm", "-i", "-w", self.working_dir] + for host_path, container_path in self.volumes.items(): + create_args += ["-v", f"{host_path}:{container_path}"] + for key, value in self.environment.items(): + create_args += ["-e", f"{key}={value}"] + create_args += [self.image, "sleep", "infinity"] + result = await self._run_docker(create_args, timeout=60) + if result.exit_code != 0: + raise RuntimeError(f"failed to create container: {result.stderr}") + self._container_id = result.stdout.strip() + await self._run_docker(["start", self._container_id], timeout=30) + await self.execute(f"mkdir -p {shlex.quote(self.working_dir)}") + + async def execute(self, command: str, timeout: int | None = None) -> ExecutionResult: + if self._container_id is None: + raise RuntimeError("docker sandbox not started") + return await self._run_docker( + ["exec", "-w", self.working_dir, self._container_id, "sh", "-c", command], + timeout=timeout, + ) + + # Override: pipe content via stdin to avoid heredoc injection + async def write_file(self, path: str, content: str) -> None: + if self._container_id is None: + raise RuntimeError("docker sandbox not started") + if not path.startswith("/"): + path = f"{self.working_dir}/{path}" + parent = "/".join(path.split("/")[:-1]) + if parent: + await self.execute(f"mkdir -p {shlex.quote(parent)}") + result = await self._run_docker( + ["exec", "-i", "-w", self.working_dir, self._container_id, + "sh", "-c", f"cat > {shlex.quote(path)}"], + stdin_data=content.encode(), + ) + if result.exit_code != 0: + raise IOError(result.stderr) + + async def stop(self) -> None: + if self._container_id: + await self._run_docker(["rm", "-f", self._container_id], timeout=30) + self._container_id = None +``` + +### AgentCoreSandbox + +Wraps the Bedrock AgentCore Code Interpreter behind the Sandbox interface. Overrides `execute_code()`, `read_file()`, `write_file()`, and `list_files()` with native AgentCore API calls for better performance and binary-file support. + +```python +import uuid +from typing import Any + +from strands.sandbox import Sandbox, ExecutionResult + + +class AgentCoreSandbox(Sandbox): + """Execute code and commands in AWS Bedrock AgentCore.""" + + def __init__(self, region: str | None = None, session_name: str | None = None): + self.region = region + self.session_name = session_name or f"sandbox-{uuid.uuid4().hex[:12]}" + self._client: Any = None + + async def start(self) -> None: + from bedrock_agentcore.tools.code_interpreter_client import ( + CodeInterpreter as BedrockAgentCoreCodeInterpreterClient, + ) + kwargs: dict[str, Any] = {} + if self.region: + kwargs["region"] = self.region + self._client = BedrockAgentCoreCodeInterpreterClient(**kwargs) + self._client.start(identifier="aws.codeinterpreter.v1", name=self.session_name) + + def _parse_stream_result(self, response: dict[str, Any]) -> ExecutionResult: + """Parse an AgentCore event-stream response into an ExecutionResult.""" + if "stream" in response: + for event in response["stream"]: + if "result" in event: + result = event["result"] + content = str(result.get("content", "")) + is_error = response.get("isError", False) + return ExecutionResult( + exit_code=1 if is_error else 0, + stdout="" if is_error else content, + stderr=content if is_error else "", + ) + return ExecutionResult(exit_code=1, stdout="", stderr=f"unexpected response: {response}") + + async def execute(self, command: str, timeout: int | None = None) -> ExecutionResult: + response = self._client.invoke("executeCommand", {"command": command}) + return self._parse_stream_result(response) + + # Override: AgentCore has a native code execution API + async def execute_code(self, code: str, language: str = "python", timeout: int | None = None) -> ExecutionResult: + response = self._client.invoke("executeCode", {"code": code, "language": language}) + return self._parse_stream_result(response) + + # Override: AgentCore has native file APIs + async def read_file(self, path: str) -> str: + response = self._client.invoke("readFiles", {"paths": [path]}) + result = self._parse_stream_result(response) + if result.exit_code != 0: + raise FileNotFoundError(result.stderr) + return result.stdout + + async def write_file(self, path: str, content: str) -> None: + response = self._client.invoke("writeFiles", {"content": [{"path": path, "text": content}]}) + result = self._parse_stream_result(response) + if result.exit_code != 0: + raise IOError(result.stderr) + + async def list_files(self, path: str = ".") -> list[str]: + response = self._client.invoke("listFiles", {"path": path}) + result = self._parse_stream_result(response) + if result.exit_code != 0: + raise FileNotFoundError(result.stderr) + return [f for f in result.stdout.strip().split("\n") if f] + + async def stop(self) -> None: + if self._client: + self._client.stop() + self._client = None +``` + +
+ +--- + +
+Appendix B: Prior art + +A survey of sandbox interfaces across the AI agent ecosystem. This informed the interface design in [Layer 1](#layer-1-sandbox). + +### E2B (e2b.dev) + +Cloud-based sandboxes running in Firecracker microVMs. The most widely adopted sandbox SDK in the agent ecosystem. + +Interface uses module-based grouping on the sandbox object: + +```python +from e2b_code_interpreter import Sandbox + +sandbox = Sandbox() + +# Shell commands via sandbox.commands +result = sandbox.commands.run("echo hello") # → stdout, stderr, exit_code + +# Code execution as a top-level method +execution = sandbox.run_code("x = 1 + 1; print(x)") # → text, logs + +# Filesystem via sandbox.files +sandbox.files.write("/tmp/data.txt", "content") +content = sandbox.files.read("/tmp/data.txt") +files = sandbox.files.list("/") +``` + +Key design choices: +- `commands`, `files`, and `pty` are separate readonly modules +- Code execution (`run_code`) is a top-level method, not under `commands` +- Sandboxes are stateful — variables persist across `run_code` calls +- Supports connecting to existing sandboxes by ID + +### Daytona (daytona.io) + +Container-based sandboxes with sub-90ms creation time. Similar to E2B but uses `process` and `fs` sub-objects. + +```python +from daytona import Daytona, CreateSandboxParams + +daytona = Daytona() +sandbox = daytona.create(CreateSandboxParams(language="python")) + +# Code execution (language-aware) +response = sandbox.process.code_run('print("hello")') # → result, exit_code, stderr + +# Shell commands +response = sandbox.process.exec("ls -la", cwd="/home", timeout=10) # → result, exit_code + +# Filesystem +sandbox.fs.upload_file(b"data", "/tmp/file.txt") +files = sandbox.fs.list("/workspace") +``` + +Key design choices: +- Separates `code_run` (language-aware) from `exec` (shell) +- Filesystem uses `upload_file` (bytes) rather than `write` (string) +- Sandbox creation accepts language parameter + +### LangChain DeepAgents + +Abstract backend interface that multiple providers implement. The most relevant prior art for our design because it defines a minimal contract. + +Core insight: implement only `execute()`, get everything else for free. + +```python +class BaseSandbox(BackendProtocol): + def execute( + self, + command: str, + timeout: int = 120, + max_output_bytes: int = 100000, + env: dict | None = None, + inherit_env: bool = False, + ) -> dict: + """Run a shell command. Returns {stdout, stderr, returncode, duration}.""" + ... + + # Built on top of execute() by the base class: + # ls_info(path) → List[FileInfo] + # read(file_path, offset, limit) → str + # write(file_path, content) → WriteResult + # edit(file_path, old, new) → EditResult + # glob_info(pattern) → List[FileInfo] + # grep_raw(pattern, path) → List[GrepMatch] +``` + +Daytona, Modal, E2B, and Runloop all implement this single `execute()` method. The base class builds `ls`, `read`, `write`, `edit`, `glob`, and `grep` on top by running shell commands through `execute()`. + +Key design choices: +- Single abstract method minimizes provider implementation effort +- Filesystem ops are derived from shell commands (for example, `read` runs `cat`) +- Tradeoff: shell-based file ops have encoding and binary limitations +- Providers can override derived methods with native implementations + +### OpenHands (formerly OpenDevin) + +Docker-based runtime with an action/observation pattern. More complex than a sandbox interface — it is a full agent runtime. + +``` +Architecture: + Host (backend) ←→ REST API ←→ Docker container (ActionExecutor) + +Actions: + CmdRunAction → runs shell commands + IPythonRunCellAction → runs Python code cells + FileReadAction → reads files + FileWriteAction → writes files +``` + +Key design choices: +- Action/observation dispatch pattern (extensible but complex) +- Communication via REST API between host and container +- `ActionExecutor` inside the container handles all action types +- Supports custom Docker images with pre-installed dependencies +- Daytona also provides an alternative runtime backend for OpenHands + +### Google ADK (Agent Development Kit) + +GKE-based with two execution modes: + +```python +executor = GkeCodeExecutor( + namespace="agent-sandbox-system", + executor_type="sandbox", # or "job" for ephemeral + sandbox_template="python-sandbox-template", +) +result = executor.execute_code(ctx, CodeExecutionInput(code="print('Hello')")) +``` + +Key design choices: +- Single `execute_code` method (very narrow interface) +- "Sandbox" mode uses pre-warmed instances for low latency +- "Job" mode creates ephemeral Kubernetes jobs with gVisor isolation +- Agent Engine variant auto-creates persistent sandboxes for multi-step tasks + +### Anthropic Sandbox Runtime (SRT) + +Not a sandbox SDK in the traditional sense. Focuses on policy and permissions rather than execution interface. + +```python +config = { + "filesystem": { + "allowWrite": [".", "/tmp"], + "denyRead": ["~/.ssh"], + }, + "network": { + "allowedDomains": ["anthropic.com"], + }, +} +``` + +Key design choices: +- Security wrapper around local execution, not a remote sandbox +- Config-driven: declares what the sandbox can do, not how you call it +- Filesystem allow/deny lists for reads and writes +- Network traffic routed through proxy with domain filtering +- Used by Claude Code for autonomous execution in cloud VMs + +### Summary of patterns + +| Project | Interface style | Abstract methods | Filesystem approach | Isolation | +|---------|----------------|-----------------|-------------------|-----------| +| E2B | Module-based (`commands`, `files`, `pty`) | Multiple | Native API | Firecracker microVM | +| Daytona | Sub-object (`process`, `fs`) | Multiple | Native API | Container | +| LangChain DeepAgents | Single `execute()` | 1 | Shell-derived | Provider-dependent | +| OpenHands | Action/observation dispatch | Multiple action types | Action-based | Docker | +| Google ADK | Single `execute_code()` | 1 | Not exposed | GKE/gVisor | +| Anthropic SRT | Config-driven policy | N/A (wraps local) | Allow/deny lists | Process-level | + +The ecosystem converges on two patterns: (1) a single execution primitive with derived operations (LangChain, Google ADK), or (2) separate modules for commands, code, and files (E2B, Daytona). Our recommended Option A follows pattern 1, with the ability to override derived methods following pattern 2 where native APIs exist. + +
+ +--- + +
+Appendix C: Research on open questions + +### Namespace injection approaches in the ecosystem + +The programmatic tool caller needs to make tools callable as Python functions inside the execution context. Different frameworks handle this differently. + +smolagents' `CodeAgent` runs code in the host process with tools pre-loaded as global functions. Tools are registered at initialization (`tools=[WebSearchTool()]`) and become directly callable in the generated code (for example, `web_search("query")`). The execution happens in a sandboxed executor (E2B, Docker, Modal), but the tool namespace is injected before execution. smolagents supports `additional_authorized_imports` to allow specific imports in the sandbox. + +E2B and Daytona handle this at the environment level — custom Docker images or VM snapshots pre-populate `globals()` with functions before agent code runs. This works for static tool sets but not for dynamic tool registries. + +LangChain uses `@tool` decorated wrappers with runtime state injection (`ToolRuntime`). Tools access context via `runtime.state`, not via global namespace injection. The execution itself is not sandboxed by LangChain — it pairs with external executors. + +The key insight: for dynamic tool sets (where tools are registered at agent creation time, not baked into an image), in-process execution with namespace injection is the only practical approach. Remote sandboxes cannot receive arbitrary Python closures. This is why we recommend the programmatic tool caller runs orchestration code locally. + +### Working directory persistence approaches + +The current `shell` tool in `strands_tools` uses `CommandContext` to track the working directory. The devtools shell ([PR #39](https://github.com/strands-agents/devtools/pull/39)) takes a different approach: a persistent subprocess-based shell where `cd` and `export` persist naturally because commands are sent to the same shell process. + +E2B supports full sandbox persistence via `sandbox.pause()` and `sandbox.connect()` — the entire environment (filesystem, working directory, env vars) survives across sessions. Daytona provides persistent sandboxes by design, maintaining state between runs without rebuilding. + +Docker `exec` on a running container naturally preserves filesystem state. However, each `docker exec` starts a new shell process, so `cd` in one exec does not affect the next. To persist `cd`, you either track it externally or use a persistent shell inside the container. + +For `LocalSandbox`, a persistent shell process is the right approach. It matches Unix expectations and avoids fragile command parsing. The implementation would keep a long-running `/bin/sh` process, send commands via stdin, and read output from stdout/stderr. + +### Parallel execution and sandbox concurrency + +Strands' `ConcurrentToolExecutor` runs tool calls in parallel via `asyncio.create_task`. This creates a concurrency concern for sandboxes that maintain shared state. + +Docker handles this naturally — each `docker exec` gets its own process inside the container. Multiple concurrent execs do not interfere with each other (though they share the filesystem, so concurrent file writes can conflict). + +E2B and Daytona handle concurrent API calls independently — each call gets its own execution context within the sandbox. + +The challenge is `LocalSandbox` with a persistent shell. If two tools send commands to the same shell process simultaneously, stdout/stderr from both commands interleave unpredictably. Solutions: + +1. Serialize with a lock — simple but negates concurrency benefits. +2. Spawn subprocesses per call that inherit tracked state — each `execute()` creates a fresh subprocess with `cwd=self._tracked_cwd` and `env=self._tracked_env`, while the persistent shell is only used for state-tracking commands. This preserves concurrency while maintaining state. +3. Pool of shell processes — more complex, diminishing returns. + +Option 2 is the recommended approach for `LocalSandbox`. + +