docs(designs): sandboxes and code execution#681
docs(designs): sandboxes and code execution#681mkmeral wants to merge 5 commits intostrands-agents:mainfrom
Conversation
Introduces a Sandbox abstraction that decouples tool logic from execution environment, enabling tool reuse across local, Docker, and AgentCore sandboxes. Covers three layers: Sandbox interface (SDK-level), tooling updates including programmatic tool caller, and CodeAct plugin for code-based tool orchestration. Includes prior art survey of E2B, Daytona, LangChain DeepAgents, OpenHands, Google ADK, and Anthropic SRT.
- Rename to 'Sandboxes and code execution' - Rewrite CodeAct plugin to use AfterInvocationEvent.resume for the observation loop instead of wrapping the agent - Fix programmatic tool caller to run in host process, not sandbox - Add open questions for namespace injection, cwd persistence, and parallel execution with research in Appendix C - Fix duplicate shell example, dead variable, broken appendix link - Add python tool description
- Add tool proxy section to Layer 1 with incremental rollout plan - Fix contradiction between programmatic tool caller steps and relationship to sandbox section - Fix async exec issue in CodeAct (wrap in async main) - Remove redundant open question 1 (now covered by tool proxy section) - Fix import paths: sandbox implementations live in SDK - Add sandbox isolation row to comparison table - Wrap appendices in details/summary tags
Documentation Preview ReadyYour documentation preview has been successfully deployed! Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-cms-681/docs/user-guide/quickstart/overview/ Updated at: 2026-03-19T22:10:28.182Z |
|
Related POC branch and notebook |
Initial agent review## 🔍 Deep Review: Sandboxes & Code Execution Design + POC BranchI've reviewed the full design document (1058 lines), the POC branch ( ✅ What Works WellThe overall architecture is sound. Three things stand out:
🐛 Bugs in the POC BranchBug 1:
|
…eedback - Change execute() to AsyncGenerator that yields output lines and final ExecutionResult, matching SDK's existing async tool streaming - Add State management section to Layer 1 with stateless-by-default recommendation and rationale (concurrency safety) - Move state discussion out of open questions into proper design section - Explicitly address cd and export persistence in stateless model - Update shell tool example to show streaming pattern - Update TypeScript interface to match
Update Option A interface, tool examples, TypeScript SDK, and all three Appendix A implementation sketches to match the actual code. Main body: - Replace AsyncGenerator streaming with direct ExecutionResult return - Use shlex.quote() for shell escaping in convenience methods - Use randomized heredoc delimiter (secrets.token_hex) in write_file - Remove _execute_to_result() helper (not needed without streaming) - Update tool usage examples (shell, my_tool) to use await pattern - Update TypeScript interface to use Promise<ExecutionResult> Appendix A — LocalSandbox: - Add timeout error handling (proc.kill() on TimeoutError) - Add absolute path resolution (os.path.isabs check) - Add os.makedirs for parent directory creation in write_file - Add class docstring Appendix A — DockerSandbox: - Replace ... placeholders with full implementation - Add _run_docker helper with stdin_data support - Add stdin pipe approach for write_file (no heredoc injection) - Add shlex.quote for safe path handling - Add working_dir parameter and container lifecycle - Add timeout handling Appendix A — AgentCoreSandbox: - Fix import path (bedrock_agentcore.tools.code_interpreter_client) - Add _parse_stream_result helper for response parsing - Add list_files override with native API - Add uuid-based default session naming - Add stop() method - Add proper error handling in read_file/write_file
| |--------|-------------------------------|----------------------------| | ||
| | Methods to implement | 1 required, 4 optional overrides | 5 required | | ||
| | New provider effort | Minimal — one method gets you a working sandbox | Higher — must implement all filesystem ops | | ||
| | Filesystem quality | Shell-based by default (encoding/binary issues) | Native by default (provider controls quality) | |
There was a problem hiding this comment.
What kind of issues/what does this look like/what customer impacting limitations does this impose?
There was a problem hiding this comment.
So essentially, if you are trying to send all data (including files) over shell, you will need to base64 them, and also be extra careful about file end. For example normally, you'd have EOF tags, but now that you use shell, what if the file content has that tag? You might end up parsing something wrong
The tradeoff here is, more methods are better/safer, but single method is better DevX, and easier to understand/develop.
So we propose single method abstractions, but still provide other methods (file read/write/etc) as overwrite-able methods, so folks can achieve higher security bar.
There was a problem hiding this comment.
I comment on it elsewhere, but I'd lean on:
The tradeoff here is, more methods are better/safer
Over:
but single method is better DevX, and easier to understand/develop.
Especially given tenet:
The obvious path is the happy path
The # of people who will be implementing a Sandbox will be small I think compared to the number of users, so we want to nudge the implementation to be the right one. We can always off BaseShellSandbox for "easy devx" IMHO
|
|
||
| The following table summarizes the planned Sandbox implementations. See [Appendix A](#appendix-a-sandbox-implementation-sketches) for code examples. | ||
|
|
||
| | Sandbox | Where it runs | Isolation | Latency | Use case | |
There was a problem hiding this comment.
Where do these latency stats come from?
There was a problem hiding this comment.
Claude made them up, honestly I didn't even check that column. I will delete it later, that doesn't really concern us in scope of this doc
There was a problem hiding this comment.
I think it is definitely a statistic what would be useful to know as it will have a large net influence on high tool use conversations on Sandbox , but would be curious to see this tested rather than hallucinated lol.
| │ result = await calculator(expression="2 + 2") │ | ||
| │ print(result) │ | ||
| └──────────────────────┬──────────────────────────────────┘ | ||
| │ HTTP POST /tool/calculator |
There was a problem hiding this comment.
Might be nice to have this in mermaid so it is renderable on gh.
There was a problem hiding this comment.
technically we can. Claude decided to do it 😅
That said, diagrams are a bit more annoying in PR version, I tend to review on raw md files (maybe i should change lol)
There was a problem hiding this comment.
I tend to review on raw md files
It would help me; I do side-by-side and mermaid would be prettier :)
|
|
||
| - `LocalSandbox` — spawns a fresh subprocess per `execute()` call. Stateless by default. | ||
| - `DockerSandbox` — each `docker exec` gets its own process. Filesystem changes persist (shared container), but cwd and env do not carry across calls. | ||
| - `AgentCoreSandbox` — session state persists natively via the AgentCore API. |
There was a problem hiding this comment.
Which aspects of AgentCore are you imagining here? Which API?
There was a problem hiding this comment.
These APIs, essentially AC creates a code interpreter instance taht we can connect and run codes on. that container itself is stateful (i'm guessing up to a certain ttl)
| result = agent("Calculate the squares of numbers 1-100 and sum them") | ||
| ``` | ||
|
|
||
| The model might respond with a `programmatic_tool_caller` invocation containing: |
There was a problem hiding this comment.
just a thought -- I'm curious if this could have a twin tool that integrates with AI functions. seems fairly similar in nature.
|
|
||
| #### How it works | ||
|
|
||
| 1. The model receives `programmatic_tool_caller` as one of its available tools |
There was a problem hiding this comment.
Would the model prefer writing code that calls tools or code that uses libraries/clients. In the case below for example, the model could instead write:
print(sum(i ** 2 for i in range(1, 101)))There was a problem hiding this comment.
I think the answer is (as always) it depends. This is a application security concern. Depending on my security posture + what I want to accomplish, I might want to enable more libraries (or less).
If I want a highly secure env, but I want to use PTC (programmatic tool caller, i hate writing the long name), I can just remove all the namespace except tools, and maybe just add AC code interpreter as sandbox.
But if I am running this locally to search github to see Strands references, I will be a lot more relaxed w.r.t. what I give my agent
|
|
||
| ### P0 | ||
|
|
||
| - [ ] Define `Sandbox` ABC and `ExecutionResult` in `sdk-python` |
There was a problem hiding this comment.
I don't believe in them anymore 🙃
I already have a branch that implements this https://github.com/mkmeral/sdk-python/tree/feat/sandbox-abstraction and a notebook if you want to try it out https://github.com/mkmeral/sdk-python/blob/feat/sandbox-abstraction/notebooks/sandbox_demo.ipynb
There was a problem hiding this comment.
will poke around! but any estimates on date/effort to get this code from dev branch into prod? mainly asking so i can track this for context management and have this inform my roadmap dates.
| wrapped = f"async def __codeact_main__():\n" + "\n".join( | ||
| f" {line}" for line in code.splitlines() | ||
| ) | ||
| exec(compile(wrapped, "<codeact>", "exec"), self.namespace) |
There was a problem hiding this comment.
Does it have to be limited to Python?
Aligns programmatic_tool_caller with the sandboxes design doc (strands-agents/docs#681) Phase 1 requirements: - Remove Executor ABC and LocalAsyncExecutor classes The design doc separates Sandbox (SDK-level, where code runs) from the programmatic tool caller (tools-level, runs in host process). The Executor abstraction competed with the Sandbox design. - Inline async execution logic directly in the tool function Phase 1 always runs orchestration code in-process. The ~15 lines of execution logic are now directly in programmatic_tool_caller(). - Use compile() for better error tracebacks Per the design doc: compile(code, '<programmatic_tool_caller>', 'exec') gives clearer tracebacks than raw exec(). - Remove custom executor documentation and examples The Custom Executors section in the module docstring is removed. The Sandbox + Tool Proxy design (Phase 2) replaces this concept. - Remove executor-related tests TestExecutor class and test_custom_executor removed. Added test_stderr_captured and test_syntax_error_handled for coverage. The core tool logic (tool wrappers, _execute_tool, _create_async_tool_function, _validate_code, _get_allowed_tools) is unchanged. The tool gets simpler, not more complex. Refs: strands-agents/docs#681, strands-agents#387
| └─────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ### Layer 1: Sandbox |
There was a problem hiding this comment.
Let's decouple the name; Sandbox implies safety, but this is more of an Environment esp when running locally.
Not a fan of Environment directly either, but it's closer than Sandbox
There was a problem hiding this comment.
I have naming alternatives below. I think environment is too overloaded
|
|
||
| #### Interface | ||
|
|
||
| The Sandbox ABC lives in the SDK (`strands-agents/sdk-python`). Concrete implementations (`LocalSandbox`, `DockerSandbox`, `AgentCoreSandbox`) also live in the SDK as vended sandbox providers, similar to how the SDK already vends tools and plugins. Third-party sandbox providers can be published as separate packages. |
There was a problem hiding this comment.
Concrete implementations (
LocalSandbox,DockerSandbox,AgentCoreSandbox)
I'm not as convinced of this living in the SDK, esp. docker and AgentCoreSandbox, esp because of the dependencies they might include. I'm okay with us vending them, but the SDK isn't the right place
| share the same Sandbox instance, giving them a common working | ||
| directory, environment variables, and filesystem. | ||
|
|
||
| Implementations only need to provide execute(). All other methods |
There was a problem hiding this comment.
Implementations only need to provide execute(). All other methods are built on top of it.
Option 1 & Optino 2 are the same api shape, it's just that one provides a default implementation right?
I'd like to nudge implementors to implement all Sandbox fully to be more efficient; ShellBasedSandbox can be an escape hatch for those that just have a shell or don't care about optimizations, but I think the default is "Conform to this entire api"
| async def write_file(self, path: str, content: str) -> None: ... | ||
|
|
||
| @abstractmethod | ||
| async def list_files(self, path: str = ".") -> list[str]: ... |
There was a problem hiding this comment.
3 file apis so fewer than I'd expect TBH; are we sure that's it?
https://developer.mozilla.org/en-US/docs/Web/API/File_System_API is a sandboxed implementation of a filesystem (and one we should identify as potential target for TS) to see what else we might need. The only thing I really see form there is a "get[root]Directory()"
|
|
||
| We recommend stateless by default: each `execute()` call is independent. This avoids concurrency issues when multiple tools call the sandbox in parallel (tool A does `cd /tmp`, tool B does `cd /home` — with shared state, the last one wins and both are confused). Stateless execution is predictable and matches the LangChain DeepAgents model. | ||
|
|
||
| This means `cd`, `export`, and other state-modifying commands do not persist between calls. Tools that need state persistence (for example, a shell tool that tracks `cd` across calls) can manage it themselves, as the current `shell` tool already does via `CommandContext`. |
There was a problem hiding this comment.
or example, a shell tool that tracks
cdacross calls
Have we POC'd this? I'd be curious if the shell would be able to track this given that it'd be delegating to the Sandbox for a lot more work.
|
|
||
| #### Tool proxy | ||
|
|
||
| When code runs inside a sandbox (for example, the programmatic tool caller or CodeAct executing model-generated code in Docker), that code needs to call agent tools. But agent tools are Python objects in the host process — they cannot be serialized into a remote sandbox. The tool proxy solves this by bridging tool calls from the sandbox back to the host. |
There was a problem hiding this comment.
that code needs to call agent tools
I wonder - do they? have we explored splitting the world in terms of "Tools that are attached to the environment, but not the agent". That would reduce a lot of complexity.
What I'm trying to tease out is: how important is tool context vs code-execution:
Strands tools have closures over the agent, access ToolContext, make API calls — they cannot be serialized as source
| We recommend the callback proxy approach, implemented incrementally: | ||
|
|
||
| 1. Phase 1 (P0/P1): orchestration code runs in the host process. Tools are local async functions. This is simple and works today. | ||
| 2. Phase 2 (P2): add a tool proxy server. Orchestration code runs fully inside the sandbox. Tool calls are proxied back to the host via HTTP. |
There was a problem hiding this comment.
If we go forward with the concept of a tool proxy, I think the sandbox should have the concept of the proxy built in; otherwise we're delegating a lot of complexity to the caller
|
|
||
| #### New `python` tool | ||
|
|
||
| A new sandbox-based `python` tool delegates to `sandbox.execute_code()`. Unlike the existing `python_repl` (which maintains a persistent namespace via `dill` serialization and supports interactive PTY), the `python` tool is stateless — each invocation runs in a fresh interpreter. This is simpler, works across all sandbox types, and is the recommended default. The existing `python_repl` tool remains available for use cases that need stateful execution. |
There was a problem hiding this comment.
Does it have to be python? I'm thinking that in the browser, JS is the natural choice.
|
|
||
| #### New `python` tool | ||
|
|
||
| A new sandbox-based `python` tool delegates to `sandbox.execute_code()`. Unlike the existing `python_repl` (which maintains a persistent namespace via `dill` serialization and supports interactive PTY), the `python` tool is stateless — each invocation runs in a fresh interpreter. This is simpler, works across all sandbox types, and is the recommended default. The existing `python_repl` tool remains available for use cases that need stateful execution. |
There was a problem hiding this comment.
Did we do an analysis on other similar tools to see if they're mostly stateless? I can imagine that stateful execution would be useful for incremental development. Not a blocker though
|
|
||
| #### Relationship to Sandbox | ||
|
|
||
| The programmatic tool caller does not run its code in the sandbox. It runs the model's orchestration code in the host process, injecting tool wrappers as async functions that call `agent.tool.X()`. Those tools in turn use the sandbox. The sandbox is used by the individual tools, not by the orchestration layer. |
There was a problem hiding this comment.
It runs the model's orchestration code in the host process
Why? This seems like we'd have two different execution patterns - why not do it in the sandbox?
|
|
||
| #### How it works in Strands | ||
|
|
||
| CodeAct maps naturally to Strands' hook system. The [`AfterInvocationEvent.resume`](https://strandsagents.com/docs/user-guide/concepts/agents/hooks/index.md) property triggers a follow-up agent invocation with new input, which is exactly the CodeAct observation loop: model generates code → execute → feed results back as next observation → model generates more code. |
There was a problem hiding this comment.
I think the state machine would be a better fit; this is effectively rewriting the agent-loop.
But implementation detail I guess
| | Sandbox isolation | Phase 1: host process. Phase 2: full sandbox via tool proxy | Phase 1: host process. Phase 2: full sandbox via tool proxy | | ||
| | Use case | Optimization for batch operations | Full paradigm shift | | ||
|
|
||
| The programmatic tool caller is a tool the model can optionally use within standard tool calling. CodeAct changes how the agent interacts entirely — the model always writes code, and the hook system handles execution and feedback. |
There was a problem hiding this comment.
What have been the most common use cases for CodeAct? Coding agents?
|
|
||
| ### Layer 3: CodeAct plugin | ||
|
|
||
| CodeAct is a higher-level paradigm where the agent always responds with code instead of JSON tool calls. It is implemented as a hook using the existing agent lifecycle, not a modification to the agent loop. |
There was a problem hiding this comment.
This is really interesting; how effective has it been in practice?
I'm curious of how closely the model(s) actually output code instead of text
|
|
||
| ## Open questions | ||
|
|
||
| 1. **Namespace injection for programmatic tool calling.** The programmatic tool caller and CodeAct need tool wrapper functions in the code execution namespace. Phase 1 runs orchestration code in the host process with tools as local functions. Phase 2 uses the tool proxy to run code fully inside the sandbox. See [Tool proxy](#tool-proxy) for the full design and alternatives considered. |
|
|
||
| We recommend that the `Sandbox` interface makes no concurrency guarantees. Implementations document their behavior. `LocalSandbox` can handle this by spawning subprocesses that inherit the persistent shell's tracked state (cwd, env) rather than sharing the shell process for all calls. | ||
|
|
||
| 4. **Interactive mode.** Both `shell` and `python_repl` today support interactive PTY mode for real-time user input. This does not map cleanly to remote sandboxes (Docker, AgentCore). Our inclination is to keep interactive mode as a `LocalSandbox` concern and not complicate the abstract interface. This is not a blocker for the initial implementation and can be revisited later. |
There was a problem hiding this comment.
today support interactive PTY mode for real-time user input
How does this end up surfacing to users?
|
|
||
| | Name | Pros | Cons | | ||
| |------|------|------| | ||
| | `Sandbox` (recommended) | Clear mental model, aligns with E2B/Daytona/LangChain ecosystem terminology, implies containment | Implies isolation that `LocalSandbox` does not provide | |
There was a problem hiding this comment.
Ha; here's the section.
Clear mental model, aligns with... implies containment
I'm not sure this is clear. This is not a Sandbox; I feel like it's the opposite of "clear" ;). It can be a sandbox, but LocalSandbox is not clear IMHO.
Some more alternatives:
ExecutionRuntimeCodeEnvironmentToolEnvironmentToolExecutionEnvironmentToolBackend
| ### P2 | ||
|
|
||
| - [ ] Implement `CodeActPlugin` | ||
| - [ ] Add Anthropic advanced tool use support (tool search tool, tool use examples) |
There was a problem hiding this comment.
This one seems like a separate workstream
| - [ ] Implement `CodeActPlugin` | ||
| - [ ] Add Anthropic advanced tool use support (tool search tool, tool use examples) | ||
| - [ ] Tool proxy: implement proxy server, stub generation, and integrate with programmatic tool caller and CodeAct | ||
| - [ ] Mirror `Sandbox` interface in `sdk-typescript` |
There was a problem hiding this comment.
Depending on timelines, i would expect this done much sooner. We should be doing each phase in parallel
|
|
||
| We considered making Sandbox a tool-level concern — each tool would optionally accept a Sandbox in its constructor. This avoids SDK changes but means every tool must independently handle Sandbox injection, and there is no guarantee that tools share the same Sandbox instance. | ||
|
|
||
| Rejected because the shared-environment property (tools operating in the same filesystem) requires the Sandbox to be agent-level, not tool-level. |
There was a problem hiding this comment.
I was convinced that agent.sandbox should/can exist.
However this:
Rejected because the shared-environment property... requires the Sandbox to be agent-level, not tool-level.
Is a larger issue that we can address. E.g. if we wanted to limit Sandbox to not be fundamental to an agent, then we need a mechanism to store non-JSON data with an agent. E.g.
agent.services.set(Sandbox, sandbox)
sandbox = agent.services.get(Sandbox)
Description
Design doc proposing three layered additions to Strands:
A
Sandboxabstraction (SDK-level) that decouples tool logic from execution environment. Tools delegate totool_context.agent.sandboxinstead of managing their own subprocesses and filesystem access. The agent defaults toLocalSandbox, so existing behavior is preserved.DockerSandboxandAgentCoreSandboxprovide isolation for production and cloud deployments. The interface follows the minimalexecute()pattern validated by LangChain DeepAgents, E2B, and Daytona.A tooling layer that updates existing tools (
shell,python_repl,file_read,file_write) to use the sandbox, adds a new statelesspythontool, and introducesprogrammatic_tool_caller— a tool that lets the model write Python code to orchestrate other tools, reducing API round-trips and keeping intermediate results out of context.A
CodeActPluginthat implements the CodeAct paradigm (Apple ML Research) via Strands' hook system. UsesAfterInvocationEvent.resumefor the observation loop: model generates code, plugin executes it, feeds results back as next observation. No custom agent loop needed.The doc also covers a tool proxy design for running model-generated code fully inside the sandbox with tool calls proxied back to the host via HTTP, implemented incrementally (host-process first, proxy later).
Includes prior art survey of E2B, Daytona, LangChain DeepAgents, OpenHands, Google ADK, and Anthropic SRT.
Related Issues
Type of Change
Checklist
npm run devBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.