Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 50 additions & 24 deletions docs/references/api/nemo-evaluator/sandbox/index.rst
Original file line number Diff line number Diff line change
@@ -1,55 +1,81 @@
``nemo_evaluator.sandbox``
======================================

Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session.
Sandbox implementations used by evaluation harnesses that need isolated
container environments for command execution, file transfer, and agent hosting.

This module is designed to keep dependencies **optional**:

- The ECS Fargate implementation only imports AWS SDKs (``boto3``/``botocore``) when actually used.
- Using the ECS sandbox also requires the AWS CLI (``aws``) and ``session-manager-plugin`` on the host.
- Transport is SSH-based; no AWS CLI or session-manager-plugin required on the host.

Usage (ECS Fargate)
-------------------
Sandbox Protocol
----------------

Typical usage is:
All sandbox backends implement the :class:`~nemo_evaluator.sandbox.base.Sandbox`
protocol, so harnesses can be written backend-agnostically:

- configure :class:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateConfig`
- :meth:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateSandbox.spin_up` a sandbox context
- create an interactive :class:`~nemo_evaluator.sandbox.base.NemoSandboxSession`
- ``start()`` / ``stop()`` — lifecycle
- ``exec(command)`` — run a shell command
- ``upload(local_path, remote_path)`` / ``download(remote_path, local_path)`` — file transfer
- ``is_running`` — health check
- Context manager (``with sandbox: ...``) for automatic cleanup

Example::
Usage (ECS Fargate — exec-server mode)
--------------------------------------

from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox
For harnesses that drive execution from the orchestrator (e.g. terminal-bench)::

from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox, SshSidecarConfig

cfg = EcsFargateConfig(
region="us-west-2",
cluster="my-ecs-cluster",
task_definition="my-task-def:1",
container_name="eval",
subnets=["subnet-abc"],
security_groups=["sg-xyz"],
image_template="123456789.dkr.ecr.us-west-2.amazonaws.com/my-repo:{task_id}",
s3_bucket="my-staging-bucket",
ssh_sidecar=SshSidecarConfig(
public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
exec_server_port=19542,
),
)

with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
sandbox.start()
result = sandbox.exec("echo hello")
print(result.stdout)

Usage (ECS Fargate — agent-server mode)
---------------------------------------

For harnesses that host an agent inside the container (e.g. openhands),
omit ``exec_server_port`` and configure reverse/forward tunnels::

cfg = EcsFargateConfig(
...
ssh_sidecar=SshSidecarConfig(
public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
target_url_env="MODEL_URL",
),
)

with EcsFargateSandbox.spin_up(
cfg=cfg,
task_id="task-001",
trial_name="trial-0001",
run_id="run-2026-01-12",
) as sandbox:
session = sandbox.create_session("main")
session.send_keys(["echo hello", "Enter"], block=True)
print(session.capture_pane())
with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
sandbox.start()
# The agent inside the container can now reach the model via the reverse tunnel.
# The orchestrator can reach the agent's API via sandbox.local_port.

Prerequisites / Notes
---------------------

- The harness host must have **AWS CLI** and **session-manager-plugin** installed.
- If you use S3-based fallbacks (large uploads / long commands), configure ``s3_bucket``.
- SSH keys must be **pre-provisioned** in AWS Secrets Manager.
- If you use S3-based file staging (large uploads / downloads), configure ``s3_bucket``.
- Docker image building via AWS CodeBuild is available through :class:`~nemo_evaluator.sandbox.ecs_fargate.ImageBuilder`.

.. automodule:: nemo_evaluator.sandbox
:members:
:undoc-members:
:member-order: bysource


24 changes: 14 additions & 10 deletions packages/nemo-evaluator/src/nemo_evaluator/sandbox/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,25 @@
# See the License for the specific language governing permissions and
# limitations under the License.


from .base import NemoEvaluatorSandbox, NemoSandboxCommand, NemoSandboxSession
from .ecs_fargate import (
AwsCliMissingError,
EcsExecError,
from nemo_evaluator.sandbox.base import ExecResult, Sandbox
from nemo_evaluator.sandbox.ecs_fargate import (
EcsFargateConfig,
EcsFargateSandbox,
EnvVarSpec,
ExecClient,
ImageBuilder,
SshSidecarConfig,
SshTunnel,
)

__all__ = [
"NemoEvaluatorSandbox",
"NemoSandboxCommand",
"NemoSandboxSession",
"AwsCliMissingError",
"EcsExecError",
"ExecResult",
"Sandbox",
"EcsFargateConfig",
"EcsFargateSandbox",
"EnvVarSpec",
"ExecClient",
"ImageBuilder",
"SshSidecarConfig",
"SshTunnel",
]
133 changes: 38 additions & 95 deletions packages/nemo-evaluator/src/nemo_evaluator/sandbox/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,103 +13,46 @@
# See the License for the specific language governing permissions and
# limitations under the License.

"""Sandbox protocol and shared types.

Any sandbox backend (ECS Fargate, local Docker, Modal, etc.) implements
the :class:`Sandbox` protocol so harnesses can be backend-agnostic.
"""

from __future__ import annotations

from abc import ABC, abstractmethod
from dataclasses import dataclass
from pathlib import Path
from typing import ContextManager, Iterable, Protocol, runtime_checkable


@dataclass(frozen=True)
class NemoSandboxCommand:
"""
TB-independent command model for driving an interactive terminal.

Mirrors the fields terminal-bench agents commonly use (but does not depend on TB).
"""

command: str
min_timeout_sec: float = 0.0
max_timeout_sec: float = 180.0
block: bool = False
append_enter: bool = True


@runtime_checkable
class NemoSandboxSession(Protocol):
"""
Minimal session API used by agents/harnesses (tmux-like).
"""

def send_keys(
self,
keys: str | list[str],
block: bool = False,
min_timeout_sec: float = 0.0,
max_timeout_sec: float = 180.0,
) -> None: ...

def send_command(self, command: NemoSandboxCommand) -> None: ...

def capture_pane(self, capture_entire: bool = False) -> str: ...

def is_session_alive(self) -> bool: ...

def get_incremental_output(self) -> str: ...

def get_asciinema_timestamp(self) -> float: ...

def copy_to_sandbox(
self,
paths: list[Path] | Path,
container_dir: str | None = None,
container_filename: str | None = None,
) -> None: ...


class NemoEvaluatorSandbox(ABC):
"""
Abstract factory for evaluator sandboxes.

Implementations are responsible for provisioning an isolated environment and exposing
a tmux-like session API for agents to interact with it.
"""

@classmethod
@abstractmethod
def spin_up(
cls,
*,
task_id: str,
trial_name: str,
run_id: str,
pre_upload_paths: Iterable[Path] | None = None,
upload_dest_dir: str | None = None,
**kwargs,
) -> ContextManager["NemoEvaluatorSandbox"]:
raise NotImplementedError

@abstractmethod
def create_session(
self,
session_name: str,
is_active_stream: bool = False,
as_configured_user: bool = True,
) -> NemoSandboxSession:
raise NotImplementedError

@abstractmethod
def copy_to_sandbox(
self,
*,
paths: list[Path] | Path,
container_dir: str | None = None,
container_filename: str | None = None,
) -> None:
raise NotImplementedError

@abstractmethod
def stop(self) -> None:
raise NotImplementedError
from typing import Protocol

from typing_extensions import Self


@dataclass
class ExecResult:
"""Result of a command executed inside a sandbox."""

stdout: str
stderr: str
return_code: int


class Sandbox(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Sandbox protocol claims universal contract but EcsFargateSandbox violates it in agent-server mode · 85% confidence

Per the knowledge document (Inconsistent Error Contract): 'A function documents that it raises ValueError on bad input but actually raises TypeError, or an HTTP endpoint documents 404 but returns 400.' The Sandbox protocol docstring says 'Minimal contract every sandbox backend must satisfy' and defines exec(), upload(), download(). However, EcsFargateSandbox raises RuntimeError for all three in agent-server mode (line 2227-2232). Callers programming to the Sandbox protocol cannot reliably call these methods.

💡 Suggestion: Either split the protocol into Sandbox (lifecycle only) and ExecSandbox(Sandbox) (with exec/upload/download), or document clearly in the protocol docstring that exec/upload/download are optional and may raise NotImplementedError or RuntimeError depending on mode.

"""Minimal contract every sandbox backend must satisfy."""

def start(self, *, force_build: bool = False) -> None: ...

def stop(self) -> None: ...

def exec(self, command: str, timeout_sec: float = 180) -> ExecResult: ...

def upload(self, local_path: Path, remote_path: str) -> None: ...

def download(self, remote_path: str, local_path: Path) -> None: ...

@property
def is_running(self) -> bool: ...

def __enter__(self) -> Self: ...

def __exit__(self, *exc: object) -> None: ...
Loading