Skip to content

feat: ECS refactor#785

Open
Glorf wants to merge 8 commits intomainfrom
mbien/ecs-refactor
Open

feat: ECS refactor#785
Glorf wants to merge 8 commits intomainfrom
mbien/ecs-refactor

Conversation

@Glorf
Copy link
Contributor

@Glorf Glorf commented Feb 27, 2026

Summary by CodeRabbit

  • Refactor
    • Updated sandbox module's public API with new utility types for environment variables, SSH configuration, image building, and execution management. Simplified the sandbox protocol interface for more straightforward integration. Removed several deprecated classes and exception types to improve API clarity.

Signed-off-by: Michal Bien <mbien@nvidia.com>
@Glorf Glorf requested review from a team as code owners February 27, 2026 17:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Glorf
Copy link
Contributor Author

Glorf commented Feb 27, 2026

@CodeRabbit full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Walkthrough

This refactoring replaces the sandbox module's older abstract base classes (NemoEvaluatorSandbox, NemoSandboxCommand, NemoSandboxSession) with a simpler Protocol-based Sandbox abstraction and an ExecResult dataclass. Associated error classes are removed, while new utility types from ecs_fargate are newly exported.

Changes

Cohort / File(s) Summary
Sandbox abstraction refactoring
packages/nemo-evaluator/src/nemo_evaluator/sandbox/base.py
Replaced previous Nemo sandbox abstractions with a new Sandbox protocol defining minimal backend contract (start, stop, exec, upload, download, is_running, context manager support). Introduced ExecResult dataclass to encapsulate command execution outcomes (stdout, stderr, return_code). Removed NemoSandboxCommand, NemoSandboxSession, and NemoEvaluatorSandbox classes.
Public API surface update
packages/nemo-evaluator/src/nemo_evaluator/sandbox/__init__.py
Updated all exports to introduce ExecResult and Sandbox from base module; removed NemoEvaluatorSandbox, NemoSandboxCommand, NemoSandboxSession, AwsCliMissingError, and EcsExecError. Added new ecs_fargate exports: EnvVarSpec, ExecClient, ImageBuilder, SshSidecarConfig, and SshTunnel.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • feat: add sandbox logic #624 — The original PR introducing NemoEvaluatorSandbox and related abstractions, which are now replaced by the Sandbox protocol in this refactoring.

Suggested reviewers

  • agronskiy

Poem

🐰 Old abstractions fade to dust,
Protocols emerge—a cleaner trust,
ExecResult now holds the way,
Sandbox simple, here to stay! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: ECS refactor' is vague and does not clearly convey the main changes; it refers to ECS but the actual core changes involve restructuring the sandbox public API by removing NemoEvaluatorSandbox, NemoSandboxCommand, NemoSandboxSession and introducing new Sandbox protocol and ExecResult abstractions. Use a more specific title that captures the primary refactoring goal, such as 'feat: restructure sandbox API with protocol-based Sandbox and ExecResult' or 'feat: replace Nemo sandbox abstractions with Sandbox protocol'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mbien/ecs-refactor

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nemo-evaluator/src/nemo_evaluator/sandbox/base.py`:
- Line 26: Replace the use of typing.Self with typing_extensions.Self to support
Python 3.10: update the import statement that currently reads "from typing
import Protocol, Self" so that Protocol comes from typing and Self is imported
from typing_extensions (e.g., import Protocol from typing and import Self from
typing_extensions); also ensure the symbol Self is used unchanged elsewhere
(e.g., in any classes or function signatures referencing Self such as those in
this module) and add typing_extensions to project dependencies if not already
present.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c759ecf and efd593f.

📒 Files selected for processing (3)
  • packages/nemo-evaluator/src/nemo_evaluator/sandbox/__init__.py
  • packages/nemo-evaluator/src/nemo_evaluator/sandbox/base.py
  • packages/nemo-evaluator/src/nemo_evaluator/sandbox/ecs_fargate.py

@abstractmethod
def stop(self) -> None:
raise NotImplementedError
from typing import Protocol, Self
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Python version constraints in packaging/config files =="
fd -HI 'pyproject.toml|setup.cfg|setup.py|tox.ini|\.python-version|\.tool-versions|Pipfile' | while read -r f; do
  echo "--- $f ---"
  rg -n "requires-python|python_requires|python_version|target-version|3\.10|3\.11|3\.12" "$f" || true
done

echo
echo "== Self imports across repository =="
rg -n "from typing import .*\\bSelf\\b|typing\\.Self|from typing_extensions import .*\\bSelf\\b" || true

Repository: NVIDIA-NeMo/Evaluator

Length of output: 667


Fix Python 3.10 compatibility: use typing_extensions.Self instead of typing.Self.

The package declares support for Python 3.10+ (requires-python = ">=3.10,<3.14"), but typing.Self is only available in Python 3.11+. This import will fail at runtime on Python 3.10. Use typing_extensions.Self instead to maintain compatibility.

Required fix
-from typing import Protocol, Self
+from typing import Protocol
+from typing_extensions import Self
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Protocol, Self
from typing import Protocol
from typing_extensions import Self
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nemo-evaluator/src/nemo_evaluator/sandbox/base.py` at line 26,
Replace the use of typing.Self with typing_extensions.Self to support Python
3.10: update the import statement that currently reads "from typing import
Protocol, Self" so that Protocol comes from typing and Self is imported from
typing_extensions (e.g., import Protocol from typing and import Self from
typing_extensions); also ensure the symbol Self is used unchanged elsewhere
(e.g., in any classes or function signatures referencing Self such as those in
this module) and add typing_extensions to project dependencies if not already
present.

pribalta

This comment was marked as spam.

pribalta

This comment was marked as spam.

Copy link
Collaborator

@pribalta pribalta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power Review

Reviewed 3 files, left 25 suggestions: 4 errors, 18 warnings, 3 suggestions.

Dimensions Analyzed

Dimension 🔴 ⚠️ 💡
anti_patterns 0 3 1
architecture 0 5 1
hallucination-detection 1 1 0
patterns 3 5 0
security-risks 0 3 0
testing-quality 0 1 1
Total 4 18 3

New logic lands in ecs refactor space,
Our agents checked each file and trace.
Yet automated eyes can overlook a flaw,
So review these notes — that’s the golden law.


Automated review by Power Review — findings are advisory.

Glorf added 2 commits March 1, 2026 17:22
Signed-off-by: Michal Bien <mbien@nvidia.com>
Signed-off-by: Michal Bien <mbien@nvidia.com>
@Glorf
Copy link
Contributor Author

Glorf commented Mar 1, 2026

/ok to test 12ccd41

Signed-off-by: Michal Bien <mbien@nvidia.com>
@github-actions github-actions bot added the tests label Mar 1, 2026
@Glorf
Copy link
Contributor Author

Glorf commented Mar 1, 2026

/ok to test 1abc89a

# =====================================================================


def _coerce_list(value: Any, name: str) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is not needed here at all, raised value will have it's stack trace and taken to extreme, every operator should have name of the variable passed to it :D

x = add_two_integers(y, z, "y", "z")

raise
attempt += 1
if 0 < max_retries <= attempt:
log.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: the codebase uses structlog, ask the agent to be good citizen here

code = ""
if hasattr(exc, "response"):
code = (exc.response.get("Error") or {}).get("Code", "") # type: ignore[union-attr]
return code in _RETRYABLE_CODES or any(m in msg for m in _RETRYABLE_MESSAGES)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug: _RETRYABLE_MESSAGES above is not lowercased whily the msg is lowercased in line 326, this is a bug

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 2, 2026
Signed-off-by: Michal Bien <mbien@nvidia.com>
@Glorf
Copy link
Contributor Author

Glorf commented Mar 2, 2026

/ok to test de3f7a2

Copy link
Collaborator

@pribalta pribalta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power Review

Reviewed 5 files, left 26 suggestions: 1 error, 20 warnings, 5 suggestions.

Dimensions Analyzed

Dimension 🔴 ⚠️ 💡
api-contracts 0 2 1
boundary-conditions 0 5 1
concurrency 1 2 1
data-integrity 0 3 0
error-handling 0 2 1
logic-correctness 0 5 1
resource-lifecycle 0 1 0
Total 1 20 5

New logic lands in ecs refactor space,
Our agents checked each file and trace.
Yet automated eyes can overlook a flaw,
So review these notes — that’s the golden law.

⏱️ Reviewed in 4.6 min (est. manual review: ~60 min, 13x faster)


Automated review by Power Review — findings are advisory. Learn more about how it works.

raise ValueError("container_dir is required")
if isinstance(paths, Path):
paths = [paths]
def _emergency_cleanup() -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 🔴 Self-deadlock: _emergency_cleanup holds non-reentrant lock then re-enters it via stop() · 97% confidence

The knowledge document flags deadlock-prone lock ordering as an ERROR (CWE-362). _cleanup_lock is a non-reentrant threading.Lock() (line 1352). _emergency_cleanup() acquires _cleanup_lock (line 1359), then calls sb.stop() (line 1362), which calls _unregister_from_cleanup() (line 1490), which tries to acquire _cleanup_lock again (line 2287). Since threading.Lock is non-reentrant, the same thread will deadlock on itself. This means the atexit handler will hang forever, preventing graceful process exit and leaving orphaned ECS tasks running.

💡 Suggestion: Either (a) change _cleanup_lock = threading.RLock() to allow reentrant acquisition, or (b) refactor _emergency_cleanup to call sb._cleanup() directly instead of sb.stop(), bypassing _unregister_from_cleanup() since all sandboxes are being cleaned up anyway and the dict is about to be discarded.

)
raise EcsExecError(
f"ECS Exec failed: rc={effective_rc}\nOUTPUT:\n{text_out}"
cls._build_semaphore = threading.Semaphore(cfg.build_parallelism)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Semaphore(0) deadlock when build_parallelism is zero · 92% confidence

Missing Input Validation (CWE-20): build_parallelism from user config is passed directly to threading.Semaphore() without a lower-bound check. Semaphore(0) starts with an internal counter of 0, so the subsequent .acquire() at line 1161 will block forever, causing a deadlock. Negative values also produce undefined/broken behavior. The knowledge document states: 'validate shape, type, and range at the entry point; reject or sanitise before use.'

💡 Suggestion: Add validation in EcsFargateConfig.from_dict() or at the point of use: build_parallelism = max(1, int(raw.get('build_parallelism', 50))). Alternatively, validate in ensure_image_built before creating the semaphore.

try:
ecr.describe_images(repositoryName=repo_name, imageIds=[{"imageTag": tag}])
return True
except ClientError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Over-broad ClientError catch in image_exists_in_ecr masks transient errors · 90% confidence

Per the knowledge document (Over-Broad Exception Catch — WARNING): catching a parent type when only a specific error is expected masks unrelated bugs. image_exists_in_ecr catches all ClientError and returns False, but only ImageNotFoundException (or similar 404 response) means the image is absent. Throttling errors (ThrottlingException), access-denied errors, or network issues would also return False, causing unnecessary image rebuilds and potentially hiding permission problems.

💡 Suggestion: Inspect e.response['Error']['Code'] and only return False for image-not-found codes (e.g., ImageNotFoundException). Re-raise or propagate other ClientError subtypes.

)
time.sleep(sleep_sec)
time.sleep(delay)
raise RuntimeError("register_task_definition failed after max retries")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Unreachable dead code after for-loop in _do_register · 90% confidence

Per the knowledge document's 'Unreachable or Dead Branch' rule: the raise RuntimeError at line 1944 can never be reached. The for-loop iterates range(1, max_retries + 1). On each iteration, either: (a) success → return arn, (b) non-retryable error → raise RuntimeError(...), or (c) last-attempt retryable error → attempt >= max_retries is True → raise RuntimeError(...). The same issue exists at lines 2017-2019 in _run_task.

💡 Suggestion: Remove the unreachable raise after the loop, or restructure using a while-loop pattern if the intent is to have a fallback error.

continue
else:
time.sleep(delay)

raise RuntimeError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Unreachable dead code after for-loop in _run_task · 90% confidence

Per the knowledge document's 'Unreachable or Dead Branch' rule: the raise RuntimeError at line 2017 is unreachable. In the last loop iteration, either the inner _retry_with_backoff succeeds (then capacity failures or success is handled), or an exception is raised (re-raised by the except clause since attempt >= cfg.run_task_max_retries), or capacity failures trigger a raise (since attempt >= cfg.run_task_max_retries). The loop cannot exit normally.

💡 Suggestion: Remove the unreachable raise after the loop. If defensive coding is desired, add a comment explaining it's a safety net, or restructure with a while-loop.

"ThrottlingException" in combined
or "TooManyRequestsException" in combined
or "Rate exceeded" in combined
def health(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 Tautological is not None check on bytes return · 85% confidence

Per the knowledge document's 'Redundant or Tautological Condition' rule: ExecClient.health() checks self._request(...) is not None. The _request method always returns bytes on success and raises on failure — it never returns None. The is not None comparison is always True when no exception occurs, making the expression tautological.

💡 Suggestion: Simplify to: self._request(...); return True inside the try block, relying on the except to return False.

resp = self._ecs.describe_task_definition(
taskDefinition=cfg.task_definition
)
base = resp["taskDefinition"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 Bare except Exception when describing base task definition hides root cause · 80% confidence

Per the knowledge document (Over-Broad Exception Catch — WARNING): catching Exception when only a specific error is expected masks unrelated bugs. At line 1785, the code catches any Exception from describe_task_definition and falls through to base = None, causing the code to register from scratch. This hides throttling, credential, or network errors that should be surfaced rather than silently triggering a fallback code path that may fail differently.

💡 Suggestion: Catch only ClientError and optionally check for the specific error code (e.g., task definition not found). Let unexpected exceptions propagate.

)
def _wait_for_running(self) -> None:
cfg = self._cfg
start = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 time.time() used for elapsed-time checks instead of time.monotonic() · 80% confidence

Date/Time Boundary Error: _wait_for_running and _wait_for_ssh_ready use time.time() for deadline/elapsed calculations. time.time() is subject to NTP adjustments and system clock changes (including backwards jumps), which can cause premature timeouts or infinite waits. The knowledge document cautions against time-related edge cases that cause silent wrong results.

💡 Suggestion: Replace time.time() with time.monotonic() in _wait_for_running (line 2023) and _wait_for_ssh_ready (line 2120) to be immune to clock adjustments.

subnets = _coerce_list(raw.get("subnets"))
sgs = _coerce_list(raw.get("security_groups"))
has_sidecar = isinstance(raw.get("ssh_sidecar"), Mapping)
assign_public_ip = bool(raw.get("assign_public_ip", False)) or has_sidecar
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 Undocumented side effect: from_dict silently forces assign_public_ip=True when SSH sidecar is present · 80% confidence

Per the knowledge document (Undocumented Side Effect): 'A function performs a side effect that is not obvious from its name or signature.' EcsFargateConfig.from_dict() silently overrides assign_public_ip to True when an ssh_sidecar mapping is present, even if the caller explicitly sets assign_public_ip: false. This could surprise users in VPC-only environments where public IPs are prohibited by policy.

💡 Suggestion: Either document this override behavior clearly in the SshSidecarConfig docstring, or raise an error/warning when the user explicitly sets assign_public_ip=False while providing an SSH sidecar config, rather than silently overriding.

"""
if self._previous_buffer is None:
return None
def _free_port() -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 TOCTOU in _free_port(): port freed before actual use · 75% confidence

The knowledge document flags TOCTOU as an ERROR (CWE-367). _free_port() binds to port 0 to get an ephemeral port, then closes the socket (exiting the with block) before the port is actually used by the SSH tunnel. Another process can claim the port in the gap. The SSH tunnel retry logic in SshTunnel.open() (line 480: re-allocates on each attempt) partially mitigates this, but it's still a known race window.

💡 Suggestion: Consider using SO_REUSEADDR on the socket and passing the bound socket directly, or accept the TOCTOU and document that retries handle port collisions (which the current code already does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants