Skip to content

DRAFT: Gym integration PoC#770

Draft
Glorf wants to merge 2 commits intomainfrom
gym-integration
Draft

DRAFT: Gym integration PoC#770
Glorf wants to merge 2 commits intomainfrom
gym-integration

Conversation

@Glorf
Copy link
Contributor

@Glorf Glorf commented Feb 25, 2026

Summary by CodeRabbit

  • New Features

    • Introduced a new abstract interface for implementing custom evaluation harnesses
    • Added configuration generation utilities to create and persist evaluation setup files
    • Added a resource server process manager for programmatic server lifecycle management
  • Tests

    • Added comprehensive unit tests for configuration generation and resource server operations

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Glorf
Copy link
Contributor Author

Glorf commented Feb 25, 2026

@CodeRabbit full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

This PR introduces a new GymHarness abstraction framework for nemo-gym integration, including an abstract base class for benchmark harnesses, configuration generation utilities, subprocess management for the resource server, and corresponding unit tests across five new modules.

Changes

Cohort / File(s) Summary
Public API & Abstractions
packages/nemo-evaluator/src/nemo_evaluator/__init__.py, packages/nemo-evaluator/src/nemo_evaluator/gym_harness.py
Exposes new GymHarness abstract base class via public import; defines interface with get_dataset() and async verify() methods for benchmark implementations.
Configuration Management
packages/nemo-evaluator/src/nemo_evaluator/core/gym_config_generator.py
Introduces generate_gym_config() for building nemo-gym configuration dicts and write_gym_config() for YAML serialization with optional auto-generated temp file paths.
Process Management
packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py
Adds ResourceServerProcess context manager for spawning and managing the NeMoEvaluatorResourcesServer subprocess lifecycle, including port discovery, command construction, and graceful cleanup.
Unit Tests
packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py, packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py
Validates config generation, YAML writing, resource server initialization, harness package extraction, judge configuration handling, and port-finding behavior.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ResourceServerProcess
    participant ResourceServer
    participant GymHarness
    
    Client->>ResourceServerProcess: __init__(eval_type, framework_config)
    ResourceServerProcess->>ResourceServerProcess: Extract harness_package & config
    
    Client->>ResourceServerProcess: __enter__()
    ResourceServerProcess->>ResourceServer: spawn subprocess with config
    ResourceServer->>GymHarness: load Harness implementation
    GymHarness->>GymHarness: get_dataset()
    ResourceServer-->>ResourceServerProcess: ready (health check)
    ResourceServerProcess-->>Client: return self with url
    
    Client->>Client: generate_gym_config(resource_server_host, ...)
    Client->>Client: write_gym_config(config, output_path)
    
    Client->>ResourceServerProcess: __exit__()
    ResourceServerProcess->>ResourceServer: terminate subprocess
    ResourceServer-->>ResourceServerProcess: shutdown complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A harness for gyms, abstract and neat,
Config and processes, all quite complete,
Subprocess management, with grace it runs,
New gym integration—hooray, what fun!
From abstract to concrete, the dance has begun! 🎯

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'DRAFT: Gym integration PoC' is vague and uses non-descriptive terms that don't clearly convey what changes are made, relying on acronyms (PoC, Gym) without context about the actual implementation. Replace with a more descriptive title that clearly indicates the main implementation details, such as 'Add GymHarness abstract interface and resource server integration' or similar that specifically describes what is being added.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gym-integration

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py (1)

30-82: Add lifecycle tests for __enter__ failure and cleanup paths.

Current coverage is good for config parsing, but it misses subprocess startup-failure behavior (terminate/timeout/kill), which is where the highest reliability risk sits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py`
around lines 30 - 82, Add unit tests for ResourceServerProcess lifecycle to
cover failure and cleanup paths: create tests that exercise
ResourceServerProcess.__enter__ when the subprocess fails to start (simulate
subprocess creation/startup raising or the process exiting immediately) and
assert that ResourceServerProcess calls terminate/kill and enforces a timeout;
also add tests for the normal __exit__/cleanup path to verify terminate is
called and that a forced kill occurs if terminate times out. Use the
ResourceServerProcess class and its __enter__, __exit__, terminate, and any
internal subprocess handling (mock subprocess.Popen or the method that starts
the process) to simulate failures and assert the expected calls and behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`:
- Around line 82-83: The auto-selected port assignment (self.port = port or
_find_free_port(host)) is susceptible to a TOCTOU race: when port is not
provided, retry binding with a freshly chosen port if the server startup fails;
modify the startup logic in the ResourceServerProcess (the code paths that call
self.port and launch the server—e.g., the constructor assignment and the method
that actually binds/starts the server) to detect bind/start failures and, if the
original port was auto-assigned (port argument was None), loop a few times: pick
a new port via _find_free_port(host), set self.port, and retry startup with a
short backoff before finally raising the error; ensure this retry logic is used
in both the initial constructor/startup and the alternate startup path
referenced around lines 118–138.
- Around line 126-130: The current logger.info call logs the full spawn command
(cmd) and can expose sensitive data in self.harness_kwargs; before logging,
sanitize cmd by removing or redacting the value passed to
"--harness_kwargs_json" (or omit that flag entirely) to produce a safe_cmd, then
log safe_cmd instead of cmd; keep subprocess.Popen(cmd) unchanged so the process
still receives the real arguments.
- Around line 136-138: The startup failure path currently calls
self._process.terminate() and self._process.wait(timeout=5) without handling
subprocess.TimeoutExpired, which can leave the child running; update the failure
cleanup in the method that raises RuntimeError(f"Resource server failed to start
at {self.url}") to mirror the __exit__ pattern: call self._process.terminate(),
catch subprocess.TimeoutExpired from self._process.wait(timeout=5), then call
self._process.kill() and self._process.wait() in the except block before
re-raising the RuntimeError so the process cannot be orphaned (refer to
self._process, the RuntimeError raising location, and the __exit__ cleanup
logic).

In `@packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py`:
- Around line 61-67: The test test_auto_generates_path leaves a temp YAML file
in /tmp; update it to clean up after itself by deleting the generated file
(os.remove(path)) at the end of the test or switch to using pytest's tmp_path
fixture and write_gym_config so the file is created in a managed temp directory;
reference test_auto_generates_path and write_gym_config to locate the code and
ensure the cleanup (or tmp_path usage) runs even on assertion failures (e.g., in
a finally block or via tmp_path).

---

Nitpick comments:
In
`@packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py`:
- Around line 30-82: Add unit tests for ResourceServerProcess lifecycle to cover
failure and cleanup paths: create tests that exercise
ResourceServerProcess.__enter__ when the subprocess fails to start (simulate
subprocess creation/startup raising or the process exiting immediately) and
assert that ResourceServerProcess calls terminate/kill and enforces a timeout;
also add tests for the normal __exit__/cleanup path to verify terminate is
called and that a forced kill occurs if terminate times out. Use the
ResourceServerProcess class and its __enter__, __exit__, terminate, and any
internal subprocess handling (mock subprocess.Popen or the method that starts
the process) to simulate failures and assert the expected calls and behavior.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cca91e and d3552df.

📒 Files selected for processing (6)
  • packages/nemo-evaluator/src/nemo_evaluator/__init__.py
  • packages/nemo-evaluator/src/nemo_evaluator/core/gym_config_generator.py
  • packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py
  • packages/nemo-evaluator/src/nemo_evaluator/gym_harness.py
  • packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py
  • packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py

Comment on lines +82 to +83
self.port = port or _find_free_port(host)
self.data_dir = data_dir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Auto-selected port can race and cause flaky startup.

Choosing a port first and binding later in another process is a TOCTOU race; another process can claim that port before the resource server starts.

Consider retrying startup with a newly selected port when port was auto-assigned and startup fails.

Also applies to: 118-138

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`
around lines 82 - 83, The auto-selected port assignment (self.port = port or
_find_free_port(host)) is susceptible to a TOCTOU race: when port is not
provided, retry binding with a freshly chosen port if the server startup fails;
modify the startup logic in the ResourceServerProcess (the code paths that call
self.port and launch the server—e.g., the constructor assignment and the method
that actually binds/starts the server) to detect bind/start failures and, if the
original port was auto-assigned (port argument was None), loop a few times: pick
a new port via _find_free_port(host), set self.port, and retry startup with a
short backoff before finally raising the error; ensure this retry logic is used
in both the initial constructor/startup and the alternate startup path
referenced around lines 118–138.

Comment on lines +126 to +130
if self.harness_kwargs:
cmd += ["--harness_kwargs_json", json.dumps(self.harness_kwargs)]

logger.info("Spawning resource server: %s", " ".join(cmd))
self._process = subprocess.Popen(cmd)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging the full spawn command with raw harness kwargs.

The command log can include --harness_kwargs_json and expose sensitive data (for example credentials embedded in URLs).

🔒 Proposed hardening
-        logger.info("Spawning resource server: %s", " ".join(cmd))
+        logger.info(
+            "Spawning resource server (exe=%s, host=%s, port=%s, harness_package=%s, has_harness_kwargs=%s)",
+            exe,
+            self.host,
+            self.port,
+            self.harness_package,
+            bool(self.harness_kwargs),
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`
around lines 126 - 130, The current logger.info call logs the full spawn command
(cmd) and can expose sensitive data in self.harness_kwargs; before logging,
sanitize cmd by removing or redacting the value passed to
"--harness_kwargs_json" (or omit that flag entirely) to produce a safe_cmd, then
log safe_cmd instead of cmd; keep subprocess.Popen(cmd) unchanged so the process
still receives the real arguments.

Comment on lines +136 to +138
self._process.terminate()
self._process.wait(timeout=5)
raise RuntimeError(f"Resource server failed to start at {self.url}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the startup-failure cleanup path lacks TimeoutExpired handling
rg -n -C5 'self\._process\.terminate\(\)' packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 798


🏁 Script executed:

#!/bin/bash
# Check imports in the file
head -50 packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 1579


🏁 Script executed:

#!/bin/bash
# Look for TimeoutExpired handling anywhere in the file
rg -n 'TimeoutExpired' packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 114


Handle TimeoutExpired in startup failure cleanup to prevent orphaned processes.

Lines 136–138 call wait(timeout=5) without catching subprocess.TimeoutExpired, which can leave the child process running if termination hangs. The __exit__ method (lines 143–147) already implements the correct pattern with force-kill; apply the same approach here.

Proposed fix
         self._process.terminate()
-        self._process.wait(timeout=5)
+        try:
+            self._process.wait(timeout=5)
+        except subprocess.TimeoutExpired:
+            logger.warning("Force-killing resource server (pid=%s)", self._process.pid)
+            self._process.kill()
+            self._process.wait()
         raise RuntimeError(f"Resource server failed to start at {self.url}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._process.terminate()
self._process.wait(timeout=5)
raise RuntimeError(f"Resource server failed to start at {self.url}")
self._process.terminate()
try:
self._process.wait(timeout=5)
except subprocess.TimeoutExpired:
logger.warning("Force-killing resource server (pid=%s)", self._process.pid)
self._process.kill()
self._process.wait()
raise RuntimeError(f"Resource server failed to start at {self.url}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`
around lines 136 - 138, The startup failure path currently calls
self._process.terminate() and self._process.wait(timeout=5) without handling
subprocess.TimeoutExpired, which can leave the child running; update the failure
cleanup in the method that raises RuntimeError(f"Resource server failed to start
at {self.url}") to mirror the __exit__ pattern: call self._process.terminate(),
catch subprocess.TimeoutExpired from self._process.wait(timeout=5), then call
self._process.kill() and self._process.wait() in the except block before
re-raising the RuntimeError so the process cannot be orphaned (refer to
self._process, the RuntimeError raising location, and the __exit__ cleanup
logic).

Comment on lines +61 to +67
def test_auto_generates_path(self):
config = {"test": "value"}
path = write_gym_config(config)
assert path.endswith(".yaml")
with open(path) as f:
loaded = yaml.safe_load(f)
assert loaded == {"test": "value"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

test_auto_generates_path should clean up the generated temp file.

This test leaves artifacts in /tmp, which can accumulate across runs.

🧹 Proposed fix
+import os
 import yaml
@@
     def test_auto_generates_path(self):
         config = {"test": "value"}
         path = write_gym_config(config)
-        assert path.endswith(".yaml")
-        with open(path) as f:
-            loaded = yaml.safe_load(f)
-        assert loaded == {"test": "value"}
+        try:
+            assert path.endswith(".yaml")
+            with open(path) as f:
+                loaded = yaml.safe_load(f)
+            assert loaded == {"test": "value"}
+        finally:
+            if os.path.exists(path):
+                os.remove(path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_auto_generates_path(self):
config = {"test": "value"}
path = write_gym_config(config)
assert path.endswith(".yaml")
with open(path) as f:
loaded = yaml.safe_load(f)
assert loaded == {"test": "value"}
import os
import yaml
def test_auto_generates_path(self):
config = {"test": "value"}
path = write_gym_config(config)
try:
assert path.endswith(".yaml")
with open(path) as f:
loaded = yaml.safe_load(f)
assert loaded == {"test": "value"}
finally:
if os.path.exists(path):
os.remove(path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py`
around lines 61 - 67, The test test_auto_generates_path leaves a temp YAML file
in /tmp; update it to clean up after itself by deleting the generated file
(os.remove(path)) at the end of the test or switch to using pytest's tmp_path
fixture and write_gym_config so the file is created in a managed temp directory;
reference test_auto_generates_path and write_gym_config to locate the code and
ensure the cleanup (or tmp_path usage) runs even on assertion failures (e.g., in
a finally block or via tmp_path).

Signed-off-by: Michal Bien <mbien@nvidia.com>
Signed-off-by: Michal Bien <mbien@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant