DRAFT: Gym integration PoC by Glorf · Pull Request #770 · NVIDIA-NeMo/Evaluator

Glorf · 2026-02-25T16:43:15Z

Summary by CodeRabbit

New Features
- Introduced a new abstract interface for implementing custom evaluation harnesses
- Added configuration generation utilities to create and persist evaluation setup files
- Added a resource server process manager for programmatic server lifecycle management
Tests
- Added comprehensive unit tests for configuration generation and resource server operations

copy-pr-bot · 2026-02-25T16:43:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Glorf · 2026-02-25T17:58:51Z

@CodeRabbit full review

coderabbitai · 2026-02-25T17:58:59Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-02-25T18:03:54Z

Walkthrough

This PR introduces a new GymHarness abstraction framework for nemo-gym integration, including an abstract base class for benchmark harnesses, configuration generation utilities, subprocess management for the resource server, and corresponding unit tests across five new modules.

Changes

Cohort / File(s)	Summary
Public API & Abstractions `packages/nemo-evaluator/src/nemo_evaluator/__init__.py`, `packages/nemo-evaluator/src/nemo_evaluator/gym_harness.py`	Exposes new `GymHarness` abstract base class via public import; defines interface with `get_dataset()` and async `verify()` methods for benchmark implementations.
Configuration Management `packages/nemo-evaluator/src/nemo_evaluator/core/gym_config_generator.py`	Introduces `generate_gym_config()` for building nemo-gym configuration dicts and `write_gym_config()` for YAML serialization with optional auto-generated temp file paths.
Process Management `packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`	Adds `ResourceServerProcess` context manager for spawning and managing the NeMoEvaluatorResourcesServer subprocess lifecycle, including port discovery, command construction, and graceful cleanup.
Unit Tests `packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py`, `packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py`	Validates config generation, YAML writing, resource server initialization, harness package extraction, judge configuration handling, and port-finding behavior.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ResourceServerProcess
    participant ResourceServer
    participant GymHarness
    
    Client->>ResourceServerProcess: __init__(eval_type, framework_config)
    ResourceServerProcess->>ResourceServerProcess: Extract harness_package & config
    
    Client->>ResourceServerProcess: __enter__()
    ResourceServerProcess->>ResourceServer: spawn subprocess with config
    ResourceServer->>GymHarness: load Harness implementation
    GymHarness->>GymHarness: get_dataset()
    ResourceServer-->>ResourceServerProcess: ready (health check)
    ResourceServerProcess-->>Client: return self with url
    
    Client->>Client: generate_gym_config(resource_server_host, ...)
    Client->>Client: write_gym_config(config, output_path)
    
    Client->>ResourceServerProcess: __exit__()
    ResourceServerProcess->>ResourceServer: terminate subprocess
    ResourceServer-->>ResourceServerProcess: shutdown complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A harness for gyms, abstract and neat,
Config and processes, all quite complete,
Subprocess management, with grace it runs,
New gym integration—hooray, what fun!
From abstract to concrete, the dance has begun! 🎯

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'DRAFT: Gym integration PoC' is vague and uses non-descriptive terms that don't clearly convey what changes are made, relying on acronyms (PoC, Gym) without context about the actual implementation.	Replace with a more descriptive title that clearly indicates the main implementation details, such as 'Add GymHarness abstract interface and resource server integration' or similar that specifically describes what is being added.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gym-integration

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py (1)

30-82: Add lifecycle tests for __enter__ failure and cleanup paths.

Current coverage is good for config parsing, but it misses subprocess startup-failure behavior (terminate/timeout/kill), which is where the highest reliability risk sits.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py`
around lines 30 - 82, Add unit tests for ResourceServerProcess lifecycle to
cover failure and cleanup paths: create tests that exercise
ResourceServerProcess.__enter__ when the subprocess fails to start (simulate
subprocess creation/startup raising or the process exiting immediately) and
assert that ResourceServerProcess calls terminate/kill and enforces a timeout;
also add tests for the normal __exit__/cleanup path to verify terminate is
called and that a forced kill occurs if terminate times out. Use the
ResourceServerProcess class and its __enter__, __exit__, terminate, and any
internal subprocess handling (mock subprocess.Popen or the method that starts
the process) to simulate failures and assert the expected calls and behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py`:
- Around line 82-83: The auto-selected port assignment (self.port = port or
_find_free_port(host)) is susceptible to a TOCTOU race: when port is not
provided, retry binding with a freshly chosen port if the server startup fails;
modify the startup logic in the ResourceServerProcess (the code paths that call
self.port and launch the server—e.g., the constructor assignment and the method
that actually binds/starts the server) to detect bind/start failures and, if the
original port was auto-assigned (port argument was None), loop a few times: pick
a new port via _find_free_port(host), set self.port, and retry startup with a
short backoff before finally raising the error; ensure this retry logic is used
in both the initial constructor/startup and the alternate startup path
referenced around lines 118–138.
- Around line 126-130: The current logger.info call logs the full spawn command
(cmd) and can expose sensitive data in self.harness_kwargs; before logging,
sanitize cmd by removing or redacting the value passed to
"--harness_kwargs_json" (or omit that flag entirely) to produce a safe_cmd, then
log safe_cmd instead of cmd; keep subprocess.Popen(cmd) unchanged so the process
still receives the real arguments.
- Around line 136-138: The startup failure path currently calls
self._process.terminate() and self._process.wait(timeout=5) without handling
subprocess.TimeoutExpired, which can leave the child running; update the failure
cleanup in the method that raises RuntimeError(f"Resource server failed to start
at {self.url}") to mirror the __exit__ pattern: call self._process.terminate(),
catch subprocess.TimeoutExpired from self._process.wait(timeout=5), then call
self._process.kill() and self._process.wait() in the except block before
re-raising the RuntimeError so the process cannot be orphaned (refer to
self._process, the RuntimeError raising location, and the __exit__ cleanup
logic).

In `@packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py`:
- Around line 61-67: The test test_auto_generates_path leaves a temp YAML file
in /tmp; update it to clean up after itself by deleting the generated file
(os.remove(path)) at the end of the test or switch to using pytest's tmp_path
fixture and write_gym_config so the file is created in a managed temp directory;
reference test_auto_generates_path and write_gym_config to locate the code and
ensure the cleanup (or tmp_path usage) runs even on assertion failures (e.g., in
a finally block or via tmp_path).

---

Nitpick comments:
In
`@packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py`:
- Around line 30-82: Add unit tests for ResourceServerProcess lifecycle to cover
failure and cleanup paths: create tests that exercise
ResourceServerProcess.__enter__ when the subprocess fails to start (simulate
subprocess creation/startup raising or the process exiting immediately) and
assert that ResourceServerProcess calls terminate/kill and enforces a timeout;
also add tests for the normal __exit__/cleanup path to verify terminate is
called and that a forced kill occurs if terminate times out. Use the
ResourceServerProcess class and its __enter__, __exit__, terminate, and any
internal subprocess handling (mock subprocess.Popen or the method that starts
the process) to simulate failures and assert the expected calls and behavior.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6cca91e and d3552df.

📒 Files selected for processing (6)

packages/nemo-evaluator/src/nemo_evaluator/__init__.py
packages/nemo-evaluator/src/nemo_evaluator/core/gym_config_generator.py
packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py
packages/nemo-evaluator/src/nemo_evaluator/gym_harness.py
packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py
packages/nemo-evaluator/tests/unit_tests/core/test_resource_server_process.py

coderabbitai · 2026-02-25T18:03:58Z

packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

+        self.port = port or _find_free_port(host)
+        self.data_dir = data_dir


⚠️ Potential issue | 🟠 Major

Auto-selected port can race and cause flaky startup.

Choosing a port first and binding later in another process is a TOCTOU race; another process can claim that port before the resource server starts.

Consider retrying startup with a newly selected port when port was auto-assigned and startup fails.

Also applies to: 118-138

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py` around lines 82 - 83, The auto-selected port assignment (self.port = port or _find_free_port(host)) is susceptible to a TOCTOU race: when port is not provided, retry binding with a freshly chosen port if the server startup fails; modify the startup logic in the ResourceServerProcess (the code paths that call self.port and launch the server—e.g., the constructor assignment and the method that actually binds/starts the server) to detect bind/start failures and, if the original port was auto-assigned (port argument was None), loop a few times: pick a new port via _find_free_port(host), set self.port, and retry startup with a short backoff before finally raising the error; ensure this retry logic is used in both the initial constructor/startup and the alternate startup path referenced around lines 118–138.

coderabbitai · 2026-02-25T18:03:58Z

packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

+        if self.harness_kwargs:
+            cmd += ["--harness_kwargs_json", json.dumps(self.harness_kwargs)]
+
+        logger.info("Spawning resource server: %s", " ".join(cmd))
+        self._process = subprocess.Popen(cmd)


⚠️ Potential issue | 🟠 Major

Avoid logging the full spawn command with raw harness kwargs.

The command log can include --harness_kwargs_json and expose sensitive data (for example credentials embedded in URLs).

🔒 Proposed hardening

- logger.info("Spawning resource server: %s", " ".join(cmd)) + logger.info( + "Spawning resource server (exe=%s, host=%s, port=%s, harness_package=%s, has_harness_kwargs=%s)", + exe, + self.host, + self.port, + self.harness_package, + bool(self.harness_kwargs), + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py` around lines 126 - 130, The current logger.info call logs the full spawn command (cmd) and can expose sensitive data in self.harness_kwargs; before logging, sanitize cmd by removing or redacting the value passed to "--harness_kwargs_json" (or omit that flag entirely) to produce a safe_cmd, then log safe_cmd instead of cmd; keep subprocess.Popen(cmd) unchanged so the process still receives the real arguments.

coderabbitai · 2026-02-25T18:03:58Z

packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

+        self._process.terminate()
+        self._process.wait(timeout=5)
+        raise RuntimeError(f"Resource server failed to start at {self.url}")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the startup-failure cleanup path lacks TimeoutExpired handling rg -n -C5 'self\._process\.terminate\(\)' packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 798

🏁 Script executed:

#!/bin/bash # Check imports in the file head -50 packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 1579

🏁 Script executed:

#!/bin/bash # Look for TimeoutExpired handling anywhere in the file rg -n 'TimeoutExpired' packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py

Repository: NVIDIA-NeMo/Evaluator

Length of output: 114

Handle TimeoutExpired in startup failure cleanup to prevent orphaned processes.

Lines 136–138 call wait(timeout=5) without catching subprocess.TimeoutExpired, which can leave the child process running if termination hangs. The __exit__ method (lines 143–147) already implements the correct pattern with force-kill; apply the same approach here.

Proposed fix

self._process.terminate() - self._process.wait(timeout=5) + try: + self._process.wait(timeout=5) + except subprocess.TimeoutExpired: + logger.warning("Force-killing resource server (pid=%s)", self._process.pid) + self._process.kill() + self._process.wait() raise RuntimeError(f"Resource server failed to start at {self.url}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._process.terminate()

self._process.wait(timeout=5)

raise RuntimeError(f"Resource server failed to start at {self.url}")

self._process.terminate()

try:

self._process.wait(timeout=5)

except subprocess.TimeoutExpired:

logger.warning("Force-killing resource server (pid=%s)", self._process.pid)

self._process.kill()

self._process.wait()

raise RuntimeError(f"Resource server failed to start at {self.url}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nemo-evaluator/src/nemo_evaluator/core/resource_server_process.py` around lines 136 - 138, The startup failure path currently calls self._process.terminate() and self._process.wait(timeout=5) without handling subprocess.TimeoutExpired, which can leave the child running; update the failure cleanup in the method that raises RuntimeError(f"Resource server failed to start at {self.url}") to mirror the __exit__ pattern: call self._process.terminate(), catch subprocess.TimeoutExpired from self._process.wait(timeout=5), then call self._process.kill() and self._process.wait() in the except block before re-raising the RuntimeError so the process cannot be orphaned (refer to self._process, the RuntimeError raising location, and the __exit__ cleanup logic).

coderabbitai · 2026-02-25T18:03:58Z

packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py

+    def test_auto_generates_path(self):
+        config = {"test": "value"}
+        path = write_gym_config(config)
+        assert path.endswith(".yaml")
+        with open(path) as f:
+            loaded = yaml.safe_load(f)
+        assert loaded == {"test": "value"}


⚠️ Potential issue | 🟡 Minor

test_auto_generates_path should clean up the generated temp file.

This test leaves artifacts in /tmp, which can accumulate across runs.

🧹 Proposed fix

+import os import yaml @@ def test_auto_generates_path(self): config = {"test": "value"} path = write_gym_config(config) - assert path.endswith(".yaml") - with open(path) as f: - loaded = yaml.safe_load(f) - assert loaded == {"test": "value"} + try: + assert path.endswith(".yaml") + with open(path) as f: + loaded = yaml.safe_load(f) + assert loaded == {"test": "value"} + finally: + if os.path.exists(path): + os.remove(path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def test_auto_generates_path(self):

config = {"test": "value"}

path = write_gym_config(config)

assert path.endswith(".yaml")

with open(path) as f:

loaded = yaml.safe_load(f)

assert loaded == {"test": "value"}

import os

import yaml

def test_auto_generates_path(self):

config = {"test": "value"}

path = write_gym_config(config)

try:

assert path.endswith(".yaml")

with open(path) as f:

loaded = yaml.safe_load(f)

assert loaded == {"test": "value"}

finally:

if os.path.exists(path):

os.remove(path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/nemo-evaluator/tests/unit_tests/core/test_gym_config_generator.py` around lines 61 - 67, The test test_auto_generates_path leaves a temp YAML file in /tmp; update it to clean up after itself by deleting the generated file (os.remove(path)) at the end of the test or switch to using pytest's tmp_path fixture and write_gym_config so the file is created in a managed temp directory; reference test_auto_generates_path and write_gym_config to locate the code and ensure the cleanup (or tmp_path usage) runs even on assertion failures (e.g., in a finally block or via tmp_path).

Signed-off-by: Michal Bien <mbien@nvidia.com>

github-actions bot added nemo-evaluator tests labels Feb 25, 2026

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

Gym integration PoC

5492b11

Signed-off-by: Michal Bien <mbien@nvidia.com>

Glorf force-pushed the gym-integration branch from d3552df to 5492b11 Compare February 25, 2026 18:12

format

723e079

Signed-off-by: Michal Bien <mbien@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Gym integration PoC#770

DRAFT: Gym integration PoC#770
Glorf wants to merge 2 commits intomainfrom
gym-integration

Glorf commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Glorf commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		self.port = port or _find_free_port(host)
		self.data_dir = data_dir

Conversation

Glorf commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Glorf commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Glorf commented Feb 25, 2026 •

edited by coderabbitai bot

Loading