Revert "Eval kit support (#1239)" by Kipok · Pull Request #1294 · NVIDIA-NeMo/Skills

Kipok · 2026-03-06T19:47:32Z

This reverts commit b237e33.

That pr broke gpu tests (and likely slurm tests as well)

Summary by CodeRabbit

Chores
- Removed VLMEvalKit integration and related evaluation functionality, including documentation and configuration files.
- Removed in-process generation mode support.
- Simplified evaluation pipeline configuration and command assembly.

This reverts commit b237e33. Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2026-03-06T19:49:26Z

📝 Walkthrough

Walkthrough

This pull request removes the VLMEvalKit integration from NeMo Skills, eliminating documentation, module exports, dataset utilities, evaluation metrics, inference implementations, and pipeline logic that supported VLMEvalKit-based evaluation workflows.

Changes

Cohort / File(s)	Summary
Documentation & Requirements `docs/evaluation/eval-kit.md`, `docs/evaluation/index.md`, `requirements/eval-kit.txt`	Removed VLMEvalKit documentation, index reference, and placeholder requirements file.
Dataset Module `nemo_skills/dataset/eval_kit/__init__.py`, `nemo_skills/dataset/utils.py`	Removed eval_kit module exports (GENERATION_MODULE, METRICS_TYPE, constants, and get_extra_generation_args function); removed special-case dotted dataset name handling in get_default_dataset_module.
Evaluation Metrics `nemo_skills/evaluation/metrics/eval_kit_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`	Deleted EvalKitMetrics class; removed eval_kit entry from METRICS_MAP; centralized corpus_bleu import in translation_metrics.
Audio Evaluation `nemo_skills/evaluation/evaluator/audio.py`	Simplified generation extraction logic; removed specialized handling for AudioBench, ST-EN-ZH, and MathQA task types; streamlined ASR-translation routing.
Inference Implementations `nemo_skills/inference/eval/eval_kit.py`, `nemo_skills/inference/mcore_skills.py`	Deleted EvalKitGenerationTask and MegatronMCoreGenerationTask classes with all supporting logic for VLMEvalKit integration, mcore in-process generation, distributed data handling, and metrics computation.
Inference Base & Factory `nemo_skills/inference/factory.py`, `nemo_skills/inference/generate.py`	Removed mcore_skills enum member and module mapping; removed CONTAINER_KEY, USE_TORCHRUN, and related classmethods (is_self_contained, get_env_prefix, get_extra_package_dirs) from GenerationTask base class.
Pipeline Evaluation `nemo_skills/pipeline/eval.py`	Removed _apply_task_overrides helper; eliminated dynamic per-task GPU/container overrides, torchrun configuration, and extra package directory propagation; simplified container assignment and command assembly.
Pipeline Utils `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/generation.py`	Removed _resolve_generation_task_class; removed self_contained_task, num_gpus, and generation_task_class fields from BenchmarkArgs; made input_file mandatory; simplified venv bootstrap to always use uv; added input validation to get_generation_cmd.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Eval kit support #1239: Directly inverse operation—adds the eval_kit integration that this PR removes; touches identical files, classes, and exports.
Add nemo-skills-core subpackage for lightweight installs #1229: Modifies dataset loading and eval_kit special-case handling in nemo_skills/dataset/utils.py and related pipeline dataset logic.
Fix run.Script refactor #1133: Concurrent changes to get_generation_cmd in nemo_skills/pipeline/utils/generation.py (venv and input handling modifications).

Suggested reviewers

melllinia
gwarmstrong

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Revert "Eval kit support (`#1239`)"' accurately reflects the primary change: reverting a previous commit that added eval kit support to the NeMo Skills project.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch revert-eval-kit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 115-122: The current block overwrites the previously resolved
input_file with a global data_dir, breaking datasets located outside data_dir;
instead, when data_dir is truthy only compute the unmounted check path (use
data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
and set check_path = f"{data_dir_unmounted}/{benchmark.replace('.',
'/')}/{split}.jsonl") and do not reassign input_file (leave the already-resolved
input_file/unmounted_path logic intact); update the code around the data_dir
branch to only set check_path from data_dir_unmounted and preserve the existing
input_file variable.
- Around line 97-104: The current branch sets input_file to the
container-mounted path even when cluster_config["executor"] == "none" and
local_data_path exists; change logic in the not is_on_cluster block so that when
executor == "none" you use the host/unmounted path (unmounted_path) as
input_file. Locate the block using pipeline_utils.is_mounted_filepath,
input_file, unmounted_input_file, unmounted_path and adjust: if local_data_path
is not None and executor == "none" assign input_file = unmounted_path (or
compute unmounted_path via local_data_path or pipeline_utils.get_unmounted_path)
instead of the mounted f"{data_path}/..."; keep existing get_unmounted_path
fallback for the non-local case.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4420c18f-469e-4e1b-9400-2659573c997f

📥 Commits

Reviewing files that changed from the base of the PR and between b237e33 and 99f2c08.

📒 Files selected for processing (16)

docs/evaluation/eval-kit.md
docs/evaluation/index.md
nemo_skills/dataset/eval_kit/__init__.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/eval/eval_kit.py
nemo_skills/inference/factory.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
requirements/eval-kit.txt

💤 Files with no reviewable changes (11)

docs/evaluation/index.md
docs/evaluation/eval-kit.md
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/dataset/eval_kit/init.py
nemo_skills/inference/factory.py
nemo_skills/inference/eval/eval_kit.py
requirements/eval-kit.txt
nemo_skills/dataset/utils.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py

coderabbitai · 2026-03-06T20:02:24Z

nemo_skills/pipeline/utils/eval.py

+    if not is_on_cluster:
+        if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none":
            input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
-            unmounted_path = pipeline_utils.get_unmounted_path(cluster_config, input_file)
-
-        unmounted_path = str(unmounted_path)
-        # When data_dir is specified, use it for both input_file and the existence check
-        # data_dir is always assumed to be a mounted path
-        if data_dir:
-            data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
-            input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
-            check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
-        else:
-            check_path = unmounted_path
-        # checking if data file exists (can check locally as well)
-        if is_on_cluster:
-            if not pipeline_utils.cluster_path_exists(cluster_config, check_path):
-                raise ValueError(
-                    f"Data file {check_path} does not exist on cluster. "
-                    "Please check the benchmark and split parameters. "
-                    "Did you forget to run prepare data commands or add data_dir argument?"
-                )
+            if local_data_path is not None:
+                unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
+            else:
+                unmounted_input_file = pipeline_utils.get_unmounted_path(cluster_config, input_file)
+                unmounted_path = str(Path(__file__).parents[3] / unmounted_input_file.replace("/nemo_run/code/", ""))


⚠️ Potential issue | 🟠 Major

Use the host path for remapped datasets when executor == "none".

_resolve_data_path() converts external roots into /nemo_run/code/..., but this branch still feeds that mounted path into input_file even when local_data_path is present. For the "none" executor the command is emitted/run outside the container, so the existence check passes on the host path and generation later tries to open a container-only path.

Suggested fix

- if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none": - input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" - if local_data_path is not None: - unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + if cluster_config["executor"] == "none" and local_data_path is not None: + input_file = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + unmounted_path = input_file + elif pipeline_utils.is_mounted_filepath(cluster_config, data_path): + input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl" + if local_data_path is not None: + unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 97 - 104, The current branch sets input_file to the container-mounted path even when cluster_config["executor"] == "none" and local_data_path exists; change logic in the not is_on_cluster block so that when executor == "none" you use the host/unmounted path (unmounted_path) as input_file. Locate the block using pipeline_utils.is_mounted_filepath, input_file, unmounted_input_file, unmounted_path and adjust: if local_data_path is not None and executor == "none" assign input_file = unmounted_path (or compute unmounted_path via local_data_path or pipeline_utils.get_unmounted_path) instead of the mounted f"{data_path}/..."; keep existing get_unmounted_path fallback for the non-local case.

coderabbitai · 2026-03-06T20:02:24Z

nemo_skills/pipeline/utils/eval.py

+    # When data_dir is specified, use it for both input_file and the existence check
+    # data_dir is always assumed to be a mounted path
+    if data_dir:
+        data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
+        input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
+        check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
+    else:
+        check_path = unmounted_path


⚠️ Potential issue | 🟠 Major

Don't overwrite the resolved dataset path with the global data_dir.

The code above has already computed the correct input_file / unmounted_path pair for mounted paths, copied extra datasets, and local runs. Replacing both values here with data_dir throws that resolution away, so benchmarks whose files live outside data_dir — and local executor runs that need the mounted /nemo_run/code/... path — now point at the wrong file.

Suggested fix

- # When data_dir is specified, use it for both input_file and the existence check - # data_dir is always assumed to be a mounted path - if data_dir: - data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir) - input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl" - check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl" - else: - check_path = unmounted_path + check_path = unmounted_path

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# When data_dir is specified, use it for both input_file and the existence check

# data_dir is always assumed to be a mounted path

if data_dir:

data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)

input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"

check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"

else:

check_path = unmounted_path

check_path = unmounted_path

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 115 - 122, The current block overwrites the previously resolved input_file with a global data_dir, breaking datasets located outside data_dir; instead, when data_dir is truthy only compute the unmounted check path (use data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir) and set check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl") and do not reassign input_file (leave the already-resolved input_file/unmounted_path logic intact); update the code around the data_dir branch to only set check_path from data_dir_unmounted and preserve the existing input_file variable.

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Revert "Eval kit support (#1239)"

99f2c08

This reverts commit b237e33. Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok force-pushed the revert-eval-kit branch from 269fd2c to 99f2c08 Compare March 6, 2026 19:48

Kipok requested a review from gwarmstrong March 6, 2026 19:49

Kipok enabled auto-merge (squash) March 6, 2026 19:49

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

Kipok disabled auto-merge March 6, 2026 20:13

Kipok merged commit a5da597 into main Mar 6, 2026
5 checks passed

Kipok deleted the revert-eval-kit branch March 6, 2026 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Eval kit support (#1239)"#1294

Revert "Eval kit support (#1239)"#1294
Kipok merged 1 commit intomainfrom
revert-eval-kit

Kipok commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kipok commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kipok commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading