Skip to content

Revert "Eval kit support (#1239)"#1294

Merged
Kipok merged 1 commit intomainfrom
revert-eval-kit
Mar 6, 2026
Merged

Revert "Eval kit support (#1239)"#1294
Kipok merged 1 commit intomainfrom
revert-eval-kit

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Mar 6, 2026

This reverts commit b237e33.

That pr broke gpu tests (and likely slurm tests as well)

Summary by CodeRabbit

  • Chores
    • Removed VLMEvalKit integration and related evaluation functionality, including documentation and configuration files.
    • Removed in-process generation mode support.
    • Simplified evaluation pipeline configuration and command assembly.

This reverts commit b237e33.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok force-pushed the revert-eval-kit branch from 269fd2c to 99f2c08 Compare March 6, 2026 19:48
@Kipok Kipok requested a review from gwarmstrong March 6, 2026 19:49
@Kipok Kipok enabled auto-merge (squash) March 6, 2026 19:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

This pull request removes the VLMEvalKit integration from NeMo Skills, eliminating documentation, module exports, dataset utilities, evaluation metrics, inference implementations, and pipeline logic that supported VLMEvalKit-based evaluation workflows.

Changes

Cohort / File(s) Summary
Documentation & Requirements
docs/evaluation/eval-kit.md, docs/evaluation/index.md, requirements/eval-kit.txt
Removed VLMEvalKit documentation, index reference, and placeholder requirements file.
Dataset Module
nemo_skills/dataset/eval_kit/__init__.py, nemo_skills/dataset/utils.py
Removed eval_kit module exports (GENERATION_MODULE, METRICS_TYPE, constants, and get_extra_generation_args function); removed special-case dotted dataset name handling in get_default_dataset_module.
Evaluation Metrics
nemo_skills/evaluation/metrics/eval_kit_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/translation_metrics.py
Deleted EvalKitMetrics class; removed eval_kit entry from METRICS_MAP; centralized corpus_bleu import in translation_metrics.
Audio Evaluation
nemo_skills/evaluation/evaluator/audio.py
Simplified generation extraction logic; removed specialized handling for AudioBench, ST-EN-ZH, and MathQA task types; streamlined ASR-translation routing.
Inference Implementations
nemo_skills/inference/eval/eval_kit.py, nemo_skills/inference/mcore_skills.py
Deleted EvalKitGenerationTask and MegatronMCoreGenerationTask classes with all supporting logic for VLMEvalKit integration, mcore in-process generation, distributed data handling, and metrics computation.
Inference Base & Factory
nemo_skills/inference/factory.py, nemo_skills/inference/generate.py
Removed mcore_skills enum member and module mapping; removed CONTAINER_KEY, USE_TORCHRUN, and related classmethods (is_self_contained, get_env_prefix, get_extra_package_dirs) from GenerationTask base class.
Pipeline Evaluation
nemo_skills/pipeline/eval.py
Removed _apply_task_overrides helper; eliminated dynamic per-task GPU/container overrides, torchrun configuration, and extra package directory propagation; simplified container assignment and command assembly.
Pipeline Utils
nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/generation.py
Removed _resolve_generation_task_class; removed self_contained_task, num_gpus, and generation_task_class fields from BenchmarkArgs; made input_file mandatory; simplified venv bootstrap to always use uv; added input validation to get_generation_cmd.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • melllinia
  • gwarmstrong
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Revert "Eval kit support (#1239)"' accurately reflects the primary change: reverting a previous commit that added eval kit support to the NeMo Skills project.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch revert-eval-kit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 115-122: The current block overwrites the previously resolved
input_file with a global data_dir, breaking datasets located outside data_dir;
instead, when data_dir is truthy only compute the unmounted check path (use
data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
and set check_path = f"{data_dir_unmounted}/{benchmark.replace('.',
'/')}/{split}.jsonl") and do not reassign input_file (leave the already-resolved
input_file/unmounted_path logic intact); update the code around the data_dir
branch to only set check_path from data_dir_unmounted and preserve the existing
input_file variable.
- Around line 97-104: The current branch sets input_file to the
container-mounted path even when cluster_config["executor"] == "none" and
local_data_path exists; change logic in the not is_on_cluster block so that when
executor == "none" you use the host/unmounted path (unmounted_path) as
input_file. Locate the block using pipeline_utils.is_mounted_filepath,
input_file, unmounted_input_file, unmounted_path and adjust: if local_data_path
is not None and executor == "none" assign input_file = unmounted_path (or
compute unmounted_path via local_data_path or pipeline_utils.get_unmounted_path)
instead of the mounted f"{data_path}/..."; keep existing get_unmounted_path
fallback for the non-local case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4420c18f-469e-4e1b-9400-2659573c997f

📥 Commits

Reviewing files that changed from the base of the PR and between b237e33 and 99f2c08.

📒 Files selected for processing (16)
  • docs/evaluation/eval-kit.md
  • docs/evaluation/index.md
  • nemo_skills/dataset/eval_kit/__init__.py
  • nemo_skills/dataset/utils.py
  • nemo_skills/evaluation/evaluator/audio.py
  • nemo_skills/evaluation/metrics/eval_kit_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/translation_metrics.py
  • nemo_skills/inference/eval/eval_kit.py
  • nemo_skills/inference/factory.py
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/mcore_skills.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/eval.py
  • nemo_skills/pipeline/utils/generation.py
  • requirements/eval-kit.txt
💤 Files with no reviewable changes (11)
  • docs/evaluation/index.md
  • docs/evaluation/eval-kit.md
  • nemo_skills/evaluation/metrics/eval_kit_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/dataset/eval_kit/init.py
  • nemo_skills/inference/factory.py
  • nemo_skills/inference/eval/eval_kit.py
  • requirements/eval-kit.txt
  • nemo_skills/dataset/utils.py
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/mcore_skills.py

Comment on lines +97 to +104
if not is_on_cluster:
if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none":
input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
unmounted_path = pipeline_utils.get_unmounted_path(cluster_config, input_file)

unmounted_path = str(unmounted_path)
# When data_dir is specified, use it for both input_file and the existence check
# data_dir is always assumed to be a mounted path
if data_dir:
data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
else:
check_path = unmounted_path
# checking if data file exists (can check locally as well)
if is_on_cluster:
if not pipeline_utils.cluster_path_exists(cluster_config, check_path):
raise ValueError(
f"Data file {check_path} does not exist on cluster. "
"Please check the benchmark and split parameters. "
"Did you forget to run prepare data commands or add data_dir argument?"
)
if local_data_path is not None:
unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
else:
unmounted_input_file = pipeline_utils.get_unmounted_path(cluster_config, input_file)
unmounted_path = str(Path(__file__).parents[3] / unmounted_input_file.replace("/nemo_run/code/", ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the host path for remapped datasets when executor == "none".

_resolve_data_path() converts external roots into /nemo_run/code/..., but this branch still feeds that mounted path into input_file even when local_data_path is present. For the "none" executor the command is emitted/run outside the container, so the existence check passes on the host path and generation later tries to open a container-only path.

Suggested fix
-    if pipeline_utils.is_mounted_filepath(cluster_config, data_path) or cluster_config["executor"] == "none":
-        input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
-        if local_data_path is not None:
-            unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
+    if cluster_config["executor"] == "none" and local_data_path is not None:
+        input_file = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
+        unmounted_path = input_file
+    elif pipeline_utils.is_mounted_filepath(cluster_config, data_path):
+        input_file = f"{data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
+        if local_data_path is not None:
+            unmounted_path = f"{local_data_path}/{benchmark.replace('.', '/')}/{split}.jsonl"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 97 - 104, The current branch
sets input_file to the container-mounted path even when
cluster_config["executor"] == "none" and local_data_path exists; change logic in
the not is_on_cluster block so that when executor == "none" you use the
host/unmounted path (unmounted_path) as input_file. Locate the block using
pipeline_utils.is_mounted_filepath, input_file, unmounted_input_file,
unmounted_path and adjust: if local_data_path is not None and executor == "none"
assign input_file = unmounted_path (or compute unmounted_path via
local_data_path or pipeline_utils.get_unmounted_path) instead of the mounted
f"{data_path}/..."; keep existing get_unmounted_path fallback for the non-local
case.

Comment on lines +115 to +122
# When data_dir is specified, use it for both input_file and the existence check
# data_dir is always assumed to be a mounted path
if data_dir:
data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
else:
check_path = unmounted_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't overwrite the resolved dataset path with the global data_dir.

The code above has already computed the correct input_file / unmounted_path pair for mounted paths, copied extra datasets, and local runs. Replacing both values here with data_dir throws that resolution away, so benchmarks whose files live outside data_dir — and local executor runs that need the mounted /nemo_run/code/... path — now point at the wrong file.

Suggested fix
-    # When data_dir is specified, use it for both input_file and the existence check
-    # data_dir is always assumed to be a mounted path
-    if data_dir:
-        data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
-        input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
-        check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
-    else:
-        check_path = unmounted_path
+    check_path = unmounted_path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# When data_dir is specified, use it for both input_file and the existence check
# data_dir is always assumed to be a mounted path
if data_dir:
data_dir_unmounted = pipeline_utils.get_unmounted_path(cluster_config, data_dir)
input_file = f"{data_dir}/{benchmark.replace('.', '/')}/{split}.jsonl"
check_path = f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl"
else:
check_path = unmounted_path
check_path = unmounted_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 115 - 122, The current block
overwrites the previously resolved input_file with a global data_dir, breaking
datasets located outside data_dir; instead, when data_dir is truthy only compute
the unmounted check path (use data_dir_unmounted =
pipeline_utils.get_unmounted_path(cluster_config, data_dir) and set check_path =
f"{data_dir_unmounted}/{benchmark.replace('.', '/')}/{split}.jsonl") and do not
reassign input_file (leave the already-resolved input_file/unmounted_path logic
intact); update the code around the data_dir branch to only set check_path from
data_dir_unmounted and preserve the existing input_file variable.

@Kipok Kipok disabled auto-merge March 6, 2026 20:13
@Kipok Kipok merged commit a5da597 into main Mar 6, 2026
5 checks passed
@Kipok Kipok deleted the revert-eval-kit branch March 6, 2026 20:13
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant