Add DSBench-DA evaluation by sgunasekar · Pull Request #1254 · NVIDIA-NeMo/Skills

sgunasekar · 2026-02-18T20:25:04Z

Adds evaluation support for the DA (data analysis) portion of DSBench.

New dataset (dsbench_da) with data preparation and prompt configs
New evaluator with case-insensitive MCQ comparison and recursive list/dict comparison
Minor change to math evaluator to pass relaxed argument through to extract_answer — default behavior is unchanged
Updated sandbox.lock with DSBench dependencies (openpyxl, pyxlsb)

Most files are independent and will not affect existing workflows. Evals have been tested end-to-end.

Summary by CodeRabbit

New Features
- DSBench evaluation added with flexible/relaxed answer matching for varied formats.
- Data preparation pipeline and CLI for DSBench: downloads, validates, converts Excel tasks to prompt-ready entries.
- New prompt configurations for in-context and tool-mode DSBench usage, enforcing boxed final-answer formatting.
Chores
- Added Excel/data-processing dependencies and updated ignore entries for local tooling.

coderabbitai · 2026-02-18T20:36:36Z

No actionable comments were generated in the recent review. 🎉

📝 Walkthrough

Walkthrough

Adds DSBench (data analysis benchmark) support: dataset preparation (Excel reading, HF download), a DSBenchEvaluator with relaxed structural equality fallback, prompt templates, module constants, and Excel-related dependency updates.

Changes

Cohort / File(s)	Summary
Git Configuration `\.gitignore`	Added `.claude` and `.cursor` ignore entries; removed an extra blank line.
DSBench Dataset Configuration `nemo_skills/dataset/dsbench_da/__init__.py`	Added module constants: `EVAL_SPLIT`, `METRICS_TYPE`, `GENERATION_ARGS` and license/header.
DSBench Data Preparation `nemo_skills/dataset/dsbench_da/prepare.py`	New end-to-end data pipeline: Excel reading (including .xlsb), path formatting, HF hub download+extract, metadata parsing, incontext content extraction, and split JSONL output; adds CLI.
Evaluator Integration `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/evaluator/dsbench.py`, `nemo_skills/evaluation/evaluator/math.py`	Registered `DSBenchEvaluator` in class map; added `DSBenchEvaluator` subclass that uses `relaxed_equal` structural comparison fallback; added `relaxed_extraction` flag and passed to `extract_answer`.
Prompt Templates `nemo_skills/prompt/config/generic/dsbench-da.yaml`, `nemo_skills/prompt/config/generic/dsbench-da-incontext.yaml`	Added two DSBench prompt configs (tool-mode and in-context) with placeholders, Excel usage guidance, and boxed final-answer formatting requirements.
Dependencies `requirements/main.txt`, `requirements/stem.txt`	Added/pinned Excel-related deps: `openpyxl>=3.1.0`, `pandas>=2.0.0`, `pyxlsb>=1.0.10`.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DSBenchEval as DSBenchEvaluator
    participant MathEval as MathEvaluator
    participant ExtractAns as extract_answer()
    participant RelaxedEq as relaxed_equal()

    Client->>DSBenchEval: eval_single(data_point)
    DSBenchEval->>MathEval: super().eval_single(data_point)
    MathEval->>ExtractAns: extract_answer(response, relaxed=...)
    ExtractAns-->>MathEval: predicted_answer
    MathEval-->>DSBenchEval: evaluation_result (symbolic_correct)
    alt symbolic_correct == false
        DSBenchEval->>RelaxedEq: relaxed_equal(expected_answer, predicted_answer)
        RelaxedEq-->>DSBenchEval: match? (true/false)
        alt true
            DSBenchEval->>DSBenchEval: set symbolic_correct = true
        end
    end
    DSBenchEval-->>Client: final_evaluation_result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

Add compute eval #1158: Adds a new evaluator registration and dataset/evaluator integration; related changes to EVALUATOR_CLASS_MAP and evaluator wiring.

Suggested reviewers

gwarmstrong

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Add DSBench-DA evaluation' directly and concisely describes the main change—introducing DSBench data analysis evaluation support—matching the core objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch dsbench-da-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Kipok

thanks, mostly good, just a few minor comments! Could you please also add this to docs?

Kipok · 2026-02-18T22:59:27Z

nemo_skills/dataset/dsbench_da/__init__.py

+# Use DSBench evaluator (extends MathEvaluator) with relaxed extraction and case-insensitive MCQ and handling of dict and list.
+GENERATION_ARGS = "++prompt_config=generic/dsbench-da ++eval_type=dsbench ++eval_config.relaxed=true"
+
+# # Recommend running LLM judge to verify dicts and lists correctly 


should this not be commented out?

Was going back and forth on it: The llm as judge catches about 1-2% - so think it is ok to skip for every run and only use for final reporting. But wanted to leave the config in the doc in case folks so took the middle ground of leaving as comment. Any recommendation on this?

nemo_skills/dataset/dsbench_da/prepare.py

Kipok · 2026-02-18T23:39:38Z

nemo_skills/evaluation/evaluator/dsbench.py

+    2. Dict/list comparison using math_equal recursively
+    """
+    if predicted_answer is None:
+        return gt_answer is None


can gt_answer be None? If not, probably better to just return False here for clarity

Not in this eval per se - but scenario I am thinking is a benchmark where some questions have short answers and others require some action (e.g., saving files). Then the gt answer can be None.

Kipok · 2026-02-18T23:40:46Z

nemo_skills/evaluation/evaluator/dsbench.py

+LOG = logging.getLogger(get_logger_name(__file__))
+
+
+def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool:


should we update the original math_equal with these changes?

if we do that, I guess we'd be able to fully reuse math evaluator here

Yes updating math evaluator would be most cleanest: two issues

directly but there was concern that it would make existing workflows inconsistent

if mathevaluator is pegged to match some external 3rd party evaluations, it would break those too

One option is to use the "relaxed" argument that is already there for extract_answer and use it to branch into relaxed-mcq.
All of these are pretty straightforward to implement, so let me know what you prefer and I can make the changes.

I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.

So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good

And maybe also some mcq benchmark, eg gpqa

I would still feel more comfortable doing this as a new PR so that if it breaks anyones workflow they can revert to it. Plus would unblock dsbench for now.

nemo_skills/evaluation/evaluator/dsbench.py

nemo_skills/evaluation/evaluator/math.py

nemo_skills/prompt/config/generic/dsbench-da-incontext.yaml

nemo_skills/evaluation/evaluator/dsbench.py

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/dsbench.py (1)
29-79: Add DSBench evaluator to slurm tests and documentation.

Per project learnings:

New evaluation/metrics logic should be added to slurm tests for comprehensive evaluation coverage.

New benchmarks should have documentation with example commands, expected results for tested models, and any dataset-specific preparation or inference arguments.

Would you like me to draft an issue to track slurm test registration and a documentation stub for dsbench_da?

Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation" and "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 29 - 79, Add the
DSBench evaluator (dsbench_da) to slurm test registrations and create a
documentation stub: update the slurm tests registry to include the new benchmark
so it runs in CI (register the dsbench_da benchmark name and associated
evaluation script/module), add a docs page under benchmarks describing how to
run evaluation (example CLI commands and expected results for tested models),
list dataset-specific preparation steps and inference arguments, and open an
issue tracking adding slurm test coverage and the docs stub; reference the
dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and the evaluator
name dsbench_da so reviewers can locate the code to wire into tests and docs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The two broad except blocks around json.loads calls for
predicted_answer and gt_answer should be narrowed to only handle expected
decode/type errors: replace "except Exception" with "except
(json.JSONDecodeError, TypeError, ValueError)" for the
json.loads(predicted_answer) and json.loads(gt_answer) calls in dsbench.py so we
still keep the original string form on JSON/type parsing failures but don’t
swallow unrelated exceptions.
- Around line 47-64: The comparison currently allows asymmetric JSON parsing of
gt_answer and predicted_answer which yields false positives; change the parsing
so it is atomic: attempt to json.loads both gt_answer and predicted_answer
together (or attempt to parse each but only accept parsed values if both
parse-successfully), and only enter the dict/list comparison branches when both
sides parsed to the same container type; otherwise fall back to string/scalar
comparison using relaxed_equal. Use the existing variable names gt_answer and
predicted_answer and comparisons around isinstance(..., dict)/isinstance(...,
list) and relaxed_equal to locate where to enforce the atomic parse and
same-type check.

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 29-79: Add the DSBench evaluator (dsbench_da) to slurm test
registrations and create a documentation stub: update the slurm tests registry
to include the new benchmark so it runs in CI (register the dsbench_da benchmark
name and associated evaluation script/module), add a docs page under benchmarks
describing how to run evaluation (example CLI commands and expected results for
tested models), list dataset-specific preparation steps and inference arguments,
and open an issue tracking adding slurm test coverage and the docs stub;
reference the dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and
the evaluator name dsbench_da so reviewers can locate the code to wire into
tests and docs.

coderabbitai

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/dsbench.py (2)

38-45: Narrow exception types for JSON parsing.

Catching bare Exception is overly broad. The expected exceptions from json.loads() are json.JSONDecodeError, TypeError (if input is not a string), and potentially ValueError. Per coding guidelines, avoid catching exceptions that are not normally expected.

♻️ Proposed fix

     try:
         predicted_answer = json.loads(predicted_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
     try:
         gt_answer = json.loads(gt_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer are
catching a bare Exception; narrow these to catch only the expected exceptions
(json.JSONDecodeError, TypeError, and ValueError) so unrelated errors surface.
Update both blocks around the json.loads(...) calls for predicted_answer and
gt_answer to except (json.JSONDecodeError, TypeError, ValueError): and keep the
same "pass" behavior to preserve the original string form.

62-64: Consider adding strict=True to zip() for defensive coding.

While the length equality check on line 62 ensures the lists are the same length, adding strict=True provides an additional safeguard and silences the static analysis warning.

♻️ Proposed fix

         return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The
equality check uses zip(gt_answer, predicted_answer) without strictness; update
the zip call used in the return statement (the expression using relaxed_equal in
dsbench.py) to zip(gt_answer, predicted_answer, strict=True) so mismatched
lengths raise immediately and satisfy static analysis—keep the existing len(...)
equality check if you prefer defensive redundancy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 151-171: The loop over task["questions"] can IndexError when
task["answers"] is shorter; before iterating (in the function/process where task
and task_id are available, e.g., in prepare.py around the for idx, question_name
in enumerate(task["questions"]) block), validate that len(task["answers"]) >=
len(task["questions"]) and raise a clear exception (including task_id and the
lengths) if not; this ensures the code fails fast with a descriptive error
instead of a bare IndexError when accessing task["answers"][idx].

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer are catching a bare Exception; narrow these to
catch only the expected exceptions (json.JSONDecodeError, TypeError, and
ValueError) so unrelated errors surface. Update both blocks around the
json.loads(...) calls for predicted_answer and gt_answer to except
(json.JSONDecodeError, TypeError, ValueError): and keep the same "pass" behavior
to preserve the original string form.
- Around line 62-64: The equality check uses zip(gt_answer, predicted_answer)
without strictness; update the zip call used in the return statement (the
expression using relaxed_equal in dsbench.py) to zip(gt_answer,
predicted_answer, strict=True) so mismatched lengths raise immediately and
satisfy static analysis—keep the existing len(...) equality check if you prefer
defensive redundancy.

nemo_skills/dataset/dsbench_da/__init__.py

Kipok · 2026-02-20T17:37:10Z

nemo_skills/evaluation/evaluator/dsbench.py

+LOG = logging.getLogger(get_logger_name(__file__))
+
+
+def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool:


I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.

So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good

coderabbitai

🧹 Nitpick comments (3)

nemo_skills/evaluation/evaluator/dsbench.py (2)
38-45: Narrow the exception type from bare Exception to expected JSON parsing errors.

json.loads raises json.JSONDecodeError (or TypeError for non-string input). Catching bare Exception can mask unexpected errors like MemoryError or KeyboardInterrupt (the latter via BaseException, but Exception still catches more than necessary). The atomicity fix from the past review was applied (separate try/except blocks), but the exception type was not narrowed as also suggested.
♻️ Proposed fix
     try:
         predicted_answer = json.loads(predicted_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
     try:
         gt_answer = json.loads(gt_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer should
not catch bare Exception; replace the generic except with a narrower except that
only handles expected parsing failures (json.JSONDecodeError and TypeError) so
other unexpected errors surface; update both places where
json.loads(predicted_answer) and json.loads(gt_answer) are called to catch
(json.JSONDecodeError, TypeError) and keep the existing behavior of leaving the
original string on those specific failures.
57-64: zip() without strict=True is safe here but could be more explicit.

The len equality check on line 62 guarantees equal lengths before zip, so no data is silently dropped. Adding strict=True would make the intent self-documenting and guard against future refactors that might remove the length check.
♻️ Proposed fix
         return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 57 - 64, Update the
zip call in the DSBench evaluator branch that compares list answers to use
zip(..., strict=True) to make the length-equality assumption explicit and guard
against future refactors: in the block where predicted_answer is a list and
gt_answer is also a list (the return that currently uses len(gt_answer) ==
len(predicted_answer) and all(relaxed_equal(e, p) for e, p in zip(gt_answer,
predicted_answer))), add strict=True to the zip invocation so it becomes
zip(gt_answer, predicted_answer, strict=True).
nemo_skills/dataset/dsbench_da/prepare.py (1)
127-144: errors="ignore" silently drops malformed bytes — consider errors="replace" for visibility.

Lines 131 and 163 use errors="ignore" when reading text files, which silently drops bytes that can't be decoded. If a file has encoding issues, this could lead to subtly truncated or corrupted content with no indication. Using errors="replace" would insert � markers, making encoding problems visible in the output while still avoiding crashes.
♻️ Proposed fix
-            introduction = intro_file.read_text(encoding="utf-8", errors="ignore")
+            introduction = intro_file.read_text(encoding="utf-8", errors="replace")
Same for line 163:
-            question_text = question_file.read_text(encoding="utf-8", errors="ignore").strip()
+            question_text = question_file.read_text(encoding="utf-8", errors="replace").strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 127 - 144, The code
uses intro_file.read_text(encoding="utf-8", errors="ignore") (and a similar
read_text call later around line 163) which silently drops undecodable bytes;
change these read_text calls to use errors="replace" while keeping
encoding="utf-8" so malformed bytes become visible as replacement characters (�)
instead of being silently discarded—locate the intro_file.read_text invocation
and any other read_text(...) calls in prepare.py and update their errors
parameter accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 127-144: The code uses intro_file.read_text(encoding="utf-8",
errors="ignore") (and a similar read_text call later around line 163) which
silently drops undecodable bytes; change these read_text calls to use
errors="replace" while keeping encoding="utf-8" so malformed bytes become
visible as replacement characters (�) instead of being silently discarded—locate
the intro_file.read_text invocation and any other read_text(...) calls in
prepare.py and update their errors parameter accordingly.

In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch bare Exception; replace the
generic except with a narrower except that only handles expected parsing
failures (json.JSONDecodeError and TypeError) so other unexpected errors
surface; update both places where json.loads(predicted_answer) and
json.loads(gt_answer) are called to catch (json.JSONDecodeError, TypeError) and
keep the existing behavior of leaving the original string on those specific
failures.
- Around line 57-64: Update the zip call in the DSBench evaluator branch that
compares list answers to use zip(..., strict=True) to make the length-equality
assumption explicit and guard against future refactors: in the block where
predicted_answer is a list and gt_answer is also a list (the return that
currently uses len(gt_answer) == len(predicted_answer) and all(relaxed_equal(e,
p) for e, p in zip(gt_answer, predicted_answer))), add strict=True to the zip
invocation so it becomes zip(gt_answer, predicted_answer, strict=True).

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/evaluation/evaluator/__init__.py (1)
34-34: LGTM — clean registration following the established pattern.

The import is inserted in alphabetical order, "dsbench" is absent from EVALUATOR_MAP so the overlap-validation guard at lines 83–91 won't fire, and all class-based-evaluator infrastructure (evaluate, get_evaluator_class, supports_single_eval) will pick up the new entry automatically.

Consider adding dsbench_da to the slurm evaluation tests for comprehensive end-to-end CI coverage. Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation."

Also applies to: 80-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/__init__.py` at line 34, Add the new dsbench
evaluator coverage to CI by including the dsbench_da dataset in the Slurm
evaluation tests list so end-to-end evaluation runs in CI; update the test
configuration that enumerates datasets for Slurm runs (the list/array that
configures slurm evaluation datasets) to include "dsbench_da", and ensure any
test harness that filters by evaluator uses DSBenchEvaluator (class name
DSBenchEvaluator / registration in EVALUATOR_MAP) so the dataset is picked up
during test execution. Verify the change triggers the existing evaluation flow
(evaluate, get_evaluator_class, supports_single_eval) without modifying
evaluator registration.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Line 34: Add the new dsbench evaluator coverage to CI by including the
dsbench_da dataset in the Slurm evaluation tests list so end-to-end evaluation
runs in CI; update the test configuration that enumerates datasets for Slurm
runs (the list/array that configures slurm evaluation datasets) to include
"dsbench_da", and ensure any test harness that filters by evaluator uses
DSBenchEvaluator (class name DSBenchEvaluator / registration in EVALUATOR_MAP)
so the dataset is picked up during test execution. Verify the change triggers
the existing evaluation flow (evaluate, get_evaluator_class,
supports_single_eval) without modifying evaluator registration.

Signed-off-by: suriya <sgunasekar@nvidia.com>

…sts and dicts. Signed-off-by: suriya <sgunasekar@nvidia.com>

…ches the current system. Signed-off-by: suriya <sgunasekar@nvidia.com>

Signed-off-by: suriya <sgunasekar@nvidia.com>

…en excel files with context manager rather then unclosed file handles Signed-off-by: suriya <sgunasekar@nvidia.com>

Signed-off-by: suriya <sgunasekar@nvidia.com>

Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Signed-off-by: suriya <sgunasekar@nvidia.com>

Signed-off-by: suriya <sgunasekar@nvidia.com>

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com>

Signed-off-by: suriya <sgunasekar@nvidia.com>

coderabbitai

🧹 Nitpick comments (3)

nemo_skills/dataset/dsbench_da/prepare.py (1)

155-161: Use LOG.warning instead of print for the skipped-file warning.

Line 160 uses print for a condition (missing question file) that warrants a structured warning log so it is visible and filterable in non-interactive runs. All progress/info messages using print are fine, but error/skip conditions should go through the logger.
♻️ Proposed fix
-                print(f"    Warning: {task_id}/{question_name}.txt not found, skipping")
+                LOG.warning("Skipping %s/%s.txt: file not found", task_id, question_name)
This also requires importing the logger at the module level (add alongside the existing imports):
import logging
from nemo_skills.utils import get_logger_name

LOG = logging.getLogger(get_logger_name(__file__))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 155 - 161, Replace
the print warning in the question-file check with a structured logger call:
import logging and get_logger_name from nemo_skills.utils at module top, create
LOG = logging.getLogger(get_logger_name(__file__)), and change the print in the
loop (the missing-file branch inside the for idx, question_name in
enumerate(task["questions"]) block that checks question_file.exists()) to
LOG.warning with the same message text so skipped-file conditions are emitted as
warnings.

nemo_skills/evaluation/evaluator/dsbench.py (2)

38-45: Narrow the caught exception type from Exception to (json.JSONDecodeError, TypeError).

json.loads only raises json.JSONDecodeError (for invalid JSON) and TypeError (for non-string/bytes input). Catching bare Exception here can mask unexpected errors such as AttributeError, MemoryError, etc., silently producing incorrect comparison results instead of a clear failure. Ruff BLE001/S110 flags this on both blocks.

♻️ Proposed fix

-    try:
-        predicted_answer = json.loads(predicted_answer)
-    except Exception:
-        pass  # keep original string form
-    try:
-        gt_answer = json.loads(gt_answer)
-    except Exception:
-        pass  # keep original string form
+    try:
+        predicted_answer = json.loads(predicted_answer)
+    except (json.JSONDecodeError, TypeError):
+        pass  # keep original string form
+    try:
+        gt_answer = json.loads(gt_answer)
+    except (json.JSONDecodeError, TypeError):
+        pass  # keep original string form

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer should
not catch broad Exception; replace "except Exception:" with "except
(json.JSONDecodeError, TypeError):" so only JSON decode errors and non-string
input are swallowed, leaving other errors to surface; update both places where
json.loads(predicted_answer) and json.loads(gt_answer) are used accordingly.

62-64: Add strict=True to zip() to make the length invariant explicit.

The preceding len(gt_answer) == len(predicted_answer) check guarantees equal lengths, so this is safe as-is. However, passing strict=True makes the invariant explicit and avoids Ruff B905.

♻️ Proposed fix

-        return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
-        )
+        return len(gt_answer) == len(predicted_answer) and all(
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The return
expression in the function that compares gt_answer and predicted_answer uses
zip(gt_answer, predicted_answer); update this to zip(gt_answer,
predicted_answer, strict=True) so the length invariant is explicit (keep the
existing len(...) == len(...) check or remove it if you prefer); target the
return statement that calls relaxed_equal(e, p) for e, p in zip(gt_answer,
predicted_answer) and add strict=True to the zip call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 155-161: Replace the print warning in the question-file check with
a structured logger call: import logging and get_logger_name from
nemo_skills.utils at module top, create LOG =
logging.getLogger(get_logger_name(__file__)), and change the print in the loop
(the missing-file branch inside the for idx, question_name in
enumerate(task["questions"]) block that checks question_file.exists()) to
LOG.warning with the same message text so skipped-file conditions are emitted as
warnings.

In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch broad Exception; replace "except
Exception:" with "except (json.JSONDecodeError, TypeError):" so only JSON decode
errors and non-string input are swallowed, leaving other errors to surface;
update both places where json.loads(predicted_answer) and json.loads(gt_answer)
are used accordingly.
- Around line 62-64: The return expression in the function that compares
gt_answer and predicted_answer uses zip(gt_answer, predicted_answer); update
this to zip(gt_answer, predicted_answer, strict=True) so the length invariant is
explicit (keep the existing len(...) == len(...) check or remove it if you
prefer); target the return statement that calls relaxed_equal(e, p) for e, p in
zip(gt_answer, predicted_answer) and add strict=True to the zip call.

Signed-off-by: suriya <sgunasekar@nvidia.com>

Kipok

Let's create issues for unfinished items and we can merge as is as long as gpu tests are passing

sgunasekar self-assigned this Feb 18, 2026

sgunasekar requested a review from gwarmstrong February 18, 2026 20:26

sgunasekar removed their assignment Feb 18, 2026

sgunasekar requested review from Kipok and tmfs10 February 18, 2026 20:27