Skip to content

Comments

Add DSBench-DA evaluation#1254

Merged
sgunasekar merged 23 commits intomainfrom
dsbench-da-eval
Feb 22, 2026
Merged

Add DSBench-DA evaluation#1254
sgunasekar merged 23 commits intomainfrom
dsbench-da-eval

Conversation

@sgunasekar
Copy link
Collaborator

@sgunasekar sgunasekar commented Feb 18, 2026

Adds evaluation support for the DA (data analysis) portion of DSBench.

  • New dataset (dsbench_da) with data preparation and prompt configs
  • New evaluator with case-insensitive MCQ comparison and recursive list/dict comparison
  • Minor change to math evaluator to pass relaxed argument through to extract_answer — default behavior is unchanged
  • Updated sandbox.lock with DSBench dependencies (openpyxl, pyxlsb)

Most files are independent and will not affect existing workflows. Evals have been tested end-to-end.

Summary by CodeRabbit

  • New Features

    • DSBench evaluation added with flexible/relaxed answer matching for varied formats.
    • Data preparation pipeline and CLI for DSBench: downloads, validates, converts Excel tasks to prompt-ready entries.
    • New prompt configurations for in-context and tool-mode DSBench usage, enforcing boxed final-answer formatting.
  • Chores

    • Added Excel/data-processing dependencies and updated ignore entries for local tooling.

@sgunasekar sgunasekar self-assigned this Feb 18, 2026
@sgunasekar sgunasekar removed their assignment Feb 18, 2026
@sgunasekar sgunasekar requested review from Kipok and tmfs10 February 18, 2026 20:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Adds DSBench (data analysis benchmark) support: dataset preparation (Excel reading, HF download), a DSBenchEvaluator with relaxed structural equality fallback, prompt templates, module constants, and Excel-related dependency updates.

Changes

Cohort / File(s) Summary
Git Configuration
\.gitignore
Added .claude and .cursor ignore entries; removed an extra blank line.
DSBench Dataset Configuration
nemo_skills/dataset/dsbench_da/__init__.py
Added module constants: EVAL_SPLIT, METRICS_TYPE, GENERATION_ARGS and license/header.
DSBench Data Preparation
nemo_skills/dataset/dsbench_da/prepare.py
New end-to-end data pipeline: Excel reading (including .xlsb), path formatting, HF hub download+extract, metadata parsing, incontext content extraction, and split JSONL output; adds CLI.
Evaluator Integration
nemo_skills/evaluation/evaluator/__init__.py, nemo_skills/evaluation/evaluator/dsbench.py, nemo_skills/evaluation/evaluator/math.py
Registered DSBenchEvaluator in class map; added DSBenchEvaluator subclass that uses relaxed_equal structural comparison fallback; added relaxed_extraction flag and passed to extract_answer.
Prompt Templates
nemo_skills/prompt/config/generic/dsbench-da.yaml, nemo_skills/prompt/config/generic/dsbench-da-incontext.yaml
Added two DSBench prompt configs (tool-mode and in-context) with placeholders, Excel usage guidance, and boxed final-answer formatting requirements.
Dependencies
requirements/main.txt, requirements/stem.txt
Added/pinned Excel-related deps: openpyxl>=3.1.0, pandas>=2.0.0, pyxlsb>=1.0.10.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DSBenchEval as DSBenchEvaluator
    participant MathEval as MathEvaluator
    participant ExtractAns as extract_answer()
    participant RelaxedEq as relaxed_equal()

    Client->>DSBenchEval: eval_single(data_point)
    DSBenchEval->>MathEval: super().eval_single(data_point)
    MathEval->>ExtractAns: extract_answer(response, relaxed=...)
    ExtractAns-->>MathEval: predicted_answer
    MathEval-->>DSBenchEval: evaluation_result (symbolic_correct)
    alt symbolic_correct == false
        DSBenchEval->>RelaxedEq: relaxed_equal(expected_answer, predicted_answer)
        RelaxedEq-->>DSBenchEval: match? (true/false)
        alt true
            DSBenchEval->>DSBenchEval: set symbolic_correct = true
        end
    end
    DSBenchEval-->>Client: final_evaluation_result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

Possibly related PRs

  • Add compute eval #1158: Adds a new evaluator registration and dataset/evaluator integration; related changes to EVALUATOR_CLASS_MAP and evaluator wiring.

Suggested reviewers

  • gwarmstrong
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add DSBench-DA evaluation' directly and concisely describes the main change—introducing DSBench data analysis evaluation support—matching the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dsbench-da-eval

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as duplicate.

coderabbitai[bot]

This comment was marked as duplicate.

coderabbitai[bot]

This comment was marked as outdated.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, mostly good, just a few minor comments! Could you please also add this to docs?

# Use DSBench evaluator (extends MathEvaluator) with relaxed extraction and case-insensitive MCQ and handling of dict and list.
GENERATION_ARGS = "++prompt_config=generic/dsbench-da ++eval_type=dsbench ++eval_config.relaxed=true"

# # Recommend running LLM judge to verify dicts and lists correctly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not be commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going back and forth on it: The llm as judge catches about 1-2% - so think it is ok to skip for every run and only use for final reporting. But wanted to leave the config in the doc in case folks so took the middle ground of leaving as comment. Any recommendation on this?

2. Dict/list comparison using math_equal recursively
"""
if predicted_answer is None:
return gt_answer is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can gt_answer be None? If not, probably better to just return False here for clarity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this eval per se - but scenario I am thinking is a benchmark where some questions have short answers and others require some action (e.g., saving files). Then the gt answer can be None.

LOG = logging.getLogger(get_logger_name(__file__))


def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update the original math_equal with these changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do that, I guess we'd be able to fully reuse math evaluator here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes updating math evaluator would be most cleanest: two issues

  • directly but there was concern that it would make existing workflows inconsistent
  • if mathevaluator is pegged to match some external 3rd party evaluations, it would break those too

One option is to use the "relaxed" argument that is already there for extract_answer and use it to branch into relaxed-mcq.
All of these are pretty straightforward to implement, so let me know what you prefer and I can make the changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.

So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe also some mcq benchmark, eg gpqa

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still feel more comfortable doing this as a new PR so that if it breaks anyones workflow they can revert to it. Plus would unblock dsbench for now.

coderabbitai[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/dsbench.py (1)

29-79: Add DSBench evaluator to slurm tests and documentation.

Per project learnings:

  1. New evaluation/metrics logic should be added to slurm tests for comprehensive evaluation coverage.
  2. New benchmarks should have documentation with example commands, expected results for tested models, and any dataset-specific preparation or inference arguments.

Would you like me to draft an issue to track slurm test registration and a documentation stub for dsbench_da?

Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation" and "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 29 - 79, Add the
DSBench evaluator (dsbench_da) to slurm test registrations and create a
documentation stub: update the slurm tests registry to include the new benchmark
so it runs in CI (register the dsbench_da benchmark name and associated
evaluation script/module), add a docs page under benchmarks describing how to
run evaluation (example CLI commands and expected results for tested models),
list dataset-specific preparation steps and inference arguments, and open an
issue tracking adding slurm test coverage and the docs stub; reference the
dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and the evaluator
name dsbench_da so reviewers can locate the code to wire into tests and docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The two broad except blocks around json.loads calls for
predicted_answer and gt_answer should be narrowed to only handle expected
decode/type errors: replace "except Exception" with "except
(json.JSONDecodeError, TypeError, ValueError)" for the
json.loads(predicted_answer) and json.loads(gt_answer) calls in dsbench.py so we
still keep the original string form on JSON/type parsing failures but don’t
swallow unrelated exceptions.
- Around line 47-64: The comparison currently allows asymmetric JSON parsing of
gt_answer and predicted_answer which yields false positives; change the parsing
so it is atomic: attempt to json.loads both gt_answer and predicted_answer
together (or attempt to parse each but only accept parsed values if both
parse-successfully), and only enter the dict/list comparison branches when both
sides parsed to the same container type; otherwise fall back to string/scalar
comparison using relaxed_equal. Use the existing variable names gt_answer and
predicted_answer and comparisons around isinstance(..., dict)/isinstance(...,
list) and relaxed_equal to locate where to enforce the atomic parse and
same-type check.

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 29-79: Add the DSBench evaluator (dsbench_da) to slurm test
registrations and create a documentation stub: update the slurm tests registry
to include the new benchmark so it runs in CI (register the dsbench_da benchmark
name and associated evaluation script/module), add a docs page under benchmarks
describing how to run evaluation (example CLI commands and expected results for
tested models), list dataset-specific preparation steps and inference arguments,
and open an issue tracking adding slurm test coverage and the docs stub;
reference the dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and
the evaluator name dsbench_da so reviewers can locate the code to wire into
tests and docs.

@sgunasekar sgunasekar requested a review from Kipok February 19, 2026 22:55
coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/dsbench.py (2)

38-45: Narrow exception types for JSON parsing.

Catching bare Exception is overly broad. The expected exceptions from json.loads() are json.JSONDecodeError, TypeError (if input is not a string), and potentially ValueError. Per coding guidelines, avoid catching exceptions that are not normally expected.

♻️ Proposed fix
     try:
         predicted_answer = json.loads(predicted_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
     try:
         gt_answer = json.loads(gt_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer are
catching a bare Exception; narrow these to catch only the expected exceptions
(json.JSONDecodeError, TypeError, and ValueError) so unrelated errors surface.
Update both blocks around the json.loads(...) calls for predicted_answer and
gt_answer to except (json.JSONDecodeError, TypeError, ValueError): and keep the
same "pass" behavior to preserve the original string form.

62-64: Consider adding strict=True to zip() for defensive coding.

While the length equality check on line 62 ensures the lists are the same length, adding strict=True provides an additional safeguard and silences the static analysis warning.

♻️ Proposed fix
         return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The
equality check uses zip(gt_answer, predicted_answer) without strictness; update
the zip call used in the return statement (the expression using relaxed_equal in
dsbench.py) to zip(gt_answer, predicted_answer, strict=True) so mismatched
lengths raise immediately and satisfy static analysis—keep the existing len(...)
equality check if you prefer defensive redundancy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 151-171: The loop over task["questions"] can IndexError when
task["answers"] is shorter; before iterating (in the function/process where task
and task_id are available, e.g., in prepare.py around the for idx, question_name
in enumerate(task["questions"]) block), validate that len(task["answers"]) >=
len(task["questions"]) and raise a clear exception (including task_id and the
lengths) if not; this ensures the code fails fast with a descriptive error
instead of a bare IndexError when accessing task["answers"][idx].

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer are catching a bare Exception; narrow these to
catch only the expected exceptions (json.JSONDecodeError, TypeError, and
ValueError) so unrelated errors surface. Update both blocks around the
json.loads(...) calls for predicted_answer and gt_answer to except
(json.JSONDecodeError, TypeError, ValueError): and keep the same "pass" behavior
to preserve the original string form.
- Around line 62-64: The equality check uses zip(gt_answer, predicted_answer)
without strictness; update the zip call used in the return statement (the
expression using relaxed_equal in dsbench.py) to zip(gt_answer,
predicted_answer, strict=True) so mismatched lengths raise immediately and
satisfy static analysis—keep the existing len(...) equality check if you prefer
defensive redundancy.

LOG = logging.getLogger(get_logger_name(__file__))


def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.

So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
nemo_skills/evaluation/evaluator/dsbench.py (2)

38-45: Narrow the exception type from bare Exception to expected JSON parsing errors.

json.loads raises json.JSONDecodeError (or TypeError for non-string input). Catching bare Exception can mask unexpected errors like MemoryError or KeyboardInterrupt (the latter via BaseException, but Exception still catches more than necessary). The atomicity fix from the past review was applied (separate try/except blocks), but the exception type was not narrowed as also suggested.

♻️ Proposed fix
     try:
         predicted_answer = json.loads(predicted_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form
     try:
         gt_answer = json.loads(gt_answer)
-    except Exception:
+    except (json.JSONDecodeError, TypeError, ValueError):
         pass  # keep original string form

As per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer should
not catch bare Exception; replace the generic except with a narrower except that
only handles expected parsing failures (json.JSONDecodeError and TypeError) so
other unexpected errors surface; update both places where
json.loads(predicted_answer) and json.loads(gt_answer) are called to catch
(json.JSONDecodeError, TypeError) and keep the existing behavior of leaving the
original string on those specific failures.

57-64: zip() without strict=True is safe here but could be more explicit.

The len equality check on line 62 guarantees equal lengths before zip, so no data is silently dropped. Adding strict=True would make the intent self-documenting and guard against future refactors that might remove the length check.

♻️ Proposed fix
         return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 57 - 64, Update the
zip call in the DSBench evaluator branch that compares list answers to use
zip(..., strict=True) to make the length-equality assumption explicit and guard
against future refactors: in the block where predicted_answer is a list and
gt_answer is also a list (the return that currently uses len(gt_answer) ==
len(predicted_answer) and all(relaxed_equal(e, p) for e, p in zip(gt_answer,
predicted_answer))), add strict=True to the zip invocation so it becomes
zip(gt_answer, predicted_answer, strict=True).
nemo_skills/dataset/dsbench_da/prepare.py (1)

127-144: errors="ignore" silently drops malformed bytes — consider errors="replace" for visibility.

Lines 131 and 163 use errors="ignore" when reading text files, which silently drops bytes that can't be decoded. If a file has encoding issues, this could lead to subtly truncated or corrupted content with no indication. Using errors="replace" would insert markers, making encoding problems visible in the output while still avoiding crashes.

♻️ Proposed fix
-            introduction = intro_file.read_text(encoding="utf-8", errors="ignore")
+            introduction = intro_file.read_text(encoding="utf-8", errors="replace")

Same for line 163:

-            question_text = question_file.read_text(encoding="utf-8", errors="ignore").strip()
+            question_text = question_file.read_text(encoding="utf-8", errors="replace").strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 127 - 144, The code
uses intro_file.read_text(encoding="utf-8", errors="ignore") (and a similar
read_text call later around line 163) which silently drops undecodable bytes;
change these read_text calls to use errors="replace" while keeping
encoding="utf-8" so malformed bytes become visible as replacement characters (�)
instead of being silently discarded—locate the intro_file.read_text invocation
and any other read_text(...) calls in prepare.py and update their errors
parameter accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 127-144: The code uses intro_file.read_text(encoding="utf-8",
errors="ignore") (and a similar read_text call later around line 163) which
silently drops undecodable bytes; change these read_text calls to use
errors="replace" while keeping encoding="utf-8" so malformed bytes become
visible as replacement characters (�) instead of being silently discarded—locate
the intro_file.read_text invocation and any other read_text(...) calls in
prepare.py and update their errors parameter accordingly.

In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch bare Exception; replace the
generic except with a narrower except that only handles expected parsing
failures (json.JSONDecodeError and TypeError) so other unexpected errors
surface; update both places where json.loads(predicted_answer) and
json.loads(gt_answer) are called to catch (json.JSONDecodeError, TypeError) and
keep the existing behavior of leaving the original string on those specific
failures.
- Around line 57-64: Update the zip call in the DSBench evaluator branch that
compares list answers to use zip(..., strict=True) to make the length-equality
assumption explicit and guard against future refactors: in the block where
predicted_answer is a list and gt_answer is also a list (the return that
currently uses len(gt_answer) == len(predicted_answer) and all(relaxed_equal(e,
p) for e, p in zip(gt_answer, predicted_answer))), add strict=True to the zip
invocation so it becomes zip(gt_answer, predicted_answer, strict=True).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)

34-34: LGTM — clean registration following the established pattern.

The import is inserted in alphabetical order, "dsbench" is absent from EVALUATOR_MAP so the overlap-validation guard at lines 83–91 won't fire, and all class-based-evaluator infrastructure (evaluate, get_evaluator_class, supports_single_eval) will pick up the new entry automatically.

Consider adding dsbench_da to the slurm evaluation tests for comprehensive end-to-end CI coverage. Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation."

Also applies to: 80-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/__init__.py` at line 34, Add the new dsbench
evaluator coverage to CI by including the dsbench_da dataset in the Slurm
evaluation tests list so end-to-end evaluation runs in CI; update the test
configuration that enumerates datasets for Slurm runs (the list/array that
configures slurm evaluation datasets) to include "dsbench_da", and ensure any
test harness that filters by evaluator uses DSBenchEvaluator (class name
DSBenchEvaluator / registration in EVALUATOR_MAP) so the dataset is picked up
during test execution. Verify the change triggers the existing evaluation flow
(evaluate, get_evaluator_class, supports_single_eval) without modifying
evaluator registration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Line 34: Add the new dsbench evaluator coverage to CI by including the
dsbench_da dataset in the Slurm evaluation tests list so end-to-end evaluation
runs in CI; update the test configuration that enumerates datasets for Slurm
runs (the list/array that configures slurm evaluation datasets) to include
"dsbench_da", and ensure any test harness that filters by evaluator uses
DSBenchEvaluator (class name DSBenchEvaluator / registration in EVALUATOR_MAP)
so the dataset is picked up during test execution. Verify the change triggers
the existing evaluation flow (evaluate, get_evaluator_class,
supports_single_eval) without modifying evaluator registration.

Signed-off-by: suriya <sgunasekar@nvidia.com>
sgunasekar and others added 20 commits February 20, 2026 23:55
…sts and dicts.

Signed-off-by: suriya <sgunasekar@nvidia.com>
…ches the current system.

Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
…en excel files with context manager rather then unclosed file handles

Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
nemo_skills/dataset/dsbench_da/prepare.py (1)

155-161: Use LOG.warning instead of print for the skipped-file warning.

Line 160 uses print for a condition (missing question file) that warrants a structured warning log so it is visible and filterable in non-interactive runs. All progress/info messages using print are fine, but error/skip conditions should go through the logger.

♻️ Proposed fix
-                print(f"    Warning: {task_id}/{question_name}.txt not found, skipping")
+                LOG.warning("Skipping %s/%s.txt: file not found", task_id, question_name)

This also requires importing the logger at the module level (add alongside the existing imports):

import logging
from nemo_skills.utils import get_logger_name

LOG = logging.getLogger(get_logger_name(__file__))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 155 - 161, Replace
the print warning in the question-file check with a structured logger call:
import logging and get_logger_name from nemo_skills.utils at module top, create
LOG = logging.getLogger(get_logger_name(__file__)), and change the print in the
loop (the missing-file branch inside the for idx, question_name in
enumerate(task["questions"]) block that checks question_file.exists()) to
LOG.warning with the same message text so skipped-file conditions are emitted as
warnings.
nemo_skills/evaluation/evaluator/dsbench.py (2)

38-45: Narrow the caught exception type from Exception to (json.JSONDecodeError, TypeError).

json.loads only raises json.JSONDecodeError (for invalid JSON) and TypeError (for non-string/bytes input). Catching bare Exception here can mask unexpected errors such as AttributeError, MemoryError, etc., silently producing incorrect comparison results instead of a clear failure. Ruff BLE001/S110 flags this on both blocks.

♻️ Proposed fix
-    try:
-        predicted_answer = json.loads(predicted_answer)
-    except Exception:
-        pass  # keep original string form
-    try:
-        gt_answer = json.loads(gt_answer)
-    except Exception:
-        pass  # keep original string form
+    try:
+        predicted_answer = json.loads(predicted_answer)
+    except (json.JSONDecodeError, TypeError):
+        pass  # keep original string form
+    try:
+        gt_answer = json.loads(gt_answer)
+    except (json.JSONDecodeError, TypeError):
+        pass  # keep original string form
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The
try/except blocks that call json.loads on predicted_answer and gt_answer should
not catch broad Exception; replace "except Exception:" with "except
(json.JSONDecodeError, TypeError):" so only JSON decode errors and non-string
input are swallowed, leaving other errors to surface; update both places where
json.loads(predicted_answer) and json.loads(gt_answer) are used accordingly.

62-64: Add strict=True to zip() to make the length invariant explicit.

The preceding len(gt_answer) == len(predicted_answer) check guarantees equal lengths, so this is safe as-is. However, passing strict=True makes the invariant explicit and avoids Ruff B905.

♻️ Proposed fix
-        return len(gt_answer) == len(predicted_answer) and all(
-            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer)
-        )
+        return len(gt_answer) == len(predicted_answer) and all(
+            relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True)
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The return
expression in the function that compares gt_answer and predicted_answer uses
zip(gt_answer, predicted_answer); update this to zip(gt_answer,
predicted_answer, strict=True) so the length invariant is explicit (keep the
existing len(...) == len(...) check or remove it if you prefer); target the
return statement that calls relaxed_equal(e, p) for e, p in zip(gt_answer,
predicted_answer) and add strict=True to the zip call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 155-161: Replace the print warning in the question-file check with
a structured logger call: import logging and get_logger_name from
nemo_skills.utils at module top, create LOG =
logging.getLogger(get_logger_name(__file__)), and change the print in the loop
(the missing-file branch inside the for idx, question_name in
enumerate(task["questions"]) block that checks question_file.exists()) to
LOG.warning with the same message text so skipped-file conditions are emitted as
warnings.

In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch broad Exception; replace "except
Exception:" with "except (json.JSONDecodeError, TypeError):" so only JSON decode
errors and non-string input are swallowed, leaving other errors to surface;
update both places where json.loads(predicted_answer) and json.loads(gt_answer)
are used accordingly.
- Around line 62-64: The return expression in the function that compares
gt_answer and predicted_answer uses zip(gt_answer, predicted_answer); update
this to zip(gt_answer, predicted_answer, strict=True) so the length invariant is
explicit (keep the existing len(...) == len(...) check or remove it if you
prefer); target the return statement that calls relaxed_equal(e, p) for e, p in
zip(gt_answer, predicted_answer) and add strict=True to the zip call.

Signed-off-by: suriya <sgunasekar@nvidia.com>
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create issues for unfinished items and we can merge as is as long as gpu tests are passing

@sgunasekar sgunasekar merged commit ad034b5 into main Feb 22, 2026
6 checks passed
@sgunasekar sgunasekar deleted the dsbench-da-eval branch February 22, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants