Conversation
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughAdds DSBench (data analysis benchmark) support: dataset preparation (Excel reading, HF download), a DSBenchEvaluator with relaxed structural equality fallback, prompt templates, module constants, and Excel-related dependency updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DSBenchEval as DSBenchEvaluator
participant MathEval as MathEvaluator
participant ExtractAns as extract_answer()
participant RelaxedEq as relaxed_equal()
Client->>DSBenchEval: eval_single(data_point)
DSBenchEval->>MathEval: super().eval_single(data_point)
MathEval->>ExtractAns: extract_answer(response, relaxed=...)
ExtractAns-->>MathEval: predicted_answer
MathEval-->>DSBenchEval: evaluation_result (symbolic_correct)
alt symbolic_correct == false
DSBenchEval->>RelaxedEq: relaxed_equal(expected_answer, predicted_answer)
RelaxedEq-->>DSBenchEval: match? (true/false)
alt true
DSBenchEval->>DSBenchEval: set symbolic_correct = true
end
end
DSBenchEval-->>Client: final_evaluation_result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5a8101d to
6403193
Compare
Kipok
left a comment
There was a problem hiding this comment.
thanks, mostly good, just a few minor comments! Could you please also add this to docs?
| # Use DSBench evaluator (extends MathEvaluator) with relaxed extraction and case-insensitive MCQ and handling of dict and list. | ||
| GENERATION_ARGS = "++prompt_config=generic/dsbench-da ++eval_type=dsbench ++eval_config.relaxed=true" | ||
|
|
||
| # # Recommend running LLM judge to verify dicts and lists correctly |
There was a problem hiding this comment.
should this not be commented out?
There was a problem hiding this comment.
Was going back and forth on it: The llm as judge catches about 1-2% - so think it is ok to skip for every run and only use for final reporting. But wanted to leave the config in the doc in case folks so took the middle ground of leaving as comment. Any recommendation on this?
| 2. Dict/list comparison using math_equal recursively | ||
| """ | ||
| if predicted_answer is None: | ||
| return gt_answer is None |
There was a problem hiding this comment.
can gt_answer be None? If not, probably better to just return False here for clarity
There was a problem hiding this comment.
Not in this eval per se - but scenario I am thinking is a benchmark where some questions have short answers and others require some action (e.g., saving files). Then the gt answer can be None.
| LOG = logging.getLogger(get_logger_name(__file__)) | ||
|
|
||
|
|
||
| def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool: |
There was a problem hiding this comment.
should we update the original math_equal with these changes?
There was a problem hiding this comment.
if we do that, I guess we'd be able to fully reuse math evaluator here
There was a problem hiding this comment.
Yes updating math evaluator would be most cleanest: two issues
- directly but there was concern that it would make existing workflows inconsistent
- if mathevaluator is pegged to match some external 3rd party evaluations, it would break those too
One option is to use the "relaxed" argument that is already there for extract_answer and use it to branch into relaxed-mcq.
All of these are pretty straightforward to implement, so let me know what you prefer and I can make the changes.
There was a problem hiding this comment.
I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.
So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good
There was a problem hiding this comment.
And maybe also some mcq benchmark, eg gpqa
There was a problem hiding this comment.
I would still feel more comfortable doing this as a new PR so that if it breaks anyones workflow they can revert to it. Plus would unblock dsbench for now.
68c88cc to
2da0a4a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/dsbench.py (1)
29-79: Add DSBench evaluator to slurm tests and documentation.Per project learnings:
- New evaluation/metrics logic should be added to slurm tests for comprehensive evaluation coverage.
- New benchmarks should have documentation with example commands, expected results for tested models, and any dataset-specific preparation or inference arguments.
Would you like me to draft an issue to track slurm test registration and a documentation stub for
dsbench_da?Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation" and "When adding new benchmarks, add documentation with example commands for how to run evaluation, expected results for tested models, and any dataset-specific details."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 29 - 79, Add the DSBench evaluator (dsbench_da) to slurm test registrations and create a documentation stub: update the slurm tests registry to include the new benchmark so it runs in CI (register the dsbench_da benchmark name and associated evaluation script/module), add a docs page under benchmarks describing how to run evaluation (example CLI commands and expected results for tested models), list dataset-specific preparation steps and inference arguments, and open an issue tracking adding slurm test coverage and the docs stub; reference the dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and the evaluator name dsbench_da so reviewers can locate the code to wire into tests and docs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The two broad except blocks around json.loads calls for
predicted_answer and gt_answer should be narrowed to only handle expected
decode/type errors: replace "except Exception" with "except
(json.JSONDecodeError, TypeError, ValueError)" for the
json.loads(predicted_answer) and json.loads(gt_answer) calls in dsbench.py so we
still keep the original string form on JSON/type parsing failures but don’t
swallow unrelated exceptions.
- Around line 47-64: The comparison currently allows asymmetric JSON parsing of
gt_answer and predicted_answer which yields false positives; change the parsing
so it is atomic: attempt to json.loads both gt_answer and predicted_answer
together (or attempt to parse each but only accept parsed values if both
parse-successfully), and only enter the dict/list comparison branches when both
sides parsed to the same container type; otherwise fall back to string/scalar
comparison using relaxed_equal. Use the existing variable names gt_answer and
predicted_answer and comparisons around isinstance(..., dict)/isinstance(...,
list) and relaxed_equal to locate where to enforce the atomic parse and
same-type check.
---
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 29-79: Add the DSBench evaluator (dsbench_da) to slurm test
registrations and create a documentation stub: update the slurm tests registry
to include the new benchmark so it runs in CI (register the dsbench_da benchmark
name and associated evaluation script/module), add a docs page under benchmarks
describing how to run evaluation (example CLI commands and expected results for
tested models), list dataset-specific preparation steps and inference arguments,
and open an issue tracking adding slurm test coverage and the docs stub;
reference the dsbench module (nemo_skills/evaluation/evaluator/dsbench.py) and
the evaluator name dsbench_da so reviewers can locate the code to wire into
tests and docs.
f049e73 to
7397408
Compare
7397408 to
6c84988
Compare
6c84988 to
c9c1a56
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/dsbench.py (2)
38-45: Narrow exception types for JSON parsing.Catching bare
Exceptionis overly broad. The expected exceptions fromjson.loads()arejson.JSONDecodeError,TypeError(if input is not a string), and potentiallyValueError. Per coding guidelines, avoid catching exceptions that are not normally expected.♻️ Proposed fix
try: predicted_answer = json.loads(predicted_answer) - except Exception: + except (json.JSONDecodeError, TypeError, ValueError): pass # keep original string form try: gt_answer = json.loads(gt_answer) - except Exception: + except (json.JSONDecodeError, TypeError, ValueError): pass # keep original string form🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The try/except blocks that call json.loads on predicted_answer and gt_answer are catching a bare Exception; narrow these to catch only the expected exceptions (json.JSONDecodeError, TypeError, and ValueError) so unrelated errors surface. Update both blocks around the json.loads(...) calls for predicted_answer and gt_answer to except (json.JSONDecodeError, TypeError, ValueError): and keep the same "pass" behavior to preserve the original string form.
62-64: Consider addingstrict=Truetozip()for defensive coding.While the length equality check on line 62 ensures the lists are the same length, adding
strict=Trueprovides an additional safeguard and silences the static analysis warning.♻️ Proposed fix
return len(gt_answer) == len(predicted_answer) and all( - relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer) + relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The equality check uses zip(gt_answer, predicted_answer) without strictness; update the zip call used in the return statement (the expression using relaxed_equal in dsbench.py) to zip(gt_answer, predicted_answer, strict=True) so mismatched lengths raise immediately and satisfy static analysis—keep the existing len(...) equality check if you prefer defensive redundancy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 151-171: The loop over task["questions"] can IndexError when
task["answers"] is shorter; before iterating (in the function/process where task
and task_id are available, e.g., in prepare.py around the for idx, question_name
in enumerate(task["questions"]) block), validate that len(task["answers"]) >=
len(task["questions"]) and raise a clear exception (including task_id and the
lengths) if not; this ensures the code fails fast with a descriptive error
instead of a bare IndexError when accessing task["answers"][idx].
---
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer are catching a bare Exception; narrow these to
catch only the expected exceptions (json.JSONDecodeError, TypeError, and
ValueError) so unrelated errors surface. Update both blocks around the
json.loads(...) calls for predicted_answer and gt_answer to except
(json.JSONDecodeError, TypeError, ValueError): and keep the same "pass" behavior
to preserve the original string form.
- Around line 62-64: The equality check uses zip(gt_answer, predicted_answer)
without strictness; update the zip call used in the return statement (the
expression using relaxed_equal in dsbench.py) to zip(gt_answer,
predicted_answer, strict=True) so mismatched lengths raise immediately and
satisfy static analysis—keep the existing len(...) equality check if you prefer
defensive redundancy.
c9c1a56 to
7a42dc3
Compare
| LOG = logging.getLogger(get_logger_name(__file__)) | ||
|
|
||
|
|
||
| def relaxed_equal(gt_answer: Any, predicted_answer: Any) -> bool: |
There was a problem hiding this comment.
I'd probably make a change directly. It feels like this is the right way to compare things. E.g. if options are A, B, C, D and llm says \boxed{a}, where A is correct, that should be counted as correct I guess. And the same for the other change.
So my suggestion would be to make a change directly but please run e.g. nano-v3 math eval on maybe comp-math-24-25 and if we get score within normal random variance, we should be good
ab01ad9 to
6e6e4c1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
nemo_skills/evaluation/evaluator/dsbench.py (2)
38-45: Narrow the exception type from bareExceptionto expected JSON parsing errors.
json.loadsraisesjson.JSONDecodeError(orTypeErrorfor non-string input). Catching bareExceptioncan mask unexpected errors likeMemoryErrororKeyboardInterrupt(the latter viaBaseException, butExceptionstill catches more than necessary). The atomicity fix from the past review was applied (separate try/except blocks), but the exception type was not narrowed as also suggested.♻️ Proposed fix
try: predicted_answer = json.loads(predicted_answer) - except Exception: + except (json.JSONDecodeError, TypeError, ValueError): pass # keep original string form try: gt_answer = json.loads(gt_answer) - except Exception: + except (json.JSONDecodeError, TypeError, ValueError): pass # keep original string formAs per coding guidelines: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The try/except blocks that call json.loads on predicted_answer and gt_answer should not catch bare Exception; replace the generic except with a narrower except that only handles expected parsing failures (json.JSONDecodeError and TypeError) so other unexpected errors surface; update both places where json.loads(predicted_answer) and json.loads(gt_answer) are called to catch (json.JSONDecodeError, TypeError) and keep the existing behavior of leaving the original string on those specific failures.
57-64:zip()withoutstrict=Trueis safe here but could be more explicit.The
lenequality check on line 62 guarantees equal lengths beforezip, so no data is silently dropped. Addingstrict=Truewould make the intent self-documenting and guard against future refactors that might remove the length check.♻️ Proposed fix
return len(gt_answer) == len(predicted_answer) and all( - relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer) + relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 57 - 64, Update the zip call in the DSBench evaluator branch that compares list answers to use zip(..., strict=True) to make the length-equality assumption explicit and guard against future refactors: in the block where predicted_answer is a list and gt_answer is also a list (the return that currently uses len(gt_answer) == len(predicted_answer) and all(relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer))), add strict=True to the zip invocation so it becomes zip(gt_answer, predicted_answer, strict=True).nemo_skills/dataset/dsbench_da/prepare.py (1)
127-144:errors="ignore"silently drops malformed bytes — considererrors="replace"for visibility.Lines 131 and 163 use
errors="ignore"when reading text files, which silently drops bytes that can't be decoded. If a file has encoding issues, this could lead to subtly truncated or corrupted content with no indication. Usingerrors="replace"would insert�markers, making encoding problems visible in the output while still avoiding crashes.♻️ Proposed fix
- introduction = intro_file.read_text(encoding="utf-8", errors="ignore") + introduction = intro_file.read_text(encoding="utf-8", errors="replace")Same for line 163:
- question_text = question_file.read_text(encoding="utf-8", errors="ignore").strip() + question_text = question_file.read_text(encoding="utf-8", errors="replace").strip()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 127 - 144, The code uses intro_file.read_text(encoding="utf-8", errors="ignore") (and a similar read_text call later around line 163) which silently drops undecodable bytes; change these read_text calls to use errors="replace" while keeping encoding="utf-8" so malformed bytes become visible as replacement characters (�) instead of being silently discarded—locate the intro_file.read_text invocation and any other read_text(...) calls in prepare.py and update their errors parameter accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 127-144: The code uses intro_file.read_text(encoding="utf-8",
errors="ignore") (and a similar read_text call later around line 163) which
silently drops undecodable bytes; change these read_text calls to use
errors="replace" while keeping encoding="utf-8" so malformed bytes become
visible as replacement characters (�) instead of being silently discarded—locate
the intro_file.read_text invocation and any other read_text(...) calls in
prepare.py and update their errors parameter accordingly.
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch bare Exception; replace the
generic except with a narrower except that only handles expected parsing
failures (json.JSONDecodeError and TypeError) so other unexpected errors
surface; update both places where json.loads(predicted_answer) and
json.loads(gt_answer) are called to catch (json.JSONDecodeError, TypeError) and
keep the existing behavior of leaving the original string on those specific
failures.
- Around line 57-64: Update the zip call in the DSBench evaluator branch that
compares list answers to use zip(..., strict=True) to make the length-equality
assumption explicit and guard against future refactors: in the block where
predicted_answer is a list and gt_answer is also a list (the return that
currently uses len(gt_answer) == len(predicted_answer) and all(relaxed_equal(e,
p) for e, p in zip(gt_answer, predicted_answer))), add strict=True to the zip
invocation so it becomes zip(gt_answer, predicted_answer, strict=True).
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)
34-34: LGTM — clean registration following the established pattern.The import is inserted in alphabetical order,
"dsbench"is absent fromEVALUATOR_MAPso the overlap-validation guard at lines 83–91 won't fire, and all class-based-evaluator infrastructure (evaluate,get_evaluator_class,supports_single_eval) will pick up the new entry automatically.Consider adding
dsbench_dato the slurm evaluation tests for comprehensive end-to-end CI coverage. Based on learnings: "When adding new evaluation or metrics logic for benchmarks, consider adding the dataset to slurm tests for comprehensive evaluation."Also applies to: 80-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/__init__.py` at line 34, Add the new dsbench evaluator coverage to CI by including the dsbench_da dataset in the Slurm evaluation tests list so end-to-end evaluation runs in CI; update the test configuration that enumerates datasets for Slurm runs (the list/array that configures slurm evaluation datasets) to include "dsbench_da", and ensure any test harness that filters by evaluator uses DSBenchEvaluator (class name DSBenchEvaluator / registration in EVALUATOR_MAP) so the dataset is picked up during test execution. Verify the change triggers the existing evaluation flow (evaluate, get_evaluator_class, supports_single_eval) without modifying evaluator registration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Line 34: Add the new dsbench evaluator coverage to CI by including the
dsbench_da dataset in the Slurm evaluation tests list so end-to-end evaluation
runs in CI; update the test configuration that enumerates datasets for Slurm
runs (the list/array that configures slurm evaluation datasets) to include
"dsbench_da", and ensure any test harness that filters by evaluator uses
DSBenchEvaluator (class name DSBenchEvaluator / registration in EVALUATOR_MAP)
so the dataset is picked up during test execution. Verify the change triggers
the existing evaluation flow (evaluate, get_evaluator_class,
supports_single_eval) without modifying evaluator registration.
Signed-off-by: suriya <sgunasekar@nvidia.com>
…sts and dicts. Signed-off-by: suriya <sgunasekar@nvidia.com>
…ches the current system. Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
…en excel files with context manager rather then unclosed file handles Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com>
3c4161d to
f5986e5
Compare
Signed-off-by: suriya <sgunasekar@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
nemo_skills/dataset/dsbench_da/prepare.py (1)
155-161: UseLOG.warninginstead ofLine 160 uses
♻️ Proposed fix
- print(f" Warning: {task_id}/{question_name}.txt not found, skipping") + LOG.warning("Skipping %s/%s.txt: file not found", task_id, question_name)This also requires importing the logger at the module level (add alongside the existing imports):
import logging from nemo_skills.utils import get_logger_name LOG = logging.getLogger(get_logger_name(__file__))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/dsbench_da/prepare.py` around lines 155 - 161, Replace the print warning in the question-file check with a structured logger call: import logging and get_logger_name from nemo_skills.utils at module top, create LOG = logging.getLogger(get_logger_name(__file__)), and change the print in the loop (the missing-file branch inside the for idx, question_name in enumerate(task["questions"]) block that checks question_file.exists()) to LOG.warning with the same message text so skipped-file conditions are emitted as warnings.nemo_skills/evaluation/evaluator/dsbench.py (2)
38-45: Narrow the caught exception type fromExceptionto(json.JSONDecodeError, TypeError).
json.loadsonly raisesjson.JSONDecodeError(for invalid JSON) andTypeError(for non-string/bytes input). Catching bareExceptionhere can mask unexpected errors such asAttributeError,MemoryError, etc., silently producing incorrect comparison results instead of a clear failure. Ruff BLE001/S110 flags this on both blocks.♻️ Proposed fix
- try: - predicted_answer = json.loads(predicted_answer) - except Exception: - pass # keep original string form - try: - gt_answer = json.loads(gt_answer) - except Exception: - pass # keep original string form + try: + predicted_answer = json.loads(predicted_answer) + except (json.JSONDecodeError, TypeError): + pass # keep original string form + try: + gt_answer = json.loads(gt_answer) + except (json.JSONDecodeError, TypeError): + pass # keep original string form🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 38 - 45, The try/except blocks that call json.loads on predicted_answer and gt_answer should not catch broad Exception; replace "except Exception:" with "except (json.JSONDecodeError, TypeError):" so only JSON decode errors and non-string input are swallowed, leaving other errors to surface; update both places where json.loads(predicted_answer) and json.loads(gt_answer) are used accordingly.
62-64: Addstrict=Truetozip()to make the length invariant explicit.The preceding
len(gt_answer) == len(predicted_answer)check guarantees equal lengths, so this is safe as-is. However, passingstrict=Truemakes the invariant explicit and avoids Ruff B905.♻️ Proposed fix
- return len(gt_answer) == len(predicted_answer) and all( - relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer) - ) + return len(gt_answer) == len(predicted_answer) and all( + relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer, strict=True) + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/dsbench.py` around lines 62 - 64, The return expression in the function that compares gt_answer and predicted_answer uses zip(gt_answer, predicted_answer); update this to zip(gt_answer, predicted_answer, strict=True) so the length invariant is explicit (keep the existing len(...) == len(...) check or remove it if you prefer); target the return statement that calls relaxed_equal(e, p) for e, p in zip(gt_answer, predicted_answer) and add strict=True to the zip call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemo_skills/dataset/dsbench_da/prepare.py`:
- Around line 155-161: Replace the print warning in the question-file check with
a structured logger call: import logging and get_logger_name from
nemo_skills.utils at module top, create LOG =
logging.getLogger(get_logger_name(__file__)), and change the print in the loop
(the missing-file branch inside the for idx, question_name in
enumerate(task["questions"]) block that checks question_file.exists()) to
LOG.warning with the same message text so skipped-file conditions are emitted as
warnings.
In `@nemo_skills/evaluation/evaluator/dsbench.py`:
- Around line 38-45: The try/except blocks that call json.loads on
predicted_answer and gt_answer should not catch broad Exception; replace "except
Exception:" with "except (json.JSONDecodeError, TypeError):" so only JSON decode
errors and non-string input are swallowed, leaving other errors to surface;
update both places where json.loads(predicted_answer) and json.loads(gt_answer)
are used accordingly.
- Around line 62-64: The return expression in the function that compares
gt_answer and predicted_answer uses zip(gt_answer, predicted_answer); update
this to zip(gt_answer, predicted_answer, strict=True) so the length invariant is
explicit (keep the existing len(...) == len(...) check or remove it if you
prefer); target the return statement that calls relaxed_equal(e, p) for e, p in
zip(gt_answer, predicted_answer) and add strict=True to the zip call.
Signed-off-by: suriya <sgunasekar@nvidia.com>
Kipok
left a comment
There was a problem hiding this comment.
Let's create issues for unfinished items and we can merge as is as long as gpu tests are passing
Adds evaluation support for the DA (data analysis) portion of DSBench.
dsbench_da) with data preparation and prompt configsrelaxedargument through toextract_answer— default behavior is unchangedsandbox.lockwith DSBench dependencies (openpyxl,pyxlsb)Most files are independent and will not affect existing workflows. Evals have been tested end-to-end.
Summary by CodeRabbit
New Features
Chores