Address VDR feedback for NeMo FW evaluations #13701

athitten · 2025-05-22T17:31:05Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds details to evaluation docs addressing the VDR feedback for evaluations in NeMo FW.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Abhishree <[email protected]>

jgerh

Completed Tech Pubs review of docs/source/evaluation/evaluation-doc.rst and provided a few copyedits, formatting updates, and suggested text revisions.

docs/source/evaluation/evaluation-doc.rst

jgerh · 2025-05-22T21:59:03Z

docs/source/evaluation/evaluation-doc.rst

@@ -51,15 +51,18 @@ facilitates easier debugging. However, for running evaluations on clusters, it i
 ease of use.



here is a list of line-by-line edits to read-only text. these edits should also be addressed.

Lines 4/4 to 13/13
Please fix bullet list syntax (I think it is causing the following paragraphs to be bold). See the reStructuredText Guide here: https://aschilling.gitlab-master-pages.nvidia.com/documentation/repo_docs/latest/rst-guide.html. Replace asterisk with hyphen, delete leading indent, and separate sentence from bulleted list with a blank line.

This guide provides detailed instructions on evaluating NeMo 2.0 checkpoints using the NVIDIA Evals Factory <https://pypi.org/project/nvidia-lm-eval/>__ within the NeMo Framework. Supported benchmarks include:

GPQA

GSM8K

IFEval

MGSM

MMLU

MMLU-Pro

MMLU-Redux

Wikilingua

Line 29/29 - 34/34
Same comment about bulleted list

The NVIDIA Evals Factory provides the following predefined configurations for evaluating the completions endpoint:

gsm8k

mgsm

mmlu

mmlu_pro

mmlu_redux

Line 36/36 - Line 44/44
same comment about bulleted list

It also provides the following configurations for evaluating the chat endpoint:

gpqa_diamond_cot

gsm8k_cot_instruct

ifeval

mgsm_cot

mmlu_instruct

mmlu_pro_instruct

mmlu_redux_instruct

wikilingua

Line 67/70 - 70/73
revise sentence (no "killed")

The entrypoint for evaluation is the `evaluatemethod defined innemo/collections/llm/api.py``. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being terminated unexpectedly and aborting the runs.

Line 86/89
revise note

.. note::
Please refer to the deploy and evaluate methods in nemo/collections/llm/api.py to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the ApiEndpoint and ConfigParams classes for evaluation, see the source code at nemo/collections/llm/evaluation/api.py <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/evaluation/api.py>__.

Line 96/99
Make sure the # characters match the exact length of the heading text.

Line 98/101 - 99/102
remove period

The evaluation.py <https://github.com/NVIDIA/NeMo/blob/main/scripts/llm/evaluation.py>__ script serves as a reference for launching evaluations with NeMo-Run.

Line 120/123
Make sure the # characters match the exact length of the heading text.

Line 137/140
Make sure the - characters match the exact length of the heading text.

Line 164/167 - Line 168/178
revise sentence (no "killed")

The evaluate method defined in nemo/collections/llm/api.py supports the legacy way of evaluating models. To run evaluations on the deployed model, use the following command. Make sure to pass the nemo_checkpoint_path and url parameters, as they are required for using the legacy evaluation code. Open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being interrupted or terminated unexpectedly, which could cause the runs to abort.

@jgerh could you please review again, I have addressed all your comments. Thank you!

marta-sd

LGTM, thank you!

Co-authored-by: jgerh <[email protected]> Signed-off-by: Abhishree Thittenamane <[email protected]>

Signed-off-by: Abhishree <[email protected]>

jgerh

Completed tech pubs review of latest copyedits and approved.

chtruong814 · 2025-05-28T21:02:16Z

Fast merging since this is a docs change only

Address VDR feedback

d477486

Signed-off-by: Abhishree <[email protected]>

athitten requested review from marta-sd and jgerh May 22, 2025 17:31

jgerh reviewed May 22, 2025

View reviewed changes

marta-sd previously approved these changes May 23, 2025

View reviewed changes

Update docs/source/evaluation/evaluation-doc.rst

f7a0d5a

Co-authored-by: jgerh <[email protected]> Signed-off-by: Abhishree Thittenamane <[email protected]>

athitten dismissed marta-sd’s stale review via f7a0d5a May 27, 2025 21:28

Address code review comments

a03b60d

Signed-off-by: Abhishree <[email protected]>

jgerh approved these changes May 28, 2025

View reviewed changes

chtruong814 approved these changes May 28, 2025

View reviewed changes

chtruong814 merged commit be6aae6 into main May 28, 2025
35 checks passed

chtruong814 deleted the athitten/vdr_feedback branch May 28, 2025 21:02

athitten mentioned this pull request May 29, 2025

[Evaluation] Add support for simple-evals and tasks that require logprobs #13647

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address VDR feedback for NeMo FW evaluations #13701

Address VDR feedback for NeMo FW evaluations #13701

Uh oh!

athitten commented May 22, 2025

Uh oh!

jgerh left a comment

Uh oh!

Uh oh!

jgerh May 22, 2025

Uh oh!

athitten May 28, 2025

Uh oh!

marta-sd left a comment

Uh oh!

jgerh left a comment

Uh oh!

chtruong814 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -51,15 +51,18 @@ facilitates easier debugging. However, for running evaluations on clusters, it i
		ease of use.

Address VDR feedback for NeMo FW evaluations #13701

Address VDR feedback for NeMo FW evaluations #13701

Uh oh!

Conversation

athitten commented May 22, 2025

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgerh May 22, 2025

Choose a reason for hiding this comment

Uh oh!

athitten May 28, 2025

Choose a reason for hiding this comment

Uh oh!

marta-sd left a comment

Choose a reason for hiding this comment

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

chtruong814 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!