Skip to content

Address VDR feedback for NeMo FW evaluations #13701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 28, 2025
Merged

Conversation

athitten
Copy link
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds details to evaluation docs addressing the VDR feedback for evaluations in NeMo FW.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Abhishree <[email protected]>
@athitten athitten requested review from marta-sd and jgerh May 22, 2025 17:31
Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed Tech Pubs review of docs/source/evaluation/evaluation-doc.rst and provided a few copyedits, formatting updates, and suggested text revisions.

@@ -51,15 +51,18 @@ facilitates easier debugging. However, for running evaluations on clusters, it i
ease of use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a list of line-by-line edits to read-only text. these edits should also be addressed.

Lines 4/4 to 13/13
Please fix bullet list syntax (I think it is causing the following paragraphs to be bold). See the reStructuredText Guide here: https://aschilling.gitlab-master-pages.nvidia.com/documentation/repo_docs/latest/rst-guide.html. Replace asterisk with hyphen, delete leading indent, and separate sentence from bulleted list with a blank line.

This guide provides detailed instructions on evaluating NeMo 2.0 checkpoints using the NVIDIA Evals Factory <https://pypi.org/project/nvidia-lm-eval/>__ within the NeMo Framework. Supported benchmarks include:

  • GPQA
  • GSM8K
  • IFEval
  • MGSM
  • MMLU
  • MMLU-Pro
  • MMLU-Redux
  • Wikilingua

Line 29/29 - 34/34
Same comment about bulleted list

The NVIDIA Evals Factory provides the following predefined configurations for evaluating the completions endpoint:

  • gsm8k
  • mgsm
  • mmlu
  • mmlu_pro
  • mmlu_redux

Line 36/36 - Line 44/44
same comment about bulleted list

It also provides the following configurations for evaluating the chat endpoint:

  • gpqa_diamond_cot
  • gsm8k_cot_instruct
  • ifeval
  • mgsm_cot
  • mmlu_instruct
  • mmlu_pro_instruct
  • mmlu_redux_instruct
  • wikilingua

Line 67/70 - 70/73
revise sentence (no "killed")

The entrypoint for evaluation is the `evaluatemethod defined innemo/collections/llm/api.py``. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being terminated unexpectedly and aborting the runs.

Line 86/89
revise note

.. note::
Please refer to the deploy and evaluate methods in nemo/collections/llm/api.py to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the ApiEndpoint and ConfigParams classes for evaluation, see the source code at nemo/collections/llm/evaluation/api.py <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/evaluation/api.py>__.

Line 96/99
Make sure the # characters match the exact length of the heading text.

Line 98/101 - 99/102
remove period

The evaluation.py <https://github.com/NVIDIA/NeMo/blob/main/scripts/llm/evaluation.py>__ script serves as a reference for launching evaluations with NeMo-Run.

Line 120/123
Make sure the # characters match the exact length of the heading text.

Line 137/140
Make sure the - characters match the exact length of the heading text.

Line 164/167 - Line 168/178
revise sentence (no "killed")

The evaluate method defined in nemo/collections/llm/api.py supports the legacy way of evaluating models. To run evaluations on the deployed model, use the following command. Make sure to pass the nemo_checkpoint_path and url parameters, as they are required for using the legacy evaluation code. Open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being interrupted or terminated unexpectedly, which could cause the runs to abort.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgerh could you please review again, I have addressed all your comments. Thank you!

marta-sd
marta-sd previously approved these changes May 23, 2025
Copy link
Collaborator

@marta-sd marta-sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Co-authored-by: jgerh <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed tech pubs review of latest copyedits and approved.

@chtruong814
Copy link
Collaborator

Fast merging since this is a docs change only

@chtruong814 chtruong814 merged commit be6aae6 into main May 28, 2025
35 checks passed
@chtruong814 chtruong814 deleted the athitten/vdr_feedback branch May 28, 2025 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants