Skip to content

Specify nodes for gpu metrics collection and split data to each rank#320

Merged
hemildesai merged 2 commits intoNVIDIA-NeMo:mainfrom
ashbhandare:abhandare/llmb
Aug 18, 2025
Merged

Specify nodes for gpu metrics collection and split data to each rank#320
hemildesai merged 2 commits intoNVIDIA-NeMo:mainfrom
ashbhandare:abhandare/llmb

Conversation

@ashbhandare
Copy link
Copy Markdown
Contributor

Node list for collection is a comma separated list set in env.
eg :
export GPU_METRICS_NODES=0,2
python3 -m scripts.performance.llm.pretrain_nemotron4 ...

Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com>
Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com>
@briancoutinho
Copy link
Copy Markdown

Lgtm, thanks for adding this change :)

@hemildesai hemildesai merged commit 04f900a into NVIDIA-NeMo:main Aug 18, 2025
19 of 21 checks passed
zoeyz101 pushed a commit to zoeyz101/NeMo-Run that referenced this pull request Nov 12, 2025
…VIDIA-NeMo#320)

* Specify nodes for gpu metrics collection and split data to each rank

Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com>

* Fix unit test

Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com>

---------

Signed-off-by: Aishwarya Bhandare <abhandare@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants