Skip to content

Add nemo run plugins#65

Merged
hemildesai merged 3 commits intomainfrom
hemil/run-plugins
Jun 24, 2025
Merged

Add nemo run plugins#65
hemildesai merged 3 commits intomainfrom
hemil/run-plugins

Conversation

@hemildesai
Copy link
Copy Markdown
Contributor

Closes #28

Copy link
Copy Markdown
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

For all of the run.Script paths, the arguments are specified in a particular way. We'll need to add some docs and/or point to the examples for how user scripts should be written if using these plugins.

How did you want to test these plugins in CI? AFAICT testing the nemo run plugins in nemo resorted to using end to end tests, but since these plugins override or specify overrides to the script or set env vars, do we need full e2e tests for these?

Comment on lines +122 to +125
# Check if nsys profiling is enabled and warn if so
if hasattr(task, "profiling") and task.profiling and task.profiling.use_nsys_profiler:
print("Warning: Nsys not supported with the FaultTolerancePlugin.")
task.profiling.use_nsys_profiler = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check also be added to the configcontainer validation, since we can't catch this for the script case? it'd also be good to add this restriction to the docstring

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but let's add that in a separate PR.

print(f"{self.__class__.__name__} added CLI override: train.exit_signal_handler=true")
else:
# Enable exit signal handler in training config
if self.enable_exit_handler and hasattr(task, "train"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the task the ConfigContainer ? from #56 the megatron_pretrain is what's wrapped with run.Partial, so does this need to access task.config first to get to the ConfigContainer?

@hemildesai
Copy link
Copy Markdown
Contributor Author

How did you want to test these plugins in CI? AFAICT testing the nemo run plugins in nemo resorted to using end to end tests, but since these plugins override or specify overrides to the script or set env vars, do we need full e2e tests for these?

I think unit tests for individual plugins and end to end tests in the CI with the plugins enabled are both required.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Copy link
Copy Markdown
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question inline

@hemildesai hemildesai merged commit 4395912 into main Jun 24, 2025
25 checks passed
@hemildesai hemildesai deleted the hemil/run-plugins branch June 24, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plugins support for nemo run launching

2 participants