Conversation
ananthsub
left a comment
There was a problem hiding this comment.
This looks great!
For all of the run.Script paths, the arguments are specified in a particular way. We'll need to add some docs and/or point to the examples for how user scripts should be written if using these plugins.
How did you want to test these plugins in CI? AFAICT testing the nemo run plugins in nemo resorted to using end to end tests, but since these plugins override or specify overrides to the script or set env vars, do we need full e2e tests for these?
| # Check if nsys profiling is enabled and warn if so | ||
| if hasattr(task, "profiling") and task.profiling and task.profiling.use_nsys_profiler: | ||
| print("Warning: Nsys not supported with the FaultTolerancePlugin.") | ||
| task.profiling.use_nsys_profiler = False |
There was a problem hiding this comment.
should this check also be added to the configcontainer validation, since we can't catch this for the script case? it'd also be good to add this restriction to the docstring
There was a problem hiding this comment.
Yes, but let's add that in a separate PR.
| print(f"{self.__class__.__name__} added CLI override: train.exit_signal_handler=true") | ||
| else: | ||
| # Enable exit signal handler in training config | ||
| if self.enable_exit_handler and hasattr(task, "train"): |
There was a problem hiding this comment.
is the task the ConfigContainer ? from #56 the megatron_pretrain is what's wrapped with run.Partial, so does this need to access task.config first to get to the ConfigContainer?
I think unit tests for individual plugins and end to end tests in the CI with the plugins enabled are both required. |
ac02c3b to
d76c6a2
Compare
ananthsub
left a comment
There was a problem hiding this comment.
LGTM, one question inline
Closes #28