-
Notifications
You must be signed in to change notification settings - Fork 245
Mdp #1849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: update-megatron-lm-5247a1f
Are you sure you want to change the base?
Mdp #1849
Conversation
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know this is a draft and things are in flux, but leaving some feedback from a first pass. this definitely needs more passes.
cc @ashors1 @ananthsub @yaoyu-33 for design of mcore inf in the policy worker
my opinion is we now need to maybe move the inference to a mixin so that the regular training policy methods are clearly separated from the generation ones because now there are many. After separating, the MegatronInferenceMixin can be added as one of the parent classes (multi-ineheritance). right now the megatron policy worker class has ballooned quite significantly and is very intimidating.
a general feedback is i would love to have more boilerplate get pushed into megatron inference. the amount of code change needed seems like a lot and we seem to need to set state that i would imagine mcore inference APIs might handle (like the local/none thing)
| from megatron.core.process_groups_config import ProcessGroupCollection | ||
| from megatron.core.rerun_state_machine import get_rerun_state_machine | ||
| from megatron.core.transformer import MegatronModule | ||
| # Note: delete_cuda_graphs is called internally by toggle_cuda_graphs when reset_cuda_graphs=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe move this comment to where toggle is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why were these dropped?
| self.dynamic_inference_engine = None | ||
| self.inference_client = None | ||
| self.inference_context = None | ||
| self.inference_wrapped_model = None | ||
| self._inference_engine_initialized = False | ||
| self._inference_engine_paused = True # Start paused since we begin with training | ||
| self._inference_loop = None # Event loop for inference operations | ||
| self._inference_thread = None # Thread running the event loop | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high level q: why does mcore inference require so much book keeping by the application?
| model_cfg = cfg_from_pretrained.model | ||
| cfg_from_pretrained.logger = LoggerConfig() | ||
|
|
||
| # Ensure make_vocab_size_divisible_by has a reasonable default (128 is standard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ananthsub @yaoyu-33 can you comment on this?
i feel this is potentially an unsafe thing to default especially if we don't currently have a way to chop off the vocab during HF export.
this is good to have in general though b/c i know vocab parallel will have issues w/o but prob not a good thing to enable globally yet
| # Setting moe_router_dtype to higher precision (e.g. fp64) can improve numerical stability, | ||
| # especially when using many experts. | ||
| model_cfg.moe_router_dtype = self.cfg["megatron_cfg"]["moe_router_dtype"] | ||
| model_cfg.moe_token_dispatcher_type = "alltoall" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this hard coded? can this be plumbed in
RL/nemo_rl/models/policy/__init__.py
Line 193 in dacac7e
| moe_token_dispatcher_type: str |
| unified_memory_level = mcore_generation_config["unified_memory_level"] | ||
| model_config = self.model.config | ||
| # Enable CUDA graphs for inference | ||
| model_config.cuda_graph_impl = "local" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another place it's set from local (vs none). is this potentially error prone since we have many places where this needs to be set?
| self._inference_engine_paused = True | ||
| print(f"[Rank {self.rank}] paused inference engine") | ||
|
|
||
| async def pause_engine(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be protected? it doesn't appear to be something the user needs to be aware of
| async def pause_engine(self): | |
| async def _pause_engine(self): |
| self._inference_engine_paused = False | ||
| print(f"[Rank {self.rank}] Resumed inference engine") | ||
|
|
||
| async def resume_engine(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same protected comment
|
|
||
| self._inference_engine_paused = False | ||
|
|
||
| def pause_inference_engine(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use the nomenclature sleep/wake to match the other inference engines at least in the nemo-rl API? i think it's okay that mcore inf calls it pause/resume, but i do think that could potentially be a little confusing if we ever do partial rollouts or in-flight weight updates when n actual pause may be needed
| The dynamic inference engine for use during inference. | ||
| """ | ||
| # Get the language module (unwrap from precision wrappers if needed) | ||
| lang_module = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_lang_module ?
0c7d97e to
e970643
Compare
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information