Skip to content

Conversation

@shanmugamr1992
Copy link
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@shanmugamr1992 shanmugamr1992 requested review from a team as code owners January 29, 2026 22:05
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this is a draft and things are in flux, but leaving some feedback from a first pass. this definitely needs more passes.

cc @ashors1 @ananthsub @yaoyu-33 for design of mcore inf in the policy worker

my opinion is we now need to maybe move the inference to a mixin so that the regular training policy methods are clearly separated from the generation ones because now there are many. After separating, the MegatronInferenceMixin can be added as one of the parent classes (multi-ineheritance). right now the megatron policy worker class has ballooned quite significantly and is very intimidating.

a general feedback is i would love to have more boilerplate get pushed into megatron inference. the amount of code change needed seems like a lot and we seem to need to set state that i would imagine mcore inference APIs might handle (like the local/none thing)

from megatron.core.process_groups_config import ProcessGroupCollection
from megatron.core.rerun_state_machine import get_rerun_state_machine
from megatron.core.transformer import MegatronModule
# Note: delete_cuda_graphs is called internally by toggle_cuda_graphs when reset_cuda_graphs=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move this comment to where toggle is called

Comment on lines 161 to 141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why were these dropped?

Comment on lines +529 to +537
self.dynamic_inference_engine = None
self.inference_client = None
self.inference_context = None
self.inference_wrapped_model = None
self._inference_engine_initialized = False
self._inference_engine_paused = True # Start paused since we begin with training
self._inference_loop = None # Event loop for inference operations
self._inference_thread = None # Thread running the event loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level q: why does mcore inference require so much book keeping by the application?

model_cfg = cfg_from_pretrained.model
cfg_from_pretrained.logger = LoggerConfig()

# Ensure make_vocab_size_divisible_by has a reasonable default (128 is standard)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub @yaoyu-33 can you comment on this?

i feel this is potentially an unsafe thing to default especially if we don't currently have a way to chop off the vocab during HF export.

this is good to have in general though b/c i know vocab parallel will have issues w/o but prob not a good thing to enable globally yet

# Setting moe_router_dtype to higher precision (e.g. fp64) can improve numerical stability,
# especially when using many experts.
model_cfg.moe_router_dtype = self.cfg["megatron_cfg"]["moe_router_dtype"]
model_cfg.moe_token_dispatcher_type = "alltoall"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this hard coded? can this be plumbed in

moe_token_dispatcher_type: str
?

unified_memory_level = mcore_generation_config["unified_memory_level"]
model_config = self.model.config
# Enable CUDA graphs for inference
model_config.cuda_graph_impl = "local"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another place it's set from local (vs none). is this potentially error prone since we have many places where this needs to be set?

self._inference_engine_paused = True
print(f"[Rank {self.rank}] paused inference engine")

async def pause_engine(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be protected? it doesn't appear to be something the user needs to be aware of

Suggested change
async def pause_engine(self):
async def _pause_engine(self):

self._inference_engine_paused = False
print(f"[Rank {self.rank}] Resumed inference engine")

async def resume_engine(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same protected comment


self._inference_engine_paused = False

def pause_inference_engine(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use the nomenclature sleep/wake to match the other inference engines at least in the nemo-rl API? i think it's okay that mcore inf calls it pause/resume, but i do think that could potentially be a little confusing if we ever do partial rollouts or in-flight weight updates when n actual pause may be needed

The dynamic inference engine for use during inference.
"""
# Get the language module (unwrap from precision wrappers if needed)
lang_module = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_lang_module ?

@yfw yfw force-pushed the update-megatron-lm-5247a1f branch from 0c7d97e to e970643 Compare February 9, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants