Skip to content

Conversation

@mingmao-inflection
Copy link

@mingmao-inflection mingmao-inflection commented Jan 30, 2026

What does this PR do?

Removes the assertion for MoE aux loss validation that was blocking usage due to a known bug in Megatron-LM. The upstream issue has been resolved (NVIDIA/Megatron-LM#1984), so this assertion is no longer necessary.

Issues

N/A — This is a cleanup PR removing an outdated workaround.

Usage

No usage changes. MoE aux loss can now be used without triggering the assertion error.

# MoE aux loss is now supported
model_cfg.moe_router_load_balancing_type = "aux_loss"
model_cfg.moe_aux_loss_coeff = 0.01  # Previously would fail assertion

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

Summary by CodeRabbit

  • Bug Fixes
    • Removed overly restrictive validation that prevented certain Mixture of Experts (MoE) configuration combinations during model setup, allowing previously blocked auxiliary loss settings to be used.

✏️ Tip: You can customize this high-level summary in your review settings.

Removed assertion for MoE aux loss validation due to known bug. The issue from megatron-lm has already beenn fixed ("NVIDIA/Megatron-LM#1984" ), so there is no need to hold the assertion

Signed-off-by: Mingmao Sun <mingmao.sun@inflection.ai>
@mingmao-inflection mingmao-inflection requested a review from a team as a code owner January 30, 2026 01:19
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

A runtime assertion in the MoE training configuration validation function was removed from nemo_rl/models/megatron/setup.py. The assertion previously blocked non-zero auxiliary loss values for certain router load balancing type configurations. This validation check is no longer enforced during setup.

Changes

Cohort / File(s) Summary
MoE Validation Logic
nemo_rl/models/megatron/setup.py
Removed 9-line assertion that validated and rejected non-zero MoE auxiliary loss configurations for specific router load balancing types in _validate_training_config.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR removes MoE aux loss assertion but test suite still expects it to be present, causing test failures. Update or remove test_moe_aux_loss_not_supported, add tests for non-zero MoE auxiliary loss configurations, and document test results.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately summarizes the main change: removing an obsolete MoE auxiliary loss validation assertion. It is concise, specific, and clearly describes the primary modification in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mingmao-inflection mingmao-inflection changed the title Remove MoE aux loss validation assertion fix: Remove obsolete MoE aux loss validation assertion Jan 30, 2026
@mingmao-inflection mingmao-inflection marked this pull request as draft January 30, 2026 01:29
@mingmao-inflection mingmao-inflection marked this pull request as ready for review January 30, 2026 01:29
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution!

@yfw @gshennvm I recall this being fixed, but has anyone validated if that fix was sufficient?

@terrykong terrykong requested review from gshennvm and yfw January 30, 2026 06:51
@gshennvm
Copy link
Contributor

thanks for the contribution!

@yfw @gshennvm I recall this being fixed, but has anyone validated if that fix was sufficient?

I believe the fix needs to be more than disabling the check since we need to also pass in the padding mask to mcore. I have a not well tested commit here for it ad2a1d3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

moe routing loss incorrect for post-training

3 participants