chore: Update Megatron submodule pins by yaoyu-33 · Pull Request #1787 · NVIDIA-NeMo/RL

yaoyu-33 · 2026-01-16T01:41:52Z

Summary by CodeRabbit

Chores
- Updated dependencies: narrowed transformers to version <5.0.0, broadened package requirements, adjusted transformer-engine version range to >=2.10.0a0,<2.12.0, added accelerate and other packages
- Updated submodule references to latest versions
Refactor
- Restructured distributed model setup and training processes to utilize process group collection configuration

github-actions · 2026-01-16T01:42:46Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 2712979 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/5247a1f4600c258c8f5d40c3bd934f210dbf3698/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-16T01:52:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 6d03b70 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-16T19:20:57Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 82d1059 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-17T00:47:41Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: fde1997 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-27T23:31:33Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: cf59947 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-27T23:35:42Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: ea2ca52 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-28T01:41:40Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 228c370 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-28T01:50:35Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: a0cb209 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2026-01-28T21:21:30Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0c7d97e (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Jeronymous · 2026-01-29T08:24:14Z

@yaoyu-33 @ashors1 @yfw In order to finetune a Mamba architecture with sequence packing, we would need an update of the megatron core submodule.
So this PR looks very interesting :)
What is the status? Is it ready to be tested or are there known issues?

terrykong · 2026-01-29T20:58:15Z

@yaoyu-33 @ashors1 @yfw In order to finetune a Mamba architecture with sequence packing, we would need an update of the megatron core submodule. So this PR looks very interesting :) What is the status? Is it ready to be tested or are there known issues?

Hi @Jeronymous . This PR still needs some work to fix some failures caused by bumping. Have you seen our guide for RL of the mamba hybrid moe architecture (https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/nemotron-3-nano.md)? This container and branch supports seq-packing on the mamba arch

Jeronymous · 2026-02-01T16:17:48Z

Thanks a lot for spotting that @terrykong
Indeed finetuning (with seq packing) of mamba hybrid architecture seems to work on the nano-v3 branch :)

I just don't understand the relation with moe, as my model is not a mixture of expert (just a hybrid archi with Mamba2 and transformer blocks).
Actually, I first had an error

AssertionError: MoE aux loss is currently not supported due to a known bug in Megatron-LM. See https://github.com/NVIDIA/Megatron-LM/issues/1984 for more details.

with this setting taken from the example sft.yaml :

    moe_router_load_balancing_type: "aux_loss"

(I had to put "none" instead of "aux_loss" so that it does not fail)

coderabbitai · 2026-02-04T01:38:39Z

📝 Walkthrough

Walkthrough

Updates Megatron-LM and Megatron-Bridge submodule references and dependencies, then integrates ProcessGroupCollection into the Megatron setup and policy worker to manage distributed process groups during training initialization and execution.

Changes

Cohort / File(s)	Summary
Submodule Updates `.gitmodules`, `3rdparty/Megatron-Bridge-workspace/Megatron-Bridge`, `3rdparty/Megatron-LM-workspace/Megatron-LM`	Updated Megatron-LM submodule URL and branch; updated Megatron-Bridge-workspace and Megatron-LM-workspace submodule pointers to newer commits.
Dependency Configuration `3rdparty/Megatron-Bridge-workspace/setup.py`, `3rdparty/Megatron-LM-workspace/setup.py`	Loosened and adjusted version constraints for transformers, transformer-engine, numpy, nvidia-modelopt, and other dependencies; added accelerate, datasets, and fastapi packages; removed setuptools upper bound.
ProcessGroupCollection Integration `nemo_rl/models/megatron/setup.py`, `nemo_rl/models/policy/workers/megatron_policy_worker.py`	Introduced ProcessGroupCollection creation and plumbing through setup flow; updated get_model and _update_model_config_funcs to accept pg_collection parameter; configured policy worker to use process group for distributed aggregation operations.

Sequence Diagram(s)

sequenceDiagram
    participant Setup as Megatron Setup
    participant Config as Configuration
    participant Model as Model Factory
    participant Worker as Policy Worker

    Setup->>Setup: Create ProcessGroupCollection
    Setup->>Config: Store pg_collection in megatron_cfg.model._pg_collection
    Setup->>Model: get_model(pg_collection=pg_collection)
    Model-->>Setup: Return initialized model
    Setup->>Worker: Initialize MegatronPolicyWorker
    Worker->>Config: Retrieve pg_collection via get_pg_collection()
    Worker->>Worker: Use pg_collection.mp for distributed aggregation<br/>(logical_and, reduce_max_stat operations)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: refactor megatron init #1646: Overlaps in Megatron initialization and setup code refactoring with ProcessGroupCollection plumbing through megatron_policy_worker.
chore: Update RL to use megatron-bridge tot #1358: Modifies same Megatron model setup flow and get_model method signatures for configuration changes.
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) #1568: Updates identical Megatron-Bridge and Megatron-LM submodule pointer references.

Suggested labels

mcore

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR contains major submodule and dependency changes but lacks test results documentation; review comments identify unresolved critical numpy 2.x incompatibility and other failures.	Address numpy 2.x incompatibility, remove unused imports, run comprehensive tests, document results, and verify all reported issues are resolved before merging.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'chore: Update Megatron submodule pins' accurately describes the primary change—updating pinned Megatron-related submodule references across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update-megatron-lm-5247a1f

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@3rdparty/Megatron-LM-workspace/setup.py`:
- Line 47: The setup currently lists "numpy" with no version constraint which
will allow numpy 2.x and cause runtime AttributeError due to use of np.product
in Megatron-LM; either pin numpy to a 1.x release by changing the dependency in
setup.py to "numpy<2.0.0" or instead apply the Megatron-LM fix that replaces all
uses of np.product() with np.prod() (search for np.product in the codebase and
update to np.prod) so the code is compatible with numpy 2.x; update setup.py or
the code references accordingly to resolve the incompatibility.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 30-51: Remove the unused Megatron imports listed in the top import
block: delete CheckpointConfig, ConfigContainer, DistributedDataParallelConfig,
LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig,
initialize_megatron, set_jit_fusion_options, setup_optimizer,
_create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and
build_tokenizer so only get_pg_collection remains imported from
megatron.bridge.training.utils.pg_utils; update the import statement accordingly
and run pre-commit (ruff) to ensure there are no remaining unused-import errors.

🧹 Nitpick comments (1)

nemo_rl/models/megatron/setup.py (1)
735-746: Replace setattr with direct attribute assignment.

Per static analysis (Ruff B010), using setattr with a constant attribute name is equivalent to direct assignment but less readable.
♻️ Suggested fix
     pg_collection = ProcessGroupCollection.use_mpu_process_groups()
-    setattr(megatron_cfg.model, "_pg_collection", pg_collection)
+    megatron_cfg.model._pg_collection = pg_collection
     model = get_model(

coderabbitai · 2026-02-04T01:38:42Z

3rdparty/Megatron-LM-workspace/setup.py

    # Default dependencies from pyproject.toml
    "torch",
-    "numpy<2.0.0",
+    "numpy",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Megatron-LM numpy 2.0 compatibility

💡 Result:

Megatron-LM is not fully NumPy 2.x compatible out of the box (at least in some recent commits), because it still calls np.product(...), which was removed in NumPy 2.x—causing errors like AttributeError: module 'numpy' has no attribute 'product' during distributed checkpointing validation (megatron/core/dist_checkpointing/...). [1]

Fix / workaround options:

Patch Megatron-LM: replace np.product(...) with np.prod(...) in the affected files (e.g., exchange_utils.py, mapping.py, validation.py). A PR exists proposing exactly this change. [2]

Temporary shim (quick local workaround): add something like:
import numpy as np if not hasattr(np, "product"): np.product = np.prod
(This is also suggested in the bug report.) [1]

Pin NumPy: use numpy<2 until your Megatron-LM version includes the fix. [1]

References:
[1] NVIDIA/Megatron-LM Issue #1563 (“numpy has no attribute 'product'” with NumPy 2.2.5)
[2] NVIDIA/Megatron-LM PR #1440 (replace numpy.product with numpy.prod for NumPy ≥ 2.0)

🌐 Web query:

numpy 2.0 breaking changes downstream packages compatibility

💡 Result:

Key NumPy 2.0 breaking changes that affect downstream package compatibility:

ABI break (biggest downstream impact): Any package with compiled extensions using the NumPy C-API (direct C/C++, Cython, f2py, etc.) that was built against NumPy 1.x will fail to import on NumPy 2.0 with a binary-incompatibility ImportError. Downstream projects generally need to rebuild/release wheels targeting NumPy 2.0. [1]

“Build on 2.x, run on 1.x and 2.x” guidance: NumPy explicitly supports building binaries against NumPy 2.x such that they can run on both NumPy 2.0 and NumPy 1.x at runtime; the reverse is not true (build on 1.x → won’t run on 2.0). NumPy recommends downstreams rebuild once the 2.0 ABI is stable (from 2.0.0rc1). [1][2]

C-API evolution / removals: NumPy 2.0 removes/changes older C-API pieces and hides more internals to enable future evolution; this is one reason extensions must be updated/recompiled. [1][3]

Python API cleanup (expired/removed/moved symbols): NumPy 2.0 includes a clearer public/private API split and removes a number of non-recommended aliases/functions, so some pure-Python downstream code may break and needs updates. The goal is that once updated for 2.0, code should typically also work on 1.2x without version-branching. [1][4]

Behavior change: dtype promotion rules (NEP 50): Mixed-dtype operations may yield different result dtypes/precision than in NumPy 1.x, which can break downstream tests or numerical assumptions even if code still runs. [1][5]

Migration tooling: The NumPy project provides a NumPy-2.0 ruff rule (NPY201) to automatically fix many common Python-level changes in downstream code. [5]

Practical takeaway: if you maintain a library with compiled NumPy usage, you almost certainly need a new wheel build/release for NumPy 2.x; if you’re pure Python, you mainly need to address removed/moved APIs and promotion-related behavior/test changes. [1][2][5]

Sources: NumPy 2.0 release notes [1], downstream author guidance [2], NEP 53 (C-API evolution) [3], NEP 52 (Python API cleanup goals/compat) [4], NumPy 2.0 migration guide [5].

🌐 Web query:

transformer-engine nvidia-modelopt numpy 2.0 support

💡 Result:

Transformer Engine: The current PyPI release is transformer-engine 2.11.0 (released 2026-01-02). Its published Python package metadata doesn’t list NumPy as a runtime requirement, so Transformer Engine itself is generally not the thing that “needs NumPy 2.0 support” (NumPy compatibility issues usually come from other compiled extensions in the same environment). [1]

NVIDIA ModelOpt: The current PyPI release is nvidia-modelopt 0.41.0 (released 2026-01-20) (Python >=3.10,<3.13). NVIDIA doesn’t clearly document a NumPy-2 pin or guarantee on the PyPI page itself, so the safest assumption is: use the latest ModelOpt if you want the best chance of NumPy 2.x working. [2]

If you hit the classic error/warning:

“A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 …”

that indicates some compiled dependency in your env was built against NumPy 1.x; typical workarounds are (a) upgrade the affected module to a wheel built for NumPy 2, or (b) downgrade NumPy to <2 in that environment. [3]

Sources: [1] [2] [3]

Revert numpy constraint or fix Megatron-LM numpy 2.x incompatibility.

Megatron-LM is not fully numpy 2.x compatible. The distributed checkpointing code uses np.product(), which was removed in numpy 2.0 and will cause AttributeError at runtime (see NVIDIA/Megatron-LM Issue #1563). Either pin numpy<2.0.0 to restore the original constraint, or apply the fix from PR #1440 that replaces np.product() with np.prod() throughout the codebase. Relaxing this constraint without addressing the incompatibility will introduce runtime failures.

🤖 Prompt for AI Agents

In `@3rdparty/Megatron-LM-workspace/setup.py` at line 47, The setup currently lists "numpy" with no version constraint which will allow numpy 2.x and cause runtime AttributeError due to use of np.product in Megatron-LM; either pin numpy to a 1.x release by changing the dependency in setup.py to "numpy<2.0.0" or instead apply the Megatron-LM fix that replaces all uses of np.product() with np.prod() (search for np.product in the codebase and update to np.prod) so the code is compatible with numpy 2.x; update setup.py or the code references accordingly to resolve the incompatibility.

coderabbitai · 2026-02-04T01:38:42Z

nemo_rl/models/policy/workers/megatron_policy_worker.py

+from megatron.bridge.training.config import (
+    CheckpointConfig,
+    ConfigContainer,
+    DistributedDataParallelConfig,
+    LoggerConfig,
+    OptimizerConfig,
+    SchedulerConfig,
+    TokenizerConfig,
+    TrainingConfig,
+)
+from megatron.bridge.training.initialize import (
+    initialize_megatron,
+    set_jit_fusion_options,
+)
+from megatron.bridge.training.optim import setup_optimizer
+from megatron.bridge.training.setup import (
+    _create_peft_pre_wrap_hook,
+    _update_model_config_funcs,
+)
+from megatron.bridge.training.state import GlobalState
+from megatron.bridge.training.tokenizers.tokenizer import build_tokenizer
+from megatron.bridge.training.utils.pg_utils import get_pg_collection


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check which of the new imports are actually used in megatron_policy_worker.py echo "Checking usage of imports from lines 30-51..." # Get the file and check for usage of imported names file="nemo_rl/models/policy/workers/megatron_policy_worker.py" # List of imported names to check imports=( "CheckpointConfig" "ConfigContainer" "DistributedDataParallelConfig" "LoggerConfig" "OptimizerConfig" "SchedulerConfig" "TokenizerConfig" "TrainingConfig" "initialize_megatron" "set_jit_fusion_options" "setup_optimizer" "_create_peft_pre_wrap_hook" "_update_model_config_funcs" "GlobalState" "build_tokenizer" ) # For each import, check usage (excluding import statements) echo "=== Usage Summary ===" for name in "${imports[@]}"; do count=$(rg "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | wc -l) if [ "$count" -eq 0 ]; then echo "$name: NOT USED" else echo "$name: USED ($count occurrences)" rg -n "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | head -2 fi done # Also check if get_pg_collection is used (it was mentioned in the review) echo "" echo "=== Checking get_pg_collection (mentioned in review) ===" rg -n "get_pg_collection" "$file"

Repository: NVIDIA-NeMo/RL

Length of output: 1264

🏁 Script executed:

#!/bin/bash file="nemo_rl/models/policy/workers/megatron_policy_worker.py" echo "=== Checking actual usage (excluding import block lines 30-51) ===" imports=( "CheckpointConfig" "ConfigContainer" "DistributedDataParallelConfig" "LoggerConfig" "OptimizerConfig" "SchedulerConfig" "TokenizerConfig" "TrainingConfig" "initialize_megatron" "set_jit_fusion_options" "setup_optimizer" "_create_peft_pre_wrap_hook" "_update_model_config_funcs" "GlobalState" "build_tokenizer" ) for name in "${imports[@]}"; do # Use sed to exclude lines 30-51 and search for usage count=$(sed '30,51d' "$file" | grep -c "\b${name}\b") if [ "$count" -eq 0 ]; then echo "$name: NOT USED OUTSIDE IMPORTS" else echo "$name: USED ($count occurrences)" # Show where it's used sed -n '30,51!p' "$file" | grep -n "\b${name}\b" | head -2 fi done echo "" echo "=== File metadata ===" wc -l "$file"

Repository: NVIDIA-NeMo/RL

Length of output: 878

🏁 Script executed:

#!/bin/bash file="nemo_rl/models/policy/workers/megatron_policy_worker.py" echo "=== Checking for __all__ definition ===" grep -n "__all__" "$file" echo "" echo "=== Checking file purpose (first 60 lines) ===" head -60 "$file" | tail -30

Repository: NVIDIA-NeMo/RL

Length of output: 1226

🏁 Script executed:

#!/bin/bash file="nemo_rl/models/policy/workers/megatron_policy_worker.py" echo "=== First 30 lines (copyright, imports start) ===" head -30 "$file" echo "" echo "=== Class definition ===" grep -n "^class " "$file" | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 1260

Remove unused imports from megatron.bridge.training modules.

The following imports (lines 30-50) are not used anywhere in this file and must be removed: CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig, initialize_megatron, set_jit_fusion_options, setup_optimizer, _create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and build_tokenizer. Only get_pg_collection (line 51) is actually used in this file.

After cleanup, ensure pre-commit run --all-files passes with no ruff errors.

🤖 Prompt for AI Agents

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 30 - 51, Remove the unused Megatron imports listed in the top import block: delete CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig, initialize_megatron, set_jit_fusion_options, setup_optimizer, _create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and build_tokenizer so only get_pg_collection remains imported from megatron.bridge.training.utils.pg_utils; update the import statement accordingly and run pre-commit (ruff) to ensure there are no remaining unused-import errors.

terrykong · 2026-02-04T06:50:39Z

Thanks a lot for spotting that @terrykong Indeed finetuning (with seq packing) of mamba hybrid architecture seems to work on the nano-v3 branch :)

I just don't understand the relation with moe, as my model is not a mixture of expert (just a hybrid archi with Mamba2 and transformer blocks). Actually, I first had an error
AssertionError: MoE aux loss is currently not supported due to a known bug in Megatron-LM. See https://github.com/NVIDIA/Megatron-LM/issues/1984 for more details.
with this setting taken from the example sft.yaml :
    moe_router_load_balancing_type: "aux_loss"
(I had to put "none" instead of "aux_loss" so that it does not fail)

That branch has some fixes to the mamba package, so it would probably help the non-MoE mamba models as well. @yfw to comment on the load balancing

yfw · 2026-02-04T22:03:31Z

@Jeronymous This is an oversight in the sft.yaml config. We should set that to null while we still have this assertion in place. We have wip to fix this properly (ad2a1d3), but if you are not using an moe model you can safely set that to null.

#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Rayen <ruit@nvidia.com> Co-authored-by: Rayen <ruit@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

…_CONF (#1806) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Ryan <yzr1914001753@gmail.com> Signed-off-by: Zhuoran Yin <yzr1914001753@gmail.com> Signed-off-by: Night <32424487+PrinsYin@users.noreply.github.com> Signed-off-by: RolaoDenthu <xinyis10@illinois.edu> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Xinyi Song <86638975+RolaoDenthu@users.noreply.github.com> Co-authored-by: Zhuoran Yin <yzr1914001753@gmail.com> Co-authored-by: Night <32424487+PrinsYin@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

…dge (main) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw · 2026-02-10T01:57:47Z

Closing in favor of #1902

yaoyu-33 changed the title ~~Update Megatron submodule pins~~ [Draft] Update Megatron submodule pins Jan 16, 2026

ashors1 mentioned this pull request Jan 29, 2026

Add Muon post-training support #1848

Draft

4 tasks

yfw changed the title ~~[Draft] Update Megatron submodule pins~~ chore: Update Megatron submodule pins Feb 4, 2026

yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026

yfw marked this pull request as ready for review February 4, 2026 01:32

yfw requested review from a team as code owners February 4, 2026 01:32

yfw temporarily deployed to nemo-ci February 4, 2026 01:32 — with GitHub Actions Inactive

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

terrykong mentioned this pull request Feb 4, 2026

chore: bump torch 2.9.1, vllm 0.15 sglang 0.5.8, ray 2.53 #1871

Open

4 tasks

terrykong added the dependencies Pull requests that update a dependency file label Feb 4, 2026

yfw temporarily deployed to nemo-ci February 4, 2026 08:47 — with GitHub Actions Inactive

yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 5, 2026

sfawzy-nv and others added 15 commits February 9, 2026 18:07

docs: v0.5 performance results update (#1772)

a9075f9

Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

docs: model support page (#1799)

7ffaabb

Signed-off-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

refactor: split train and val dataset in response dataset (#1649)

4be1e52

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Rayen <ruit@nvidia.com> Co-authored-by: Rayen <ruit@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

docs: fix pytorch anchor link: PYTORCH_CUDA_ALLOC_CONF->PYTORCH_ALLOC…

38fab9c

…_CONF (#1806) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

fix: log validation data (#1805)

417d183

Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

refactor: reuse setup data (#1808)

2f1dc01

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

feat: refactor megatron init (#1646)

e0e5798

Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com>

build: Bump setuptools >= 80.10.1 and wheel >= 0.46.2 (#1822)

e0de3a8

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

build: Bump setuptools to 80.10.2 (#1830)

b98dce0

Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

chore(3rdparty): bump Megatron-LM (nemo-rl-use-main) and Megatron-Bri…

53c5684

…dge (main) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

fix mbridge api

1fd8c71

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Update cached dependencies

3e55a86

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

Fix error when pp > 1

e970643

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>

yfw force-pushed the update-megatron-lm-5247a1f branch from 0c7d97e to e970643 Compare February 9, 2026 18:42

yfw requested review from a team as code owners February 9, 2026 18:42

github-actions bot added the documentation Improvements or additions to documentation label Feb 9, 2026

yfw mentioned this pull request Feb 9, 2026

chore: bump mcore and mbridge #1902

Merged

4 tasks

terrykong added the super-v3 label Feb 10, 2026

terrykong assigned yfw Feb 10, 2026

yfw closed this Feb 10, 2026

Conversation

yaoyu-33 commented Jan 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 16, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 16, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 16, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 17, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 27, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 27, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 28, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 28, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

github-actions bot commented Jan 28, 2026

❌ Submodule Fast-Forward Check Failed

✅ Submodules that are properly updated:

❌ Submodules that need attention:

Uh oh!

Jeronymous commented Jan 29, 2026

Uh oh!

terrykong commented Jan 29, 2026

Uh oh!

Jeronymous commented Feb 1, 2026

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong commented Feb 4, 2026

Uh oh!

yfw commented Feb 4, 2026

Uh oh!

yfw commented Feb 10, 2026

Uh oh!

yaoyu-33 commented Jan 16, 2026 •

edited by coderabbitai bot

Loading