Skip to content

chore: Update Megatron submodule pins#1787

Closed
yaoyu-33 wants to merge 25 commits intomainfrom
update-megatron-lm-5247a1f
Closed

chore: Update Megatron submodule pins#1787
yaoyu-33 wants to merge 25 commits intomainfrom
update-megatron-lm-5247a1f

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Jan 16, 2026

Summary by CodeRabbit

  • Chores
    • Updated dependencies: narrowed transformers to version <5.0.0, broadened package requirements, adjusted transformer-engine version range to >=2.10.0a0,<2.12.0, added accelerate and other packages
    • Updated submodule references to latest versions
  • Refactor
    • Restructured distributed model setup and training processes to utilize process group collection configuration

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 2712979 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/5247a1f4600c258c8f5d40c3bd934f210dbf3698/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 6d03b70 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yaoyu-33 yaoyu-33 changed the title Update Megatron submodule pins [Draft] Update Megatron submodule pins Jan 16, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 82d1059 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: fde1997 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: cf59947 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/11dcbaca317133cc5c77c8bc4f54ed71d3b5d656/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: ea2ca52 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 228c370 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: a0cb209 (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0c7d97e (PR #1787 from update-megatron-lm-5247a1f)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1787 from update-megatron-lm-5247a1f): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@Jeronymous
Copy link

@yaoyu-33 @ashors1 @yfw In order to finetune a Mamba architecture with sequence packing, we would need an update of the megatron core submodule.
So this PR looks very interesting :)
What is the status? Is it ready to be tested or are there known issues?

@ashors1 ashors1 mentioned this pull request Jan 29, 2026
4 tasks
@terrykong
Copy link
Contributor

@yaoyu-33 @ashors1 @yfw In order to finetune a Mamba architecture with sequence packing, we would need an update of the megatron core submodule. So this PR looks very interesting :) What is the status? Is it ready to be tested or are there known issues?

Hi @Jeronymous . This PR still needs some work to fix some failures caused by bumping. Have you seen our guide for RL of the mamba hybrid moe architecture (https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/nemotron-3-nano.md)? This container and branch supports seq-packing on the mamba arch

@Jeronymous
Copy link

Thanks a lot for spotting that @terrykong
Indeed finetuning (with seq packing) of mamba hybrid architecture seems to work on the nano-v3 branch :)

I just don't understand the relation with moe, as my model is not a mixture of expert (just a hybrid archi with Mamba2 and transformer blocks).
Actually, I first had an error

AssertionError: MoE aux loss is currently not supported due to a known bug in Megatron-LM. See https://github.com/NVIDIA/Megatron-LM/issues/1984 for more details.

with this setting taken from the example sft.yaml :

    moe_router_load_balancing_type: "aux_loss"

(I had to put "none" instead of "aux_loss" so that it does not fail)

@yfw yfw changed the title [Draft] Update Megatron submodule pins chore: Update Megatron submodule pins Feb 4, 2026
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 4, 2026
@yfw yfw marked this pull request as ready for review February 4, 2026 01:32
@yfw yfw requested review from a team as code owners February 4, 2026 01:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Updates Megatron-LM and Megatron-Bridge submodule references and dependencies, then integrates ProcessGroupCollection into the Megatron setup and policy worker to manage distributed process groups during training initialization and execution.

Changes

Cohort / File(s) Summary
Submodule Updates
.gitmodules, 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge, 3rdparty/Megatron-LM-workspace/Megatron-LM
Updated Megatron-LM submodule URL and branch; updated Megatron-Bridge-workspace and Megatron-LM-workspace submodule pointers to newer commits.
Dependency Configuration
3rdparty/Megatron-Bridge-workspace/setup.py, 3rdparty/Megatron-LM-workspace/setup.py
Loosened and adjusted version constraints for transformers, transformer-engine, numpy, nvidia-modelopt, and other dependencies; added accelerate, datasets, and fastapi packages; removed setuptools upper bound.
ProcessGroupCollection Integration
nemo_rl/models/megatron/setup.py, nemo_rl/models/policy/workers/megatron_policy_worker.py
Introduced ProcessGroupCollection creation and plumbing through setup flow; updated get_model and _update_model_config_funcs to accept pg_collection parameter; configured policy worker to use process group for distributed aggregation operations.

Sequence Diagram(s)

sequenceDiagram
    participant Setup as Megatron Setup
    participant Config as Configuration
    participant Model as Model Factory
    participant Worker as Policy Worker

    Setup->>Setup: Create ProcessGroupCollection
    Setup->>Config: Store pg_collection in megatron_cfg.model._pg_collection
    Setup->>Model: get_model(pg_collection=pg_collection)
    Model-->>Setup: Return initialized model
    Setup->>Worker: Initialize MegatronPolicyWorker
    Worker->>Config: Retrieve pg_collection via get_pg_collection()
    Worker->>Worker: Use pg_collection.mp for distributed aggregation<br/>(logical_and, reduce_max_stat operations)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

mcore

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major submodule and dependency changes but lacks test results documentation; review comments identify unresolved critical numpy 2.x incompatibility and other failures. Address numpy 2.x incompatibility, remove unused imports, run comprehensive tests, document results, and verify all reported issues are resolved before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'chore: Update Megatron submodule pins' accurately describes the primary change—updating pinned Megatron-related submodule references across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update-megatron-lm-5247a1f

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@3rdparty/Megatron-LM-workspace/setup.py`:
- Line 47: The setup currently lists "numpy" with no version constraint which
will allow numpy 2.x and cause runtime AttributeError due to use of np.product
in Megatron-LM; either pin numpy to a 1.x release by changing the dependency in
setup.py to "numpy<2.0.0" or instead apply the Megatron-LM fix that replaces all
uses of np.product() with np.prod() (search for np.product in the codebase and
update to np.prod) so the code is compatible with numpy 2.x; update setup.py or
the code references accordingly to resolve the incompatibility.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 30-51: Remove the unused Megatron imports listed in the top import
block: delete CheckpointConfig, ConfigContainer, DistributedDataParallelConfig,
LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig,
initialize_megatron, set_jit_fusion_options, setup_optimizer,
_create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and
build_tokenizer so only get_pg_collection remains imported from
megatron.bridge.training.utils.pg_utils; update the import statement accordingly
and run pre-commit (ruff) to ensure there are no remaining unused-import errors.
🧹 Nitpick comments (1)
nemo_rl/models/megatron/setup.py (1)

735-746: Replace setattr with direct attribute assignment.

Per static analysis (Ruff B010), using setattr with a constant attribute name is equivalent to direct assignment but less readable.

♻️ Suggested fix
     pg_collection = ProcessGroupCollection.use_mpu_process_groups()
-    setattr(megatron_cfg.model, "_pg_collection", pg_collection)
+    megatron_cfg.model._pg_collection = pg_collection
     model = get_model(

# Default dependencies from pyproject.toml
"torch",
"numpy<2.0.0",
"numpy",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Megatron-LM numpy 2.0 compatibility

💡 Result:

Megatron-LM is not fully NumPy 2.x compatible out of the box (at least in some recent commits), because it still calls np.product(...), which was removed in NumPy 2.x—causing errors like AttributeError: module 'numpy' has no attribute 'product' during distributed checkpointing validation (megatron/core/dist_checkpointing/...). [1]

Fix / workaround options:

  • Patch Megatron-LM: replace np.product(...) with np.prod(...) in the affected files (e.g., exchange_utils.py, mapping.py, validation.py). A PR exists proposing exactly this change. [2]
  • Temporary shim (quick local workaround): add something like:
    import numpy as np
    if not hasattr(np, "product"):
        np.product = np.prod
    (This is also suggested in the bug report.) [1]
  • Pin NumPy: use numpy<2 until your Megatron-LM version includes the fix. [1]

References:
[1] NVIDIA/Megatron-LM Issue #1563 (“numpy has no attribute 'product'” with NumPy 2.2.5)
[2] NVIDIA/Megatron-LM PR #1440 (replace numpy.product with numpy.prod for NumPy ≥ 2.0)


🌐 Web query:

numpy 2.0 breaking changes downstream packages compatibility

💡 Result:

Key NumPy 2.0 breaking changes that affect downstream package compatibility:

  • ABI break (biggest downstream impact): Any package with compiled extensions using the NumPy C-API (direct C/C++, Cython, f2py, etc.) that was built against NumPy 1.x will fail to import on NumPy 2.0 with a binary-incompatibility ImportError. Downstream projects generally need to rebuild/release wheels targeting NumPy 2.0. [1]
  • “Build on 2.x, run on 1.x and 2.x” guidance: NumPy explicitly supports building binaries against NumPy 2.x such that they can run on both NumPy 2.0 and NumPy 1.x at runtime; the reverse is not true (build on 1.x → won’t run on 2.0). NumPy recommends downstreams rebuild once the 2.0 ABI is stable (from 2.0.0rc1). [1][2]
  • C-API evolution / removals: NumPy 2.0 removes/changes older C-API pieces and hides more internals to enable future evolution; this is one reason extensions must be updated/recompiled. [1][3]
  • Python API cleanup (expired/removed/moved symbols): NumPy 2.0 includes a clearer public/private API split and removes a number of non-recommended aliases/functions, so some pure-Python downstream code may break and needs updates. The goal is that once updated for 2.0, code should typically also work on 1.2x without version-branching. [1][4]
  • Behavior change: dtype promotion rules (NEP 50): Mixed-dtype operations may yield different result dtypes/precision than in NumPy 1.x, which can break downstream tests or numerical assumptions even if code still runs. [1][5]
  • Migration tooling: The NumPy project provides a NumPy-2.0 ruff rule (NPY201) to automatically fix many common Python-level changes in downstream code. [5]

Practical takeaway: if you maintain a library with compiled NumPy usage, you almost certainly need a new wheel build/release for NumPy 2.x; if you’re pure Python, you mainly need to address removed/moved APIs and promotion-related behavior/test changes. [1][2][5]

Sources: NumPy 2.0 release notes [1], downstream author guidance [2], NEP 53 (C-API evolution) [3], NEP 52 (Python API cleanup goals/compat) [4], NumPy 2.0 migration guide [5].


🌐 Web query:

transformer-engine nvidia-modelopt numpy 2.0 support

💡 Result:

  • Transformer Engine: The current PyPI release is transformer-engine 2.11.0 (released 2026-01-02). Its published Python package metadata doesn’t list NumPy as a runtime requirement, so Transformer Engine itself is generally not the thing that “needs NumPy 2.0 support” (NumPy compatibility issues usually come from other compiled extensions in the same environment). [1]
  • NVIDIA ModelOpt: The current PyPI release is nvidia-modelopt 0.41.0 (released 2026-01-20) (Python >=3.10,<3.13). NVIDIA doesn’t clearly document a NumPy-2 pin or guarantee on the PyPI page itself, so the safest assumption is: use the latest ModelOpt if you want the best chance of NumPy 2.x working. [2]

If you hit the classic error/warning:

“A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 …”

that indicates some compiled dependency in your env was built against NumPy 1.x; typical workarounds are (a) upgrade the affected module to a wheel built for NumPy 2, or (b) downgrade NumPy to <2 in that environment. [3]

Sources: [1] [2] [3]


Revert numpy constraint or fix Megatron-LM numpy 2.x incompatibility.

Megatron-LM is not fully numpy 2.x compatible. The distributed checkpointing code uses np.product(), which was removed in numpy 2.0 and will cause AttributeError at runtime (see NVIDIA/Megatron-LM Issue #1563). Either pin numpy<2.0.0 to restore the original constraint, or apply the fix from PR #1440 that replaces np.product() with np.prod() throughout the codebase. Relaxing this constraint without addressing the incompatibility will introduce runtime failures.

🤖 Prompt for AI Agents
In `@3rdparty/Megatron-LM-workspace/setup.py` at line 47, The setup currently
lists "numpy" with no version constraint which will allow numpy 2.x and cause
runtime AttributeError due to use of np.product in Megatron-LM; either pin numpy
to a 1.x release by changing the dependency in setup.py to "numpy<2.0.0" or
instead apply the Megatron-LM fix that replaces all uses of np.product() with
np.prod() (search for np.product in the codebase and update to np.prod) so the
code is compatible with numpy 2.x; update setup.py or the code references
accordingly to resolve the incompatibility.

Comment on lines 30 to 51
from megatron.bridge.training.config import (
CheckpointConfig,
ConfigContainer,
DistributedDataParallelConfig,
LoggerConfig,
OptimizerConfig,
SchedulerConfig,
TokenizerConfig,
TrainingConfig,
)
from megatron.bridge.training.initialize import (
initialize_megatron,
set_jit_fusion_options,
)
from megatron.bridge.training.optim import setup_optimizer
from megatron.bridge.training.setup import (
_create_peft_pre_wrap_hook,
_update_model_config_funcs,
)
from megatron.bridge.training.state import GlobalState
from megatron.bridge.training.tokenizers.tokenizer import build_tokenizer
from megatron.bridge.training.utils.pg_utils import get_pg_collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check which of the new imports are actually used in megatron_policy_worker.py
echo "Checking usage of imports from lines 30-51..."

# Get the file and check for usage of imported names
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"

# List of imported names to check
imports=(
    "CheckpointConfig"
    "ConfigContainer"
    "DistributedDataParallelConfig"
    "LoggerConfig"
    "OptimizerConfig"
    "SchedulerConfig"
    "TokenizerConfig"
    "TrainingConfig"
    "initialize_megatron"
    "set_jit_fusion_options"
    "setup_optimizer"
    "_create_peft_pre_wrap_hook"
    "_update_model_config_funcs"
    "GlobalState"
    "build_tokenizer"
)

# For each import, check usage (excluding import statements)
echo "=== Usage Summary ==="
for name in "${imports[@]}"; do
    count=$(rg "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | wc -l)
    if [ "$count" -eq 0 ]; then
        echo "$name: NOT USED"
    else
        echo "$name: USED ($count occurrences)"
        rg -n "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | head -2
    fi
done

# Also check if get_pg_collection is used (it was mentioned in the review)
echo ""
echo "=== Checking get_pg_collection (mentioned in review) ==="
rg -n "get_pg_collection" "$file"

Repository: NVIDIA-NeMo/RL

Length of output: 1264


🏁 Script executed:

#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"

echo "=== Checking actual usage (excluding import block lines 30-51) ==="
imports=(
    "CheckpointConfig"
    "ConfigContainer"
    "DistributedDataParallelConfig"
    "LoggerConfig"
    "OptimizerConfig"
    "SchedulerConfig"
    "TokenizerConfig"
    "TrainingConfig"
    "initialize_megatron"
    "set_jit_fusion_options"
    "setup_optimizer"
    "_create_peft_pre_wrap_hook"
    "_update_model_config_funcs"
    "GlobalState"
    "build_tokenizer"
)

for name in "${imports[@]}"; do
    # Use sed to exclude lines 30-51 and search for usage
    count=$(sed '30,51d' "$file" | grep -c "\b${name}\b")
    if [ "$count" -eq 0 ]; then
        echo "$name: NOT USED OUTSIDE IMPORTS"
    else
        echo "$name: USED ($count occurrences)"
        # Show where it's used
        sed -n '30,51!p' "$file" | grep -n "\b${name}\b" | head -2
    fi
done

echo ""
echo "=== File metadata ==="
wc -l "$file"

Repository: NVIDIA-NeMo/RL

Length of output: 878


🏁 Script executed:

#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"

echo "=== Checking for __all__ definition ==="
grep -n "__all__" "$file"

echo ""
echo "=== Checking file purpose (first 60 lines) ==="
head -60 "$file" | tail -30

Repository: NVIDIA-NeMo/RL

Length of output: 1226


🏁 Script executed:

#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"

echo "=== First 30 lines (copyright, imports start) ==="
head -30 "$file"

echo ""
echo "=== Class definition ==="
grep -n "^class " "$file" | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 1260


Remove unused imports from megatron.bridge.training modules.

The following imports (lines 30-50) are not used anywhere in this file and must be removed: CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig, initialize_megatron, set_jit_fusion_options, setup_optimizer, _create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and build_tokenizer. Only get_pg_collection (line 51) is actually used in this file.

After cleanup, ensure pre-commit run --all-files passes with no ruff errors.

🤖 Prompt for AI Agents
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 30 -
51, Remove the unused Megatron imports listed in the top import block: delete
CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig,
OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig,
initialize_megatron, set_jit_fusion_options, setup_optimizer,
_create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and
build_tokenizer so only get_pg_collection remains imported from
megatron.bridge.training.utils.pg_utils; update the import statement accordingly
and run pre-commit (ruff) to ensure there are no remaining unused-import errors.

@terrykong
Copy link
Contributor

Thanks a lot for spotting that @terrykong Indeed finetuning (with seq packing) of mamba hybrid architecture seems to work on the nano-v3 branch :)

I just don't understand the relation with moe, as my model is not a mixture of expert (just a hybrid archi with Mamba2 and transformer blocks). Actually, I first had an error

AssertionError: MoE aux loss is currently not supported due to a known bug in Megatron-LM. See https://github.com/NVIDIA/Megatron-LM/issues/1984 for more details.

with this setting taken from the example sft.yaml :

    moe_router_load_balancing_type: "aux_loss"

(I had to put "none" instead of "aux_loss" so that it does not fail)

That branch has some fixes to the mamba package, so it would probably help the non-MoE mamba models as well. @yfw to comment on the load balancing

@terrykong terrykong added the dependencies Pull requests that update a dependency file label Feb 4, 2026
@yfw
Copy link
Contributor

yfw commented Feb 4, 2026

@Jeronymous This is an oversight in the sft.yaml config. We should set that to null while we still have this assertion in place. We have wip to fix this properly (ad2a1d3), but if you are not using an moe model you can safely set that to null.

@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 5, 2026
sfawzy-nv and others added 15 commits February 9, 2026 18:07
#1623)

Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Rayen <ruit@nvidia.com>
Co-authored-by: Rayen <ruit@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…_CONF (#1806)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Ryan <yzr1914001753@gmail.com>
Signed-off-by: Zhuoran Yin <yzr1914001753@gmail.com>
Signed-off-by: Night <32424487+PrinsYin@users.noreply.github.com>
Signed-off-by: RolaoDenthu <xinyis10@illinois.edu>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Xinyi Song <86638975+RolaoDenthu@users.noreply.github.com>
Co-authored-by: Zhuoran Yin <yzr1914001753@gmail.com>
Co-authored-by: Night <32424487+PrinsYin@users.noreply.github.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…dge (main)

Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
@yfw yfw force-pushed the update-megatron-lm-5247a1f branch from 0c7d97e to e970643 Compare February 9, 2026 18:42
@yfw yfw requested review from a team as code owners February 9, 2026 18:42
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 9, 2026
@yfw yfw mentioned this pull request Feb 9, 2026
4 tasks
@yfw
Copy link
Contributor

yfw commented Feb 10, 2026

Closing in favor of #1902

@yfw yfw closed this Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation super-v3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments