Conversation
❌ Submodule Fast-Forward Check FailedCheck based on commit: 2712979 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 6d03b70 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 82d1059 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: fde1997 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: cf59947 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: ea2ca52 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 228c370 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: a0cb209 (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
❌ Submodule Fast-Forward Check FailedCheck based on commit: 0c7d97e (PR #1787 from ✅ Submodules that are properly updated:Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward) ❌ Submodules that need attention:Megatron-LM: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Hi @Jeronymous . This PR still needs some work to fix some failures caused by bumping. Have you seen our guide for RL of the mamba hybrid moe architecture (https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/nemotron-3-nano.md)? This container and branch supports seq-packing on the mamba arch |
|
Thanks a lot for spotting that @terrykong I just don't understand the relation with moe, as my model is not a mixture of expert (just a hybrid archi with Mamba2 and transformer blocks). with this setting taken from the example sft.yaml : (I had to put |
📝 WalkthroughWalkthroughUpdates Megatron-LM and Megatron-Bridge submodule references and dependencies, then integrates ProcessGroupCollection into the Megatron setup and policy worker to manage distributed process groups during training initialization and execution. Changes
Sequence Diagram(s)sequenceDiagram
participant Setup as Megatron Setup
participant Config as Configuration
participant Model as Model Factory
participant Worker as Policy Worker
Setup->>Setup: Create ProcessGroupCollection
Setup->>Config: Store pg_collection in megatron_cfg.model._pg_collection
Setup->>Model: get_model(pg_collection=pg_collection)
Model-->>Setup: Return initialized model
Setup->>Worker: Initialize MegatronPolicyWorker
Worker->>Config: Retrieve pg_collection via get_pg_collection()
Worker->>Worker: Use pg_collection.mp for distributed aggregation<br/>(logical_and, reduce_max_stat operations)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@3rdparty/Megatron-LM-workspace/setup.py`:
- Line 47: The setup currently lists "numpy" with no version constraint which
will allow numpy 2.x and cause runtime AttributeError due to use of np.product
in Megatron-LM; either pin numpy to a 1.x release by changing the dependency in
setup.py to "numpy<2.0.0" or instead apply the Megatron-LM fix that replaces all
uses of np.product() with np.prod() (search for np.product in the codebase and
update to np.prod) so the code is compatible with numpy 2.x; update setup.py or
the code references accordingly to resolve the incompatibility.
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Around line 30-51: Remove the unused Megatron imports listed in the top import
block: delete CheckpointConfig, ConfigContainer, DistributedDataParallelConfig,
LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig,
initialize_megatron, set_jit_fusion_options, setup_optimizer,
_create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and
build_tokenizer so only get_pg_collection remains imported from
megatron.bridge.training.utils.pg_utils; update the import statement accordingly
and run pre-commit (ruff) to ensure there are no remaining unused-import errors.
🧹 Nitpick comments (1)
nemo_rl/models/megatron/setup.py (1)
735-746: Replacesetattrwith direct attribute assignment.Per static analysis (Ruff B010), using
setattrwith a constant attribute name is equivalent to direct assignment but less readable.♻️ Suggested fix
pg_collection = ProcessGroupCollection.use_mpu_process_groups() - setattr(megatron_cfg.model, "_pg_collection", pg_collection) + megatron_cfg.model._pg_collection = pg_collection model = get_model(
| # Default dependencies from pyproject.toml | ||
| "torch", | ||
| "numpy<2.0.0", | ||
| "numpy", |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Megatron-LM numpy 2.0 compatibility
💡 Result:
Megatron-LM is not fully NumPy 2.x compatible out of the box (at least in some recent commits), because it still calls np.product(...), which was removed in NumPy 2.x—causing errors like AttributeError: module 'numpy' has no attribute 'product' during distributed checkpointing validation (megatron/core/dist_checkpointing/...). [1]
Fix / workaround options:
- Patch Megatron-LM: replace
np.product(...)withnp.prod(...)in the affected files (e.g.,exchange_utils.py,mapping.py,validation.py). A PR exists proposing exactly this change. [2] - Temporary shim (quick local workaround): add something like:
(This is also suggested in the bug report.) [1]
import numpy as np if not hasattr(np, "product"): np.product = np.prod
- Pin NumPy: use
numpy<2until your Megatron-LM version includes the fix. [1]
References:
[1] NVIDIA/Megatron-LM Issue #1563 (“numpy has no attribute 'product'” with NumPy 2.2.5)
[2] NVIDIA/Megatron-LM PR #1440 (replace numpy.product with numpy.prod for NumPy ≥ 2.0)
🌐 Web query:
numpy 2.0 breaking changes downstream packages compatibility
💡 Result:
Key NumPy 2.0 breaking changes that affect downstream package compatibility:
- ABI break (biggest downstream impact): Any package with compiled extensions using the NumPy C-API (direct C/C++, Cython, f2py, etc.) that was built against NumPy 1.x will fail to import on NumPy 2.0 with a binary-incompatibility
ImportError. Downstream projects generally need to rebuild/release wheels targeting NumPy 2.0. [1] - “Build on 2.x, run on 1.x and 2.x” guidance: NumPy explicitly supports building binaries against NumPy 2.x such that they can run on both NumPy 2.0 and NumPy 1.x at runtime; the reverse is not true (build on 1.x → won’t run on 2.0). NumPy recommends downstreams rebuild once the 2.0 ABI is stable (from
2.0.0rc1). [1][2] - C-API evolution / removals: NumPy 2.0 removes/changes older C-API pieces and hides more internals to enable future evolution; this is one reason extensions must be updated/recompiled. [1][3]
- Python API cleanup (expired/removed/moved symbols): NumPy 2.0 includes a clearer public/private API split and removes a number of non-recommended aliases/functions, so some pure-Python downstream code may break and needs updates. The goal is that once updated for 2.0, code should typically also work on 1.2x without version-branching. [1][4]
- Behavior change: dtype promotion rules (NEP 50): Mixed-dtype operations may yield different result dtypes/precision than in NumPy 1.x, which can break downstream tests or numerical assumptions even if code still runs. [1][5]
- Migration tooling: The NumPy project provides a NumPy-2.0 ruff rule (
NPY201) to automatically fix many common Python-level changes in downstream code. [5]
Practical takeaway: if you maintain a library with compiled NumPy usage, you almost certainly need a new wheel build/release for NumPy 2.x; if you’re pure Python, you mainly need to address removed/moved APIs and promotion-related behavior/test changes. [1][2][5]
Sources: NumPy 2.0 release notes [1], downstream author guidance [2], NEP 53 (C-API evolution) [3], NEP 52 (Python API cleanup goals/compat) [4], NumPy 2.0 migration guide [5].
🌐 Web query:
transformer-engine nvidia-modelopt numpy 2.0 support
💡 Result:
- Transformer Engine: The current PyPI release is
transformer-engine 2.11.0(released 2026-01-02). Its published Python package metadata doesn’t list NumPy as a runtime requirement, so Transformer Engine itself is generally not the thing that “needs NumPy 2.0 support” (NumPy compatibility issues usually come from other compiled extensions in the same environment). [1] - NVIDIA ModelOpt: The current PyPI release is
nvidia-modelopt 0.41.0(released 2026-01-20) (Python >=3.10,<3.13). NVIDIA doesn’t clearly document a NumPy-2 pin or guarantee on the PyPI page itself, so the safest assumption is: use the latest ModelOpt if you want the best chance of NumPy 2.x working. [2]
If you hit the classic error/warning:
“A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 …”
that indicates some compiled dependency in your env was built against NumPy 1.x; typical workarounds are (a) upgrade the affected module to a wheel built for NumPy 2, or (b) downgrade NumPy to <2 in that environment. [3]
Sources: [1] [2] [3]
Revert numpy constraint or fix Megatron-LM numpy 2.x incompatibility.
Megatron-LM is not fully numpy 2.x compatible. The distributed checkpointing code uses np.product(), which was removed in numpy 2.0 and will cause AttributeError at runtime (see NVIDIA/Megatron-LM Issue #1563). Either pin numpy<2.0.0 to restore the original constraint, or apply the fix from PR #1440 that replaces np.product() with np.prod() throughout the codebase. Relaxing this constraint without addressing the incompatibility will introduce runtime failures.
🤖 Prompt for AI Agents
In `@3rdparty/Megatron-LM-workspace/setup.py` at line 47, The setup currently
lists "numpy" with no version constraint which will allow numpy 2.x and cause
runtime AttributeError due to use of np.product in Megatron-LM; either pin numpy
to a 1.x release by changing the dependency in setup.py to "numpy<2.0.0" or
instead apply the Megatron-LM fix that replaces all uses of np.product() with
np.prod() (search for np.product in the codebase and update to np.prod) so the
code is compatible with numpy 2.x; update setup.py or the code references
accordingly to resolve the incompatibility.
| from megatron.bridge.training.config import ( | ||
| CheckpointConfig, | ||
| ConfigContainer, | ||
| DistributedDataParallelConfig, | ||
| LoggerConfig, | ||
| OptimizerConfig, | ||
| SchedulerConfig, | ||
| TokenizerConfig, | ||
| TrainingConfig, | ||
| ) | ||
| from megatron.bridge.training.initialize import ( | ||
| initialize_megatron, | ||
| set_jit_fusion_options, | ||
| ) | ||
| from megatron.bridge.training.optim import setup_optimizer | ||
| from megatron.bridge.training.setup import ( | ||
| _create_peft_pre_wrap_hook, | ||
| _update_model_config_funcs, | ||
| ) | ||
| from megatron.bridge.training.state import GlobalState | ||
| from megatron.bridge.training.tokenizers.tokenizer import build_tokenizer | ||
| from megatron.bridge.training.utils.pg_utils import get_pg_collection |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check which of the new imports are actually used in megatron_policy_worker.py
echo "Checking usage of imports from lines 30-51..."
# Get the file and check for usage of imported names
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"
# List of imported names to check
imports=(
"CheckpointConfig"
"ConfigContainer"
"DistributedDataParallelConfig"
"LoggerConfig"
"OptimizerConfig"
"SchedulerConfig"
"TokenizerConfig"
"TrainingConfig"
"initialize_megatron"
"set_jit_fusion_options"
"setup_optimizer"
"_create_peft_pre_wrap_hook"
"_update_model_config_funcs"
"GlobalState"
"build_tokenizer"
)
# For each import, check usage (excluding import statements)
echo "=== Usage Summary ==="
for name in "${imports[@]}"; do
count=$(rg "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | wc -l)
if [ "$count" -eq 0 ]; then
echo "$name: NOT USED"
else
echo "$name: USED ($count occurrences)"
rg -n "\b${name}\b" "$file" | grep -v "^[0-9]*:from \|^[0-9]*:import " | head -2
fi
done
# Also check if get_pg_collection is used (it was mentioned in the review)
echo ""
echo "=== Checking get_pg_collection (mentioned in review) ==="
rg -n "get_pg_collection" "$file"Repository: NVIDIA-NeMo/RL
Length of output: 1264
🏁 Script executed:
#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"
echo "=== Checking actual usage (excluding import block lines 30-51) ==="
imports=(
"CheckpointConfig"
"ConfigContainer"
"DistributedDataParallelConfig"
"LoggerConfig"
"OptimizerConfig"
"SchedulerConfig"
"TokenizerConfig"
"TrainingConfig"
"initialize_megatron"
"set_jit_fusion_options"
"setup_optimizer"
"_create_peft_pre_wrap_hook"
"_update_model_config_funcs"
"GlobalState"
"build_tokenizer"
)
for name in "${imports[@]}"; do
# Use sed to exclude lines 30-51 and search for usage
count=$(sed '30,51d' "$file" | grep -c "\b${name}\b")
if [ "$count" -eq 0 ]; then
echo "$name: NOT USED OUTSIDE IMPORTS"
else
echo "$name: USED ($count occurrences)"
# Show where it's used
sed -n '30,51!p' "$file" | grep -n "\b${name}\b" | head -2
fi
done
echo ""
echo "=== File metadata ==="
wc -l "$file"Repository: NVIDIA-NeMo/RL
Length of output: 878
🏁 Script executed:
#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"
echo "=== Checking for __all__ definition ==="
grep -n "__all__" "$file"
echo ""
echo "=== Checking file purpose (first 60 lines) ==="
head -60 "$file" | tail -30Repository: NVIDIA-NeMo/RL
Length of output: 1226
🏁 Script executed:
#!/bin/bash
file="nemo_rl/models/policy/workers/megatron_policy_worker.py"
echo "=== First 30 lines (copyright, imports start) ==="
head -30 "$file"
echo ""
echo "=== Class definition ==="
grep -n "^class " "$file" | head -5Repository: NVIDIA-NeMo/RL
Length of output: 1260
Remove unused imports from megatron.bridge.training modules.
The following imports (lines 30-50) are not used anywhere in this file and must be removed: CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig, OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig, initialize_megatron, set_jit_fusion_options, setup_optimizer, _create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and build_tokenizer. Only get_pg_collection (line 51) is actually used in this file.
After cleanup, ensure pre-commit run --all-files passes with no ruff errors.
🤖 Prompt for AI Agents
In `@nemo_rl/models/policy/workers/megatron_policy_worker.py` around lines 30 -
51, Remove the unused Megatron imports listed in the top import block: delete
CheckpointConfig, ConfigContainer, DistributedDataParallelConfig, LoggerConfig,
OptimizerConfig, SchedulerConfig, TokenizerConfig, TrainingConfig,
initialize_megatron, set_jit_fusion_options, setup_optimizer,
_create_peft_pre_wrap_hook, _update_model_config_funcs, GlobalState, and
build_tokenizer so only get_pg_collection remains imported from
megatron.bridge.training.utils.pg_utils; update the import statement accordingly
and run pre-commit (ruff) to ensure there are no remaining unused-import errors.
That branch has some fixes to the mamba package, so it would probably help the non-MoE mamba models as well. @yfw to comment on the load balancing |
|
@Jeronymous This is an oversight in the |
#1623) Signed-off-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Hosam Fouad Fawzy <sfawzy@login-eos01.eos.clusters.nvidia.com> Co-authored-by: Sherif Fawzy <sfawzy@cw-dfw-cs-001-login-02.cm.cluster> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Wenwen Gao <94138584+snowmanwwg@users.noreply.github.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Rayen <ruit@nvidia.com> Co-authored-by: Rayen <ruit@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…_CONF (#1806) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Ryan <yzr1914001753@gmail.com> Signed-off-by: Zhuoran Yin <yzr1914001753@gmail.com> Signed-off-by: Night <32424487+PrinsYin@users.noreply.github.com> Signed-off-by: RolaoDenthu <xinyis10@illinois.edu> Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Xinyi Song <86638975+RolaoDenthu@users.noreply.github.com> Co-authored-by: Zhuoran Yin <yzr1914001753@gmail.com> Co-authored-by: Night <32424487+PrinsYin@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com> Co-authored-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com> Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com> Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
…dge (main) Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
Signed-off-by: Yi-Fu Wu <yifu.wu@gmail.com>
0c7d97e to
e970643
Compare
|
Closing in favor of #1902 |
Summary by CodeRabbit