Skip to content

Conversation

@yfw
Copy link
Contributor

@yfw yfw commented Feb 9, 2026

What does this PR do ?

Sqashed #1787. Bumps mcore to a branch close to main.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated Megatron-LM submodule repository reference and branch
    • Updated dependency versions for transformers, transformer-engine, nvidia-modelopt, and additional packages
    • Added new dependencies: accelerate, datasets, and fastapi
    • Removed or adjusted version constraints on numpy, nvidia-resiliency-ext, and setuptools for improved compatibility

@yfw yfw requested review from a team as code owners February 9, 2026 20:00
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

❌ Submodule Fast-Forward Check Failed

Check based on commit: 5e64e0c (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yfw yfw force-pushed the yifu/mcore_bump_20260209 branch from 5e64e0c to 3ad8f21 Compare February 9, 2026 20:02
@yfw yfw changed the title chore: MCore bump + related updates chore: bump mcore to main + bump mbridge Feb 9, 2026
@yfw yfw changed the title chore: bump mcore to main + bump mbridge chore: bump mcore and mbridge Feb 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

Updates Megatron-LM submodule source and branch, refreshes submodule pointers for Megatron-Bridge and Megatron-LM workspaces, adjusts dependency versions across setup.py files, and integrates ProcessGroupCollection for managing model-parallel process groups in Megatron model initialization and policy worker training operations.

Changes

Cohort / File(s) Summary
Submodule Configuration
.gitmodules, 3rdparty/Megatron-Bridge-workspace/Megatron-Bridge, 3rdparty/Megatron-LM-workspace/Megatron-LM
Updated Megatron-LM repository URL and branch; refreshed submodule commit pointers for both Megatron-Bridge and Megatron-LM workspaces to track updated versions.
Dependency Updates
3rdparty/Megatron-Bridge-workspace/setup.py, 3rdparty/Megatron-LM-workspace/setup.py
Modified dependency constraints: transformers restricted to <5.0.0, transformer-engine bumped to >=2.10.0a0,<2.12.0, added accelerate; removed setuptools upper bound, replaced pinned versions for nvidia-modelopt and nvidia-resiliency-ext with unpinned variants, added flash-linear-attention~=0.3.2, flashinfer-python~=0.5.0, datasets, and fastapi~=0.50.
ProcessGroupCollection Integration
nemo_rl/models/megatron/setup.py, nemo_rl/models/policy/workers/megatron_policy_worker.py
Introduced ProcessGroupCollection initialization and propagation through model construction paths; integrated pg_collection parameter into get_model calls and threaded mp_group from pg_collection into model-parallel distributed reduction operations.

Sequence Diagram

sequenceDiagram
    participant Setup as Model Setup Flow
    participant PGC as ProcessGroupCollection
    participant MegatronModel as Megatron Model
    participant PolicyWorker as Policy Worker

    Setup->>PGC: ProcessGroupCollection.use_mpu_process_groups()
    PGC-->>Setup: pg_collection instance
    Setup->>Setup: Store in megatron_cfg.model._pg_collection
    Setup->>MegatronModel: get_model(..., pg_collection)
    MegatronModel-->>Setup: model instance
    PolicyWorker->>PolicyWorker: get_pg_collection(model)
    PolicyWorker->>PolicyWorker: Use mp_group=pg_collection.mp for distributed ops
    PolicyWorker->>MegatronModel: logical_and/reduce_max with mp_group
    MegatronModel-->>PolicyWorker: aggregated results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

CI:L1, r0.5.0

Suggested reviewers

  • terrykong
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major dependency and feature changes but lacks test results, convergence validation, and has unchecked testing items plus identified runtime bugs. Add comprehensive testing with documented results, performance metrics, fix runtime bugs (undefined get_pg_collection), and verify end-to-end training without regressions.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'chore: bump mcore and mbridge' is directly related to the main changes in the pull request, which involve updating Megatron-LM (mcore) and Megatron-Bridge (mbridge) submodules and their dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yifu/mcore_bump_20260209

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

❌ Submodule Fast-Forward Check Failed

Check based on commit: 3ad8f21 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.gitmodules:
- Around line 3-4: The .gitmodules entry currently points the Megatron-LM
submodule URL "https://github.com/yaoyu-33/Megatron-LM.git" (branch = main) to a
personal fork; change that URL to the official org repository (e.g.,
"https://github.com/NVIDIA/Megatron-LM.git" or
"https://github.com/NVIDIA-NeMo/Megatron-LM.git") and keep or update the branch
as appropriate, then run submodule sync and update locally to confirm it tracks
the official repo; ensure the updated URL replaces the existing "url" value for
the Megatron-LM submodule in .gitmodules and commit the change.

In `@3rdparty/Megatron-Bridge-workspace/setup.py`:
- Line 29: The dependency constraint "transformers<5.0.0" removes the minimum
version guarantee; update the requirement in setup.py by replacing the existing
"transformers<5.0.0" entry with a bounded range such as
"transformers>=4.57.1,<5.0.0" so pip will enforce a safe minimum version while
still blocking major v5 releases.

In `@nemo_rl/models/megatron/setup.py`:
- Around line 735-736: The code uses setattr(megatron_cfg.model,
"_pg_collection", pg_collection) which is unnecessary reflection; directly
assign the attribute instead by setting megatron_cfg.model._pg_collection =
pg_collection so replace the setattr call with a direct attribute assignment
referencing pg_collection and megatron_cfg.model and the "_pg_collection"
attribute.

In `@nemo_rl/models/policy/workers/megatron_policy_worker.py`:
- Line 419: Replace the undefined call to get_pg_collection in
megatron_policy_worker.py (which will crash train()) by reading the stored
collection off the model config: retrieve pg_collection from
self.model.config._pg_collection (or defensively via getattr(self.model.config,
"_pg_collection")) and use that variable instead of get_pg_collection; also
update the file copyright header year from 2025 to 2026.
🧹 Nitpick comments (2)
3rdparty/Megatron-Bridge-workspace/setup.py (1)

31-31: accelerate dependency has no version constraint.

The newly added accelerate dependency is unpinned. Consider adding at least a minimum version to ensure compatibility.

nemo_rl/models/megatron/setup.py (1)

879-879: Consider reusing the pg_collection already stored on megatron_cfg.model.

setup_reference_model_state creates a new ProcessGroupCollection.use_mpu_process_groups() instance, but megatron_cfg.model._pg_collection was already set in setup_model_and_optimizer. Since this function receives the same megatron_cfg, you could reuse megatron_cfg.model._pg_collection instead of creating a redundant instance.

Similarly, line 933 in finalize_megatron_setup creates yet another instance.

@yfw yfw force-pushed the yifu/mcore_bump_20260209 branch from 3ad8f21 to 7d37192 Compare February 10, 2026 00:19
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7d37192 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

yfw added 3 commits February 9, 2026 16:43
Signed-off-by: Yi-Fu Wu <[email protected]>
Signed-off-by: Yi-Fu Wu <[email protected]>
@yfw yfw requested a review from a team as a code owner February 10, 2026 01:53
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 4891923 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yfw yfw added the super-v3 label Feb 10, 2026
@yfw yfw self-assigned this Feb 10, 2026
@yfw yfw added the CI:L1 Run doctests, unit tests, and functional tests label Feb 10, 2026
Signed-off-by: Yi-Fu Wu <[email protected]>
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: af1d4b6 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 11, 2026
@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 12, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 7ad9e52 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yfw yfw requested a review from a team as a code owner February 12, 2026 10:10
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: e0f58f9 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Shanmugam Ramasamy and others added 2 commits February 12, 2026 02:38
@yfw yfw force-pushed the yifu/mcore_bump_20260209 branch from e0f58f9 to 9dceaf4 Compare February 12, 2026 10:38
@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 12, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 9dceaf4 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 03088ab (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@yfw yfw added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 12, 2026
@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 1c9a1f4 (PR #1902 from yifu/mcore_bump_20260209)

✅ Submodules that are properly updated:

Megatron-Bridge: ✅ PR branch is ahead of main branch (fast-forward)

❌ Submodules that need attention:

Megatron-LM: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/yaoyu-33/Megatron-LM/commits/b73ae5cdab9d409fcface2b2f3c375710abe6911/
CURRENT (PR #1902 from yifu/mcore_bump_20260209): https://github.com/yaoyu-33/Megatron-LM/commits/193463c4f8414e6906a40dd527a450bca50706b1/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests super-v3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant