Unify bf16 gb300 qwen3 235b mapping by dingqingy-nv · Pull Request #2670 · NVIDIA-NeMo/Megatron-Bridge

dingqingy-nv · 2026-03-05T22:50:54Z

What does this PR do ?

Align the BF16 V2 config for Qwen3 235B A22B on GB300 with the MXFP8/FP8 config, so all precisions use the same parallelism strategy to avoid nan grad issue.

Summary by CodeRabbit

Chores
- Updated Qwen3 performance testing configurations to streamline configuration management and reduce redundancy.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

copy-pr-bot · 2026-03-05T22:50:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-05T22:57:04Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b9a99bf9-1048-4869-8122-6184e77f8dbf

📥 Commits

Reviewing files that changed from the base of the PR and between b730ab6 and ea8a94d.

📒 Files selected for processing (1)

scripts/performance/configs/qwen/qwen3_workload_base_configs.py

📝 Walkthrough

Walkthrough

This change simplifies a configuration file by replacing an explicit detailed definition of QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2 with a direct alias to QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_BF16_V2, reducing the definition from 10 lines to 1 line and making FP8_CS_V2 equivalent to BF16_V2.

Changes

Cohort / File(s)	Summary
Configuration Alias `scripts/performance/configs/qwen/qwen3_workload_base_configs.py`	Replaced explicit replace(...) definition of `QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2` with an alias to `QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_BF16_V2`, affecting downstream configurations that inherit from FP8_CS_V2 (e.g., FP8_MX_V2).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm #2209: Directly modifies the same configuration symbol QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2.
cp: Update Qwen3 235B A22B MXFP8 GB200/300 recipe and resolve NaN grad norm (2209) into r0.3.0 #2210: Also modifies the same public config QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2 with field-level changes.
Revert Qwen3 235B GB300 MXFP8 large scale mapping #2338: Affected by this change since FP8_MX_V2 now inherits the aliased BF16 settings instead of separate FP8_CS configuration parameters.

Suggested reviewers

ko3n1g
malay-nagda

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR addresses a critical NaN gradient issue but provides no test results, validation data, or performance benchmarks to demonstrate the fix or confirm no regressions.	Add test results demonstrating NaN gradient resolution, validation showing no numerical regression, and performance benchmarks comparing aligned configurations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: unifying BF16 GB300 Qwen3 235B mapping by converting FP8_CS_V2 into an alias of BF16_V2.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

unify bf16 gb300 qwen3 235b mapping

ea8a94d

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dingqingy-nv requested review from ko3n1g and malay-nagda March 5, 2026 22:50

dingqingy-nv added performance performance/release Performance items related with NeMo release r0.3.0 Cherry-pick label for r0.3.0 release branch labels Mar 5, 2026

yaoyu-33 approved these changes Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify bf16 gb300 qwen3 235b mapping#2670

Unify bf16 gb300 qwen3 235b mapping#2670
dingqingy-nv wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3-patch-nan-grad-fix

dingqingy-nv commented Mar 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dingqingy-nv commented Mar 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dingqingy-nv commented Mar 5, 2026 •

edited by coderabbitai bot

Loading