Skip to content

Unify bf16 gb300 qwen3 235b mapping#2670

Open
dingqingy-nv wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3-patch-nan-grad-fix
Open

Unify bf16 gb300 qwen3 235b mapping#2670
dingqingy-nv wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
dingqingy-nv:qwen3-patch-nan-grad-fix

Conversation

@dingqingy-nv
Copy link
Contributor

@dingqingy-nv dingqingy-nv commented Mar 5, 2026

What does this PR do ?

Align the BF16 V2 config for Qwen3 235B A22B on GB300 with the MXFP8/FP8 config, so all precisions use the same parallelism strategy to avoid nan grad issue.

Summary by CodeRabbit

  • Chores
    • Updated Qwen3 performance testing configurations to streamline configuration management and reduce redundancy.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>
@dingqingy-nv dingqingy-nv added performance performance/release Performance items related with NeMo release r0.3.0 Cherry-pick label for r0.3.0 release branch labels Mar 5, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b9a99bf9-1048-4869-8122-6184e77f8dbf

📥 Commits

Reviewing files that changed from the base of the PR and between b730ab6 and ea8a94d.

📒 Files selected for processing (1)
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py

📝 Walkthrough

Walkthrough

This change simplifies a configuration file by replacing an explicit detailed definition of QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2 with a direct alias to QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_BF16_V2, reducing the definition from 10 lines to 1 line and making FP8_CS_V2 equivalent to BF16_V2.

Changes

Cohort / File(s) Summary
Configuration Alias
scripts/performance/configs/qwen/qwen3_workload_base_configs.py
Replaced explicit replace(...) definition of QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_FP8_CS_V2 with an alias to QWEN3_235B_A22B_PRETRAIN_CONFIG_GB300_BF16_V2, affecting downstream configurations that inherit from FP8_CS_V2 (e.g., FP8_MX_V2).

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested reviewers

  • ko3n1g
  • malay-nagda
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR addresses a critical NaN gradient issue but provides no test results, validation data, or performance benchmarks to demonstrate the fix or confirm no regressions. Add test results demonstrating NaN gradient resolution, validation showing no numerical regression, and performance benchmarks comparing aligned configurations.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: unifying BF16 GB300 Qwen3 235B mapping by converting FP8_CS_V2 into an alias of BF16_V2.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/release Performance items related with NeMo release performance r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants