Add Qwen 3.5 recipes by cuichenx · Pull Request #2654 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-03-05T00:01:58Z

What does this PR do?

Add Qwen3.5 VL (dense + MoE) bridge, provider, fine-tuning recipes, and example scripts for the full model family.

Changelog

Add Qwen35VLBridge (dense) and Qwen35VLMoEBridge for bidirectional HF ↔ Megatron checkpoint conversion with GDN + Gated Attention hybrid architecture support
Add Qwen35VLModelProvider / Qwen35VLMoEModelProvider with selective attention spec patching (_patch_standard_attention_specs) for mRoPE on standard attention layers only
Add fine-tuning recipes for 8 model sizes: 800M, 2B, 4B, 9B, 27B (dense) and 35B-A3B, 122B-A10B, 397B-A17B (MoE), with recommended TP/PP/EP defaults for SFT and LoRA
Add recipes/qwen_vl/__init__.py to register the new recipe module
Add example scripts: conversion.sh, inference.sh, slurm_sft.sh, slurm_peft.sh
Add unit tests (test_qwen35_vl_recipes.py) and functional tests (test_qwen35_vl_recipes_finetune.py)
Fix test_qwen35_vl_bridge.py flaky assertions

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Requires transformers >= 5.2.0. Import guards are in place for both dense and MoE transformers classes.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-03-05T00:05:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

coderabbitai · 2026-03-05T00:12:45Z

📝 Walkthrough

Walkthrough

This PR adds comprehensive fine-tuning support for Qwen3.5 Vision-Language models. It introduces parameterized SLURM scripts for SFT and PEFT training, a new recipe module with eight factory functions for different model sizes, validation logic for parallelism settings, and extensive test coverage. Scripts are updated to dynamically handle multiple model variants with appropriate tensor/pipeline/expert parallelism configurations.

Changes

Cohort / File(s)	Summary
Conversion & Inference Scripts `examples/models/vlm/qwen35_vl/conversion.sh`, `examples/models/vlm/qwen35_vl/inference.sh`	Added strict error handling and parameterized parallelism flags (EP, PP, TP). Conversion script supports multiple dense and MoE variants. Inference script dynamically sets EP based on model type (dense vs. MoE).
SLURM Training Scripts `examples/models/vlm/qwen35_vl/slurm_peft.sh`, `examples/models/vlm/qwen35_vl/slurm_sft.sh`	New scripts for orchestrating PEFT (LoRA) and SFT training via SLURM. Both accept model size arguments, map to HF model names and recipes, configure container execution, build CLI overrides, and integrate with Weights & Biases logging.
Qwen3.5 VL Recipe Module `src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`	New module with eight public factory functions for sizes 800M–397B-A17B (dense and MoE). Shared `_qwen35_vl_common` builds ConfigContainer with model, training, optimizer, scheduler, PEFT, dataset provider selection, and validation. Supports dataset type selection (mock, preloaded, HF) with error handling.
Bridge & Provider Updates `src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py`, `src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py`	Bridge now uses top-level HF config for embedding sharing instead of text_config default. Provider adds `validate_parallelism()` method to both dense and MoE variants, enforcing tensor_model_parallel_size ≤ num_query_groups.
Recipe Package Exports `src/megatron/bridge/recipes/qwen_vl/__init__.py`	Added imports and all exports for eight new Qwen3.5-VL finetune config functions, grouped by density (dense and MoE).
Functional Tests `tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`	New test class with four GPU-only SFT scenarios: nothing frozen, language frozen, vision+language frozen, and recompute. Includes autouse fixture to reset global microbatch calculator between tests.
Unit Tests `tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py`, `tests/unit_tests/models/qwen_vl/test_qwen35_vl_bridge.py`	Comprehensive recipe unit tests with fake AutoBridge and model config doubles, parameterized coverage of 8 model sizes, SFT/PEFT scenarios, dataset type selection, and override propagation. Bridge tests updated with Mock(spec=[]) and torch_dtype attribute adjustments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

PR #2613: Directly related—identical mock adjustments in test_qwen35_vl_bridge.py (Mock(spec=[]) and torch_dtype additions).
PR #2614: Related refactoring of VLM finetune recipe surface and shared SFT/PEFT helpers.

Suggested labels

Run CICD

Suggested reviewers

yaoyu-33
yashaswikarnati
chtruong814

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major changes (691 new lines, 8 model variants, SLURM scripts) with comprehensive tests (955 lines), but PR description lacks test results, coverage metrics, or validation information.	Add PR description section documenting executed tests, their results, coverage of new recipe functions and model variants, performance validation, and test run outputs.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Qwen 3.5 recipes' is concise and directly reflects the primary objective of the PR, which is to introduce fine-tuning recipes for various Qwen3.5 VL model variants (dense and MoE) across multiple sizes.
Docstring Coverage	✅ Passed	Docstring coverage is 88.14% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chcui/qwen35_recipes

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (1)

tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py (1)
138-199: Prefer subprocess isolation for these functional training runs.

These tests delegate to run_pretrain_vl_recipe_test (see tests/functional_tests/recipes/utils.py, Lines 199-298), which initializes distributed state and runs real pretraining in-process. Moving each scenario to a subprocess will reduce inter-test state coupling and flakes.

As per coding guidelines, "tests/functional_tests/**/*.py: Use subprocess for functional tests that require process isolation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py` around
lines 138 - 199, The four functional tests (test_sft_nothing_frozen,
test_sft_language_model_frozen, test_sft_vision_and_language_frozen,
test_recompute) currently call run_pretrain_vl_recipe_test in-process and must
be converted to run each scenario in a separate subprocess to avoid shared
distributed state; replace the direct call to run_pretrain_vl_recipe_test with
logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process
invoking a small runner module/function that imports run_pretrain_vl_recipe_test
and executes it with the same arguments), pass the same
config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs
(serialize as needed), wait for completion, and assert the subprocess exit code
is zero; ensure the new runner entrypoint and the test wrappers use unique
identifiers matching the existing test parameterization so behavior and
signatures remain unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/models/vlm/qwen35_vl/inference.sh`:
- Around line 26-33: The EP assignment in the MODEL_NAME case block currently
sets EP=4 for all MoE variants; update the case for Qwen3.5-35B-A3B,
Qwen3.5-122B-A10B, and Qwen3.5-397B-A17B to use per-model defaults (set EP to
the specific values used in the VL scripts instead of a single shared 4) by
modifying the MODEL_NAME case in inference.sh to assign the correct EP for each
model name (reference the MODEL_NAME case block and the EP variable).

In `@examples/models/vlm/qwen35_vl/slurm_peft.sh`:
- Around line 116-121: The script currently overwrites externally provided
CONTAINER_IMAGE and CONTAINER_MOUNTS with empty strings (symbols CONTAINER_IMAGE
and CONTAINER_MOUNTS), which breaks the later guard that checks for a container
image; change the assignment so you only set defaults when the variables are
unset or empty (e.g., use a conditional or shell parameter expansion to leave an
externally supplied value intact), and apply the same change to the mounts
handling and any subsequent logic that validates CONTAINER_IMAGE to ensure it
respects environment-supplied values.
- Around line 49-50: SBATCH output/error paths reference a logs/ directory that
may not exist when Slurm opens the files; either stop using a non-existent
subdirectory or ensure it is created before submission. Fix by changing the
SBATCH directives in the header (the lines with “#SBATCH
--output=logs/qwen35vl_lora_%j.out” and “#SBATCH
--error=logs/qwen35vl_lora_%j.err”) to write to the current directory (e.g.,
remove the "logs/" prefix or use "%x_%j.out" and "%x_%j.err") or ensure the logs
directory is created by the submitter before calling sbatch (add a pre-submit
mkdir -p logs in the wrapper that writes/submits this script rather than inside
the job script itself).

In `@examples/models/vlm/qwen35_vl/slurm_sft.sh`:
- Around line 1-15: Add strict shell options to the script to fail fast on
errors and unset variables: immediately after the shebang (#!/bin/bash) in
slurm_sft.sh enable "set -euo pipefail" and set a safe IFS (e.g., IFS=$'\n\t')
so command failures and unset variables cannot silently continue during SLURM
training runs.
- Around line 27-35: Commented parallelism recommendations in slurm_sft.sh are
inconsistent with the canonical recipe values in
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py; update or remove the
mismatched lines (e.g., entries for 122B and 397B) so they reflect the recipe's
TP/PP/EP defaults. Open examples/models/vlm/qwen35_vl/slurm_sft.sh and either
correct the TP/PP/EP numbers to match the values defined in qwen35_vl.py or
replace the block with a short note directing users to the recipe file for
authoritative parallelism settings, ensuring the commented sizes (e.g., 27B,
35B, 122B, 397B) match the recipe's PP/EP/TP/EP entries. Make sure the text
explicitly references the recipe as the source of truth to avoid future drift.

In `@src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`:
- Around line 354-357: The documented default pipeline-parallel size for 122B
full-SFT (text saying "PP=4") is inconsistent with the implementation which sets
pipeline_model_parallel_size to 6; update one to match the other: either change
the documentation block in qwen35_vl.py to "PP=6" to reflect the code, or change
the implementation variable pipeline_model_parallel_size (and any related
FULL_SFT config assignments) to 4 so the code matches the docs; ensure both the
default config comment block and the pipeline_model_parallel_size assignment
(and any other FULL_SFT-specific parallelism constants) are updated together.
- Around line 593-595: The TokenizerConfig is being wired to hf_path while
preprocessing uses _processor_model (set from tokenizer_model or hf_path),
causing config drift; update the TokenizerConfig instantiation (and any other
places that currently pass hf_path, e.g., around the TokenizerConfig creation at
the block using hf_path) to use _processor_model (or tokenizer_model if present)
instead so the tokenizer config and dataset preprocessing use the same value
(ensure both the dataset preprocessing branch that relies on _processor_model
and the TokenizerConfig/tokenizer creation use the same symbol:
_processor_model).

In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 110-199: Add an explicit test category marker (e.g.,
pytest.mark.integration or pytest.mark.system) to this module/class so tests are
properly categorized; for example, set a module-level pytestmark or decorate the
TestQwen35VLFinetuneRecipes class with pytest.mark.integration. Update the
file-level or class-level marker near TestQwen35VLFinetuneRecipes so the GPU
gating remains (run_only_on stays on individual tests) but the module/class also
has the integration/system pytest.mark.
- Around line 113-127: The fixture _reset_microbatch_calculator only destroys
the leaked calculator after each test; change it to also clear any existing
global state before tests start by checking _GLOBAL_NUM_MICROBATCHES_CALCULATOR
and calling destroy_num_microbatches_calculator() prior to yield (in addition to
the existing post-yield cleanup). Update the fixture body that references
_GLOBAL_NUM_MICROBATCHES_CALCULATOR and destroy_num_microbatches_calculator so
both pre- and post-yield paths ensure the global is None.

In `@tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py`:
- Around line 98-102: Rename unused parameters to start with an underscore to
silence ruff ARG002/ARG004: in from_hf_pretrained change hf_path to _hf_path and
**kwargs to **_kwargs; in to_megatron_provider change self to _self (or keep
self and prefix unused params) and load_weights to _load_weights so the intent
is explicit and linters stop flagging _FakeAutoBridge and _FakeModelCfg factory
helpers.
- Around line 132-756: Add a pytest category marker so these are discoverable as
unit tests: declare a module-level marker variable `pytestmark =
pytest.mark.unit` near the top of the file (ensure `pytest` is imported), rather
than decorating every function; this will mark all functions such as
`test_each_qwen35_vl_recipe_builds_config`,
`test_qwen35_vl_dataset_type_selection`, `test_sft_nothing_frozen`, etc., as
unit tests for CI selection.

---

Nitpick comments:
In `@tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py`:
- Around line 138-199: The four functional tests (test_sft_nothing_frozen,
test_sft_language_model_frozen, test_sft_vision_and_language_frozen,
test_recompute) currently call run_pretrain_vl_recipe_test in-process and must
be converted to run each scenario in a separate subprocess to avoid shared
distributed state; replace the direct call to run_pretrain_vl_recipe_test with
logic that spawns a subprocess (e.g., subprocess.run or multiprocessing.Process
invoking a small runner module/function that imports run_pretrain_vl_recipe_test
and executes it with the same arguments), pass the same
config_func/recipe_name/parallelism_overrides/model_overrides/tmp_path inputs
(serialize as needed), wait for completion, and assert the subprocess exit code
is zero; ensure the new runner entrypoint and the test wrappers use unique
identifiers matching the existing test parameterization so behavior and
signatures remain unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 09eb4a86-1e7b-4867-8806-73d971b7578d

📥 Commits

Reviewing files that changed from the base of the PR and between 394037d and cfd9d90.

📒 Files selected for processing (11)

examples/models/vlm/qwen35_vl/conversion.sh
examples/models/vlm/qwen35_vl/inference.sh
examples/models/vlm/qwen35_vl/slurm_peft.sh
examples/models/vlm/qwen35_vl/slurm_sft.sh
src/megatron/bridge/models/qwen_vl/qwen35_vl_bridge.py
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py
src/megatron/bridge/recipes/qwen_vl/__init__.py
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py
tests/unit_tests/models/qwen_vl/test_qwen35_vl_bridge.py
tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py

examples/models/vlm/qwen35_vl/inference.sh

examples/models/vlm/qwen35_vl/slurm_peft.sh

examples/models/vlm/qwen35_vl/slurm_sft.sh

src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py

tests/functional_tests/recipes/test_qwen35_vl_recipes_finetune.py

tests/unit_tests/recipes/qwen_vl/test_qwen35_vl_recipes.py

Signed-off-by: Chen Cui <chcui@nvidia.com>

- Add `set -euo pipefail` to slurm_sft.sh and slurm_peft.sh for fail-fast behavior - Remove `logs/` prefix from SBATCH output/error paths (directory doesn't exist at job start) - Remove now-unnecessary `mkdir -p logs` calls - Fix 122B-A10B parallelism comment in slurm_sft.sh to match recipe (TP=2, PP=6, EP=8) - Add `pytestmark = pytest.mark.integration` to functional test module - Reset microbatch calculator both before and after each test in fixture Signed-off-by: Chen Cui <chcui@nvidia.com> Made-with: Cursor

Signed-off-by: Chen Cui <chcui@nvidia.com>

…n-Bridge into chcui/qwen35_recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

…n-Bridge into chcui/qwen35_recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-03-06T23:59:20Z

/ok to test 63395f9

cuichenx added 2 commits March 2, 2026 10:06

fix test

238b9af

Signed-off-by: Chen Cui <chcui@nvidia.com>

add 3 new dense models and training recipes

cfd9d90

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 5, 2026 00:02 Inactive

cuichenx marked this pull request as draft March 5, 2026 00:05

Merge branch 'main' into chcui/qwen35_recipes

047134b

Signed-off-by: Chen Cui <chcui@nvidia.com>

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

recipe refactor

935c40c

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review March 5, 2026 06:44

copy-pr-bot bot had a problem deploying to test March 5, 2026 06:45 Error

cuichenx added 2 commits March 4, 2026 22:51

Merge branch 'main' into chcui/qwen35_recipes

7bea48e

copy-pr-bot bot temporarily deployed to test March 5, 2026 07:02 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 09:38 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 09:59 Failure

cuichenx added 2 commits March 5, 2026 13:55

fix test

c4448fc

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…

9213637

…n-Bridge into chcui/qwen35_recipes

copy-pr-bot bot temporarily deployed to public March 5, 2026 21:56 Inactive

copy-pr-bot bot temporarily deployed to test March 5, 2026 21:56 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 22:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 5, 2026 22:36 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 5, 2026 22:46 Error

update docs and readmes

3996cd5

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to public March 5, 2026 23:23 Inactive

Merge branch 'main' into chcui/qwen35_recipes

6d54d97

copy-pr-bot bot had a problem deploying to test March 6, 2026 00:03 Error

cuichenx added 2 commits March 5, 2026 16:25

doc

0c14274

Signed-off-by: Chen Cui <chcui@nvidia.com>

Merge branch 'chcui/qwen35_recipes' of github.com:NVIDIA-NeMo/Megatro…

eeb9de3

…n-Bridge into chcui/qwen35_recipes

copy-pr-bot bot temporarily deployed to test March 6, 2026 00:25 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 00:52 Error

fix doc link

8407255

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot had a problem deploying to test March 6, 2026 00:54 Error

cuichenx added 2 commits March 5, 2026 16:55

doc

c617fee

Signed-off-by: Chen Cui <chcui@nvidia.com>

doc

76ac963

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 6, 2026 00:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 01:27 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 01:37 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 18:34 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 19:04 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 6, 2026 19:04 Failure

copy-pr-bot bot temporarily deployed to nemo-ci March 6, 2026 19:04 Inactive

Merge branch 'main' into chcui/qwen35_recipes

63395f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen 3.5 recipes#2654

Add Qwen 3.5 recipes#2654
cuichenx wants to merge 16 commits intomainfrom
chcui/qwen35_recipes

cuichenx commented Mar 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuichenx commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuichenx commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuichenx commented Mar 5, 2026 •

edited

Loading