update merge_lora script by liding-nv · Pull Request #2688 · NVIDIA-NeMo/Megatron-Bridge

liding-nv · 2026-03-06T20:20:35Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved CPU execution with informative warnings for parallelism configuration mismatches instead of strict errors.
- Simplified out-of-memory error handling with clearer exit behavior.
Documentation
- Updated usage examples and docstrings to reflect current CPU and GPU parallelism invocation patterns.

Signed-off-by: Li Ding <liding@nvidia.com>

copy-pr-bot · 2026-03-06T20:20:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-06T20:28:56Z

📝 Walkthrough

Walkthrough

The examples/peft/merge_lora.py file is refactored to centralize path resolution through a dedicated helper, adjust CPU-mode validation from strict assertion to warning with automatic reset, remove CUDA out-of-memory retry logic in favor of warning and error exit, and introduce a process group finalizer for cleanup.

Changes

Cohort / File(s)	Summary
Path Resolution Refactoring `examples/peft/merge_lora.py`	Replaces direct `Path(...).expanduser().resolve()` calls with centralized `resolve_path(explicit)` helper imported from `megatron.bridge.utils.common_utils` for consistent user path handling across `_resolve_pretrained()` and `main()` functions.
CPU Mode Validation Update `examples/peft/merge_lora.py`	Replaces strict assertion enforcing TP/PP/EP ≤ 1 on CPU with warning log and automatic reset of parallelism parameters to 1.
CUDA OOM Error Handling `examples/peft/merge_lora.py`	Removes retry-on-out-of-memory logic and replaces with warning log followed by error exit code 1.
Process Group Cleanup `examples/peft/merge_lora.py`	Introduces finalizer to destroy distributed process groups if initialized.
Documentation Updates `examples/peft/merge_lora.py`	Updates usage/docstring examples to reflect new CPU-only and GPU parallelism invocation formats.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'update merge_lora script' is vague and generic, using non-descriptive language that doesn't convey the specific nature of the changes.	Provide a more specific title that highlights the main change, such as 'Replace Path resolution with resolve_path helper and add CPU-only mode warning' or 'Improve path handling and CPU parallelism enforcement in merge_lora script'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes	✅ Passed	Changes are minor refactoring and bug fixes to a utility script affecting path handling, error handling, and resource cleanup without impacting core numerics or performance.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch liding/lora_merge_update

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/peft/merge_lora.py (1)

157-171: ⚠️ Potential issue | 🟠 Major

Normalize CPU parallelism before configuring model_provider.

At Lines 158-160, the user-supplied TP/PP/EP values are copied into model_provider. Lines 164-168 then reset only args. If someone runs --cpu --tp 2, the warning says the sizes were forced back to 1, but initialize_model_parallel() still sees the stale non-1 values from model_provider.

Suggested fix

-    print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}")
-    model_provider.tensor_model_parallel_size = args.tp
-    model_provider.pipeline_model_parallel_size = args.pp
-    model_provider.expert_model_parallel_size = args.ep
-    model_provider.expert_tensor_parallel_size = 1
-    model_provider.pipeline_dtype = torch.bfloat16
     if args.cpu:
         if args.tp != 1 or args.pp != 1 or args.ep != 1:
             logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.")
             args.tp = 1
             args.pp = 1
             args.ep = 1
         if not torch.distributed.is_initialized():
             torch.distributed.init_process_group("gloo")
+    print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}")
+    model_provider.tensor_model_parallel_size = args.tp
+    model_provider.pipeline_model_parallel_size = args.pp
+    model_provider.expert_model_parallel_size = args.ep
+    model_provider.expert_tensor_parallel_size = 1
+    model_provider.pipeline_dtype = torch.bfloat16
     model_provider.initialize_model_parallel(seed=0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/peft/merge_lora.py` around lines 157 - 171, The CPU-path resets
args.tp/pp/ep after you already set model_provider.parallel sizes, causing
initialize_model_parallel() to use stale non-1 values; fix by normalizing args
when args.cpu is true before assigning to model_provider (or, alternatively,
after you detect CPU usage update model_provider.tensor_model_parallel_size,
model_provider.pipeline_model_parallel_size, and
model_provider.expert_model_parallel_size to reflect the forced-to-1 values),
ensuring model_provider values and args are consistent before calling
model_provider.initialize_model_parallel(seed=0).

🧹 Nitpick comments (1)

examples/peft/merge_lora.py (1)

30-46: Use uv run in the usage examples.

These examples still show direct python/torchrun execution, which drifts from the repo convention for examples/**/*.py.

Suggested doc update

-    python merge_lora.py \
+    uv run python examples/peft/merge_lora.py \
         --lora-checkpoint path/to/finetune_ckpt \
         --hf-model-path   path/to/hf_model \
         --output          path/to/merged_ckpt \
         [--pretrained path/to/base_ckpt] \
         --cpu

-    torchrun --nproc_per_node <N> merge_lora.py \
+    uv run torchrun --nproc_per_node <N> examples/peft/merge_lora.py \
         --lora-checkpoint path/to/finetune_ckpt \
         --hf-model-path   path/to/hf_model \
         --output          path/to/merged_ckpt \
         [--pretrained path/to/base_ckpt] \
         [--tp 1] [--pp 1] [--ep 1]

As per coding guidelines "Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/peft/merge_lora.py` around lines 30 - 46, The usage examples in
merge_lora.py show direct python/torchrun execution; update them to the repo
convention by replacing those invocations with the `uv run` form
(single-process: `uv run merge_lora.py --lora-checkpoint ... --hf-model-path ...
--output ... [--pretrained ...] --cpu`; multi-process: `uv run -n <N>
merge_lora.py --lora-checkpoint ... --hf-model-path ... --output ...
[--pretrained ...] [--tp 1] [--pp 1] [--ep 1]`). Edit the example block that
currently lists `python merge_lora.py` and `torchrun --nproc_per_node <N>
merge_lora.py` so it uses `uv run` and `uv run -n <N>` while keeping the same
flags (`--lora-checkpoint`, `--hf-model-path`, `--output`, `--pretrained`,
`--cpu`, `--tp`, `--pp`, `--ep`) and surrounding text.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/peft/merge_lora.py`:
- Around line 163-170: When args.cpu is true the code calls
torch.distributed.init_process_group("gloo") without rank/world_size/init_method
which relies on environment vars; update the init call in merge_lora.py (the
block guarded by args.cpu and torch.distributed.is_initialized()) to initialize
a single-rank group explicitly (e.g., pass backend="gloo", rank=0, world_size=1
or provide an explicit init_method suitable for local single-process use) so the
CPU example runs standalone, or alternatively add a short comment/docstring near
args.cpu explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be
set if using the env:// init method.

---

Outside diff comments:
In `@examples/peft/merge_lora.py`:
- Around line 157-171: The CPU-path resets args.tp/pp/ep after you already set
model_provider.parallel sizes, causing initialize_model_parallel() to use stale
non-1 values; fix by normalizing args when args.cpu is true before assigning to
model_provider (or, alternatively, after you detect CPU usage update
model_provider.tensor_model_parallel_size,
model_provider.pipeline_model_parallel_size, and
model_provider.expert_model_parallel_size to reflect the forced-to-1 values),
ensuring model_provider values and args are consistent before calling
model_provider.initialize_model_parallel(seed=0).

---

Nitpick comments:
In `@examples/peft/merge_lora.py`:
- Around line 30-46: The usage examples in merge_lora.py show direct
python/torchrun execution; update them to the repo convention by replacing those
invocations with the `uv run` form (single-process: `uv run merge_lora.py
--lora-checkpoint ... --hf-model-path ... --output ... [--pretrained ...]
--cpu`; multi-process: `uv run -n <N> merge_lora.py --lora-checkpoint ...
--hf-model-path ... --output ... [--pretrained ...] [--tp 1] [--pp 1] [--ep
1]`). Edit the example block that currently lists `python merge_lora.py` and
`torchrun --nproc_per_node <N> merge_lora.py` so it uses `uv run` and `uv run -n
<N>` while keeping the same flags (`--lora-checkpoint`, `--hf-model-path`,
`--output`, `--pretrained`, `--cpu`, `--tp`, `--pp`, `--ep`) and surrounding
text.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 184e718a-8c25-4026-876f-413bc6a18ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 1d25ea2 and 4922e74.

📒 Files selected for processing (1)

examples/peft/merge_lora.py

coderabbitai · 2026-03-06T20:28:59Z

examples/peft/merge_lora.py

    if args.cpu:
-        assert args.tp == args.pp == args.ep == 1, "TP, PP, and EP must be 1 when using CPU merge"
+        if args.tp != 1 or args.pp != 1 or args.ep != 1:
+            logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.")
+            args.tp = 1
+            args.pp = 1
+            args.ep = 1
        if not torch.distributed.is_initialized():
            torch.distributed.init_process_group("gloo")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In the current PyTorch documentation, what defaults does torch.distributed.init_process_group() use when only backend="gloo" is passed, and which environment variables are required for env:// initialization?

💡 Result:

Defaults used by torch.distributed.init_process_group(backend="gloo") (i.e., only backend passed)

From the current PyTorch docs, the function signature/defaults are: init_process_group(backend=None, init_method=None, timeout=None, world_size=-1, rank=-1, store=None, group_name='', pg_options=None, device_id=None, ...). (docs.pytorch.org)

So with only backend="gloo" passed, the defaults in effect are:

init_method: assumed to be "env://" (because neither store nor init_method is specified). (docs.pytorch.org)

timeout: 30 minutes (docs: “10 minutes for NCCL and 30 minutes for other backends”; Gloo is “other backends”). (docs.pytorch.org)

world_size: -1 (meaning it must be provided via env:// env vars or explicitly). (docs.pytorch.org)

rank: -1 (meaning it must be provided via env:// env vars or explicitly). (docs.pytorch.org)

store=None, group_name='', pg_options=None, device_id=None (and other shown defaults). (docs.pytorch.org)

Environment variables required for env:// initialization

PyTorch docs list these variables for env://:

MASTER_PORT — required

MASTER_ADDR — required (except for rank 0)

WORLD_SIZE — required (unless you pass world_size directly)

RANK — required (unless you pass rank directly) (docs.pytorch.org)

Citations:

1: https://docs.pytorch.org/docs/stable/distributed.html

2: https://docs.pytorch.org/docs/stable/distributed.html

3: https://docs.pytorch.org/docs/stable/distributed.html

4: https://docs.pytorch.org/docs/stable/distributed.html

5: https://docs.pytorch.org/docs/stable/distributed.html

6: https://docs.pytorch.org/docs/stable/distributed.html

7: https://docs.pytorch.org/docs/stable/distributed.html

🏁 Script executed:

# First, let's check the file and its surrounding context head -n 180 examples/peft/merge_lora.py | tail -n 50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2363

🏁 Script executed:

# Also look for the docstring/description of what --cpu is supposed to do head -n 50 examples/peft/merge_lora.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1931

🏁 Script executed:

# Check if there's any conditional logic that skips distributed init for CPU-only grep -n "torch.distributed" examples/peft/merge_lora.py -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 682

🏁 Script executed:

# Check if there's a way to properly initialize distributed for single-rank case grep -n "initialize_model_parallel\|world_size\|rank" examples/peft/merge_lora.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 673

🏁 Script executed:

# Look for environment variable handling or alternative initialization patterns grep -n "MASTER\|RANK\|WORLD_SIZE" examples/peft/merge_lora.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53

🏁 Script executed:

# Check if there's conditional logic or context that handles single-rank initialization grep -n "rank\|world_size" examples/peft/merge_lora.py -i

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 616

🏁 Script executed:

# Look for torch.distributed usage patterns in other files in the repo find . -name "*.py" -type f -exec grep -l "init_process_group" {} \; | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1106

🏁 Script executed:

# Check how other examples initialize distributed process group grep -A 5 "init_process_group" ./examples/conversion/compare_text_generation.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 237

🏁 Script executed:

# Check the model_provider to understand if initialize_model_parallel requires distributed init grep -A 10 -B 5 "initialize_model_parallel" ./src/megatron/bridge/models/model_provider.py | head -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1942

🏁 Script executed:

# Search for any handling of rank=0 or single-rank initialization patterns rg "rank.*0|world_size.*1" examples/peft/merge_lora.py -A 2 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1468

Supply explicit single-rank configuration for standalone CPU usage or document required environment variables.

The docstring advertises python merge_lora.py ... --cpu as a standalone single-process example, but the code calls torch.distributed.init_process_group("gloo") without explicit rank, world_size, or init_method. PyTorch defaults init_method to "env://" when not specified, requiring MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK environment variables—which will not be set in a standalone Python invocation. Either pass explicit parameters (e.g., rank=0, world_size=1) or document that users must export these environment variables before running the CPU example.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/peft/merge_lora.py` around lines 163 - 170, When args.cpu is true the code calls torch.distributed.init_process_group("gloo") without rank/world_size/init_method which relies on environment vars; update the init call in merge_lora.py (the block guarded by args.cpu and torch.distributed.is_initialized()) to initialize a single-rank group explicitly (e.g., pass backend="gloo", rank=0, world_size=1 or provide an explicit init_method suitable for local single-process use) so the CPU example runs standalone, or alternatively add a short comment/docstring near args.cpu explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be set if using the env:// init method.

cuichenx

LGTM thanks!

liding-nv · 2026-03-06T20:57:29Z

/ok to test 4922e74

liding-nv added 2 commits March 6, 2026 20:18

add torch destroy_process_group

d0dc93d

Signed-off-by: Li Ding <liding@nvidia.com>

remove cpu fallback, exit with a warning message

4922e74

Signed-off-by: Li Ding <liding@nvidia.com>

liding-nv requested a review from cuichenx March 6, 2026 20:20

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

cuichenx approved these changes Mar 6, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test March 6, 2026 20:58 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 7, 2026 08:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 7, 2026 09:04 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 7, 2026 09:16 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update merge_lora script#2688

update merge_lora script#2688
liding-nv wants to merge 2 commits intomainfrom
liding/lora_merge_update

liding-nv commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

cuichenx left a comment

Uh oh!

liding-nv commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liding-nv commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Defaults used by torch.distributed.init_process_group(backend="gloo") (i.e., only backend passed)

Environment variables required for env:// initialization

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

liding-nv commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liding-nv commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Defaults used by `torch.distributed.init_process_group(backend="gloo")` (i.e., only `backend` passed)

Environment variables required for `env://` initialization