Skip to content

update merge_lora script#2688

Open
liding-nv wants to merge 2 commits intomainfrom
liding/lora_merge_update
Open

update merge_lora script#2688
liding-nv wants to merge 2 commits intomainfrom
liding/lora_merge_update

Conversation

@liding-nv
Copy link
Contributor

@liding-nv liding-nv commented Mar 6, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved CPU execution with informative warnings for parallelism configuration mismatches instead of strict errors.
    • Simplified out-of-memory error handling with clearer exit behavior.
  • Documentation

    • Updated usage examples and docstrings to reflect current CPU and GPU parallelism invocation patterns.

Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
@liding-nv liding-nv requested a review from cuichenx March 6, 2026 20:20
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

The examples/peft/merge_lora.py file is refactored to centralize path resolution through a dedicated helper, adjust CPU-mode validation from strict assertion to warning with automatic reset, remove CUDA out-of-memory retry logic in favor of warning and error exit, and introduce a process group finalizer for cleanup.

Changes

Cohort / File(s) Summary
Path Resolution Refactoring
examples/peft/merge_lora.py
Replaces direct Path(...).expanduser().resolve() calls with centralized resolve_path(explicit) helper imported from megatron.bridge.utils.common_utils for consistent user path handling across _resolve_pretrained() and main() functions.
CPU Mode Validation Update
examples/peft/merge_lora.py
Replaces strict assertion enforcing TP/PP/EP ≤ 1 on CPU with warning log and automatic reset of parallelism parameters to 1.
CUDA OOM Error Handling
examples/peft/merge_lora.py
Removes retry-on-out-of-memory logic and replaces with warning log followed by error exit code 1.
Process Group Cleanup
examples/peft/merge_lora.py
Introduces finalizer to destroy distributed process groups if initialized.
Documentation Updates
examples/peft/merge_lora.py
Updates usage/docstring examples to reflect new CPU-only and GPU parallelism invocation formats.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'update merge_lora script' is vague and generic, using non-descriptive language that doesn't convey the specific nature of the changes. Provide a more specific title that highlights the main change, such as 'Replace Path resolution with resolve_path helper and add CPU-only mode warning' or 'Improve path handling and CPU parallelism enforcement in merge_lora script'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Test Results For Major Changes ✅ Passed Changes are minor refactoring and bug fixes to a utility script affecting path handling, error handling, and resource cleanup without impacting core numerics or performance.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch liding/lora_merge_update

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/peft/merge_lora.py (1)

157-171: ⚠️ Potential issue | 🟠 Major

Normalize CPU parallelism before configuring model_provider.

At Lines 158-160, the user-supplied TP/PP/EP values are copied into model_provider. Lines 164-168 then reset only args. If someone runs --cpu --tp 2, the warning says the sizes were forced back to 1, but initialize_model_parallel() still sees the stale non-1 values from model_provider.

Suggested fix
-    print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}")
-    model_provider.tensor_model_parallel_size = args.tp
-    model_provider.pipeline_model_parallel_size = args.pp
-    model_provider.expert_model_parallel_size = args.ep
-    model_provider.expert_tensor_parallel_size = 1
-    model_provider.pipeline_dtype = torch.bfloat16
     if args.cpu:
         if args.tp != 1 or args.pp != 1 or args.ep != 1:
             logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.")
             args.tp = 1
             args.pp = 1
             args.ep = 1
         if not torch.distributed.is_initialized():
             torch.distributed.init_process_group("gloo")
+    print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}")
+    model_provider.tensor_model_parallel_size = args.tp
+    model_provider.pipeline_model_parallel_size = args.pp
+    model_provider.expert_model_parallel_size = args.ep
+    model_provider.expert_tensor_parallel_size = 1
+    model_provider.pipeline_dtype = torch.bfloat16
     model_provider.initialize_model_parallel(seed=0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/peft/merge_lora.py` around lines 157 - 171, The CPU-path resets
args.tp/pp/ep after you already set model_provider.parallel sizes, causing
initialize_model_parallel() to use stale non-1 values; fix by normalizing args
when args.cpu is true before assigning to model_provider (or, alternatively,
after you detect CPU usage update model_provider.tensor_model_parallel_size,
model_provider.pipeline_model_parallel_size, and
model_provider.expert_model_parallel_size to reflect the forced-to-1 values),
ensuring model_provider values and args are consistent before calling
model_provider.initialize_model_parallel(seed=0).
🧹 Nitpick comments (1)
examples/peft/merge_lora.py (1)

30-46: Use uv run in the usage examples.

These examples still show direct python/torchrun execution, which drifts from the repo convention for examples/**/*.py.

Suggested doc update
-    python merge_lora.py \
+    uv run python examples/peft/merge_lora.py \
         --lora-checkpoint path/to/finetune_ckpt \
         --hf-model-path   path/to/hf_model \
         --output          path/to/merged_ckpt \
         [--pretrained path/to/base_ckpt] \
         --cpu

-    torchrun --nproc_per_node <N> merge_lora.py \
+    uv run torchrun --nproc_per_node <N> examples/peft/merge_lora.py \
         --lora-checkpoint path/to/finetune_ckpt \
         --hf-model-path   path/to/hf_model \
         --output          path/to/merged_ckpt \
         [--pretrained path/to/base_ckpt] \
         [--tp 1] [--pp 1] [--ep 1]

As per coding guidelines "Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/peft/merge_lora.py` around lines 30 - 46, The usage examples in
merge_lora.py show direct python/torchrun execution; update them to the repo
convention by replacing those invocations with the `uv run` form
(single-process: `uv run merge_lora.py --lora-checkpoint ... --hf-model-path ...
--output ... [--pretrained ...] --cpu`; multi-process: `uv run -n <N>
merge_lora.py --lora-checkpoint ... --hf-model-path ... --output ...
[--pretrained ...] [--tp 1] [--pp 1] [--ep 1]`). Edit the example block that
currently lists `python merge_lora.py` and `torchrun --nproc_per_node <N>
merge_lora.py` so it uses `uv run` and `uv run -n <N>` while keeping the same
flags (`--lora-checkpoint`, `--hf-model-path`, `--output`, `--pretrained`,
`--cpu`, `--tp`, `--pp`, `--ep`) and surrounding text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/peft/merge_lora.py`:
- Around line 163-170: When args.cpu is true the code calls
torch.distributed.init_process_group("gloo") without rank/world_size/init_method
which relies on environment vars; update the init call in merge_lora.py (the
block guarded by args.cpu and torch.distributed.is_initialized()) to initialize
a single-rank group explicitly (e.g., pass backend="gloo", rank=0, world_size=1
or provide an explicit init_method suitable for local single-process use) so the
CPU example runs standalone, or alternatively add a short comment/docstring near
args.cpu explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be
set if using the env:// init method.

---

Outside diff comments:
In `@examples/peft/merge_lora.py`:
- Around line 157-171: The CPU-path resets args.tp/pp/ep after you already set
model_provider.parallel sizes, causing initialize_model_parallel() to use stale
non-1 values; fix by normalizing args when args.cpu is true before assigning to
model_provider (or, alternatively, after you detect CPU usage update
model_provider.tensor_model_parallel_size,
model_provider.pipeline_model_parallel_size, and
model_provider.expert_model_parallel_size to reflect the forced-to-1 values),
ensuring model_provider values and args are consistent before calling
model_provider.initialize_model_parallel(seed=0).

---

Nitpick comments:
In `@examples/peft/merge_lora.py`:
- Around line 30-46: The usage examples in merge_lora.py show direct
python/torchrun execution; update them to the repo convention by replacing those
invocations with the `uv run` form (single-process: `uv run merge_lora.py
--lora-checkpoint ... --hf-model-path ... --output ... [--pretrained ...]
--cpu`; multi-process: `uv run -n <N> merge_lora.py --lora-checkpoint ...
--hf-model-path ... --output ... [--pretrained ...] [--tp 1] [--pp 1] [--ep
1]`). Edit the example block that currently lists `python merge_lora.py` and
`torchrun --nproc_per_node <N> merge_lora.py` so it uses `uv run` and `uv run -n
<N>` while keeping the same flags (`--lora-checkpoint`, `--hf-model-path`,
`--output`, `--pretrained`, `--cpu`, `--tp`, `--pp`, `--ep`) and surrounding
text.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 184e718a-8c25-4026-876f-413bc6a18ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 1d25ea2 and 4922e74.

📒 Files selected for processing (1)
  • examples/peft/merge_lora.py

Comment on lines 163 to 170
if args.cpu:
assert args.tp == args.pp == args.ep == 1, "TP, PP, and EP must be 1 when using CPU merge"
if args.tp != 1 or args.pp != 1 or args.ep != 1:
logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.")
args.tp = 1
args.pp = 1
args.ep = 1
if not torch.distributed.is_initialized():
torch.distributed.init_process_group("gloo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In the current PyTorch documentation, what defaults does torch.distributed.init_process_group() use when only backend="gloo" is passed, and which environment variables are required for env:// initialization?

💡 Result:

Defaults used by torch.distributed.init_process_group(backend="gloo") (i.e., only backend passed)

From the current PyTorch docs, the function signature/defaults are: init_process_group(backend=None, init_method=None, timeout=None, world_size=-1, rank=-1, store=None, group_name='', pg_options=None, device_id=None, ...). (docs.pytorch.org)

So with only backend="gloo" passed, the defaults in effect are:

  • init_method: assumed to be "env://" (because neither store nor init_method is specified). (docs.pytorch.org)
  • timeout: 30 minutes (docs: “10 minutes for NCCL and 30 minutes for other backends”; Gloo is “other backends”). (docs.pytorch.org)
  • world_size: -1 (meaning it must be provided via env:// env vars or explicitly). (docs.pytorch.org)
  • rank: -1 (meaning it must be provided via env:// env vars or explicitly). (docs.pytorch.org)
  • store=None, group_name='', pg_options=None, device_id=None (and other shown defaults). (docs.pytorch.org)

Environment variables required for env:// initialization

PyTorch docs list these variables for env://:

  • MASTER_PORT — required
  • MASTER_ADDR — required (except for rank 0)
  • WORLD_SIZE — required (unless you pass world_size directly)
  • RANK — required (unless you pass rank directly) (docs.pytorch.org)

Citations:


🏁 Script executed:

# First, let's check the file and its surrounding context
head -n 180 examples/peft/merge_lora.py | tail -n 50

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2363


🏁 Script executed:

# Also look for the docstring/description of what --cpu is supposed to do
head -n 50 examples/peft/merge_lora.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1931


🏁 Script executed:

# Check if there's any conditional logic that skips distributed init for CPU-only
grep -n "torch.distributed" examples/peft/merge_lora.py -A 3 -B 3

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 682


🏁 Script executed:

# Check if there's a way to properly initialize distributed for single-rank case
grep -n "initialize_model_parallel\|world_size\|rank" examples/peft/merge_lora.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 673


🏁 Script executed:

# Look for environment variable handling or alternative initialization patterns
grep -n "MASTER\|RANK\|WORLD_SIZE" examples/peft/merge_lora.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 53


🏁 Script executed:

# Check if there's conditional logic or context that handles single-rank initialization
grep -n "rank\|world_size" examples/peft/merge_lora.py -i

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 616


🏁 Script executed:

# Look for torch.distributed usage patterns in other files in the repo
find . -name "*.py" -type f -exec grep -l "init_process_group" {} \; | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1106


🏁 Script executed:

# Check how other examples initialize distributed process group
grep -A 5 "init_process_group" ./examples/conversion/compare_text_generation.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 237


🏁 Script executed:

# Check the model_provider to understand if initialize_model_parallel requires distributed init
grep -A 10 -B 5 "initialize_model_parallel" ./src/megatron/bridge/models/model_provider.py | head -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1942


🏁 Script executed:

# Search for any handling of rank=0 or single-rank initialization patterns
rg "rank.*0|world_size.*1" examples/peft/merge_lora.py -A 2 -B 2

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1468


Supply explicit single-rank configuration for standalone CPU usage or document required environment variables.

The docstring advertises python merge_lora.py ... --cpu as a standalone single-process example, but the code calls torch.distributed.init_process_group("gloo") without explicit rank, world_size, or init_method. PyTorch defaults init_method to "env://" when not specified, requiring MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK environment variables—which will not be set in a standalone Python invocation. Either pass explicit parameters (e.g., rank=0, world_size=1) or document that users must export these environment variables before running the CPU example.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/peft/merge_lora.py` around lines 163 - 170, When args.cpu is true
the code calls torch.distributed.init_process_group("gloo") without
rank/world_size/init_method which relies on environment vars; update the init
call in merge_lora.py (the block guarded by args.cpu and
torch.distributed.is_initialized()) to initialize a single-rank group explicitly
(e.g., pass backend="gloo", rank=0, world_size=1 or provide an explicit
init_method suitable for local single-process use) so the CPU example runs
standalone, or alternatively add a short comment/docstring near args.cpu
explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be set if
using the env:// init method.

Copy link
Contributor

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@liding-nv
Copy link
Contributor Author

/ok to test 4922e74

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants