Conversation
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/peft/merge_lora.py (1)
157-171:⚠️ Potential issue | 🟠 MajorNormalize CPU parallelism before configuring
model_provider.At Lines 158-160, the user-supplied TP/PP/EP values are copied into
model_provider. Lines 164-168 then reset onlyargs. If someone runs--cpu --tp 2, the warning says the sizes were forced back to 1, butinitialize_model_parallel()still sees the stale non-1 values frommodel_provider.Suggested fix
- print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}") - model_provider.tensor_model_parallel_size = args.tp - model_provider.pipeline_model_parallel_size = args.pp - model_provider.expert_model_parallel_size = args.ep - model_provider.expert_tensor_parallel_size = 1 - model_provider.pipeline_dtype = torch.bfloat16 if args.cpu: if args.tp != 1 or args.pp != 1 or args.ep != 1: logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.") args.tp = 1 args.pp = 1 args.ep = 1 if not torch.distributed.is_initialized(): torch.distributed.init_process_group("gloo") + print_rank_0(f"Setting Parallelism: TP={args.tp} | PP={args.pp} | EP={args.ep}") + model_provider.tensor_model_parallel_size = args.tp + model_provider.pipeline_model_parallel_size = args.pp + model_provider.expert_model_parallel_size = args.ep + model_provider.expert_tensor_parallel_size = 1 + model_provider.pipeline_dtype = torch.bfloat16 model_provider.initialize_model_parallel(seed=0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/peft/merge_lora.py` around lines 157 - 171, The CPU-path resets args.tp/pp/ep after you already set model_provider.parallel sizes, causing initialize_model_parallel() to use stale non-1 values; fix by normalizing args when args.cpu is true before assigning to model_provider (or, alternatively, after you detect CPU usage update model_provider.tensor_model_parallel_size, model_provider.pipeline_model_parallel_size, and model_provider.expert_model_parallel_size to reflect the forced-to-1 values), ensuring model_provider values and args are consistent before calling model_provider.initialize_model_parallel(seed=0).
🧹 Nitpick comments (1)
examples/peft/merge_lora.py (1)
30-46: Useuv runin the usage examples.These examples still show direct
python/torchrunexecution, which drifts from the repo convention forexamples/**/*.py.Suggested doc update
- python merge_lora.py \ + uv run python examples/peft/merge_lora.py \ --lora-checkpoint path/to/finetune_ckpt \ --hf-model-path path/to/hf_model \ --output path/to/merged_ckpt \ [--pretrained path/to/base_ckpt] \ --cpu - torchrun --nproc_per_node <N> merge_lora.py \ + uv run torchrun --nproc_per_node <N> examples/peft/merge_lora.py \ --lora-checkpoint path/to/finetune_ckpt \ --hf-model-path path/to/hf_model \ --output path/to/merged_ckpt \ [--pretrained path/to/base_ckpt] \ [--tp 1] [--pp 1] [--ep 1]As per coding guidelines "Use 'uv run' to execute scripts instead of activating a virtual environment and calling 'python' directly".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/peft/merge_lora.py` around lines 30 - 46, The usage examples in merge_lora.py show direct python/torchrun execution; update them to the repo convention by replacing those invocations with the `uv run` form (single-process: `uv run merge_lora.py --lora-checkpoint ... --hf-model-path ... --output ... [--pretrained ...] --cpu`; multi-process: `uv run -n <N> merge_lora.py --lora-checkpoint ... --hf-model-path ... --output ... [--pretrained ...] [--tp 1] [--pp 1] [--ep 1]`). Edit the example block that currently lists `python merge_lora.py` and `torchrun --nproc_per_node <N> merge_lora.py` so it uses `uv run` and `uv run -n <N>` while keeping the same flags (`--lora-checkpoint`, `--hf-model-path`, `--output`, `--pretrained`, `--cpu`, `--tp`, `--pp`, `--ep`) and surrounding text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/peft/merge_lora.py`:
- Around line 163-170: When args.cpu is true the code calls
torch.distributed.init_process_group("gloo") without rank/world_size/init_method
which relies on environment vars; update the init call in merge_lora.py (the
block guarded by args.cpu and torch.distributed.is_initialized()) to initialize
a single-rank group explicitly (e.g., pass backend="gloo", rank=0, world_size=1
or provide an explicit init_method suitable for local single-process use) so the
CPU example runs standalone, or alternatively add a short comment/docstring near
args.cpu explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be
set if using the env:// init method.
---
Outside diff comments:
In `@examples/peft/merge_lora.py`:
- Around line 157-171: The CPU-path resets args.tp/pp/ep after you already set
model_provider.parallel sizes, causing initialize_model_parallel() to use stale
non-1 values; fix by normalizing args when args.cpu is true before assigning to
model_provider (or, alternatively, after you detect CPU usage update
model_provider.tensor_model_parallel_size,
model_provider.pipeline_model_parallel_size, and
model_provider.expert_model_parallel_size to reflect the forced-to-1 values),
ensuring model_provider values and args are consistent before calling
model_provider.initialize_model_parallel(seed=0).
---
Nitpick comments:
In `@examples/peft/merge_lora.py`:
- Around line 30-46: The usage examples in merge_lora.py show direct
python/torchrun execution; update them to the repo convention by replacing those
invocations with the `uv run` form (single-process: `uv run merge_lora.py
--lora-checkpoint ... --hf-model-path ... --output ... [--pretrained ...]
--cpu`; multi-process: `uv run -n <N> merge_lora.py --lora-checkpoint ...
--hf-model-path ... --output ... [--pretrained ...] [--tp 1] [--pp 1] [--ep
1]`). Edit the example block that currently lists `python merge_lora.py` and
`torchrun --nproc_per_node <N> merge_lora.py` so it uses `uv run` and `uv run -n
<N>` while keeping the same flags (`--lora-checkpoint`, `--hf-model-path`,
`--output`, `--pretrained`, `--cpu`, `--tp`, `--pp`, `--ep`) and surrounding
text.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 184e718a-8c25-4026-876f-413bc6a18ce6
📒 Files selected for processing (1)
examples/peft/merge_lora.py
| if args.cpu: | ||
| assert args.tp == args.pp == args.ep == 1, "TP, PP, and EP must be 1 when using CPU merge" | ||
| if args.tp != 1 or args.pp != 1 or args.ep != 1: | ||
| logger.warning("TP, PP, and EP must be 1 when using CPU merge. Setting to 1.") | ||
| args.tp = 1 | ||
| args.pp = 1 | ||
| args.ep = 1 | ||
| if not torch.distributed.is_initialized(): | ||
| torch.distributed.init_process_group("gloo") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In the current PyTorch documentation, what defaults does torch.distributed.init_process_group() use when only backend="gloo" is passed, and which environment variables are required for env:// initialization?
💡 Result:
Defaults used by torch.distributed.init_process_group(backend="gloo") (i.e., only backend passed)
From the current PyTorch docs, the function signature/defaults are: init_process_group(backend=None, init_method=None, timeout=None, world_size=-1, rank=-1, store=None, group_name='', pg_options=None, device_id=None, ...). (docs.pytorch.org)
So with only backend="gloo" passed, the defaults in effect are:
init_method: assumed to be"env://"(because neitherstorenorinit_methodis specified). (docs.pytorch.org)timeout: 30 minutes (docs: “10 minutes for NCCL and 30 minutes for other backends”; Gloo is “other backends”). (docs.pytorch.org)world_size: -1 (meaning it must be provided viaenv://env vars or explicitly). (docs.pytorch.org)rank: -1 (meaning it must be provided viaenv://env vars or explicitly). (docs.pytorch.org)store=None,group_name='',pg_options=None,device_id=None(and other shown defaults). (docs.pytorch.org)
Environment variables required for env:// initialization
PyTorch docs list these variables for env://:
MASTER_PORT— requiredMASTER_ADDR— required (except for rank 0)WORLD_SIZE— required (unless you passworld_sizedirectly)RANK— required (unless you passrankdirectly) (docs.pytorch.org)
Citations:
- 1: https://docs.pytorch.org/docs/stable/distributed.html
- 2: https://docs.pytorch.org/docs/stable/distributed.html
- 3: https://docs.pytorch.org/docs/stable/distributed.html
- 4: https://docs.pytorch.org/docs/stable/distributed.html
- 5: https://docs.pytorch.org/docs/stable/distributed.html
- 6: https://docs.pytorch.org/docs/stable/distributed.html
- 7: https://docs.pytorch.org/docs/stable/distributed.html
🏁 Script executed:
# First, let's check the file and its surrounding context
head -n 180 examples/peft/merge_lora.py | tail -n 50Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2363
🏁 Script executed:
# Also look for the docstring/description of what --cpu is supposed to do
head -n 50 examples/peft/merge_lora.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1931
🏁 Script executed:
# Check if there's any conditional logic that skips distributed init for CPU-only
grep -n "torch.distributed" examples/peft/merge_lora.py -A 3 -B 3Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 682
🏁 Script executed:
# Check if there's a way to properly initialize distributed for single-rank case
grep -n "initialize_model_parallel\|world_size\|rank" examples/peft/merge_lora.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 673
🏁 Script executed:
# Look for environment variable handling or alternative initialization patterns
grep -n "MASTER\|RANK\|WORLD_SIZE" examples/peft/merge_lora.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 53
🏁 Script executed:
# Check if there's conditional logic or context that handles single-rank initialization
grep -n "rank\|world_size" examples/peft/merge_lora.py -iRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 616
🏁 Script executed:
# Look for torch.distributed usage patterns in other files in the repo
find . -name "*.py" -type f -exec grep -l "init_process_group" {} \; | head -5Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1106
🏁 Script executed:
# Check how other examples initialize distributed process group
grep -A 5 "init_process_group" ./examples/conversion/compare_text_generation.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 237
🏁 Script executed:
# Check the model_provider to understand if initialize_model_parallel requires distributed init
grep -A 10 -B 5 "initialize_model_parallel" ./src/megatron/bridge/models/model_provider.py | head -40Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1942
🏁 Script executed:
# Search for any handling of rank=0 or single-rank initialization patterns
rg "rank.*0|world_size.*1" examples/peft/merge_lora.py -A 2 -B 2Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1468
Supply explicit single-rank configuration for standalone CPU usage or document required environment variables.
The docstring advertises python merge_lora.py ... --cpu as a standalone single-process example, but the code calls torch.distributed.init_process_group("gloo") without explicit rank, world_size, or init_method. PyTorch defaults init_method to "env://" when not specified, requiring MASTER_PORT, MASTER_ADDR, WORLD_SIZE, and RANK environment variables—which will not be set in a standalone Python invocation. Either pass explicit parameters (e.g., rank=0, world_size=1) or document that users must export these environment variables before running the CPU example.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/peft/merge_lora.py` around lines 163 - 170, When args.cpu is true
the code calls torch.distributed.init_process_group("gloo") without
rank/world_size/init_method which relies on environment vars; update the init
call in merge_lora.py (the block guarded by args.cpu and
torch.distributed.is_initialized()) to initialize a single-rank group explicitly
(e.g., pass backend="gloo", rank=0, world_size=1 or provide an explicit
init_method suitable for local single-process use) so the CPU example runs
standalone, or alternatively add a short comment/docstring near args.cpu
explaining that MASTER_ADDR, MASTER_PORT, WORLD_SIZE and RANK must be set if
using the env:// init method.
|
/ok to test 4922e74 |
Summary by CodeRabbit
Release Notes
Bug Fixes
Documentation