[QEff Finetune]: Enable PP+DDP #394

quic-mamta · 2025-05-08T07:55:09Z

Added support for PP and DDP

Command for PP only : QAIC_VISIBLE_DEVICES=0,1,2,3 python -m QEfficient.cloud.finetune --device qaic --enable_pp --dist_backend qccl (number of pipeline stages will be equal to visible devices)

Command for DDP only : QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 -m QEfficient.cloud.finetune --device qaic --enable_ddp --dist_backend qccl

Command for PP+DDP : For 4 qaic devices(1 Ultra) with 2 pipeline stages
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 2 -m QEfficient.cloud.finetune --device qaic --enable_ddp --enable_pp --num_pp_stages 2 --dist_backend qccl

Signed-off-by: Mamta Singh <[email protected]>

quic-meetkuma

Good work, Mamta! Please address the comments. Let us discuss offline if anything is confusing.

quic-meetkuma · 2025-05-12T09:03:41Z

QEfficient/cloud/finetune.py

-
-    model.to(train_config.device)
-    optimizer = optim.AdamW(model.parameters(), lr=train_config.lr, weight_decay=train_config.weight_decay)
+    # model.to(train_config.device)


This commented line will be required for non-(PP +DDP) use case?

quic-meetkuma · 2025-05-12T09:04:10Z

QEfficient/cloud/finetune.py

    scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
    if train_config.enable_ddp:
-        model = nn.parallel.DistributedDataParallel(model, device_ids=[dist.get_rank()])
+        model = nn.parallel.DistributedDataParallel(model)  # , device_ids=[dist.get_rank()])


why we removed device_ids in case of ddp? Because we are using device_map now?

quic-meetkuma · 2025-05-12T09:05:37Z

QEfficient/cloud/finetune.py

        )
+        print(model.hf_device_map)


If this is just for debugging, please remove it. If we actually want to show the user that which part of model is distributed to which device then add some DEBUG logs about the split. That would be helpful for the user to debug easily. That will make our tool anti-black box. :)

quic-meetkuma · 2025-05-12T09:07:53Z

QEfficient/finetune/configs/training.py

@@ -99,6 +99,8 @@ class TrainConfig:
    # profiler_dir: str = "PATH/to/save/profiler/results" # will be used if using profiler

    # dist-related
+    enable_pp: bool = False


I think this support is only added for decoder kind of model. So this needs to be properly documented. May be we can share some numerical data as well. E.g. If user's model is more than lets say 8B then user may need 4 pp stages. If it is more than 30B, user may need 16 pp stage. Like that.

quic-meetkuma · 2025-05-12T09:09:07Z

QEfficient/cloud/finetune.py

-    getattr(torch, torch_device.type).set_device(dist.get_rank())
+    if train_config.enable_pp:
+        assert dist.get_world_size() % train_config.num_pp_stages == 0, (
+            "total available devices should be multiple of number of pipeline stages"


Total instead of total
full stop at the end.

Also, can we intimate the user that
if dist.get_world_size() // train_config.num_pp_stage == 1, this will be only pure PP.
if dist.get_world_size() // train_config.num_pp_stage > 1, this will be actually PP+DDP.

This might be helpful to make our system idiot proof.

Also, we need another assert condition.
assert dist.get_world_size() * train_config.num_pp_stage == total_available_devices

quic-meetkuma · 2025-05-12T09:27:21Z

QEfficient/cloud/finetune.py

+        - This device map structure is verified for llama models only.
+    """
+    device_map = {
+        "model.embed_tokens": rank * num_pp_stages,


Please add some explanation why these particular layers are mapped to a particular device.
L64 to L67

quic-meetkuma · 2025-05-12T09:28:27Z

QEfficient/cloud/finetune.py

+        "model.rotary_emb": rank * num_pp_stages + (num_pp_stages - 1),
+    }
+    n_layer_per_stage = math.ceil(num_layers / num_pp_stages)
+    for j in range(num_pp_stages):


Please add some strong documentation for this double for loop. It is difficult to understand without taking a case. Better add some example and explain with it.

quic-meetkuma · 2025-05-12T09:29:42Z

QEfficient/cloud/finetune.py

+            num_layers = get_num_layers_from_config(model_config)
+            device_map = get_device_map(rank, train_config.num_pp_stages, num_layers)
+        else:
+            device_map = "auto"


does auto works well for only DDP use case and only single device use case?

quic-meetkuma · 2025-05-12T09:31:25Z

QEfficient/cloud/finetune.py

+        "model.norm": rank * num_pp_stages + (num_pp_stages - 1),
+        "model.rotary_emb": rank * num_pp_stages + (num_pp_stages - 1),
+    }
+    n_layer_per_stage = math.ceil(num_layers / num_pp_stages)


Suggestion: Use np.ceil so that no new module will be imported.

Signed-off-by: Mamta Singh <[email protected]>

PP+DDP for 70B

c213173

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta requested review from quic-rishinr, ochougul and quic-amitraj as code owners May 8, 2025 07:55

quic-mamta marked this pull request as draft May 8, 2025 07:55

quic-mamta self-assigned this May 8, 2025

quic-mamta changed the title ~~Enable PP+DDP~~ [QEff Finetune]: Enable PP+DDP May 8, 2025

quic-mamta force-pushed the pp_ddp branch from b1443c4 to 38b4a86 Compare May 8, 2025 07:57

quic-mamta requested review from vbaddi and quic-swatia May 8, 2025 07:58

quic-mamta force-pushed the pp_ddp branch 2 times, most recently from e8b1da7 to df36ae1 Compare May 8, 2025 08:34

Merge branch 'quic:main' into pp_ddp

406a869

quic-mamta force-pushed the pp_ddp branch 8 times, most recently from 3ca1229 to 53ff3c4 Compare May 11, 2025 19:37

quic-rishinr added the fine-tuning label May 12, 2025

quic-meetkuma suggested changes May 12, 2025

View reviewed changes

Merge branch 'main' into pp_ddp

9c3a460

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta force-pushed the pp_ddp branch from 53ff3c4 to 9c3a460 Compare May 12, 2025 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QEff Finetune]: Enable PP+DDP #394

[QEff Finetune]: Enable PP+DDP #394

quic-mamta commented May 8, 2025 •

edited

Loading

quic-meetkuma left a comment

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

quic-meetkuma May 12, 2025

[QEff Finetune]: Enable PP+DDP #394

Are you sure you want to change the base?

[QEff Finetune]: Enable PP+DDP #394

Conversation

quic-mamta commented May 8, 2025 • edited Loading

quic-meetkuma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quic-mamta commented May 8, 2025 •

edited

Loading