DDP MultiGPU Training doesn't reduce training time #18187

AlejandroTL · 2023-07-28T14:50:57Z

AlejandroTL
Jul 28, 2023

Hello!

I want to do multiGPU training with a model. I have a node with 4 GPUs. Training with just 1 GPU, each epoch takes 9 hours. Training with 4 GPUs, each epoch takes 9 hours. There is no reduction whatsoever nor in the number of batches needed to train nor the time.

The way I am calling the trainer is:

trainer = pl.Trainer(
                    min_epochs=1,
                    max_epochs=100,
                    check_val_every_n_epoch=2, 
                    logger=wandb_logger,
                    accelerator='gpu',
                    devices=-1, 
                    strategy="ddp_find_unused_parameters_true")

Lines I see in the logs when training with 4 GPUs are:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3     ]

How can I actually check whether I am indeed working with 4 GPUs? I know that my system can see them.

Finally, I use ddp_find_unused_parameters_true instead of ddp cause I use a torch.nn.Embedding and not in every minibatch I retrieve all indices, which apparently provokes some problems:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

Torch version: '2.0.1+cu117'
Pytorch lightning version: '2.0.6'

I solved this problem but now what happens is that my script gets stuck in the following construction of the DDP process:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:15094 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to []:15094 (errno: 97 - Address family not supported by protocol).

I have seen some githubs issues with similar problems but no solution found! Any idea?

Thanks!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP MultiGPU Training doesn't reduce training time #18187

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

DDP MultiGPU Training doesn't reduce training time #18187

AlejandroTL Jul 28, 2023

Replies: 0 comments

AlejandroTL
Jul 28, 2023