How to properly split train/val/test sets when using DDP and multiple GPUs #13343

mfoglio · 2022-06-20T19:46:33Z

mfoglio
Jun 20, 2022

I trained a model using a single GPU. Now I am trying to use 4 GPUs and DDP. The problem is that the code seems to be executed 4 times, making the datasets split into training, validation, and test sets invalid.
In particular, in the __main__ function, I call a function to initialize an instance of my custom data module (which extends LightningDataModule). In doing this, I take a list of all the records' IDs, and shuffle them to split them across training, validation and test sets. The problem is that each of the 4 processes shuffles the annotations at randoms (and therefore differently) and, as a consequence, elements that could be in the training set on GPU_0, could be in the test set of another GPU, effectively making the split invalid.

For example, suppose I have 4 records with IDs [1,2,3,4]. What could happen is that:

On GPU_0 the split is training_set=[1,2]; validation_set=[3]; test_set=[4]
On GPU_1 the split is training_set=[1,3]; validation_set=[2]; test_set=[4]
Etc...

How do you suggest to fix this? How can I run the "splitting" only once? I'd like the fact that every time that I execute the code I get a different splits. I just want the splits to be consistent across all the GPUs.

Answered by rohitgr7

Jun 21, 2022

did you set seed_everything(seed) at the beginning of your main?

View full answer

rohitgr7 · 2022-06-21T11:22:03Z

rohitgr7
Jun 21, 2022

did you set seed_everything(seed) at the beginning of your main?

2 replies

mfoglio Oct 20, 2022
Author

Hi @rohitgr7 . That obviously fixed the issue.
Looking at torch lightning documentation, it is suggested to perform train/val/test splits into the setup function of the LightningDataModule. However, the setup function is called on every GPU. So why is this suggested? I am probably missing something. Thank you!

rohitgr7 Oct 20, 2022

for DDP it's required since we create processes there that runs in isolation so all of them needs to know the exact data.
https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html?highlight=setup#setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly split train/val/test sets when using DDP and multiple GPUs #13343

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to properly split train/val/test sets when using DDP and multiple GPUs #13343

mfoglio Jun 20, 2022

Replies: 1 comment · 2 replies

rohitgr7 Jun 21, 2022

mfoglio Oct 20, 2022 Author

rohitgr7 Oct 20, 2022

mfoglio
Jun 20, 2022

Replies: 1 comment 2 replies

rohitgr7
Jun 21, 2022

mfoglio Oct 20, 2022
Author