DDP: All devices get the same data #16548
Replies: 1 comment 11 replies
-
This only gets added if you set
At the very beginning. Typically before you do anything with random number generators, data, model, training etc.
No, each GPU should only see 1/N of the data, where N is the number of GPUs. If this isn't the case, it means the distributed sampler wasn't applied. |
Beta Was this translation helpful? Give feedback.
-
Hi!
I am attempting to implement multi-GPU training for our single cell genomics model written with Pyro / scvi-tools model - a collaboration between Stegle, Bayraktar and Teichmann groups (Wellcome Sanger Institute, DKFZ, EMBL, including @macwiatrak @gtca) - as well as with @adamgayoso (scvi-tools). This project would also help scvi-tools (single cell genomics modelling project) to provide multi-GPU training for all models.
Our current model uses a custom Dataset and BatchSampler (map style) to enable the loading of various variables from anndata object using both
obs
andvar
indices. This is a reasonably complex project with many moving parts, so given that I am new to lightning and multi-GPU training, it is hard for me to generate simpler reproducible examples.DistributedSamplerWrapper
makes sense for our application - so any suggestions on what's going on and how to fix the issues would be great.Following on the discussion in the parameter loading issue here @awaelchli, writing a more detailed description :
Problem 1: Main problem. All devices get the same data - suggesting that the distributed sampler wrapper doesn't select different subsets of the data. After reading this #7186 and other related issues I don't understand what is the correct setup to address this issue.
Problem 2: During
trainer.fit()
,sampler.shuffle
is set to False for the training batch sampler (my sampler, notDistributedSamplerWrapper
). This can be fixed by replacing shuffle with a different argument (e.g randomise_batches) - however, problem 1 still holds - all GPUs see the same data. This line sets shuffle to False for all samplers except RandomSampler. DistributedSamplerWrapper docs say that this should not happen - but maybe it does [Bug]?Also
self.trainer.train_dataloaders
doesn't exist before and aftertrainer.fit
. We usepl.LightningDataModule
with.setup()
and.train_dataloader()
methods.Also, the lack of debug messages about the seed suggests that
worker_init_fn=pl_worker_init_function
is never called - suggesting that it is never added. If I add it manually I see that all workers in all ranks have the same seed.Problem 3: I am not using DistributedSampler - however, setting
replace_sampler_ddp=False
doesn't raise any issues, errors or warnings. Settingreplace_sampler_ddp=False
andis the only way I can get different GPU devices to see different data batches. As far as I understand, this is not correct because the same observation will be seen by each process because each process goes through the full set of data batches.
Problem 4: Where in the training script should seed_everything(1, workers=True) be called? Currently, it is called when scvi-tools package is loaded at the very start of the script.
Beta Was this translation helpful? Give feedback.
All reactions