DDP: All devices get the same data #16548

vitkl · 2023-01-30T02:19:52Z

vitkl
Jan 30, 2023

Hi!

I am attempting to implement multi-GPU training for our single cell genomics model written with Pyro / scvi-tools model - a collaboration between Stegle, Bayraktar and Teichmann groups (Wellcome Sanger Institute, DKFZ, EMBL, including @macwiatrak @gtca) - as well as with @adamgayoso (scvi-tools). This project would also help scvi-tools (single cell genomics modelling project) to provide multi-GPU training for all models.

Our current model uses a custom Dataset and BatchSampler (map style) to enable the loading of various variables from anndata object using both obs and var indices. This is a reasonably complex project with many moving parts, so given that I am new to lightning and multi-GPU training, it is hard for me to generate simpler reproducible examples. DistributedSamplerWrapper makes sense for our application - so any suggestions on what's going on and how to fix the issues would be great.

Following on the discussion in the parameter loading issue here @awaelchli, writing a more detailed description :

Problem 1: Main problem. All devices get the same data - suggesting that the distributed sampler wrapper doesn't select different subsets of the data. After reading this #7186 and other related issues I don't understand what is the correct setup to address this issue.

Problem 2: During trainer.fit(), sampler.shuffle is set to False for the training batch sampler (my sampler, not DistributedSamplerWrapper). This can be fixed by replacing shuffle with a different argument (e.g randomise_batches) - however, problem 1 still holds - all GPUs see the same data. This line sets shuffle to False for all samplers except RandomSampler. DistributedSamplerWrapper docs say that this should not happen - but maybe it does [Bug]?

Also self.trainer.train_dataloaders doesn't exist before and after trainer.fit. We use pl.LightningDataModule with .setup() and .train_dataloader() methods.
Also, the lack of debug messages about the seed suggests that worker_init_fn=pl_worker_init_function is never called - suggesting that it is never added. If I add it manually I see that all workers in all ranks have the same seed.

Problem 3: I am not using DistributedSampler - however, settingreplace_sampler_ddp=False doesn't raise any issues, errors or warnings. Setting replace_sampler_ddp=False and

seed_everything(seed + trainer.global_rank)
trainer.fit(my_model, ...)

is the only way I can get different GPU devices to see different data batches. As far as I understand, this is not correct because the same observation will be seen by each process because each process goes through the full set of data batches.

Problem 4: Where in the training script should seed_everything(1, workers=True) be called? Currently, it is called when scvi-tools package is loaded at the very start of the script.

awaelchli · 2023-01-30T14:32:28Z

awaelchli
Jan 30, 2023

pl_worker_init_function is never called

This only gets added if you set seed_everything(..., workers=True). You are not meant to add it manually, unless you do it correctly at the right time with the rank provided. This is likely the reason why you see the same seed in each dataloader worker. So just remove this part of your code.

Where in the training script should seed_everything(1, workers=True)

At the very beginning. Typically before you do anything with random number generators, data, model, training etc.

As far as I understand, this is not correct because the same observation will be seen by each process because each process goes through the full set of data batches.

No, each GPU should only see 1/N of the data, where N is the number of GPUs.
You can verify this by printing len(self.trainer.train_dataloaders[0]) for example in your training step. You should see the same number printed from each process, and if you sum all numbers up you should (roughly) get the total number of batches in your dataset.

If this isn't the case, it means the distributed sampler wasn't applied.
If so, I suggest to look at the Pyro docs for how to handle their datasets for distributed training. Or share some references here to the code and datasets used so we can get a better understanding of how the data is sampled.

11 replies

awaelchli Feb 2, 2023

Here and below is all the logic for determining whether a distributed sampler is needed:
https://github.com/Lightning-AI/lightning/blob/cb1e8e0a3e5d449f149c1b6000b9af0a9eecbddb/src/lightning/pytorch/trainer/connectors/data_connector.py#L225

If you have a batch sampler, you can wrap it in the DistributedSampler yourself (or the underlying sampler inside of your batch sampler if you have one).

vitkl Feb 15, 2023
Author

Thanks for suggesting to look into this. I solved this by writing a custom distributed batch sampler. Would be great if the lightning documentation about DDP and replace_sampler_ddp mentioned that batch samplers require custom DistributedSampler.

yipliu Jul 25, 2023

Hi, @vitkl @awaelchli

I can write a custom sampler. But I have no idea how to make it DistributedSampler. Can you help me?

The following is my code

from torch.utils.data.sampler import Sampler

class BatchSampler(Sampler):
    def __init__(self, dataset, batch_size, shuffle=True):
        self.batch_size = batch_size
        self.shuffle = shuffle
    
        self.indices = [(i, sort_key) for i, sort_key in enumerate(dataset._len)] #  dataset._len means the length of every data sample

    def __iter__(self):
        if self.shuffle:
            random.shuffle(self.indices)

        pooled_indices = []
        self.indices = sorted(self.indices, key=lambda x: x[1])
 
        for i in range(0, len(self.indices), self.batch_size * 100):

            pooled_indices.extend(self.indices[i:i + self.batch_size * 100])
        
        self.pooled_indices = [x[0] for x in pooled_indices]

        batches = [self.pooled_indices[i:i + self.batch_size] for i in
                   range(0, len(self.pooled_indices), self.batch_size)]

        if self.shuffle:
            random.shuffle(batches)
        for batch in batches:
            yield batch

    def __len__(self):
        return (len(self.indices) - 1) // self.batch_size + 1

My attempts are as follows:

sampler_train= BucketBatchSampler(data_train, batch_size=N, shuffle=True, drop_last=False)

train_dataloader = DataLoader(
                dataset=data_train,
                batch_sampler = sampler_train,
            )

trainer.fit(model, train_dataloader, strategy='ddp', use_distributed_sampler=True)

Unfortunately, it keeps reporting errors

AttributeError: 'DistributedSampler' object has no attribute 'traj_lengths'

I have found Sampler Wrappers

Does this mean that PL can automatically change Sampler to DDP for us with use_distributed_sampler=True?

vitkl Jul 31, 2023
Author

I used this https://github.com/pytorch/pytorch/blob/main/torch/utils/data/distributed.py#L13 with minor modifications

vedang-04 Aug 22, 2023

Did you find a solution? If yes can you send me the code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP: All devices get the same data #16548

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

DDP: All devices get the same data #16548

vitkl Jan 30, 2023

Replies: 1 comment · 11 replies

awaelchli Jan 30, 2023

awaelchli Feb 2, 2023

vitkl Feb 15, 2023 Author

yipliu Jul 25, 2023

vitkl Jul 31, 2023 Author

vedang-04 Aug 22, 2023

vitkl
Jan 30, 2023

Replies: 1 comment 11 replies

awaelchli
Jan 30, 2023

vitkl Feb 15, 2023
Author

vitkl Jul 31, 2023
Author