-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Lightning-AI pytorch-lightning Ddp-multi-gpu-multi-node Discussions
Pinned Discussions
Sort by:
Latest activity
Categories, most helpful, and community links
Categories
Community links
🤖 DDP / multi-GPU / multi-node Discussions
Any questions about DDP or multi GPU things
-
You must be logged in to vote 🤖 Multi-GPUs DDP - How the dataset is distributed accross the GPUs
data handlingGeneric data-related topic strategy: ddpDistributedDataParallel -
You must be logged in to vote 🤖 DDP - Synchronization on DGX - Use CPUs or GPU-to-GPU interconnect
accelerator: cudaCompute Unified Device Architecture GPU -
You must be logged in to vote 🤖 self.log twice problem
loggingRelated to the `LoggerConnector` and `log()` -
You must be logged in to vote 🤖 What's the relationship between number of gpu and batch size (global batch size))
distributedGeneric distributed-related topic accelerator: cudaCompute Unified Device Architecture GPU -
You must be logged in to vote 🤖 How to carry out validation loop on one single GPU
accelerator: cudaCompute Unified Device Architecture GPU trainer: validate -
You must be logged in to vote 🤖 DynamicBatchSampler
data handlingGeneric data-related topic -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 Run Trainer.fit multiple times under DDP mode
strategy: ddpDistributedDataParallel -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 Implementing a Metric and including a nn.Module doesn't work correctly in parallel
strategy: ddpDistributedDataParallel accelerator: cudaCompute Unified Device Architecture GPU -
You must be logged in to vote 🤖 ModelCheckpoint in DDP
callback: model checkpoint strategy: ddpDistributedDataParallel -
You must be logged in to vote 🤖 Is there any way to cache the data when training with 'ddp' ?
data handlingGeneric data-related topic -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖 -
You must be logged in to vote 🤖