These scripts use the same toy model and random dataset so you can focus on the parallel strategy differences.
common.py: shared model, dataset, and config.ddp_min.py: DistributedDataParallel baseline.fsdp_min.py: FullyShardedDataParallel (parameter sharding).zero_min.py: DDP +ZeroRedundancyOptimizer(ZeRO stage-1 style optimizer sharding).
- Python 3.10+
- PyTorch with CUDA + NCCL
- Multi-GPU machine
cd parallel_minimal
torchrun --standalone --nproc_per_node=4 ddp_min.py
torchrun --standalone --nproc_per_node=4 fsdp_min.py
torchrun --standalone --nproc_per_node=4 zero_min.pycd parallel_minimal
torchrun --standalone --nproc_per_node=1 ddp_min.py
torchrun --standalone --nproc_per_node=1 fsdp_min.py
torchrun --standalone --nproc_per_node=1 zero_min.py- DDP:
- Full model replica on each GPU.
- Gradients synchronized every step.
- FSDP:
- Parameters/gradients are sharded across GPUs.
- Better memory scaling at the cost of more communication complexity.
- ZeRO (here, stage-1 style):
- Model is still replicated (DDP), but optimizer states are sharded.
- Memory savings mainly from optimizer states.
zero_min.pyuses PyTorchZeroRedundancyOptimizer, which is conceptually ZeRO stage-1.- If you want ZeRO stage-2/3, use DeepSpeed or FSDP-style full sharding.