List view
4D parallelism = combining four strategies to train massive models across many GPUs: - Data parallelism – split data batches across GPUs. - Tensor parallelism – split large weight matrices across GPUs. - Pipeline parallelism – split model layers into stages across GPUs. - Sequence/Expert parallelism – split token sequences (or experts in MoE) across GPUs. Together, they balance memory + compute so trillion-parameter models can be trained efficiently
No due date•1/1 issues closed