Skip to content

Milestones

List view

  • 4D parallelism = combining four strategies to train massive models across many GPUs: - Data parallelism – split data batches across GPUs. - Tensor parallelism – split large weight matrices across GPUs. - Pipeline parallelism – split model layers into stages across GPUs. - Sequence/Expert parallelism – split token sequences (or experts in MoE) across GPUs. Together, they balance memory + compute so trillion-parameter models can be trained efficiently

    No due date
    1/1 issues closed