|
1 | 1 | ## TorchTitan & TorchComms Composability Testing |
2 | 2 |
|
3 | | -#### Overview |
| 3 | +### Overview |
4 | 4 |
|
5 | | -This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. The goal is to enable flexible experimentation with distributed communication primitives and parallelism strategies in PyTorch. |
6 | | -TODO: add more explanation once the torchcomm goes public. |
7 | | ---- |
8 | | -#### Example |
| 5 | +This folder provides a framework for composability testing with TorchComms and distributed training in TorchTitan. It enables flexible experimentation with distributed communication primitives and various parallelism strategies in PyTorch. |
| 6 | + |
| 7 | +> **TODO:** Additional documentation will be provided once TorchComms is publicly released. |
| 8 | +
|
| 9 | +### Quick Start |
| 10 | + |
| 11 | +The following command uses Llama 3 as an example: |
9 | 12 |
|
10 | | -The command below uses Llama 3 as an example, but should work on all models. |
11 | 13 | ```bash |
12 | 14 | TEST_BACKEND=nccl TRAIN_FILE=torchtitan.experiments.torchcomms.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh |
13 | 15 | ``` |
14 | | ---- |
15 | | -### Available Features |
16 | | -- **Distributed Training Utilities** |
17 | | - - Training with `torchcomms.new_comm` |
18 | | - - Device mesh initialization with `torchcomms.init_device_mesh` |
19 | | -- **Composability Testing** |
20 | | - - Integration and testing with `fully_shard` (FSDP) |
21 | | ---- |
22 | | -### To Be Added |
23 | | -- Integration and testing with additional parallelism strategies (e.g., tensor, pipeline, context parallelism) other than fully_shard |
24 | | -- Integration and testing with torch.compile |
25 | | ---- |
| 16 | + |
| 17 | +### Features |
| 18 | + |
| 19 | +#### Distributed Training Utilities |
| 20 | +- Custom communicator backend initialization via `torchcomms.new_comm` |
| 21 | +- Compose torchcomms with DeviceMesh via the wrapper API `torchcomms.init_device_mesh` |
| 22 | + |
| 23 | +#### Parallelism Support |
| 24 | +Locally tested with: |
| 25 | +- **FSDP** (`fully_shard`) - Fully Sharded Data Parallel |
| 26 | +- **TP** - Tensor Parallelism |
| 27 | +- **PP** - Pipeline Parallelism |
| 28 | +- **CP** - Context Parallelism |
| 29 | + |
| 30 | +### Roadmap |
| 31 | + |
| 32 | +- [ ] Add N-D parallelism E2E perf and convergence tests |
| 33 | +- [ ] Integrated and tested with Expert Parallelism |
| 34 | +- [ ] Integration and testing with `torch.compile` |
0 commit comments