Skip to content

Commit 466e909

Browse files
committed
reproducible + network
1 parent be999b6 commit 466e909

File tree

2 files changed

+97
-4
lines changed

2 files changed

+97
-4
lines changed

throughput/README.md

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,39 @@
22

33
In general maximazing throughput is all about running many experiments and measuring
44

5-
## Crucial experiment setting nuances
5+
## Crucial reproducibility requirements
66

7-
The most important requirements for a successful experiment is to be able to reproduce the environment
7+
The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables.
88

9-
### Network
9+
Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable.
1010

11+
For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for [108B pre-BLOOM experiments](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr8-104B-wide) it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect.
12+
13+
## Network throughput
14+
15+
It's critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time.
16+
17+
If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance.
18+
19+
Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network:
20+
21+
[all_reduce_bench.py](./all_reduce_bench.py)
22+
23+
Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
24+
25+
To run it on 4 nodes
26+
27+
```
28+
python -m torch.distributed.run --nproc_per_node=4 all_reduce_bench.py
29+
```
30+
31+
You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
32+
33+
Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
34+
35+
Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.
1136

1237

13-
You must use a stable not-shared with other network when running experiment that tell you whether this or that change made the throughput better or worse.
1438

1539

1640
## Vector and matrix size divisibility

throughput/all_reduce_bench.py

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# this version has been derived from @jeffra's gist: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36
2+
# which in turn is derived from https://github.com/NVIDIA/nccl-tests
3+
#
4+
# to run for 2 nodes:
5+
# python -m torch.distributed.run --nproc_per_node=2 all_reduce_bench.py
6+
#
7+
# the printed results are already n_gpu-agnostic (i.e. averaged for the world size)
8+
9+
import argparse
10+
import fcntl
11+
import os
12+
import socket
13+
import time
14+
import torch
15+
import torch.distributed as dist
16+
17+
TRIALS = 5
18+
19+
N = 500000
20+
M = 2000
21+
22+
def printflock(*msgs):
23+
""" print """
24+
with open(__file__, "r") as fh:
25+
fcntl.flock(fh, fcntl.LOCK_EX)
26+
try:
27+
print(*msgs)
28+
finally:
29+
fcntl.flock(fh, fcntl.LOCK_UN)
30+
31+
def timed_allreduce(mat, id):
32+
pre = time.perf_counter()
33+
dist.all_reduce(mat)
34+
printflock(f"ignore me {int(mat[0][0])}") # required due to lazy evaluation
35+
duration = time.perf_counter() - pre
36+
tput = ((M*N*4*2)/duration)*8 # *2 is for send + receive, *8 for gigabits/second
37+
size = M * N * 4 # 4 is fp32
38+
n = dist.get_world_size()
39+
busbw = (size / duration) * (2 * (n - 1) / n) * 8
40+
printflock(f"{id}:\n",
41+
f"duration: {duration:.4f} sec\n",
42+
f"algo throughput: {tput:.4f} bps, {tput/1e9:.4f} Gbps\n",
43+
f"busbw: {busbw / 1e9:.4f} Gbps"
44+
)
45+
46+
def run(local_rank):
47+
hostname = socket.gethostname()
48+
id = f"{hostname}:{local_rank}"
49+
global_rank = dist.get_rank()
50+
51+
printflock(f"{id} data size: {M*N*4/1e9} GB")
52+
mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank)
53+
54+
for i in range(TRIALS):
55+
dist.barrier()
56+
if global_rank == 0:
57+
print(f"\n\n\n-----------trial-{i}----------------")
58+
timed_allreduce(mat, id)
59+
60+
def init_processes(local_rank, fn, backend='nccl'):
61+
torch.cuda.set_device(local_rank)
62+
dist.init_process_group(backend)
63+
fn(local_rank)
64+
65+
66+
if __name__ == "__main__":
67+
rank = int(os.environ["LOCAL_RANK"])
68+
printflock("local_rank: %d" % rank)
69+
init_processes(local_rank=rank, fn=run)

0 commit comments

Comments
 (0)