reproducible + network

stas00 · stas00 · commit 466e909916f2 · 2023-03-07T14:02:55.000-08:00
diff --git a/throughput/README.md b/throughput/README.md
@@ -2,15 +2,39 @@
 
 In general maximazing throughput is all about running many experiments and measuring
 
-## Crucial experiment setting nuances
+## Crucial reproducibility requirements
 
-The most important requirements for a successful experiment is to be able to reproduce the environment
+The most important requirements for a series of successful experiments is to be able to reproduce the experiment environment again and again while changing only one or a few setup variables.
 
-### Network
+Therefore when you try to figure out whether some change will improve performance or make it worse, you must figure out how to keep things stable.
 
+For example, you need to find a way to prevent the network usage from fluctuations. When we were doing performance optimizations for [108B pre-BLOOM experiments](https://github.com/bigscience-workshop/bigscience/tree/master/train/tr8-104B-wide) it was close to impossible to perform, since we were on a shared internode network and the exact same setup would yield different throughput depending on how many other users used the network. It was not working. During BLOOM-176B we were given a dedicated SLURM partition with an isolated network where the only traffic was ours. Doing the performance optimization in such environment was just perfect.
+
+## Network throughput
+
+It's critical to understand your particular model size and framework requirements with regard to network bandwidth, throughput and latency. If you underpay for network you will end up having idle gpus and thus you wasted money and time. If you overpay for very fast network, but your gpus are slow, then again you wasted money and time.
+
+If your network is very slow, your training is likely to be network-bound and many improvements in the training setup will not help with the improving performance.
+
+Here is a simple all-reduce benchmark that you can use to quickly measure the throughput of your internode network:
+
+[all_reduce_bench.py](./all_reduce_bench.py)
+
+Usually benchmarking at least 4 nodes is recommended, but, of course, if you already have access to all the nodes you will be using during the training, benchmark using all of the nodes.
+
+To run it on 4 nodes
+
+```
+python -m torch.distributed.run --nproc_per_node=4 all_reduce_bench.py
+```
+
+You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
+
+Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
+
+Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.
 
 
-You must use a stable not-shared with other network when running experiment that tell you whether this or that change made the throughput better or worse.
 
 
 ## Vector and matrix size divisibility
diff --git a/throughput/all_reduce_bench.py b/throughput/all_reduce_bench.py
@@ -0,0 +1,69 @@
+# this version has been derived from @jeffra's gist: https://gist.github.com/jeffra/b5e80466b4c86be00ea3b6f130fb7a36
+# which in turn is derived from https://github.com/NVIDIA/nccl-tests
+#
+# to run for 2 nodes:
+# python -m torch.distributed.run --nproc_per_node=2 all_reduce_bench.py
+#
+# the printed results are already n_gpu-agnostic (i.e. averaged for the world size)
+
+import argparse
+import fcntl
+import os
+import socket
+import time
+import torch
+import torch.distributed as dist
+
+TRIALS = 5
+
+N = 500000
+M = 2000
+
+def printflock(*msgs):
+    """ print """
+    with open(__file__, "r") as fh:
+        fcntl.flock(fh, fcntl.LOCK_EX)
+        try:
+            print(*msgs)
+        finally:
+            fcntl.flock(fh, fcntl.LOCK_UN)
+
+def timed_allreduce(mat, id):
+    pre = time.perf_counter()
+    dist.all_reduce(mat)
+    printflock(f"ignore me {int(mat[0][0])}")  # required due to lazy evaluation
+    duration = time.perf_counter() - pre
+    tput = ((M*N*4*2)/duration)*8 # *2 is for send + receive, *8 for gigabits/second
+    size = M * N * 4 # 4 is fp32
+    n = dist.get_world_size()
+    busbw = (size / duration) * (2 * (n - 1) / n) * 8
+    printflock(f"{id}:\n",
+               f"duration: {duration:.4f} sec\n",
+               f"algo throughput: {tput:.4f} bps, {tput/1e9:.4f} Gbps\n",
+               f"busbw: {busbw / 1e9:.4f}  Gbps"
+    )
+
+def run(local_rank):
+    hostname = socket.gethostname()
+    id = f"{hostname}:{local_rank}"
+    global_rank = dist.get_rank()
+
+    printflock(f"{id} data size: {M*N*4/1e9} GB")
+    mat = torch.rand(N, M, dtype=torch.float32).cuda(local_rank)
+
+    for i in range(TRIALS):
+        dist.barrier()
+        if global_rank == 0:
+            print(f"\n\n\n-----------trial-{i}----------------")
+        timed_allreduce(mat, id)
+
+def init_processes(local_rank, fn, backend='nccl'):
+    torch.cuda.set_device(local_rank)
+    dist.init_process_group(backend)
+    fn(local_rank)
+
+
+if __name__ == "__main__":
+    rank = int(os.environ["LOCAL_RANK"])
+    printflock("local_rank: %d" % rank)
+    init_processes(local_rank=rank, fn=run)