From f76fe8d898f8cbb366097f35800f789108e41e25 Mon Sep 17 00:00:00 2001 From: Zhanyuan Zhang <32000378+zhanyuanucb@users.noreply.github.com> Date: Sat, 25 Feb 2023 15:22:05 -0800 Subject: [PATCH] Added explanation on the result (#887) Co-authored-by: zhanyuan.zhang --- docs/gallery/tutorials/pipeshard_parallelism.py | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/docs/gallery/tutorials/pipeshard_parallelism.py b/docs/gallery/tutorials/pipeshard_parallelism.py index 0cc1339e0..220c34e35 100644 --- a/docs/gallery/tutorials/pipeshard_parallelism.py +++ b/docs/gallery/tutorials/pipeshard_parallelism.py @@ -244,3 +244,20 @@ def loss_func(params): atol=5e-3) alpa.shutdown() + +################################################################################ +# Interpret the Results +# --------------------- +# **Some basic concepts** +# - Cluster mesh and submeshes +# - Cluster mesh is a computer cluster that contains GPUs. A ``N×M`` cluster mesh means the cluster has ``N`` physical machines and each machine has ``M`` GPUs. +# - Submeshes can be obtained by slicing from the cluster mesh. For example, given a ``N×M`` cluster mesh, a submesh ``(1, M)`` means using all GPUs in one physical machine. +# - For more details on how Alpa uses submeshes to solve *inter-operator parallelism*, you can read the **Section 5: Inter-Operator Parallelism** in the `Alpa paper `_. +# - Device mesh and logical mesh +# - A device mesh is a 2-dimensional logical view of a set of physical devices. +# - For a set of physical devices, there can be multiple logical views. For example, given 2 nodes and 8 GPUs per node (i.e., 16 devices in total), we can view them as a 2×8, 1×16, 4×4, 8×2, or 16×1 device mesh. +# - The mapping between physical devices and the logical device mesh view is optimized by the inter-op pass +# - Hence, you can see ``Result mesh_shapes`` and the corresponding ``Result logical_mesh_shapes`` in the optimization output. +# +# With the basic concepts in mind, you now can better understand the ``ModuleProfileResult``: +# - ``ModuleProfileResult``: ``result[(i, j, s, c), m]`` means this stage contains forward layers ``i, i+1, ..., j`` and corresponding backward layers, and runs under the ``s``-th submesh and the ``c``-th auto sharding config for the submesh. The ``m = 0`` means the result is for the forward pass, and ``m = 1`` for backward pass. \ No newline at end of file