Distributed layers #1270

angeloskath · 2024-07-15T21:52:18Z

Adds linear layers that allow training and inference of a model sharded across several devices. The main things added are

float16/bfloat16 reductions for MPI
AllToShardedLinear and its quantized sibling
ShardedToAllLinear and its quantized sibling

simply changing linear layers to the above results in a model that works out of the box with distributed inference and training.

I am starting it as a draft so that we can iterate a bit on the design. The negative aspects of the above design are that we have yet another linear layer to think about when implementing LoRA and friends or weird new quantizations for instance. Perhaps it would be better to make the above layers with an internal linear layer so model surgery that swaps linear layers would still work out of the box.

python/mlx/nn/layers/distributed.py

awni · 2024-07-17T15:21:02Z

I kind of like this design. I like that it's all quite simple and easy to follow and we have a lot of control over how to shard the model (as in ml-explore/mlx-examples#890). We could possibly find a way to reduce the code needed for adding a new custom linear-like layer.. but the simplicity is nice, I wouldn't want to give that up.

angeloskath · 2025-03-06T23:29:00Z

I am marking this ready for review. The main things that are new since I started the branch:

Exposing mx.contiguous. This ensures both that the array is contiguous and that it occupies at most x.size() * x.itemsize() + 16384 bytes. Mainly a contiguous slice is still going to be copied.

shard_linear convenience function and shard_inplace. The first one just creates the appropriate linear layer quantized or not. The second actually shards the parameters in place which allows us to shard any layer and apply the collective operations as we see fit. It is used for instance to shard the single stream transformer blocks in FLUX but only perform one communication (ml-explore/mlx-examples#1325).

The sharding functions now also take a groups argument. This assumes the linear layer is a fused one and splits it according to the groups argument (evenly or percentage wise). I think the argument name may need improving here.

awni · 2025-03-11T19:13:57Z

python/mlx/nn/layers/distributed.py

+    # The multiplication with 1 forces a copy, perhaps change to
+    # something better when available.


Nit remove comment?

awni · 2025-03-11T19:14:16Z

python/mlx/nn/layers/distributed.py

+    # The multiplication with 1 forces a copy, perhaps change to
+    # something better when available.


awni · 2025-03-11T19:21:38Z

python/mlx/nn/layers/distributed.py

+        if not isinstance(parameters[k], mx.array):
+            continue
+
+        axis = max(parameters[k].ndim - 2, 0)


Should that just be always 0? Maybe would work for conv in that case as well?

Well it assumes linear layers (as in fully connected layers). It isn't 0 so that it can work with Switch layers and their variants. Perhaps combining with your comment below we could make this more general and support both.

awni · 2025-03-11T19:51:40Z

python/mlx/nn/layers/distributed.py

+def shard_inplace(
+    module: Module,
+    sharding: str,
+    *,
+    groups: Union[int, list] = 1,
+    group: Optional[mx.distributed.Group] = None,
+):
+    _check_sharding(sharding)
+    shard_function = (
+        _all_to_sharded if sharding == "all-to-sharded" else _sharded_to_all
+    )
+    module.update(shard_function(module.parameters(), groups=groups, group=group))


Would it make sense to have the this take a callable which returns a sharding based on a key? It would be more like nn.quantize and capable of one-shot sharding a Module in place with a given policy.

Possibly you are right. Ideally we can keep the same interface but provide this as an extra. Returning a tuple (or named tuple) with axis, groups (renamed) given path and weight would a nice interface I think.

awni · 2025-03-11T19:52:59Z

The sharding functions now also take a groups argument. This assumes the linear layer is a fused one and splits it according to the groups argument (evenly or percentage wise)

The purpose there is to allow uneven shardings? I think it would be good to think on a name that is more different from group.

angeloskath · 2025-03-11T20:55:25Z

The purpose there is to allow uneven shardings?

Totally agree that we should name it something different. It isn't for uneven shadings in the sense that one node can take 70% of the computation. This isn't supported in this API. It is for weights that comprise several concatenated weights. In this case for the sharded linear to be valid we need to split, shard and concatenate. Otherwise one node will get all the queries and no keys and so on.

awni · 2025-03-11T22:33:10Z

Otherwise one node will get all the queries and no keys and so on.

Ah that makes sense now. Some suggestions on alternative names:

shards
segments
sections
splits

Maybe it makes sense to prefix sub with one of those like sub_shards?

angeloskath mentioned this pull request Jul 15, 2024

Distributed inference example ml-explore/mlx-examples#890

Draft

awni reviewed Jul 17, 2024

View reviewed changes

python/mlx/nn/layers/distributed.py Outdated Show resolved Hide resolved

angeloskath force-pushed the distributed-layers branch from 6090542 to fea9644 Compare August 1, 2024 22:29

angeloskath force-pushed the distributed-layers branch 2 times, most recently from 061d214 to b32ce2c Compare August 29, 2024 08:20

angeloskath force-pushed the distributed-layers branch 2 times, most recently from ab26116 to 3d431c0 Compare September 6, 2024 18:03

awni mentioned this pull request Sep 16, 2024

Data parallel helper #1407

Merged

angeloskath force-pushed the distributed-layers branch 5 times, most recently from 2298954 to 1697581 Compare November 5, 2024 19:35

awni force-pushed the distributed-layers branch 3 times, most recently from 31ba022 to 60e7e02 Compare January 18, 2025 14:06

awni force-pushed the distributed-layers branch 2 times, most recently from 07b5bd5 to 794eb42 Compare February 6, 2025 15:36

angeloskath force-pushed the distributed-layers branch 3 times, most recently from 517eb95 to a323642 Compare March 4, 2025 21:32

angeloskath and others added 8 commits March 5, 2025 13:25

Add MPI barrier

ffe2988

Add the distributed linear layers

6bec66b

Add quantized distributed layers

85f6237

Add distributed layers to nn top-level

ca88085

Fixes in distributed layers

990c05e

fix rebase

3384040

Provide some sharding conveniences

3c4a131

Remove distributed barrier

b101f84

Add groups support for sharding

dd89374

angeloskath force-pushed the distributed-layers branch from a323642 to dd89374 Compare March 5, 2025 21:25

angeloskath marked this pull request as ready for review March 6, 2025 23:29

awni mentioned this pull request Mar 7, 2025

[Feature] Multi-Machine Support for Distributed Inference #1046

Open

awni reviewed Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed layers #1270

Distributed layers #1270

angeloskath commented Jul 15, 2024

awni commented Jul 17, 2024

angeloskath commented Mar 6, 2025

awni Mar 11, 2025

angeloskath Mar 11, 2025

awni Mar 11, 2025

awni Mar 11, 2025

angeloskath Mar 11, 2025

awni Mar 11, 2025

angeloskath Mar 11, 2025

awni commented Mar 11, 2025

angeloskath commented Mar 11, 2025

awni commented Mar 11, 2025

		# The multiplication with 1 forces a copy, perhaps change to
		# something better when available.

Distributed layers #1270

Are you sure you want to change the base?

Distributed layers #1270

Conversation

angeloskath commented Jul 15, 2024

awni commented Jul 17, 2024

angeloskath commented Mar 6, 2025

awni Mar 11, 2025

Choose a reason for hiding this comment

angeloskath Mar 11, 2025

Choose a reason for hiding this comment

awni Mar 11, 2025

Choose a reason for hiding this comment

awni Mar 11, 2025

Choose a reason for hiding this comment

angeloskath Mar 11, 2025

Choose a reason for hiding this comment

awni Mar 11, 2025

Choose a reason for hiding this comment

angeloskath Mar 11, 2025

Choose a reason for hiding this comment

awni commented Mar 11, 2025

angeloskath commented Mar 11, 2025

awni commented Mar 11, 2025