Neuron support in Axlearn #566

apoorvtintin · 2024-07-01T20:14:48Z

This PR enables use of neuron devices in Axlearn for model training.

Chooses correct mesh for TRN devices for Fuji 7B with the mesh selector flag --mesh_selector=neuron-trn1.32xlarge-64

ruomingp

Thanks.

axlearn/common/utils.py

ruomingp · 2024-07-02T12:00:03Z

axlearn/experiments/text/gpt/fuji.py

@@ -167,6 +167,10 @@ def get_trainer_kwargs(
                    "gpu-(p5.48xlarge|p4de.24xlarge)-(256|512|1024)",
                    mesh_shape_from_axes(data=-1, fsdp=8),
                ),
+                (   
+                    "neuron-(trn1.32xlarge|trn1n.32xlarge)-(32|64|256|512|1024|2048)",
+                    mesh_shape_from_axes(data=-1, model=TRN_MODEL_AXIS_SIZE),


How does model=8 compare to fsdp=8? Usually we find fsdp to be more efficient.

Might also be worth listing the step times for different configurations, similar to the other mesh rules.

How does model=8 compare to fsdp=8? Usually we find fsdp to be more efficient.

I am launching a fsdp=8 job with 8 nodes. The job is blocked due to AWS capacity. Hope to get some data to share by Friday

The previous response from AWS was that FSDP is slower due to higher communication overhead.

Tensor parallel (model) is more performant on trn1 arch

axlearn/experiments/text/gpt/common.py

markblee · 2024-07-03T20:29:12Z

axlearn/experiments/text/gpt/fuji.py

@@ -167,6 +167,10 @@ def get_trainer_kwargs(
                    "gpu-(p5.48xlarge|p4de.24xlarge)-(256|512|1024)",
                    mesh_shape_from_axes(data=-1, fsdp=8),
                ),
+                (   
+                    "neuron-(trn1.32xlarge|trn1n.32xlarge)-(32|64|256|512|1024|2048)",
+                    mesh_shape_from_axes(data=-1, model=TRN_MODEL_AXIS_SIZE),


Might also be worth listing the step times for different configurations, similar to the other mesh rules.

axlearn/experiments/text/gpt/fuji.py

axlearn/experiments/text/gpt/common.py

markblee · 2024-07-03T20:32:27Z

axlearn/experiments/text/gpt/common.py

@@ -267,12 +269,17 @@ def model_config(
        batch_axis_names=batch_axis_names,
        seq_axis_names="seq",
    )
+
+    device_platform = np.asarray(jax.devices())[0].platform


jax.devices() during config building may be an unexpected dependency on global state -- should we take a platform arg or similar?

We could change it, but I followed the pattern already used here

axlearn/axlearn/common/utils.py

Line 1231 in 89c6f75

devices = jax.devices()

Please let me know if the platform flag is necessary, I can add it. Thanks!

kelvin-zou · 2024-07-09T15:52:20Z

@apoorvtintin I see this PR is quite stale for sometime.
If no objection, I'd like to have @Ruixuan who is working on Trn from our end to port your change and continue iterate it?

ptoulme-aws · 2024-07-09T17:03:54Z

@apoorvtintin I see this PR is quite stale for sometime. If no objection, I'd like to have @Ruixuan who is working on Trn from our end to port your change and continue iterate it?

Apoorv is on PTO right now. I am OK with you all taking over this PR. Can you add us as a reviewer when you finish? Thanks

apoorvtintin · 2024-07-24T23:15:55Z

Thanks for all the reviews, I fixed most of the comments on the PR.

apoorvtintin added 3 commits July 1, 2024 19:14

General Neuron Support

c8d0a7d

add 'data' axis to fsdp axis

6669a41

fix import

7baa4d9

apoorvtintin requested review from ruomingp and markblee as code owners July 1, 2024 20:14

ruomingp reviewed Jul 2, 2024

View reviewed changes

ruomingp requested a review from kelvin-zou July 2, 2024 12:00

madrob force-pushed the main branch from 9471857 to cce635c Compare July 2, 2024 21:02

kelvin-zou reviewed Jul 3, 2024

View reviewed changes

axlearn/experiments/text/gpt/common.py Outdated Show resolved Hide resolved

markblee reviewed Jul 3, 2024

View reviewed changes

Address PR comments

49f9efa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neuron support in Axlearn #566

Neuron support in Axlearn #566

apoorvtintin commented Jul 1, 2024

ruomingp left a comment

ruomingp Jul 2, 2024

markblee Jul 3, 2024

Ruixuan Jul 3, 2024

ptoulme-aws Jul 3, 2024

markblee Jul 3, 2024

markblee Jul 3, 2024

apoorvtintin Jul 24, 2024 •

edited

Loading

kelvin-zou commented Jul 9, 2024

ptoulme-aws commented Jul 9, 2024

apoorvtintin commented Jul 24, 2024

Neuron support in Axlearn #566

Are you sure you want to change the base?

Neuron support in Axlearn #566

Conversation

apoorvtintin commented Jul 1, 2024

ruomingp left a comment

Choose a reason for hiding this comment

ruomingp Jul 2, 2024

Choose a reason for hiding this comment

markblee Jul 3, 2024

Choose a reason for hiding this comment

Ruixuan Jul 3, 2024

Choose a reason for hiding this comment

ptoulme-aws Jul 3, 2024

Choose a reason for hiding this comment

markblee Jul 3, 2024

Choose a reason for hiding this comment

markblee Jul 3, 2024

Choose a reason for hiding this comment

apoorvtintin Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

kelvin-zou commented Jul 9, 2024

ptoulme-aws commented Jul 9, 2024

apoorvtintin commented Jul 24, 2024

apoorvtintin Jul 24, 2024 •

edited

Loading