[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts #3088

danielvegamyhre · 2025-09-27T16:21:36Z

Summary

With these changes, the mxfp8 a2a working e2e in torchtitan Llama4 training (using this PR in torchtitan: [mxfp8 MoE training] Support mxfp8 all to all in expert parallel torchtitan#1765)
Perf is currently worse than bf16 baseline to due d2h sync caused by aten::item call resulting from extracting the actual tokens from the overallocated symmetric memory grad_output buffer. This sym mem buff must be overallocated to account for the fact that the corresponding output from a2a fwd will be variable size.
Therefore, my thinking is:
1. Use this impl in experimental DSV3 model: This impl is more suitable for the experimental DSV3 no-sync model which natively supports this preallocation method by passing the full padded output/grad_input to downstream ops like grouped_mm, scatter_add etc unmodified. With a couple small changes to this mxfp8 impl (e.g., just returning full padded output and grad_input) we can use it there. The reason this method is experimental is because if while it avoids d2h sync, if there is enough skew in expert routing, the job will crash due to insufficient sym mem buffer space to write to during token exchange.
2. Add new impl for non-experimental DSV3/Llama4 models: we can add a simpler mxfp8 a2a impl that just kicks off 2 async all_to_all_single_autograds on the e4m3 data and e8m0 scales.

Changes

When integrating the mxfp8 a2a kernel with torchtitan I hit some CUDA IMA errors, which I've fixed with more robust bounds checking.
Disable compile for forward()/backward() at method level, due to compile not playing nicely with class variables
Compile to_mx and to_dtype in fwd and bwd
Fix name of unit test
Update bench script to measure real default_a2a and mxfp8_a2a impls used in torchitan (being added in [mxfp8 MoE training] Support mxfp8 all to all in expert parallel torchtitan#1765)
Add option to profile run in bench script

Benchmarks

input_shape         num_splits    bf16_ms    mxfp8_ms
----------------  ------------  ---------  ----------
(16, 8192, 5120)             8     10.684     62.2852

Limitations

Extracting actual tokens from overallocated device buffer at end of forward() and backward() causes d2h syncs, hurting perf. Need to think about ways to avoid this.

pytorch-bot · 2025-09-27T16:21:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3088

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dcc9237 with merge base 0d3217d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-09-30T03:12:22Z

@kwen2501 @vkuzo I need try a different approach to get better perf (see PR description) but would like to land these incremental changes, which contain a mxfp8 a2a impl that is now at least e2e functional in torchtitan training.

vkuzo · 2025-09-30T10:33:48Z

we can add a simpler mxfp8 a2a impl that just kicks off 2 async all_to_all_single_autograds on the e4m3 data and e8m0 scales

for a general utility I'd start with this and then iterate, sounds simpler

vkuzo

stamp

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 27, 2025

danielvegamyhre force-pushed the improve branch 3 times, most recently from 2fb7cb0 to 0250ba0 Compare September 27, 2025 16:53

danielvegamyhre added mx moe labels Sep 27, 2025

danielvegamyhre force-pushed the improve branch from 0250ba0 to cd0de7d Compare September 27, 2025 17:05

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Sep 27, 2025

danielvegamyhre force-pushed the improve branch from cd0de7d to a885ed8 Compare September 27, 2025 17:36

[mxfp8 moe training] fix CUDA IMA and improve bench + test scripts

dcc9237

danielvegamyhre force-pushed the improve branch from a885ed8 to dcc9237 Compare September 29, 2025 21:52

danielvegamyhre mentioned this pull request Sep 29, 2025

[mxfp8 MoE training] Support mxfp8 all to all in expert parallel pytorch/torchtitan#1765

Closed

danielvegamyhre marked this pull request as draft September 30, 2025 00:27

danielvegamyhre requested review from drisspg, kwen2501 and vkuzo September 30, 2025 03:08

danielvegamyhre marked this pull request as ready for review September 30, 2025 03:09

danielvegamyhre changed the title ~~[mxfp8 moe training] fix CUDA IMA and improve bench + test scripts~~ [mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts Sep 30, 2025

vkuzo approved these changes Sep 30, 2025

View reviewed changes

danielvegamyhre merged commit cbd3adb into main Sep 30, 2025
18 checks passed

danielvegamyhre mentioned this pull request Sep 30, 2025

[mxfp8 moe training] mxfp8 a2a with d2h sync #3103

Merged

danielvegamyhre mentioned this pull request Oct 17, 2025

[mxfp8 moe training] add mxfp8 all to all impl pytorch/torchtitan#1912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts #3088

[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts #3088

Uh oh!

danielvegamyhre commented Sep 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 27, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Sep 30, 2025 •

edited

Loading

Uh oh!

vkuzo commented Sep 30, 2025

Uh oh!

vkuzo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts #3088

[mxfp8 moe training] mxfp8 a2a working e2e in torchtitan llama4 training; improve tests + bench scripts #3088

Uh oh!

Conversation

danielvegamyhre commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmarks

Limitations

Uh oh!

pytorch-bot bot commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3088

✅ No Failures

Uh oh!

danielvegamyhre commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Sep 30, 2025

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Sep 27, 2025 •

edited

Loading

pytorch-bot bot commented Sep 27, 2025 •

edited

Loading

danielvegamyhre commented Sep 30, 2025 •

edited

Loading