Re-implement FlashAttention with new Xe atoms #547

petercad · 2025-10-04T00:55:15Z

This PR updates FlashAttention to the new copy/MMA atoms.

Changes:

Prefill and decode unified into a single implementation, allowing simultaneous K and Q subgroup-level parallelization rather than an either-or.
GEMMs and softmax grouped together and the full k loop consolidated into an FMHA mainloop class.
- This will facilitate further manual pipelining/overlap of GEMM with softmax.
Use new copy/MMA atoms and reorders to transparently support arbitrary data types.
Automatic copy/MMA operator selection.

Current status: prefill/decode examples almost all working, similar/better performance to old examples.

Known issues:

Head size 192 decode config doesn't compile yet -- to be fixed.
Strange SYCL compiler behavior/bug with tSrS->tArP reorder. Apparently the compiler believes there is UB somewhere and will omit a large section of the kernel as a result. For the moment, there's a direct copy as a workaround while I pin down the issue. I'm not able to reproduce this behavior with the reorder in isolation.

Additional features (causal masking, variable sequence lengths, etc.) to be added later.

Reminder: the new atoms require a very recent driver due to necessary IGC fixes/enhancements. Recommended version: ci-comp_igc-30613.

petercad · 2025-10-04T01:56:37Z

I will break up this large commit into self-contained smaller commits after review is complete.

rolandschulz · 2025-10-06T18:37:16Z

applications/flash_attention_v2/collective/copy_block_slm.hpp

why is this here? This isn't flash attention specific, is it?

No, it's not. These started as some simple helpers to make copying to/from SLM easier for the epilogue. We could move them, maybe to include/cute/algorithm/cute.hpp, though they should be made more sophisticated (use smaller/larger block sizes as appropriate, automatic fallback to scatter/gather, etc.).

rolandschulz · 2025-10-06T18:44:49Z

include/cute/algorithm/subgroup_algorithms.hpp

+//   No diagnostics/error will be issued by the compiler if it is not.
+template <typename T>
+CUTE_HOST_DEVICE void
+set_wi_value(T &x, int i, T val)


why don't you take i as compile time value to make this safer? The usage is on line 137 where the input comes from the unrolled loop index. If you replace the loop with for_each you have a compile time constant.

That is an option -- I did it this way since compile-time unrolling of the loop is IMO harder to use and harder to read.

I opened a compiler ticket for the lack of diagnostics, and they have a patch under review now to address it.

I see. As long as we have diagnostic that's fine. Current solution won't compile for O0. Not sure whether it matters.

Good point on -O0. How important is it to support -O0 operation? Does the rest of CUTLASS work OK under -O0? (I know SYCL in general has had some functionality issues at -O0.)

I believe the SYCL issues with -O0 have been resolved. I'm not aware of a good reason to compile with -O0. Even for debug O1/Og tends to be better. I think other parts of CUTLASS are correct with O0. But IGC used to crash for larger kernels compiled at O0. Haven't tried it with more recent IGC versions.

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

[Umbrella commit] Re-implement FlashAttention with new Xe atoms

3917f24

petercad changed the title ~~[Umbrella commit] Re-implement FlashAttention with new Xe atoms~~ Re-implement FlashAttention with new Xe atoms Oct 4, 2025

rolandschulz reviewed Oct 6, 2025

View reviewed changes

rolandschulz mentioned this pull request Oct 8, 2025

First version of SDPA Fwd - No need to review #548

Open

wuxun-zhang reviewed Oct 10, 2025

View reviewed changes

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

wuxun-zhang reviewed Oct 10, 2025

View reviewed changes

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

petercad added 3 commits October 13, 2025 08:12

GQA indexing fix

b3c3d5a

Fully vectorized reductions

4ba557a

Fix spill on head size 192 prefill config

6dfaabc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-implement FlashAttention with new Xe atoms #547

Re-implement FlashAttention with new Xe atoms #547

Uh oh!

petercad commented Oct 4, 2025 •

edited

Loading

Uh oh!

petercad commented Oct 4, 2025

Uh oh!

rolandschulz Oct 6, 2025

Uh oh!

petercad Oct 6, 2025

Uh oh!

rolandschulz Oct 6, 2025

Uh oh!

petercad Oct 6, 2025

Uh oh!

rolandschulz Oct 6, 2025

Uh oh!

petercad Oct 13, 2025

Uh oh!

rolandschulz Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Re-implement FlashAttention with new Xe atoms #547

Are you sure you want to change the base?

Re-implement FlashAttention with new Xe atoms #547

Uh oh!

Conversation

petercad commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercad commented Oct 4, 2025

Uh oh!

rolandschulz Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petercad commented Oct 4, 2025 •

edited

Loading