[WIP] SLM pipelined GEMM design #653

jiyang1011 · 2025-12-01T08:35:21Z

Description

This RFC is mainly to supply a proposal of copy API from G->S. And consider whether there is possibility to gain benefit when SLM is introduced into GEMM pipeline, which needs experiments to confirm this argument.

Type

Bug - [ ] Feature - [ ] Performance - [ ] Refactor

Testing

Tests pass - [ ] Xe12 - [ ] Xe20

Performance

Metric	Before	After

References

Fixes #

Checklist

Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

Antonyvance · 2025-12-01T20:23:32Z

@jiyang1011 Could you please add the following
What the new APIs
Which examples should be modified to show the new APIs
Pipeline: Create a pseudo code to show the flow.
Thank you.

rolandschulz · 2025-12-01T21:13:25Z

media/docs/cpp/xe_slm_pipelined_gemm.md

+
+## (Pros) Register->SLM->Register is more efficient than L1 path
+  - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.
+  - Subgroup specification in each workgroup could avoid duplicated data tiles movemonet.


Do you mean specialization? If not what's specification? If so why would there be duplicated data movement without specialization?

I mean the warp specialization in Nvidia terminology. First we use a part of subgroups to loads data from G->S, and another part of subgroups to matrix math. Secondly supposed that all of subgroups with the work flow (G->S, S->R, R->dpas), obviously there is no gain compared with the existing pipeline. At last we certainly could use the prefetch like method to load data to SLM to avoid duplicated data movement, but for most cases, prefetch all has overlapped data to satisfy the cache line limitation.

I fix the typo

rolandschulz · 2025-12-01T21:18:04Z

media/docs/cpp/xe_slm_pipelined_gemm.md

+  - Gathering and reordering through controllable and high-bandwith memory
+
+## (Pros) Register->SLM->Register is more efficient than L1 path
+  - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.


This isn't true in general, is it? For cases without dequantizing the cooperative prefetch makes sure that the global->L1 transfer also only happens once per WG. Of course we can't reuse dequant work with the prefetch pipeline.

Yes, you are correct. Most of case prefetch has the same effect. So there is chance to gain benefit with SLM pipeline when dequantizing / reordering / gathering etc. exists.

rolandschulz · 2025-12-01T21:18:52Z

media/docs/cpp/xe_slm_pipelined_gemm.md

+## (Pros) Register->SLM->Register is more efficient than L1 path
+  - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.
+  - Subgroup specification in each workgroup could avoid duplicated data tiles movemonet.
+  - In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data.


This isn't true, is it? Subgroups only copy the tiles they need which is only the same if subgroups do need the same data. And if they need the same data a SLM pipeline will also need to copy from SLM to reg multiple times.

rolandschulz · 2025-12-01T21:23:55Z

media/docs/cpp/xe_slm_pipelined_gemm.md

+  - In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data.
+  - To make an SLM flow efficient, we need heavy pipelining, typically triple or quadruple buffering in SLM.
+  - We can also combine the SLM flow with prefetch to L1, though prefetch is much less crucial in this case.
+  - Due to lack of asynchronous hardware features on Xe3p platform, programming model should be switched to Producer-Consumer pattern. Subgroups producers only copy from global memory to SLM.


It isn't at all clear that producer-consumer would be beneficial. Probably needs experiments or a good analysis based on prior data.

Yes, we need more experiments to confirm this argument

@jiyang1011, can you please use scaledMMs for experimentation on BMG, as in #633?

I had not used subgroup specialization for my experiments, BTW.

On BMG, in each mainloop iteration, I had tried all subgroups loading equal parts of the B workgroup tile (32 subgroups loaded 16x16 tiles for a 256x32 workgroup-level B tile), converting to BF16/FP16 VNNI-16 format, applying scales, and caching in SLM. However, I used inefficient SLM <-> registers transfers.

I'll also try :

loading equal portions of the subgroup-level MMA tile a specific number of subgroups need. For example, if 4 subgroups need the same 32x32 tile for MMA, then they will each load 8x32 or 16x16 tiles, convert to BF16/FP16 VNNI-16 format, apply scales, and then cache them in SLM.

Using more efficient SLM <-> registers transfers, as Ji Yang described in this PR.

Thanks!

jiyang1011 · 2025-12-02T01:42:48Z

@jiyang1011 Could you please add the following What the new APIs Which examples should be modified to show the new APIs Pipeline: Create a pseudo code to show the flow. Thank you.

Hi Antony, we have the necessary components. Please refer to these tests

sycl-tla/test/unit/cute/intel_xe/copy_scatter.cpp

Line 204 in 4cdea5a

void copy_kernel_local(TensorS S, TensorD D, TiledCopy Op) {

and https://github.com/intel/sycl-tla/blob/main/test/unit/cute/intel_xe/copy_1d.cpp#L50
block 2d copy and univseralCopy/1-D copy can be integrated into G->S copy operator to meet the demand.

jiyang1011 requested review from petercad, rolandschulz, taozha2 and tdeng5 December 1, 2025 08:35

rolandschulz reviewed Dec 1, 2025

View reviewed changes

tdeng5 changed the title ~~SLM pipelined GEMM~~ [WIP] SLM pipelined GEMM design Dec 2, 2025

ClarkChin08 added 2 commits December 2, 2025 02:04

RFC

261fbf4

fix typo

140f7d6

jiyang1011 force-pushed the jiyang/RFC branch from 8dd6fcb to 140f7d6 Compare December 2, 2025 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] SLM pipelined GEMM design #653

[WIP] SLM pipelined GEMM design #653

Uh oh!

jiyang1011 commented Dec 1, 2025 •

edited

Loading

Uh oh!

Antonyvance commented Dec 1, 2025

Uh oh!

rolandschulz Dec 1, 2025

Uh oh!

jiyang1011 Dec 2, 2025

Uh oh!

jiyang1011 Dec 2, 2025

Uh oh!

rolandschulz Dec 1, 2025

Uh oh!

jiyang1011 Dec 2, 2025 •

edited

Loading

Uh oh!

rolandschulz Dec 1, 2025

Uh oh!

rolandschulz Dec 1, 2025

Uh oh!

jiyang1011 Dec 2, 2025

Uh oh!

sanchitintel Dec 3, 2025 •

edited

Loading

Uh oh!

jiyang1011 commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[WIP] SLM pipelined GEMM design #653

Are you sure you want to change the base?

[WIP] SLM pipelined GEMM design #653

Uh oh!

Conversation

jiyang1011 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type

Testing

Performance

References

Checklist

Uh oh!

Antonyvance commented Dec 1, 2025

Uh oh!

rolandschulz Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jiyang1011 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jiyang1011 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jiyang1011 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rolandschulz Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jiyang1011 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiyang1011 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiyang1011 commented Dec 1, 2025 •

edited

Loading

jiyang1011 Dec 2, 2025 •

edited

Loading

sanchitintel Dec 3, 2025 •

edited

Loading

jiyang1011 commented Dec 2, 2025 •

edited

Loading