Skip to content

Conversation

@jiyang1011
Copy link

@jiyang1011 jiyang1011 commented Dec 1, 2025

Description

This RFC is mainly to supply a proposal of copy API from G->S. And consider whether there is possibility to gain benefit when SLM is introduced into GEMM pipeline, which needs experiments to confirm this argument.

Type

  • Bug - [ ] Feature - [ ] Performance - [ ] Refactor

Testing

  • Tests pass - [ ] Xe12 - [ ] Xe20

Performance

Metric Before After

References

Fixes #

Checklist

  • Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

@Antonyvance
Copy link

@jiyang1011 Could you please add the following
What the new APIs
Which examples should be modified to show the new APIs
Pipeline: Create a pseudo code to show the flow.
Thank you.


## (Pros) Register->SLM->Register is more efficient than L1 path
- The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.
- Subgroup specification in each workgroup could avoid duplicated data tiles movemonet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean specialization? If not what's specification? If so why would there be duplicated data movement without specialization?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the warp specialization in Nvidia terminology. First we use a part of subgroups to loads data from G->S, and another part of subgroups to matrix math. Secondly supposed that all of subgroups with the work flow (G->S, S->R, R->dpas), obviously there is no gain compared with the existing pipeline. At last we certainly could use the prefetch like method to load data to SLM to avoid duplicated data movement, but for most cases, prefetch all has overlapped data to satisfy the cache line limitation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fix the typo

- Gathering and reordering through controllable and high-bandwith memory

## (Pros) Register->SLM->Register is more efficient than L1 path
- The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true in general, is it? For cases without dequantizing the cooperative prefetch makes sure that the global->L1 transfer also only happens once per WG. Of course we can't reuse dequant work with the prefetch pipeline.

Copy link
Author

@jiyang1011 jiyang1011 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. Most of case prefetch has the same effect. So there is chance to gain benefit with SLM pipeline when dequantizing / reordering / gathering etc. exists.

## (Pros) Register->SLM->Register is more efficient than L1 path
- The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.
- Subgroup specification in each workgroup could avoid duplicated data tiles movemonet.
- In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true, is it? Subgroups only copy the tiles they need which is only the same if subgroups do need the same data. And if they need the same data a SLM pipeline will also need to copy from SLM to reg multiple times.

- In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data.
- To make an SLM flow efficient, we need heavy pipelining, typically triple or quadruple buffering in SLM.
- We can also combine the SLM flow with prefetch to L1, though prefetch is much less crucial in this case.
- Due to lack of asynchronous hardware features on Xe3p platform, programming model should be switched to Producer-Consumer pattern. Subgroups producers only copy from global memory to SLM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't at all clear that producer-consumer would be beneficial. Probably needs experiments or a good analysis based on prior data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need more experiments to confirm this argument

Copy link

@sanchitintel sanchitintel Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiyang1011, can you please use scaledMMs for experimentation on BMG, as in #633?

I had not used subgroup specialization for my experiments, BTW.

On BMG, in each mainloop iteration, I had tried all subgroups loading equal parts of the B workgroup tile (32 subgroups loaded 16x16 tiles for a 256x32 workgroup-level B tile), converting to BF16/FP16 VNNI-16 format, applying scales, and caching in SLM. However, I used inefficient SLM <-> registers transfers.

I'll also try :

  1. loading equal portions of the subgroup-level MMA tile a specific number of subgroups need. For example, if 4 subgroups need the same 32x32 tile for MMA, then they will each load 8x32 or 16x16 tiles, convert to BF16/FP16 VNNI-16 format, apply scales, and then cache them in SLM.
  2. Using more efficient SLM <-> registers transfers, as Ji Yang described in this PR.

Thanks!

@tdeng5 tdeng5 changed the title SLM pipelined GEMM [WIP] SLM pipelined GEMM design Dec 2, 2025
@jiyang1011
Copy link
Author

jiyang1011 commented Dec 2, 2025

@jiyang1011 Could you please add the following What the new APIs Which examples should be modified to show the new APIs Pipeline: Create a pseudo code to show the flow. Thank you.

Hi Antony, we have the necessary components. Please refer to these tests

void copy_kernel_local(TensorS S, TensorD D, TiledCopy Op) {

and https://github.com/intel/sycl-tla/blob/main/test/unit/cute/intel_xe/copy_1d.cpp#L50
block 2d copy and univseralCopy/1-D copy can be integrated into G->S copy operator to meet the demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants