-
Notifications
You must be signed in to change notification settings - Fork 68
[WIP] SLM pipelined GEMM design #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@jiyang1011 Could you please add the following |
|
|
||
| ## (Pros) Register->SLM->Register is more efficient than L1 path | ||
| - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data. | ||
| - Subgroup specification in each workgroup could avoid duplicated data tiles movemonet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean specialization? If not what's specification? If so why would there be duplicated data movement without specialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the warp specialization in Nvidia terminology. First we use a part of subgroups to loads data from G->S, and another part of subgroups to matrix math. Secondly supposed that all of subgroups with the work flow (G->S, S->R, R->dpas), obviously there is no gain compared with the existing pipeline. At last we certainly could use the prefetch like method to load data to SLM to avoid duplicated data movement, but for most cases, prefetch all has overlapped data to satisfy the cache line limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fix the typo
| - Gathering and reordering through controllable and high-bandwith memory | ||
|
|
||
| ## (Pros) Register->SLM->Register is more efficient than L1 path | ||
| - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true in general, is it? For cases without dequantizing the cooperative prefetch makes sure that the global->L1 transfer also only happens once per WG. Of course we can't reuse dequant work with the prefetch pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. Most of case prefetch has the same effect. So there is chance to gain benefit with SLM pipeline when dequantizing / reordering / gathering etc. exists.
| ## (Pros) Register->SLM->Register is more efficient than L1 path | ||
| - The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data. | ||
| - Subgroup specification in each workgroup could avoid duplicated data tiles movemonet. | ||
| - In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true, is it? Subgroups only copy the tiles they need which is only the same if subgroups do need the same data. And if they need the same data a SLM pipeline will also need to copy from SLM to reg multiple times.
| - In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data. | ||
| - To make an SLM flow efficient, we need heavy pipelining, typically triple or quadruple buffering in SLM. | ||
| - We can also combine the SLM flow with prefetch to L1, though prefetch is much less crucial in this case. | ||
| - Due to lack of asynchronous hardware features on Xe3p platform, programming model should be switched to Producer-Consumer pattern. Subgroups producers only copy from global memory to SLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't at all clear that producer-consumer would be beneficial. Probably needs experiments or a good analysis based on prior data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need more experiments to confirm this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiyang1011, can you please use scaledMMs for experimentation on BMG, as in #633?
I had not used subgroup specialization for my experiments, BTW.
On BMG, in each mainloop iteration, I had tried all subgroups loading equal parts of the B workgroup tile (32 subgroups loaded 16x16 tiles for a 256x32 workgroup-level B tile), converting to BF16/FP16 VNNI-16 format, applying scales, and caching in SLM. However, I used inefficient SLM <-> registers transfers.
I'll also try :
- loading equal portions of the subgroup-level MMA tile a specific number of subgroups need. For example, if 4 subgroups need the same 32x32 tile for MMA, then they will each load 8x32 or 16x16 tiles, convert to BF16/FP16 VNNI-16 format, apply scales, and then cache them in SLM.
- Using more efficient SLM <-> registers transfers, as Ji Yang described in this PR.
Thanks!
Hi Antony, we have the necessary components. Please refer to these tests
and https://github.com/intel/sycl-tla/blob/main/test/unit/cute/intel_xe/copy_1d.cpp#L50 block 2d copy and univseralCopy/1-D copy can be integrated into G->S copy operator to meet the demand. |
8dd6fcb to
140f7d6
Compare
Description
This RFC is mainly to supply a proposal of copy API from G->S. And consider whether there is possibility to gain benefit when SLM is introduced into GEMM pipeline, which needs experiments to confirm this argument.
Type
Testing
Performance
References
Fixes #
Checklist