Skip to content

Commit 8dd6fcb

Browse files
committed
RFC
1 parent 4cdea5a commit 8dd6fcb

File tree

1 file changed

+54
-0
lines changed

1 file changed

+54
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Xe SLM Pipelined GEMM
2+
## Limitations of the L1 pipelined GEMM
3+
* Current GEMM implementation is not well-performed. SLM pipelined GEMM maybe can get better performance.
4+
5+
## Goals
6+
7+
The goal of introduing Shared Local Memory ([SLM](https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-2/shared-local-memory.html#id-d26320e84)) into pipelined GEMM, In more detail, we want to:
8+
9+
* Fullly control of Shared local Memory
10+
- Programmers take over and optimize shared memory path.
11+
- Avoid competition between multiple work items.
12+
13+
* Avoid bank conflict with cutlass component
14+
- There is potential to use exsiting component (see `include/cute/swizzle.hpp`) in Cutlass.
15+
- Using SLM as Cache, higher parallelism
16+
17+
* Ease of gathering
18+
- L1 cache is non-deterministic
19+
- Gathering and reordering through controllable and high-bandwith memory
20+
21+
## (Pros) Register->SLM->Register is more efficient than L1 path
22+
- The key improvement is in the global -> register -> SLM part. The amount of data each thread needs to copy from global memory is smaller because threads collaborate to share the work of copying/reordering/dequantizing the data.
23+
- Subgroup specification in each workgroup could avoid duplicated data tiles movemonet.
24+
- In the current mainloop, each subgroup copies the same data from L1 -> register and does any necessary reorder/data conversion/dequantization on that data.
25+
- To make an SLM flow efficient, we need heavy pipelining, typically triple or quadruple buffering in SLM.
26+
- We can also combine the SLM flow with prefetch to L1, though prefetch is much less crucial in this case.
27+
- Due to lack of asynchronous hardware features on Xe3p platform, programming model should be switched to Producer-Consumer pattern. Subgroups producers only copy from global memory to SLM.
28+
29+
## (Cons) Can not fullly utilize hardware engines in SLM pipelines
30+
- Type conversion/dequantization is expensive and can't be hidden behind dpas (or increases power). For instance, e4m3 -> f16/bf16 upconversion on BMG. We need an SLM pipeline to reach full performance.
31+
- Loading is expensive. Generally this happens whenever we can't use block 2D atoms. Sometimes the loads themselves are slow (e.g. non-4-byte-aligned loads), and sometimes there are so many individual loads needed that HW is unable to keep them all in flight.
32+
33+
34+
## Implementation of SLM Pipelined GEMM
35+
### Provide high efficiency global memory to SLM interface
36+
* The copy operator from global memory to SLM can be obtained by wrapping 2-D block(G->R) and the vectorized copy / 1-D instruction(R->SLM).
37+
- 2-D block instruction exists in Cutlass.
38+
- 1-D instruction or vectorized copy exist in Cutlass.
39+
- Reorder Shared local memory layout can avoid bank conflicts.
40+
- Meeting the alignment (32 bits) rule is a fundamental requirement for vectorized copy / 1-D instruction, which can be satisfied by tuning stages for various data types in most conditions.
41+
42+
### Producer-Consumer Programming Model
43+
* A kernel-level design pattern in which different subgroups are statically assigned to asymmetric roles:
44+
- Producers – issue asynchronous copy instructions (block 2d load / vectorized load) that move data from global memory into shared memory / registers.
45+
- Consumers – issue compute instructions (dpas, math calculation etc.) that read the freshly delivered data and produce results.
46+
* Key properties
47+
- Subgroups-specialised: only a subset of subgroups in a block act as producers; the rest are consumers.
48+
- Scale-out: the same kernel can run with 2 producer subgroups + 6 consumer subgroups, or 4 + 12, etc.;
49+
50+
<!-- ### Subgroup Specification
51+
* `Subgroup Specification`originates from concept `Warp Specification` On NVIDIA Hopper.
52+
- Warp Groups (warpgroup) – 4 consecutive warps (128 threads) that act as one schedulable unit for the new matrix-multiply-and-accumulate instruction WGMMA.
53+
- Asynchronous special-purpose units – TMA (Tensor Memory Accelerator) for global→shared bulk copy and WGMMA for shared→register MMA; both run without stalling the issuing warp.
54+
- Role declaration + resource budget – you statically label warps as producer (TMA), consumer (WGMMA), reduction, etc., and you may re-allocate register file per warpgroup with the new PTX directive. -->

0 commit comments

Comments
 (0)