|
13 | 13 | - [X] mat_transpose_f32x4_shared_row2col_kernel(float4向量化版本,共享内存)
|
14 | 14 | - [X] mat_transpose_f32x4_shared_bcf_col2row_kernel(float4向量化版本,共享内存,去bank conflict)
|
15 | 15 | - [X] mat_transpose_f32x4_shared_bcf_row2col_kernel(float4向量化版本,共享内存,去bank conflict)
|
16 |
| -- CuTe kernel and configurations |
17 |
| - - mat_transpose_cute_reg_kernel |
18 |
| - - [X] mat_transpose_cute_row2col_reg |
19 |
| - - [X] mat_transpose_cute_col2row_reg |
20 |
| - - mat_transpose_cute_smem_kernel (smem) |
21 |
| - - [X] mat_transpose_cute_col_smem |
22 |
| - - [X] mat_transpose_cute_row_smem |
23 |
| - - [X] mat_transpose_cute_col_smem_swizzled (bank conflict free) |
24 |
| - - [X] mat_transpose_cute_row_smem_swizzled |
25 |
| - - mat_transpose_cute_smem_vectorized_kernel (float4) |
26 |
| - - [X] mat_transpose_cute_row_cvectorized |
27 |
| - - [X] mat_transpose_cute_row_cvectorized_swizzled |
28 |
| - - [X] mat_transpose_cute_row_rvectorized |
29 |
| - - [X] mat_transpose_cute_row_rvectorized_swizzled |
| 16 | +- [X] mat_transpose_cute_row2col_reg |
| 17 | +- [X] mat_transpose_cute_col2row_reg |
| 18 | +- [X] mat_transpose_cute_col_smem |
| 19 | +- [X] mat_transpose_cute_row_smem |
| 20 | +- [X] mat_transpose_cute_col_smem_swizzled (bank conflict free) |
| 21 | +- [X] mat_transpose_cute_row_smem_swizzled |
| 22 | +- [X] mat_transpose_cute_row_cvectorized |
| 23 | +- [X] mat_transpose_cute_row_cvectorized_swizzled |
| 24 | +- [X] mat_transpose_cute_row_rvectorized |
| 25 | +- [X] mat_transpose_cute_row_rvectorized_swizzled |
30 | 26 | - [X] PyTorch bindings
|
31 | 27 |
|
32 | 28 | 虽然是基础操作但是很适合练手,比矩阵乘法难度低一点但是可以其中可以用到的优化技巧都可以想办法用到这里来。
|
|
0 commit comments