This repository is an ongoing archive of CUDA kernels I've been implementing.
The kernels can be found in the include/kernel/ folder. The benchmarks can be found in the src/ folder. First, use make to compile, and then run the executables in the build/ directory.
Kernels in this repo:
- The
include/gemm/folder contains a naive, a blocktiled and a threadtiled implementations of GEneral Matrix Multiply (GEMM), in both FP32 and FP16 formats (SGEMM and HGEMM, respectively). - The
include/attn/folder contains a kernel for the transpose operation, for the softmax operation, and for the flash attention forward pass, in both FP16 and FP32 formats.
Next steps:
- Benchmarking for attention mechanisms.
- Improve benchmarking for sgemm, and add benchmarking for hgemm.
- Further optimizations: double buffering, vectorized loads, tensor cores, etc.
My main reference so far has been this incredibly useful tutorial.