making the official triton documentation tutorials actually comprehensible by heavily commenting in-detail about every little thing that's happening. Follow them in order of filename and check out the accompanying videos:
Note: these tutorials were all tested and benchmarked on an Nvidia RTX 4060 Ti. On different GPUs your mileage may vary, and on GPUs with less VRAM or SRAM you may even receive errors. I've also found older GPUs running the exact same Triton code to get incorrect values (eg. RTX 3090) so I recommend using at least a 40 series
- of course the official Triton documentation
- here's a flash-attention implementation by one of my fav youtubers that comes with an 8 hour video
- and the original flash-attention papers v1 & v2 (you only really need v2)
- here's a wider set of GPU kernel guides that includes an intro to Triton in lesson 14
