This repository is a living log of my journey into GPU programming with CUDA, Triton, and ONNX, inspired by the book Parallel Programming and Optimization (PMPP).
The goal is simple:
- Learn GPU computing step by step.
- Document everything I practice and read.
- Share code, notes, and resources so others can follow along.
Whether youβre new to GPU programming or brushing up, youβll find tutorials, experiments, and resources here.
- GPU: NVIDIA RTX / GTX or any CUDA-capable GPU
- Drivers: Install the latest NVIDIA GPU drivers
# On Ubuntu WSL
sudo apt update
sudo apt install nvidia-cuda-toolkit
nvcc --version # verify installationOfficial CUDA Toolkit Install Guide
pip install torch triton- Install VS Code
- Install Remote - WSL extension
- Connect to Ubuntu from VS Code (this repo is developed on WSL)
- Installed CUDA Toolkit and set up VS Code with WSL.
- Learned about threads, blocks, and grids in GPU execution.
- Practiced first kernel: vector addition on GPU.
- Worked on Pinned and Unified Memory.
π Code: Vector_addition.cu
π Code: Pinned_memory_Vector_addition.cu
π Code: Unified_memory_Vector_addition.cu
π Resources:
- Studied CUDA memory hierarchy.
- Benchmarked performance differences between global vs shared memory.
- Wrote kernel for matrix multiplication using shared memory.
π Code: day2_matrix_multiplication.cu
π Resources:
- PMPP Chapter 2 (Memory Hierarchy)
- YouTube: CUDA Memory Hierarchy Explained
- Installed Triton and ran first kernel.
- Compared Triton vs CUDA in terms of ease of use.
- Implemented vector add in Triton.
π Code: day3_triton_vector_add.py
π Resources:
- Learned about warps (32 threads) in CUDA.
- Used
nvprofto analyze kernel occupancy. - Started optimizing matrix multiplication.
π Code: day4_kernel_optimizations.cu
π Resources:
- PMPP Chapter 3 (Performance)
- Blog: NVIDIA CUDA Best Practices Guide
- Exported a PyTorch model to ONNX.
- Ran inference using ONNX Runtime GPU Execution Provider.
- Benchmarked CPU vs GPU latency.
π Code: day5_onnx_runtime_gpu.py
π Resources:
- Parallel Programming and Optimization (PMPP)
- Programming Massively Parallel Processors by David Kirk & Wen-mei Hwu