Skip to content

Mshrooom/GPU-Kernel-journal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ My GPU Programming Journey

This repository is a living log of my journey into GPU programming with CUDA, Triton, and ONNX, inspired by the book Parallel Programming and Optimization (PMPP).

The goal is simple:

  • Learn GPU computing step by step.
  • Document everything I practice and read.
  • Share code, notes, and resources so others can follow along.

Whether you’re new to GPU programming or brushing up, you’ll find tutorials, experiments, and resources here.


πŸ› οΈ Setup

1. Hardware + Drivers

  • GPU: NVIDIA RTX / GTX or any CUDA-capable GPU
  • Drivers: Install the latest NVIDIA GPU drivers

2. CUDA Toolkit

# On Ubuntu WSL
sudo apt update
sudo apt install nvidia-cuda-toolkit
nvcc --version  # verify installation

Official CUDA Toolkit Install Guide

3. Python + Triton

pip install torch triton

4. VS Code + WSL

  • Install VS Code
  • Install Remote - WSL extension
  • Connect to Ubuntu from VS Code (this repo is developed on WSL)

πŸ“– Daily Log

🟒 Day 1: Getting Started with CUDA

  • Installed CUDA Toolkit and set up VS Code with WSL.
  • Learned about threads, blocks, and grids in GPU execution.
  • Practiced first kernel: vector addition on GPU.
  • Worked on Pinned and Unified Memory.

πŸ“‚ Code: Vector_addition.cu
πŸ“‚ Code: Pinned_memory_Vector_addition.cu
πŸ“‚ Code: Unified_memory_Vector_addition.cu

πŸ”— Resources:


🟒 Day 2: Memory Hierarchy (Registers, Shared, Global, L1/L2)

  • Studied CUDA memory hierarchy.
  • Benchmarked performance differences between global vs shared memory.
  • Wrote kernel for matrix multiplication using shared memory.

πŸ“‚ Code: day2_matrix_multiplication.cu πŸ”— Resources:


🟒 Day 3: Triton Basics

  • Installed Triton and ran first kernel.
  • Compared Triton vs CUDA in terms of ease of use.
  • Implemented vector add in Triton.

πŸ“‚ Code: day3_triton_vector_add.py πŸ”— Resources:


🟒 Day 4: Optimizing Kernels (Occupancy & Warps)

  • Learned about warps (32 threads) in CUDA.
  • Used nvprof to analyze kernel occupancy.
  • Started optimizing matrix multiplication.

πŸ“‚ Code: day4_kernel_optimizations.cu πŸ”— Resources:


🟒 Day 5: ONNX Runtime + GPU Execution

  • Exported a PyTorch model to ONNX.
  • Ran inference using ONNX Runtime GPU Execution Provider.
  • Benchmarked CPU vs GPU latency.

πŸ“‚ Code: day5_onnx_runtime_gpu.py πŸ”— Resources:


πŸ“š Learning Resources

πŸ“– Books

  • Parallel Programming and Optimization (PMPP)
  • Programming Massively Parallel Processors by David Kirk & Wen-mei Hwu

πŸŽ₯ YouTube Channels

πŸ§‘β€πŸ’» Blogs & Docs

About

building GPU kernels!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages