🚀 My GPU Programming Journey

This repository is a living log of my journey into GPU programming with CUDA, Triton, and ONNX, inspired by the book Parallel Programming and Optimization (PMPP).

The goal is simple:

Learn GPU computing step by step.
Document everything I practice and read.
Share code, notes, and resources so others can follow along.

Whether you’re new to GPU programming or brushing up, you’ll find tutorials, experiments, and resources here.

🛠️ Setup

1. Hardware + Drivers

GPU: NVIDIA RTX / GTX or any CUDA-capable GPU
Drivers: Install the latest NVIDIA GPU drivers

2. CUDA Toolkit

# On Ubuntu WSL
sudo apt update
sudo apt install nvidia-cuda-toolkit
nvcc --version  # verify installation

Official CUDA Toolkit Install Guide

3. Python + Triton

pip install torch triton

4. VS Code + WSL

Install VS Code
Install Remote - WSL extension
Connect to Ubuntu from VS Code (this repo is developed on WSL)

📖 Daily Log

🟢 Day 1: Getting Started with CUDA

Installed CUDA Toolkit and set up VS Code with WSL.
Learned about threads, blocks, and grids in GPU execution.
Practiced first kernel: vector addition on GPU.
Worked on Pinned and Unified Memory.

📂 Code: Vector_addition.cu
📂 Code: Pinned_memory_Vector_addition.cu
📂 Code: Unified_memory_Vector_addition.cu

🔗 Resources:

NVIDIA CUDA Programming Model Intro
YouTube: Intro to CUDA Programming

🟢 Day 2: Memory Hierarchy (Registers, Shared, Global, L1/L2)

Studied CUDA memory hierarchy.
Benchmarked performance differences between global vs shared memory.
Wrote kernel for matrix multiplication using shared memory.

📂 Code: day2_matrix_multiplication.cu 🔗 Resources:

PMPP Chapter 2 (Memory Hierarchy)
YouTube: CUDA Memory Hierarchy Explained

🟢 Day 3: Triton Basics

Installed Triton and ran first kernel.
Compared Triton vs CUDA in terms of ease of use.
Implemented vector add in Triton.

📂 Code: day3_triton_vector_add.py 🔗 Resources:

Triton Official Docs
YouTube: OpenAI Triton Tutorial

🟢 Day 4: Optimizing Kernels (Occupancy & Warps)

Learned about warps (32 threads) in CUDA.
Used nvprof to analyze kernel occupancy.
Started optimizing matrix multiplication.

📂 Code: day4_kernel_optimizations.cu 🔗 Resources:

PMPP Chapter 3 (Performance)
Blog: NVIDIA CUDA Best Practices Guide

🟢 Day 5: ONNX Runtime + GPU Execution

Exported a PyTorch model to ONNX.
Ran inference using ONNX Runtime GPU Execution Provider.
Benchmarked CPU vs GPU latency.

📂 Code: day5_onnx_runtime_gpu.py 🔗 Resources:

ONNX Runtime GPU Docs
YouTube: ONNX Runtime Explained

📚 Learning Resources

📖 Books

Parallel Programming and Optimization (PMPP)
Programming Massively Parallel Processors by David Kirk & Wen-mei Hwu

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Day 01		Day 01
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 My GPU Programming Journey

🛠️ Setup

1. Hardware + Drivers

2. CUDA Toolkit

3. Python + Triton

4. VS Code + WSL

📖 Daily Log

🟢 Day 1: Getting Started with CUDA

🟢 Day 2: Memory Hierarchy (Registers, Shared, Global, L1/L2)

🟢 Day 3: Triton Basics

🟢 Day 4: Optimizing Kernels (Occupancy & Warps)

🟢 Day 5: ONNX Runtime + GPU Execution

📚 Learning Resources

📖 Books

🎥 YouTube Channels

🧑‍💻 Blogs & Docs

About

Uh oh!

Releases

Packages

Languages

Mshrooom/GPU-Kernel-journal

Folders and files

Latest commit

History

Repository files navigation

🚀 My GPU Programming Journey

🛠️ Setup

1. Hardware + Drivers

2. CUDA Toolkit

3. Python + Triton

4. VS Code + WSL

📖 Daily Log

🟢 Day 1: Getting Started with CUDA

🟢 Day 2: Memory Hierarchy (Registers, Shared, Global, L1/L2)

🟢 Day 3: Triton Basics

🟢 Day 4: Optimizing Kernels (Occupancy & Warps)

🟢 Day 5: ONNX Runtime + GPU Execution

📚 Learning Resources

📖 Books

🎥 YouTube Channels

🧑‍💻 Blogs & Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages