Repository of lessons exploring image diffusion models, focused on understanding and education.
This series is heavily inspired by Andrej Karpathy's Zero to Hero series of videos. Well, actually, we are straight out copying that series, because they are so good. Seriously, if you haven't followed his videos, go do that now - lot's of great stuff in there!
Each lesson contains both an explanatory video which walks you through the lesson and the code, a colab notebook that corresponds to the video material, and a a pointer to the runnable code in github. All of the code is designed to run on a minimal GPU. We test everything on T4 instances, since that is what colab provides at the free tier, and they are cheap to run on AWS as stand alone instances. Theoretically each of the lessons should be runnable on any 8GB or greater GPU, as they are all designed to be trained in real time on minimal hardware, so that we can really dive into the code.
Each lesson is in its own subdirectory, and we have ordered the lessons in historical order (from oldest to latest) so that its easy to trace the development of the research and see the historical progress of this space.
Since every lesson is meant to be trained in real time with minimal cost, most of the lessons are restricted to training on the MNIST dataset, simply because it is quick to train and easy to visualize.
For even more diffusion models, including Audio and Video diffusion models, check out the xdiffusion respository, which is a unified modeling framework for image, audio, and video diffusion modeling.
All lessons are built using PyTorch and written in Python 3. To setup an environment to run all of the lessons, we suggest using conda or venv:
> python3 -m venv mindiffusion_env
> source mindiffusion_env/bin/activate
> pip install --upgrade pip
> pip install -r requirements.txt
All lessons are designed to be run in the lesson directory, not the root of the repository.
- Emu (abstract)
- CogView (abstract)
- CogView 2 (abstract)
- CogView 3 (abstract)
- Consistency Models (abstract)
- Latent Consistency Models (abstract)
- Scalable Diffusion Models with State Space Backbone (abstract)
- Palette: Image-to-Image Diffusion Models (abstract)
- MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation (abstract)
- Matryoshka Diffusion Models (abstract)
- On the Importance of Noise Scheduling for Diffusion Models (abstract)
- Analyzing and Improving the Training Dynamics of Diffusion Models (abstract)
- Elucidating the Design Space of Diffusion-Based Generative Models (abstract)
- Flow Matching for Generative Modeling (abstract)
- U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (abstract)
- MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer (abstract)
- DiffiT: Diffusion Vision Transformers for Image Generation (abstract)
- Scaling Vision Transformers to 22 Billion Parameters (abstract)
- DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention (abstract)
- DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis (abstract)
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (abstract)
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation (abstract)
- Adversarial Diffusion Distillation (abstract)
- Discrete Predictor-Corrector Diffusion Models for Image Synthesis (abstract)
- One-step Diffusion with Distribution Matching Distillation (abstract)
- Salient Object-Aware Background Generation using Text-Guided Diffusion Models (abstract)
- Versatile Diffusion (abstract)
- D3PM: Structured Denoising Diffusion Models in Discrete State-Spaces (abstract)
Most of the implementations have been consolidated into a single image and video diffusion repository, which is configurable through YAML files.
If you are interested Video Diffusion Models, take a look through video diffusion models where we are adding all of the latest video diffusion model paper implementations, on an equivalent MNIST dataset for video.