A PyTorch implementation of Denoising Diffusion Probabilistic Models (DDPM) for conditional MNIST digit generation. This project demonstrates how diffusion models can learn to generate high-quality images by gradually denoising random noise.
This implementation includes:
- Custom U-Net architecture with time and class conditioning
- Forward diffusion process that gradually adds noise to images
- Reverse diffusion process that learns to denoise and generate new images
- Conditional generation - generate specific digits (0-9)
- GIF visualization of the complete denoising process
The model learns to transform pure noise into recognizable MNIST digits. Below are GIFs showing the complete diffusion process for each digit class:
The core model is a U-Net architecture with the following components:
# Sinusoidal position embedding for timesteps
self.time_mlp = nn.Sequential(
SinusodialPositionEmbedding(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim * 4),
nn.GELU(),
nn.Linear(time_emb_dim * 4, time_emb_dim)
)
# Learnable embedding for digit classes (0-9)
self.label_emb = nn.Embedding(num_classes, time_emb_dim)- 4 DownBlocks with progressively increasing channels (64 โ 128 โ 256 โ 512)
- Each block contains:
- 2 ResNet blocks with time/label conditioning
- Attention mechanism (applied to even-indexed layers)
- Space-to-depth downsampling
- 2 ResNet blocks + 1 Attention block
- Processes the most compressed representation
- 4 UpBlocks with skip connections from encoder
- Each block contains:
- Transpose convolution for upsampling
- Concatenation with skip connection
- 2 ResNet blocks with conditioning
- Attention mechanism
- Weight Standardized Convolutions: Improves training stability
- Group Normalization: Better than BatchNorm for small batches
- SiLU Activation: Smooth, differentiable activation function
- Residual Connections: Helps with gradient flow
The forward process gradually corrupts images with Gaussian noise:
# At timestep t, add noise according to:
x_t = sqrt(ฮฑฬ
_t) * x_0 + sqrt(1 - ฮฑฬ
_t) * ฮตWhere:
x_0is the original imageฮฑฬ _tis the cumulative product of noise scheduleฮตis Gaussian noise
The model learns to reverse this process by predicting the noise:
# Model predicts noise ฮต_ฮธ(x_t, t, class)
# Then we can recover x_{t-1} using:
x_{t-1} = (1/โฮฑ_t) * (x_t - (ฮฒ_t/โ(1-ฮฑฬ
_t)) * ฮต_ฮธ(x_t, t))Two noise schedules are implemented:
-
Linear Schedule (used in training):
ฮฒ_t = linear_interpolation(1e-4, 0.02, num_timesteps)
-
Cosine Schedule (alternative):
ฮฑฬ _t = cosยฒ((t/T + s)/(1 + s) * ฯ/2)
The model is trained to predict the noise added at each timestep:
def compute_loss(model, x0, t, labels=None, noise=None):
if noise is None:
noise = torch.randn_like(x0)
x_t = sample_q(x0, t, noise) # Add noise
predicted_noise = model(x_t, t, labels) # Predict noise
loss = F.l1_loss(noise, predicted_noise) # L1 loss
return loss- Sample batch of images and labels
- Sample random timesteps t for each image
- Add noise according to forward process
- Predict noise using the model
- Compute L1 loss between actual and predicted noise
- Backpropagate and update model weights
The model learns to generate specific digits by conditioning on class labels:
- Label embeddings are added to time embeddings
- This allows controlled generation: "Generate a digit 7"
- Start with pure noise:
x_T ~ N(0, I) - Iteratively denoise for T steps:
for t in range(T, 0, -1): x_{t-1} = sample_p(model, x_t, t, labels)
- Final result: Clean image x_0
- Conditional sampling: Generate specific digit classes
- DDIM sampling: Faster sampling with fewer steps (not implemented yet)
- Classifier-free guidance: Could be added for better conditional generation
- Time embedding dimension: 128
- ResNet depth: 4 layers
- Image size: 32ร32 (upscaled from 28ร28 MNIST)
- Input channels: 1 (grayscale MNIST)
- Number of classes: 10 (digits 0-9)
- Timesteps: 1000
- Learning rate: 1e-4
- Batch size: 64
- Optimizer: Adam
- Loss function: L1 (mean absolute error)
- Epochs: 1000
- Model size: ~50M parameters
- Training time: ~hours on GPU
- Inference time: ~30 seconds per batch (1000 steps)
DDPM-diffusion/
โโโ custom_diffusion_model_experiments.ipynb # Main development notebook
โโโ custom_diffusion_model_training.py # Standalone training script
โโโ generating_gif.ipynb # GIF generation code
โโโ saved_model.pth # Trained model weights
โโโ data/MNIST/ # MNIST dataset
โโโ GIFs/ # Generated diffusion GIFs
โโโ results/ # Training samples
โโโ requirements.txt # Dependencies
- SpaceToDepth: Efficient downsampling using channel dimension
- WeightStandardizedConv2d: Normalized convolutions for stability
- SinusoidalPositionEmbedding: Time encoding for diffusion steps
- ResnetBlock: Residual blocks with time/label conditioning
- Attention: Self-attention for capturing long-range dependencies
- Gradient clipping: Prevents exploding gradients
- Exponential moving averages: Smoother model updates (could be added)
- Progressive training: Start with fewer timesteps (could be implemented)
The project includes comprehensive visualization:
- Training samples: Saved every 1000 batches
- Diffusion GIFs: Complete denoising process visualization
- Loss tracking: Monitor training progress
- Conditional samples: Generate specific digit classes
- DDIM Sampling: Faster inference with deterministic sampling
- Classifier-free Guidance: Better conditional generation
- Progressive Training: Start with fewer timesteps
- FID/IS Metrics: Quantitative evaluation
- Higher Resolution: Scale to larger images
- Other Datasets: CIFAR-10, CelebA, etc.
- DDPM Paper: "Denoising Diffusion Probabilistic Models" (Ho et al., 2020)
- Improved DDPM: "Improved Denoising Diffusion Probabilistic Models" (Nichol & Dhariwal, 2021)
- DDIM: "Denoising Diffusion Implicit Models" (Song et al., 2020)
-
Install dependencies:
pip install -r requirements.txt
-
Run training:
python custom_diffusion_model_training.py
-
Generate samples:
# Load trained model and sample model.eval() samples = sampling(model, (10, 1, 32, 32), labels=torch.arange(10))
-
Create GIFs:
# Run generating_gif.ipynb to create visualization GIFs
This implementation demonstrates the power of diffusion models for high-quality image generation with the added benefit of conditional control over the generated content.









