Diffusion models have emerged as a powerful approach in generative AI, producing state-of-the-art results in image, audio, and video generation. In this in-depth technical article, we’ll explore how diffusion models work, their key innovations, and why they’ve become so successful. We’ll cover the mathematical foundations, training process, sampling algorithms, and cutting-edge applications of this exciting new technology.
Introduction to Diffusion Models
Diffusion models are a class of generative models that learn to gradually denoise data by reversing a diffusion process. The core idea is to start with pure noise and iteratively refine it into a high-quality sample from the target distribution.
This approach was inspired by non-equilibrium thermodynamics – specifically, the process of reversing diffusion to recover structure. In the context of machine learning, we can think of it as learning to reverse the gradual addition of noise to data.
Some key advantages of diffusion models include:
- State-of-the-art image quality, surpassing GANs in many cases
- Stable training without adversarial dynamics
- Highly parallelizable
- Flexible architecture – any model that maps inputs to outputs of the same dimensionality can be used
- Strong theoretical grounding
Let’s dive deeper into how diffusion models work.
Stochastic Differential Equations govern the forward and reverse processes in diffusion models. The forward SDE adds noise to the data, gradually transforming it into a noise distribution. The reverse SDE, guided by a learned score function, progressively removes noise, leading to the generation of realistic images from random noise. This approach is key to achieving high-quality generative performance in continuous state spaces
The Forward Diffusion Process
The forward diffusion process starts with a data point x₀ sampled from the real data distribution, and gradually adds Gaussian noise over T timesteps to produce increasingly noisy versions x₁, x₂, …, xT.
At each timestep t, we add a small amount of noise according to:
x_t = √(1 - β_t) * x_{t-1} + √(β_t) * ε
Where:
- β_t is a variance schedule that controls how much noise is added at each step
- ε is random Gaussian noise
This process continues until xT is nearly pure Gaussian noise.
Mathematically, we can describe this as a Markov chain:
q(x_t | x_{t-1}) = N(x_t; √(1 - β_t) * x_{t-1}, β_t * I)
Where N denotes a Gaussian distribution.
The β_t schedule is typically chosen to be small for early timesteps and increase over time. Common choices include linear, cosine, or sigmoid schedules.
The Reverse Diffusion Process
The goal of a diffusion model is to learn the reverse of this process – to start with pure noise xT and progressively denoise it to recover a clean sample x₀.
We model this reverse process as:
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_θ^2(x_t, t))
Where μ_θ and σ_θ^2 are learned functions (typically neural networks) parameterized by θ.
The key innovation is that we don’t need to explicitly model the full reverse distribution. Instead, we can parameterize it in terms of the forward process, which we know.
Specifically, we can show that the optimal reverse process mean μ* is:
μ* = 1/√(1 - β_t) * (x_t - β_t/√(1 - α_t) * ε_θ(x_t, t))
Where:
- α_t = 1 – β_t
- ε_θ is a learned noise prediction network
This gives us a simple objective – train a neural network ε_θ to predict the noise that was added at each step.
Training Objective
The training objective for diffusion models can be derived from variational inference. After some simplification, we arrive at a simple L2 loss:
L = E_t,x₀,ε [ ||ε - ε_θ(x_t, t)||² ]
Where:
- t is sampled uniformly from 1 to T
- x₀ is sampled from the training data
- ε is sampled Gaussian noise
- x_t is constructed by adding noise to x₀ according to the forward process
In other words, we’re training the model to predict the noise that was added at each timestep.
Model Architecture
Source: Ronneberger et al.
The U-Net architecture is central to the denoising step in the diffusion model. It features an encoder-decoder structure with skip connections that help preserve fine-grained details during the reconstruction process. The encoder progressively downsamples the input image while capturing high-level features, and the decoder up-samples the encoded features to reconstruct the image. This architecture is particularly effective in tasks requiring precise localization, such as image segmentation.
The noise prediction network ε_θ
can use any architecture that maps inputs to outputs of the same dimensionality. U-Net style architectures are a popular choice, especially for image generation tasks.
A typical architecture might look like:
class DiffusionUNet(nn.Module):
def __init__(self):
super().__init__()
# Downsampling
self.down1 = UNetBlock(3, 64)
self.down2 = UNetBlock(64, 128)
self.down3 = UNetBlock(128, 256)
# Bottleneck
self.bottleneck = UNetBlock(256, 512)
# Upsampling
self.up3 = UNetBlock(512, 256)
self.up2 = UNetBlock(256, 128)
self.up1 = UNetBlock(128, 64)
# Output
self.out = nn.Conv2d(64, 3, 1)
def forward(self, x, t):
# Embed timestep
t_emb = self.time_embedding(t)
# Downsample
d1 = self.down1(x, t_emb)
d2 = self.down2(d1, t_emb)
d3 = self.down3(d2, t_emb)
# Bottleneck
bottleneck = self.bottleneck(d3, t_emb)
# Upsample
u3 = self.up3(torch.cat([bottleneck, d3], dim=1), t_emb)
u2 = self.up2(torch.cat([u3, d2], dim=1), t_emb)
u1 = self.up1(torch.cat([u2, d1], dim=1), t_emb)
# Output
return self.out(u1)
The key components are:
- U-Net style architecture with skip connections
- Time embedding to condition on the timestep
- Flexible depth and width
Sampling Algorithm
Once we’ve trained our noise prediction network ε_θ, we can use it to generate new samples. The basic sampling algorithm is:
- Start with pure Gaussian noise xT
- For t = T to 1:
- Predict noise:
ε_θ(x_t, t)
- Compute mean:
μ = 1/√(1-β_t) * (x_t - β_t/√(1-α_t) * ε_θ(x_t, t))
- Sample:
x_{t-1} ~ N(μ, σ_t^2 * I)
- Return x₀
This process gradually denoises the sample, guided by our learned noise prediction network.
In practice, there are various sampling techniques that can improve quality or speed:
- DDIM sampling: A deterministic variant that allows for fewer sampling steps
- Ancestral sampling: Incorporates the learned variance σ_θ^2
- Truncated sampling: Stops early for faster generation
Here’s a basic implementation of the sampling algorithm:
def sample(model, n_samples, device):
# Start with pure noise
x = torch.randn(n_samples, 3, 32, 32).to(device)
for t in reversed(range(1000)):
# Add noise to create x_t
t_batch = torch.full((n_samples,), t, device=device)
noise = torch.randn_like(x)
x_t = add_noise(x, noise, t)
# Predict and remove noise
pred_noise = model(x_t, t_batch)
x = remove_noise(x_t, pred_noise, t)
# Add noise for next step (except at t=0)
if t > 0:
noise = torch.randn_like(x)
x = add_noise(x, noise, t-1)
return x
The Mathematics Behind Diffusion Models
To truly understand diffusion models, it’s crucial to delve deeper into the mathematics that underpin them. Let’s explore some key concepts in more detail:
Markov Chain and Stochastic Differential Equations
The forward diffusion process in diffusion models can be viewed as a Markov chain or, in the continuous limit, as a stochastic differential equation (SDE). The SDE formulation provides a powerful theoretical framework for analyzing and extending diffusion models.
The forward SDE can be written as:
dx = f(x,t)dt + g(t)dw
Where:
- f(x,t) is the drift term
- g(t) is the diffusion coefficient
- dw is a Wiener process (Brownian motion)
Different choices of f and g lead to different types of diffusion processes. For example:
- Variance Exploding (VE)
SDE: dx = √(d/dt σ²(t)) dw
- Variance Preserving (VP)
SDE: dx = -0.5 β(t)xdt + √(β(t)) dw
Understanding these SDEs allows us to derive optimal sampling strategies and extend diffusion models to new domains.
Score Matching and Denoising Score Matching
The connection between diffusion models and score matching provides another valuable perspective. The score function is defined as the gradient of the log-probability density:
s(x) = ∇x log p(x)
Denoising score matching aims to estimate this score function by training a model to denoise slightly perturbed data points. This objective turns out to be equivalent to the diffusion model training objective in the continuous limit.
This connection allows us to leverage techniques from score-based generative modeling, such as annealed Langevin dynamics for sampling.
Advanced Training Techniques
Importance Sampling
The standard diffusion model training samples timesteps uniformly. However, not all timesteps are equally important for learning. Importance sampling techniques can be used to focus training on the most informative timesteps.
One approach is to use a non-uniform distribution over timesteps, weighted by the expected L2 norm of the score:
p(t) ∝ E[||s(x_t, t)||²]
This can lead to faster training and improved sample quality.
Progressive Distillation
Progressive distillation is a technique to create faster sampling models without sacrificing quality. The process works as follows:
- Train a base diffusion model with many timesteps (e.g. 1000)
- Create a student model with fewer timesteps (e.g. 100)
- Train the student to match the base model’s denoising process
- Repeat steps 2-3, progressively reducing timesteps
This allows for high-quality generation with significantly fewer denoising steps.
Architectural Innovations
Transformer-based Diffusion Models
While U-Net architectures have been popular for image diffusion models, recent work has explored using transformer architectures. Transformers offer several potential advantages:
- Better handling of long-range dependencies
- More flexible conditioning mechanisms
- Easier scaling to larger model sizes
Models like DiT (Diffusion Transformers) have shown promising results, potentially offering a path to even higher quality generation.
Hierarchical Diffusion Models
Hierarchical diffusion models generate data at multiple scales, allowing for both global coherence and fine-grained details. The process typically involves:
- Generating a low-resolution output
- Progressively upsampling and refining
This approach can be particularly effective for high-resolution image generation or long-form content generation.
Advanced Topics
Classifier-Free Guidance
Classifier-free guidance is a technique to improve sample quality and controllability. The key idea is to train two diffusion models:
- An unconditional model p(x_t)
- A conditional model p(x_t | y) where y is some conditioning information (e.g. text prompt)
During sampling, we interpolate between these models:
ε_θ = (1 + w) * ε_θ(x_t | y) - w * ε_θ(x_t)
Where w > 0 is a guidance scale that controls how much to emphasize the conditional model.
This allows for stronger conditioning without having to retrain the model. It’s been crucial for the success of text-to-image models like DALL-E 2 and Stable Diffusion.
Latent Diffusion
Source: Rombach et al.
Latent Diffusion Model (LDM) process involves encoding input data into a latent space where the diffusion process occurs. The model progressively adds noise to the latent representation of the image, leading to the generation of a noisy version, which is then denoised using a U-Net architecture. The U-Net, guided by cross-attention mechanisms, integrates information from various conditioning sources like semantic maps, text, and image representations, ultimately reconstructing the image in pixel space. This process is pivotal in generating high-quality images with a controlled structure and desired attributes.
This offers several advantages:
- Faster training and sampling
- Better handling of high-resolution images
- Easier to incorporate conditioning
The process works as follows:
- Train an autoencoder to compress images to a latent space
- Train a diffusion model in this latent space
- For generation, sample in latent space and decode to pixels
This approach has been highly successful, powering models like Stable Diffusion.
Consistency Models
Consistency models are a recent innovation that aims to improve the speed and quality of diffusion models. The key idea is to train a single model that can map from any noise level directly to the final output, rather than requiring iterative denoising.
This is achieved through a carefully designed loss function that enforces consistency between predictions at different noise levels. The result is a model that can generate high-quality samples in a single forward pass, dramatically speeding up inference.
Practical Tips for Training Diffusion Models
Training high-quality diffusion models can be challenging. Here are some practical tips to improve training stability and results:
- Gradient clipping: Use gradient clipping to prevent exploding gradients, especially early in training.
- EMA of model weights: Keep an exponential moving average (EMA) of model weights for sampling, which can lead to more stable and higher-quality generation.
- Data augmentation: For image models, simple augmentations like random horizontal flips can improve generalization.
- Noise scheduling: Experiment with different noise schedules (linear, cosine, sigmoid) to find what works best for your data.
- Mixed precision training: Use mixed precision training to reduce memory usage and speed up training, especially for large models.
- Conditional generation: Even if your end goal is unconditional generation, training with conditioning (e.g. on image classes) can improve overall sample quality.
Evaluating Diffusion Models
Properly evaluating generative models is crucial but challenging. Here are some common metrics and approaches:
Fréchet Inception Distance (FID)
FID is a widely used metric for evaluating the quality and diversity of generated images. It compares the statistics of generated samples to real data in the feature space of a pre-trained classifier (typically InceptionV3).
Lower FID scores indicate better quality and more realistic distributions. However, FID has limitations and shouldn’t be the only metric used.
Inception Score (IS)
Inception Score measures both the quality and diversity of generated images. It uses a pre-trained Inception network to compute:
IS = exp(E[KL(p(y|x) || p(y))])
Where p(y|x) is the conditional class distribution for generated image x.
Higher IS indicates better quality and diversity, but it has known limitations, especially for datasets very different from ImageNet.
Negative Log-likelihood (NLL)
For diffusion models, we can compute the negative log-likelihood of held-out data. This provides a direct measure of how well the model fits the true data distribution.
However, NLL can be computationally expensive to estimate accurately for high-dimensional data.
Human Evaluation
For many applications, especially creative ones, human evaluation remains crucial. This can involve:
- Side-by-side comparisons with other models
- Turing test-style evaluations
- Task-specific evaluations (e.g. image captioning for text-to-image models)
While subjective, human evaluation can capture aspects of quality that automated metrics miss.
Diffusion Models in Production
Deploying diffusion models in production environments presents unique challenges. Here are some considerations and best practices:
Optimization for Inference
- ONNX export: Convert models to ONNX format for faster inference across different hardware.
- Quantization: Use techniques like INT8 quantization to reduce model size and improve inference speed.
- Caching: For conditional models, cache intermediate results for the unconditional model to speed up classifier-free guidance.
- Batch processing: Leverage batching to make efficient use of GPU resources.
Scaling
- Distributed inference: For high-throughput applications, implement distributed inference across multiple GPUs or machines.
- Adaptive sampling: Dynamically adjust the number of sampling steps based on the desired quality-speed tradeoff.
- Progressive generation: For large outputs (e.g. high-res images), generate progressively from low to high resolution to provide faster initial results.
Safety and Filtering
- Content filtering: Implement robust content filtering systems to prevent generation of harmful or inappropriate content.
- Watermarking: Consider incorporating invisible watermarks into generated content for traceability.
Applications
Diffusion models have found success in a wide range of generative tasks:
Image Generation
Image generation is where diffusion models first gained prominence. Some notable examples include:
- DALL-E 3: OpenAI’s text-to-image model, combining a CLIP text encoder with a diffusion image decoder
- Stable Diffusion: An open-source latent diffusion model for text-to-image generation
- Imagen: Google’s text-to-image diffusion model
These models can generate highly realistic and creative images from text descriptions, outperforming previous GAN-based approaches.
Video Generation
Diffusion models have also been applied to video generation:
- Video Diffusion Models: Generating video by treating time as an additional dimension in the diffusion process
- Make-A-Video: Meta’s text-to-video diffusion model
- Imagen Video: Google’s text-to-video diffusion model
These models can generate short video clips from text descriptions, opening up new possibilities for content creation.
3D Generation
Recent work has extended diffusion models to 3D generation:
- DreamFusion: Text-to-3D generation using 2D diffusion models
- Point-E: OpenAI’s point cloud diffusion model for 3D object generation
These approaches enable the creation of 3D assets from text descriptions, with applications in gaming, VR/AR, and product design.
Challenges and Future Directions
While diffusion models have shown remarkable success, there are still several challenges and areas for future research:
Computational Efficiency
The iterative sampling process of diffusion models can be slow, especially for high-resolution outputs. Approaches like latent diffusion and consistency models aim to address this, but further improvements in efficiency are an active area of research.
Controllability
While techniques like classifier-free guidance have improved controllability, there’s still work to be done in allowing more fine-grained control over generated outputs. This is especially important for creative applications.
Multi-Modal Generation
Current diffusion models excel at single-modality generation (e.g. images or audio). Developing truly multi-modal diffusion models that can seamlessly generate across modalities is an exciting direction for future work.
Theoretical Understanding
While diffusion models have strong empirical results, there’s still more to understand about why they work so well. Developing a deeper theoretical understanding could lead to further improvements and new applications.
Conclusion
Diffusion models represent a step forward in generative AI, offering high-quality results across a range of modalities. By learning to reverse a noise-adding process, they provide a flexible and theoretically grounded approach to generation.
From creative tools to scientific simulations, the ability to generate complex, high-dimensional data has the potential to transform many fields. However, it’s important to approach these powerful technologies thoughtfully, considering both their immense potential and the ethical challenges they present.