Diffusion models
Generative models based on denoising diffusion.
Definition
Diffusion models are a class of generative models that learn to produce data by reversing a gradual noising process. During training, the model learns to predict and remove noise that was incrementally added to real data over many timesteps. At inference, starting from pure Gaussian noise, the model iteratively denoises to produce a sample from the target data distribution. This approach has become the dominant paradigm for high-quality image generation, with landmark systems including DALL·E 2, Stable Diffusion, and Imagen.
The key theoretical insight comes from score matching and stochastic differential equations: the model learns the score function (gradient of the log-density) of the data distribution at each noise level. The forward (noising) process is fixed and has a closed form, so training is straightforward — sample a real data point, corrupt it to a random noise level, train a U-Net (or transformer) to predict the added noise. The loss is a simple mean squared error between predicted and actual noise.
Unlike GANs, diffusion training is stable with no adversarial dynamic. Unlike VAEs, samples are sharp and diverse because generation traces a rich trajectory through the data manifold rather than decoding from a bottlenecked latent. The main practical trade-off is inference speed: the reverse process requires many denoising steps (50–1000 for DDPM). Distillation (e.g., Consistency Models) and efficient schedulers (DDIM, DPM-Solver) have reduced this to as few as 1–4 steps without significant quality loss. See case study: DALL-E.
How it works
Forward process (data → noise)
A real sample x₀ is progressively corrupted by adding Gaussian noise over T timesteps to produce x₁, x₂, …, xT. After enough steps, xT is approximately pure Gaussian noise. The forward process has a closed form: given any timestep t, you can compute xₜ directly from x₀ without running all intermediate steps.
Reverse process (noise → data)
A neural network (typically a U-Net with attention layers) learns to predict the noise εθ(xₜ, t) that was added at step t. Training minimizes the difference between predicted and actual noise. At generation time, the model starts from random xT and iteratively applies the denoising network to recover x₀.
Conditioning and guidance
For conditional generation (e.g., text-to-image), classifier-free guidance (CFG) trains the model both with and without the condition, then blends the conditional and unconditional score at inference. Higher guidance weight increases prompt adherence at the cost of diversity.
Latent diffusion
Stable Diffusion runs the diffusion process in the latent space of a pretrained VAE encoder, not in pixel space. This dramatically reduces computation while preserving quality. The VAE encoder compresses the image; the diffusion model denoises in the latent space; the VAE decoder reconstructs the final image.
When to use / When NOT to use
| Scenario | Use diffusion | Avoid diffusion |
|---|---|---|
| High-quality, diverse image generation | Yes — state-of-the-art quality and diversity | No — if inference speed is critical and quality can be lower |
| Text-to-image or text-to-audio generation | Yes — conditioning via CFG is flexible and powerful | No — if you need a fast single-forward-pass generator |
| Image editing and inpainting | Yes — diffusion naturally supports masked denoising | No — if GAN-based editing pipeline is already established |
| Real-time generation (e.g., game assets) | No — multi-step inference adds latency | — |
| Tabular or low-dimensional data | No — overkill; simpler models work better | — |
Comparisons
| Model | Sample quality | Diversity | Training stability | Inference speed |
|---|---|---|---|---|
| Diffusion | Excellent | Excellent | Stable | Slow (many steps) |
| GAN | High (sharp) | Low (mode collapse) | Difficult | Fast (1 step) |
| VAE | Medium (blurry) | Good | Stable | Fast |
| Flow-based | High | Good | Stable | Medium |
Pros and cons
| Pros | Cons |
|---|---|
| Stable training — no adversarial dynamic | Slow inference — requires many denoising steps |
| Excellent sample diversity and coverage | High compute cost for both training and sampling |
| Flexible conditioning (text, class, image) | Requires careful noise schedule tuning |
| Strong theoretical foundations in score matching | Latent diffusion adds VAE compression artifacts |
Code examples
Text-to-image generation using the Hugging Face Diffusers library:
from diffusers import StableDiffusionPipeline
import torch
# Load Stable Diffusion v1.5 (fp16 for GPU efficiency)
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")
# Generate an image from a text prompt
prompt = "A photorealistic mountain landscape at golden hour, 4K"
negative_prompt = "blurry, low quality, artifacts"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50, # Denoising steps
guidance_scale=7.5, # CFG strength
height=512,
width=512,
).images[0]
image.save("output.png")Practical resources
- Denoising Diffusion Probabilistic Models (Ho et al., 2020) — Foundational DDPM paper establishing the modern training objective
- Hugging Face Diffusers — Production library for diffusion pipelines, fine-tuning, and inference
- DDIM (Song et al., 2020) — Deterministic sampling that reduces steps from 1000 to 50 without retraining
- Classifier-Free Guidance (Ho & Salimans, 2022) — The key technique behind conditional generation in modern diffusion models