AI Summary Hub

Diffusion models

Generative models based on denoising diffusion.

Definition

Diffusion models are a class of generative models that learn to produce data by reversing a gradual noising process. During training, the model learns to predict and remove noise that was incrementally added to real data over many timesteps. At inference, starting from pure Gaussian noise, the model iteratively denoises to produce a sample from the target data distribution. This approach has become the dominant paradigm for high-quality image generation, with landmark systems including DALL·E 2, Stable Diffusion, and Imagen.

The key theoretical insight comes from score matching and stochastic differential equations: the model learns the score function (gradient of the log-density) of the data distribution at each noise level. The forward (noising) process is fixed and has a closed form, so training is straightforward — sample a real data point, corrupt it to a random noise level, train a U-Net (or transformer) to predict the added noise. The loss is a simple mean squared error between predicted and actual noise.

Unlike GANs, diffusion training is stable with no adversarial dynamic. Unlike VAEs, samples are sharp and diverse because generation traces a rich trajectory through the data manifold rather than decoding from a bottlenecked latent. The main practical trade-off is inference speed: the reverse process requires many denoising steps (50–1000 for DDPM). Distillation (e.g., Consistency Models) and efficient schedulers (DDIM, DPM-Solver) have reduced this to as few as 1–4 steps without significant quality loss. See case study: DALL-E.

How it works

Forward process (data → noise)

A real sample x₀ is progressively corrupted by adding Gaussian noise over T timesteps to produce x₁, x₂, …, xT. After enough steps, xT is approximately pure Gaussian noise. The forward process has a closed form: given any timestep t, you can compute xₜ directly from x₀ without running all intermediate steps.

Reverse process (noise → data)

A neural network (typically a U-Net with attention layers) learns to predict the noise εθ(xₜ, t) that was added at step t. Training minimizes the difference between predicted and actual noise. At generation time, the model starts from random xT and iteratively applies the denoising network to recover x₀.

Conditioning and guidance

For conditional generation (e.g., text-to-image), classifier-free guidance (CFG) trains the model both with and without the condition, then blends the conditional and unconditional score at inference. Higher guidance weight increases prompt adherence at the cost of diversity.

Latent diffusion

Stable Diffusion runs the diffusion process in the latent space of a pretrained VAE encoder, not in pixel space. This dramatically reduces computation while preserving quality. The VAE encoder compresses the image; the diffusion model denoises in the latent space; the VAE decoder reconstructs the final image.

When to use / When NOT to use

ScenarioUse diffusionAvoid diffusion
High-quality, diverse image generationYes — state-of-the-art quality and diversityNo — if inference speed is critical and quality can be lower
Text-to-image or text-to-audio generationYes — conditioning via CFG is flexible and powerfulNo — if you need a fast single-forward-pass generator
Image editing and inpaintingYes — diffusion naturally supports masked denoisingNo — if GAN-based editing pipeline is already established
Real-time generation (e.g., game assets)No — multi-step inference adds latency
Tabular or low-dimensional dataNo — overkill; simpler models work better

Comparisons

ModelSample qualityDiversityTraining stabilityInference speed
DiffusionExcellentExcellentStableSlow (many steps)
GANHigh (sharp)Low (mode collapse)DifficultFast (1 step)
VAEMedium (blurry)GoodStableFast
Flow-basedHighGoodStableMedium

Pros and cons

ProsCons
Stable training — no adversarial dynamicSlow inference — requires many denoising steps
Excellent sample diversity and coverageHigh compute cost for both training and sampling
Flexible conditioning (text, class, image)Requires careful noise schedule tuning
Strong theoretical foundations in score matchingLatent diffusion adds VAE compression artifacts

Code examples

Text-to-image generation using the Hugging Face Diffusers library:

from diffusers import StableDiffusionPipeline
import torch

# Load Stable Diffusion v1.5 (fp16 for GPU efficiency)
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Generate an image from a text prompt
prompt = "A photorealistic mountain landscape at golden hour, 4K"
negative_prompt = "blurry, low quality, artifacts"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,    # Denoising steps
    guidance_scale=7.5,        # CFG strength
    height=512,
    width=512,
).images[0]

image.save("output.png")

Practical resources

See also