Variational autoencoders (VAEs)
Probabilistic autoencoders for generation and representation.
Definition
Variational Autoencoders (VAEs), introduced by Kingma & Welling in 2013, are a class of generative model that learns a structured latent space by combining an autoencoder architecture with variational Bayesian inference. The encoder maps input data to a probability distribution over latent codes (rather than a single point), and the decoder maps sampled latent codes back to reconstructed outputs. This probabilistic formulation forces the latent space to be smooth and continuous, enabling meaningful interpolation and unconditional generation.
The training objective is the Evidence Lower Bound (ELBO): a reconstruction term that pushes decoded outputs to match inputs, plus a KL divergence term that regularizes the latent distribution toward a prior (typically a standard normal). The reparameterization trick — sampling z = μ + σ·ε where ε ~ N(0, I) — allows gradients to flow through the sampling operation, making end-to-end training with backpropagation possible.
Compared to GANs, VAEs are easier to train (no adversarial dynamic), provide an explicit (though approximate) likelihood, and offer a well-structured latent space suitable for interpolation and representation learning. However, the KL regularization and the reconstruction loss (typically MSE or BCE) tend to produce blurrier samples than GANs or diffusion models. VAEs remain the workhorse for anomaly detection, controllable generation, and as the compression backbone in latent diffusion (Stable Diffusion uses a VAE encoder/decoder around its diffusion process).
How it works
Encoder
The encoder q(z|x) maps input x to the parameters of a Gaussian distribution over the latent variable z: a mean vector μ and a log-variance vector log σ². This is implemented as a neural network with two output heads.
Reparameterization and sampling
A latent vector z is sampled as z = μ + σ · ε, where ε ~ N(0, I). This keeps sampling differentiable so gradients flow back through μ and σ to the encoder weights.
Decoder
The decoder p(x|z) maps z back to the data space, producing the reconstructed output x̂. The reconstruction quality is measured by a reconstruction loss (MSE for continuous data, BCE for binary).
Training objective (ELBO)
Loss = Reconstruction loss + β · KL(q(z|x) || p(z))The KL term penalizes the encoder for deviating from the prior N(0, I), ensuring the latent space is compact and smooth. β-VAE uses β > 1 to increase disentanglement.
Generation at inference
To generate new samples, z is drawn from the prior N(0, I) — bypassing the encoder entirely — and passed through the decoder. Because the KL term ensures the latent space is dense and smooth, most random z vectors produce coherent outputs.
When to use / When NOT to use
| Scenario | Use VAE | Avoid VAE |
|---|---|---|
| Smooth latent interpolation needed | Yes — continuous, structured latent space | No — GANs do not guarantee smooth interpolation |
| Anomaly detection via reconstruction error | Yes — high reconstruction error signals anomalies | No — if a discriminative threshold is simpler |
| Representation learning with uncertainty | Yes — probabilistic encoder captures input uncertainty | No — if a deterministic encoder (e.g., SimCLR) suffices |
| Photorealistic image generation | No — outputs tend to be blurry compared to GANs/diffusion | — |
| Applications requiring exact likelihood | Partial — ELBO is a lower bound, not exact | — |
Comparisons
| Model | Latent structure | Sample sharpness | Training | Likelihood |
|---|---|---|---|---|
| VAE | Smooth, regularized | Blurry | Stable | Approximate (ELBO) |
| GAN | No explicit latent | Sharp | Unstable | None |
| Diffusion | Implicit (noise schedule) | Very sharp | Stable | Approximate |
| AE (plain) | Unregularized | Sharp | Stable | None |
Pros and cons
| Pros | Cons |
|---|---|
| Stable, principled training via ELBO | Samples are often blurrier than GANs or diffusion |
| Explicit (lower-bound) likelihood for evaluation | KL term can over-regularize, reducing expressiveness |
| Smooth latent space supports interpolation | Posterior collapse — encoder ignores z when decoder is too powerful |
| Useful for anomaly detection and representation learning | Reconstruction loss is a proxy; may not match perceptual quality |
Code examples
Minimal VAE on MNIST using PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
class VAE(nn.Module):
def __init__(self, latent_dim=20):
super().__init__()
# Encoder
self.fc1 = nn.Linear(784, 400)
self.fc_mu = nn.Linear(400, latent_dim)
self.fc_logvar = nn.Linear(400, latent_dim)
# Decoder
self.fc3 = nn.Linear(latent_dim, 400)
self.fc4 = nn.Linear(400, 784)
def encode(self, x):
h = F.relu(self.fc1(x))
return self.fc_mu(h), self.fc_logvar(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
h = F.relu(self.fc3(z))
return torch.sigmoid(self.fc4(h))
def forward(self, x):
mu, logvar = self.encode(x.view(-1, 784))
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
def elbo_loss(recon_x, x, mu, logvar):
bce = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum")
kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return bce + kl
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loader = DataLoader(
datasets.MNIST(".", download=True, transform=transforms.ToTensor()),
batch_size=128, shuffle=True
)
for epoch in range(5):
total_loss = 0
for x, _ in loader:
recon, mu, logvar = model(x)
loss = elbo_loss(recon, x, mu, logvar)
optimizer.zero_grad(); loss.backward(); optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1} | Loss: {total_loss / len(loader.dataset):.2f}")
# Generate new samples
with torch.no_grad():
z = torch.randn(16, 20)
samples = model.decode(z).view(16, 1, 28, 28)Practical resources
- Auto-Encoding Variational Bayes (Kingma & Welling, 2013) — Original VAE paper introducing ELBO and the reparameterization trick
- PyTorch – VAE example — Official minimal implementation
- β-VAE (Higgins et al., 2017) — Extension for disentangled latent representations
- Latent Diffusion Models (Rombach et al.) — How VAE compression underpins Stable Diffusion