STANFORD XCS236 · DEEP GENERATIVE MODELS
Variational Autoencoders
Week 2–3 · ELBO, Latent Space & VAE Variants
01

VAE Architecture

Variational autoencoders combine autoencoders with probabilistic inference. The encoder q_φ(z|x) maps input x to a latent distribution, the decoder p_θ(x|z) reconstructs from latent samples, and prior p(z) regularizes the latent space. Training via the Evidence Lower Bound (ELBO) balances reconstruction and distributional alignment.

VAEs learn continuous, interpretable latent representations that enable smooth interpolation and generative sampling. Unlike deterministic autoencoders, VAEs explicitly model uncertainty and provide a principled framework for learning generative models through amortized variational inference.

02

Loss Function

The VAE loss combines reconstruction loss (MSE, BCE) and Kullback-Leibler divergence. The ELBO decomposes into E[log p_θ(x|z)] – reconstruction fidelity – and D_KL(q_φ(z|x) || p(z)) – distributional regularization. This β-weighted formulation controls the trade-off between generation quality and latent space structure.

β-VAE introduces a hyperparameter to weight the KL term, enabling control over disentanglement. Higher β encourages more independent latent factors. Understanding this decomposition reveals why VAEs learn both compression and generation simultaneously, and how to tune for different applications.

03

Encoder Design

The encoder outputs mean μ and log-variance log σ² of a Gaussian posterior q_φ(z|x). Diagonal covariance assumptions reduce parameters and enable efficient inference. The reparameterization trick z = μ + σ ⊙ ε allows backpropagation through sampling, making the encoder differentiable end-to-end.

Amortized inference means the encoder learns a mapping x → q(z|x), avoiding per-sample optimization. This enables fast inference and scalable training. Architecture choices (depth, width, skip connections) affect expressiveness and posterior collapse risk. Modern variants use more flexible posteriors like normalizing flows or hierarchical structures.

04

Decoder Design

The decoder p_θ(x|z) defines the likelihood of data given latents. For binary data, outputs are Bernoulli; for continuous data, Gaussian with learnable variance. Perceptual losses (LPIPS, VGG) complement pixel-level metrics when reconstruction must preserve semantic content rather than exact pixel values.

Decoder architecture impacts generation quality and stability. Transposed convolutions, upsampling, or progressive generation can improve visual quality. The decoder must balance model capacity against overfitting. Output activations (sigmoid for [0,1], tanh for [-1,1]) and variance parameterization (fixed vs. learned) significantly affect generation.

05

Latent Space Properties

VAE latent spaces naturally exhibit smoothness: interpolation between z₁ and z₂ produces semantically meaningful intermediate reconstructions. Arithmetic properties allow latent vector manipulation (z_smile ≈ z_face – z_neutral + z_other_face). The prior p(z) = N(0,I) induces a manifold structure where nearby regions map to similar data.

Latent space quality depends on encoder capacity, KL weight, and architectural choices. Disentanglement (independent factors) improves interpretability but requires careful tuning. Exploring the manifold reveals learned invariances and variations. High posterior variance indicates underdetermined regions; zero variance suggests encoding without uncertainty.

06

VAE Variants

Conditional VAE (CVAE) incorporates class labels or context via concatenation to encoder/decoder. VQ-VAE replaces continuous latents with discrete codebook vectors, enabling vector quantization and stable discrete sampling. Hierarchical VAE structures latents across scales, modeling coarse-to-fine variation. NVAE adds normalizing flows to improve posterior expressiveness.

Variants extend core VAE framework for specific constraints. β-TCVAE disentangles factors of variation. SoothingVAE manages posterior collapse through temporal smoothing. Duplex VAE handles asymmetric data noise. Each variant addresses specific failure modes or application requirements while preserving the fundamental ELBO objective.

07

Posterior Collapse

Posterior collapse occurs when the encoder learns q_φ(z|x) ≈ p(z), causing the KL term to vanish. The decoder learns to ignore z and reconstruct from x alone, resulting in uninformative latents and degraded generation quality. Root causes include decoder overfitting, free bits allowing KL gradients to zero, and weak encoder initialization.

Mitigation strategies include KL annealing (β increases over training), free bits (min KL per batch), cyclical schedules (alternating β), and decoder regularization. Weakening the decoder via dropout or complexity limits forces use of latent information. Hierarchical VAEs naturally avoid collapse by distributing information across scales. Understanding collapse mechanisms is crucial for VAE success.

08

Applications

VAEs excel at image generation (CelebA, MNIST), learning interpretable face attributes. Drug discovery uses VAEs for molecule generation and property prediction. Anomaly detection leverages reconstruction error: out-of-distribution samples typically have higher error. Representation learning via VAE encoders provides features for downstream tasks.

Time series modeling benefits from VAE temporal structure learning. Semi-supervised VAEs combine labeled and unlabeled data. VAE-GAN hybrids combine generative power of GANs with principled inference. Recommendation systems use VAEs for collaborative filtering. The learned latent space enables visualization, clustering, and transfer learning across domains.

09

References & Further Reading

The Variational Autoencoder framework provides a principled approach to learning latent representations through probabilistic inference. This section gathers key resources for deeper understanding of VAE architectures, training dynamics, and applications across diverse domains.

From foundational papers to modern extensions, these references document the evolution and impact of VAEs in generative modeling research.