Stanford XCS236 · Deep Generative Models
MLE & Latent Variables
Week 2 · ELBO, Variational Inference & Reparameterization
01

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the foundation of modern statistical learning. Given observed data, we seek parameters θ that maximize the probability of observing that data: θ* = argmax log p(x|θ). This objective is equivalent to minimizing negative log-likelihood, the principle underlying virtually all supervised learning.

For complex models, we use gradient-based optimization: compute ∇_θ log p(x|θ) and perform stochastic gradient descent. The gradient points toward regions of higher likelihood. This approach scales to high dimensions and forms the basis of deep learning. Log-likelihood is preferred computationally since products become sums, avoiding numerical underflow.

log p(x|θ)
Likelihood
∇_θ
Gradient Flow
SGD
Optimization
Scalable
To High D
02

Introducing Latent Variables

Many phenomena are better modeled with hidden variables z that causally generate observed data x. A latent variable model defines p(x|z)p(z), capturing the joint distribution. The key challenge: we never observe z directly. To fit these models, we must marginalize: p(x) = ∫ p(x|z)p(z) dz. This integral is typically intractable for complex models.

Intractability arises because z may be high-dimensional or the integral lacks closed form. We cannot simply apply MLE because computing log p(x) requires evaluating this intractable integral. This motivates variational inference: instead of exact inference, we approximate the posterior p(z|x) and optimize a lower bound on log p(x).

p(x,z)
Joint
∫ p(x|z)p(z)
Marginal
p(z|x)
Posterior
Intractable
In General
03

Gaussian Mixtures & EM

Gaussian mixture models (GMMs) exemplify latent variable models. A discrete latent z ∈ {1,...,K} selects which Gaussian component generates x. The joint is p(x,z) = p(x|z)p(z), where p(x|z) = N(x|μ_z, Σ_z) and p(z) is categorical. This factorization is natural: each component explains a distinct data mode.

The expectation-maximization (EM) algorithm optimizes GMMs by iterating two steps: E-step computes p(z|x) given current parameters, M-step updates parameters given expected sufficient statistics from the posterior. EM is a special case of variational inference where the variational posterior exactly equals the true posterior in the M-step, yielding monotonic likelihood increase.

K
Components
Discrete z
Latent
EM
Algorithm
Monotonic
Increase
04

Evidence Lower Bound

The evidence lower bound (ELBO) provides a tractable objective for latent variable models. For any distribution q(z|x), the KL divergence KL(q || p) ≥ 0 implies: log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z|x)]. This is the ELBO. When q(z|x) = p(z|x), the ELBO equals log p(x); tightness measures how well q approximates the true posterior.

Optimizing the ELBO is equivalent to maximizing likelihood via two mechanisms: the reconstruction term E_q[log p(x|z)] encourages q to place mass on z values that explain x, while the KL term E_q[log p(z)/q(z|x)] regularizes q toward the prior. This decomposition unifies supervised learning (reconstruction) with regularization (KL), providing a principled framework for generative models.

ELBO
Lower Bound
KL Gap
Tightness
Recon + KL
Decomposed
q(z|x)
Approximate
05

Variational Inference

Variational inference replaces the intractable posterior p(z|x) with a tractable approximation q(z|x), optimizing the ELBO over q. We choose a function class for q (e.g., diagonal Gaussian, fully factorized) to maintain computational efficiency. The posterior q takes observed x as input, enabling it to adapt to each data point. This design choice is called amortization.

Amortization is powerful: instead of solving a separate optimization problem per data point, we train a single neural network to compute q parameters. This amortized inference is fast at test time and enables scalable learning. The variational objective becomes: maximize E_q[log p(x|z)] - KL(q(z|x) || p(z)). Both terms are tractable given q's functional form, allowing gradient-based optimization.

q(z|x)
Variational
Amortized
Inference
KL Divergence
Minimization
Neural Network
q-encoder
06

Reparameterization Trick

A fundamental challenge in variational inference: backpropagating gradients through the sampling operation. The reparameterization trick solves this by expressing sampling as a deterministic transformation of an auxiliary random variable. For Gaussian q with mean μ and std σ, instead of z ~ q, we write z = μ + σ⊙ε where ε ~ N(0,I). Now gradients flow through the deterministic function μ(x) and σ(x).

This trick enables end-to-end gradient flow through sampling. Losses like the ELBO become differentiable with respect to q parameters (μ,σ). The gradient estimator has low variance since it uses the deterministic path. This is essential for training variational autoencoders (VAEs), where both encoder q and decoder p are neural networks jointly optimized by gradient descent.

z = μ + σ⊙ε
Reparameterized
Low Variance
Gradient
Deterministic
Path
End-to-End
Differentiable
07

Importance Weighting

The ELBO with a single sample can underestimate log p(x). Importance weighting (IWAE) bounds use multiple samples from q to tighten the bound. With M samples {z_1,...,z_M} ~ q(z|x), the IWAE bound is: log p(x) ≥ E[log (1/M) Σ p(x,z_m)/q(z_m|x)]. As M→∞, this converges to log p(x). With M=1, it reduces to the standard ELBO.

IWAE bounds are tighter than the ELBO for any M≥1, incurring only linear computational overhead. This provides a principled way to trade computation for tighter bounds. The bias-variance tradeoff is key: more samples reduce bias (tighter bound) but increase variance. In practice, modest M (5-50) often suffices. IWAE demonstrates that tightness is a learnable dimension, opening pathways to more effective variational training.

M Samples
IWAE
Tighter
Than ELBO
log Σ
Averaging
Bias-Variance
Tradeoff
08

Training Deep Latent Models

Scaling latent variable models to high-dimensional data reveals practical challenges. Posterior collapse occurs when the learned q(z|x) approaches the prior p(z), making the KL term near zero. This happens because the decoder p(x|z) becomes powerful enough to reconstruct x without z, rendering the latent variables unused. The model ignores the bottleneck.

Addressing posterior collapse requires careful design: (1) use warm-up schedules that gradually weight the KL term, allowing reconstruction to stabilize first; (2) employ more expressive posteriors; (3) strengthen the evidence that z provides. Another challenge is optimization: latent variable models have non-convex objectives with many local minima. Best practices include: careful initialization, batch normalization, appropriate learning rates, and architectural choices that encourage information flow. Understanding these pitfalls is essential for successfully training deep generative models.

Posterior
Collapse
Warm-up
Schedule
β-VAE
Weighting
Stable
Training
09

References & Further Reading

This section establishes the mathematical foundations connecting maximum likelihood estimation, latent variable models, and variational inference. The references below guide deeper study of these core principles underlying modern generative models.

From mixture models and expectation-maximization to variational autoencoders, these techniques form the backbone of learning with unobserved structure.