MLE & Latent Variables

Stanford XCS236 · Deep Generative Models

MLE & Latent Variables

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the foundation of modern statistical learning. Given observed data, we seek parameters θ that maximize the probability of observing that data: θ* = argmax log p(x|θ). This objective is equivalent to minimizing negative log-likelihood, the principle underlying virtually all supervised learning.

For complex models, we use gradient-based optimization: compute ∇_θ log p(x|θ) and perform stochastic gradient descent. The gradient points toward regions of higher likelihood. This approach scales to high dimensions and forms the basis of deep learning. Log-likelihood is preferred computationally since products become sums, avoiding numerical underflow.

log p(x|θ)

Likelihood

∇_θ

Gradient Flow

SGD

Optimization

Scalable

To High D

Introducing Latent Variables

Many phenomena are better modeled with hidden variables z that causally generate observed data x. A latent variable model defines p(x|z)p(z), capturing the joint distribution. The key challenge: we never observe z directly. To fit these models, we must marginalize: p(x) = ∫ p(x|z)p(z) dz. This integral is typically intractable for complex models.

Intractability arises because z may be high-dimensional or the integral lacks closed form. We cannot simply apply MLE because computing log p(x) requires evaluating this intractable integral. This motivates variational inference: instead of exact inference, we approximate the posterior p(z|x) and optimize a lower bound on log p(x).

p(x,z)

Joint

∫ p(x|z)p(z)

Marginal

p(z|x)

Posterior

Intractable

In General

Gaussian Mixtures & EM

Gaussian mixture models (GMMs) exemplify latent variable models. A discrete latent z ∈ {1,...,K} selects which Gaussian component generates x. The joint is p(x,z) = p(x|z)p(z), where p(x|z) = N(x|μ_z, Σ_z) and p(z) is categorical. This factorization is natural: each component explains a distinct data mode.

The expectation-maximization (EM) algorithm optimizes GMMs by iterating two steps: E-step computes p(z|x) given current parameters, M-step updates parameters given expected sufficient statistics from the posterior. EM is a special case of variational inference where the variational posterior exactly equals the true posterior in the M-step, yielding monotonic likelihood increase.

Components

Discrete z

Latent

Algorithm

Monotonic

Increase

Evidence Lower Bound

The evidence lower bound (ELBO) provides a tractable objective for latent variable models. For any distribution q(z|x), the KL divergence KL(q || p) ≥ 0 implies: log p(x) ≥ E_q[log p(x,z)] - E_q[log q(z|x)]. This is the ELBO. When q(z|x) = p(z|x), the ELBO equals log p(x); tightness measures how well q approximates the true posterior.

Optimizing the ELBO is equivalent to maximizing likelihood via two mechanisms: the reconstruction term E_q[log p(x|z)] encourages q to place mass on z values that explain x, while the KL term E_q[log p(z)/q(z|x)] regularizes q toward the prior. This decomposition unifies supervised learning (reconstruction) with regularization (KL), providing a principled framework for generative models.

ELBO

Lower Bound

KL Gap

Tightness

Recon + KL

Decomposed

q(z|x)

Approximate

Variational Inference

Variational inference replaces the intractable posterior p(z|x) with a tractable approximation q(z|x), optimizing the ELBO over q. We choose a function class for q (e.g., diagonal Gaussian, fully factorized) to maintain computational efficiency. The posterior q takes observed x as input, enabling it to adapt to each data point. This design choice is called amortization.

Amortization is powerful: instead of solving a separate optimization problem per data point, we train a single neural network to compute q parameters. This amortized inference is fast at test time and enables scalable learning. The variational objective becomes: maximize E_q[log p(x|z)] - KL(q(z|x) || p(z)). Both terms are tractable given q's functional form, allowing gradient-based optimization.

q(z|x)

Variational

Amortized

Inference

KL Divergence

Minimization

Neural Network

q-encoder

Reparameterization Trick

A fundamental challenge in variational inference: backpropagating gradients through the sampling operation. The reparameterization trick solves this by expressing sampling as a deterministic transformation of an auxiliary random variable. For Gaussian q with mean μ and std σ, instead of z ~ q, we write z = μ + σ⊙ε where ε ~ N(0,I). Now gradients flow through the deterministic function μ(x) and σ(x).

This trick enables end-to-end gradient flow through sampling. Losses like the ELBO become differentiable with respect to q parameters (μ,σ). The gradient estimator has low variance since it uses the deterministic path. This is essential for training variational autoencoders (VAEs), where both encoder q and decoder p are neural networks jointly optimized by gradient descent.

z = μ + σ⊙ε

Reparameterized

Low Variance

Gradient

Deterministic

Path

End-to-End

Differentiable

Importance Weighting

The ELBO with a single sample can underestimate log p(x). Importance weighting (IWAE) bounds use multiple samples from q to tighten the bound. With M samples {z_1,...,z_M} ~ q(z|x), the IWAE bound is: log p(x) ≥ E[log (1/M) Σ p(x,z_m)/q(z_m|x)]. As M→∞, this converges to log p(x). With M=1, it reduces to the standard ELBO.

IWAE bounds are tighter than the ELBO for any M≥1, incurring only linear computational overhead. This provides a principled way to trade computation for tighter bounds. The bias-variance tradeoff is key: more samples reduce bias (tighter bound) but increase variance. In practice, modest M (5-50) often suffices. IWAE demonstrates that tightness is a learnable dimension, opening pathways to more effective variational training.

M Samples

IWAE

Tighter

Than ELBO

log Σ

Averaging

Bias-Variance

Tradeoff

Training Deep Latent Models

Scaling latent variable models to high-dimensional data reveals practical challenges. Posterior collapse occurs when the learned q(z|x) approaches the prior p(z), making the KL term near zero. This happens because the decoder p(x|z) becomes powerful enough to reconstruct x without z, rendering the latent variables unused. The model ignores the bottleneck.

Addressing posterior collapse requires careful design: (1) use warm-up schedules that gradually weight the KL term, allowing reconstruction to stabilize first; (2) employ more expressive posteriors; (3) strengthen the evidence that z provides. Another challenge is optimization: latent variable models have non-convex objectives with many local minima. Best practices include: careful initialization, batch normalization, appropriate learning rates, and architectural choices that encourage information flow. Understanding these pitfalls is essential for successfully training deep generative models.

Posterior

Collapse

Warm-up

Schedule

β-VAE

Weighting

Stable

Training

References & Further Reading

This section establishes the mathematical foundations connecting maximum likelihood estimation, latent variable models, and variational inference. The references below guide deeper study of these core principles underlying modern generative models.

From mixture models and expectation-maximization to variational autoencoders, these techniques form the backbone of learning with unobserved structure.

Maximum Likelihood Estimation

Introducing Latent Variables

Gaussian Mixtures & EM

Evidence Lower Bound

Variational Inference

Reparameterization Trick

Importance Weighting

Training Deep Latent Models

References & Further Reading

Maximum Likelihood Estimation

Gradient-Based Optimization

Why Log-Likelihood

Connection to Deep Learning

Introducing Latent Variables

Intractability and Its Implications

Graphical Model View

Why Latent Variables Matter

Gaussian Mixtures & EM

The EM Algorithm

EM as Variational Inference

Guarantees and Limitations

Evidence Lower Bound

Decomposition: Reconstruction + Regularization

Tightness of the ELBO

Optimization as Likelihood Maximization

Variational Inference

Amortized Inference

Variational Families

Connection to Posterior Approximation

Reparameterization Trick

Gradient Estimator and Variance

Beyond Gaussians

Impact on Deep Generative Models

Importance Weighting

Bias and Variance Tradeoff

Implementation Considerations

Theoretical Significance

Training Deep Latent Models

Warm-up Schedules

Optimization Landscape

Model-Specific Insights

References & Further Reading

Foundational Papers

Core Concepts

Extensions & Improvements

Learning Resources