MLE seeks parameters θ that maximize the probability assigned to observed data D = {x_1,...,x_N}. The objective is L(θ) = Π_i p(x_i|θ), or equivalently (and more numerically stable) log L(θ) = Σ_i log p(x_i|θ). This is the log-likelihood.
For differentiable models, we compute the gradient ∇_θ log p(x|θ) and use gradient ascent (or descent on negative log-likelihood). For large datasets, stochastic gradient descent approximates the full gradient using minibatches, making optimization practical for billions of parameters.
Gradient-Based Optimization
At each step: θ ← θ + α ∇_θ log p(x|θ). The step size α (learning rate) controls update magnitude. Adaptive methods (Adam, RMSprop) scale learning rates per parameter, improving convergence. Modern deep learning frameworks (PyTorch, TensorFlow) automatically differentiate through complex computational graphs, enabling optimization of massive models.
Why Log-Likelihood
Products Σ_i log p(x_i|θ) become sums, avoiding numerical underflow when p values are small. Mathematically, log is a monotonic transformation, so maximizing log p(x|θ) is equivalent to maximizing p(x|θ). This simplification is essential: multiplying many small probabilities produces values near zero, causing floating-point precision loss.
Connection to Deep Learning
Supervised learning (e.g., classification) fits p(y|x;θ) via MLE. The cross-entropy loss in classification is precisely -log p(y|x;θ), cementing MLE as the foundation of modern machine learning. Extending MLE to unsupervised and generative settings requires handling unobserved variables—the domain of latent variable models.
A latent variable z is a hidden factor we never directly observe but assume influences x. Examples: the semantic content (topic) of a document, the underlying emotion in an image, the latent factors in matrix factorization. The joint distribution p(x,z) factorizes as p(x|z)p(z), where p(x|z) is the likelihood and p(z) is the prior.
To fit a latent variable model via MLE, we must marginalize out z: p(x) = ∫ p(x|z)p(z) dz. This marginal likelihood represents the probability of x under the model, averaging over all possible explanations z. For continuous z or discrete z with many states, this integral is often intractable.
Intractability and Its Implications
Common scenarios: (1) z is high-dimensional, making the integral multidimensional and expensive; (2) p(x|z) and p(z) are neural networks, so no closed form exists; (3) even simple models can yield intractable integrals (e.g., mixture of non-conjugate distributions). Unable to compute log p(x), we cannot directly optimize via MLE.
Graphical Model View
The latent variable structure is often depicted as a directed graphical model: p(z) → z → p(x|z) → x. This causal interpretation is intuitive: z is generated first (from prior), then x is generated conditioned on z. This structure appears in Bayes nets, Markov chains, and structural causal models.
Why Latent Variables Matter
Latent variables enable compact representations of complex distributions. A simple prior (e.g., standard normal) combined with a learned decoder can represent multimodal, high-dimensional distributions. Latent models also facilitate interpretability: z often captures semantically meaningful factors. This is why they're central to generative modeling and unsupervised learning.
A Gaussian mixture model (GMM) models p(x) as a weighted sum of K Gaussian components. Latent z ∈ {1,...,K} indicates which component generated x. The joint is p(x,z) = p(z)p(x|z) where p(z=k) = π_k and p(x|z=k) = N(x|μ_k, Σ_k). Marginalizing: p(x) = Σ_k π_k N(x|μ_k, Σ_k).
GMMs are interpretable (each component is a Gaussian) yet flexible (mixing allows non-Gaussian marginals). Fitting GMMs via direct MLE requires optimizing Σ_k log Σ_k π_k N(x_i|μ_k, Σ_k), a non-convex problem. The expectation-maximization (EM) algorithm provides a principled alternative.
The EM Algorithm
EM alternates two steps, both tractable: (1) E-step: compute γ_{ik} = p(z_i=k|x_i) (responsibility of component k for x_i) using current parameters; (2) M-step: update π_k, μ_k, Σ_k using weighted data where x_i has weight γ_{ik} in component k. Intuitively, E-step "soft-assigns" data to components, M-step re-estimates component parameters.
EM as Variational Inference
EM optimizes the ELBO (to be defined later) with a specific variational family: q(z_i|x_i) is a one-hot distribution (deterministic responsibility). The E-step sets q to the true posterior; the M-step maximizes ELBO over parameters. This makes EM a special case of variational EM, revealing connections between classic and modern methods.
Guarantees and Limitations
EM guarantees monotonic increase in log-likelihood: L(θ^{t+1}) ≥ L(θ^t). However, it converges only to local optima. Initialization matters: poor starting points yield suboptimal solutions. EM also requires knowing K a priori. Model selection (choosing K) typically uses information criteria (BIC, AIC) or cross-validation.
The ELBO is the heart of modern generative modeling. Start with any distribution q(z|x). Apply Bayes: p(z|x) = p(x|z)p(z) / p(x). Taking logs: log p(x) = log p(x,z) - log p(z|x). Rearrange: log p(x) = log p(x,z) - log p(z|x). Taking expectation under q(z|x): log p(x) = E_q[log p(x,z)] - E_q[log p(z|x)] + E_q[log q(z|x) - log q(z|x)].
Simplify: log p(x) = E_q[log p(x,z) - log q(z|x)] + KL(q || p). Since KL ≥ 0, we have the ELBO: log p(x) ≥ E_q[log p(x,z) - log q(z|x)] := ELBO(q). Equivalently, ELBO = E_q[log p(x|z)] + E_q[log p(z) - log q(z|x)] = Recon - KL.
Decomposition: Reconstruction + Regularization
The reconstruction term E_q[log p(x|z)] measures how well z explains x. High reconstruction means placing mass q(z|x) on z values for which p(x|z) is large. The KL term penalizes divergence from the prior: large KL means q(z|x) deviates from p(z), which is costly. This decomposition mirrors classic ML: supervised loss (reconstruction) + regularization (KL).
Tightness of the ELBO
The gap between log p(x) and ELBO is the KL divergence: log p(x) - ELBO = KL(q || p(z|x)). When q(z|x) = p(z|x) (true posterior), the ELBO is tight (gap = 0). Otherwise, the gap reflects how poorly q approximates the true posterior. Tightness motivates: (1) expressive variational families for q, (2) importance weighting for tighter bounds, (3) iterative refinement of q.
Optimization as Likelihood Maximization
Optimizing ELBO over q and parameters θ of p(x|z) and p(z) increases a lower bound on log p(x). While not guaranteed to reach the global optimum (the ELBO may have local optima), each step makes progress on a principled objective. This is the foundation of variational autoencoders and modern generative models.
Variational inference replaces the true posterior p(z|x) with a tractable approximation q(z|x). We restrict q to a family Q (e.g., diagonal Gaussian, factorized, mixture) ensuring both computational efficiency and differentiability. The goal: find q* ∈ Q minimizing KL(q || p(z|x)), or equivalently, maximizing ELBO_q.
Unlike EM (which solves a separate posterior for each x), we parameterize q as a function of x: q(z|x; ϕ) where ϕ are neural network weights. This amortization means a single encoder produces posteriors for all data points. Training the encoder (and decoder) jointly via ELBO optimization is scalable and enables end-to-end learning.
Amortized Inference
Amortization trades off: (1) expressiveness loss (q is restricted relative to true posterior) for (2) computational gain (single forward pass instead of per-example optimization) and (3) generalization (encoder learns patterns applicable to new data). For most applications, this tradeoff is favorable: a learned encoder generalizes better than overfitting per-example posteriors.
Variational Families
Common choices: (1) Mean-field: q(z|x) = ∏_d q_d(z_d|x), each dimension independent; (2) Gaussian: q(z|x) = N(z|μ(x), diag(σ(x)^2)), with neural network μ, σ; (3) Hierarchical: stacked layers of random variables; (4) Autoregressive: q(z_1|x) q(z_2|z_1,x) ... More expressive families better approximate true posteriors but incur higher computational cost.
Connection to Posterior Approximation
The KL objective is asymmetric. KL(q || p) places high penalty when q has mass where p has low mass (mode-seeking), while KL(p || q) penalizes missing modes (mode-covering). Modern VI uses KL(q || p), making q underestimate posterior variance (tighter posterior approximation) but potentially missing modes. Importance weighting and other techniques mitigate this.
The core challenge: gradients cannot flow through sampling operations z ~ q. Backward differentiation requires z = f(x; ϕ) to be deterministic in ϕ. The reparameterization trick solves this: express sampling as a deterministic transformation of an auxiliary random variable.
For Gaussian q(z|x) = N(μ(x), σ(x)^2), instead of sampling z directly, sample ε ~ N(0,I) and set z = μ(x) + σ(x) ⊙ ε. Now z depends deterministically on μ, σ through the affine transformation. Gradients ∂z/∂μ = 1 and ∂z/∂σ = ε flow freely through the network to update encoder parameters.
Gradient Estimator and Variance
The ELBO gradient becomes ∇_ϕ ELBO = E_ε[∇_ϕ log p(x, f(x,ε;ϕ)) - ∇_ϕ log q(f(x,ε;ϕ)|x;ϕ)]. This pathwise gradient estimator (also called score function or Gumbel-Softmax when extended to discrete z) has low variance because it leverages the structure of the transformation. Contrast with score function estimators (REINFORCE), which estimate ∇_ϕ through finite differences—these have high variance.
Beyond Gaussians
The trick extends to other distributions: (1) Beta: z = F^{-1}(u) where u ~ Uniform(0,1), F is beta CDF; (2) Exponential: z = -log(u)/λ; (3) Discrete (Gumbel-Softmax): sample from continuous approximation, then take argmax for discrete. Even when the CDF is expensive, inverse transform sampling with numerical approximation often works.
Impact on Deep Generative Models
The reparameterization trick is essential for training VAEs, where encoder and decoder are jointly optimized via gradient descent. Without it, backprop cannot traverse encoder → sample → decoder path. This enables end-to-end learning of complex generative models, from pixel-level VAEs to hierarchical probabilistic models. The trick's simplicity and effectiveness made modern variational generative modeling practical.
The standard ELBO uses a single sample from q: ELBO ≈ log p(x, z) - log q(z|x) where z ~ q. This estimate is noisy. Importance weighting (IWAE) improves the bound using multiple samples {z_1,...,z_M} ~ q. The estimator is: log p(x) ≥ log E_q[(1/M) Σ_m p(x,z_m) / q(z_m|x)].
This bound is tighter than the single-sample ELBO for any M ≥ 1. Mathematically, let w_m = p(x,z_m)/q(z_m|x). By Jensen's inequality, log E[Σ_m w_m / M] ≥ E[log(Σ_m w_m / M)]. As M→∞, the bound approaches log p(x) (it becomes tight). The IWAE bound is thus a rigorous way to trade computation for tightness.
Bias and Variance Tradeoff
With M=1, the ELBO has some bias (gap from true log p(x)). Increasing M reduces bias monotonically. However, the log-of-sum gradient estimator has higher variance than log-of-expectation (single-sample ELBO). In practice, modest M (5-50) often balances bias and variance: enough samples to substantially tighten the bound without excessive variance in gradients.
Implementation Considerations
Computing log(Σ_m exp(w_m)) numerically requires the log-sum-exp trick to avoid overflow. Gradients through the IWAE bound use pathwise estimators (via reparameterization), making the bound differentiable. Modern implementations parallelize over samples for efficiency. Some work suggests β-weighted combinations (ELBO vs IWAE) during training can accelerate learning.
Theoretical Significance
IWAE reveals that tightness of variational bounds is a learnable quantity: you can get arbitrarily tight bounds by sampling. This motivates research into other tight bounds (ladder VAEs, hamiltonian VI) and demonstrates that the ELBO, while useful, is just one point on a spectrum of bounds. This perspective has driven recent advances in variational generative modeling.
Scaling to high-dimensional, high-capacity models (e.g., deep convolutional VAE on ImageNet) exposes practical challenges. Posterior collapse is the most notorious: the learned encoder q(z|x) converges to the prior p(z), rendering latents unused. This occurs when the decoder becomes powerful enough to reconstruct x without z information, so the reconstruction term dominates and KL→0.
Posterior collapse is problematic because: (1) latents carry no information about x; (2) generation from prior p(z) ignores data; (3) the model becomes a deterministic autoencoder, not a generative model. Diagnosis: monitor KL(q||p) during training—if it drops to near zero early, suspect collapse. Remedies: (1) β-VAE weights KL by factor β < 1, reducing decoder power; (2) warm-up schedules gradually increase KL weight; (3) use more expressive posteriors; (4) free bits prevent KL from dropping below threshold.
Warm-up Schedules
A common strategy: initialize KL weight β=0, gradually increase to 1 over training (e.g., over first 100k steps). Early reconstruction loss stabilizes the decoder; later, KL loss activates latents. This mimics curriculum learning: master simple (reconstruction) before hard (compression). Variants include cyclical warm-up and divergence schedules. Empirically, warm-up significantly improves final model quality and prevents collapse.
Optimization Landscape
Latent variable models are non-convex with many local minima. Careful initialization helps: warm-starting from a pre-trained autoencoder, orthogonal weight initialization, and batch normalization stabilize training. Adaptive optimizers (Adam) with proper learning rates (typically 1e-3 to 1e-4) outperform vanilla SGD. Some architectures (residual connections, skip connections) improve gradient flow.
Model-Specific Insights
For VAEs on images, architectural choices matter: (1) downsampling reduces spatial dimensions, concentrating information in channels—helps decoder; (2) skip connections in encoder/decoder preserve details; (3) larger latent dimension d increases KL but improves reconstruction. β-VAE with β<1 increases disentanglement (latent factors correspond to interpretable attributes), though at reconstruction cost. Choice of β is problem-dependent; commonly β=0.1-0.5 for image data.
Foundational Papers
Core Concepts
- Expectation-Maximization (EM) — Classical algorithm for learning with latent variables
- Variational Inference — Approximating intractable posteriors through optimization
- Evidence Lower Bound (ELBO) — Principled lower bound on marginal likelihood
Extensions & Improvements
- β-VAE — Balancing reconstruction and KL divergence for disentangled representations
- Importance Weighting — Tightening ELBO bounds through multiple samples
- Hierarchical VAEs — Deep latent variable hierarchies for complex data
Learning Resources