Stanford CS
Deep Generative Models
XCS236 · Comprehensive Deep-Dive Architecture
01

Foundations of Generative Modeling

Generative models learn to sample from and reason about high-dimensional data distributions. Unlike discriminative models that learn p(y|x), generative models learn p(x)—the full data distribution. This enables sampling, density estimation, missing data imputation, and learning meaningful representations.

The core challenge: modeling complex, high-dimensional distributions like images, text, or audio. There is no universal best approach. Different model families trade off likelihood tractability, sampling speed, mode coverage, and stability. The course explores eight major families, each with distinct strengths and limitations.

Sampling
Generate new data
Likelihood
Density estimation
Representation
Learn structure
8 Families
Different trades
02

Autoregressive Models

Autoregressive models factor the joint distribution as p(x) = ∏ᵢ p(xᵢ|x₁:ᵢ₋₁). This chain rule decomposition makes the likelihood tractable and allows fast likelihood computation. Models like PixelCNN and WaveNet apply neural networks to each conditional term, predicting one dimension at a time given all previous dimensions.

Strength: exact likelihood. Weakness: slow generation (O(n) samples per dimension). The sequential nature is fundamentally inherent to the factorization. Variants like MADE improve efficiency by using masking. RNNs and Transformers also fit this paradigm.

Chain rule: p(x)=∏p(xᵢ|x₁:ᵢ₋₁) PixelCNN, WaveNet Tractable likelihood Sequential generation
03

Variational Autoencoders

Variational Autoencoders introduce latent variables z. The generative process: z ~ N(0, I), then x ~ p_θ(x|z). To learn, we maximize the Evidence Lower Bound (ELBO): ELBO(x; θ, λ) = 𝔼_q_λ[log p_θ(x,z)/q_λ(z|x)] = 𝔼_q_λ[log p_θ(x|z)] - KL(q_λ(z|x) || p(z)). The reparameterization trick enables efficient gradient estimation.

VAEs learn a meaningful latent space where interpolation produces smooth transitions. The KL term pushes q(z|x) toward the prior, preventing posterior collapse. Trade-off: the KL divergence loss causes blurriness in reconstructions.

Latent z
Disentangled space
ELBO
Tractable bound
Reparameterization
Gradient trick
Reconstruction+KL
Dual objectives
04

Normalizing Flows

Normalizing flows use invertible transformations f_θ to map from simple distributions (e.g., standard Gaussian) to complex ones. By change of variables: p_x(x) = p_z(f_θ⁻¹(x)) |det(∂f_θ⁻¹/∂x)|. The determinant of the Jacobian must be efficient to compute. Coupling layers and RealNVP satisfy this constraint while allowing expressive transformations.

Strengths: exact likelihood, efficient sampling. Weakness: volume-preserving constraints limit expressiveness. Glow and Flow++ extend the framework with multi-scale architectures and dequantization tricks.

Change of variables Invertible f_θ RealNVP, Glow Exact likelihood + fast sampling
05

Generative Adversarial Networks

GANs introduce an adversarial game between a generator G and discriminator D. G tries to fool D; D tries to distinguish real from generated data. The objective: min_G max_D 𝔼[log D(x)] + 𝔼[log(1 - D(G(z)))]. The Nash equilibrium (when it exists) corresponds to G matching the data distribution.

Strength: high-quality samples, no likelihood requirement. Weakness: training instability, mode collapse, and difficulty evaluating convergence. Advanced techniques: Wasserstein distance, spectral normalization, and gradient penalties mitigate these issues.

Min-max game
Generator vs Discriminator
No likelihood
Implicit model
High-quality samples
Mode coverage issues
Training tricks
Stability critical
06

Energy-Based & Score-Based Models

Energy-based models define p(x) = exp(-E(x)) / Z, where E is an energy function and Z is a partition function. Learning uses contrastive divergence: update parameters to lower energy on data, raise it on model samples. Score-based models learn ∇ log p(x) (the score function) via score matching, which avoids estimating Z entirely.

Scores enable sampling via Langevin dynamics: x_{t+1} = x_t + (ε/2)∇ log p(x_t) + √ε · ζ_t. Strength: flexible, tractable training. Weakness: sampling requires many steps. Score-based and diffusion models are deeply connected.

EBM: p(x)=exp(-E(x))/Z Score: ∇log p(x) Contrastive divergence Langevin dynamics
07

Diffusion Models

Diffusion models define a forward process: q(x_t|x_0) with gradually increasing noise. Learning the reverse: p_θ(x_{t-1}|x_t), a denoising neural network. DDPM (Denoising Diffusion Probabilistic Models) shows this is equivalent to score matching with a weighted loss. DDIM accelerates sampling via a deterministic path. Discrete diffusion extends to categorical and sequential data.

Strengths: simple training, stable, scalable, state-of-the-art sample quality (DALL-E, Stable Diffusion). Weakness: slow sampling (100s of steps). Connection to score-based models reveals a unified framework spanning multiple model families.

Forward: add noise
q(x_t|x_0)
Reverse: denoise
p_θ(x_{t-1}|x_t)
DDPM, DDIM
Training & sampling
Score matching
Unified view
08

Evaluating & Comparing Models

No single metric captures model quality. Likelihood-based models (autoregressive, VAE, flows) can report log p(x) on test sets. Sample-based models (GANs, diffusion) use Inception Score (IS) and Fréchet Inception Distance (FID), which measure sample quality via inception networks but cannot assess mode coverage. Precision and Recall metrics quantify mode coverage and sample quality separately.

Best practice: report multiple metrics, use human evaluation, and visualize samples. Different applications prioritize differently: likelihood for compression, sample quality for generation, diversity for data augmentation. Understanding trade-offs between families informs model selection.

Likelihood: log p(x) IS: sample quality FID: distribution match Precision/Recall: coverage
09

References & Further Reading

This course covers eight major families of generative models, each with rich theory and practical applications. The references below provide foundational papers, recent advances, and comprehensive surveys spanning the field.

From variational inference to score-based models, these works document the evolution of generative modeling and offer entry points for deeper study of specific architectures and extensions.