XCS236 Deep Generative Models

Stanford CS

Deep Generative Models

Foundations of Generative Modeling

Generative models learn to sample from and reason about high-dimensional data distributions. Unlike discriminative models that learn p(y|x), generative models learn p(x)—the full data distribution. This enables sampling, density estimation, missing data imputation, and learning meaningful representations.

The core challenge: modeling complex, high-dimensional distributions like images, text, or audio. There is no universal best approach. Different model families trade off likelihood tractability, sampling speed, mode coverage, and stability. The course explores eight major families, each with distinct strengths and limitations.

Sampling

Generate new data

Likelihood

Density estimation

Representation

Learn structure

8 Families

Different trades

Autoregressive Models

Autoregressive models factor the joint distribution as p(x) = ∏ᵢ p(xᵢ|x₁:ᵢ₋₁). This chain rule decomposition makes the likelihood tractable and allows fast likelihood computation. Models like PixelCNN and WaveNet apply neural networks to each conditional term, predicting one dimension at a time given all previous dimensions.

Strength: exact likelihood. Weakness: slow generation (O(n) samples per dimension). The sequential nature is fundamentally inherent to the factorization. Variants like MADE improve efficiency by using masking. RNNs and Transformers also fit this paradigm.

Chain rule: p(x)=∏p(xᵢ|x₁:ᵢ₋₁) PixelCNN, WaveNet Tractable likelihood Sequential generation

Variational Autoencoders

Variational Autoencoders introduce latent variables z. The generative process: z ~ N(0, I), then x ~ p_θ(x|z). To learn, we maximize the Evidence Lower Bound (ELBO): ELBO(x; θ, λ) = 𝔼_q_λ[log p_θ(x,z)/q_λ(z|x)] = 𝔼_q_λ[log p_θ(x|z)] - KL(q_λ(z|x) || p(z)). The reparameterization trick enables efficient gradient estimation.

VAEs learn a meaningful latent space where interpolation produces smooth transitions. The KL term pushes q(z|x) toward the prior, preventing posterior collapse. Trade-off: the KL divergence loss causes blurriness in reconstructions.

Latent z

Disentangled space

ELBO

Tractable bound

Reparameterization

Gradient trick

Reconstruction+KL

Dual objectives

Normalizing Flows

Normalizing flows use invertible transformations f_θ to map from simple distributions (e.g., standard Gaussian) to complex ones. By change of variables: p_x(x) = p_z(f_θ⁻¹(x)) |det(∂f_θ⁻¹/∂x)|. The determinant of the Jacobian must be efficient to compute. Coupling layers and RealNVP satisfy this constraint while allowing expressive transformations.

Strengths: exact likelihood, efficient sampling. Weakness: volume-preserving constraints limit expressiveness. Glow and Flow++ extend the framework with multi-scale architectures and dequantization tricks.

Change of variables Invertible f_θ RealNVP, Glow Exact likelihood + fast sampling

Generative Adversarial Networks

GANs introduce an adversarial game between a generator G and discriminator D. G tries to fool D; D tries to distinguish real from generated data. The objective: min_G max_D 𝔼[log D(x)] + 𝔼[log(1 - D(G(z)))]. The Nash equilibrium (when it exists) corresponds to G matching the data distribution.

Strength: high-quality samples, no likelihood requirement. Weakness: training instability, mode collapse, and difficulty evaluating convergence. Advanced techniques: Wasserstein distance, spectral normalization, and gradient penalties mitigate these issues.

Min-max game

Generator vs Discriminator

No likelihood

Implicit model

High-quality samples

Mode coverage issues

Training tricks

Stability critical

Energy-Based & Score-Based Models

Energy-based models define p(x) = exp(-E(x)) / Z, where E is an energy function and Z is a partition function. Learning uses contrastive divergence: update parameters to lower energy on data, raise it on model samples. Score-based models learn ∇ log p(x) (the score function) via score matching, which avoids estimating Z entirely.

Scores enable sampling via Langevin dynamics: x_{t+1} = x_t + (ε/2)∇ log p(x_t) + √ε · ζ_t. Strength: flexible, tractable training. Weakness: sampling requires many steps. Score-based and diffusion models are deeply connected.

EBM: p(x)=exp(-E(x))/Z Score: ∇log p(x) Contrastive divergence Langevin dynamics

Diffusion Models

Diffusion models define a forward process: q(x_t|x_0) with gradually increasing noise. Learning the reverse: p_θ(x_{t-1}|x_t), a denoising neural network. DDPM (Denoising Diffusion Probabilistic Models) shows this is equivalent to score matching with a weighted loss. DDIM accelerates sampling via a deterministic path. Discrete diffusion extends to categorical and sequential data.

Strengths: simple training, stable, scalable, state-of-the-art sample quality (DALL-E, Stable Diffusion). Weakness: slow sampling (100s of steps). Connection to score-based models reveals a unified framework spanning multiple model families.

Forward: add noise

q(x_t|x_0)

Reverse: denoise

p_θ(x_{t-1}|x_t)

DDPM, DDIM

Training & sampling

Score matching

Unified view

Evaluating & Comparing Models

No single metric captures model quality. Likelihood-based models (autoregressive, VAE, flows) can report log p(x) on test sets. Sample-based models (GANs, diffusion) use Inception Score (IS) and Fréchet Inception Distance (FID), which measure sample quality via inception networks but cannot assess mode coverage. Precision and Recall metrics quantify mode coverage and sample quality separately.

Best practice: report multiple metrics, use human evaluation, and visualize samples. Different applications prioritize differently: likelihood for compression, sample quality for generation, diversity for data augmentation. Understanding trade-offs between families informs model selection.

Likelihood: log p(x) IS: sample quality FID: distribution match Precision/Recall: coverage

References & Further Reading

This course covers eight major families of generative models, each with rich theory and practical applications. The references below provide foundational papers, recent advances, and comprehensive surveys spanning the field.

From variational inference to score-based models, these works document the evolution of generative modeling and offer entry points for deeper study of specific architectures and extensions.

Foundations of Generative Modeling

Autoregressive Models

Variational Autoencoders

Normalizing Flows

Generative Adversarial Networks

Energy-Based & Score-Based Models

Diffusion Models

Evaluating & Comparing Models

References & Further Reading

Foundations of Generative Modeling

Density Estimation and Sampling

Latent Variable Models

Taxonomy: Eight Model Families

Learning Objectives

Autoregressive Models

Tractable Likelihood

PixelCNN and WaveNet

The Sequential Bottleneck

MADE and Masking

Variational Autoencoders

The ELBO Objective

The Reparameterization Trick

Disentangled Representation

Blurriness and KL Annealing

Normalizing Flows

Invertibility Constraint

RealNVP and Coupling Layers

Sampling and Likelihood

Limitations

Generative Adversarial Networks

Training Dynamics

Wasserstein GANs and Improvements

Implicit Distribution Learning

Conditional Generation and Applications

Energy-Based & Score-Based Models

Score Matching and Score-Based Models

Langevin Dynamics and Sampling

Connections to Diffusion Models

Advantages and Challenges

Diffusion Models

DDPM: Denoising Diffusion Probabilistic Models

Scalability and DDIM

Discrete and Structured Diffusion

Recent Advances: Stable Diffusion, DALL-E 3

Evaluating & Comparing Models

Likelihood-Based Metrics

Sample Quality Metrics: IS and FID

Mode Coverage: Precision and Recall

Human Evaluation and Domain-Specific Metrics

References & Further Reading

Course Materials

Foundational Papers by Topic

Autoregressive Models

Variational Autoencoders

Normalizing Flows

Generative Adversarial Networks

Score-Based & Diffusion Models

Evaluation Metrics