A generative model learns p(x), the probability distribution over observed data x. This is fundamentally different from discriminative models, which learn p(y|x). The advantage: once you have p(x), you can sample, compute likelihoods, perform inference over missing variables, and learn useful representations.
Density Estimation and Sampling
The two core capabilities are: (1) density estimation—computing log p(x) for data x; (2) sampling—drawing new samples x ~ p(x). Different model families make different trade-offs. Autoregressive models have tractable likelihood but slow sampling. Flows have both tractable likelihood and efficient sampling. GANs produce excellent samples but provide no likelihood. Diffusion models achieve state-of-the-art sample quality but require many sampling steps.
Latent Variable Models
Many generative models use latent variables z that capture hidden structure. The generative process: sample z from a simple prior p(z), then draw x from p_θ(x|z). This induces the marginal p_θ(x) = ∫ p_θ(x|z)p(z) dz. VAEs and normalizing flows use latent variables to enable efficient inference and sampling. Estimating z from x (inference) is often as important as sampling x.
Taxonomy: Eight Model Families
The course covers: (1) autoregressive models—chain rule factorization, exact likelihood, slow generation; (2) VAEs—latent variables, ELBO optimization, blurry samples; (3) normalizing flows—invertible transformations, exact likelihood, constrained expressiveness; (4) GANs—adversarial training, high-quality samples, mode collapse risk; (5) EBMs—energy functions, contrastive training; (6) score-based models—learn score function ∇log p(x); (7) diffusion models—iterative denoising, state-of-the-art quality; (8) discrete diffusion—for categorical and sequence data.
Learning Objectives
Most generative models optimize some form of likelihood or divergence: maximum likelihood (autoregressive), ELBO (VAE), exact likelihood with constraints (flows), adversarial divergence (GAN), contrastive divergence (EBM), or score matching (score-based). Understanding these objectives and their properties is central to applying and extending generative models.
Autoregressive models decompose the joint distribution using the chain rule: p(x) = p(x₁)·p(x₂|x₁)·p(x₃|x₁,x₂)·...·p(x_n|x₁,...,x_{n-1}). Each conditional term p(xᵢ|x₁:ᵢ₋₁) is parameterized by a neural network. To train, compute -log p(x) on the training set and optimize via maximum likelihood.
Tractable Likelihood
The likelihood decomposes into a product of conditionals, each of which is tractable. Computing log p(x) involves n forward passes (one per dimension) but no integrals or approximations. This makes maximum likelihood training straightforward: minimize -∑ log p(xᵢ|x₁:ᵢ₋₁).
PixelCNN and WaveNet
PixelCNN models images by predicting pixels in raster scan order. A masked convolutional network ensures each prediction depends only on previous pixels. WaveNet extends the idea to audio, with dilated convolutions to increase receptive field. Both achieve competitive likelihoods and generate sharp samples (though sometimes repetitive due to exposure bias).
The Sequential Bottleneck
Sampling requires generating one dimension at a time: first sample x₁, then x₂ given x₁, then x₃ given x₁, x₂, etc. This O(n) sequential process is inherently slow. For a 256×256 image (65k dimensions), drawing one sample requires 65k neural network evaluations. Parallelization is difficult because future dimensions depend on past ones.
MADE and Masking
MADE (Masked Autoencoder for Distribution Estimation) improves efficiency by using masked linear layers, making the computation fully parallelizable during training. The mask ensures output dimension i depends only on input dimensions 1 to i-1. Despite parallelizable training, sampling remains sequential.
VAEs combine latent variable models with variational inference. The generative model: z ~ N(0, I), then x ~ p_θ(x|z), a neural network decoder. The challenge: computing log p(x) = ∫ p(x|z)p(z) dz is intractable. VAEs solve this by introducing an encoder q_λ(z|x) that approximates the true posterior p(z|x).
The ELBO Objective
The Evidence Lower Bound (ELBO) is: ELBO(x; θ, λ) = 𝔼_q_λ(z|x)[log p_θ(x|z)] - KL(q_λ(z|x) || p(z)). The first term is reconstruction loss: the encoder produces a distribution over z, the decoder reconstructs x from z, and we measure fidelity. The second term is the KL divergence penalty: it pushes q(z|x) toward the standard normal prior, encouraging a structured latent space.
The Reparameterization Trick
To backpropagate through the sampling operation z ~ q_λ(z|x), we reparameterize: z = μ_λ(x) + ε ⊙ σ_λ(x), where ε ~ N(0, I) and ⊙ is element-wise product. Now z is a deterministic function of ε (which is non-trainable), so gradients flow through μ and σ.
Disentangled Representation
VAEs learn a latent space where dimensions capture interpretable features. Interpolation between two latent codes produces meaningful transitions (e.g., a smooth rotation of a face). This is because the KL penalty encourages the posterior to stay close to the standard normal prior, which has no preferred directions or scales.
Blurriness and KL Annealing
VAEs tend to produce blurry reconstructions because the KL term conflicts with the reconstruction term: high β (weight on KL) leads to aggressive posterior regularization and poor reconstructions. Low β leads to posterior collapse (q(z|x) ≈ p(z) and z is ignored). KL annealing—gradually increasing β during training—helps balance these objectives.
Normalizing flows construct flexible generative models by composing simple invertible transformations. A base distribution (e.g., Gaussian) is transformed through a sequence of invertible functions to model complex distributions. The log-likelihood is tractable through the change-of-variables formula: log p_X(x) = log p_Z(f⁻¹(x)) - Σ log|det J_f_i(·)|, where J is the Jacobian of each transformation.
Invertibility Constraint
For flows to be generative (capable of both sampling and likelihood evaluation), each transformation must be invertible. Additionally, the determinant of the Jacobian must be efficiently computable. This excludes standard neural network layers and requires specialized architectures like coupling layers and autoregressive transformations.
RealNVP and Coupling Layers
Real-valued Non-Volume Preserving (RealNVP) uses coupling layers where dimensions are split, and one subset is transformed affinely based on the other subset. This preserves invertibility while enabling expressive transformations. Each coupling layer has tractable Jacobian determinant (a product of diagonal elements), and stacking layers increases expressiveness.
Sampling and Likelihood
Flows enable efficient sampling: draw z ~ p_Z, then compute x = f(z) in a single forward pass. Likelihood computation is equally efficient. This dual tractability makes flows ideal for tasks requiring both sampling and density estimation, such as variational inference or likelihood-based model selection.
Limitations
Volume-preservation constraints (Jacobian determinant ≈ 1) limit expressiveness. Flows also require deep architectures to model complex distributions, increasing computational cost. Recent extensions like neural spline flows and unconstrained transformations address these limitations but at computational cost.
GANs formulate generative modeling as a two-player game. A generator G learns to produce samples from noise, while a discriminator D learns to distinguish real samples from generated ones. The objective is: min_G max_D V(D, G) = 𝔼_x[log D(x)] + 𝔼_z[log(1 - D(G(z)))]. At the Nash equilibrium, G matches the data distribution and D cannot distinguish them.
Training Dynamics
The adversarial setup creates interesting dynamics. As D improves, it provides stronger gradient signal to G. However, poorly trained discriminators provide uninformative gradients, slowing G's training. Vanishing gradients and mode collapse (G learns to generate only a few modes) are endemic problems. The original formulation suffers from training instability.
Wasserstein GANs and Improvements
Wasserstein distance provides a more stable training objective than binary classification. WGAN-GP (gradient penalty) enforces Lipschitz continuity without weight clipping. Spectral normalization constrains the discriminator's Lipschitz constant. These improvements enable more stable training and higher-quality samples on complex datasets.
Implicit Distribution Learning
Unlike VAEs and flows, GANs do not explicitly model p(x). Instead, G implicitly defines a distribution through its sampling procedure. This avoids intractable likelihood computation but prevents direct likelihood-based evaluation. GANs excel at perceptual quality and mode coverage but struggle with density estimation and mode collapse.
Conditional Generation and Applications
Conditional GANs (cGANs) enable class-conditional generation: G and D both receive class labels. StyleGAN introduces style injection for fine-grained control. Pix2Pix and CycleGAN enable image-to-image translation. These extensions demonstrate GANs' versatility for structured generation tasks.
Energy-based models (EBMs) define distributions via energy functions: p(x) ∝ exp(-E(x)). The energy function assigns low values to high-probability regions and high values to low-probability regions. Learning requires inferring the partition function Z = ∫ exp(-E(x)) dx, which is intractable. Contrastive divergence approximates this via Markov chain Monte Carlo sampling on model samples.
Score Matching and Score-Based Models
Score-based models learn the score function: ∇_x log p(x), the gradient of log-density. Score matching avoids estimating Z entirely by minimizing the expected squared difference between the model score and data score. This framework is elegant and enables efficient training on high-dimensional data.
Langevin Dynamics and Sampling
Sampling from score-based models uses Langevin dynamics: x_{t+1} = x_t + (ε/2)∇ log p(x_t) + √ε ζ_t, where ζ_t is Gaussian noise. This iterative process requires many steps (often 1000+) to mix thoroughly. However, the framework is flexible and theoretically grounded.
Connections to Diffusion Models
Diffusion and score-based models are deeply connected. Diffusion models can be viewed as learning scores for perturbed distributions at different noise levels. Song et al. showed that the reverse SDE (stochastic differential equation) of diffusion is determined by score functions, unifying multiple generative frameworks under one mathematical structure.
Advantages and Challenges
Score-based models offer theoretical elegance and flexibility. They scale well to high-dimensional data and avoid mode-collapse issues. However, sampling is slow due to iterative procedures. Recent work combines score-based and flow-based approaches to achieve both efficiency and flexibility.
Diffusion models define a forward process that gradually adds Gaussian noise to data: q(x_t|x_0) = √(ᾱ_t) x_0 + √(1-ᾱ_t) ε, where ᾱ_t = ∏_{s=1}^t (1-β_s) and β_t are fixed or learnable noise schedules. After sufficient steps, x_T ≈ N(0,I). Learning involves training a neural network p_θ(x_{t-1}|x_t) to reverse this process, predicting either x_0, noise ε_θ(x_t,t), or scores ∇ log p(x_t).
DDPM: Denoising Diffusion Probabilistic Models
DDPM (Ho et al., 2020) reformulated diffusion as a latent variable model with a Markov chain: p_θ(x_0:T) = p(x_T) ∏_t p_θ(x_{t-1}|x_t). The training objective is equivalent to minimizing noise prediction error at each step. DDPM achieved impressive image generation results, particularly on CIFAR-10 and CelebA, rivaling GANs in quality.
Scalability and DDIM
DDPM requires many sampling steps (often 1000), making generation slow. DDIM (Denoising Diffusion Implicit Models) accelerates sampling by using a deterministic trajectory instead of stochastic steps, reducing steps to 50-100 with minimal quality loss. This breakthrough made diffusion models practical for real-time applications.
Discrete and Structured Diffusion
While continuous diffusion targets images and audio, discrete diffusion models extend to categorical sequences and structured data. D3PM (Discrete Denoising Diffusion Probabilistic Models) handles mixed continuous-discrete domains. These extensions enable diffusion models for text, molecules, and protein sequences.
Recent Advances: Stable Diffusion, DALL-E 3
Recent models like Stable Diffusion (latent space diffusion) and DALL-E 3 achieve photorealistic image generation from text. Training on large datasets (billions of image-text pairs) with careful engineering enables unprecedented sample quality and semantic understanding. These models represent a paradigm shift in generative AI, surpassing GANs for visual generation.
Evaluating generative models is non-trivial because no single metric captures all aspects of quality. Different models excel at different objectives. A model with excellent likelihood may produce blurry samples; one with perfect mode coverage may miss rare regions. Comprehensive evaluation requires multiple metrics aligned with specific use cases.
Likelihood-Based Metrics
Models that explicitly parameterize p(x) (autoregressive, VAEs, flows) can report negative log-likelihood (NLL) or bits per dimension (BPD). Lower is better. Likelihood directly assesses density estimation and is interpretable: -log p(x) measures information content. However, likelihood alone doesn't guarantee perceptual quality; VAEs with low likelihood can produce blurry images.
Sample Quality Metrics: IS and FID
Inception Score (IS) measures sample quality by passing generated images through a pre-trained Inception-v3 classifier. High IS indicates that samples are diverse and class-recognizable. Fréchet Inception Distance (FID) compares distributions of real and generated image features in Inception-v3 space. FID is more robust and preferred for comparing models. Both assume images are natural and may fail for out-of-distribution domains.
Mode Coverage: Precision and Recall
Precision measures what fraction of generated samples are close to real data. Recall measures what fraction of real modes are covered by generated samples. A model can achieve high precision by generating a few high-quality samples (mode collapse) or high recall by covering all modes broadly. Simultaneously optimizing both requires careful training.
Human Evaluation and Domain-Specific Metrics
Human evaluation remains the gold standard for perceptual quality. Observers rate sample quality, realism, and diversity. For specific domains (medical imaging, code generation), task-specific metrics matter more than generic ones. Downstream task performance (e.g., using generated data as training data) offers practical evaluation. Best practice: combine multiple metrics, use human studies, and visualize samples for thorough assessment.
Course Materials
Foundational Papers by Topic
Autoregressive Models
Variational Autoencoders
Normalizing Flows
Generative Adversarial Networks
Score-Based & Diffusion Models
Evaluation Metrics