Generative models learn to represent and sample from data distributions. Rather than predicting labels or features, they estimate p(x)—the probability density of observed data. This enables generation of novel samples and provides rich representations useful for downstream tasks.
At their core, generative models pursue density estimation through learned parametric distributions. Whether through autoencoders, flow models, or diffusion processes, the goal remains consistent: discover latent structure in high-dimensional data and synthesize realistic continuations.
02
Taxonomy of Approaches
Generative modeling encompasses diverse algorithmic families with distinct tradeoffs. Autoregressive models factor joint distributions as products of conditionals; latent variable models introduce unobserved structure; flow-based models apply invertible transformations; implicit models skip explicit likelihoods entirely.
Each approach balances tractability, expressiveness, and sampling efficiency. Understanding this taxonomy clarifies why certain methods suit certain domains—transformers for text, score-based diffusion for images, normalizing flows for molecular design.
03
Maximum Likelihood Principle
Maximum likelihood estimation remains the dominant training objective. By maximizing log p_θ(x), we implicitly minimize KL divergence from data to model, pushing learned distributions toward empirical observations. This principled approach justifies why MLE works across nearly all modern architectures.
Alternative divergences exist—Wasserstein, f-divergences, reverse KL in VAEs—each reflecting different geometric assumptions. Yet forward KL via MLE dominates due to computational tractability and its natural alignment with our information-theoretic intuitions about learning.
04
Sampling & Generation
Sampling lies at the heart of generative modeling's utility. Ancestral sampling proceeds autogressively through learned conditional distributions; rejection sampling accepts proposals with likelihood-weighted probability; MCMC explores high-dimensional spaces through iterative refinement.
Each technique trades off computational cost, mixing time, and sample quality. Modern diffusion models reframe sampling as reversing a noise schedule; transformer-based models leverage top-k filtering; adversarial approaches rely on implicit sampling through generator networks.
05
Evaluation Metrics
Evaluating generative models requires both computational metrics and perceptual quality measures. Log-likelihood gauges model fit; Fréchet Inception Distance captures perceptual similarity; Inception Score rewards confident, diverse predictions; precision-recall balances mode coverage against spurious novel modes.
No single metric suffices. Likelihood alone ignores sample quality; FID suits images but not all modalities; human evaluation remains essential. Modern benchmarks combine multiple metrics, acknowledging that generative quality involves multiple dimensions of fidelity.
06
Computational Tradeoffs
Computing exact log-likelihoods is intractable for most architectures. Autoregressive models achieve tractability through factorization; flow models via Jacobian determinants; VAEs through learned variational bounds. Implicit models abandon likelihood entirely, complicating evaluation but enabling flexible expressiveness.
Latent variable models trade approximate inference for expressive posteriors. Recognizing this spectrum—from fully tractable autoregressive to fully implicit GANs—explains architectural choices and guides practitioners toward methods matching their constraints and requirements.
07
Historical Context
Generative modeling's modern renaissance builds on decades of foundational work. Boltzmann machines and RBMs pioneered energy-based approaches; deep belief networks unified supervised and unsupervised learning; VAEs and GANs introduced latent variable models at scale; transformers and diffusion models now dominate.
Each era discovered fundamental principles—the variational autoencoder's latent space trade-off, the adversarial game's nonconvergence challenges, diffusion's connection to score matching. These aren't historical footnotes but active principles guiding current research directions.
08
Course Roadmap
This course spans the generative modeling landscape from foundations through modern applications. Weeks 1-3 establish core concepts: density estimation, MLE, sampling, and evaluation. Weeks 4-6 build toward powerful architectures: autoencoders, flows, and diffusion. Weeks 7-8 synthesize with adversarial training and applications.
Each approach's tradeoffs become clear through hands-on implementation. Autoregressive models train quickly but sample slowly; flow models offer exact likelihood but limited expressiveness; diffusion models match state-of-the-art image generation despite iterative sampling. Understanding these tensions prepares you for designing novel architectures.
09
References & Further Reading
This course draws on seminal papers, textbooks, and pedagogical resources spanning decades of generative modeling research. The references below provide both foundational theory and modern implementations, enabling deeper exploration of each architectural family and their applications.
Key resources include Stanford's course materials, classic papers introducing VAEs, GANs, and diffusion models, as well as contemporary blog posts and tutorials that bridge theory and practice. This section guides further learning beyond lecture content.
Section 01
What Are Generative Models
A generative model p_θ(x) estimates the probability distribution of observed data. Unlike discriminative models that learn p(y|x), generative models learn the full joint distribution, enabling both inference and sample synthesis. This fundamental difference opens entire research directions unavailable to supervised learning alone.
Density estimation connects to representation learning through shared optimization goals. Learning p(x) forces the model to discover meaningful structure—latent factors of variation, hierarchical abstractions, compositional building blocks—that compress data distribution into learned parameters.
Core Concepts
Generative models pursue three interconnected objectives: fit observed data distribution through tractable p_θ(x), enable sample synthesis from p_θ, and discover latent representations explaining data variation. These goals sometimes conflict, creating the design tensions that make generative modeling challenging and interesting.
The probability distribution p(x) has no simple closed form for high-dimensional data like images or text. Generative models approximate it through neural network parameterization, learning transformations from simple noise distributions to complex data distributions. The learned map reveals data structure.
Density Estimation
Learning p_θ(x) through parametric models on high-dimensional data spaces with flexible neural networks.
Sample Synthesis
Drawing realistic samples from learned distributions to generate novel, coherent data points matching training distribution.
Representation Learning
Discovering latent factors of variation and hierarchical structure through unsupervised distribution learning.
Likelihood Optimization
Maximizing model likelihood as principled objective for learning distributions that both fit and generalize well.
Key Distinctions
Generative vs. discriminative is fundamental. Discriminative models learn decision boundaries; generative models learn full distributions. This enables generative models to perform conditional generation (p(x|z)), imputation, and reasoning about unobserved variables—capabilities invisible to classifiers.
Explicit vs. implicit models matter computationally. Explicit models have tractable densities p_θ(x); implicit models define distributions through sampling procedures. GANs exemplify implicit models—no closed-form likelihood, only a learned sampler. This choice cascades through architecture and evaluation decisions.
Foundation Principle
Generative modeling is unsupervised learning of data distributions. Everything else—sampling, synthesis, representation learning—flows from this core mission to discover what makes data probable.
Learning Objectives
Maximum likelihood drives most architectures: minimize -E_x[log p_θ(x)] across training data. Alternative objectives like adversarial losses or divergence-based criteria exist, but MLE's information-theoretic grounding and optimization tractability keep it dominant.
Section 02
Taxonomy of Approaches
Generative modeling comprises several architectural families, each making distinct choices about tractability and expressiveness. Understanding this landscape prevents wasted effort on mismatched approaches and clarifies why certain architectures dominate specific domains.
The taxonomy's axes are tractability (explicit vs. implicit likelihood) and conditioning (autoregressive, latent-variable, flow-based). Each combination yields models with characteristic strengths—some train easily but sample slowly; others sample quickly but train unpredictably.
Autoregressive Models
Autoregressive models factor joint distributions as products of conditionals: p(x) = ∏ p(x_i | x_{. Transformers, PixelCNN, and WaveNet exemplify this approach. Likelihood is tractable—just sum log-conditionals—but sampling requires sequential generation, slowing inference.
Autoregressive remains the workhorse for text (transformers), audio (WaveNet), and increasingly images (ViT-based models). Sequential generation means each token depends on previous tokens, capturing long-range dependencies through attention mechanisms.
Latent Variable Models
Latent variable models introduce unobserved variables z to explain data: p(x) = ∫ p(x|z)p(z) dz. VAEs use learned inference networks to approximate posterior q(z|x); GANs skip inference, learning only generator p_θ(x|z). This flexibility enables powerful expressiveness.
The cost: posteriors become intractable, requiring approximation (VAE) or abandoning likelihood (GAN). Training stabilizes compared to GANs through principled ELBO bounds, but typically underperforms newer approaches on likelihood benchmarks.
Flow-Based Models
Normalizing flows transform simple distributions through invertible functions. x = f_θ(z), z ~ p(z) where f_θ is invertible. Likelihoods become tractable via change-of-variables: log p_θ(x) = log p(z) - ∑ log |det ∂f/∂z_i|. Sampling remains fast (one forward pass).
Flows trade expressiveness for tractability—computing Jacobian determinants constrains architectures. Real NVP, Glow, and flow matching represent successful instantiations, excelling in low-dimensional problems and molecular design where Jacobian computation remains feasible.
Score-Based & Diffusion Models
Diffusion models reverse noise corruption: iteratively denoise corrupted data back to clean samples. Score-based models learn gradients of log-density; diffusion models learn denoising functions directly. Both reframe sampling as Markov chain traversal through a reverse schedule.
These models have emerged as state-of-the-art for image generation (DDPM, EDM, Consistency Models). Sampling requires many steps but achieves unprecedented quality-diversity tradeoffs. Training is stable and unsupervised, avoiding GAN instability and VAE approximation gaps.
Implicit Models (GANs)
Implicit models define distributions through sampling procedures rather than explicit densities. GANs pit generator against discriminator in an adversarial game. No likelihood computation, but expressiveness remains unmatched—generators learn highly realistic distributions.
Training instability and mode collapse plagued early GANs. Architectural advances (Spectral Normalization, Progressive Growing, StyleGAN) and alternative objectives (Wasserstein distance, relativistic loss) stabilized training but didn't eliminate sensitivity to hyperparameters.
Advantages
Each approach excels in specific domains
Multiple families reduce architectural lock-in
Hybrid approaches combine complementary strengths
Tradeoffs
No universal best approach—context dependent
Choosing wisely requires understanding axioms
Implementation complexity varies dramatically
Section 03
Maximum Likelihood Principle
Maximum likelihood estimation drives modern generative modeling. Given data D = {x_1, ..., x_n}, we maximize L(θ) = ∑_i log p_θ(x_i). This simple objective has profound implications: it minimizes KL divergence from data to model, grounding MLE in information-theoretic principles.
The KL divergence identity D_KL(p_data || p_θ) = -E_x[log p_θ(x)] + H(p_data) shows that minimizing negative likelihood equals pushing the model toward the data distribution. The constant entropy term irrelevant for optimization but important conceptually.
KL Divergence Equivalence
KL divergence measures distributional distance: D_KL(P||Q) = E_p[log(p/q)]. Unlike symmetric distances, forward KL D_KL(data||model) penalizes regions where data has high probability but model doesn't (mode-seeking). Reverse KL in VAEs penalizes the opposite (mode-covering).
This asymmetry matters profoundly. Forward KL forces the model to cover data modes; reverse KL encourages ignoring low-probability regions. VAE's ELBO optimizes reverse KL on the posterior, creating the characteristic "fuzzy" generations compared to forward-KL-trained GANs.
Why MLE Dominates
MLE's dominance stems from four properties: first, principled grounding in information theory; second, computational tractability (works with standard autodiff); third, established statistical guarantees; fourth, compatibility with virtually all architectures.
Alternative divergences and losses exist. Wasserstein distances suit high-dimensional spaces where KL becomes uninformative. Chi-squared divergences handle rare events. Yet MLE's combination of simplicity, principled justification, and empirical success keeps it standard across domains.
Optimization Landscape
MLE optimization is nonconvex for neural networks, but several factors make it tractable. First, overparameterization—neural networks typically have enough capacity to fit training data, making the landscape benign. Second, stochastic gradient descent with momentum, Adam, and other adaptive methods handle high-dimensionality well.
Challenges arise in specific architectures. GANs optimize a min-max game rather than pure MLE, creating instability. VAEs optimize lower bounds (ELBO) rather than likelihood directly, introducing approximation gaps. Diffusion models train denoising networks at many noise levels, creating coupled objectives.
Information Theory Principle
Maximizing likelihood equals minimizing the KL divergence from true data distribution to the learned model. This connects optimization to information-theoretic optimality, grounding practical algorithms in theoretical principles.
Forward KL (MLE)
Mode-seeking behavior: penalty when model misses data. Used in autoregressive models, diffusion. Creates peaky, diverse distributions.
Reverse KL (VAE)
Mode-covering: penalty when model assigns mass outside data. Creates wider, possibly multimodal posteriors. Trade-off with expressiveness.
Wasserstein Distance
Optimal transport perspective. Better suited for high-dimensional spaces where KL becomes uninformative or infinity. Used in Wasserstein GANs.
Implicit Objectives
GANs, denoising score matching avoid explicit likelihood. Optimize through adversarial or auxiliary objectives. More flexible but harder to analyze.
Practical Considerations
Numerical stability matters in likelihood computation. Computing log p(x) directly for images requires careful handling of mixed discrete-continuous distributions. Log-sum-exp tricks prevent underflow. For high-dimensional problems, likelihoods become extremely small—storing in log-space is essential.
Batch size affects optimization. Small batches add gradient noise (sometimes helpful for escaping local minima); large batches yield stable but potentially stale gradients. Learning rates interact with model architecture and data scale, requiring careful tuning.
Section 04
Sampling & Generation
Sampling draws from learned distributions to generate novel data. Three main families exist: ancestral sampling follows conditional chains; rejection sampling accepts/rejects proposals; MCMC iteratively refines samples through Markov transitions. Each trades off computational cost, mixing time, and sample quality.
Modern generative modeling emphasizes different sampling regimes. Autoregressive models sample sequentially but deterministically given previous tokens. Diffusion models reverse noise through iterative denoising. Flow models sample in one forward pass. GANs define distributions implicitly through generator networks.
Ancestral Sampling
Ancestral sampling leverages factorization: sample x_1 ~ p(x_1), then x_2 ~ p(x_2|x_1), etc. For autoregressive models, this is the natural procedure. Transformers sampling text follow this: sample first token, then next token given prefix, iterating until stopping token.
This approach guarantees correctness—samples exactly follow the model distribution. But it's sequential: generating N-dimensional vectors requires N forward passes. For large N (e.g., 1024×1024 images), this becomes prohibitively slow. Top-k and nucleus sampling modify this to improve speed and quality tradeoffs.
Rejection Sampling
Rejection sampling accepts proposals x ~ q(x) with probability min(1, p(x)/q(x)). Efficient when proposal closely matches target. For simple targets, this generates exact samples. But for high-dimensional distributions, acceptance rates plummet—most proposals get rejected.
Rarely used in modern deep learning due to poor scaling. Important theoretically: shows how to convert samples from simple distributions to complex ones. Variants like importance sampling and Metropolis-Hastings remain valuable in Bayesian inference.
Markov Chain Monte Carlo
MCMC samples through iterative refinement: x_{t+1} ~ T(x_{t+1}|x_t) where T is transition kernel with stationary distribution p(x). Metropolis-Hastings, Gibbs sampling, Hamiltonian MC exemplify this. After burn-in period, samples approximately follow target distribution.
MCMC enables sampling from distributions we can only evaluate (not sample). High-dimensional posteriors in Bayesian inference rely on MCMC. But mixing time—iterations until convergence—grows exponentially with dimension in naive implementations, limiting practical applicability.
Diffusion Model Sampling
Diffusion models reframe generation as reversing a noise schedule. Forward process: q(x_t) = √(1-β_t)x_{t-1} + √(β_t)ε gradually corrupts data. Reverse process: p_θ(x_{t-1}|x_t) denoises iteratively. Starting from pure noise, successive denoising recovers high-probability data.
This converts a hard sampling problem into tractable denoising. Neural network learns ε_θ(x_t, t), predicting noise to remove. During sampling, simply iterate: x_{t-1} = (x_t - √(β_t)ε_θ(x_t,t))/√(1-β_t). Fast sampling through distillation is an active research area.
Advantages
Diverse sampling methods for different tradeoffs
Theoretically grounded procedures
Diffusion enables stable, high-quality generation
Limitations
Sequential sampling inherently slow for long sequences
MCMC mixing times grow with dimension
Diffusion requires many iterations per sample
Section 05
Evaluation Metrics
Evaluating generative models requires multiple perspectives: computational metrics (likelihood, bits/dim) measure model fit; perceptual metrics (FID, IS) capture visual quality; sample-level metrics (precision, recall) assess diversity and fidelity balance. No single metric suffices.
Log-likelihood alone ignores sample quality—a model assigning uniform probability everywhere has decent likelihood but generates gibberish. FID alone misses coverage—models that memorize training images achieve perfect FID on training data but fail to generalize. Comprehensive evaluation combines metrics.
Log-Likelihood & Bits/Dimension
Log-likelihood L = ∑_i log p_θ(x_i) measures model fit directly. Normalized per-dimension, "bits/dimension" (bpd) is -log₂ p_θ(x) / D, interpreting likelihood through information-theoretic lens. Lower bpd is better. MNIST achieves ~0.3 bpd with modern models; CIFAR-10 ~3.5 bpd.
Likelihood has blind spots. Models can achieve good likelihood while generating poor samples (VAE mode-covering posterior). Conversely, implicit models (GANs) may generate excellent samples without tractable likelihoods. Likelihood benchmarks matter for researchers but less for practitioners judging sample quality.
Fréchet Inception Distance (FID)
FID measures similarity between generated and real distributions in ImageNet-pretrained feature space. Extract activations from generated and real images, compute Fréchet distance between multivariate Gaussians fitted to distributions: FID = ||μ_real - μ_gen||² + Tr(Σ_real + Σ_gen - 2(Σ_real Σ_gen)^{1/2})||.
FID correlates reasonably with human perception, makes evaluation computationally tractable, and doesn't require training separate classifiers. Limitations: depends on feature extractor choice, favors models trained on ImageNet, struggles with out-of-distribution domains. Yet it's become standard for image generation benchmarking.
Inception Score (IS) & Precision/Recall
Inception Score IS = exp(E_x[KL(p(y|x) || p(y))]) rewards confident, diverse predictions. Generated images should yield high-confidence predictions in different classes. Precision-Recall curves measure quality-diversity tradeoff directly: precision = % of generated images in training manifold; recall = coverage of training manifold.
IS has fallen out of favor due to GameSAT-style hacking (models learning to fool classifiers without realistic content). Precision-Recall offers more nuanced evaluation but requires defining manifold boundaries. Together, they reveal complementary information about generation quality.
Sample Quality Metrics
Human evaluation remains gold standard but is expensive and subjective. Mechanical Turk studies aggregate human judgments on sample realism, diversity, and correctness. Inter-rater agreement often surprisingly low, highlighting evaluation difficulty. Cost limits sample size and metric diversity.
Automated proxies include LPIPS (learned perceptual metric using pretrained networks), CLIP similarity (learned vision-language embeddings), and domain-specific metrics (BLEU for text, mel-cepstral distance for audio). Each introduces biases but enables rapid iteration.
Evaluation Principle
No single metric captures generation quality comprehensively. Likelihood, sample fidelity, and coverage offer different perspectives. Compare models across multiple metrics and modalities; aggregate human evaluation for final validation.
Log-Likelihood
Measures p_θ(x) directly. Metric: bits/dimension. Blind to sample quality but informative theoretically.
FID Score
Perceptual similarity via inception features. Lower is better (0=perfect). Standard for image benchmarks. Computationally efficient.
Inception Score
Confidence × diversity via classifier logits. Higher is better. Gamed by adversarial examples; declining popularity.
Precision/Recall
Coverage vs. quality tradeoff. Precision: does model stay on manifold? Recall: does model cover all modes? Complementary perspectives.
Context-Specific Metrics
Different modalities require specialized metrics. Text generation uses BLEU, ROUGE, METEOR (sequence similarity); BLEURT (learned metric). Molecular generation evaluates validity, novelty, diversity. Audio uses MOS (mean opinion score), mel-cepstral distortion. Metrics shape optimization targets—optimizing for BLEU produces different translations than optimizing for human judgment.
1990s–2000s
Log-likelihood becomes standard for probabilistic models. Limited computational ability restricts evaluation to small datasets.
2010s Early
GANs emerge; likelihood-free methods require new evaluation paradigms. Inception Score introduced; human evaluation becomes standard.
2017–2019
FID proposed and becomes dominant for image generation. Precision-Recall metrics provide complementary perspectives on coverage-quality tradeoff.
2020+
Diffusion dominance drives return to likelihood metrics. Multiple metrics standard: likelihood + FID + human evaluation for comprehensive assessment.
Section 06
Computational Tradeoffs
Every generative model faces fundamental tradeoffs between exact and approximate likelihood, tractable and intractable posteriors, training efficiency and inference speed. Recognizing these tensions explains architectural choices and guides method selection.
The central tension: models with tractable likelihoods typically have restrictions limiting expressiveness. Models with flexible expressiveness often sacrifice likelihood tractability. Understanding this spectrum—from fully tractable autoregressive to fully implicit GANs—prepares practitioners for principled design choices.
Exact vs. Approximate Likelihood
Autoregressive models achieve exact likelihood through factorization: p(x) = ∏ p(x_i|x_{. Flows achieve exact likelihood via change-of-variables but require invertible architectures. VAEs optimize ELBO lower bound rather than exact likelihood. GANs have no explicit likelihood.
Exact likelihood enables proper statistical interpretation and direct likelihood-ratio testing. But architectural constraints to achieve exactness (autoregressive, invertible) may limit expressiveness. The tradeoff often favors approximate methods when expressiveness matters more than interpretability.
Tractable vs. Intractable Posteriors
Latent variable models introduce unobserved variables z. In VAEs, posterior q(z|x) is intractable; we learn approximate inference network. GANs skip posteriors entirely. This intractability enables flexible generator-encoder pairs but sacrifices posterior interpretability and principled uncertainty quantification.
Flow models maintain tractable posteriors but this constrains architectural flexibility. Autoregressive models over jointly distributed variables implicitly tractable posteriors (though might be uninteresting). Different architectures make different posterior-expressiveness tradeoffs visible.
Training Efficiency
Autoregressive models train efficiently with standard supervised learning: each conditional is a standard classification/regression problem. Diffusion models train denoising networks, straightforward with standard losses. GANs train competitively, balancing generator-discriminator. VAEs train with standard variational bounds.
Training stability varies. Autoregressive and diffusion: stable, predictable loss curves. GANs: historically unstable, requiring careful architecture/hyperparameter choice. VAEs: mode collapse in posteriors, posterior collapse when decoder too powerful. Modern techniques stabilize most approaches.
Inference (Sampling) Speed
Autoregressive models require sequential generation—one forward pass per generated token. For 1024-token sequences, 1024 forward passes. Flows sample in single pass. Diffusion requires iterative denoising (50-1000 steps typically). GANs sample in one forward pass.
This creates practical application boundaries. Real-time applications (voice, interactive generation) favor single-pass methods (flows, GANs, autoregressive with speculative decoding). Batch applications (offline image generation) tolerate iterative approaches if quality improves enough to justify latency.
Autoregressive
Tractable likelihood, efficient training, sequential sampling. Dominates text; slow for long high-dimensional outputs.
Stable training, excellent quality-diversity, iterative sampling. Current state-of-art images, emerging text approaches.
Scalability to High Dimensions
High-dimensional data (images, text) poses challenges. Autoregressive factors naturally but samples slowly. Flows require Jacobian computation, expensive in high dimensions. Latent variable models compress to lower-dimensional latent space, reducing dimensionality. Diffusion operates in data space but iteratively.
Practical solutions: VAEs compress through bottleneck; autoencoders-with-diffusion operate in latent space (Latent Diffusion Models); hierarchical approaches decompose computation across scales. Understanding bottlenecks guides architecture choice.
Modern generative modeling stands on decades of foundational work. Boltzmann machines and RBMs pioneered energy-based approaches; deep belief networks unified supervised and unsupervised learning; VAEs and GANs democratized neural generative models; transformers and diffusion models now dominate. Each era discovered principles still guiding research.
Understanding history prevents reinventing wheels and highlights which challenges are fundamental vs. incidental to specific architectures. The struggle to train deep nets (solved by pretraining and layer normalization), the balance between likelihood and generation quality (fundamental to KL divergence), and adversarial training instability (partly architectural, partly inherent) echo across decades.
Energy-Based Models Era (1980s-2000s)
Boltzmann machines modeled distributions through energy functions: p(x) ∝ exp(-E_θ(x)). Hopfield networks, restricted Boltzmann machines (RBMs), deep belief networks followed. Training via Gibbs sampling and contrastive divergence was computationally expensive but theoretically principled.
RBMs enabled pretraining deep networks layer-by-layer (Hinton's 2006 breakthrough). Though pretraining became less necessary after better initialization and normalization, energy-based principles remain valuable. Modern score-based models directly learn energy gradients.
Deep Learning Revolution (2010s Early)
Deep learning enabled training deeper networks directly. Convolutional networks, RNNs, and attention mechanisms transformed discriminative modeling. For generation, autoregressive models (PixelCNN, WaveNet) and autoencoders emerged. Variational Autoencoders (2014) combined deep learning with principled probabilistic inference.
VAEs introduced reparameterization trick—representing posteriors through invertible transformations enabling backpropagation through stochastic nodes. This unified deep learning with Bayesian inference, creating tractable approximate likelihood through ELBO. Limitations: posterior collapse, blurry reconstructions, but foundational for subsequent work.
Adversarial Training & GANs (2014-2018)
GANs (Goodfellow et al., 2014) introduced adversarial training: pit generator against discriminator in minimax game. Elegant formulation, powerful results, but training instability and mode collapse plagued early versions. Wasserstein GANs, spectral normalization, progressive growing, and StyleGAN progressively stabilized training.
GANs achieved unprecedented sample quality on high-resolution image generation. Yet implicit models (no tractable likelihood) complicated evaluation and theoretical analysis. The adversarial game creates instability absent in supervised learning, a tension unresolved despite architectural improvements.
Attention & Transformer Era (2017-2019)
Transformers (Vaswani et al., 2017) revolutionized NLP through attention mechanisms and self-supervised pretraining. For generation, autoregressive transformers (GPT series) dominated text; Vision Transformers extended to images. Attention's ability to capture long-range dependencies suited sequential generation.
Transformers also enabled hybrid approaches: BERT-style pretraining, decoder-only models, multimodal models. For generative modeling specifically, autoregressive transformers remain standard for text despite diffusion models' recent image-generation dominance.
Score-Based & Diffusion Resurgence (2019-2022)
Diffusion models (DDPM, 2020) emerged from score-based generative modeling. Key insight: reversing Markovian noise corruption yields tractable sampling and stable training. Denoising score matching (Song & Ermon, 2019) connected to diffusion, clarifying the landscape.
Diffusion's rise parallels diminishing returns elsewhere: VAE blurriness persistent despite improvements, GAN instability remained despite tricks, autoregressive sampling slow. Diffusion traded sampling speed for stability and quality—a tradeoff recently becoming favorable through distillation and acceleration techniques.
Hinton's breakthrough: RBMs enable layer-by-layer pretraining of deep networks. Contrastive divergence accelerates training but remains expensive.
2012-2014
Deep learning revolution: CNNs, RNNs, autoencoders. Variational Autoencoders introduced with reparameterization trick, combining deep learning and Bayesian inference.
2014-2018
GANs achieve state-of-art sample quality through adversarial training. Variants stabilize training: Wasserstein GANs, Progressive Growing, StyleGAN.
2017+
Transformers revolutionize NLP and extend to images. Autoregressive models dominate text; attention captures long-range dependencies.
2019-2022
Score-based models and diffusion emerge. Iterative denoising enables stable training and high-quality generation. Become current state-of-art for images.
Lasting Principles
Throughout this history, certain principles persist. Likelihood-based models face tractability-expressiveness tradeoffs (visible in RBMs→VAEs→Flows). Generative quality improves with model capacity and data, but overfitting requires regularization (seen across all eras). Scalability demands hierarchical approaches—from Hinton's layer-wise pretraining to Latent Diffusion Models' learned autoencoders.
Understanding this history contextualizes current debates. Diffusion's recent dominance doesn't invalidate autoregressive or flow principles—each remains valid for specific problems. The field matures not by replacement but accumulation: each architecture's lessons enrich the landscape.
Section 08
Course Roadmap
XCS236 spans foundational concepts through cutting-edge architectures, integrating theory with hands-on implementation. Week 1 establishes fundamentals; subsequent weeks deepen into specific families; later weeks synthesize hybrid approaches and applications. The progression mirrors field maturation: from basic principles to specialized techniques to modern integration.
Each major architecture trades specific dimensions of the likelihood-expressiveness-speed tradeoff. Learning these tradeoffs prepares you to design novel methods matching your problem constraints. The course emphasizes understanding rather than memorization—what questions each approach answers, which it leaves open.
Weeks 1-2: Foundations
Week 1 covers density estimation, MLE, KL divergence, sampling basics, and evaluation metrics—the conceptual bedrock. Week 2 introduces the model taxonomy: autoregressive, latent variable, flow, implicit, diffusion. You'll implement simple versions of each, gaining intuition for architectural choices.
Key question for Weeks 1-2: What does it mean to learn a distribution? Why is MLE natural despite being nonconvex? How do we evaluate models without human judgment? Answering these deeply prepares for understanding specific architectures.
Weeks 3-4: Autoregressive & VAEs
Autoregressive models form the foundation for modern NLP. Week 3 covers PixelCNN/WaveNet and transformers for generation. VAEs (Week 4) introduce latent variables and approximate inference—the ELBO, reparameterization trick, posterior collapse. You'll implement both and recognize their complementary strengths.
Hands-on goal: generate text with autoregressive transformers; generate images with VAEs. Compare quality, speed, likelihood. Why is VAE output blurry while transformers produce sharper but slower samples? These empirical observations ground theoretical understanding.
Weeks 5-6: Flows & Diffusion Models
Normalizing flows (Week 5) achieve exact likelihood through invertible transformations. Real NVP, Glow, and flow matching exemplify different approaches. Diffusion models (Week 6) reverse noise through iterative denoising—DDPM, improved schedulers, conditioning mechanisms.
Hands-on goal: implement Real NVP for density estimation; implement DDPM for image generation. Observe diffusion's stability compared to GANs, exact likelihood compared to implicit models. Which domains favor which approaches?
Hands-on goal: implement WGAN with spectral normalization; understand mode collapse and training dynamics empirically. By course end, you'll recognize when each architecture suits problems and how to combine them for novel applications.
Key Concepts Throughout
Likelihood vs. Generation Quality: likelihood doesn't guarantee sample quality; implicit models achieve quality without likelihoods. Factorization enables tractability: autoregressive factorizes conditionals; flows use invertibility; VAEs use learned inference networks. Sampling-Training Tradeoff: Some models train easily but sample slowly (autoregressive); others sample fast but train unpredictably (GANs); modern approaches (diffusion, flows) balance both.
Each week builds on previous understanding. Early weeks teach why specific objectives (MLE, ELBO, adversarial loss) make sense. Later weeks show how different architectures implement these objectives with distinct tradeoffs. Final weeks synthesize: combining architectures (diffusion in VAE latent space), conditioning mechanisms (text-to-image), and applications.
Strengths
Progressive deepening from foundations to state-of-art
Implementation complexity varies across architectures
Learning Outcomes
By course end, you will: (1) understand fundamental principles connecting all generative models; (2) implement multiple major architectures from scratch; (3) recognize architectural tradeoffs and choose methods matching problem constraints; (4) read recent papers and adapt methods to new domains; (5) design novel hybrid approaches combining complementary strengths.
Generative modeling is simultaneously mature (decades of foundations) and nascent (major architectural breakthroughs within months). This course positions you at that intersection—understanding established principles while remaining flexible for emerging developments. The goal isn't mastering specific methods but mastering the conceptual landscape enabling rapid adaptation.