STANFORD XCS236 · PROBLEM SET 3
PS3: GANs & Diffusion
80 Points · Adversarial, Energy-Based & Diffusion Models
01

PS3 Overview

Problem Set 3 is an 80-point assignment covering the final three weeks of XCS236, exploring three paradigms for deep generative modeling: Generative Adversarial Networks (GANs), Energy-Based Models (EBMs), and Diffusion Models. These approaches represent distinct theoretical foundations and practical tradeoffs in how machines learn to generate high-quality samples from data distributions.

The assignment progresses from understanding core adversarial training dynamics through implementing GANs, diagnosing failure modes like mode collapse, building intuition for energy landscapes and contrastive learning, and finally mastering diffusion-based generation which has become the state-of-the-art approach. Each paradigm reveals different insights about the geometry of learned representations and the costs of sample generation.

80
Points
3
Weeks
8
Sections
3
Paradigms
02

GAN Implementation

The GAN framework consists of two neural networks in adversarial competition: a Generator that transforms noise into synthetic samples, and a Discriminator that attempts to distinguish real from generated data. The generator aims to fool the discriminator, while the discriminator learns to become increasingly discerning. This minimax game dynamics drives both networks toward better representations.

The implementation requires designing appropriate network architectures, typically using convolutional layers for image generation. The generator maps from a latent vector (usually 100-dimensional Gaussian noise) through transposed convolutions to produce images, while the discriminator uses standard convolutions to output a binary classification. PyTorch provides the tools to define these networks and manage the training loop efficiently with alternating updates.

Generator Architecture

Transposed convolutions upsample latent codes to image resolution. Layer normalization stabilizes training dynamics and improves sample quality.

Discriminator Architecture

Standard convolutional layers downsample images to extract features. Binary classification head predicts real vs. generated likelihood.

Loss Functions

Binary cross-entropy between discriminator predictions and labels drives both networks. BCE loss is stable compared to pure minimax formulations.

Training Loop

Discriminator gradient step on real and fake batches, then generator step. Careful initialization and learning rates prevent collapse and instability.

03

GAN Training Challenges

Mode collapse is the most infamous failure mode in GAN training, where the generator learns to produce only a narrow subset of the target distribution, often repeating similar-looking samples. This happens because the generator can exploit discriminator weaknesses to fool it without exploring the full data manifold. Diagnosing mode collapse involves visual inspection of generated samples for diversity and computing metrics like inception score and Frechet inception distance.

The Wasserstein distance (Earth-Mover distance) provides a more stable loss landscape than Jensen-Shannon divergence underlying traditional GANs. Wasserstein GANs (WGAN) clip discriminator weights to enforce Lipschitz constraints, enabling training without mode collapse. Gradient penalty methods further stabilize training by penalizing large gradients in the discriminator, ensuring smooth decision boundaries and enabling deeper networks with faster convergence and higher quality samples.

Mode
Collapse
Wasserstein
Distance
Gradient
Penalty
Spectral
Norm
04

Energy-Based Model Theory

Energy-Based Models represent probability distributions through energy functions: p(x) ∝ exp(-E(x)). Rather than generating samples directly, EBMs define scalar-valued functions that assign lower energies to high-probability regions and higher energies to low-probability regions. The normalization constant Z (partition function) integrates over all possible configurations, making exact likelihood computation intractable but enabling flexible model specification.

Learning in EBMs focuses on score matching, which trains the energy function gradient (score) to match data gradients without computing the partition function. This avoids the computational bottleneck that plagued earlier energy models. The learned score function ∇log p(x) points in the direction of increasing probability, enabling both sampling via Langevin dynamics and likelihood-free inference. Energy functions can be deep neural networks, making EBMs expressive function classes for capturing complex marginal distributions.

Key Insight

Energy-based models decouple representation learning from partition function normalization. By learning only the score function, we avoid the intractable sum over all configurations while maintaining expressive probability models.

05

EBM Training

EBM training employs contrastive divergence, which approximates the true data distribution gradient using samples from the model distribution. The training objective minimizes energy on real data while increasing energy on model-generated samples. Early in training, model samples may be poor, so Langevin MCMC dynamics are used to refine them, converting the energy function into a sampler through iterative gradient ascent plus noise.

Stochastic Gradient Langevin Dynamics (SGLD) combines gradient updates with Gaussian noise proportional to the learning rate, enabling efficient approximate sampling while training. The noise temperature must be carefully calibrated—too high produces poor samples, too low traps chains in local modes. Replay buffers can store good negative samples from previous iterations, accelerating training by reducing Langevin steps per batch. This combination makes EBM training practical despite requiring expensive MCMC inference during learning.

Contrastive Loss
Minimize E(x_real) - E(x_fake) to push model energies toward data manifold.
Langevin Sampling
Generate negative samples via Langevin dynamics: x ← x + (α/2)∇log p(x) + √α ξ.
SGLD Training
Combine gradient descent with annealing noise schedule for stable, efficient learning.
Replay Buffer
Cache previous Langevin chains to amortize sampling cost and stabilize gradients.
06

Diffusion Model Theory

Diffusion Models reverse a process that gradually corrupts data into pure noise, learning to denoise at each step. The forward process adds Gaussian noise according to a schedule (variance schedule), transitioning from p(x_0) to p(x_T) ≈ N(0, I) over T steps. The reverse process learns the inverse transitions, using a neural network to predict noise (or mean/variance) conditioned on time steps, reconstructing the original distribution through iterative refinement.

The training objective comes from the Evidence Lower Bound (ELBO), which decomposes into weighted sum of denoising losses at each timestep. Denoising Diffusion Probabilistic Models (DDPM) train a U-Net to predict noise added at each step, conditioned on timestep embeddings that encode the noise level. The loss down-weights early timesteps (high noise) where predictions are easier and up-weights late timesteps where fine details matter. Sampling reverses the process, starting from pure noise and iteratively denoising to produce high-quality samples.

Advantages

  • Stable training without adversarial dynamics or mode collapse
  • Theoretically principled through probabilistic interpretation
  • Scales to high-resolution image generation with architectural improvements
  • Enables guided generation via classifier gradients or guidance weights

Challenges

  • Sampling requires many sequential denoising steps (typically 50-1000)
  • Higher computational cost than GAN inference (single forward pass)
  • Noise schedule and timestep embeddings require careful tuning
  • Understanding failure modes less developed than GANs or EBMs
07

Diffusion Implementation

A diffusion model implementation centers on a U-Net architecture that processes images at multiple scales, incorporating timestep embeddings through a transformer-style conditioning mechanism. Timestep embeddings encode the current noise level (and thus position in the denoising process), allowing the network to adapt predictions appropriately. The network includes skip connections and attention blocks to maintain spatial resolution information while modeling long-range dependencies.

The training loop computes forward diffusion by sampling timesteps uniformly and computing noisy versions using the closed-form formula x_t = √(ᾱ_t) x_0 + √(1 - ᾱ_t) ε. The network predicts the noise ε, and loss is computed as MSE between predicted and actual noise. Sampling implements the reverse process, starting from x_T ~ N(0, I) and iteratively applying the learned reverse transition. At each step, the network predicts noise for the current timestep, which is subtracted along with a carefully-chosen variance term to gradually denoise toward the original distribution.

U-Net Architecture

Encoder-decoder structure with skip connections preserves spatial information while capturing global context through attention blocks.

Timestep Embeddings

Sinusoidal positional encodings of timestep t condition network behavior on noise level and signal-to-noise ratio.

Training Loss

MSE between predicted and true noise, weighted to emphasize later denoising steps where details matter most.

Sampling Loop

Reverse Markov chain: x_{t-1} = (1/√ā_t)[x_t - (1-ā_t)/√(1-ᾱ_t) * ŷ] + σ_t z, where ŷ is predicted noise.

08

Key Takeaways

The three generative modeling paradigms represent distinct points in the architecture-theory-practice tradeoff space. GANs offer fast sampling through single forward passes but suffer from training instability and mode collapse requiring adversarial safeguards. Energy-based models provide elegant theoretical frameworks with flexible energy functions and principled contrastive learning, but expensive MCMC sampling during training and inference limits practical scalability. Diffusion models achieve superior sample quality and training stability through iterative refinement, now dominating state-of-the-art generative AI applications.

Understanding all three deepens appreciation for generative modeling's landscape: adversarial training explores minimax game theory, energy models reveal information-theoretic perspectives on probability, and diffusion processes show how multi-step refinement enables better optimization and sample quality. The field continues evolving with hybrid approaches combining insights from each paradigm, acceleration techniques for faster diffusion sampling, and conditional generation methods leveraging all three model classes. Practitioners should understand each approach's strengths to select appropriate models for specific generative tasks and datasets.

GANs
Fast Sampling
EBMs
Principled Theory
Diffusion
State-of-Art
Hybrid
Future
09

References & Further Reading

Problem Set 3 covers the final major paradigms in modern generative modeling: adversarial networks, energy-based models, and diffusion models. These represent the cutting edge of generative AI research, with diffusion models now achieving state-of-the-art results across image, audio, and video generation. The references below span foundational papers, technical overviews, and implementation resources for all three approaches.

Start with seminal papers (Goodfellow et al. for GANs, Ho et al. and Song et al. for diffusion) to understand core concepts. Use Lilian Weng's blog for clear explanations bridging theory and intuition. Consult implementation guides when coding. Energy-based models are particularly useful for understanding probabilistic modeling from first principles, while GANs and diffusion models represent practical state-of-the-art in generative AI. Understanding all three gives complete perspective on modern generative modeling.