Problem Set 3 represents the capstone exploration of deep generative modeling within the XCS236 course structure. The assignment spans 80 points across eight detailed sections, each building upon conceptual and practical foundations established in earlier problem sets. The problem set covers material from weeks 5–8, which represent a progression through three increasingly powerful and nuanced approaches to learning generative distributions.
The three major paradigms—Generative Adversarial Networks, Energy-Based Models, and Diffusion Models—emerged at different times in machine learning history and reflect evolving understanding of how neural networks can learn to generate data. GANs introduced adversarial training in 2014, fundamentally changing how we think about neural network competition and game-theoretic learning dynamics. Energy-Based Models have roots in statistical physics and Boltzmann machines but gained renewed attention through score matching innovations. Diffusion Models represent the newest paradigm, drawing inspiration from non-equilibrium thermodynamics and gaining dominance in recent years through superior empirical results.
Problem Set Structure
The assignment progresses logically from GAN implementation through training challenge mitigation, then introduces energy-based perspectives with both theoretical foundations and practical training procedures, before culminating in diffusion model theory and implementation. This sequencing allows students to build intuition gradually, moving from sample generation through adversarial means, to energy landscape reasoning, to multi-step refinement processes. By the conclusion, students have implemented representatives from each paradigm, encountering the unique challenges and insights each offers.
Learning Objectives
Upon completing this problem set, students should be able to: (1) implement functional GAN architectures and diagnose failure modes through visual inspection and quantitative metrics; (2) understand adversarial training dynamics including mode collapse and its remediation through Wasserstein objectives and gradient penalties; (3) formulate probability distributions through energy functions and derive score matching training objectives; (4) apply contrastive divergence and Langevin sampling for efficient EBM inference; (5) implement diffusion-based generation through noise scheduling and U-Net architectures; (6) compare and contrast the three paradigms regarding sample quality, training stability, inference cost, and theoretical interpretability; (7) select appropriate generative models for different application domains and constraints.
The foundational GAN architecture consists of two neural networks designed to operate in complementary roles within an adversarial training framework. The Generator function G(z) takes a latent code z (typically sampled from a standard normal distribution) and produces synthetic samples that match the dimensionality and statistical properties of real data. The Discriminator function D(x) outputs a scalar probability estimate indicating whether input x is real or synthetically generated. Both networks are implemented as deep convolutional neural networks with distinct architectural patterns suited to their respective roles.
Generator architecture typically employs transposed convolution operations (also called deconvolutions or fractionally-strided convolutions) to progressively upsample latent vectors into full-resolution images. A common pattern starts with a dense layer that reshapes noise vectors into spatial tensors, then applies successive transposed convolutions with stride 2 to double spatial dimensions at each layer. Batch normalization between layers stabilizes training by reducing internal covariate shift, while ReLU activations maintain non-linearity. The final layer applies tanh activation to map outputs to [-1, 1], matching preprocessing conventions for real images.
Discriminator Implementation
Discriminators use standard convolutional layers that progressively downsample input images through stride 2 operations, extracting increasingly abstract features. Each convolution applies learned kernels followed by batch normalization and leaky ReLU activations (with negative slope ~0.2 to prevent dying ReLU issues). After spatial dimensions reduce to 1×1 through the network, a flattening operation precedes fully-connected layers that compute the final discrimination score. Unlike generators which typically output real-valued samples, the discriminator outputs a scalar through sigmoid activation, producing a binary probability compatible with cross-entropy loss.
Training Loop Mechanics
Each training iteration alternates between discriminator and generator updates. For discriminator training, real samples and generated samples (from fresh generator outputs) are fed separately through the network. Real samples receive labels 1.0 (indicating genuine data) while generated samples receive labels 0.0 (indicating synthetic). Binary cross-entropy loss is computed separately for each group, summed, and backpropagated to update discriminator weights. For generator training, new samples are generated and passed through the fixed discriminator, but with labels 1.0 (fooling objective). The generator is updated only through this feedback, never seeing real data directly.
Practical Implementation Considerations
Successful GAN implementation requires careful hyperparameter selection and initialization. Learning rates for discriminator and generator may differ; typical ranges are 0.0001–0.0005 for Adam optimizer. Batch sizes substantially impact training stability—larger batches reduce gradient noise but increase memory requirements. Weight initialization schemes matter significantly; techniques like spectral normalization or careful Gaussian initialization with layer-dependent scaling can prevent training collapse. The latent dimension (typically 100–512) affects sample diversity and generator expressiveness. Training from scratch often takes 10,000–100,000 iterations depending on dataset size and model capacity, with progress monitored through generated sample visualization at regular intervals.
Mode collapse represents the most pervasive failure mode in GAN training, where the generator converges to producing samples from only a narrow subset of the target distribution. Rather than learning to generate diverse samples across the entire data manifold, the generator discovers that it can "trick" the discriminator by repeatedly producing nearly-identical high-quality samples from a single mode. This occurs because early in training the discriminator is weak, and the generator exploits this by over-specializing to fool it without exploring diversity. Once the discriminator improves, the generator may lack sufficient optimization landscape to escape this local solution.
The fundamental cause of mode collapse stems from the minimax formulation underlying standard GANs. As the discriminator improves, its loss becomes nearly zero on easy-to-distinguish samples, providing uninformative gradients to the generator. The generator thus faces vanishing gradients despite the discriminator incorrectly classifying some generated samples. This feedback breakdown is exacerbated when generator and discriminator capacities are mismatched—an overly-powerful discriminator can perfectly classify and provide no learning signal, while an under-capacity generator cannot express the full distribution.
Wasserstein Distance & WGAN
Wasserstein GANs (WGAN) address mode collapse by replacing the Jensen-Shannon divergence underlying standard GANs with the Wasserstein distance (Earth-Mover distance). The Wasserstein distance measures the minimum cost of transporting probability mass from generated to real distributions, providing a continuous, differentiable metric even when distributions don't overlap. This enables the discriminator (reframed as a critic) to provide meaningful gradients even when samples are far from real data, preventing gradient starvation. Training becomes more stable because the loss landscape becomes smoother and more informative throughout the entire sample space.
Gradient Penalty for Lipschitz Enforcement
Practical Wasserstein GANs implement the Wasserstein objective by enforcing a Lipschitz constraint on the critic function through gradient penalty. Rather than clipping weights (which damages expressiveness), gradient penalty adds a regularization term to the discriminator loss: λ·E[(||∇_x D(x)||_2 - 1)²], encouraging the gradient norm to stay near 1. This constraint ensures the critic is 1-Lipschitz, which is necessary for the Wasserstein estimate. Gradient penalty enables training with much larger learning rates and more discriminator steps per generator step, significantly stabilizing training dynamics.
Diagnosis & Metrics
Detecting mode collapse requires visual inspection of generated sample diversity across training iterations. Computing Inception Score (IS) and Fréchet Inception Distance (FID) provides quantitative metrics: IS measures sample quality and diversity using a pre-trained image classifier, while FID compares feature distributions between generated and real images. Significantly improving FID or IS during training while sample diversity decreases visually is a strong indicator of mode collapse. Monitoring discriminator and generator losses during training also reveals problems—stable training shows gradually changing loss curves, while mode collapse often causes sudden loss changes or discriminator loss plateauing at near-zero values.
Energy-Based Models represent probability distributions through unnormalized energy functions, where p(x) = exp(-E(x))/Z with partition function Z = ∫exp(-E(x))dx. Low energy regions correspond to high probability regions in the data distribution, while high energy regions correspond to low probability areas. This perspective is borrowed from statistical physics, particularly Boltzmann machines, but generalized to arbitrary neural network energy functions. The beauty of this formulation is that we can design flexible energy functions without explicitly constraining them to be valid probability densities—the exponential family framework automatically produces normalized distributions.
The energy function E(x) is typically implemented as a deep neural network that maps high-dimensional data to scalar-valued energies. Unlike autoencoders or classifiers, the output is a single scalar representing the "energy cost" of configuration x. The network learns to assign low energies to regions containing real data samples and high energies to regions containing random noise or out-of-distribution samples. This scalar output is crucial because it enables computing gradients with respect to the input, yielding the score function ∇_x log p(x) = -∇_x E(x). The score points toward regions of increasing probability density, making it a fundamental quantity for both sampling and likelihood-free inference.
Partition Function Challenge
The normalization constant Z=∫exp(-E(x))dx is intractable to compute directly, requiring integration over all possible configurations. This intractability was historically the primary limitation of energy models—computing gradients of log p(x) requires ∇_x Z, which involves integrating exponential functions over the entire space. For binary variables, this requires summing 2^N terms; for continuous variables, it requires high-dimensional integration. This computational barrier motivated alternative approaches like GANs that avoid explicit density computation.
Score Matching Solution
Score matching circumvents the partition function problem by training only the score function ∇log p(x) without computing Z. The key insight is that the score matching objective matches model score ∇_θ log p_θ(x) to data score ∇ log p_data(x). Through integration by parts, this objective becomes ∫p_data(x)[||∇_θ log p_θ(x)||² - 2Tr(∇²_θ log p_θ(x))]dx, involving only derivatives of the energy function without the partition function. In practice, we estimate this using mini-batches: for each data sample x, compute the score matching loss as ||∇_x E(x)||² + Tr(∇²_x E(x)), where derivatives flow through the network. This makes training tractable while learning distributions with the same expressiveness as traditional energy models.
Sampling via Langevin Dynamics
Once trained, the score function enables sampling through Langevin dynamics: x_{t+1} = x_t + (α/2)∇_x log p(x_t) + √α ξ_t, where α is step size and ξ_t is Gaussian noise. This iterative process performs gradient ascent in log probability space with noise injection, gradually moving from random initialization toward high-probability regions. The noise term prevents the chain from converging to isolated modes and ensures ergodicity. After sufficient iterations and annealing the step size, samples approximately follow the learned distribution. The Langevin dynamics provide a principled connection between the score function and actual sample generation, making EBMs inherently generative once trained.
Training energy-based models employs contrastive divergence, a principle that combines maximum likelihood objective with approximation. Rather than directly minimizing KL divergence between model and data distributions (which requires the partition function), contrastive divergence minimizes the difference between data and model samples in the energy landscape. The training objective becomes minimizing E_data[E(x)] - E_model[E(x)], pushing model energies downward on data and upward on generated samples. This contrastive formulation motivates why these are called "contrastive" divergence—we explicitly contrast positive examples (data) against negative examples (model samples).
The critical challenge is obtaining model samples efficiently. Early in training, model samples are terrible, so running full Langevin dynamics from scratch for every batch would be computationally prohibitive. The solution combines initialization strategy with replay buffers. A replay buffer maintains a pool of generated samples from previous training iterations, whose energies have been partially optimized. At each training step, we retrieve samples from the buffer, run a few Langevin refinement steps, and update them in place. This amortizes the expensive MCMC inference cost across multiple training iterations, and the buffer ensures negative samples improve continuously as the energy function evolves.
Langevin MCMC Implementation
Implementing Langevin sampling for EBM training requires the Stochastic Gradient Langevin Dynamics (SGLD) update rule: x ← x + (α/2)∇_x E(x) + √α ξ, where we compute energy gradients with respect to input coordinates rather than parameters. This differs from standard gradient descent because we negate the energy gradient (moving toward low energy/high probability) and inject controlled noise. The step size α and noise magnitude must be calibrated carefully: too-large step sizes cause instability, while too-small sizes require many iterations for convergence. Temperature-based annealing helps—starting with larger step sizes to escape local modes, then reducing gradually to refine samples.
Replay Buffer Management
The replay buffer is a central data structure that caches negative samples across training iterations. Typically, it contains 10,000–100,000 samples organized by their learned energies. At each training step, we sample a minibatch from the buffer (often biased toward higher-energy samples to focus on decision boundary refinement), run Langevin steps to refine them, and update them in the buffer. To maintain sample freshness and diversity, we periodically inject new random samples. The buffer size and refresh rate are hyperparameters that significantly impact training speed and final model quality. Larger buffers provide more diverse negatives but require more memory, while smaller buffers train faster but may become stale.
Practical Training Dynamics
EBM training can be unstable if energies are allowed to grow unboundedly, as refinement becomes difficult. Regularization terms like L2 weight penalties or explicit energy regularization help keep energies well-scaled. The ratio of Langevin refinement steps to training steps is critical—too few refinement steps yield poor negative samples, while too many make training prohibitively slow. Typically, 10–100 Langevin steps per training batch is reasonable depending on dimensionality and energy landscape complexity. Monitoring the energy gap (E_data - E_model) provides crucial training diagnostics: positive gaps indicate progress, while stagnation suggests hyperparameter adjustment is needed.
Diffusion Models reverse a forward process that gradually corrupts data through Gaussian noise injection. The forward (diffusion) process defines a Markov chain q(x_1:T | x_0) = ∏_{t=1}^T q(x_t | x_{t-1}) where x_0 is real data and each transition adds Gaussian noise with fixed variance: q(x_t|x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_t I). The sequence of variances β_1,...,β_T is the noise schedule, which controls how quickly noise accumulates. Starting from data, successive applications of noise progressively degrade samples until x_T ≈ N(0,I) is pure noise. This diffusion process is fixed (non-trainable) but enables efficient sampling of q(x_t|x_0) in closed form.
The reverse process p_θ(x_{0:T}) learns to invert the forward diffusion, with trainable transitions p_θ(x_{t-1}|x_t) = N(x_{t-1}; μ_θ(x_t,t), Σ_θ(x_t,t)). Starting from x_T ∼ N(0,I) and applying reverse transitions iteratively, samples are progressively denoised back toward high-probability data regions. The reverse process is learned through neural networks parameterized by θ, typically with both mean and variance modeled (though variance is often fixed for simplicity). This formulation elegantly leverages Markov chain theory: if we know the reverse transition probabilities, we can sample from the data distribution by reversing a known diffusion process.
Evidence Lower Bound (ELBO) Derivation
Training diffusion models is formulated through the variational lower bound on log p(x_0): log p(x_0) ≥ ∑_t E_q[log(p_θ(x_{t-1}|x_t)/q(x_t|x_{t-1}))]. This ELBO decomposes into several terms: a reconstruction term comparing p(x_0|x_1) to q(x_1|x_0), KL divergence terms comparing learned reverse transitions to forward transitions, and a prior matching term for the final timestep. Importantly, the reconstruction and KL terms involve only single timestep distributions, making them tractable to evaluate. In practice, diffusion models weight these terms differently—down-weighting early timesteps (where predictions are easy due to high noise) and up-weighting late timesteps (where fine details matter).
Noise Prediction Formulation
Rather than predicting means and variances directly, effective diffusion models learn to predict the noise ε added at each step. Given the forward process generates x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε with ᾱ_t being the cumulative product of (1-β_t), we can rewrite the denoising objective as predicting the noise component. The network ε_θ(x_t, t) predicts the noise, and loss is computed as E[||ε - ε_θ(x_t,t)||²]. This formulation is more stable than directly predicting means because noise predictions are less sensitive to input scale and easier for networks to learn. The relationship between predicted noise and the mean prediction is: μ_θ = (1/√(ᾱ_t))[x_t - (√(1-ᾱ_t)/√(ᾱ_t))ε_θ(x_t,t)].
Noise Schedules
The variance schedule {β_t} fundamentally shapes the diffusion process and learning difficulty. Linear schedules increase β_t uniformly from small to large values; cosine schedules create smoother transitions preserving signal-to-noise ratio longer into training. The cumulative products ᾱ_t = ∏_{s=1}^t (1-β_s) determine how much signal remains at each timestep. Well-designed schedules ensure that x_T is nearly pure noise while x_0 remains nearly identical to the original data. Poor schedules either leave significant signal at x_T (making final timesteps unnecessary) or corrupt data so quickly early timesteps learn trivial denoising tasks. The schedule's choice significantly impacts model quality and sampling speed, with recent research showing that careful schedule design enables faster sampling without quality loss.
The neural network at the heart of diffusion models is typically a U-Net architecture designed for image-to-image translation tasks. U-Nets combine an encoder that downsamples spatial dimensions through strided convolutions with a symmetric decoder that upsamples through transposed convolutions, connected by skip connections. This architecture preserves fine-grained spatial information necessary for high-quality image generation while maintaining receptive fields large enough for global context. For diffusion models, U-Nets operate on noisy input images and output predictions of the same dimensionality (either denoised images or noise coefficients).
Timestep conditioning in diffusion models uses learned embeddings similar to positional encodings in transformers. The continuous timestep t is first mapped to a fixed-dimensional embedding vector via sinusoidal embeddings and learned dense layers. This embedding is then injected into the U-Net at multiple points—typically through addition with intermediate feature maps or through FiLM (Feature-wise Linear Modulation) layers that scale and shift features based on the timestep embedding. This conditioning allows the network to adapt its denoising behavior based on the current noise level: early timesteps require coarse denoising of very noisy inputs, while late timesteps refine fine details in nearly-clean images.
U-Net Architecture Details
Encoder blocks typically consist of convolution → batch normalization → ReLU → convolution operations with stride 2 for downsampling. Each downsampling reduces spatial dimensions by half while increasing channel count. The decoder mirrors this structure with transposed convolutions for upsampling, progressively increasing spatial resolution. Skip connections concatenate features from corresponding encoder levels to the decoder, maintaining fine-grained information that would otherwise be lost through downsampling. Attention blocks further enhance the U-Net, enabling long-range spatial dependencies modeling through self-attention mechanisms. These attention blocks operate on flattened feature maps, compute attention weights globally, then reshape back to spatial structure.
Training Pipeline
Diffusion model training is remarkably straightforward. Each training iteration: (1) Sample a batch of real images from the dataset. (2) Sample random timesteps t uniformly from 1 to T. (3) Sample noise vectors ε from N(0,I). (4) Compute noisy images x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε using precomputed schedule constants. (5) Pass x_t and t to the U-Net to predict noise ε_θ(x_t,t). (6) Compute MSE loss between predicted and actual noise. (7) Backpropagate and update model weights. The simplicity enables stable training without adversarial dynamics—larger models and more data directly improve results. Training for high-resolution images may require 100,000+ iterations on large GPUs, with checkpoint-based training enabling convergence monitoring.
Sampling Implementation
Sampling reverses the diffusion process starting from x_T ∼ N(0,I). The sampling loop executes the reverse transition T times: for each timestep t from T down to 1, predict noise ε_θ(x_t,t), then update x_{t-1} = (1/√ᾱ_t)[x_t - (√(1-ᾱ_t)/√ᾱ_t)ε_θ(x_t,t)] + σ_t ξ where ξ ∼ N(0,I) and σ_t is the reverse process variance. The variance σ_t can be set to the forward process variance β_t or optimized. Sampling a single image requires T forward passes through the network, each taking ~ms on modern hardware, so complete sampling takes seconds per image. Advanced techniques like DDIM (Denoising Diffusion Implicit Models) skip timesteps without sacrificing much quality, reducing sampling time to milliseconds, enabling real-time applications.
The three generative modeling paradigms explored in this problem set represent distinct philosophical and practical approaches to the fundamental problem of learning data distributions. GANs frame generation as an adversarial game where two networks with opposing objectives drive each other toward equilibrium. This game-theoretic perspective enables single-step sampling—once trained, a single forward pass through the generator produces complete samples. However, adversarial training introduces instability and mode collapse risks, requiring careful tuning and architectural innovations like Wasserstein objectives and gradient penalties. The paradigm excels when fast sampling is critical and computational resources for training are abundant.
Energy-Based Models provide a thermodynamic perspective grounded in statistical physics, where probability distributions emerge from scalar-valued energy landscapes. This principled theoretical foundation enables flexible model design and interpretable probability assignments. Training through score matching and contrastive divergence avoids some pitfalls of adversarial training, enabling more stable learning. However, the requirement to sample via iterative Langevin dynamics during training and inference makes EBMs computationally expensive for high-dimensional problems. The paradigm shines when interpretability and theoretical grounding are priorities, or when conditional generation and energy-based control are desired.
Diffusion Models and Modern Generative AI
Diffusion Models have emerged as the state-of-the-art generative approach in recent years, dominating in text-to-image generation (DALL-E, Stable Diffusion, Midjourney) and image editing applications. The multi-step iterative refinement through denoising provides remarkable stability—training doesn't suffer from adversarial dynamics or mode collapse, instead reliably improving with more data and model capacity. The probabilistic formulation through the ELBO provides principled training objectives and enables principled likelihood-based inference. Sample quality substantially exceeds GANs, though computational requirements (1-50 seconds per sample) exceed GANs but are acceptable for many applications. Recent acceleration techniques progressively narrow the inference gap.
Selecting Models for Applications
Model selection depends on application constraints and objectives: Choose GANs when single-step sampling speed is critical and training stability is acceptable—typical for real-time interactive systems or embedded applications. Choose EBMs when the application requires energy-based control (optimizing the energy landscape), when interpretability and theoretical grounding are valued, or when conditional generation requires explicit probability modeling. Choose Diffusion Models when sample quality is paramount and inference time constraints are moderate, for applications where iterative refinement enhances perception, or when scaling to larger models and datasets is planned.
Hybrid Approaches and Future Directions
Cutting-edge research increasingly explores hybrid approaches combining insights from all three paradigms. Energy-Guided Diffusion uses energy functions to guide diffusion sampling toward high-probability regions. Adversarial Diffusion Models train diffusion networks with auxiliary discriminator objectives for enhanced quality. Diffusion-based EBM sampling accelerates EBM inference through diffusion processes. Understanding the distinct strengths—GANs' fast sampling, EBMs' principled theory, diffusion's stability—enables creative combinations achieving unprecedented capabilities. Practitioners should view this problem set as the foundation for evaluating these tools as generative AI continues its rapid evolution across academia and industry.
Landmark Papers — GANs
Landmark Papers — Energy-Based Models
Landmark Papers — Diffusion Models
Intuitive Explanations & Blogs
- Lilian Weng: What are Diffusion Models? — Clear, accessible explanation of diffusion model theory with connections to VAEs and score matching; excellent starting point.
- Lilian Weng: Blog (All Posts) — Comprehensive blog covering GANs, diffusion, energy models, and all major generative modeling paradigms; highly recommended across problem sets.
Mathematical Background
Implementation & Practical Guides
Course Context
Reading Strategy for PS3: Start with Lilian Weng's blog posts for intuitive understanding. Read Goodfellow et al. (2014) for GAN foundations, then Arjovsky et al. (2017) for Wasserstein improvements. Read Ho et al. (2020) for diffusion models—the DDPM paper is the most important. Use Song et al. (2021) for advanced understanding and SDEs. Reference papers while coding; diffusion model architecture decisions are well-motivated by the theory. Energy-based models have less recent literature; understanding them deepens appreciation for why diffusion and GANs dominate current practice.