STANFORD XCS236 · DEEP GENERATIVE MODELS
Diffusion Models
Week 7–8 · DDPM, Score Matching & Conditional Generation
01

Forward Process — q(xt|xt−1)

The forward process gradually adds Gaussian noise to data over T timesteps, transforming clean samples into pure noise via a Markov chain. Each step applies a small amount of noise controlled by variance schedule βt, which determines signal-to-noise ratio at each timestep.

The process is deterministic given x₀ and defines q(xt|xt−1) = 𝒩(xt; √(1−βt)xt−1, βtI). This tractable marginal q(xt|x₀) allows efficient loss computation without iterating the chain, using accumulated product α̅t = ∏ᵢ(1−βi).

02

DDPM — Denoising Diffusion Probabilistic Models

DDPM (Ho et al., 2020) frames image generation as learning the reverse of the forward diffusion process. By iteratively denoising a Gaussian sample, the model recovers data without explicit likelihood computation or adversarial training, offering stability and mode coverage.

The simplified MSE objective trains the model to predict noise, yielding strong empirical results on image benchmarks. DDPM demonstrated that diffusion could compete with GANs and VAEs, launching the modern era of diffusion-based generative models with applications across image, audio, and video synthesis.

03

Noise Prediction — ε-prediction & U-Net

Instead of predicting the mean directly, the model learns to predict the noise ε added at each step. This ε-prediction reparameterization simplifies the loss and improves sample quality compared to predicting variance. U-Net architecture with skip connections and multi-scale feature extraction dominates diffusion model backbones.

U-Net designs employ downsampling, self-attention layers, and time embeddings to condition the network on timestep. Residual connections and layer normalization stabilize training. Cross-attention blocks enable conditioning on text prompts (as in Stable Diffusion) or other modalities, making the architecture flexible for guided generation tasks.

04

Reverse Process — pθ(xt−1|xt)

The reverse process learns pθ(xt−1|xt) to map noise back to data. The posterior mean and variance are derived analytically from the forward process, so the model learns only the mean, predicting it as a function of xt and timestep t. Sampling iterates from xT ∼ 𝒩(0, I) down to x₀.

Variance scheduling (e.g., linear, cosine) controls the denoising trajectory. Learned variance with separate heads can improve likelihood, though fixed schedules work well empirically. The Markov structure enables efficient sequential sampling, making generation tractable despite T being large (typically 50–1000 steps).

05

Training Objective — Variational Bound & MSE Loss

The training objective is the variational lower bound (ELBO), which decomposes into multiple terms. The dominant term is reconstruction error at different noise levels, capturing how well the model predicts the noise distribution.

The simplified MSE loss L_simple ignores coefficient schedules and trains on uniform timesteps, making optimization simpler while maintaining effectiveness. Connection to score matching reveals that the diffusion objective is equivalent to learning the score ∇x log p(x) under a noise-perturbed distribution, providing theoretical grounding in statistical physics and unifying with energy-based models.

06

Sampling Acceleration — DDIM, DPM-Solver & Distillation

Sampling T forward diffusion steps is slow. DDIM (Denoising Diffusion Implicit Models) skips steps by using a deterministic reverse process, reducing steps from 1000 to 20–50 while maintaining quality. DPM-Solver and other ODE-based methods use numerical solvers on the probability flow ODE, further accelerating sampling.

Knowledge distillation transfers a large teacher model to a compact student, enabling real-time generation. Consistency models learn to map any noise level directly to clean data in a single step, trading accuracy for speed. Progressive acceleration, latent-space diffusion (operating in compressed VAE embeddings), and model caching all reduce computational cost, making diffusion practical for deployment.

07

Conditional Generation — Classifier & Classifier-Free Guidance

Classifier guidance trains a separate model on noisy images to predict class labels, then uses gradients to steer the diffusion trajectory toward high-probability regions. This trade-off between fidelity and diversity is controlled by a scale parameter w. Higher w gives sharper, more class-aligned samples at the cost of reduced variety.

Classifier-free guidance eliminates the need for a separate classifier by training the diffusion model conditionally (with and without labels). At inference, the model output is interpolated between unconditional and conditional predictions, achieving similar steering without extra parameters. This approach powers text-to-image models like Stable Diffusion, where text embeddings replace explicit class labels.

08

State of the Art — Stable Diffusion, DALL-E, Imagen

Stable Diffusion (Latent Diffusion) operates in VAE latent space, reducing computational cost while maintaining quality. DALL-E 3 and Imagen use diffusion for text-to-image synthesis at scale. Cascaded models first generate low-resolution images, then super-resolve with diffusion, improving efficiency and coherence.

Diffusion extends to video (frame-by-frame or frame-interpolation), 3D (point clouds, meshes, NeRFs), and discrete data (text, graphs) via discrete diffusion or multinomial diffusion. Diffusion-based models now achieve state-of-the-art results across modalities, with applications in inpainting, editing, super-resolution, and beyond. The field continues to evolve with advances in speed, control, and multimodal synthesis.

09

References & Further Reading

Diffusion models have emerged as the dominant paradigm for generative modeling in the 2020s. This section compiles foundational papers, key methods, and resources for understanding forward and reverse processes, training objectives, and modern applications that have achieved state-of-the-art results across multiple domains.

From DDPM to Stable Diffusion and beyond, these materials document the rapid evolution and widespread adoption of diffusion-based generative models.