Diffusion Models — Deep Dive

STANFORD XCS236 · DEEP GENERATIVE MODELS

Diffusion Models

Forward Process — q(x_t|x_t−1)

The forward process gradually adds Gaussian noise to data over T timesteps, transforming clean samples into pure noise via a Markov chain. Each step applies a small amount of noise controlled by variance schedule β_t, which determines signal-to-noise ratio at each timestep.

The process is deterministic given x₀ and defines q(x_t|x_t−1) = 𝒩(x_t; √(1−β_t)x_t−1, β_tI). This tractable marginal q(x_t|x₀) allows efficient loss computation without iterating the chain, using accumulated product α̅_t = ∏ᵢ(1−β_i).

DDPM — Denoising Diffusion Probabilistic Models

DDPM (Ho et al., 2020) frames image generation as learning the reverse of the forward diffusion process. By iteratively denoising a Gaussian sample, the model recovers data without explicit likelihood computation or adversarial training, offering stability and mode coverage.

The simplified MSE objective trains the model to predict noise, yielding strong empirical results on image benchmarks. DDPM demonstrated that diffusion could compete with GANs and VAEs, launching the modern era of diffusion-based generative models with applications across image, audio, and video synthesis.

Noise Prediction — ε-prediction & U-Net

Instead of predicting the mean directly, the model learns to predict the noise ε added at each step. This ε-prediction reparameterization simplifies the loss and improves sample quality compared to predicting variance. U-Net architecture with skip connections and multi-scale feature extraction dominates diffusion model backbones.

U-Net designs employ downsampling, self-attention layers, and time embeddings to condition the network on timestep. Residual connections and layer normalization stabilize training. Cross-attention blocks enable conditioning on text prompts (as in Stable Diffusion) or other modalities, making the architecture flexible for guided generation tasks.

Reverse Process — p_θ(x_t−1|x_t)

The reverse process learns p_θ(x_t−1|x_t) to map noise back to data. The posterior mean and variance are derived analytically from the forward process, so the model learns only the mean, predicting it as a function of x_t and timestep t. Sampling iterates from x_T ∼ 𝒩(0, I) down to x₀.

Variance scheduling (e.g., linear, cosine) controls the denoising trajectory. Learned variance with separate heads can improve likelihood, though fixed schedules work well empirically. The Markov structure enables efficient sequential sampling, making generation tractable despite T being large (typically 50–1000 steps).

Training Objective — Variational Bound & MSE Loss

The training objective is the variational lower bound (ELBO), which decomposes into multiple terms. The dominant term is reconstruction error at different noise levels, capturing how well the model predicts the noise distribution.

The simplified MSE loss L_simple ignores coefficient schedules and trains on uniform timesteps, making optimization simpler while maintaining effectiveness. Connection to score matching reveals that the diffusion objective is equivalent to learning the score ∇_x log p(x) under a noise-perturbed distribution, providing theoretical grounding in statistical physics and unifying with energy-based models.

Sampling Acceleration — DDIM, DPM-Solver & Distillation

Sampling T forward diffusion steps is slow. DDIM (Denoising Diffusion Implicit Models) skips steps by using a deterministic reverse process, reducing steps from 1000 to 20–50 while maintaining quality. DPM-Solver and other ODE-based methods use numerical solvers on the probability flow ODE, further accelerating sampling.

Knowledge distillation transfers a large teacher model to a compact student, enabling real-time generation. Consistency models learn to map any noise level directly to clean data in a single step, trading accuracy for speed. Progressive acceleration, latent-space diffusion (operating in compressed VAE embeddings), and model caching all reduce computational cost, making diffusion practical for deployment.

Conditional Generation — Classifier & Classifier-Free Guidance

Classifier guidance trains a separate model on noisy images to predict class labels, then uses gradients to steer the diffusion trajectory toward high-probability regions. This trade-off between fidelity and diversity is controlled by a scale parameter w. Higher w gives sharper, more class-aligned samples at the cost of reduced variety.

Classifier-free guidance eliminates the need for a separate classifier by training the diffusion model conditionally (with and without labels). At inference, the model output is interpolated between unconditional and conditional predictions, achieving similar steering without extra parameters. This approach powers text-to-image models like Stable Diffusion, where text embeddings replace explicit class labels.

State of the Art — Stable Diffusion, DALL-E, Imagen

Stable Diffusion (Latent Diffusion) operates in VAE latent space, reducing computational cost while maintaining quality. DALL-E 3 and Imagen use diffusion for text-to-image synthesis at scale. Cascaded models first generate low-resolution images, then super-resolve with diffusion, improving efficiency and coherence.

Diffusion extends to video (frame-by-frame or frame-interpolation), 3D (point clouds, meshes, NeRFs), and discrete data (text, graphs) via discrete diffusion or multinomial diffusion. Diffusion-based models now achieve state-of-the-art results across modalities, with applications in inpainting, editing, super-resolution, and beyond. The field continues to evolve with advances in speed, control, and multimodal synthesis.

References & Further Reading

Diffusion models have emerged as the dominant paradigm for generative modeling in the 2020s. This section compiles foundational papers, key methods, and resources for understanding forward and reverse processes, training objectives, and modern applications that have achieved state-of-the-art results across multiple domains.

From DDPM to Stable Diffusion and beyond, these materials document the rapid evolution and widespread adoption of diffusion-based generative models.

Forward Process — q(xt|xt−1)

DDPM — Denoising Diffusion Probabilistic Models

Noise Prediction — ε-prediction & U-Net

Reverse Process — pθ(xt−1|xt)

Training Objective — Variational Bound & MSE Loss

Sampling Acceleration — DDIM, DPM-Solver & Distillation

Conditional Generation — Classifier & Classifier-Free Guidance

State of the Art — Stable Diffusion, DALL-E, Imagen

References & Further Reading

Forward Process

Linear Schedule

Cosine Schedule

Learned Schedule

Logarithmic Schedule

Signal-to-Noise Ratio

DDPM — Denoising Diffusion Probabilistic Models

Simplicity

Mode Coverage

Flexibility

Computational Cost

Empirical Results

Noise Prediction — ε-prediction & U-Net Architecture

Skip Connections

Self-Attention

Time Embedding

Cross-Attention

Design Variations

Reverse Process — pθ(xt−1|xt)

Training Objective — Variational Bound & MSE Loss

Connection to Score Matching

Interpretation

Theoretical Grounding

Stability

Scalability

Alternative Objectives

Sampling Acceleration — DDIM, DPM-Solver & Distillation

DDIM

DPM-Solver

Consistency Models

Latent Diffusion

Knowledge Distillation

Conditional Generation — Classifier & Classifier-Free Guidance

Classifier-Free Advantage

Scale Interpretation

Text Conditioning

Multi-Modal Conditioning

Advanced Guidance Strategies

State of the Art — Stable Diffusion, DALL-E, Imagen, Video & Discrete Diffusion

Latent Diffusion (Stable)

Text Conditioning

Cascaded Models

Negative Prompts

Video & 3D Diffusion

Current Capabilities & Limitations

References & Further Reading

Foundational Papers

Key Concepts

Acceleration & Efficiency

Conditioning & Control

Modern Applications

Learning Resources

Forward Process — q(x_t|x_t−1)

Reverse Process — p_θ(x_t−1|x_t)

Reverse Process — p_θ(x_t−1|x_t)