Score-Based Generative Models

STANFORD XCS236 · DEEP GENERATIVE MODELS

Score-Based Generative Models

The Score Function

The score function is the gradient of the log probability: s(x) = ∇_x log p(x). It points in the direction of increasing probability density, encoding the shape and geometry of the data distribution without explicitly computing probabilities. The score captures local information about where the data manifold resides in high-dimensional space.

Unlike density estimation, which requires explicit probability evaluation, score-based methods avoid the intractable partition function. The Stein score provides connections to both classical probability theory and modern variational inference. Understanding scores is fundamental: they bridge gradient-based sampling, optimal transport, and generative modeling.

∇_x log p(x)

Score Formula

Gradient Space

Function Domain

No Partition Z

Key Advantage

Local Density

Geometric Info

Score Matching Objectives

Score matching minimizes Fisher divergence between the true and estimated score functions, avoiding density ratio estimation. The key insight is an integration-by-parts trick: E[||∇_x log p(x) - s_θ(x)||²] can be rewritten as E[∇²_x log p(x) + 2∇·s_θ(x)], which requires only score derivatives of the model, not the true distribution.

Implicit score matching further avoids computing explicit Hessians by using a stochastic loss. This family of methods forms the foundation for scalable training of score-based generative models, enabling learning without synthetic data or adversarial objectives. The mathematical elegance lies in transforming an intractable objective into a tractable one through differential calculus.

Fisher Divergence

Measures disagreement between score fields in a probabilistically principled way.

Integration by Parts

Transforms Hessian terms into model divergence, avoiding second derivatives of true data.

Implicit Matching

Stochastic approximation enables training without explicit Hessian computation.

Scalability

No discriminator, no adversarial training—direct supervised objective on gradients.

Denoising Score Matching

Denoising score matching exploits the key observation that adding Gaussian noise to data yields a score that is simpler to estimate. For noise-perturbed data x_t = x + εt where ε ~ N(0, σ²I), the score ∇_x log p_t(x) becomes easier to learn. This connects to denoising autoencoders: training a network to predict the noise added during corruption directly estimates the score of the noisy distribution.

This elegant reformulation avoids explicit Hessian computation while gaining robustness through noise. The denoising perspective reveals a deep connection: a denoising network that predicts noise is simultaneously learning the score function. Multiple noise levels σ₁ > σ₂ > ... > σ_L can be used to learn scores across scales, forming the foundation for annealed sampling and powerful generative models.

x_t = x + σ·ε

Noise Perturbation

Predict ε

Denoising Task

∇ log p_t

Learned Score

Multi-scale

Noise Ladder

Noise Conditional Networks

Noise Conditional Score Networks (NCSNs) introduce a single neural network conditioned on noise level σ. Rather than training separate networks for each noise scale, s_θ(x, σ) learns to output the score for any noise level. This is trained by sampling σ uniformly and using denoising objectives at that scale. The conditioning dramatically improves efficiency while capturing multi-scale structure in data.

The NCSN framework elegantly unifies score estimation across noise scales. During sampling, annealed Langevin dynamics starts with high noise σ₁ for coarse structure, gradually annealing σ → 0 for details. This hierarchical approach mirrors the intuition of coarse-to-fine generation. Empirically, NCSNs achieved state-of-the-art generative modeling results before diffusion models, establishing score-based methods as competitive with GANs.

Condition on σ

Single network learns all noise scales through explicit conditioning mechanism.

Efficient Training

One model trained on mixed noise levels replaces separate per-scale networks.

Annealed Sampling

Iteratively refine samples: start coarse (high σ), gradually sharpen (low σ).

Hierarchical Generation

Coarse-to-fine structure naturally emerges from noise scheduling strategy.

Langevin Dynamics Sampling

Langevin dynamics is a stochastic MCMC algorithm for sampling from a distribution given its score. The update rule is x_{t+1} = x_t + ε·∇_x log p(x_t) + √(2ε)·z_t where ε is step size, z_t ~ N(0,I), and the drift term ∇_x log p is exactly the score. As ε → 0 and iterations → ∞, the chain converges to the true distribution in the continuous limit.

Annealed Langevin dynamics alternates between multiple noise levels, using Langevin at each scale. Starting with high noise (ε large) enables global exploration; reducing noise progressively refines details. This hierarchical sampling strategy combats mixing issues in high dimensions. With learned score networks s_θ(x,σ), sampling becomes entirely score-based, replacing explicit density models with gradient estimation.

x := x + ε·∇log p

Score Drift

+ √(2ε)·z

Stochastic Diffusion

ε → 0

Convergence Limit

Annealed σ

Multi-scale Hierarchy

Stochastic Differential Equations

Score-based modeling naturally extends to continuous-time SDEs: dx = f(x,t)dt + g(t)dw where f is drift, g(t) is diffusion coefficient, and w is standard Brownian motion. The reverse-time SDE (score-based reverse diffusion) is dx = [f(x,t) - g(t)²·∇_x log p_t(x)]dt + g(t)dw, showing that the score ∇_x log p_t(x) appears in the drift term of the reverse process.

Multiple SDE formulations yield equivalent samplers: the Variance-Preserving (VP) SDE maintains signal variance; Variance-Exploding (VE) allows unbounded noise; sub-VP offers intermediate behavior. The framework admits a probability flow ODE (removing stochasticity) that generates identical marginal distributions. SDEs provide mathematical rigor, connection to probability theory, and alternative sampling strategies like ODE solvers.

Forward SDE

Gradually corrupt data: dx = f(x,t)dt + g(t)dw.

Reverse-time SDE

Denoise via score: reverse drift uses ∇ log p_t(x).

VP, VE, sub-VP

Different schedules trade variance, SNR, and sampling speed.

Probability Flow ODE

Deterministic sampling path with same marginals; fast inference.

Score-Based SDE Framework

The unified score-based SDE framework integrates all prior ideas: train a score network s_θ(x,t) to estimate ∇_x log p_t(x) at any time-noise coupling. During generation, use the reverse-time SDE or probability flow ODE with the learned score. This framework encompasses diffusion models (special case), score matching, denoising autoencoders, and Langevin dynamics as instances of a single paradigm.

Key advantages include flexible time-conditioning, unified handling of multiple noise schedules, and theoretical guarantees on convergence and sample quality. The probability flow ODE variant enables deterministic generation and exact log-likelihood computation. Advanced techniques like Exponential Moving Average (EMA) of weights, importance weighting, and continuous time scheduling further improve sample quality and efficiency.

s_θ(x,t)

Time-Conditional Network

VP/VE/sub-VP

SDE Variants

Reverse SDE

Generative Direction

Probability ODE

Deterministic Sampling

Connections to Diffusion

Diffusion Probabilistic Models (DDPM, Score-Based Diffusion) emerge as a special case of the score-based SDE framework with a specific noise schedule. DDPM applies a sequence of discrete noise additions: q(x_t | x_{t-1}) = N(√(1-β_t)·x_{t-1}, β_t·I). The reverse process is learned via denoising: predicting noise at each step recovers the score ∇ log p_t(x). This connection reveals DDPM as an implicit score matching algorithm with scheduling.

Classifier guidance extends both frameworks: incorporating class information into the score ∇_x log p(y|x_t) enables conditional generation. The unified view explains why continuous SDEs provide faster sampling than discrete diffusion steps—fewer evaluations of the same learned score function. Recent advances in score-based models directly leverage this connection: probability flow ODEs, consistency models, and flow matching all build upon understanding diffusion as score-based SDE sampling.

DDPM Foundation

Discrete diffusion is discretized score-based SDE with specific β schedule.

Noise Prediction

DDPM predicting noise ε ≡ predicting score ∇ log p_t(x).

Classifier Guidance

Score ∇ log p(y|x) enables class-conditional generation seamlessly.

Fast Sampling

Continuous ODE, distillation, consistency models exploit score reuse.

References & Further Reading

Score-based generative modeling offers a unified framework connecting score matching, diffusion models, and energy-based methods. This section compiles key papers and resources for understanding score functions, training objectives, and sampling algorithms that form the backbone of modern generative models.

From classical score matching to recent SDE formulations, these materials trace the theoretical and practical advances in score-based modeling.

The Score Function

Score Matching Objectives

Fisher Divergence

Integration by Parts

Implicit Matching

Scalability

Denoising Score Matching

Noise Conditional Networks

Condition on σ

Efficient Training

Annealed Sampling

Hierarchical Generation

Langevin Dynamics Sampling

Stochastic Differential Equations

Forward SDE

Reverse-time SDE

VP, VE, sub-VP

Probability Flow ODE

Score-Based SDE Framework

Connections to Diffusion

DDPM Foundation

Noise Prediction

Classifier Guidance

Fast Sampling

References & Further Reading

The Score Function

Why Scores Instead of Densities?

Stein Score & Connections

Challenges & Properties

Score Matching Objectives

Fisher Divergence & Integration by Parts

Implicit Score Matching

Advantages over Alternatives

Theoretical Properties

Denoising Score Matching

The Denoising-Score Connection

Multiple Noise Scales

Robustness Through Noise

Practical Implementation

Noise Conditional Networks

Conditioning Mechanism

Annealed Sampling Strategy

Empirical Success

Connection to Diffusion

Langevin Dynamics Sampling

The Algorithm

Convergence Theory

Annealed Langevin Dynamics

Advantages & Limitations

Comparison to Diffusion

Stochastic Differential Equations

Forward SDE (Corruption)

Reverse SDE (Denoising)

Three Popular SDE Choices

Probability Flow ODE

Training Score Networks

Score-Based SDE Framework

Framework Components

Advantages of Unified View

Advanced Training Techniques

Convergence & Theoretical Properties

Connections to Diffusion

DDPM as Discrete SDE

Classifier Guidance

Fast Sampling & Consistency Models

Information Preservation

Theoretical Synergy

The Unified Perspective

References & Further Reading

Foundational Papers

Core Concepts

Key Methods

Connections

Learning Resources