Stanford XCS236 · Deep Generative Models
Energy-Based Models
Week 5–6 · Energy Functions, MCMC & Contrastive Divergence
01

Energy Functions

Energy-based models (EBMs) define probability distributions through an energy function E(x). The probability of a data point x is proportional to the exponential negative energy: p(x) ∝ exp(−E(x)). This elegant formulation allows us to specify complex, unnormalized densities by designing energy functions that assign low energy to probable data and high energy to improbable data.

The partition function Z normalizes the distribution: p(x) = exp(−E(x)) / Z, where Z = ∑_x exp(−E(x)). Computing Z is generally intractable for continuous high-dimensional spaces, which is why EBMs are often left unnormalized in practice. Energy functions can be learned as neural networks, transforming the model from probabilistic to implicit, enabling flexible density modeling without explicit likelihood evaluation.

p(x) ∝ exp(−E(x))
Core Relation
Z = ∑ exp(−E)
Partition Function
Intractable
Normalization
Implicit Models
NN-Parameterized
02

Boltzmann Distribution

The Boltzmann distribution originates from statistical mechanics, where energy E(x) is related to system states at thermal equilibrium. The distribution p(x) = exp(−E(x)/T) / Z depends on temperature T: as T → 0, probability concentrates on low-energy states; as T → ∞, all states become equally likely. This temperature parameter enables annealing strategies in sampling and provides an intuitive physical interpretation of model behavior.

Free energy F(x) = −T log p(x) = E(x) − T log Z represents the thermodynamic cost of a state. Minimizing free energy corresponds to maximizing likelihood in machine learning. The Boltzmann framework connects probabilistic modeling to physics, allowing researchers to leverage decades of statistical mechanics insights. Temperature plays a crucial role in controlling the sharpness of the distribution and aids in escaping local minima during training.

Temperature Scaling

Higher T smooths distribution; lower T concentrates mass on modes. Enables simulated annealing.

Free Energy

F = E − T log Z unifies energy and entropy. Minimal free energy maximizes likelihood.

Thermal Equilibrium

System reaches stationary distribution; corresponds to optimal model fit given energy.

Physical Interpretation

Statistical mechanics provides intuition for training dynamics and model behavior.

03

Training EBMs

Training EBMs is challenging because the partition function Z is intractable. Contrastive divergence (CD) approximates the gradient of log Z by running MCMC for a few steps from data samples, then correcting the bias. This speeds up training dramatically compared to exact likelihood estimation. Score matching provides an alternative: matching the score function ∇_x log p(x) directly without computing Z, sidestepping normalization entirely.

Noise contrastive estimation (NCE) and its variants reformulate density estimation as classification: distinguishing data from noise. Modern approaches like noise contrastive estimation with exponential families and other divergence measures (f-divergence, Wasserstein) have been developed. Each method trades off computational cost, bias, and convergence properties. Practitioners often use contrastive divergence for its simplicity, though score matching and NCE offer theoretical advantages in certain settings.

Key Training Methods

Contrastive divergence: fast approximation via short MCMC chains. Score matching: match gradients, avoid partition function. Noise contrastive estimation: reframe as classification task. Each balances tractability with statistical efficiency.

Contrastive Divergence
Fast, Practical
Score Matching
Gradient-Based
Noise Contrastive
Classification
Divergence Estimates
Flexible
04

MCMC Sampling

Sampling from EBMs requires Markov chain Monte Carlo (MCMC) because we cannot directly draw samples. Langevin dynamics use the score function (gradient of log-density) to guide a random walk: x_{t+1} = x_t + (ε/2) ∇_x log p(x_t) + √ε · z_t, where ε is step size and z_t is Gaussian noise. This converges to the target distribution as ε → 0 and t → ∞. Hamiltonian Monte Carlo (HMC) augments state with momentum, enabling larger steps and faster mixing, crucial for high-dimensional problems.

Mixing time—how quickly MCMC forgets initialization—and burn-in (discarding initial samples) are critical practical concerns. Slow mixing can make sampling prohibitively expensive. Advanced techniques like parallel tempering, reversible jump, and replica-exchange methods accelerate mixing. Practitioners must diagnose convergence carefully via diagnostics (autocorrelation, Gelman–Rubin statistic). Understanding MCMC properties is essential for reliable EBM use, as poor sampling corrupts training and evaluation.

Initialization
Start from arbitrary x₀; run MCMC for burn-in steps.
Langevin Dynamics
Gradient-guided random walk; converges for small step size.
Hamiltonian MC
Momentum-based; larger steps, faster mixing than Langevin.
Diagnostics
Check convergence; assess effective sample size post-burn-in.
05

Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are bipartite graphical models with visible units v and hidden units h. The energy function E(v,h) = −v^T W h − b^T v − c^T h couples visible and hidden variables through weight matrix W, while lacking intra-layer connections. This bipartite structure is key: given visible units, hidden units are conditionally independent, and vice versa. p(h|v) and p(v|h) factor into independent Bernoulli distributions, enabling efficient block Gibbs sampling.

RBMs serve as building blocks for deeper models and as standalone density models. Training uses contrastive divergence: sample h from p(h|v_data), then sample v' from p(v|h), then h' from p(h|v'). The weight update is proportional to v_data ⊗ h − v' ⊗ h', contrasting data-driven and model-driven statistics. While largely superseded by modern deep learning, RBMs remain pedagogically important and are still used in some applications, especially when interpretability of latent structure is valued.

Strengths

  • Tractable conditional distributions; fast sampling.
  • Interpretable latent features in hidden layer.
  • Solid theoretical foundation; well-studied dynamics.
  • Useful for pre-training deep networks (if training needed).

Limitations

  • Bipartite structure limits expressiveness.
  • Contrastive divergence introduces bias.
  • Limited by discrete units; extension to continuous non-trivial.
  • Fewer parameters than modern neural networks.
06

Deep Boltzmann Machines

Deep Boltzmann machines (DBMs) stack multiple layers of hidden units, forming a fully-connected hierarchical structure. Unlike DBNs (deep belief networks with directed connections), DBMs are undirected. The energy function couples all layers: E = −v^T W^(1) h^(1) − h^(1,T) W^(2) h^(2) − .... Inference becomes intractable: conditioning on visible units does not marginalize hidden-layer dependencies. Variational inference uses mean-field approximations, treating each hidden unit's posterior as an independent Bernoulli with parameters updated iteratively.

Training DBMs is complex: greedy pre-training followed by variational EM. Mean-field updates iterate until convergence at each training step, adding computational overhead. Despite advantages in learning hierarchical representations, DBMs have fallen out of favor compared to deep autoencoders and variational autoencoders due to training difficulty and similar expressiveness. However, they offer theoretical insights into hierarchical probabilistic modeling and remain relevant for understanding energy-based deep learning.

Undirected Hierarchy
Fully Connected
Mean-Field Inference
Variational
Greedy Pre-Training
Initialization
Complex Training
High Computation
07

Modern EBMs

Modern EBMs leverage neural networks as flexible energy function approximators. Instead of explicit latent structures (RBM/DBM), the energy E_θ(x) is a learnable neural network, often a CNN or ResNet. This enables implicit generative modeling: sample via MCMC without a tractable posterior. Stochastic gradient Langevin dynamics (SGLD) adds noise to gradient updates, allowing samples to be drawn efficiently from a mini-batch without full dataset MCMC.

Joint energy models (JEM) unify discriminative (classification) and generative (density) modeling in a single framework. The same energy function supports both: E_θ(x, y) guides classification via −∇_x E(x|y), while ∇_x E(x) supports generation. Recent work explores connections to score-based models (diffusion models), where the score function ∇_x log p(x) is directly learned via score matching, avoiding explicit energy function design. Auto-regressive flows and other neural density estimators provide complementary approaches, though EBMs offer unique advantages in flexible density specification and control via energy design.

Implicit Generative Models

Neural network energy functions; sample via MCMC Langevin dynamics.

SGLD Sampling

Stochastic gradient Langevin dynamics; mini-batch compatible, noise-stabilized.

Joint Energy Models

Unified discriminative and generative; same energy for both tasks.

Score-Based Models

Learn ∇_x log p directly via score matching; connection to diffusion.

08

EBM Connections

Energy-based models connect deeply to score-based generative models: the score function ∇_x log p(x) is the gradient of the log energy. Denoising score matching trains a neural network to predict ∇_x log p(x) by matching the score under added noise. Diffusion models learn scores at multiple noise scales, effectively learning a trajectory backward through noise. The connection reveals that diffusion models are implicitly learning energy landscapes across scales, bridging EBMs and modern generative AI.

Classifier guidance (conditional generation) leverages classifier gradients to shape the energy landscape. A classifier score ∇_x log p(y|x) can be combined with unconditional scores to steer sampling toward desired classes. Score-based models have achieved state-of-the-art generation quality, suggesting energy-based thinking remains central to deep generative modeling. Future directions include better understanding MCMC sampling for high-dimensional EBMs, hybrid approaches combining explicit energy functions with learned scores, and applications to structured prediction and reasoning tasks where explicit energy design provides interpretability advantages.

Theoretical Unification

EBMs, diffusion models, and score-based approaches are fundamentally aligned: all learn probability distributions through energy or score functions. Modern generative AI extensively leverages these principles, emphasizing the enduring importance of energy-based thinking for understanding deep learning.

Score Functions
∇_x log p
Denoising Score Match
Training Signal
Diffusion Models
Noisy Schedules
Classifier Guidance
Conditional Control
09

References & Further Reading

Energy-based models provide a powerful framework connecting statistical physics, probabilistic inference, and modern deep learning. This section compiles key references for understanding energy functions, Boltzmann machines, and recent resurgence of EBMs in generative modeling.

From foundational theory to contemporary applications, these materials document how EBMs complement and connect to VAEs, diffusion models, and score-based approaches.