Energy-Based Models

Stanford XCS236 · Deep Generative Models

Energy-Based Models

Energy Functions

Energy-based models (EBMs) define probability distributions through an energy function E(x). The probability of a data point x is proportional to the exponential negative energy: p(x) ∝ exp(−E(x)). This elegant formulation allows us to specify complex, unnormalized densities by designing energy functions that assign low energy to probable data and high energy to improbable data.

The partition function Z normalizes the distribution: p(x) = exp(−E(x)) / Z, where Z = ∑_x exp(−E(x)). Computing Z is generally intractable for continuous high-dimensional spaces, which is why EBMs are often left unnormalized in practice. Energy functions can be learned as neural networks, transforming the model from probabilistic to implicit, enabling flexible density modeling without explicit likelihood evaluation.

p(x) ∝ exp(−E(x))

Core Relation

Z = ∑ exp(−E)

Partition Function

Intractable

Normalization

Implicit Models

NN-Parameterized

Boltzmann Distribution

The Boltzmann distribution originates from statistical mechanics, where energy E(x) is related to system states at thermal equilibrium. The distribution p(x) = exp(−E(x)/T) / Z depends on temperature T: as T → 0, probability concentrates on low-energy states; as T → ∞, all states become equally likely. This temperature parameter enables annealing strategies in sampling and provides an intuitive physical interpretation of model behavior.

Free energy F(x) = −T log p(x) = E(x) − T log Z represents the thermodynamic cost of a state. Minimizing free energy corresponds to maximizing likelihood in machine learning. The Boltzmann framework connects probabilistic modeling to physics, allowing researchers to leverage decades of statistical mechanics insights. Temperature plays a crucial role in controlling the sharpness of the distribution and aids in escaping local minima during training.

Temperature Scaling

Higher T smooths distribution; lower T concentrates mass on modes. Enables simulated annealing.

Free Energy

F = E − T log Z unifies energy and entropy. Minimal free energy maximizes likelihood.

Thermal Equilibrium

System reaches stationary distribution; corresponds to optimal model fit given energy.

Physical Interpretation

Statistical mechanics provides intuition for training dynamics and model behavior.

Training EBMs

Training EBMs is challenging because the partition function Z is intractable. Contrastive divergence (CD) approximates the gradient of log Z by running MCMC for a few steps from data samples, then correcting the bias. This speeds up training dramatically compared to exact likelihood estimation. Score matching provides an alternative: matching the score function ∇_x log p(x) directly without computing Z, sidestepping normalization entirely.

Noise contrastive estimation (NCE) and its variants reformulate density estimation as classification: distinguishing data from noise. Modern approaches like noise contrastive estimation with exponential families and other divergence measures (f-divergence, Wasserstein) have been developed. Each method trades off computational cost, bias, and convergence properties. Practitioners often use contrastive divergence for its simplicity, though score matching and NCE offer theoretical advantages in certain settings.

Key Training Methods

Contrastive divergence: fast approximation via short MCMC chains. Score matching: match gradients, avoid partition function. Noise contrastive estimation: reframe as classification task. Each balances tractability with statistical efficiency.

Contrastive Divergence

Fast, Practical

Score Matching

Gradient-Based

Noise Contrastive

Classification

Divergence Estimates

Flexible

MCMC Sampling

Sampling from EBMs requires Markov chain Monte Carlo (MCMC) because we cannot directly draw samples. Langevin dynamics use the score function (gradient of log-density) to guide a random walk: x_{t+1} = x_t + (ε/2) ∇_x log p(x_t) + √ε · z_t, where ε is step size and z_t is Gaussian noise. This converges to the target distribution as ε → 0 and t → ∞. Hamiltonian Monte Carlo (HMC) augments state with momentum, enabling larger steps and faster mixing, crucial for high-dimensional problems.

Mixing time—how quickly MCMC forgets initialization—and burn-in (discarding initial samples) are critical practical concerns. Slow mixing can make sampling prohibitively expensive. Advanced techniques like parallel tempering, reversible jump, and replica-exchange methods accelerate mixing. Practitioners must diagnose convergence carefully via diagnostics (autocorrelation, Gelman–Rubin statistic). Understanding MCMC properties is essential for reliable EBM use, as poor sampling corrupts training and evaluation.

Initialization

Start from arbitrary x₀; run MCMC for burn-in steps.

Langevin Dynamics

Gradient-guided random walk; converges for small step size.

Hamiltonian MC

Momentum-based; larger steps, faster mixing than Langevin.

Diagnostics

Check convergence; assess effective sample size post-burn-in.

Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are bipartite graphical models with visible units v and hidden units h. The energy function E(v,h) = −v^T W h − b^T v − c^T h couples visible and hidden variables through weight matrix W, while lacking intra-layer connections. This bipartite structure is key: given visible units, hidden units are conditionally independent, and vice versa. p(h|v) and p(v|h) factor into independent Bernoulli distributions, enabling efficient block Gibbs sampling.

RBMs serve as building blocks for deeper models and as standalone density models. Training uses contrastive divergence: sample h from p(h|v_data), then sample v' from p(v|h), then h' from p(h|v'). The weight update is proportional to v_data ⊗ h − v' ⊗ h', contrasting data-driven and model-driven statistics. While largely superseded by modern deep learning, RBMs remain pedagogically important and are still used in some applications, especially when interpretability of latent structure is valued.

Strengths

Tractable conditional distributions; fast sampling.
Interpretable latent features in hidden layer.
Solid theoretical foundation; well-studied dynamics.
Useful for pre-training deep networks (if training needed).

Limitations

Bipartite structure limits expressiveness.
Contrastive divergence introduces bias.
Limited by discrete units; extension to continuous non-trivial.
Fewer parameters than modern neural networks.

Deep Boltzmann Machines

Deep Boltzmann machines (DBMs) stack multiple layers of hidden units, forming a fully-connected hierarchical structure. Unlike DBNs (deep belief networks with directed connections), DBMs are undirected. The energy function couples all layers: E = −v^T W^(1) h^(1) − h^(1,T) W^(2) h^(2) − .... Inference becomes intractable: conditioning on visible units does not marginalize hidden-layer dependencies. Variational inference uses mean-field approximations, treating each hidden unit's posterior as an independent Bernoulli with parameters updated iteratively.

Training DBMs is complex: greedy pre-training followed by variational EM. Mean-field updates iterate until convergence at each training step, adding computational overhead. Despite advantages in learning hierarchical representations, DBMs have fallen out of favor compared to deep autoencoders and variational autoencoders due to training difficulty and similar expressiveness. However, they offer theoretical insights into hierarchical probabilistic modeling and remain relevant for understanding energy-based deep learning.

Undirected Hierarchy

Fully Connected

Mean-Field Inference

Variational

Greedy Pre-Training

Initialization

Complex Training

High Computation

Modern EBMs

Modern EBMs leverage neural networks as flexible energy function approximators. Instead of explicit latent structures (RBM/DBM), the energy E_θ(x) is a learnable neural network, often a CNN or ResNet. This enables implicit generative modeling: sample via MCMC without a tractable posterior. Stochastic gradient Langevin dynamics (SGLD) adds noise to gradient updates, allowing samples to be drawn efficiently from a mini-batch without full dataset MCMC.

Joint energy models (JEM) unify discriminative (classification) and generative (density) modeling in a single framework. The same energy function supports both: E_θ(x, y) guides classification via −∇_x E(x|y), while ∇_x E(x) supports generation. Recent work explores connections to score-based models (diffusion models), where the score function ∇_x log p(x) is directly learned via score matching, avoiding explicit energy function design. Auto-regressive flows and other neural density estimators provide complementary approaches, though EBMs offer unique advantages in flexible density specification and control via energy design.

Implicit Generative Models

Neural network energy functions; sample via MCMC Langevin dynamics.

SGLD Sampling

Stochastic gradient Langevin dynamics; mini-batch compatible, noise-stabilized.

Joint Energy Models

Unified discriminative and generative; same energy for both tasks.

Score-Based Models

Learn ∇_x log p directly via score matching; connection to diffusion.

EBM Connections

Energy-based models connect deeply to score-based generative models: the score function ∇_x log p(x) is the gradient of the log energy. Denoising score matching trains a neural network to predict ∇_x log p(x) by matching the score under added noise. Diffusion models learn scores at multiple noise scales, effectively learning a trajectory backward through noise. The connection reveals that diffusion models are implicitly learning energy landscapes across scales, bridging EBMs and modern generative AI.

Classifier guidance (conditional generation) leverages classifier gradients to shape the energy landscape. A classifier score ∇_x log p(y|x) can be combined with unconditional scores to steer sampling toward desired classes. Score-based models have achieved state-of-the-art generation quality, suggesting energy-based thinking remains central to deep generative modeling. Future directions include better understanding MCMC sampling for high-dimensional EBMs, hybrid approaches combining explicit energy functions with learned scores, and applications to structured prediction and reasoning tasks where explicit energy design provides interpretability advantages.

Theoretical Unification

EBMs, diffusion models, and score-based approaches are fundamentally aligned: all learn probability distributions through energy or score functions. Modern generative AI extensively leverages these principles, emphasizing the enduring importance of energy-based thinking for understanding deep learning.

Score Functions

∇_x log p

Denoising Score Match

Training Signal

Diffusion Models

Noisy Schedules

Classifier Guidance

Conditional Control

References & Further Reading

Energy-based models provide a powerful framework connecting statistical physics, probabilistic inference, and modern deep learning. This section compiles key references for understanding energy functions, Boltzmann machines, and recent resurgence of EBMs in generative modeling.

From foundational theory to contemporary applications, these materials document how EBMs complement and connect to VAEs, diffusion models, and score-based approaches.

Energy Functions

Boltzmann Distribution

Temperature Scaling

Free Energy

Thermal Equilibrium

Physical Interpretation

Training EBMs

Key Training Methods

MCMC Sampling

Restricted Boltzmann Machines

Strengths

Limitations

Deep Boltzmann Machines

Modern EBMs

Implicit Generative Models

SGLD Sampling

Joint Energy Models

Score-Based Models

EBM Connections

Theoretical Unification

References & Further Reading

Energy Functions

Energy Design Principles

Computational Considerations

Boltzmann Distribution

Temperature and Phase Transitions

Connection to Thermodynamic Limit

Training EBMs

Mini-Batch Training with Replay Buffer

Divergence and Regularization

MCMC Sampling

Parallel Tempering and Replica Exchange

Burn-In and Equilibration

Restricted Boltzmann Machines

Gaussian RBMs and Extensions

Applications and Modern View

Deep Boltzmann Machines

Pre-Training and Initialization

Relation to Other Hierarchical Models

Modern EBMs

Why Neural Network Energies?

Computational Bottlenecks

EBM Connections

Hybrid Approaches: EBMs + Diffusion

Future Directions

References & Further Reading

Foundational References

Architectures & Models

Training Algorithms

Connections to Other Models

Learning Resources