Energy-based models define a probability distribution implicitly through an energy function E(x). The core principle is the Boltzmann relation: p(x) ∝ exp(−E(x)). This means high-probability regions correspond to low-energy configurations, and vice versa. The unnormalized form sidesteps the partition function Z, which is generally intractable to compute or estimate.
Formally, the normalized distribution is p(x) = exp(−E(x)) / Z where Z = ∫ exp(−E(x)) dx (or ∑ for discrete x). Computing Z requires integrating (or summing) over all possible states, which is exponentially hard in dimension. For high-dimensional continuous spaces, this is almost always infeasible, making EBMs inherently unnormalized.
Energy functions can be arbitrary differentiable mappings. A simple quadratic form E(x) = ½ x^T Σ^{−1} x recovers a Gaussian density. More complex energies capture multi-modal, heavy-tailed, or otherwise intricate distributions. When parameterized by neural networks E_θ(x), the model becomes implicit: we specify no explicit generative process, only an energy landscape. Sampling then requires MCMC, making generation slower than autoregressive or flow-based models but offering flexibility in density design.
The key challenge is training: how do we adjust θ to fit data p_data(x)? The maximum likelihood gradient ∇_θ log p(x_data) = −∇_θ E_θ(x_data) + 𝔼_{p_θ}[∇_θ E_θ(x)]. The first term pushes energy down on data; the second involves expectations under the model, requiring sampling. This sample-dependency makes training fundamentally different from supervised learning and motivates techniques like contrastive divergence, score matching, and NCE to approximate these gradients efficiently.
Energy Design Principles
Design energy functions to be smooth, avoiding sharp discontinuities that complicate sampling. Ensure the energy landscape has a minimum-energy solution that represents the target distribution. Add regularization (e.g., L2 on weights) to prevent unbounded energy scaling. When using neural networks, architecture (depth, width, skip connections) affects expressiveness and optimization landscape.
Computational Considerations
Forward pass (computing E(x)) is typically fast (single neural network evaluation). The bottleneck is sampling: MCMC chains must mix well. For high-dimensional data (images, text), poor mixing means expensive training. Pre-training or better initialization helps. Hybrid approaches combining learned energy with auxiliary latent variables (e.g., variational EBMs) can accelerate both training and sampling.
The Boltzmann distribution p(x) = exp(−E(x)/T) / Z stems from equilibrium statistical mechanics. In this context, E is a physical energy, T is absolute temperature, and the distribution describes the probability of microstates at thermal equilibrium. The factor 1/T controls how sharply the distribution concentrates on low-energy states: as T → 0, probability mass accumulates at the global minimum; as T → ∞, all states become nearly equiprobable.
Setting T = 1 (without loss of generality in probabilistic models) recovers our standard form. But temperature is useful conceptually: during simulated annealing, we start with high T (broad, exploratory distribution) and gradually decrease T, sharpening toward the mode. This helps escape local minima in both sampling and optimization. Temperature also appears in tempering methods for MCMC, where parallel chains at different temperatures exchange information to improve mixing.
Free energy F(x) = −T log p(x) = E(x) − T log Z captures the thermodynamic cost of a state. Minimizing F with respect to x at fixed T is equivalent to maximum likelihood under p_θ. The entropy term T log Z reflects the statistical complexity: more probable states have lower free energy. This framework unifies machine learning (maximize likelihood = minimize free energy) with physics, providing intuition and enabling transfer of techniques from statistical mechanics to deep learning.
Variational Bayes and mean-field methods use free energy to derive tractable approximations. The KL divergence between an approximate and true posterior is related to a variational free energy bound. Many modern machine learning algorithms (VAE, belief propagation) minimize variational bounds on free energy. This deep connection means understanding Boltzmann distributions strengthens intuition for probabilistic inference generally.
Temperature and Phase Transitions
In some models, varying T reveals phase transitions: qualitative changes in the distribution's structure. The critical temperature marks transitions between behaviors (e.g., order-disorder in spin systems). While less prominent in deep learning models, phase transitions in neural networks have been observed, highlighting complex energy landscapes shaped by network architecture and data.
Connection to Thermodynamic Limit
Statistical mechanics studies limits as system size N → ∞. Free energy per unit F/N becomes well-defined (self-averaging). In machine learning, this perspective suggests studying how learned distributions behave as data dimension and model capacity grow. Large-N limits often exhibit rich behavior: critical phenomena, emergent structure, robustness properties.
The fundamental challenge in EBM training is the intractable partition function. The maximum likelihood gradient is ∇_θ log p_θ(x) = −∇_θ E_θ(x) + 𝔼_{p_θ}[∇_θ E_θ(x)]. The second term requires sampling from the model p_θ, which is expensive. Exact sampling is infeasible, so we use MCMC approximations, introducing bias that must be managed or explicitly corrected.
Contrastive divergence (CD) sidesteps this by running MCMC for just k steps from data (k-step CD, typically k=1). Start with x ~ p_data, run k MCMC steps to get x', then update weights as if the MCMC chain has converged. This introduces bias (the chain is undersampled), but the bias becomes small for short chains and smooth energy landscapes. CD is fast and practical, making it the de facto standard for RBM and early EBM training. Variants like persistent CD maintain a chain across iterations, further reducing bias.
Score matching avoids sampling by directly learning the score function ∇_x log p(x). Minimizing 𝔼_x[‖∇_x log p_θ(x) − ∇_x log p_data(x)‖²] requires access to the true score, often estimated via finite differences or using a pre-trained kernel. Denoising score matching adds noise to data and matches the score under noisy distribution, circumventing the need for true score estimation. This approach naturally connects to diffusion models and has proven highly effective in modern generative modeling.
Noise contrastive estimation (NCE) reformulates density estimation as binary classification: discriminate data from noise samples. Given a noise distribution p_n, NCE maximizes 𝔼_data[log σ(E_θ(x))] + 𝔼_noise[log(1 − σ(E_θ(x)))], where σ is sigmoid. Under certain conditions, this recovers a consistent estimator of p(x). Variants using different divergences (f-divergence, Wasserstein) have been proposed, offering flexibility in training objectives.
Mini-Batch Training with Replay Buffer
In practice, MCMC chains are expensive to restart from scratch each iteration. A replay buffer maintains a pool of samples from previous iterations, initialized once and reused. New samples are drawn from the buffer, MCMC-updated a few steps, then returned to buffer. This amortizes computational cost and improves stability by maintaining diverse samples across batches.
Divergence and Regularization
Without regularization, energy can scale arbitrarily (e.g., E_θ(x) → −∞ for all x). Add L2 regularization on weights, entropy regularization on the energy, or normalize energy magnitudes. Some methods use constrained optimization to keep energy bounded. Proper regularization is crucial for stability and preventing mode collapse in EBM training.
Sampling from an unnormalized EBM p(x) ∝ exp(−E(x)) requires Markov chain Monte Carlo. MCMC constructs a reversible chain whose stationary distribution is p(x). The Metropolis-Hastings algorithm is general: propose a new state via a proposal distribution q(x' | x), then accept with probability min(1, p(x') q(x|x') / (p(x) q(x'|x))). Rejected proposals stay at current state, preserving the chain's Markovian property.
For EBMs with continuous state spaces and smooth energies, Langevin dynamics is highly effective. The update is x_{t+1} = x_t + (ε/2) ∇_x log p(x_t) + √ε · z_t, where z_t ~ N(0, I) and ε is step size. As ε → 0 and t → ∞, the chain converges to p(x) (under mild conditions). Langevin is computationally efficient: only one energy evaluation and gradient per step. The gradient ∇_x E(x) points downhill; noise enables exploration and prevents modes from fully attracting chains.
Hamiltonian Monte Carlo (HMC) augments state with auxiliary momentum variables, enabling larger steps and faster mixing. The dynamics follow Hamiltonian equations, combining potential energy E(x) and kinetic energy K(p) = ½ p^T M^{−1} p. A leapfrog integrator discretizes dynamics with step size ε. Proposals are accepted via Metropolis-Hastings. HMC is more complex than Langevin but mixes much faster (fewer correlated samples), critical for high-dimensional problems.
Mixing time is how quickly the chain forgets initialization. Fast mixing (low autocorrelation) means shorter burn-in and higher effective sample size. Slow mixing (high autocorrelation) wastes samples and can make sampling prohibitively expensive. Diagnosing mixing requires autocorrelation plots, effective sample size computation, and multiple-chain diagnostics (Gelman–Rubin ̂R statistic). Poor mixing often signals model problems (ill-specified energy, wrong temperature) or algorithmic issues (step size too small).
Parallel Tempering and Replica Exchange
Run multiple chains at different temperatures T_1 < T_2 < ... < T_K (hot chains have smoother, flatter energy landscapes). Periodically exchange states between adjacent chains; hotter chains mix faster and can help colder chains escape local minima. This dramatically improves effective mixing, especially for multimodal distributions. Replica exchange methods are more general and apply to other sampling algorithms.
Burn-In and Equilibration
Initial samples depend on starting point and should be discarded (burn-in phase). Adaptive methods adjust step size during burn-in to achieve optimal acceptance rates. In Langevin, step sizes are often tuned to achieve ~50% acceptance (though optimal is higher). Properly configured burn-in is essential for valid inference; underbilled burn-in biases estimates toward initial conditions.
Restricted Boltzmann machines are bipartite graphical models: visible units v ∈ {0,1}^n and hidden units h ∈ {0,1}^m. The energy function is E(v,h) = −v^T W h − b^T v − c^T h, where W is an n × m weight matrix, b and c are biases. The restriction: no connections within layers (no v_i − v_j or h_i − h_j edges). This bipartite structure ensures a critical property: given visible units, hidden units are conditionally independent, and vice versa.
The conditional distributions factor: p(h|v) = ∏_i p(h_i|v), p(v|h) = ∏_j p(v_j|h). Each p(h_i = 1|v) = σ(c_i + ∑_j W_{ij} v_j) and p(v_j = 1|h) = σ(b_j + ∑_i W_{ij} h_i), where σ is sigmoid. These conditionals are Bernoulli, computable in O(n + m) time. Block Gibbs sampling—alternately resampling all h given v, then all v given h—is efficient and provides a tractable sampling algorithm.
Training RBMs uses contrastive divergence with block Gibbs. Start with v_data from training set. (1) Sample h ~ p(h|v_data). (2) Sample v' ~ p(v|h). (3) Sample h' ~ p(h|v'). Weight update: ΔW ∝ v_data ⊗ h − v' ⊗ h' (and analogously for biases). This contrasts statistics under the data distribution (v_data ⊗ h) with model statistics (v' ⊗ h'). The update drives weights to increase model probability on data and decrease it elsewhere.
RBMs were pivotal in the deep learning renaissance (Hinton et al., 2006). Stacking RBMs greedily (training each layer using hidden units of the previous as visible units) provided a pre-training strategy for deep networks, addressing the vanishing gradient problem of the time. While modern techniques (batch norm, ReLU, skip connections) have obviated the need for RBM pre-training in many applications, RBMs remain theoretically important and useful when interpretability of latent features is desired.
Gaussian RBMs and Extensions
Extending RBMs to continuous visible units (Gaussian RBMs) involves modifying conditionals: v follows a Gaussian, h remains Bernoulli. The energy becomes E(v,h) = ½ v^T Λ^{−1} v − v^T W h − b^T v − c^T h. Training procedure is similar but requires handling Gaussian variables. More complex extensions (mixed Bernoulli-Gaussian, spike-and-slab) increase expressiveness but complicate inference and training.
Applications and Modern View
Modern use of RBMs: collaborative filtering (Netflix Prize context), feature learning for downstream tasks, anomaly detection, and theoretical study. They serve as building blocks in energy-based models, often paired with other techniques. The RBM literature is rich with insights on graphical models, sampling, and probabilistic inference that generalize beyond RBMs themselves.
Deep Boltzmann machines extend RBMs to multiple layers of hidden units. Energy function: E = −v^T W^(1) h^(1) − h^(1,T) W^(2) h^(2) − ... − h^(K−1,T) W^(K) h^(K) − b^T v − ∑_k c_k^T h^(k). Unlike RBMs (which have two layers) or DBNs (directed hierarchy), DBMs are fully undirected. All coupling is through weight matrices; no intra-layer edges.
Inference in DBMs is intractable: p(h|v) does not factor simply. Marginalize hidden units: p(h^(k)|v, h^(−k)) involves expectations over other hidden layers, creating dependencies. Variational inference approximates the posterior using a mean-field assumption: q(h|v) = ∏_k ∏_i q(h_i^(k)|v). Each q(h_i^(k) = 1|v) = σ(updates), where updates depend on layer-specific messages. Iterative updates (mean-field messages) propagate until convergence, approximating the true posterior.
Training uses variational EM: E-step computes mean-field approximation of posterior over all hidden units; M-step updates weights to maximize expected complete-data likelihood. Each M-step requires gradient computation, and the E-step is iterative, making training slow. Pre-training each layer as an RBM (greedy layer-wise pre-training) provides initialization. Despite slower training, DBMs can learn hierarchical representations where higher layers capture abstract, increasingly invariant features.
DBMs have been applied to image generation, feature learning, and modeling complex dependencies. However, they have largely been eclipsed by alternatives: VAEs offer stable training, autoregressive models and flows provide tractable likelihoods, and diffusion models achieve state-of-the-art sample quality. Nevertheless, DBMs offer unique theoretical insights: they show how to stack energy-based layers, the importance of mean-field approximation in inference, and tradeoffs between model expressiveness and training tractability.
Pre-Training and Initialization
Greedy pre-training: train DBM layer 1 (v, h^(1)) as RBM, then treat h^(1) as visible and train (h^(1), h^(2)) as RBM, etc. This initialization avoids poor local minima and accelerates convergence. However, it's computationally expensive and must be coordinated carefully to ensure coherence across layers. Modern deep learning usually skips pre-training in favor of better optimizers and architectures.
Relation to Other Hierarchical Models
DBNs (deep belief networks) are partly directed, partly undirected, offering a middle ground. Autoencoders and VAEs provide deterministic or variational hierarchies with explicit encoder-decoder structure. Normalizing flows stack invertible transformations. Each approach trades off tractability (compute likelihood?), expressiveness, and interpretability (can we visualize learned features?).
Modern EBMs use neural networks E_θ(x) as flexible energy function approximators, without explicit latent structure (unlike RBMs/DBMs). The network output is a scalar energy. For images, architectures like ResNets or U-Nets capture spatial structure. The model remains implicit: no sampling mechanism is built in; sampling requires MCMC. Inference (both training and generation) involves Langevin dynamics or HMC, scaling these algorithms to high dimensions via learned energy parametrizations.
Stochastic gradient Langevin dynamics (SGLD) extends Langevin to mini-batch training. On each iteration, sample a mini-batch, compute ∇_x E_θ(x_t) on that batch (biased estimate of full gradient), then apply x_{t+1} = x_t − (ε/2) ∇_x E_θ(x_t) + √ε · z_t. The noise term is crucial: it compensates for mini-batch bias, ensuring the chain converges to the correct distribution (under conditions). SGLD is practical for large-scale data and has been applied to image generation, achieving reasonable quality with deep ResNets as energy functions.
Joint Energy Models (JEMs) unify discriminative and generative objectives. A single energy function E_θ(x, y) describes both: for discriminative tasks, compute log p(y|x) = E_θ(x, y=0) − E_θ(x, y=1) (relative energy difference); for generative tasks, sample from p(x) via Langevin on ∇_x E_θ(x). Training combines supervised classification loss (with data labels) and unsupervised density matching (with generated samples). JEMs achieve reasonable classification and generation without separate architectures, though performance is typically worse than specialized models.
Connections to modern generative modeling abound. Diffusion models learn score functions ∇_x log p_t(x) at multiple noise scales via score matching. This is equivalent to learning energy landscapes {E_t}. Noise-conditional score networks (trained to predict noise given corrupted image) implicitly learn scaled energies. The connection reveals that diffusion—a dominant modern approach—is inherently energy-based, learning probability densities through energy (or score) functions. This unification suggests future directions: hybrids of EBMs and diffusion, conditional EBMs for structured prediction, and using energy design for interpretability in deep generative models.
Why Neural Network Energies?
Parametric energies offer flexibility: design a network for your data structure (CNN for images, Transformer for sequences). Neural networks are universal approximators; with enough capacity, any energy landscape can be learned. This contrasts with hand-crafted energies (e.g., Ising model form), which are rigid but interpretable. The tradeoff: flexibility vs. interpretability.
Computational Bottlenecks
MCMC sampling is slow, especially for high-resolution images or long sequences. Techniques to address this: (1) Better initialization (start from previous samples or noise in replay buffer). (2) Hybrid approaches (combine with learned denoiser). (3) Better MCMC (HMC, parallel tempering, ensemble methods). (4) Approximations (few Langevin steps, approximate gradients). Each trades off sample quality for speed.
The score function ∇_x log p(x) is fundamentally an energy gradient: ∇_x log p(x) = −∇_x E(x) (up to constants). Learning the score is equivalent to learning the energy landscape's shape. Score matching minimizes 𝔼_x[‖∇_x log p_θ(x) − ∇_x log p_data(x)‖²], directly fitting score functions. Denoising score matching corrupts data with noise σ, trains a network to predict ∇_x log p_data(x + noise | noise), then uses this to approximate the true score. This clever trick sidesteps the need for true data scores.
Diffusion models train score networks at multiple noise scales. Start with data, progressively add Gaussian noise over T timesteps, then reverse the process by learning ∇_x log p_t(x) at each scale t. The reverse diffusion process (removing noise iteratively) samples from p(x). Since each step is a slight perturbation, sampling is stable. Deep generative models based on diffusion (DDPM, EDM, etc.) have achieved state-of-the-art sample quality, suggesting score-based approaches are highly effective.
The connection: diffusion models are learning energy landscapes across noise scales. The score ∇_x log p_t(x) at scale t encodes the direction of maximum likelihood under noisy distribution p_t. The sequence {∇_x log p_t} as t decreases from T to 0 forms a gradient field guiding samples from noise toward data. This is precisely energy-based thinking applied across scales: use energy/score information to navigate high-dimensional spaces.
Classifier guidance leverages classifier score gradients to shape generation. If we have a classifier p(y|x), its score ∇_x log p(y|x) indicates regions where class y is probable. Combining unconditional score ∇_x log p(x) with classifier guidance ∇_x log p(y|x) yields ∇_x log(p(x) p(y|x)^τ), focusing samples on x that are both realistic and likely to belong to class y. The guidance strength τ controls conditioning strength. This technique has proven powerful for conditional generation in diffusion models, further validating energy/score-based approaches.
Hybrid Approaches: EBMs + Diffusion
Combining explicit energy functions with diffusion-based sampling: train an energy network E_θ(x), then use SGLD or similar at different noise scales to generate samples. Alternatively, distill diffusion model gradients into an energy function for faster sampling. Such hybrids leverage strengths of both: energy design provides interpretability and control, diffusion provides stable training and sampling.
Future Directions
Applications to structured prediction (learn energy over both inputs and outputs, use optimization at test time to find best output). Interpretability (design energies for domain knowledge, e.g., enforcing constraints). Continual learning and adaptation (adjust energy function as new data arrives). Combining with symbolic reasoning (energy-based models for logic, planning). The enduring principle—specify distributions through energy or score functions—will likely remain central to generative modeling and beyond.
Foundational References
Architectures & Models
- Restricted Boltzmann Machines (RBM) — Bipartite energy with efficient inference and sampling
- Deep Boltzmann Machines (DBM) — Deep hierarchical energy-based generative models
- Modern EBMs — Deep convolutional energy functions with SGLD sampling
Training Algorithms
- Maximum Likelihood — Direct log-likelihood maximization
- Contrastive Divergence — Approximating gradient through MCMC chains
- Persistent Chains — Warm-starting MCMC chains for efficient training
- Score Matching — Training via gradient matching instead of explicit likelihood
Connections to Other Models
- Score-Based Models — Gradients of log-density as alternative energy specification
- Diffusion Models — Connecting energy functions to noise-based generation
- VAEs — Energy-based perspective on ELBO lower bounds
Learning Resources