Normalizing flows fundamentally rest on change of variables: if we have a random variable x with known density p(x), and apply an invertible transformation z = f(x), the density q(z) of the transformed variable follows a deterministic relationship through the Jacobian determinant of f.
The key formula is: q(z) = p(f^{-1}(z)) · |det J_{f^{-1}}(z)|, where J is the Jacobian matrix of the inverse transformation. This elegant principle transforms probability density through determinants—the foundation of all normalizing flow models.
z = f(x)
Transform
|det J|
Density Scaling
Invertible
Requirement
Tractable
Likelihood
02
Normalizing Flow Idea
A normalizing flow chains multiple invertible transformations: z₀ → z₁ → ... → z_K. Starting from a simple base distribution (e.g., standard Gaussian), each layer applies a transformation with a tractable Jacobian determinant. The final density log p_K(z_K) compounds all intermediate log-determinants through backpropagation.
The power of flows lies in composability: arbitrarily flexible densities can be built from shallow primitives. Unlike autoregressive models, flows enable efficient sampling and exact likelihood computation simultaneously—a remarkable property absent in most generative models.
Base Distribution
Simple, standard Gaussian N(0,I) or uniform distribution.
Invertible Layers
Transformation f with tractable Jacobian determinant.
Composition
Chain multiple layers; determinants multiply (add in log space).
Flexibility
Approximate arbitrary densities via sufficient depth.
03
Planar & Radial Flows
Planar flows apply simple, two-dimensional invertible transformations: z' = z + uh(w^T z + b), where u, w are vectors and h is a Tanh nonlinearity. The Jacobian simplifies to a rank-1 update, making determinant computation via matrix determinant lemma tractable. Each layer increases expressiveness minimally but provides a foundation for understanding flow mechanics.
Radial flows center transformations on a reference point: z' = z + β(r)(z - r₀), where r is the distance from r₀ and β is a learnable radial scaling function. These simple flows have limited expressiveness but illustrate the principle—small expressive flows can be stacked, though convergence requires many layers for complex distributions.
Planar
Direction-wise
Radial
Distance-wise
O(d) Det
Complexity
Limited
Expressiveness
04
RealNVP Architecture
Real Valued Non-Volume Preserving (RealNVP) introduced affine coupling layers: partition variables into two groups, apply an affine transformation to one group conditioned on the other. The Jacobian is triangular, making determinant trivial (product of diagonal). Each layer alternates which group is transformed, ensuring all variables are eventually modified while maintaining computational efficiency.
RealNVP enabled high-resolution image generation by combining multiple coupling layers with multi-scale architecture. The invertibility is explicit (inversion requires just negating the affine parameters), and the Jacobian determinant computation is O(1) per layer. This practical efficiency made flows competitive with other deep generative models for the first time.
Coupling Layers
Partition and affine-transform, maintaining triangular Jacobian.
Easy Inversion
Direct formula; no iterative solver required.
O(1) Jacobian
Triangular structure; determinant = product of diagonal.
Practical Scaling
Enables image generation at reasonable resolutions.
05
GLOW & Invertible CNNs
GLOW (Generative Flow for Invertible 1x1 Convolutions) replaced RealNVP's fixed masking with learned 1×1 convolutional layers. These mix spatial information more flexibly; computing their determinant requires a log-volume-preserving constraint. Actnorm (activation normalization) initializes each layer to unit mean and variance, stabilizing training across batches.
GLOW introduced multi-scale architecture: applying flows at progressively finer resolutions and pooling coarser features separately. This hierarchical approach improves training stability and generation quality for high-resolution images. GLOW achieved striking visual results on face synthesis and manipulation, demonstrating flows as competitive with GANs and diffusion models for image generation.
MAF excels for density estimation (tractable likelihood on test data); IAF for efficient sampling in VAEs as a posterior approximator. The choice reflects the application: autoregressive orderings impose structure that can either accelerate likelihood (MAF) or sampling (IAF), but not both—a fundamental tradeoff in autoregressive decompositions.
MAF Strengths
O(1) likelihood computation
Excellent density estimation
Parallel inverse in principle
No sampling bottleneck
MAF Limitations
O(d) forward sampling cost
Sequential sampling required
Slower for latent sampling
07
Continuous Flows
Neural ODE-based flows (FFJORD) replace discrete layers with continuous differential equations: dz/dt = f_θ(z(t), t). The change in log-density follows d(log p)/dt = -tr(∂f_θ/∂z), eliminating explicit Jacobian computation. Trace estimation via random vectors or Hutchinson's method makes FFJORD scalable to high dimensions.
Continuous flows offer theoretical elegance and empirical flexibility—no need to design discrete architectures. However, integration cost can exceed discrete flows; hybrid approaches (e.g., alternating continuous layers with discrete coupling) balance expressiveness and efficiency. Continuous normalizing flows blur the boundary between flows and energy-based models, suggesting deeper connections in generative modeling.
Trace Estimation
Computing tr(∂f/∂z) directly is intractable in high dimensions. Hutchinson's estimator uses random Rademacher or Gaussian vectors v to approximate: tr(J) ≈ E_v[v^T · ∂f/∂z · v], reducing computational burden from O(d²) to O(d).
08
Flow Applications
Flows excel in applications requiring exact likelihood: maximum likelihood training on observational data, model selection via marginal likelihood, and variational inference as powerful posterior approximators. Hybrid models (flow + VAE) combine flows as flexible posteriors with VAE training. Density ratios estimated via flows enable likelihood-free inference and simulation-based calibration in scientific domains.
Limitations include computational cost at scale (discrete flows require many layers; continuous flows integrate ODEs) and architectural constraints for high-dimensional data. Despite GAN and diffusion dominance in images, flows remain indispensable for tabular data, small-scale image generation, and any application prioritizing exact, differentiable likelihood over sample quality. Recent work on diffusion flow hybrids and score-based flows signals renewed integration of flow principles into modern generative modeling.
Normalizing flows provide exact likelihood computation through invertible transformations. This section compiles key papers and resources for understanding change-of-variables formula, flow architectures, and applications in generative modeling and probabilistic inference.
From classical theory to modern architectures like RealNVP and GLOW, these materials document the evolution of flow-based generative models.
01
Change of Variables
The mathematical foundation of all normalizing flows rests on a simple principle: if a random variable x has probability density p(x), and we apply an invertible transformation z = f(x), the probability density of z is determined by the Jacobian determinant of f.
Formally: q(z) = p(f^{-1}(z)) |det J_{f^{-1}}(z)|, where J is the Jacobian matrix of the inverse transformation. In log space: log q(z) = log p(f^{-1}(z)) + log |det J_{f^{-1}}(z)|. This relationship enables us to transform any simple base distribution (e.g., Gaussian) into a complex target distribution by learning f and tracking how volume changes.
Jacobian Determinant Intuition
The Jacobian determinant quantifies local volume scaling. A transformation that stretches a region expands volume (|det J| > 1); compression shrinks it (|det J| < 1). To maintain normalized densities, we must account for this change: if volume expands by factor k, density must decrease by 1/k.
For high-dimensional data, computing determinants naively is O(d³) via LU decomposition. Efficient flows therefore exploit structure: triangular Jacobians (RealNVP), rank-1 updates (planar flows), or trace estimation (continuous flows) to reduce this cost.
Invertibility Requirement
f must be invertible—given any z, we must uniquely recover x = f^{-1}(z). This is stricter than many neural networks; we can't simply apply arbitrary non-linearities. Successful flow designs either construct explicit inverses (coupling layers, 1×1 convolutions with known determinants) or learn invertible parameterizations that are provably reversible.
02
Normalizing Flow Idea
A normalizing flow chains K invertible transformations to progressively warp a simple base distribution into a complex target. We denote the sequence as: z₀ ~ p₀(z₀), z₁ = f₁(z₀), z₂ = f₂(z₁), ..., z_K = f_K(z_{K-1}).
The log-density at the final step compounds all intermediate log-determinants: log p_K(z_K) = log p₀(z₀) - Σₖ log|det J_k|. Backpropagation computes gradients w.r.t. model parameters, enabling maximum likelihood training directly on observed data.
Composability: The Core Insight
Because each layer's determinant is tractable and invertibility is preserved under composition, we can build arbitrarily expressive densities from shallow, simple primitives. This differs fundamentally from VAEs (only implicit densities q) and GANs (no tractable likelihood). Flows unite sample generation and density estimation.
The chain rule for determinants is crucial: |det(J_a ∘ J_b)| = |det J_a| · |det J_b|. Adding layers multiplies determinants, so in log space we simply add log-determinants. This composability enables scaling to deep models.
Flexible Density Approximation
By the universal approximation theorem for flows (under sufficient depth and width), any smooth probability density can be approximated arbitrarily well. In practice, this means we can learn p_data(x) by minimizing KL divergence: KL(p_data || p_model) = -E_x[log p_model(x)] + const, exactly the negative log-likelihood objective.
03
Planar & Radial Flows
Planar flows introduce minimal computational overhead while illustrating flow mechanics. Each layer applies: z' = z + uh(w^T z + b), where u, w ∈ ℝ^d are vectors, h is Tanh nonlinearity, and b is a scalar bias. The transformation stretches or contracts along direction w.
The Jacobian J_z' = I + uh'(w^T z + b) w^T is a rank-1 update to identity. Using the matrix determinant lemma: det(I + uv^T) = 1 + v^T u. Thus, det J = 1 + h'(w^T z + b) w^T u, computable in O(d) time. This tractability makes planar flows educational and practical for low-dimensional problems.
Radial Flows
Radial flows apply distance-based transformations: z' = z + β(r)(z - r₀), where r = ||z - r₀|| and β(r) is a learnable radial scaling function. These contract or expand rings around reference point r₀. Like planar flows, they're simple pedagogically but limited in expressiveness—each layer expands or compresses radially, requiring many layers to model complex distributions.
Limitations and Stacking
Individual planar and radial flows have limited expressiveness; they can only perform gentle deformations. However, stacking many layers (e.g., 64 planar flows) can approximate arbitrary distributions. The cost is computational: evaluating a 64-layer stack is slower than deeper, more expressive single layers. In practice, planar/radial flows serve as pedagogical stepping stones to understanding more sophisticated architectures like RealNVP and GLOW.
04
RealNVP Architecture
RealNVP (Real Valued Non-Volume Preserving) introduced affine coupling layers, enabling practical flow-based image generation. Each layer partitions the d-dimensional vector z into two disjoint sets: z = [z¹, z²]. One set (say z¹) passes through unchanged; the other (z²) is transformed affinely:
z'² = z² ⊙ exp(s(z¹)) + t(z¹), where s and t are neural networks outputting scale and translation, conditioned on z¹. The symbol ⊙ denotes element-wise multiplication. The Jacobian is triangular (z'¹ doesn't depend on z²), so det J = Σᵢ s_i(z¹), computable in O(1) time.
Alternating Masking
To ensure all variables are eventually transformed, consecutive layers alternate which variables are masked (held fixed). Layer k masks variables {1,3,5,...} if k is odd, {2,4,6,...} if k is even. After 2L layers, each variable has been transformed L times, building expressiveness through depth while preserving computational efficiency.
Invertibility and Efficiency
Inversion is trivial: given z', compute z'¹ = z¹ directly, then z'² = (z² - t(z¹)) ⊙ exp(-s(z¹)). No iterative solver required, unlike continuous flows. This explicit invertibility, combined with O(1) Jacobian determinants, made RealNVP the first practical flow model for high-dimensional data.
Multi-scale Architecture
For images, RealNVP applies flows at multiple resolutions. After each flow block, half the channels are removed via 2×2 spatial downsampling, passed to an upper scale. This hierarchical structure (inspired by image pyramids) improves training stability and quality, similar to multi-scale GANs.
05
GLOW & Invertible CNNs
GLOW (Generative Flow for Invertible 1x1 Convolutions) advanced flows by replacing RealNVP's fixed checkerboard masking with learned 1×1 convolutions. A 1×1 convolution is a linear transformation W applied independently at each spatial location: z'(h,w) = W z(h,w). If W is invertible, so is the full operation. Its determinant equals det(W)^(H·W) for an H×W image, ensuring tractability.
Computing 1×1 convolution determinants leverages matrix determinant lemmas; LU parameterization ensures invertibility while keeping determinants tractable. This flexibility improves information mixing compared to RealNVP's fixed patterns, enabling more efficient expressiveness.
Actnorm (Activation Normalization)
Actnorm initializes each layer such that activations have zero mean and unit variance across a minibatch. This stabilizes training dynamics, crucial for deep flows. During inference, actnorm becomes a fixed affine transformation. The combination of 1×1 convolutions and actnorm dramatically improved stability and sample quality on large image datasets.
Multi-scale Hierarchy
GLOW applies flows at multiple scales (similar to RealNVP but more refined): after K flow blocks at a given resolution, perform spatial downsampling (2×2) to 4× fewer spatial locations. Half the channels are passed up; the rest continue through finer flows. This multi-scale design, inspired by real-NVP but implemented more carefully, yields high-quality generation on 256×256 images and beyond.
Results and Impact
GLOW achieved striking visual results on face generation and interpolation, competing with GANs in visual quality while maintaining exact likelihood computation. It demonstrated that flows could scale to complex, high-resolution visual data—a turning point in establishing flows as serious contenders in generative modeling.
Invertible 1×1 Convolution
A 1×1 conv applies linear transformation W independently at each spatial location. For invertibility, W must be non-singular. GLOW parameterizes W via LU decomposition: W = P L (U + diag(d)), where P is a fixed permutation, L lower-triangular with unit diagonal, U strictly upper-triangular, and d diagonal. This ensures det W is easy to compute and W is guaranteed invertible.
The Jacobian is strictly lower-triangular (z'_i depends only on z_{
Inverse Autoregressive Flow (IAF)
IAF inverts the dependency structure: z'_i = f_i(z_i; θ_i(z)) where parameters θ_i condition on the full (original) z. Now the Jacobian is upper-triangular, but sampling is O(1)—apply all transformations in parallel given z. Computing likelihood requires evaluating the inverse (slow), trading off sampling for density computation.
Application Tradeoff
MAF excels for density estimation (likelihood-free inference, anomaly detection) where we evaluate densities on test data frequently. IAF suits VAEs and other generative models where sampling is primary and likelihood evaluation rare. This asymmetry reflects a fundamental tension: autoregressive orderings impose structure that accelerates one direction of computation at the expense of the other.
07
Continuous Flows
Neural ODEs enable continuous normalizing flows by parameterizing densities via differential equations. Instead of discrete layers z_k = f_k(z_{k-1}), we model a continuous trajectory z(t) satisfying: dz/dt = f_θ(z(t), t). The density evolves via the instantaneous change of variables formula:
d(log p(z(t)))/dt = -tr(∂f_θ/∂z). Integrating from t=0 (base distribution) to t=T (target): log p(z(T)) = log p(z(0)) - ∫₀^T tr(∂f/∂z(t)) dt. This eliminates explicit Jacobian computation—we only need the trace of the Jacobian, not the Jacobian itself.
Trace Estimation
Computing tr(J) = Σᵢ ∂f_i/∂z_i directly requires O(d) forward passes (one per dimension). Hutchinson's trace estimator uses random vectors: tr(J) ≈ E_v[v^T (∂f/∂z) v], reducing cost to a single forward pass with a random probe vector v ~ N(0,I). This is crucial for scalability to high dimensions.
FFJORD and Variants
FFJORD (Free-form Jacobian of Reversible Dynamics) applies Neural ODE flows to generative modeling. The trajectory is integrated using ODE solvers (e.g., RK45); reverse-mode AD computes gradients through the integration. While theoretically elegant, ODE integration has computational overhead (typically 100+ f evaluations per sample), making FFJORD slower than discrete RealNVP or GLOW for many applications.
Hybrid Approaches
Recent work combines continuous and discrete elements: alternating Neural ODE flows with discrete coupling layers, or using score-based diffusion (continuous score matching) alongside flows. These hybrids balance expressiveness (continuous dynamics) with computational efficiency (discrete shortcuts), suggesting the future of generative modeling may integrate these paradigms.
Instantaneous Change of Variables
The trace formula d(log p)/dt = -tr(∂f/∂z) comes from differentiating the change of variables formula w.r.t. time. It's a remarkable identity: we don't need the Jacobian's eigenvectors, only its eigenvalue sum (trace), reducing a dense computation to a tractable one.
08
Flow Applications
Normalizing flows shine in applications requiring exact, differentiable log-likelihoods. In density estimation, flows enable maximum likelihood training on observational data without variational approximations. For model selection, exact marginal likelihoods (not ELBO bounds) guide architecture and hyperparameter choice via evidence comparisons.
In variational inference, flows serve as flexible posterior approximators. A VAE with a flow-based posterior—e.g., combining a Gaussian encoder with MAF or RealNVP—yields tighter variational bounds than Gaussian posteriors. The Helmholtz free energy is exactly computed for each sample, enabling unbiased gradient estimates w.r.t. the posterior transform parameters.
Density Ratio Estimation
Flows estimate p(x)/q(x) by training on samples from both distributions. Applications include simulation-based calibration (comparing simulated and real data in scientific inference), adversarial robustness evaluation (likelihood ratio for out-of-distribution detection), and importance weighting in Monte Carlo.
Limitations at Scale
Despite theoretical appeal, flows face practical limitations. Discrete flows (RealNVP, GLOW) require many layers to achieve complex densities, increasing computational cost. Continuous flows (Neural ODEs) require expensive ODE integration. Both struggle with very high-dimensional data like ImageNet-scale images; GANs and diffusion models currently dominate that regime due to better empirical scaling.
Current Trajectory
Recent work integrates flows into broader generative ecosystems: score-based diffusion models leverage flow intuitions (vector field learning); flow matching combines continuous flows with optimal transport; hybrid models alternate flows and diffusion. This cross-pollination suggests normalizing flows remain foundational to deep generative modeling's theory and practice, even as application niches shift.
Exact Likelihood
Compute log p(x) to machine precision, enabling unbiased MLE and model selection via exact evidence.
Fast Sampling
Sample x ~ p(x) in O(1) forward passes; no iterative MCMC or rejection sampling needed.
Invertible Transform
Encode x → z and decode z → x losslessly, enabling latent space interpolation and arithmetic.
End-to-End Gradient Flow
Gradients flow through all transformations; integrate flows into larger probabilistic models.