Normalizing Flows — Deep Dive

01

Change of Variables

Normalizing flows fundamentally rest on change of variables: if we have a random variable x with known density p(x), and apply an invertible transformation z = f(x), the density q(z) of the transformed variable follows a deterministic relationship through the Jacobian determinant of f.

The key formula is: q(z) = p(f^{-1}(z)) · |det J_{f^{-1}}(z)|, where J is the Jacobian matrix of the inverse transformation. This elegant principle transforms probability density through determinants—the foundation of all normalizing flow models.

z = f(x)

Transform

|det J|

Density Scaling

Invertible

Requirement

Tractable

Likelihood

02

Normalizing Flow Idea

A normalizing flow chains multiple invertible transformations: z₀ → z₁ → ... → z_K. Starting from a simple base distribution (e.g., standard Gaussian), each layer applies a transformation with a tractable Jacobian determinant. The final density log p_K(z_K) compounds all intermediate log-determinants through backpropagation.

The power of flows lies in composability: arbitrarily flexible densities can be built from shallow primitives. Unlike autoregressive models, flows enable efficient sampling and exact likelihood computation simultaneously—a remarkable property absent in most generative models.

Base Distribution

Simple, standard Gaussian N(0,I) or uniform distribution.

Invertible Layers

Transformation f with tractable Jacobian determinant.

Composition

Chain multiple layers; determinants multiply (add in log space).

Flexibility

Approximate arbitrary densities via sufficient depth.

03

Planar & Radial Flows

Planar flows apply simple, two-dimensional invertible transformations: z' = z + uh(w^T z + b), where u, w are vectors and h is a Tanh nonlinearity. The Jacobian simplifies to a rank-1 update, making determinant computation via matrix determinant lemma tractable. Each layer increases expressiveness minimally but provides a foundation for understanding flow mechanics.

Radial flows center transformations on a reference point: z' = z + β(r)(z - r₀), where r is the distance from r₀ and β is a learnable radial scaling function. These simple flows have limited expressiveness but illustrate the principle—small expressive flows can be stacked, though convergence requires many layers for complex distributions.

Planar

Direction-wise

Radial

Distance-wise

O(d) Det

Complexity

Limited

Expressiveness

04

RealNVP Architecture

Real Valued Non-Volume Preserving (RealNVP) introduced affine coupling layers: partition variables into two groups, apply an affine transformation to one group conditioned on the other. The Jacobian is triangular, making determinant trivial (product of diagonal). Each layer alternates which group is transformed, ensuring all variables are eventually modified while maintaining computational efficiency.

RealNVP enabled high-resolution image generation by combining multiple coupling layers with multi-scale architecture. The invertibility is explicit (inversion requires just negating the affine parameters), and the Jacobian determinant computation is O(1) per layer. This practical efficiency made flows competitive with other deep generative models for the first time.

Coupling Layers

Partition and affine-transform, maintaining triangular Jacobian.

Easy Inversion

Direct formula; no iterative solver required.

O(1) Jacobian

Triangular structure; determinant = product of diagonal.

Practical Scaling

Enables image generation at reasonable resolutions.

05

GLOW & Invertible CNNs

GLOW (Generative Flow for Invertible 1x1 Convolutions) replaced RealNVP's fixed masking with learned 1×1 convolutional layers. These mix spatial information more flexibly; computing their determinant requires a log-volume-preserving constraint. Actnorm (activation normalization) initializes each layer to unit mean and variance, stabilizing training across batches.

GLOW introduced multi-scale architecture: applying flows at progressively finer resolutions and pooling coarser features separately. This hierarchical approach improves training stability and generation quality for high-resolution images. GLOW achieved striking visual results on face synthesis and manipulation, demonstrating flows as competitive with GANs and diffusion models for image generation.

1×1 Conv

Mixing

Actnorm

Stabilization

Multi-scale

Hierarchy

High-res

Images

06

Autoregressive Flows

Autoregressive flows (MAF, IAF) factorize p(z) = ∏ᵢ p(z_i | z_{

MAF excels for density estimation (tractable likelihood on test data); IAF for efficient sampling in VAEs as a posterior approximator. The choice reflects the application: autoregressive orderings impose structure that can either accelerate likelihood (MAF) or sampling (IAF), but not both—a fundamental tradeoff in autoregressive decompositions.

MAF Strengths

O(1) likelihood computation
Excellent density estimation
Parallel inverse in principle
No sampling bottleneck

MAF Limitations

O(d) forward sampling cost
Sequential sampling required
Slower for latent sampling

07

Continuous Flows

Neural ODE-based flows (FFJORD) replace discrete layers with continuous differential equations: dz/dt = f_θ(z(t), t). The change in log-density follows d(log p)/dt = -tr(∂f_θ/∂z), eliminating explicit Jacobian computation. Trace estimation via random vectors or Hutchinson's method makes FFJORD scalable to high dimensions.

Continuous flows offer theoretical elegance and empirical flexibility—no need to design discrete architectures. However, integration cost can exceed discrete flows; hybrid approaches (e.g., alternating continuous layers with discrete coupling) balance expressiveness and efficiency. Continuous normalizing flows blur the boundary between flows and energy-based models, suggesting deeper connections in generative modeling.

Trace Estimation

Computing tr(∂f/∂z) directly is intractable in high dimensions. Hutchinson's estimator uses random Rademacher or Gaussian vectors v to approximate: tr(J) ≈ E_v[v^T · ∂f/∂z · v], reducing computational burden from O(d²) to O(d).

08

Flow Applications

Flows excel in applications requiring exact likelihood: maximum likelihood training on observational data, model selection via marginal likelihood, and variational inference as powerful posterior approximators. Hybrid models (flow + VAE) combine flows as flexible posteriors with VAE training. Density ratios estimated via flows enable likelihood-free inference and simulation-based calibration in scientific domains.

Limitations include computational cost at scale (discrete flows require many layers; continuous flows integrate ODEs) and architectural constraints for high-dimensional data. Despite GAN and diffusion dominance in images, flows remain indispensable for tabular data, small-scale image generation, and any application prioritizing exact, differentiable likelihood over sample quality. Recent work on diffusion flow hybrids and score-based flows signals renewed integration of flow principles into modern generative modeling.

Likelihood Estimation Posterior Approximation Density Ratio Latent Sampling VAE Hybrid Sim-based Inference

09

References & Further Reading

Normalizing flows provide exact likelihood computation through invertible transformations. This section compiles key papers and resources for understanding change-of-variables formula, flow architectures, and applications in generative modeling and probabilistic inference.

From classical theory to modern architectures like RealNVP and GLOW, these materials document the evolution of flow-based generative models.

Change of Variables

Normalizing Flow Idea

Base Distribution

Invertible Layers

Composition

Flexibility

Planar & Radial Flows

RealNVP Architecture

Coupling Layers

Easy Inversion

O(1) Jacobian

Practical Scaling

GLOW & Invertible CNNs

Autoregressive Flows

MAF Strengths

MAF Limitations

Continuous Flows

Trace Estimation

Flow Applications

References & Further Reading

Change of Variables

Jacobian Determinant Intuition

Invertibility Requirement

Normalizing Flow Idea

Composability: The Core Insight

Flexible Density Approximation

Planar & Radial Flows

Radial Flows

Limitations and Stacking

RealNVP Architecture

Alternating Masking

Invertibility and Efficiency

Multi-scale Architecture

GLOW & Invertible CNNs

Actnorm (Activation Normalization)

Multi-scale Hierarchy

Results and Impact

Invertible 1×1 Convolution

Autoregressive Flows

Inverse Autoregressive Flow (IAF)

Application Tradeoff

Continuous Flows

Trace Estimation

FFJORD and Variants

Hybrid Approaches

Instantaneous Change of Variables

Flow Applications

Density Ratio Estimation

Limitations at Scale

Current Trajectory

Exact Likelihood

Fast Sampling

Invertible Transform

End-to-End Gradient Flow

References & Further Reading

Foundational Papers

Flow Architectures

Key Concepts

Learning Resources