STANFORD XCS236 · DEEP GENERATIVE MODELS
Normalizing Flows
Week 3–4 · Invertible Transforms & Exact Likelihood
01

Change of Variables

Normalizing flows fundamentally rest on change of variables: if we have a random variable x with known density p(x), and apply an invertible transformation z = f(x), the density q(z) of the transformed variable follows a deterministic relationship through the Jacobian determinant of f.

The key formula is: q(z) = p(f^{-1}(z)) · |det J_{f^{-1}}(z)|, where J is the Jacobian matrix of the inverse transformation. This elegant principle transforms probability density through determinants—the foundation of all normalizing flow models.

z = f(x)
Transform
|det J|
Density Scaling
Invertible
Requirement
Tractable
Likelihood
02

Normalizing Flow Idea

A normalizing flow chains multiple invertible transformations: z₀ → z₁ → ... → z_K. Starting from a simple base distribution (e.g., standard Gaussian), each layer applies a transformation with a tractable Jacobian determinant. The final density log p_K(z_K) compounds all intermediate log-determinants through backpropagation.

The power of flows lies in composability: arbitrarily flexible densities can be built from shallow primitives. Unlike autoregressive models, flows enable efficient sampling and exact likelihood computation simultaneously—a remarkable property absent in most generative models.

Base Distribution

Simple, standard Gaussian N(0,I) or uniform distribution.

Invertible Layers

Transformation f with tractable Jacobian determinant.

Composition

Chain multiple layers; determinants multiply (add in log space).

Flexibility

Approximate arbitrary densities via sufficient depth.

03

Planar & Radial Flows

Planar flows apply simple, two-dimensional invertible transformations: z' = z + uh(w^T z + b), where u, w are vectors and h is a Tanh nonlinearity. The Jacobian simplifies to a rank-1 update, making determinant computation via matrix determinant lemma tractable. Each layer increases expressiveness minimally but provides a foundation for understanding flow mechanics.

Radial flows center transformations on a reference point: z' = z + β(r)(z - r₀), where r is the distance from r₀ and β is a learnable radial scaling function. These simple flows have limited expressiveness but illustrate the principle—small expressive flows can be stacked, though convergence requires many layers for complex distributions.

Planar
Direction-wise
Radial
Distance-wise
O(d) Det
Complexity
Limited
Expressiveness
04

RealNVP Architecture

Real Valued Non-Volume Preserving (RealNVP) introduced affine coupling layers: partition variables into two groups, apply an affine transformation to one group conditioned on the other. The Jacobian is triangular, making determinant trivial (product of diagonal). Each layer alternates which group is transformed, ensuring all variables are eventually modified while maintaining computational efficiency.

RealNVP enabled high-resolution image generation by combining multiple coupling layers with multi-scale architecture. The invertibility is explicit (inversion requires just negating the affine parameters), and the Jacobian determinant computation is O(1) per layer. This practical efficiency made flows competitive with other deep generative models for the first time.

Coupling Layers

Partition and affine-transform, maintaining triangular Jacobian.

Easy Inversion

Direct formula; no iterative solver required.

O(1) Jacobian

Triangular structure; determinant = product of diagonal.

Practical Scaling

Enables image generation at reasonable resolutions.

05

GLOW & Invertible CNNs

GLOW (Generative Flow for Invertible 1x1 Convolutions) replaced RealNVP's fixed masking with learned 1×1 convolutional layers. These mix spatial information more flexibly; computing their determinant requires a log-volume-preserving constraint. Actnorm (activation normalization) initializes each layer to unit mean and variance, stabilizing training across batches.

GLOW introduced multi-scale architecture: applying flows at progressively finer resolutions and pooling coarser features separately. This hierarchical approach improves training stability and generation quality for high-resolution images. GLOW achieved striking visual results on face synthesis and manipulation, demonstrating flows as competitive with GANs and diffusion models for image generation.

1×1 Conv
Mixing
Actnorm
Stabilization
Multi-scale
Hierarchy
High-res
Images
06

Autoregressive Flows

Autoregressive flows (MAF, IAF) factorize p(z) = ∏ᵢ p(z_i | z_{

MAF excels for density estimation (tractable likelihood on test data); IAF for efficient sampling in VAEs as a posterior approximator. The choice reflects the application: autoregressive orderings impose structure that can either accelerate likelihood (MAF) or sampling (IAF), but not both—a fundamental tradeoff in autoregressive decompositions.

MAF Strengths

  • O(1) likelihood computation
  • Excellent density estimation
  • Parallel inverse in principle
  • No sampling bottleneck

MAF Limitations

  • O(d) forward sampling cost
  • Sequential sampling required
  • Slower for latent sampling
07

Continuous Flows

Neural ODE-based flows (FFJORD) replace discrete layers with continuous differential equations: dz/dt = f_θ(z(t), t). The change in log-density follows d(log p)/dt = -tr(∂f_θ/∂z), eliminating explicit Jacobian computation. Trace estimation via random vectors or Hutchinson's method makes FFJORD scalable to high dimensions.

Continuous flows offer theoretical elegance and empirical flexibility—no need to design discrete architectures. However, integration cost can exceed discrete flows; hybrid approaches (e.g., alternating continuous layers with discrete coupling) balance expressiveness and efficiency. Continuous normalizing flows blur the boundary between flows and energy-based models, suggesting deeper connections in generative modeling.

Trace Estimation

Computing tr(∂f/∂z) directly is intractable in high dimensions. Hutchinson's estimator uses random Rademacher or Gaussian vectors v to approximate: tr(J) ≈ E_v[v^T · ∂f/∂z · v], reducing computational burden from O(d²) to O(d).

08

Flow Applications

Flows excel in applications requiring exact likelihood: maximum likelihood training on observational data, model selection via marginal likelihood, and variational inference as powerful posterior approximators. Hybrid models (flow + VAE) combine flows as flexible posteriors with VAE training. Density ratios estimated via flows enable likelihood-free inference and simulation-based calibration in scientific domains.

Limitations include computational cost at scale (discrete flows require many layers; continuous flows integrate ODEs) and architectural constraints for high-dimensional data. Despite GAN and diffusion dominance in images, flows remain indispensable for tabular data, small-scale image generation, and any application prioritizing exact, differentiable likelihood over sample quality. Recent work on diffusion flow hybrids and score-based flows signals renewed integration of flow principles into modern generative modeling.

Likelihood Estimation Posterior Approximation Density Ratio Latent Sampling VAE Hybrid Sim-based Inference
09

References & Further Reading

Normalizing flows provide exact likelihood computation through invertible transformations. This section compiles key papers and resources for understanding change-of-variables formula, flow architectures, and applications in generative modeling and probabilistic inference.

From classical theory to modern architectures like RealNVP and GLOW, these materials document the evolution of flow-based generative models.