PS2: VAE & Flows — Deep Dive

STANFORD XCS236 · PROBLEM SET 2

PS2: VAE & Flows

Problem Set Overview

Problem Set 2 is a comprehensive 80-point assignment covering variational autoencoders (VAEs) and normalizing flows—two foundational generative modeling techniques for weeks 3-4 of XCS236. Students implement VAE components including encoder networks, decoder networks, reparameterization tricks, and the evidence lower bound (ELBO) loss. The assignment emphasizes understanding latent variable models, approximate inference, and the variational principle underlying VAEs.

The second half explores normalizing flows, which provide exact density estimation through invertible transformations. Students build RealNVP-style coupling layers and train flow models on synthetic and real data. The problem set culminates in comparing VAEs and flows across multiple dimensions: likelihood accuracy, sample quality, training stability, and computational efficiency. This dual focus demonstrates how approximate and exact generative models trade off expressiveness with tractability.

Points Total

Weeks 3–4

Coverage

Model Types

PyTorch

Framework

VAE Theory Review

Variational Autoencoders perform approximate inference over latent variables z given data x. The ELBO (Evidence Lower Bound) decomposes as a reconstruction loss and KL divergence regularizer. The reconstruction term encourages the decoder to reconstruct x accurately from z, while the KL term pushes the posterior q(z|x) close to a standard normal prior p(z). This balance prevents posterior collapse and ensures meaningful latent representations.

The reparameterization trick enables backpropagation through stochastic sampling by writing z = μ + σ ⊙ ε, where ε ~ N(0, I). This reformulation eliminates the gradient barrier around the sampling operation. The encoder learns to predict mean μ and log-variance log(σ²) for each data point, allowing efficient training of the full model end-to-end with standard optimizers.

VAE Implementation

The VAE encoder is a neural network that maps data x to latent parameters (μ, log σ). Typically a 2–4 layer MLP or CNN depending on data type, it outputs a d-dimensional mean vector and variance vector. The decoder network mirrors this architecture, mapping latent z back to data space with output distribution parameters (e.g., Bernoulli logits for binary data, Gaussian parameters for continuous). Both networks are trained jointly via the ELBO objective.

The training loop involves forward pass through encoder, reparameterization sampling, forward pass through decoder, and simultaneous computation of reconstruction loss and KL divergence. Common practice uses β-scheduling to gradually increase the KL weight from 0 to 1, preventing early posterior collapse. Optimization typically uses Adam with learning rates between 1e-3 and 1e-4. Implementation challenges include numerical stability (using log-variance to avoid negative values) and debugging posterior variance to ensure it doesn't degenerate.

VAE Analysis

Latent space visualization reveals structure learned by VAEs. Plotting 2D latent codes with their reconstructions shows that similar data points cluster nearby in z-space. Interpolation between latent codes produces smooth transitions in data space, demonstrating that VAEs learn continuous meaningful representations. Reconstruction quality is measured by per-pixel or per-sample error, with well-trained VAEs achieving low MSE or binary cross-entropy on held-out test sets.

The KL divergence term controls how tightly the posterior matches the prior. High KL values suggest the model is using latent capacity; low KL indicates posterior collapse where the model ignores z and relies on the decoder's capacity to memorize. Posterior variance is also diagnostic: healthy VAEs have non-trivial variance per dimension, ensuring stochasticity in sampling. Log-likelihood estimates can be computed via importance sampling, providing an upper bound on the true marginal likelihood.

Normalizing Flows Theory

Normalizing flows transform a simple base distribution (e.g., standard normal) through a sequence of invertible transformations. If z₀ ~ p₀(z₀) and z_k = f_k(z_{k-1}), the probability under the transformed distribution follows the change-of-variables formula: log p_k(z_k) = log p₀(z₀) - Σ log |det ∇f_i|. Computing these Jacobian determinants enables exact likelihood evaluation—a key advantage over VAEs which only lower bound the likelihood.

Coupling layers (as in RealNVP) split latent dimensions into two groups: one is transformed by a neural network function, the other passes through. This design allows tractable Jacobian computation (triangular structure) while preserving expressiveness through composition. Affine coupling layers apply learnable scale and shift parameters, making them flexible yet computationally efficient. Building a flow requires stacking these layers with careful architectural choices to balance expressiveness and stability.

Flow Implementation

A RealNVP flow consists of alternating coupling layers, each parameterized by scale and translation networks. The scale network outputs log α and translation network outputs β, both conditioned on frozen dimensions. The forward pass applies the transformation x_A' = x_A * exp(α(x_B)) + β(x_B) where A and B are complementary dimension groups. The inverse is analytically tractable, enabling efficient sampling and likelihood evaluation.

Training uses maximum likelihood: optimize log p(x) = log p_base(f⁻¹(x)) - Σ log |det ∇f_i|. Implementation requires careful handling of dimension masking to ensure alternating groups, stable exponential evaluations (using log-space), and periodic batch normalization for improved training stability. Common architectures use 8–16 coupling layers with 2–3 hidden layers per network, achieving strong likelihoods on standard benchmarks. Computational cost scales linearly with depth and is typically higher than VAE training.

VAE vs Flow Comparison

Normalizing flows compute exact likelihood p(x), while VAEs only bound it from below via ELBO. On standard datasets, flows often achieve higher test likelihoods, particularly on images with sharp pixel distributions. However, VAEs train faster and require less careful hyperparameter tuning. Sample quality differs subtly: flow samples from invertible transformations can exhibit mode-covering behavior, while VAEs with diverse decoders produce perceptually coherent but sometimes blurrier samples. The tradeoff between exact vs. approximate inference manifests in both theory and practice.

VAEs excel at representation learning due to structured latent space; flows provide accurate density estimation for likelihood-based applications. VAEs are more stable during training with fewer hyperparameter sensitivities, while flows demand careful architectural design and batch normalization tuning. In terms of computational cost, VAEs are typically 2–4× faster per epoch. For downstream tasks like classification or anomaly detection using latent codes, VAEs remain preferable; for tasks requiring accurate likelihood estimation (e.g., model selection, probability weighting), flows are superior.

Key Takeaways

Latent variable models are central to modern deep generative modeling. VAEs offer interpretable inference networks and structured latent spaces but sacrifice exact likelihood. Normalizing flows provide exact likelihood through clever architectural design (coupling layers) but sacrifice interpretability and latent structure. Understanding the ELBO, reparameterization trick, and change-of-variables formula forms a foundation for advanced topics like hierarchical VAEs, conditional flows, and neural ODE-based generative models.

Practical insights: posterior collapse is VAE's main failure mode—address via β-scheduling or free bits. Flows require stable Jacobian computation and careful normalization. Neither model is universally best; choice depends on application. For representation learning and unsupervised discovery, choose VAEs; for likelihood-based model evaluation and density estimation, choose flows. Future directions include combining both paradigms (e.g., hierarchical flows) and scaling to high-dimensional data via hierarchical architectures and more efficient invertible designs.

References & Further Reading

Problem Set 2 covers two foundational paradigms in generative modeling: approximate inference via variational autoencoders and exact likelihood estimation via normalizing flows. The references below span theoretical foundations, landmark papers, intuitive explanations, and implementation resources. Understanding both approaches deeply—their strengths, limitations, and when each is appropriate—is essential for modern generative AI research and applications.

Start with conceptual resources to build intuition about the ELBO, reparameterization trick, and invertible transformations. Progress to landmark papers for rigorous mathematics and empirical validation. Use PyTorch documentation and implementation blogs when coding. The comparative perspective between VAEs and flows is crucial: both are powerful, but they represent different points in the expressiveness-tractability tradeoff space.

Problem Set Overview

VAE Theory Review

VAE Implementation

VAE Analysis

Normalizing Flows Theory

Flow Implementation

VAE vs Flow Comparison

Key Takeaways

References & Further Reading

Problem Set Overview

Core Learning Objectives

Dataset and Evaluation

VAE Theory Review

Mathematical Foundations

Reparameterization and Optimization

VAE Implementation

Loss Function and Training Dynamics

Common Implementation Pitfalls

VAE Analysis

Posterior Variance and Mode Coverage

Ablation and Interpretation

Normalizing Flows Theory

Coupling Layers and RealNVP

Expressiveness and Limitations

Flow Implementation

Training and Numerical Stability

Common Pitfalls and Debugging

VAE vs Flow Comparison

Sample Quality and Mode Coverage

Downstream Task Performance

Key Takeaways

Theory to Practice

Future Directions

References & Further Reading

Landmark Papers

Intuitive Explanations & Blogs

Background & Mathematical Foundations

Implementation Resources

Course Context