STANFORD XCS236 · PROBLEM SET 1
PS1: Autoregressive Models
40 Points · Introduction, PyTorch & Density Estimation
1

PS1 Overview — 40 Points

Problem Set 1 introduces the foundational concepts of generative modeling through autoregressive density estimation. Students implement a basic autoregressive model in PyTorch, learning how to decompose joint distributions as conditional products: p(x) = ∏ p(xᵢ|x

The problem set covers probability fundamentals (chain rule, Bayes' rule, KL divergence), PyTorch workflows (tensors, autograd, datasets, training loops), and two key architectures—MADE and PixelCNN—that efficiently compute likelihoods on high-dimensional data. By the end, students understand how to construct masked networks, evaluate density models, and sample from learned distributions.

2

Probability Review & Foundations

Probability theory is the mathematical backbone of generative modeling. The chain rule decomposes joint probabilities into conditional products, enabling tractable factorizations of high-dimensional distributions. Bayes' rule relates posterior to likelihood and prior, foundational for understanding model inference. KL divergence quantifies the distance between two distributions and motivates the variational objectives used in more advanced models.

Students review conditional independence, marginal distributions, and the relationship between likelihood and cross-entropy loss. These concepts appear throughout deep generative modeling, especially in variational inference, diffusion models, and transformer-based architectures. Mastering probability notation and manipulation is essential for reading research papers and deriving custom model objectives.

3

PyTorch Fundamentals

PyTorch provides the computational tools for deep learning. Tensors are the core data structure—multi-dimensional arrays supporting vectorized operations on GPU. Autograd implements automatic differentiation, computing gradients via backpropagation without explicit derivative code. Datasets and DataLoaders handle batching, shuffling, and memory-efficient loading of training data. Custom training loops tie these together: forward pass, loss computation, backward pass, and parameter updates.

Students implement standard practices: defining models as nn.Module subclasses, using optimizers like Adam, and organizing code with reproducibility and modularity in mind. Understanding PyTorch deeply—not just as a black box—enables debugging, custom loss functions, and novel model architectures that will be central to later problem sets in the course.

4

Autoregressive Density Estimation

Autoregressive density estimation models the joint distribution p(x) as a product of conditionals: p(x₁, x₂, ..., xₙ) = p(x₁) · p(x₂|x₁) · p(x₃|x₁,x₂) · ... · p(xₙ|x₁,...,xₙ₋₁). This factorization is always valid (following from the chain rule) but requires designing architectures that respect the ordering: each variable's prediction must depend only on previous variables. The negative log-likelihood (NLL) loss directly optimizes this probability, making training straightforward and evaluation unambiguous.

The autoregressive framework is powerful and general: it works for both continuous and discrete data, supports exact likelihood evaluation, and naturally generates samples sequentially. Students implement neural networks that learn these conditional distributions, experiment with different orderings, and compute bits-per-dimension (bits/dim) metrics to evaluate generative quality. This approach forms the basis for recurrent neural networks, transformers, and diffusion models.

5

MADE: Masked Autoencoders

Masked Autoencoders for Distribution Estimation (MADE) elegantly enforce the autoregressive ordering constraint through layer-wise masking. Each neuron in a hidden layer is assigned a variable index, and connections are restricted so a neuron can only depend on variables with lower indices. This single-pass architecture computes all conditional distributions p(xᵢ|x

The key technical challenge is designing the mask matrices to enforce ordering while maintaining hidden layer expressiveness. Students implement mask generation algorithms, understand why masking matters (preventing information leakage from future variables), and appreciate the efficiency gains. MADE showcases how architectural constraints can be elegantly baked into the network structure, a pattern repeated in PixelCNN and many modern architectures.

6

PixelCNN Basics

PixelCNN adapts the autoregressive factorization to image data by modeling pixels in raster order (top-to-bottom, left-to-right). Masked convolutions—where kernel weights are zeroed to prevent future pixel access—enforce the ordering constraint while preserving spatial structure and parameter efficiency. The receptive field of masked convolutions grows with depth, enabling long-range dependencies within the image. PixelCNN naturally generates coherent image samples by conditioning each pixel on all previous ones.

Students implement masked convolutions, visualize receptive fields to verify causality, and sample from PixelCNN-trained models. The architecture demonstrates how to build generative models for high-dimensional structured data. While later variants (PixelCNN++, Gated PixelCNN) add refinements, the core masked convolution mechanism remains central to efficient autoregressive image modeling and illustrates the broader principle of enforcing constraints through architectural design.

7

Training & Evaluation

Training a generative model optimizes the negative log-likelihood (NLL) loss, directly maximizing the probability the model assigns to training data. Evaluation uses multiple metrics: NLL on held-out test data, bits-per-dimension (NLL in bits, normalized by data dimension), and qualitative sample quality. NLL provides an exact, comparable measure across models; bits/dim scales to different data sizes for fair comparison. Sampling—drawing from the learned distribution—is both a practical evaluation (do the samples look good?) and a unique advantage of models with tractable likelihoods.

Students implement training loops with learning rate scheduling, regularization, and checkpoint saving. They compute and interpret test NLL/bits/dim, generate samples via sequential conditioning, and analyze failure cases. Understanding when models underfit or overfit, and how architectural choices affect both likelihood and sample quality, builds intuition for designing better generative models. These practices—careful evaluation, metric selection, and ablation studies—carry through to every subsequent problem set.

8

Key Takeaways

Problem Set 1 establishes core competencies: probability notation and manipulations, PyTorch proficiency, and the autoregressive factorization framework. Students see how elegant mathematical ideas (chain rule, conditional independence) translate into concrete algorithms and PyTorch code. The MADE and PixelCNN implementations demystify how theoretical constraints become practical network designs through masking and structured computation. The training and evaluation workflow—loss optimization, metric tracking, sample generation—is replicated and refined throughout the course.

These foundations are essential for later problem sets tackling variational autoencoders, diffusion models, flow-based models, and energy-based models. The autoregressive perspective—modeling complex distributions as products of simpler conditionals—remains a core technique in modern deep generative modeling. Mastery of this problem set positions students to understand and innovate in the rapidly evolving landscape of generative AI.

9

References & Further Reading

This section provides curated references for deeper understanding of autoregressive models and foundational generative modeling concepts. The papers, blogs, and resources listed below cover key topics from Problem Set 1, including chain rule and probability fundamentals, autoregressive density estimation, masked architectures (MADE, PixelCNN), and PyTorch implementation. Lilian Weng's blog posts offer intuitive explanations of complex concepts; arXiv papers provide rigorous mathematical foundations and empirical results. Stanford's course materials supplement the problem set and offer additional perspective on generative modeling as a field.

Students are encouraged to read these materials actively: work through derivations on paper, implement algorithms from scratch when possible, and compare different authors' explanations to build robust understanding. The references span theoretical foundations (probability theory, neural networks) to practical implementation (PyTorch, model architecture design) to empirical evaluation (likelihood metrics, sample quality assessment). This breadth reflects the reality of modern generative modeling research, which requires depth in mathematics, deep learning, and experimental methodology.