STANFORD XCS236 · DEEP GENERATIVE MODELS
Autoregressive Models
Week 1–2 · Chain Rule, MADE, PixelCNN & Transformers
01

Chain Rule Decomposition

Autoregressive models factorize the joint probability distribution p(x₁, x₂, ..., xₙ) using the chain rule of probability. Rather than modeling the entire joint distribution directly, we decompose it as a product of conditional distributions: p(x) = ∏ᵢ p(xᵢ|x₁, ..., xᵢ₋₁). This factorization is theoretically exact and allows tractable computation of likelihood.

The key insight is that ordering matters—different orderings yield different conditional factorizations, though all are valid. Some orderings may be more efficient or natural depending on domain structure. Once established, an autoregressive model learns each conditional distribution p(xᵢ|x₁, ..., xᵢ₋₁) using a neural network, enabling both density estimation and generation.

Exact
Likelihood
Tractable
Training
Sequential
Generation
Order
Dependent
02

NADE — Neural Density Estimation

Neural Autoregressive Density Estimation (NADE) is one of the first efficient autoregressive models. It uses masked weight matrices to ensure that the prediction of xᵢ only depends on x₁, ..., xᵢ₋₁. The key innovation is weight sharing and efficient masking, allowing single-pass evaluation of the entire distribution without explicitly computing intermediate terms.

NADE learns a shared hidden layer representation that is reused across all conditional distributions. For binary data, it demonstrates strong empirical results on benchmarks like MNIST and Caltech101. The model scales better than fully general density estimators and provides exact likelihood computation, making it valuable for both density estimation and generative modeling tasks.

Masked Weights

Ensures xᵢ prediction depends only on x₁,...,xᵢ₋₁ through careful masking of weight matrices during training.

Weight Sharing

Efficient parameter sharing across all conditional distributions reduces model complexity significantly.

Single Pass

Computes all conditionals in one forward pass rather than sequential evaluation like Boltzmann machines.

Binary Data

Originally designed for binary data but extended to continuous and categorical variables later.

03

MADE — Masked Autoencoder

MADE (Masked Autoencoder for Density Estimation) extends NADE with more expressive architectures. It applies masking to general neural networks, using a clever assignment of ordering masks to hidden and output units. Each hidden unit h_j is assigned an ordering mask m_j that ensures dependencies respect the autoregressive ordering. The output prediction for xᵢ only uses hidden units whose masks allow it.

MADE improves upon NADE by supporting deep networks while maintaining tractability. It can be trained with any architecture (fully connected, convolutional) as long as masking is properly applied. The approach is elegant: masks are binary matrices that zero out connections violating the autoregressive property, making it universally applicable to arbitrary neural network architectures.

Deep
Networks
Masked
Connections
Arbitrary
Architectures
Elegant
Design
04

PixelCNN — Image Generation

PixelCNN applies autoregressive modeling to image generation by factorizing p(image) = ∏ᵢ p(pixelᵢ|pixel₁, ..., pixelᵢ₋₁) where pixels are ordered in raster scan (top-to-bottom, left-to-right). The network uses masked convolutional filters—specifically gated convolutions—to respect the ordering while processing images efficiently. Color channels (R, G, B) follow strict dependencies: red depends on prior pixels, green on prior pixels and current red, blue on prior pixels and current red/green.

PixelCNN's masked convolutions use a receptive field carefully designed to avoid causality violations. Each layer's receptive field grows, allowing predictions to incorporate increasingly distant context. Gated units (multiplicative gates) improve expressiveness. The model generates pixels sequentially, which is slow but guarantees valid distributions. High-quality image generation is achieved on datasets like CIFAR-10 and ImageNet, though generation speed remains a practical limitation.

Masked Convolutions

Convolutional filters masked to respect pixel ordering, preventing information flow from future pixels.

Gated Units

Multiplicative gates amplify expressiveness by mixing feature channels nonlinearly in each layer.

Receptive Field

Grows systematically across layers, allowing distant spatial context to inform pixel predictions.

Sequential Generation

Pixels generated one-at-a-time left-to-right, top-to-bottom, enabling exact sampling from distribution.

05

WaveNet — Audio Synthesis

WaveNet adapts autoregressive modeling to raw audio waveforms, modeling p(audio) as a product of conditionals over time-steps. The innovation is dilated (atrous) causal convolutions that efficiently expand receptive field size without increasing parameter count. Each layer applies convolutions at different dilation rates (1, 2, 4, 8, ..., 2^k), allowing each timestep's prediction to depend on exponentially many prior timesteps with few layers.

Causal convolutions ensure predictions at time t depend only on timesteps ≤ t, maintaining autoregressive validity. Gated activations similar to PixelCNN enhance expressiveness. WaveNet generates audio by sampling from predicted distributions at each timestep, producing high-fidelity speech, music, and instrument synthesis. Conditioning on mel-spectrograms or text enables voice conversion and TTS applications. Despite sequential generation's slowness, the model's quality and conditioning flexibility made it highly influential.

Dilated
Convolutions
Causal
Masking
Raw
Waveforms
μ-law
Encoding
06

Transformer Language Models

Transformer Language Models apply autoregressive factorization to text using the GPT architecture. Tokens are generated sequentially where each token's distribution p(tᵢ|t₁,...,tᵢ₋₁) is modeled using self-attention with a causal mask. The causal attention mask zeroes out attention weights for future positions, ensuring tokens cannot attend to tokens they should predict.

Transformers scale dramatically with model size and data, discovering emergent capabilities. Scaling laws show perplexity decreases predictably with model size and compute. Large language models (GPT-2, GPT-3, etc.) achieve remarkable few-shot learning through in-context learning—adapting behavior based on prompt examples without fine-tuning. The combination of transformer efficiency, scaling law predictability, and emergent abilities makes them dominant in modern generative AI.

Causal Attention

Self-attention mask prevents tokens from attending to future positions, enforcing autoregressive property.

Scaling Laws

Perplexity decreases predictably with compute; larger models consistently outperform smaller ones.

In-Context Learning

Models learn from prompt examples at inference time without parameter updates or fine-tuning.

Efficient Computation

Parallelizable training despite sequential generation; self-attention enables long-range dependencies efficiently.

07

Sampling Strategies

Autoregressive generation requires sampling strategies at inference time. Greedy decoding selects the highest-probability token at each step—fast but often repetitive. Temperature scaling adjusts probability distributions: higher temperatures flatten distributions (more randomness), lower temperatures sharpen them (more deterministic). Top-k sampling restricts sampling to the k most likely tokens, eliminating very low-probability tail noise while preserving diversity.

Nucleus (top-p) sampling selects the smallest set of tokens with cumulative probability ≥ p, adapting dynamically to distribution shape. Beam search explores multiple hypothesis sequences in parallel, keeping the b best partial sequences at each step and returning the highest-scoring complete sequence. These strategies trade between quality (determinism), diversity (randomness), and computational cost (inference speed).

Greedy
Fast, Dull
Beam Search
Quality-Focused
Top-k/p
Diversity
Temperature
Stochasticity
08

Strengths & Limitations

Autoregressive models offer fundamental strengths: they provide exact likelihood computation, enabling principled comparison via log-likelihood on test sets. They generate sample-by-sample with no approximation, and can be conditioned flexibly on any subset of variables. Likelihood provides a clear optimization target during training without variational bounds or adversarial losses.

However, limitations are significant. Sequential generation is slow—generating N tokens requires N forward passes. Weak inductive biases mean ordering sensitivity (e.g., word order in text, pixel ordering in images) affects results. Long-range dependencies are expensive to model computationally. Modern large language models compensate through massive scale and data, but fundamental sequential generation cost remains. Other generative model families (diffusion, flow, VAE) address different trade-offs—faster generation, conditional generation, or latent structure learning.

Strengths

  • Exact, tractable likelihood for principled evaluation
  • Sample-accurate generation with no approximations
  • Flexible conditioning on any variable subset
  • Proven scaling laws with emergent capabilities

Limitations

  • Sequential generation is slow (O(N) forward passes)
  • Order-dependent; sensitivity to variable ordering
  • Weak inductive bias vs. structured models
  • Expensive long-range dependencies
09

References & Further Reading

Autoregressive models have deep roots in probability theory and machine learning. This section provides key references for further study of chain rule factorization, neural density estimation, and practical applications in image and audio generation.

The papers below establish both theoretical foundations and efficient implementations that have become standard tools for generative modeling across multiple domains.