Autoregressive Models

STANFORD XCS236 · DEEP GENERATIVE MODELS

Autoregressive Models

Chain Rule Decomposition

Autoregressive models factorize the joint probability distribution p(x₁, x₂, ..., xₙ) using the chain rule of probability. Rather than modeling the entire joint distribution directly, we decompose it as a product of conditional distributions: p(x) = ∏ᵢ p(xᵢ|x₁, ..., xᵢ₋₁). This factorization is theoretically exact and allows tractable computation of likelihood.

The key insight is that ordering matters—different orderings yield different conditional factorizations, though all are valid. Some orderings may be more efficient or natural depending on domain structure. Once established, an autoregressive model learns each conditional distribution p(xᵢ|x₁, ..., xᵢ₋₁) using a neural network, enabling both density estimation and generation.

Exact

Likelihood

Tractable

Training

Sequential

Generation

Order

Dependent

NADE — Neural Density Estimation

Neural Autoregressive Density Estimation (NADE) is one of the first efficient autoregressive models. It uses masked weight matrices to ensure that the prediction of xᵢ only depends on x₁, ..., xᵢ₋₁. The key innovation is weight sharing and efficient masking, allowing single-pass evaluation of the entire distribution without explicitly computing intermediate terms.

NADE learns a shared hidden layer representation that is reused across all conditional distributions. For binary data, it demonstrates strong empirical results on benchmarks like MNIST and Caltech101. The model scales better than fully general density estimators and provides exact likelihood computation, making it valuable for both density estimation and generative modeling tasks.

Masked Weights

Ensures xᵢ prediction depends only on x₁,...,xᵢ₋₁ through careful masking of weight matrices during training.

Weight Sharing

Efficient parameter sharing across all conditional distributions reduces model complexity significantly.

Single Pass

Computes all conditionals in one forward pass rather than sequential evaluation like Boltzmann machines.

Binary Data

Originally designed for binary data but extended to continuous and categorical variables later.

MADE — Masked Autoencoder

MADE (Masked Autoencoder for Density Estimation) extends NADE with more expressive architectures. It applies masking to general neural networks, using a clever assignment of ordering masks to hidden and output units. Each hidden unit h_j is assigned an ordering mask m_j that ensures dependencies respect the autoregressive ordering. The output prediction for xᵢ only uses hidden units whose masks allow it.

MADE improves upon NADE by supporting deep networks while maintaining tractability. It can be trained with any architecture (fully connected, convolutional) as long as masking is properly applied. The approach is elegant: masks are binary matrices that zero out connections violating the autoregressive property, making it universally applicable to arbitrary neural network architectures.

Deep

Networks

Masked

Connections

Arbitrary

Architectures

Elegant

Design

PixelCNN — Image Generation

PixelCNN applies autoregressive modeling to image generation by factorizing p(image) = ∏ᵢ p(pixelᵢ|pixel₁, ..., pixelᵢ₋₁) where pixels are ordered in raster scan (top-to-bottom, left-to-right). The network uses masked convolutional filters—specifically gated convolutions—to respect the ordering while processing images efficiently. Color channels (R, G, B) follow strict dependencies: red depends on prior pixels, green on prior pixels and current red, blue on prior pixels and current red/green.

PixelCNN's masked convolutions use a receptive field carefully designed to avoid causality violations. Each layer's receptive field grows, allowing predictions to incorporate increasingly distant context. Gated units (multiplicative gates) improve expressiveness. The model generates pixels sequentially, which is slow but guarantees valid distributions. High-quality image generation is achieved on datasets like CIFAR-10 and ImageNet, though generation speed remains a practical limitation.

Masked Convolutions

Convolutional filters masked to respect pixel ordering, preventing information flow from future pixels.

Gated Units

Multiplicative gates amplify expressiveness by mixing feature channels nonlinearly in each layer.

Receptive Field

Grows systematically across layers, allowing distant spatial context to inform pixel predictions.

Sequential Generation

Pixels generated one-at-a-time left-to-right, top-to-bottom, enabling exact sampling from distribution.

WaveNet — Audio Synthesis

WaveNet adapts autoregressive modeling to raw audio waveforms, modeling p(audio) as a product of conditionals over time-steps. The innovation is dilated (atrous) causal convolutions that efficiently expand receptive field size without increasing parameter count. Each layer applies convolutions at different dilation rates (1, 2, 4, 8, ..., 2^k), allowing each timestep's prediction to depend on exponentially many prior timesteps with few layers.

Causal convolutions ensure predictions at time t depend only on timesteps ≤ t, maintaining autoregressive validity. Gated activations similar to PixelCNN enhance expressiveness. WaveNet generates audio by sampling from predicted distributions at each timestep, producing high-fidelity speech, music, and instrument synthesis. Conditioning on mel-spectrograms or text enables voice conversion and TTS applications. Despite sequential generation's slowness, the model's quality and conditioning flexibility made it highly influential.

Dilated

Convolutions

Causal

Masking

Raw

Waveforms

μ-law

Encoding

Transformer Language Models

Transformer Language Models apply autoregressive factorization to text using the GPT architecture. Tokens are generated sequentially where each token's distribution p(tᵢ|t₁,...,tᵢ₋₁) is modeled using self-attention with a causal mask. The causal attention mask zeroes out attention weights for future positions, ensuring tokens cannot attend to tokens they should predict.

Transformers scale dramatically with model size and data, discovering emergent capabilities. Scaling laws show perplexity decreases predictably with model size and compute. Large language models (GPT-2, GPT-3, etc.) achieve remarkable few-shot learning through in-context learning—adapting behavior based on prompt examples without fine-tuning. The combination of transformer efficiency, scaling law predictability, and emergent abilities makes them dominant in modern generative AI.

Causal Attention

Self-attention mask prevents tokens from attending to future positions, enforcing autoregressive property.

Scaling Laws

Perplexity decreases predictably with compute; larger models consistently outperform smaller ones.

In-Context Learning

Models learn from prompt examples at inference time without parameter updates or fine-tuning.

Efficient Computation

Parallelizable training despite sequential generation; self-attention enables long-range dependencies efficiently.

Sampling Strategies

Autoregressive generation requires sampling strategies at inference time. Greedy decoding selects the highest-probability token at each step—fast but often repetitive. Temperature scaling adjusts probability distributions: higher temperatures flatten distributions (more randomness), lower temperatures sharpen them (more deterministic). Top-k sampling restricts sampling to the k most likely tokens, eliminating very low-probability tail noise while preserving diversity.

Nucleus (top-p) sampling selects the smallest set of tokens with cumulative probability ≥ p, adapting dynamically to distribution shape. Beam search explores multiple hypothesis sequences in parallel, keeping the b best partial sequences at each step and returning the highest-scoring complete sequence. These strategies trade between quality (determinism), diversity (randomness), and computational cost (inference speed).

Greedy

Fast, Dull

Beam Search

Quality-Focused

Top-k/p

Diversity

Temperature

Stochasticity

Strengths & Limitations

Autoregressive models offer fundamental strengths: they provide exact likelihood computation, enabling principled comparison via log-likelihood on test sets. They generate sample-by-sample with no approximation, and can be conditioned flexibly on any subset of variables. Likelihood provides a clear optimization target during training without variational bounds or adversarial losses.

However, limitations are significant. Sequential generation is slow—generating N tokens requires N forward passes. Weak inductive biases mean ordering sensitivity (e.g., word order in text, pixel ordering in images) affects results. Long-range dependencies are expensive to model computationally. Modern large language models compensate through massive scale and data, but fundamental sequential generation cost remains. Other generative model families (diffusion, flow, VAE) address different trade-offs—faster generation, conditional generation, or latent structure learning.

Strengths

Exact, tractable likelihood for principled evaluation
Sample-accurate generation with no approximations
Flexible conditioning on any variable subset
Proven scaling laws with emergent capabilities

Limitations

Sequential generation is slow (O(N) forward passes)
Order-dependent; sensitivity to variable ordering
Weak inductive bias vs. structured models
Expensive long-range dependencies

References & Further Reading

Autoregressive models have deep roots in probability theory and machine learning. This section provides key references for further study of chain rule factorization, neural density estimation, and practical applications in image and audio generation.

The papers below establish both theoretical foundations and efficient implementations that have become standard tools for generative modeling across multiple domains.

Chain Rule Decomposition

NADE — Neural Density Estimation

Masked Weights

Weight Sharing

Single Pass

Binary Data

MADE — Masked Autoencoder

PixelCNN — Image Generation

Masked Convolutions

Gated Units

Receptive Field

Sequential Generation

WaveNet — Audio Synthesis

Transformer Language Models

Causal Attention

Scaling Laws

In-Context Learning

Efficient Computation

Sampling Strategies

Strengths & Limitations

Strengths

Limitations

References & Further Reading

Chain Rule Decomposition

Ordering and Dependency Structure

Fully-Observed Conditioning

Training vs. Generation

NADE — Neural Autoregressive Density Estimation

Masked Weight Matrices

Single Forward Pass

Extension to Continuous Data

MADE — Masked Autoencoder for Density Estimation

Masking Strategy

Deep Architectures

Practical Implementation

PixelCNN — Autoregressive Image Generation

Masked Convolutional Kernels

Residual and Skip Connections

Color Channel Dependencies

Sampling and Generation

WaveNet — Raw Audio Synthesis

Dilated Causal Convolutions

Gated Activation Units

Conditioning and Synthesis

Transformer Language Models (GPT)

Scaling Laws and Emergence

In-Context Learning

Efficient Training and Generation

Sampling Strategies at Inference Time

Greedy Decoding

Temperature Scaling

Top-k Sampling

Nucleus (Top-p) Sampling

Beam Search

Strengths, Limitations & Alternatives

Core Strengths

Fundamental Limitations

Alternative Generative Models

Key Takeaway

References & Further Reading

Foundational Papers

Density Estimation Methods

Language & Transformers

Learning Resources