Autoregressive models factorize the joint probability distribution p(x₁, x₂, ..., xₙ) using the chain rule of probability. Rather than modeling the entire joint distribution directly, we decompose it as a product of conditional distributions: p(x) = ∏ᵢ p(xᵢ|x₁, ..., xᵢ₋₁). This factorization is theoretically exact and allows tractable computation of likelihood.
The key insight is that ordering matters—different orderings yield different conditional factorizations, though all are valid. Some orderings may be more efficient or natural depending on domain structure. Once established, an autoregressive model learns each conditional distribution p(xᵢ|x₁, ..., xᵢ₋₁) using a neural network, enabling both density estimation and generation.
Exact
Likelihood
Tractable
Training
Sequential
Generation
Order
Dependent
02
NADE — Neural Density Estimation
Neural Autoregressive Density Estimation (NADE) is one of the first efficient autoregressive models. It uses masked weight matrices to ensure that the prediction of xᵢ only depends on x₁, ..., xᵢ₋₁. The key innovation is weight sharing and efficient masking, allowing single-pass evaluation of the entire distribution without explicitly computing intermediate terms.
NADE learns a shared hidden layer representation that is reused across all conditional distributions. For binary data, it demonstrates strong empirical results on benchmarks like MNIST and Caltech101. The model scales better than fully general density estimators and provides exact likelihood computation, making it valuable for both density estimation and generative modeling tasks.
Masked Weights
Ensures xᵢ prediction depends only on x₁,...,xᵢ₋₁ through careful masking of weight matrices during training.
Weight Sharing
Efficient parameter sharing across all conditional distributions reduces model complexity significantly.
Single Pass
Computes all conditionals in one forward pass rather than sequential evaluation like Boltzmann machines.
Binary Data
Originally designed for binary data but extended to continuous and categorical variables later.
03
MADE — Masked Autoencoder
MADE (Masked Autoencoder for Density Estimation) extends NADE with more expressive architectures. It applies masking to general neural networks, using a clever assignment of ordering masks to hidden and output units. Each hidden unit h_j is assigned an ordering mask m_j that ensures dependencies respect the autoregressive ordering. The output prediction for xᵢ only uses hidden units whose masks allow it.
MADE improves upon NADE by supporting deep networks while maintaining tractability. It can be trained with any architecture (fully connected, convolutional) as long as masking is properly applied. The approach is elegant: masks are binary matrices that zero out connections violating the autoregressive property, making it universally applicable to arbitrary neural network architectures.
Deep
Networks
Masked
Connections
Arbitrary
Architectures
Elegant
Design
04
PixelCNN — Image Generation
PixelCNN applies autoregressive modeling to image generation by factorizing p(image) = ∏ᵢ p(pixelᵢ|pixel₁, ..., pixelᵢ₋₁) where pixels are ordered in raster scan (top-to-bottom, left-to-right). The network uses masked convolutional filters—specifically gated convolutions—to respect the ordering while processing images efficiently. Color channels (R, G, B) follow strict dependencies: red depends on prior pixels, green on prior pixels and current red, blue on prior pixels and current red/green.
PixelCNN's masked convolutions use a receptive field carefully designed to avoid causality violations. Each layer's receptive field grows, allowing predictions to incorporate increasingly distant context. Gated units (multiplicative gates) improve expressiveness. The model generates pixels sequentially, which is slow but guarantees valid distributions. High-quality image generation is achieved on datasets like CIFAR-10 and ImageNet, though generation speed remains a practical limitation.
Masked Convolutions
Convolutional filters masked to respect pixel ordering, preventing information flow from future pixels.
Gated Units
Multiplicative gates amplify expressiveness by mixing feature channels nonlinearly in each layer.
Receptive Field
Grows systematically across layers, allowing distant spatial context to inform pixel predictions.
Sequential Generation
Pixels generated one-at-a-time left-to-right, top-to-bottom, enabling exact sampling from distribution.
05
WaveNet — Audio Synthesis
WaveNet adapts autoregressive modeling to raw audio waveforms, modeling p(audio) as a product of conditionals over time-steps. The innovation is dilated (atrous) causal convolutions that efficiently expand receptive field size without increasing parameter count. Each layer applies convolutions at different dilation rates (1, 2, 4, 8, ..., 2^k), allowing each timestep's prediction to depend on exponentially many prior timesteps with few layers.
Causal convolutions ensure predictions at time t depend only on timesteps ≤ t, maintaining autoregressive validity. Gated activations similar to PixelCNN enhance expressiveness. WaveNet generates audio by sampling from predicted distributions at each timestep, producing high-fidelity speech, music, and instrument synthesis. Conditioning on mel-spectrograms or text enables voice conversion and TTS applications. Despite sequential generation's slowness, the model's quality and conditioning flexibility made it highly influential.
Dilated
Convolutions
Causal
Masking
Raw
Waveforms
μ-law
Encoding
06
Transformer Language Models
Transformer Language Models apply autoregressive factorization to text using the GPT architecture. Tokens are generated sequentially where each token's distribution p(tᵢ|t₁,...,tᵢ₋₁) is modeled using self-attention with a causal mask. The causal attention mask zeroes out attention weights for future positions, ensuring tokens cannot attend to tokens they should predict.
Transformers scale dramatically with model size and data, discovering emergent capabilities. Scaling laws show perplexity decreases predictably with model size and compute. Large language models (GPT-2, GPT-3, etc.) achieve remarkable few-shot learning through in-context learning—adapting behavior based on prompt examples without fine-tuning. The combination of transformer efficiency, scaling law predictability, and emergent abilities makes them dominant in modern generative AI.
Causal Attention
Self-attention mask prevents tokens from attending to future positions, enforcing autoregressive property.
Models learn from prompt examples at inference time without parameter updates or fine-tuning.
Efficient Computation
Parallelizable training despite sequential generation; self-attention enables long-range dependencies efficiently.
07
Sampling Strategies
Autoregressive generation requires sampling strategies at inference time. Greedy decoding selects the highest-probability token at each step—fast but often repetitive. Temperature scaling adjusts probability distributions: higher temperatures flatten distributions (more randomness), lower temperatures sharpen them (more deterministic). Top-k sampling restricts sampling to the k most likely tokens, eliminating very low-probability tail noise while preserving diversity.
Nucleus (top-p) sampling selects the smallest set of tokens with cumulative probability ≥ p, adapting dynamically to distribution shape. Beam search explores multiple hypothesis sequences in parallel, keeping the b best partial sequences at each step and returning the highest-scoring complete sequence. These strategies trade between quality (determinism), diversity (randomness), and computational cost (inference speed).
Greedy
Fast, Dull
Beam Search
Quality-Focused
Top-k/p
Diversity
Temperature
Stochasticity
08
Strengths & Limitations
Autoregressive models offer fundamental strengths: they provide exact likelihood computation, enabling principled comparison via log-likelihood on test sets. They generate sample-by-sample with no approximation, and can be conditioned flexibly on any subset of variables. Likelihood provides a clear optimization target during training without variational bounds or adversarial losses.
However, limitations are significant. Sequential generation is slow—generating N tokens requires N forward passes. Weak inductive biases mean ordering sensitivity (e.g., word order in text, pixel ordering in images) affects results. Long-range dependencies are expensive to model computationally. Modern large language models compensate through massive scale and data, but fundamental sequential generation cost remains. Other generative model families (diffusion, flow, VAE) address different trade-offs—faster generation, conditional generation, or latent structure learning.
Strengths
Exact, tractable likelihood for principled evaluation
Sample-accurate generation with no approximations
Flexible conditioning on any variable subset
Proven scaling laws with emergent capabilities
Limitations
Sequential generation is slow (O(N) forward passes)
Order-dependent; sensitivity to variable ordering
Weak inductive bias vs. structured models
Expensive long-range dependencies
09
References & Further Reading
Autoregressive models have deep roots in probability theory and machine learning. This section provides key references for further study of chain rule factorization, neural density estimation, and practical applications in image and audio generation.
The papers below establish both theoretical foundations and efficient implementations that have become standard tools for generative modeling across multiple domains.
Section 01
Chain Rule Decomposition
The chain rule of probability states: p(x₁, x₂, ..., xₙ) = p(x₁) · p(x₂|x₁) · p(x₃|x₁,x₂) · ... · p(xₙ|x₁,...,xₙ₋₁). This is not an approximation—it is an exact identity. Autoregressive models exploit this factorization to make tractable the task of learning complex high-dimensional distributions.
Unlike energy-based models (Boltzmann machines) or implicit generative models (GANs), autoregressive factorization gives an explicit, tractable form. Computing p(x) requires evaluating n simple conditional distributions rather than a partition function. The conditioning structure naturally emerges from the problem: to generate xᵢ, we condition on already-generated x₁, ..., xᵢ₋₁.
Ordering and Dependency Structure
Different orderings produce different factorizations. For image pixels, raster-scan ordering (left-to-right, top-to-bottom) is natural because spatial locality matches the dependency structure. For sequences (text, audio), temporal ordering is canonical. However, any ordering is mathematically valid.
Some orderings may admit more efficient models. For example, if the true data distribution has weak long-range dependencies, an ordering that clusters related variables could reduce the context needed for accurate prediction. Conversely, poor orderings force models to capture unnecessary long-range patterns, increasing model complexity and sample efficiency.
Fully-Observed Conditioning
During training, all xᵢ are fully observed, so p(xᵢ|x₁,...,xᵢ₋₁) can be computed directly from data. At generation time, we sample xᵢ ~ p(xᵢ|x₁ᵗʳᵃⁱⁿ,...,xᵢ₋₁ᵗʳᵃⁱⁿ), building up the sequence iteratively. This sample-then-condition approach ensures generated sequences respect the learned distribution.
Training vs. Generation
During training, we minimize -log p(x₁, x₂, ..., xₙ) = -Σᵢ log p(xᵢ|x₁,...,xᵢ₋₁), summed over training data. Each term is a standard supervised learning problem: predict xᵢ from context, minimize cross-entropy. This parallels teacher forcing in sequence models.
At generation time, we autoregress: sample x₁ ~ p(x₁), then x₂ ~ p(x₂|x₁ˢᵃᵐᵖˡᵉᵈ), etc. Sampling from learned conditionals ensures diversity. The entire generation process is exact, sampling from the learned distribution without approximation.
Section 02
NADE — Neural Autoregressive Density Estimation
NADE (Uria et al., 2016) is a foundational autoregressive neural density estimator. It learns the conditional distributions p(xᵢ|x₁,...,xᵢ₋₁) using a single shared hidden layer with masked weight matrices. The key innovation: instead of computing n separate networks, NADE computes all conditionals in one forward pass through clever parameter sharing and masking.
For binary data, the architecture is: h = σ(W_mask ⊙ x + b), where ⊙ denotes element-wise multiplication by a binary mask matrix W_mask, and σ is ReLU. The output μᵢ = σ(W_out[i] ⊙ h + b_out[i]) predicts p(xᵢ=1|context). The mask ensures W_mask[i,j] = 0 whenever j ≥ i, preventing xᵢ from using itself as input.
Masked Weight Matrices
The masking is simple but powerful. Each weight matrix is multiplied by a fixed binary mask during forward/backward propagation. The mask structure is: M[i,j] = 1 if j < i (predecessors), 0 otherwise. This zero-ing of weights ensures causality without changing the optimizer.
The elegance: during backpropagation, gradients flow only through allowed connections. A weight connecting xⱼ to unit computing xᵢ with j ≥ i receives zero gradient, automatically enforcing the ordering constraint without explicit loss terms.
Single Forward Pass
Traditional approaches might evaluate p(xᵢ|context) via separate networks for each i, requiring n forward passes. NADE evaluates all n conditionals in one forward pass through the shared hidden layer, with different output neurons predicting each conditional. This efficiency enabled scalability to high-dimensional data.
Extension to Continuous Data
The original NADE uses Bernoulli outputs (binary data). Extensions model continuous variables using Gaussian outputs (diagonal covariance), mixture of Gaussians, or copula approaches. For categorical variables, softmax outputs replace sigmoid. Modern variants combine these for mixed-type data.
NADE's influence extended to mixture density networks and energy-based models. Its simplicity and efficiency made it popular for density estimation tasks, though transformer-based models have since supplanted it for high-dimensional structured data.
Section 03
MADE — Masked Autoencoder for Density Estimation
MADE (Germain et al., 2015) generalizes autoregressive masking to arbitrary neural network architectures. Rather than restricting to single hidden layers like NADE, MADE uses deep networks while maintaining autoregressive validity through masks applied to all weight matrices.
The key idea: assign each hidden unit hⱼ an ordering label m(hⱼ) ∈ {1, ..., n-1} and each input xᵢ a label m(xᵢ) = i. Then mask weight matrices between layers such that connections from units with label ≤ a to units with label ≤ b only exist if a < b. This ensures information strictly flows forward in ordering.
Masking Strategy
During training, hidden unit masks are typically sampled randomly in {1, ..., n-1} (uniform or by other strategies). This randomization over masks encourages the model to learn robust representations invariant to mask choice. Different mask assignments can be used across training epochs.
For output units predicting xᵢ, only hidden units with m(h) < i contribute. This guarantees that p̂(xᵢ|context) depends only on x with indices less than i. The masking is elegant: it works with any architecture—fully connected, convolutional, recurrent—as long as connections respect the ordering.
Deep Architectures
Unlike NADE's single hidden layer, MADE supports multiple stacked layers. Deeper networks learn hierarchical representations: lower layers capture simple patterns, higher layers complex interactions. Depth allows exponential growth in expressiveness without proportional parameter growth.
Practical Implementation
Implementation is straightforward: multiply weight matrices by binary masks during forward propagation. Masks can be precomputed or sampled. The approach integrates seamlessly with standard autodiff frameworks. MADE serves as the backbone for modern autoregressive models like MAF (Masked Autoencoder Flows) for density estimation and normalizing flows.
Section 04
PixelCNN — Autoregressive Image Generation
PixelCNN (van den Oord et al., 2016) applies autoregressive factorization to images using masked convolutional neural networks. The model predicts p(image) = ∏ᵢ p(pixelᵢ | pixels_{1..i-1}) where pixels are ordered raster-scan: top row left-to-right, then next row, etc. This ordering is natural for images and respects spatial locality.
The innovation is masked convolutions: convolutional kernels are masked to prevent receptive fields from including future pixels. A "Type A" mask excludes the center pixel (useful for first layer), "Type B" includes it (useful for subsequent layers). Gated linear units (GLU) provide multiplicative interactions between features.
Masked Convolutional Kernels
Standard convolutions process all spatial neighbors. In PixelCNN, kernels are masked to zero out weights corresponding to pixels that should not influence prediction. For a 3×3 kernel predicting center pixel (i,j), only the top two rows and left portion of center row have nonzero weights.
Multi-scale masking: early layers use Type-A masks (excluding center), later layers use Type-B (including center with previous layers' inputs). This ensures predictions incorporate progressively broader contexts while respecting causality.
Residual and Skip Connections
Deep PixelCNN networks use residual connections to enable effective training. Skip connections aggregate information across scales. The architecture typically interleaves masked convolutions with gated units and skip connections, allowing precise gradients to flow during backpropagation.
Color Channel Dependencies
For RGB images, channels have ordering: red is predicted from prior pixels only, green from prior pixels plus current red, blue from prior pixels plus current red and green. This respects information flow: channels can depend on spatially prior channels and earlier channels in the ordering.
Sampling and Generation
Generation is purely autoregressive: predict distribution over colors for top-left pixel, sample, move right, repeat. After reaching row end, proceed to next row. High-resolution generation is slow (O(H × W) forward passes) but produces high-quality images. Post-hoc improvements like parallel decoding or mixture of logistics outputs enhance sample quality.
Section 05
WaveNet — Raw Audio Synthesis
WaveNet (van den Oord et al., 2016) models raw audio waveforms autoregressively, predicting p(audio) = ∏ₜ p(aₜ | a₁, ..., aₜ₋₁). Rather than conditioning on engineered features like MFCCs, WaveNet operates on raw sample values (μ-law encoded). This end-to-end approach learns all feature representations.
The core innovation is dilated (atrous) causal convolutions. Standard convolutions have fixed receptive fields. Dilated convolutions skip samples: dilation=1 uses all samples, dilation=2 uses every other sample, etc. By exponentially increasing dilation across layers, WaveNet grows receptive field exponentially with depth, enabling long-range dependencies with fewer parameters and layers.
Dilated Causal Convolutions
A causal convolution ensures the output at time t depends only on inputs up to time t. Dilation d makes the kernel skip d-1 samples: output_t = activation(kernel ∘ [input_{t}, input_{t-d}, input_{t-2d}, ...]). With dilation 1, 2, 4, 8, ..., after L layers receptive field ≈ 2^L samples.
Example: 10 layers with exponential dilation (1,2,4,...,512) achieve receptive field ≈ 1024 timesteps. For 16kHz audio, this is ~64ms context, sufficient for phoneme-level dependencies in speech. The architecture is efficient: few parameters, parallelizable training (teacher forcing), yet expressive.
Gated Activation Units
Like PixelCNN, WaveNet uses gated units: h_t = tanh(W_f * x + b_f) ⊙ σ(W_g * x + b_g), where ⊙ is element-wise multiplication, W_f/W_g are filter/gate weights. Gates modulate filter outputs nonlinearly, increasing expressiveness beyond standard ReLU activations.
Conditioning and Synthesis
WaveNet supports conditioning on auxiliary information: mel-spectrograms (for TTS), speaker embeddings (for multi-speaker synthesis), or other features. Conditioning is via projection onto residual/gate paths. This flexibility enables voice conversion, speech synthesis, and music generation.
Generation requires sampling: predict distribution over 256 μ-law values at each timestep, sample, feed into next timestep. Sequential sampling means generating 1 second of 16kHz audio requires 16,000 forward passes—slow but necessary for exact sampling.
Section 06
Transformer Language Models (GPT)
Transformer language models (GPT architecture) apply autoregressive factorization using self-attention with causal masking. Unlike CNNs (PixelCNN, WaveNet) or RNNs, transformers process sequences in parallel during training yet maintain autoregressive generation at inference. Token prediction p(tᵢ | t₁, ..., tᵢ₋₁) uses self-attention over all preceding tokens.
The causal attention mask is a triangular matrix: element [i,j] is -∞ for j > i (future tokens), 0 otherwise. This forces softmax attention to put zero probability on future positions. During training, all positions are processed in parallel (teacher forcing). At generation, tokens are sampled sequentially, updating the context for next token prediction.
Scaling Laws and Emergence
A remarkable empirical discovery: language model perplexity follows power-law scaling with model size and compute. Doubling model size typically reduces perplexity by ~12%. No theoretical explanation fully accounts for this, but it's reproducible and has become a design principle.
Larger models exhibit emergent capabilities absent in smaller ones: few-shot in-context learning, chain-of-thought reasoning, translation without specific training. These emerge around 10B+ parameters. Scaling appears to be a primary lever for capability.
In-Context Learning
GPT models adapt to tasks through prompt examples without fine-tuning or gradient updates. Given a prompt like "Translate to French: hello = bonjour. How to say 'goodbye'?", the model generates correct translations. This appears to be implicit task learning through next-token prediction on the prompt.
In-context learning is remarkable because it's unintended: models are only trained to predict the next token. Yet they implicitly extract task structure from context. Larger models are dramatically better at in-context learning, suggesting it's an emergent property of scale.
Efficient Training and Generation
Transformer training is highly parallelizable: all positions compute attention in parallel, all tokens' gradients flow simultaneously. This contrasts with RNNs' sequential timesteps. However, generation requires sequential sampling (like all autoregressive models), limiting inference speed. Methods like speculative decoding attempt to parallelize sampling.
Section 07
Sampling Strategies at Inference Time
Autoregressive generation requires sampling from learned conditional distributions. Different strategies trade quality, diversity, and computation. The choice significantly affects generation characteristics: greedy sampling is deterministic but dull; stochastic sampling adds diversity but may reduce coherence.
Greedy Decoding
At each step, select argmax(p(xᵢ | context)). Simple, fast (no randomness), but prone to repetition and mode collapse. For language, greedy often generates generic text lacking diversity. For images/audio, greedy tends toward averaged-out, blurry samples. Rarely optimal despite efficiency.
Temperature Scaling
Rescale logits by temperature τ: p(xᵢ) ∝ exp(logit_xᵢ / τ). Temperature τ < 1 sharpens distribution (more concentrated on high-probability tokens, more "greedy"). Temperature τ > 1 flattens distribution (more uniform, higher entropy, more random). τ = 1 is the original distribution.
In practice, τ ∈ [0.5, 2] is typical. Lower temperatures (0.7-0.8) generate more coherent text; higher temperatures (1.2-1.5) generate more creative/diverse samples. This simple scaling provides control over the stochasticity-coherence trade-off.
Top-k Sampling
Rather than sampling from full vocabulary, sample only from the k most likely tokens (rest probability zeroed). Eliminates very low-probability tail noise while maintaining diversity. For language models, k=40 or k=50 is typical. Prevents sampling from absurd tokens with near-zero probability yet allows flexibility within likely options.
Nucleus (Top-p) Sampling
Instead of fixed k, select the smallest set of tokens with cumulative probability ≥ p (nucleus). If top token has 0.6 probability, include it; if next three have 0.25, 0.1, 0.05, include all until cumulative hits p. Adapts k dynamically based on distribution shape.
Nucleus sampling often produces higher quality samples than fixed top-k because it responds to entropy: when top token dominates (low entropy), nucleus includes few tokens (similar to argmax); when entropy is high, nucleus includes more. Standard in GPT-2/3 generation.
Beam Search
Maintain b "beam" partial sequences, keeping the highest-scoring complete sequences at each step. At step t, score all b×vocab extensions, select top b by score. Continue until desired length. Return top-scoring complete sequence(s).
Beam search finds higher-likelihood sequences than greedy but requires more computation (b forward passes per step). For language, b=5 is typical. Beam search optimizes for joint likelihood over the sequence, not per-token accuracy, often producing more coherent outputs despite lower immediate probabilities.
Section 08
Strengths, Limitations & Alternatives
Autoregressive models are powerful yet face inherent trade-offs. Understanding these helps guide model choice for specific applications.
Core Strengths
Exact Likelihood. Computing p(x) via the chain rule is exact, not approximate. This enables principled comparison: log-likelihood on test sets measures how well the model captures data distribution. Unlike GANs (no density) or VAEs (approximate lower bound), autoregressive models provide ground truth.
Sample Accuracy. Sampling from learned conditionals generates samples from the actual learned distribution, not an approximation. This contrasts with VAEs (which approximate posterior) or diffusion (which uses Markov chains). Autoregressive generation is exact.
Flexible Conditioning. Any subset of variables can be conditioned on. For images: given left half, predict right half. For audio: given partial waveform, predict future. For mixed types: condition on some features, generate others. The factorization naturally supports this without model changes.
Proven Scaling Laws. Language models exhibit reproducible scaling laws—perplexity improving predictably with scale. Emergent capabilities (in-context learning, reasoning) arise at scale, and continue scaling appears safe and rewarding. This contrasts with GANs (training instability) or some flows (training complexity).
Fundamental Limitations
Sequential Generation. Generating N tokens requires N forward passes (or N sequential samples from conditionals). For 1000-token sequences or high-resolution images (1M+ pixels), this is slow. Inference time is O(N), making real-time or interactive applications challenging. This is inherent to autoregressive factorization: later tokens depend on earlier ones.
Order Sensitivity. Different orderings of variables produce different models. For text, left-to-right word order is natural, but for images, raster-scan is arbitrary—diagonal or columnar orderings would differ. This weak inductive bias means the model learns to model all orderings equally, even if true data has natural structure.
Long-Range Dependencies. Modeling dependencies over many steps is computationally expensive. In language, earlier tokens have exponentially weaker influence on later tokens (if distances matter). Transformers mitigate this with attention, but cost grows quadratically in sequence length.
Mode Coverage vs. Likelihood. Models maximizing likelihood tend toward averaging predictions across modes, producing blurry or generic samples. Balancing mode coverage and sharp modes requires careful training (e.g., mixture of logistics, discrete latent variables).
Alternative Generative Models
Diffusion Models: Gradually add noise to data, then reverse process with learned denoiser. Avoid autoregressive ordering, enable parallel decoding (few steps vs. N steps). Recent empirical success in images (DALL-E 3, Midjourney). Slower than GANs but more stable to train than VAEs or GANs. Emerging leader in image generation.
Flow-Based Models: Learn invertible transformations from simple prior (Gaussian) to data. Exact likelihood like autoregressive models, but support parallel generation (apply inverse transform in parallel). Require careful architecture design and training can be unstable. Less popular than diffusion or autoregressive recently.
GANs: Adversarial training generates samples without explicit likelihood. Fast sampling (one-shot), flexible architectures. Training instability, mode collapse, and lack of likelihood are drawbacks. Prominent in specialized domains (e.g., StyleGAN for faces) but less dominant than diffusion or transformers for general generation.
VAEs: Variational autoencoders learn latent representations with principled likelihood lower bound. Amortized inference (fast encoder), support for structured latents. Posterior approximation error (ELBO gap) and tendency to ignore latents (posterior collapse) limit expressiveness. Less competitive on density/generation quality than autoregressive or diffusion models.
Key Takeaway
Autoregressive models excel at likelihood estimation and structured generation with flexible conditioning. Their sequential generation is slow but exact. Modern dominance in language modeling reflects these properties. For faster generation or avoiding ordering assumptions, diffusion or flow-based alternatives are worth exploring.