Problem Set 1 is the first graded assignment in XCS236 (Generative Models) and anchors the course in foundational concepts and practical implementation. Spanning weeks 1–2, it covers probability theory, PyTorch fundamentals, and two flagship autoregressive architectures: MADE and PixelCNN. The assignment is worth 40 points and includes both coding and written components that test understanding of how probability theory, deep learning frameworks, and architectural design combine to build working generative models.
The problem set is structured in stages: first, students implement a simple autoregressive model from scratch to understand the mechanics of factorizing joint distributions. Next, they implement MADE, learning how to enforce autoregressive ordering through masking. Third, they implement PixelCNN for image generation, applying masked convolutions. Finally, they train, evaluate, and analyze both models on real datasets, computing likelihoods and generating samples. This progression builds from theory to implementation to empirical understanding.
Key learning outcomes include: (1) fluency with autoregressive factorization and chain rule manipulations; (2) ability to implement PyTorch models with custom layers and loss functions; (3) understanding of how masking enforces architectural constraints; (4) experience with training generative models and interpreting likelihood-based metrics; (5) familiarity with sampling, likelihood evaluation, and qualitative assessment of generative quality.
The problem set emphasizes both correctness and insight. Students are expected to debug their implementations, verify that likelihoods are computed correctly, and understand why certain design choices matter. Written responses ask students to explain concepts like bits/dim, the motivation for masking, and the tradeoffs between different orderings. This balance—code and reasoning—reflects the reality of modern generative modeling research, where strong intuition and working implementations go hand-in-hand.
The chain rule of probability is the most important result for this problem set. For a joint distribution p(x₁, x₂, ..., xₙ), the chain rule decomposes this as: p(x₁, x₂, ..., xₙ) = p(x₁) · p(x₂|x₁) · p(x₃|x₁, x₂) · ... · p(xₙ|x₁, ..., xₙ₋₁). This factorization is always valid—it follows directly from the definition of conditional probability—and is the mathematical foundation for all autoregressive models. The order of variables can be arbitrary; different orderings yield different factorizations but represent the same joint distribution.
Bayes' rule relates the posterior distribution p(y|x) to the likelihood p(x|y), prior p(y), and marginal likelihood: p(y|x) = p(x|y) · p(y) / p(x). While PS1 focuses on likelihood (forward modeling), Bayes' rule appears in later problem sets on inference and variational autoencoders. Understanding how likelihood and prior combine—and how normalization (the marginal p(x)) ensures a valid posterior—is fundamental to probabilistic modeling.
KL divergence measures the difference between two distributions: D_KL(p || q) = Σ_x p(x) log(p(x) / q(x)). It is always non-negative and zero only when p = q. In generative modeling, KL divergence quantifies how well a learned distribution q approximates the true data distribution p. The reverse KL divergence D_KL(q || p) appears in variational inference and emphasizes modes of p (staying where p is high), while forward KL D_KL(p || q) emphasizes coverage (avoiding where p is low). Maximum likelihood training implicitly minimizes forward KL divergence between data and model distributions.
Conditional independence structures simplify probability computations and enable efficient algorithms. For example, if variables are independent given a condition, their joint conditional factors into a product. Markov chains exhibit temporal Markov property: future states depend only on the present, not the past. These structures appear explicitly in recurrent neural networks (RNNs), hidden Markov models (HMMs), and later in diffusion models and transformer attention masks. Recognizing conditional independence in a problem often suggests architectural simplifications and computational savings.
Cross-entropy and negative log-likelihood (NLL) are the practical losses for training generative models. For a model p_θ(x) with parameters θ, the NLL loss is -log p_θ(x). Training on a dataset minimizes average NLL across samples, which is equivalent to maximum likelihood estimation (MLE). The cross-entropy H(p, q) = -Σ_x p(x) log q(x) reduces to NLL when p is the empirical data distribution. Understanding why NLL is the right loss (it directly optimizes the probability the model assigns to data) motivates the designs of MADE and PixelCNN.
Students review these concepts through both theory (deriving factorizations, applying Bayes' rule) and practice (computing likelihoods, implementing loss functions). Clear probability notation is non-negotiable: confusion between p(x|y) and p(y|x), or between joint and conditional, leads to incorrect implementations and misinterpreted results. Careful notation and step-by-step derivations build the precision required for deep learning research.
Tensors are PyTorch's fundamental data structure—multi-dimensional arrays that support GPU computation and automatic differentiation. A tensor can be 0-D (scalar), 1-D (vector), 2-D (matrix), or higher. Key properties include shape (dimensions), dtype (data type, e.g., float32), and device (CPU or GPU). Tensors are created via constructors (torch.tensor, torch.zeros, torch.randn) or loaded from data. Vectorized operations on tensors are dramatically faster than Python loops, especially on GPUs, making tensor-centric code essential for efficient deep learning.
Autograd enables automatic differentiation via backpropagation. When a tensor is created with requires_grad=True and used in computations, PyTorch builds a computational graph tracking operations. Calling .backward() on a scalar loss propagates gradients backward through the graph. Each tensor's .grad attribute accumulates gradients with respect to the loss. This abstraction is powerful: complex models with millions of parameters compute gradients correctly without explicit derivative code. Students must understand when gradients are computed (any operation on requires_grad tensors), when to zero them (optimizer.zero_grad()), and how with torch.no_grad() disables gradient tracking for inference.
Datasets and DataLoaders handle data organization and batching. A Dataset is a wrapper around data that provides __len__ and __getitem__ methods, returning individual samples. A DataLoader wraps the dataset, applying batching, shuffling, and parallel loading. For PS1, students use standard datasets (MNIST, CIFAR-10) or implement custom datasets. The DataLoader abstraction decouples model code from data handling, making experiments reproducible and code modular. Properly configured dataloaders are essential for avoiding memory overflow and ensuring training efficiency.
Neural network modules are defined as subclasses of nn.Module. The __init__ method defines layers and parameters; the forward method specifies computation. PyTorch's nn library provides standard layers (Linear, Conv2d, BatchNorm, etc.) with proper initialization and gradient support. Students design models by composing these layers, and PyTorch automatically tracks parameters for optimization. Custom layers can be implemented by subclassing nn.Module and defining forward computation—useful for implementing masked linear layers or masked convolutions in PS1.
Optimizers update model parameters based on gradients. The Adam optimizer, widely used in deep learning, maintains moving averages of gradients and second moments, adapting learning rates per parameter. SGD (stochastic gradient descent) and other optimizers are also available. A typical training loop: (1) forward pass through model, (2) compute loss, (3) zero gradients, (4) backprop, (5) optimizer step. Learning rate scheduling adjusts the learning rate during training, often decreasing it over epochs to refine model parameters. Early stopping, checkpoint saving, and validation monitoring are standard practices for avoiding overfitting and selecting the best model.
Students implement these concepts hands-on: creating tensors, performing operations, building models, running forward-backward passes, and training on real data. Debugging tools include print(tensor.shape) to verify dimensions, assert statements to validate assumptions, and examining gradients to detect vanishing/exploding gradients. PyTorch errors often point to shape mismatches; developing intuition for tensor shapes and operations is invaluable. The goal is fluency: PyTorch code should be written and read as naturally as basic Python, enabling focus on model design and algorithmic innovation.
The autoregressive factorization decomposes a joint distribution p(x₁, x₂, ..., xₙ) into a product of conditionals following a fixed ordering. Mathematically: p(x₁, x₂, ..., xₙ) = p(x₁) · p(x₂|x₁) · p(x₃|x₁, x₂) · ... · p(xₙ|x
To model this factorization, a neural network must satisfy the autoregressive property: the network's output for variable i (representing p(xᵢ|x
Training an autoregressive model minimizes negative log-likelihood (NLL). For a dataset {x⁽¹⁾, x⁽²⁾, ...}, the empirical loss is (1/N) Σ ᵢ -log p_θ(x⁽ⁱ⁾). Computing -log p_θ(x) factorizes as -log Π ᵢ p_θ(xᵢ|x
Evaluation focuses on likelihood-based metrics. The test negative log-likelihood (NLL) measures how well the model generalizes; lower is better. Bits-per-dimension (bits/dim) normalizes NLL by the data dimensionality: bits/dim = NLL / (N_dims · log 2). This metric enables comparison across datasets of different sizes. For example, a bits/dim of 3 means the model uses on average 3 bits per dimension to compress data. Comparing bits/dim across models and datasets—reported in papers and leaderboards—provides a standard evaluation framework. Sampling is also crucial: sequential generation of samples (x₁ ~ p_θ(x₁), then x₂ ~ p_θ(x₂|x₁), etc.) yields model-generated data. Qualitative assessment (do samples look realistic?) complements quantitative metrics.
Students implement simple autoregressive models (e.g., for 1-D or 2-D synthetic data), verify that likelihoods are computed correctly via numerical tests, and understand the ordering's effect on convergence and expressiveness. They experiment with different architectures and orderings, observing how these choices affect training dynamics and sample quality. This hands-on experience with autoregressive density estimation—from theory to implementation to empirical analysis—builds intuition for more complex models like RNNs, transformers, and diffusion models that leverage autoregressive principles.
Masked Autoencoders for Distribution Estimation (MADE) solves the efficiency problem of autoregressive modeling: naively applying a neural network to enforce autoregressive ordering requires multiple forward passes (one per variable) or inefficient sequential computation. MADE's key insight is to assign each neuron in a hidden layer a variable index (1 to n), then mask connections so that neuron with index j can only receive inputs from neurons with indices < j (in the input layer and previous hidden layers). This single-pass architecture computes all n conditionals p(xᵢ|x
Implementing the mask is straightforward but requires care. For the input-to-hidden layer, a mask matrix has shape (hidden_dim, input_dim); entry (j, i) is 1 if neuron j can receive input i, and 0 otherwise. For neuron j with index m_j, it can receive input i if m_i < m_j (in a proper data ordering). For hidden-to-hidden layers, each neuron has an index; similarly, connections respect these indices. For hidden-to-output, neuron representing xᵢ has index i and can receive from hidden neurons with indices < i. Mask generation algorithms (Germain et al., 2015) efficiently construct these masks, often randomly permuting indices to enhance model expressiveness.
The power of MADE is computational: a single forward pass through the network computes all conditional distributions, enabling efficient likelihood evaluation. For a dataset sample x, the forward pass outputs parameters for all p(xᵢ|x
Students implement masked linear layers, understanding the mask structure and how to apply it (masking weights to zero, or not backpropagating to masked entries). They verify the mask is correct by checking that neurons with higher indices do not receive inputs from neurons with lower indices—a simple assertion on mask patterns. They implement the full MADE architecture: input → masked hidden layers → output. They compute likelihoods on validation data and generate samples, comparing learned distributions to baseline models. Debugging the mask is critical; incorrect masking leads to valid-looking code that produces nonsensical likelihoods (model assigns high probability to impossible samples).
MADE demonstrates a broader principle in deep learning: architectural constraints can be elegantly enforced through structured connectivity and masking, rather than explicit variable ordering in the computation graph. This principle appears in PixelCNN (masked convolutions), transformers (causal attention masks), and many other architectures. Understanding MADE deeply—not just using it as a black box—builds the architectural intuition needed to design novel models and debug existing ones. The assignment asks students to experiment with different mask generation strategies, analyze the effect of network width and depth on expressiveness, and compare MADE's efficiency to naive autoregressive models.
PixelCNN (van den Oord et al., 2016) extends the autoregressive framework to images by modeling pixels in raster order: left-to-right, top-to-bottom (or other canonical orderings). The factorization is p(image) = Π_{row, col} p(pixel_{row,col} | all previous pixels). Convolutional networks are natural for image modeling because they exploit spatial structure and are parameter-efficient. However, naive convolutions cannot implement the autoregressive constraint: a standard convolutional layer has a receptive field that includes current and future pixels, violating causality. PixelCNN solves this through masked convolutions: convolutional kernels are masked so that the central pixel and future pixels have zero weight, allowing each pixel's prediction to depend only on previous pixels.
Masked convolutions are implemented via kernel masking (multiplying kernel weights by a binary mask after each forward pass, or during initialization) or architectural masking (manipulating input or kernel to naturally enforce the constraint). The receptive field is crucial: a single masked convolution has a small receptive field, limited to pixels directly before the current one. To model long-range dependencies (e.g., object boundaries far away), the network must stack many masked convolutions, gradually expanding the receptive field. Visualizing receptive fields—which previous pixels contribute to a given pixel's prediction—verifies that the causality constraint is satisfied. Deep PixelCNN models have receptive fields spanning the entire image, enabling rich conditional distributions.
The output of PixelCNN is a categorical distribution over pixel values for each location. For 8-bit images (256 possible values per pixel), the network outputs logits (or softmax probabilities) over 256 classes for each pixel. The loss is cross-entropy: for pixel x_{row,col} with true value v, the loss is -log p(x_{row,col} = v | previous pixels). Summing over all pixels and batch samples yields the total loss. Training proceeds with standard backpropagation and optimization. Sampling is sequential: sample the first pixel from p(x₁), then the second from p(x₂|x₁), and so on. Sampling an entire image requires H × W forward passes (one per pixel), or faster variants compute all pixels in parallel but require sorting predictions to respect the ordering (less common for PixelCNN).
Students implement PixelCNN by defining masked convolutions as custom modules, stacking them into a network, and training on image datasets (e.g., MNIST, CIFAR-10). They verify masking by visualizing receptive fields and checking that masked layers preserve causality. They compute test likelihood and bits/dim, comparing to baseline models. They generate samples by sequential sampling, analyzing the quality and diversity of generated images. PixelCNN naturally produces images pixel-by-pixel, often capturing low-level texture details early in generation and higher-level structure later—a phenomenon worth observing and discussing.
PixelCNN showcases how to apply autoregressive principles to structured high-dimensional data. Images have millions of pixels, yet PixelCNN's factorization and efficient masking make training tractable. Later refinements (PixelCNN++, Gated PixelCNN) add improved conditional distributions (mixture of logistics), skip connections, and faster sampling, but the core idea—masked convolutions enforcing causal structure—remains. Understanding PixelCNN deeply prepares students for other spatially-structured autoregressive models and the architectural patterns used in diffusion models and other modern generative systems.
Training a generative model is relatively straightforward: initialize parameters, iterate over batches of data, compute loss, backpropagate, and update parameters. The loss is negative log-likelihood (NLL). For MADE and PixelCNN, the forward pass computes model outputs (parameters of conditional distributions), and the backward pass computes gradients of the NLL with respect to parameters. A key practice is gradient monitoring: check that gradients are not vanishing (near zero) or exploding (very large), which indicate pathological behavior. Learning rate is typically constant or decayed over epochs. Checkpointing saves the best model (on validation data) and enables resuming interrupted training.
Hyperparameter tuning affects convergence and generalization. Key hyperparameters include network architecture (number of layers, width, kernel sizes for convolutions), learning rate, batch size, regularization (L2 weight decay, dropout), and training duration. Too small a learning rate slows convergence; too large causes instability. Too small a batch size has high variance; too large wastes GPU memory. Regularization prevents overfitting—critical for fitting complex models on finite data. Students tune hyperparameters via grid search, random search, or Bayesian optimization, comparing validation loss across configurations. This exercise builds intuition for balancing model capacity, data size, and computational budget.
Evaluation relies on multiple metrics. Test NLL is the primary metric: lower NLL on held-out test data indicates better generalization. Since NLL depends on the number of dimensions, bits-per-dimension (NLL / (num_dims · log(2))) provides a normalized, comparable metric. For example, constant model (always predicting average distribution) has a baseline bits/dim; the learned model should do better. Tracking both train and test metrics reveals overfitting: train loss decreases while test loss plateaus or increases. Plotting learning curves (loss vs. epoch) helps diagnose training dynamics and select checkpoint criteria (e.g., early stopping based on validation loss).
Sampling is both a practical output (generate images) and an evaluation tool. Sequential sampling—generating one variable at a time conditioned on previous ones—is the standard method and works for any autoregressive model. For images, this means generating pixel-by-pixel. Sampling quality is assessed qualitatively (do images look realistic?) and quantitatively (e.g., via Inception Score, FID, or other metrics, though these are more advanced). Comparing samples from different model architectures and training procedures highlights which design choices improve generative quality. Students generate samples at different stages of training, noting how sample quality improves as the model trains.
Common pitfalls include: (1) incorrect likelihood computation (often due to masking bugs in MADE/PixelCNN), leading to likelihoods that don't improve despite training; (2) shape mismatches in loss computation (ensure batch dimension is handled consistently); (3) forgetting to zero gradients between updates (causes accumulated gradients); (4) overfitting due to insufficient regularization or excessive model capacity; (5) numerical instability (large logits in softmax, log of zero). Debugging involves printing intermediate tensor shapes and values, checking loss components (individual conditionals' losses), and verifying that the model is actually learning by monitoring test loss. These practices—careful metric tracking, multiple evaluation perspectives, and systematic debugging—are essential for any deep learning project.
Problem Set 1 provides a complete end-to-end experience in generative modeling. Students learn that probability theory—specifically, the chain rule and conditional factorization—provides a principled framework for modeling complex distributions. They see how neural networks, carefully constrained via masking, can parameterize these conditional distributions. They implement and train working models, observing the progression from theory to code to empirical results. This trajectory—from mathematics to implementation to evaluation—repeats throughout the course and mirrors how research in generative modeling is conducted.
The autoregressive perspective is foundational. It applies to any data type—text, images, audio, structured data—provided an ordering is defined. RNNs and Transformers, the dominant architectures in NLP, are autoregressive models. Diffusion models and flow-based models, though different in mechanism, also leverage autoregressive intuitions. Mastery of autoregressive factorization and the techniques for enforcing causality in networks opens doors to understanding and developing models across these domains. The specific architectures (MADE, PixelCNN) will be refined or replaced by newer methods, but the underlying principles are enduring.
Likelihood-based evaluation is a core advantage of autoregressive models. Unlike generative adversarial networks (GANs, covered later), autoregressive models directly compute and optimize the true likelihood of data. This enables unambiguous model comparison, principled hyperparameter selection, and theoretical analysis. However, likelihood and sample quality do not always correlate perfectly—a model with high likelihood may produce blurry samples, while a low-likelihood model may generate sharp but unrealistic samples. Understanding this tradeoff, and the different evaluation criteria appropriate for different applications, is crucial for deploying generative models responsibly.
Implementation skills are as important as theory. Debugging neural networks requires careful attention to shapes, values, and gradients. Writing modular, well-commented code enables fast iteration and collaboration. PyTorch's abstractions (autograd, modules, optimizers) handle much of the complexity, but understanding what happens under the hood—how gradients are computed, how parameters are updated—deepens capability. Students who implement from scratch, rather than blindly copying code, develop intuition and debugging skills that prove invaluable when facing novel problems or debugging sophisticated models.
Problem Set 1 is a foundation on which the course builds. Later problem sets introduce variational autoencoders (where we cannot directly compute likelihood but learn to lower-bound it), diffusion models (where the generative process is learned step-by-step), flow-based models (which compose invertible transformations), and energy-based models (which use unnormalized distributions). Each approach has strengths and weaknesses. Understanding autoregressive models deeply—their efficiency, expressiveness, limitations, and practical considerations—provides context and intuition for comparing these alternatives. The skills, concepts, and code patterns from PS1 recur and extend throughout the course, making depth of understanding a lasting asset.
Foundational Papers
Intuitive Explanations & Blogs
Background & Mathematical Foundations
Implementation Resources
Course Materials
How to Use These References: Start with conceptual blogs (Lilian Weng) for intuition, then dive into papers for rigorous mathematics. Use PyTorch documentation as needed for implementation. Return to Wikipedia articles to refresh background concepts. The full course website and Stanford materials contextualize PS1 within the broader generative modeling landscape.