Stanford XCS229 · Problem Set 4
PS4: KL & Deep Learning
50 Points · Divergence, MNIST, Spam & Double Descent
01

PS4 Overview

Problem Set 4 represents the most substantial assignment in XCS229, totaling 50 points across four interconnected problems. This problem set synthesizes material from weeks 5–7, covering advanced topics in information theory, deep learning, and generalization phenomena. Students engage with theoretical foundations alongside empirical implementations, deepening understanding of how machine learning systems learn and generalize.

The assignment progresses logically from foundational information-theoretic concepts through practical neural network implementation on real-world datasets, concluding with exploration of surprising empirical phenomena. Each problem builds on prior knowledge while introducing novel complexity that mirrors research-level challenges. The heaviest weighting encourages thorough engagement with deep learning fundamentals and their mathematical underpinnings.

50 pts
Total Weight
4
Problems
Weeks 5–7
Coverage
MNIST
Dataset
02

KL Divergence

Kullback-Leibler (KL) divergence, denoted D_KL(P||Q), quantifies the difference between two probability distributions. Unlike symmetric distance metrics, KL divergence is fundamentally asymmetric: D_KL(P||Q) ≠ D_KL(Q||P). This asymmetry reflects an important interpretation: KL divergence measures the information loss when using distribution Q to approximate the true distribution P.

The mathematical definition D_KL(P||Q) = Σ_x P(x) log(P(x)/Q(x)) reveals that KL divergence is always non-negative, equaling zero if and only if P and Q are identical. This fundamental property emerges from Gibbs' inequality and connects deeply to information theory. KL divergence appears throughout machine learning: in the Evidence Lower Bound (ELBO) of variational inference, as the connection between maximum likelihood estimation and cross-entropy loss, and in understanding model mismatch penalties.

Definition & Properties

Non-negativity, asymmetry, equality condition P = Q, connection to relative entropy, measurability implications.

Intuition

Quantifies surprise or information loss when using approximation Q instead of true P. Asymmetry reflects directional interpretation.

Mathematical Foundation

Emerges from Gibbs inequality, relates to mutual information, connects to Jensen's inequality via log concavity.

Practical Significance

Central to variational inference, maximum likelihood estimation, and cross-entropy loss derivations in neural networks.

03

KL Applications

The connection between KL divergence and maximum likelihood estimation reveals why cross-entropy loss is ubiquitous in deep learning. Minimizing KL divergence D_KL(P_data||P_model) is equivalent to maximizing the log-likelihood of the model on observed data. This equivalence explains why cross-entropy loss L = -Σ_i y_i log(ŷ_i) emerges naturally as the objective for classification tasks.

In variational inference, KL divergence provides the mathematical foundation for the Evidence Lower Bound (ELBO). The ELBO = E_q[log p(x,z)] - E_q[log q(z|x)] decomposes into a reconstruction term and a KL regularizer: E_q[log p(x|z)] - D_KL(q(z|x)||p(z)). This decomposition guides many modern generative models, including Variational Autoencoders (VAEs). The KL term encourages the learned posterior q(z|x) to stay close to the prior p(z), balancing expressivity with regularization. Understanding these connections enables deeper insights into why these loss functions work and how they relate to fundamental information-theoretic principles.

MLE
Minimizes KL
Cross-Entropy
NLL Connection
ELBO
VAE Foundation
Regularization
KL Penalty
04

MNIST Neural Network

The MNIST problem—classifying handwritten digits into 10 categories—serves as the canonical deep learning benchmark. The problem involves training a neural network on 60,000 labeled training images of 28×28 pixels, with a test set of 10,000 unseen images. Successful implementations require thoughtful choices about network architecture, activation functions, optimization hyperparameters, and regularization strategies.

Architecture design involves decisions about layer width, depth, and non-linearities. A simple feedforward network with one or two hidden layers of 128–512 units often suffices for MNIST, though the problem admits interesting explorations of deeper networks. ReLU activations typically outperform sigmoid/tanh, providing better gradient flow and faster training. Hyperparameter tuning—learning rate, batch size, weight decay—significantly impacts convergence speed and final accuracy. Batch normalization can stabilize training and allow higher learning rates. The training loop requires careful implementation of forward passes, loss computation, backpropagation, and parameter updates, often using frameworks like PyTorch or TensorFlow that automate differentiation.

Architecture Choices

Layer widths (128–512), depth (1–3 hidden layers), activation functions (ReLU preferred), batch normalization.

Training Loop

Forward pass, cross-entropy loss, backpropagation, SGD or Adam optimization, learning rate schedules.

Hyperparameters

Learning rate (0.001–0.01), batch size (32–128), epochs (10–50), weight decay for L2 regularization.

Implementation Details

Input normalization (pixel values 0–255 → 0–1), random initialization, validation/test split, checkpoint saving.

05

MNIST Analysis

After training, systematic evaluation requires detailed accuracy analysis and interpretation of failure modes. On MNIST, well-trained networks achieve 97–99% test accuracy. However, aggregate accuracy masks important patterns: certain digit pairs (e.g., 4 vs 9, 3 vs 8) are frequently confused, while others (0 vs 1) are rarely misclassified. Confusion matrices reveal these patterns, exposing which digits the model struggles with and which confusions occur systematically.

Misclassification analysis uncovers deeper insights. Examining images the network misclassifies reveals that some errors are genuinely ambiguous (handwritten digits with unusual stroke patterns), while others reflect real weaknesses (systematic confusion of morphologically similar digits). Per-class accuracy metrics show which digit categories the model masters and which remain challenging. These analyses inform data collection strategies, augmentation approaches, and whether additional architectural modifications might help. Visualization of learned features—examining convolutional filters or hidden activations—reveals that early layers learn edge detectors while deeper layers build digit-specific shape detectors, confirming expected hierarchical feature learning.

97–99%
Test Accuracy
10
Digit Classes
100
Confusion Pairs
Per-Class
Metrics
06

Spam Classification

Spam detection demonstrates machine learning on text data, comparing Naive Bayes and logistic regression approaches. Both models treat documents as vectors of word frequencies (bag-of-words representation), ignoring word order but capturing presence and frequency of informative terms. Naive Bayes assumes conditional independence of features given the class, while logistic regression learns discriminative decision boundaries without independence assumptions.

Naive Bayes offers interpretability: examining log-odds log(P(spam)/P(ham)) for each word reveals which terms most strongly indicate spam (e.g., "viagra," "click," "free") versus legitimate email. Logistic regression learns similar patterns but through discriminative optimization, often achieving slightly better accuracy by fitting the decision boundary directly. Feature engineering impacts both models: lemmatization, stop word removal, and tf-idf weighting improve performance. The tradeoff between model simplicity (Naive Bayes) and accuracy (logistic regression) depends on dataset size, sparsity, and computational constraints. Both models achieve 95%+ accuracy on standard benchmarks, making them practical baselines for production systems.

Naive Bayes

Generative model, conditional independence assumption, interpretable word-class associations, efficient training.

Logistic Regression

Discriminative model, no independence assumption, better calibration, linear decision boundary in feature space.

Feature Engineering

Bag-of-words representation, lemmatization, stop word removal, tf-idf weighting, vocabulary pruning.

Evaluation

Accuracy, precision/recall (false positive cost), ROC curves, calibration assessment, dataset balance considerations.

07

Double Descent

The double descent phenomenon contradicts classical statistical theory, which predicts that test error decreases as model complexity increases up to an interpolation threshold, then increases monotonically due to overfitting. Double descent reveals a second descent: test error initially increases beyond interpolation (as expected), but then decreases again as model complexity continues to increase. This surprising U-shaped curve with a second descent challenges traditional bias-variance tradeoff intuition.

This phenomenon emerges in neural networks trained to zero training error, particularly with early stopping removed. The interpolation threshold—where training error reaches zero—represents the critical transition point. Beyond this threshold, classical theory predicts increasing overfitting, yet empirically test error improves. Understanding double descent requires careful analysis of implicit regularization from stochastic gradient descent (SGD). Even without explicit regularization (dropout, weight decay), SGD's stochastic sampling provides regularization, preventing pathological overfitting. The phenomenon reflects that extremely high-capacity models trained to interpolation often generalize better than moderately-sized models, challenging conventional wisdom about the bias-variance tradeoff. This benign overfitting—interpolation without degradation—has profound implications for understanding deep learning's success.

2
Descent Phases
Interpolation
Threshold
SGD
Regularization
Benign
Overfitting
08

Key Takeaways

Problem Set 4 bridges theory and practice, revealing how information-theoretic principles ground modern machine learning. The KL divergence foundation explains why cross-entropy loss emerges naturally from maximum likelihood, connecting abstract mathematics to practical loss functions. This theoretical understanding deepens when implementing neural networks on MNIST, where the training process becomes concrete: forward passes compute likelihoods, backpropagation optimizes the cross-entropy objective, and regularization prevents overfitting by penalizing model complexity.

The assignment culminates in double descent, a phenomenon that simultaneously humbles classical theory and inspires deeper analysis. By exploring how model complexity, implicit regularization, and training dynamics interact, students develop sophisticated mental models of machine learning. The spam classification problem reinforces that generative (Naive Bayes) and discriminative (logistic regression) paradigms offer complementary perspectives, each with distinct advantages. Together, these problems illustrate that deep learning's success emerges not from any single principle but from careful interplay of architecture, optimization, regularization, and understanding the fundamental limits and surprises of high-dimensional statistics.

Strengths & Insights

  • Information theory grounds practical machine learning loss functions
  • Empirical exploration challenges classical statistical intuitions
  • Multiple modeling paradigms illuminate complementary aspects
  • Deep understanding of generalization emerges from careful analysis

Challenges & Open Questions

  • Intuition for why benign overfitting occurs remains incomplete
  • Scaling insights from MNIST to real-world images is non-trivial
  • Feature engineering for text remains partially art, not science
  • Understanding implicit regularization in SGD requires deep theory
09

References & Further Reading

Essential resources for KL divergence, information theory, neural networks, and modern generalization phenomena. Stanford CS229 lectures cover information theory and loss functions weeks 5-7. Papers on double descent (Belkin et al., Bartlett et al.) are foundational. Murphy's textbook provides intuitive coverage of all topics. ESL remains invaluable for statistical learning theory and classical baselines.

Wikipedia articles quickly establish definitions and intuitions for KL divergence, cross-entropy, neural networks, and backpropagation. The MNIST dataset documentation and PyTorch/TensorFlow tutorials provide implementation guidance. For double descent, consult modern deep learning papers and surveys on implicit regularization in SGD.