Problem Set Structure
Problem Set 1 contains three problems worth 25 total points, distributed across weeks 1-2 material from XCS229. Each problem targets a distinct learning area but builds toward a unified understanding of supervised learning foundations.
Problem 1 (GLM Convexity): Theoretical—prove that the negative log-likelihood of any GLM is convex. This problem emphasizes mathematical rigor and understanding exponential family distributions. Expect to work with Hessians, positive semi-definite matrices, and properties of the log-partition function.
Problem 2 (Feature Maps): Mixed—implement explicit feature maps for polynomial kernels, compare computational costs, and explain the kernel trick. This problem bridges theory and implementation, requiring both mathematical insight and coding ability.
Problem 3 (Stability): Theoretical/computational—prove uniform stability bounds for regularized learning, compute leave-one-out errors, and connect stability to generalization. This problem requires understanding algorithmic stability and its implications for generalization.
Key Themes
All three problems emphasize why we use particular techniques in machine learning. Convexity explains why we can optimize GLMs efficiently. Feature maps show how to build expressive models while managing dimensionality. Stability theory justifies regularization and predicts generalization.
Time Management
Allocate your time strategically. Problem 1 is the most theoretical—allow sufficient time for careful proofs. Problem 2 combines theory and coding—test your implementations thoroughly. Problem 3 requires connecting abstract concepts to concrete algorithms.
Exponential Family Distributions
GLMs are built on exponential family distributions, which have the canonical form: p(y|x,θ) = b(y) exp(θ^T T(y) - A(θ)) where T(y) is the sufficient statistic, A(θ) is the log-partition function, and b(y) is the base measure.
For supervised learning, we parameterize θ as a linear function of the input: θ = θ(w,x) = w^T φ(x) for some feature map φ. The log-likelihood for a single example becomes log p(y|x,w) = log b(y) + w^T φ(x) T(y) - A(w^T φ(x)).
Negative Log-Likelihood
The NLL for a single example is: ℓ(w) = -log p(y|x,w) = -log b(y) - w^T φ(x) T(y) + A(w^T φ(x)). For a dataset of n examples, the total NLL is the sum: L(w) = Σ ℓ_i(w).
The structure of the NLL is key: it's the log-partition function A(w^T φ(x)) minus a linear term. The log-partition function is always convex for exponential families, and the linear term doesn't affect convexity. Therefore, the NLL is convex.
Hessian Analysis
To prove convexity rigorously, compute the Hessian ∇²ℓ(w). For a single example: ∇²ℓ(w) = ∇² A(w^T φ(x)) = A''(w^T φ(x)) φ(x) φ(x)^T. Here, A''(w^T φ(x)) is the second derivative of the log-partition function with respect to its argument.
The key fact: A''(η) = Var[T(y)|η] ≥ 0 for any exponential family. This variance term is always non-negative, making ∇²ℓ(w) = A''(w^T φ(x)) φ(x) φ(x)^T a positive semi-definite matrix (outer product scaled by a non-negative scalar).
Proof Strategy
Your proof should: (1) Write the NLL for a GLM, (2) Compute the Hessian, (3) Recognize it as a variance-weighted outer product, (4) Conclude positive semi-definiteness and thus convexity. Connect each step to exponential family properties.
Convexity Definition
A function f is convex if for all x, y and λ ∈ [0,1]: f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y). Equivalently, for twice-differentiable f: the Hessian ∇²f(x) is positive semi-definite for all x.
For the NLL of a GLM, we use the Hessian criterion. Compute ∇²L(w) where L(w) = Σ ℓ_i(w) is the total NLL. The Hessian is a sum of individual Hessians: ∇²L(w) = Σ ∇²ℓ_i(w).
Step-by-Step Proof Outline
Step 1: Write the single-example NLL: ℓ_i(w) = -w^T φ(x_i) T(y_i) + A(w^T φ(x_i)). The first term is linear in w (Hessian is 0). The second term's Hessian is what matters.
Step 2: Differentiate once: ∇ℓ_i(w) = -φ(x_i) T(y_i) + A'(w^T φ(x_i)) φ(x_i). Here, A'(η) = E[T(y)|η] is the expected sufficient statistic.
Step 3: Differentiate again: ∇²ℓ_i(w) = A''(w^T φ(x_i)) φ(x_i) φ(x_i)^T. Recognize φ(x_i) φ(x_i)^T as an outer product (positive semi-definite for any vector).
Step 4: Note that A''(η) = Var[T(y)|η] ≥ 0. This is a fundamental property of log-partition functions: they're convex, so their second derivative is non-negative.
Step 5: Conclude: ∇²ℓ_i(w) = A''(w^T φ(x_i)) φ(x_i) φ(x_i)^T is a non-negative scalar times a positive semi-definite matrix, hence positive semi-definite.
Step 6: The total Hessian ∇²L(w) = Σ ∇²ℓ_i(w) is a sum of positive semi-definite matrices, hence positive semi-definite. Therefore L(w) is convex.
Key Insights
The proof hinges on the exponential family property that A''(η) = Var[T(y)|η]. This variance is always non-negative, which is why all GLMs have convex loss functions. This guarantees that any local minimum is a global minimum, enabling efficient optimization via gradient descent.
Common Mistakes
Don't confuse the Hessian of A(η) with the Hessian of the NLL. The NLL involves A(w^T φ(x)), so you need to apply chain rule: the Hessian of the composition is A''(w^T φ(x)) φ(x) φ(x)^T, not just A''(w^T φ(x)). Pay careful attention to matrix calculus.
Explicit vs. Implicit Representations
A feature map φ: ℝ^d → ℝ^m maps input vectors to higher-dimensional feature vectors. For example, the polynomial degree-2 feature map on ℝ^2 is φ(x) = [x_1, x_2, x_1², x_1 x_2, x_2²]^T, producing a 5-dimensional output.
The explicit feature map approach: compute φ(x) directly, then apply a linear model w^T φ(x). The kernel method: implicitly work in feature space by computing inner products ⟨φ(x), φ(x')⟩ without explicitly constructing φ(x).
Polynomial Kernels
The polynomial kernel of degree d is: K(x, x') = (⟨x, x'⟩ + 1)^d. This kernel implicitly corresponds to a feature map that includes all monomials of degree ≤ d. For d=2 in ℝ^2: K(x,x') = (x_1x_1' + x_2x_2' + 1)² expands to exactly the inner product of explicit features [1, x_1, x_2, x_1², √2 x_1 x_2, x_2²].
In this problem, you'll work with explicit polynomial features, computing the feature dimension and the computational cost of evaluating features and their inner products.
Computational Analysis
For a degree-d polynomial feature map on d-dimensional inputs, the number of features (feature dimension) is (d+d choose d) = O(d^d), which grows very quickly. For d=10 input features and degree 3, this becomes thousands of features.
The kernel trick avoids constructing these features explicitly. Instead, compute K(x_i, x_j) = (⟨x_i, x_j⟩ + 1)^d directly in input space. For each pair of examples, this takes O(d) time (computing the inner product) plus O(d) time (raising to power d), versus O(d^d) time to construct both feature vectors explicitly.
Implementation Hint
When implementing polynomial features, use recursive definitions or combinatorial generation to construct all monomials systematically. Compute feature vectors explicitly for small examples to verify correctness, then time the operations and compare to kernel method timings.
Dimensionality Growth
The number of polynomial features of degree exactly d (not ≤ d) in p dimensions is the number of ways to choose d items from p with replacement: (p+d-1 choose d) ≈ p^d / d! for fixed p as d grows. For all features of degree ≤ d, sum over all degrees: ≈ O((p+d choose p)).
For concrete example: 10 input features, polynomial degree 3 gives (10+3 choose 3) = 286 features. Degree 4 gives (10+4 choose 4) = 1001 features. Degree 5 gives 2002 features. This rapid growth is why explicit feature maps become impractical for high-degree polynomials.
Computational Cost Comparison
Explicit approach: Construct feature vector for one example: O(number of features). For n examples, building the feature matrix is O(n × number of features). Training a linear model is O(n × m²) where m is number of features.
Kernel method: Computing the kernel matrix is O(n² × d) where d is input dimension. Training using kernel methods (e.g., kernel ridge regression) is O(n³) for solving the normal equations. The advantage: d is small and fixed, while the number of features can be huge.
The Kernel Trick
The key insight: many learning algorithms depend only on inner products between examples, not on the examples themselves. If you can compute ⟨φ(x_i), φ(x_j)⟩ efficiently (via a kernel function), you never need to construct φ explicitly.
For polynomial features, K(x,x') = (⟨x, x'⟩ + 1)^d computes the inner product in feature space in O(d) time instead of O(d^d) time. This is the computational breakthrough that makes kernel methods practical.
When Explicit vs. Implicit?
Use explicit features when: the feature dimension is moderate (< 1000), you want interpretable features, or you're using algorithms that don't depend only on inner products. Use kernels when: the feature dimension would be very large, you're using SVMs or kernel ridge regression, or you want to use non-polynomial kernels (RBF, etc.).
Analysis Checklist
Your analysis should quantify: the exact feature dimension for polynomial degree d and input dimension p, the time to compute one feature vector, the time for n feature vectors, and compare to kernel method timing. Include actual numbers for a concrete example.
Algorithmic Stability Concept
An algorithm A is stable if small changes to the training set cause small changes to the learned hypothesis. Formally, uniform stability requires: for all training sets S and S' differing in one example, and all test examples z, |A(S)(z) - A(S')(z)| ≤ β for some small β.
Stable algorithms generalize well: if training error equals test error on average (which stability ensures), then low training error implies low test error. This is why regularization works—it enforces stability by preventing overfitting to particular training examples.
Leave-One-Out Cross-Validation
Compute the leave-one-out error: for each training example (x_i, y_i), remove it from the training set, train on the remaining n-1 examples, and measure error on the held-out example. The leave-one-out error is an empirical estimate of test error and reveals how stable the algorithm is.
For a stable algorithm, leave-one-out error ≈ training error. For an unstable algorithm (e.g., memorizing the training set), leave-one-out error >> training error. This reveals overfitting and suggests increasing regularization.
Stability-Generalization Connection
Theorem (informal): If an algorithm is uniformly stable with stability parameter β, then with high probability over random training sets, test error ≤ training error + O(β√(log(1/δ)/n)). This bound depends on stability β and sample size n, not directly on model complexity.
This is remarkable: instead of bounding complexity via VC dimension or Rademacher complexity, we bound generalization via algorithmic properties. This explains why regularization (which increases stability) improves generalization even for high-complexity model classes.
Why Stability Matters
Stability connects optimization, regularization, and generalization. Convex objectives (from earlier in this problem set!) enable stable optimization. Regularization increases stability. Stability guarantees generalization. This forms the complete picture of supervised learning: optimize a convex, stable objective to achieve good generalization.
Uniform Stability in Regularized Learning
For regularized linear regression: w = (X^T X + λI)^(-1) X^T y. Uniform stability can be proved by analyzing how this solution changes when one example is removed. The key is the matrix inversion lemma and properties of regularization.
Proof sketch: When example i is removed, the solution changes by Δw = (X^T X + λI)^(-1) x_i e_i where e_i is the residual. The magnitude of this change is bounded by (1/λ) × ||x_i|| × |e_i|. The bound holds uniformly across all examples, establishing stability.
Regularization Parameter and Stability
Larger λ (stronger regularization) leads to smaller stability parameter β. This makes intuitive sense: stronger regularization makes the solution less sensitive to individual examples. In the limit λ → ∞, the algorithm ignores the data (very stable but high bias). The optimal λ balances stability and fitting.
The leave-one-out error for ridge regression has a closed form: LOO error = Σ (y_i - ŷ_i)² / (1 - H_ii)² where H_ii is the i-th diagonal of the hat matrix X(X^T X + λI)^(-1) X^T. Examples with high H_ii (high leverage) contribute more to LOO error, revealing which examples most affect stability.
Computing Stability Bounds
For ridge regression: The uniform stability parameter is β = O(||x||²_max / λ). This makes the dependency explicit: stability improves (β decreases) with stronger regularization and decreases with larger input norms.
Generalization bound: With stability β, the generalization error is at most training error + O(β√log(1/δ)/n) with probability 1-δ. For ridge regression with λ proportional to √n, this gives a bound of O(1/√n), which matches standard generalization rates.
Practical Implications
Use leave-one-out error or cross-validation to estimate stability empirically. If LOO error >> training error, the algorithm is unstable—increase regularization. If LOO error ≈ training error, the algorithm is stable. This test is more direct than model complexity arguments and works for any learning algorithm.
Verification Strategy
Compute leave-one-out error for small datasets with different λ values. Observe how LOO error decreases as λ increases, then increases again (bias-variance tradeoff). The optimal λ minimizes LOO error, revealing the stability-bias balance.
The Three Pillars of Supervised Learning
Optimization: GLMs have convex loss functions, enabling efficient gradient-based optimization to global optima. This isn't specific to GLMs—convex objectives are central to machine learning because they guarantee optimal solutions can be found efficiently.
Representation: Feature maps (explicit or implicit via kernels) provide flexible expressivity while managing computational complexity. The kernel trick bridges high-dimensional feature spaces with efficient input-space computation, enabling powerful non-linear learning.
Generalization: Stability theory explains why regularized learning generalizes. By making the algorithm less sensitive to individual training examples, regularization ensures that training error reflects test error. This provides a unified view of why regularization works across different models.
Connections Across Problems
From Problem 1 to Problem 2: Convex optimization of GLMs is only useful if you can represent the data effectively. Feature maps let you build expressive models while maintaining the convexity that enables efficient optimization.
From Problem 2 to Problem 3: Expressive models risk overfitting. Stability theory explains how regularization controls expressivity—by enforcing stability, regularization ensures that more flexible models don't sacrifice generalization. This explains the bias-variance tradeoff from a stability perspective.
Unified View: Start with a flexible model class (features). Make the loss convex (GLMs). Add regularization for stability (generalization). Optimize efficiently (gradient descent on convex objectives). This recipe, grounded in theory, is how modern machine learning works.
Advanced Connections
Kernels and Stability: Kernel methods inherit stability from the learning algorithm. RBF kernels with regularization are stable because they're used with regularized learning algorithms. The expressivity of the kernel (controlled by bandwidth parameter) trades off with stability.
Neural Networks and Stability: Deep learning doesn't always use convex objectives or explicit stability analysis, yet it generalizes. This remains an open area of research. Some hypotheses: implicit regularization from SGD provides stability, the overparameterized regime has implicit stability properties, or margin-based stability arguments apply.
Beyond Supervised Learning: The three pillars extend to other settings. In unsupervised learning, optimization and representation remain key, while stability takes different forms (e.g., robustness to perturbations). Understanding these foundations prepares you for advanced machine learning research.
Synthesis Challenge
Write a page explaining how a regularized GLM (e.g., logistic regression with L2 regularization) instantiates all three pillars: why its loss is convex (optimization), what feature map it implicitly uses (representation), and how regularization provides stability (generalization). This synthesis demonstrates mastery of Problem Set 1.
Stanford CS229 Course Materials
Stanford CS229: Machine Learning — Official course page with lecture notes, problem sets, and syllabus. Essential resource for all topics covered in this problem set.
CS229 Lecture Notes & Sections — Comprehensive notes on supervised learning, covering exponential families, GLMs, kernels, and generalization theory. Weeks 1-2 notes are directly relevant to this problem set.
Core Concepts: GLMs and Convexity
Wikipedia: Generalized Linear Models — Overview of GLM framework, exponential family parameterization, and link functions.
Wikipedia: Exponential Family — Mathematical foundations of exponential families, sufficient statistics, and log-partition functions essential for understanding GLM convexity.
Feature Maps and Kernels
Wikipedia: Kernel Methods — Introduction to kernel tricks, feature maps, and kernel matrices. Explains the connection between explicit and implicit feature representations.
Wikipedia: Polynomial Kernel — Specific coverage of polynomial kernels and their corresponding feature maps.
Stability and Generalization
Wikipedia: Algorithmic Stability — Introduction to stability definitions and connections to generalization bounds.
The Elements of Statistical Learning (ESL) - Free PDF — Comprehensive reference for convex optimization, regularization, kernel methods, and generalization theory. Chapters 2-4 and Chapter 12 (Unsupervised Learning) are particularly relevant.
Advanced References
Wikipedia: Convex Functions — Mathematical foundations of convexity, Hessian analysis, and positive semi-definiteness.
Wikipedia: Hessian Matrix — Detailed treatment of Hessian matrices and their role in determining function properties.
How to Use These References
Start with the Stanford CS229 course page and relevant lecture notes for your problem area. Use Wikipedia for quick refreshers on mathematical concepts. Consult ESL for rigorous proofs and practical context. As you work through each problem, cross-reference the relevant sections to deepen understanding and verify your approach.