STANFORD XCS229 · PROBLEM SET 3
PS3: Poisson & Kernels
35 Points · GLM, Kernel Methods & Perceptron
01

PS3 Overview

Problem Set 3 is a 35-point assessment covering three core topics: Poisson regression as a generalized linear model, kernel methods and positive semi-definite kernels, and the perceptron algorithm with convergence guarantees. These problems span weeks 3–4 and connect distributional assumptions to online learning.

The assignment builds toward a deeper understanding of how different model families (exponential family distributions, kernel-defined hypothesis spaces, and online learners) each offer unique perspectives on supervised and unsupervised learning.

02

Poisson Regression

Poisson regression models count data by assuming responses follow a Poisson distribution with mean λ = e^(θ^T x). The exponential family parameterization reveals why the log link is natural: it ensures λ > 0 and linearizes the relationship between features and the rate parameter.

Problem 01 derives the MLE and explores how Poisson assumptions differ from Gaussian regression, with implications for heteroscedasticity and prediction intervals on count-valued outcomes.

03

Poisson MLE & GLM

The MLE for Poisson regression follows from the exponential family framework, yielding update rules resembling gradient descent on the Poisson loss. The generalized linear model (GLM) perspective unifies Poisson, logistic, and linear regression under one framework: link function, variance function, and canonical parameter.

Key insight: the log link is canonical for Poisson; Newton–Raphson updates simplify when working with the natural parameterization, and convergence properties depend on the design matrix and response scale.

04

Kernel Construction

Valid kernels must satisfy Mercer's theorem: they correspond to inner products in a possibly infinite-dimensional feature space. Kernel composition (sums, products, powers of kernels) preserves validity; polynomial and RBF kernels arise from explicit or implicit feature maps, allowing efficient computation in high-dimensional spaces.

Problem 02 verifies kernel properties through positive semi-definiteness and explores dual representations. The kernel trick enables algorithms (SVM, kernel ridge regression) to avoid explicit feature construction.

05

Kernel Properties

A kernel K(x, x') is valid if and only if its Gram matrix is positive semi-definite for all finite point sets. Polynomial kernels (x^T x' + 1)^d arise from polynomial feature expansions; RBF kernels exp(-γ ||x - x'||^2) are infinite-dimensional and nonlinear. Both satisfy Mercer's conditions when parameters are appropriately chosen.

Understanding the spectrum and trace of kernel matrices reveals expressive capacity: larger eigenvalues indicate richer feature representations, guiding regularization and hyperparameter selection.

06

Perceptron Algorithm

The perceptron is an online algorithm that iteratively corrects misclassified examples. On linearly separable data with margin γ, it achieves at most R^2 / γ^2 mistakes, where R is the data radius. This mistake bound reveals a fundamental trade-off: smaller margins require more corrections.

The perceptron is not a probabilistic model; it directly optimizes a non-convex loss and lacks natural confidence estimates. Its simplicity and online nature make it suitable for streaming data and theoretical analysis of online learning.

07

Perceptron Convergence

Convergence proof sketches show that if data is linearly separable with margin γ and radius R, the algorithm terminates in O(R^2 / γ^2) iterations. Each mistake reduces the margin between the weight vector and any separating direction by at least γ; once sufficiently many corrections accumulate, perfect separation emerges.

The geometric interpretation centers on the angle between the current weight vector and the true separator: mistakes occur when this angle is too large, pushing the weight toward alignment with the separator.

08

Key Takeaways

Problem Set 3 connects three perspectives: exponential family distributions (Poisson GLM) model heteroscedastic count responses; kernel methods parameterize hypothesis spaces and enable nonlinear boundaries; online algorithms (perceptron) learn from streaming data with provable efficiency. Together, they form the conceptual bridge from probabilistic models to nonparametric learning to online adaptation.

Mastery of these topics prepares you for advanced subjects: variational inference, structured prediction, and deep learning—all of which rely on understanding distributional assumptions, representation learning, and optimization dynamics.

09

References & Further Reading

Key references for Poisson regression, kernel methods, and the perceptron algorithm. Stanford CS229 lectures cover GLMs including Poisson regression, kernel machines, and online learning. Murphy's textbook provides excellent intuitions on exponential families and kernel methods. Hastie, Tibshirani, and Friedman's ESL is authoritative for both theory and practice. For perceptron convergence proofs, see classic online learning theory papers and texts on statistical learning.

Wikipedia articles cover Poisson distributions, kernel methods, positive semi-definiteness, and the perceptron. These provide quick refreshers on definitions and properties. Academic papers on kernel theory (Mercer's theorem) and online learning bounds (Novikoff's theorem) are available through Stanford's repository.