This problem set comprises three problems worth 35 points total, distributed across weeks 3–4 of the XCS229 curriculum. Each problem targets specific learning objectives and connects core concepts from supervised learning, probabilistic modeling, and online algorithms.
Problem Breakdown
Problem 01 (Poisson Regression) — 12 points. Derive MLEs and update rules for count data modeling using Poisson distributions. Verify exponential family properties and compare with Gaussian regression assumptions.
Problem 02 (Kernels) — 12 points. Verify kernel validity through Mercer's theorem and positive semi-definiteness. Construct kernel combinations and analyze implicit feature maps. Implement kernel methods in a toy problem.
Problem 03 (Perceptron) — 11 points. Implement the perceptron algorithm on linearly separable data. Analyze convergence bounds as a function of margin and radius. Compare with gradient-based methods and discuss online vs. batch paradigms.
Learning Objectives
- Understand exponential family distributions and their role in GLM frameworks
- Master kernel methods, Mercer's theorem, and implicit feature spaces
- Analyze online learning algorithms and convergence rates
- Connect probabilistic models, nonparametric methods, and algorithmic guarantees
Prerequisites & Setup
Familiarity with linear algebra, calculus, and probability is essential. Most problems require Python implementations using NumPy and Matplotlib. Some analytical derivations require careful handling of matrix derivatives and Lagrange multipliers.
Poisson regression extends linear regression to count-valued responses by assuming the response y follows a Poisson distribution with parameter λ (the mean count). Unlike Gaussian regression, Poisson models naturally enforce non-negativity and heteroscedasticity: variance equals the mean.
Exponential Family Formulation
The Poisson distribution belongs to the exponential family with natural parameter θ = log(λ). The canonical form is:
p(y | θ) = exp(θ · y - log(y!) - exp(θ))
Here, the log-partition function A(θ) = exp(θ) ensures the distribution sums to 1. The mean is E[y] = exp(θ), and the variance is Var(y) = exp(θ)—the variance-mean equality characterizes Poisson data.
Log-Link and Linearity
To relate predictors to the Poisson mean, we use the log link: log(λ) = θ^T x, or equivalently, λ = exp(θ^T x). This choice:
- Ensures λ is always positive (no negative rate predictions)
- Is the canonical link for the exponential family, simplifying the exponential family form to linear predictors
- Introduces nonlinear feature-response relationships, enabling richer model expressivity
When to Use Poisson Regression
Poisson regression is appropriate when the response is a count (events in a fixed interval), variance is proportional to mean, and the observations are independent conditioned on predictors. Overdispersion (variance exceeding the mean) suggests negative binomial or quasi-Poisson alternatives.
The maximum likelihood estimator for Poisson regression parameters emerges naturally from the exponential family, and the optimization can be solved via iteratively reweighted least squares (IRLS)—a special case of the generalized linear model (GLM) framework.
Likelihood and MLE Derivation
Given n i.i.d. samples (x_i, y_i), the likelihood is:
L(θ) = ∏ exp(θ^T x_i · y_i - log(y_i!) - exp(θ^T x_i))
The log-likelihood is:
ℓ(θ) = Σ [θ^T x_i · y_i - exp(θ^T x_i) - log(y_i!)]
Setting the gradient to zero yields the normal equations. Unlike linear regression, they are nonlinear in θ, requiring iterative solvers.
Connection to GLM
The GLM framework consists of three components:
- Random component: distribution family (Poisson, Gaussian, binomial, etc.)
- Link function: relates linear predictor to the mean (log for Poisson, identity for Gaussian, logit for binomial)
- Variance function: specifies how variance depends on the mean (μ for Poisson, constant for Gaussian, μ(1-μ) for binomial)
Poisson regression fits this framework with log link and variance function V(μ) = μ. The IRLS algorithm applies to all GLMs, making Poisson regression a natural extension of familiar least-squares methods.
Convergence & Practical Considerations
Newton–Raphson iteration in the natural parameterization converges quickly (often in 3–10 iterations) on well-conditioned problems. The Hessian of the log-likelihood is the Fisher information matrix, which is always positive definite for Poisson regression, guaranteeing convexity and unique MLEs (when they exist).
A kernel K(x, x') defines a similarity measure between inputs and implicitly induces a feature map φ(x) such that K(x, x') = ⟨φ(x), φ(x')⟩. Mercer's theorem characterizes which functions are valid kernels: a symmetric function is a kernel if and only if its Gram matrix is positive semi-definite for all finite point sets.
Kernel Composition Rules
Valid kernels can be combined to form new valid kernels:
- Sum: If K₁ and K₂ are kernels, so is K₁ + K₂
- Scalar multiple: If K is a kernel and c > 0, then cK is a kernel
- Product: If K₁ and K₂ are kernels, so is K₁ · K₂
- Limit: If Kₙ are kernels and Kₙ → K pointwise, then K is a kernel
- Exponentiation: If K is a kernel, then exp(K) is a kernel
Common Kernels
Polynomial kernel: K(x, x') = (⟨x, x'⟩ + 1)^d corresponds to explicit polynomial features of degree d. Its eigenvalues decay, controlling complexity.
RBF (Gaussian) kernel: K(x, x') = exp(-γ ||x - x'||²) is universal—dense in the space of continuous functions—and implicitly maps to an infinite-dimensional space. It is useful for nonlinear boundaries and can fit arbitrary smooth patterns with appropriate regularization.
Linear kernel: K(x, x') = ⟨x, x'⟩ (or with bias: ⟨x, x'⟩ + c) is the simplest and is equivalent to linear regression in the original feature space.
Choosing Kernels
The choice of kernel encodes domain knowledge about what similarities matter. High-dimensional domains often benefit from RBF kernels due to universality; structured data (sequences, trees) may require custom kernels (string kernels, graph kernels). Cross-validation or held-out tests guide empirical kernel selection.
A kernel is valid if its Gram matrix G[i,j] = K(x_i, x_j) is positive semi-definite (PSD) for all finite sets {x₁, ..., xₙ}. This property ensures the kernel corresponds to an inner product in some Hilbert space, enabling efficient kernel methods without explicit feature computation.
Proving PSD for Polynomial Kernels
For the polynomial kernel K(x, x') = (⟨x, x'⟩ + 1)^d, expand using the multinomial theorem:
K(x, x') = Σ_{|α|≤d} (d choose α) ⟨x, x'⟩^|α|
This is a non-negative combination of inner products and constants, each of which corresponds to a PSD Gram matrix. Since positive combinations of PSD matrices are PSD, the polynomial kernel is valid.
RBF Kernel Proof Sketch
The RBF kernel can be written as:
K(x, x') = exp(-γ ||x - x'||²) = exp(-γ ||x||²) · exp(2γ ⟨x, x'⟩) · exp(-γ ||x'||²)
The middle term, exp(2γ ⟨x, x'⟩), is a function of the inner product and is thus PSD (it is a power series of the linear kernel with positive coefficients). The outer terms are scalars depending only on individual points, which do not affect positive semi-definiteness when multiplying element-wise (pointwise multiplication preserves PSD).
Spectral Perspective
The eigenvalues of a kernel's Gram matrix reflect the "richness" of the feature representation. For RBF kernels, the spectrum decays; for polynomial kernels, decay depends on the degree and the data. High eigenvalues correspond to directions of large variance in the feature space, guiding dimensionality reduction and regularization strategies.
The perceptron is a simple online algorithm for binary linear classification. It maintains a weight vector and iteratively updates it when it makes a mistake on a new example. Despite its simplicity, it has strong theoretical guarantees on linearly separable data.
Algorithm Description
Initialize θ = 0
For each example (x, y) (where y ∈ {-1, +1}):
If sign(θ^T x) ≠ y:
θ ← θ + y · x [Correction step]
Else:
No update
Return θ
The update rule is remarkably simple: add the example (scaled by its label) to the weight vector when mispredicted. This online nature makes the perceptron suitable for streaming data and provides a form of natural regularization through sequential averaging.
Geometric Intuition
Each weight vector θ defines a hyperplane {x : θ^T x = 0}. When the algorithm makes a mistake, the update rotates θ toward the misclassified example, bringing it closer to the separator. Over many mistakes, θ "spirals" toward the true separator's direction, achieving alignment.
Advantages & Limitations
- Advantages: Simple to implement, fast per-iteration, no hyperparameter tuning, theoretically grounded
- Limitations: Requires linear separability; no probability estimates; oscillates on non-separable data; doesn't minimize a well-defined objective
Comparing to Gradient Descent
Gradient descent on a smooth convex loss (e.g., hinge loss, logistic regression) converges to a global optimum and works on non-separable data. The perceptron, lacking an explicit loss, is not a gradient method and does not guarantee optimality or convergence on non-separable data. However, its online nature and computational efficiency are unmatched in certain streaming scenarios.
The perceptron convergence theorem provides a mistake bound: on linearly separable data with margin γ and radius R, the algorithm makes at most R² / γ² mistakes before converging to a separating classifier. This bound reveals fundamental trade-offs in learning: smaller margins require more data or iterations.
Formal Statement
Assume there exists θ* with ||θ*|| = 1 and margin γ > 0 such that y_i · θ*^T x_i ≥ γ for all i (all points are correctly classified with margin γ). If the data has radius R (i.e., ||x_i|| ≤ R), then the perceptron makes at most R² / γ² mistakes.
Proof Sketch
Let θ_t denote the weight vector after t updates. The proof uses two key inequalities:
- Alignment grows: After each mistake on example i, the inner product with the true separator increases: ⟨θ_{t+1}, θ*⟩ ≥ ⟨θ_t, θ*⟩ + γ. Over t mistakes, ⟨θ_t, θ*⟩ ≥ γ · t.
- Norm grows slowly: Each update increases the norm by at most R²: ||θ_{t+1}||² ≤ ||θ_t||² + 2R². Over t mistakes, ||θ_t||² ≤ R² · t.
By Cauchy–Schwarz, ⟨θ_t, θ*⟩ ≤ ||θ_t|| · ||θ*|| = ||θ_t||. Combining bounds: γ · t ≤ √(R² · t), which gives t ≤ R² / γ².
Implications for Learning
The bound is distribution-free and depends only on the geometry (margin and radius). Large margins lead to fast convergence; tiny margins can require arbitrarily many mistakes. This motivates the margin-maximization principle underlying support vector machines, which explicitly optimize γ to ensure efficient learning.
Problem Set 3 unites three major learning paradigms: probabilistic models (Poisson GLM) for heteroscedastic responses, kernel methods for nonlinear feature representations, and online algorithms (perceptron) for streaming data. Each offers distinct advantages and connects to modern machine learning.
Exponential Family Perspective
Poisson regression exemplifies the power of exponential family distributions. By choosing the appropriate distribution and link function, we elegantly handle diverse response types (counts, binary outcomes, continuous values, ordinal ranks) within a unified GLM framework. The canonical link simplifies optimization, and the variance function captures heteroscedasticity. This perspective extends beyond regression: it powers graphical models, variational inference, and neural network architectures.
Kernel Methods & Representation Learning
Kernels separate the representation (feature map) from the algorithm, allowing us to compute similarities without explicitly constructing high- or infinite-dimensional features. Mercer's theorem provides a principled way to design valid kernels, and kernel composition rules enable flexible domain-specific kernels. This abstraction is foundational: support vector machines, kernel ridge regression, and Gaussian processes all exploit the kernel trick.
Online Learning & Convergence Analysis
The perceptron and its convergence bounds introduce the online learning framework: algorithms that adapt to streaming data and have provable mistake bounds or regret guarantees. Unlike batch learning, online methods see each example once and must decide immediately whether to update. The margin-based convergence bound of the perceptron informs modern optimization and learning theory.
Connections Across Topics
GLM to Kernels: While Poisson regression is parametric, kernel methods enable nonparametric extensions (e.g., kernel-based Poisson regression with reproducing kernel Hilbert spaces).
Kernels to Online: Kernel-based online learning algorithms (e.g., kernel perceptron, online SVM) combine kernel representations with online updates, inheriting both nonlinearity and streaming efficiency.
GLM to Online: Online gradient descent on GLM losses (e.g., stochastic gradient descent on logistic regression) bridges probabilistic models and online optimization.
Looking Ahead
Future topics build on these foundations: structured prediction (beyond binary classification), deep learning (implicit feature learning via gradient descent), Bayesian nonparametrics (combining GLMs and kernels), and reinforcement learning (online decision-making under uncertainty). Mastery of exponential families, kernels, and online algorithms provides a toolkit for understanding and developing advanced machine learning systems.
Stanford CS229 Course Materials
Stanford CS229: Machine Learning — Official course homepage with lectures covering GLMs, kernel methods, and online learning algorithms. Weeks 3-4 materials directly cover Problem Set 3 topics.
CS229 Lecture Notes & Sections — Detailed notes on Poisson regression and GLMs, kernel methods and Mercer's theorem, and online learning and the perceptron algorithm.
Generalized Linear Models & Poisson Regression
Wikipedia: Generalized Linear Models — Overview of GLM framework, link functions, variance functions, and exponential family parameterization essential for understanding Poisson regression.
Wikipedia: Poisson Regression — Specific coverage of Poisson regression for count data, log link, and likelihood-based estimation.
Kernel Methods & Mercer's Theorem
Wikipedia: Kernel Methods — Comprehensive introduction to kernel tricks, kernel matrices, and the relationship between kernels and feature spaces.
Wikipedia: Positive-Definite Matrix — Mathematical foundations of positive semi-definite matrices and their role in characterizing valid kernels via Mercer's theorem.
Wikipedia: Reproducing Kernel Hilbert Space — Advanced topic on RKHS theory that underpins kernel methods and kernel ridge regression.
Perceptron & Online Learning
Wikipedia: Perceptron — Historical and modern treatment of the perceptron algorithm, convergence conditions, and limitations.
Wikipedia: Online Learning — Introduction to online learning frameworks, regret bounds, and applications to streaming data.
Comprehensive Textbooks
The Elements of Statistical Learning (ESL) - Free PDF — Authoritative reference covering GLMs (Chapter 4), kernel methods (Chapter 12), and learning theory. Highly recommended for rigorous treatment of all PS3 topics.
Wikipedia: Exponential Family — Deep mathematical foundation for understanding exponential families, sufficient statistics, log-partition functions, and moment-generating functions essential for GLM theory.
Study Path
Start with Stanford CS229 lectures on GLMs and Poisson regression. Review exponential family and link function concepts. For kernels, understand positive semi-definiteness and Mercer's theorem via Wikipedia and ESL Chapter 12. Study the perceptron algorithm and convergence proof through CS229 lectures on online learning. Finally, integrate all three perspectives by examining kernel perceptron and online GLM optimization.