SCPD ML Course
Kernel Methods
01

The Feature Map Idea

The fundamental insight behind kernel methods is that linear classifiers become far more powerful when applied to nonlinearly transformed data. If we map input points x from their original space to a higher-dimensional feature space via φ(x), linear separation becomes possible for problems that are impossible in the original space.

Classical example: the XOR problem. Two classes are perfectly separable in 2D by a circle, but no linear boundary works. However, map the points to 3D via φ(x) = (x₁, x₂, x₁·x₂), and the classes separate via a plane. The cost is the explicit computation and storage of these higher-dimensional features—yet the kernel trick eliminates this overhead.

x ∈ ℝᵈ
Original Space
φ(x) ∈ ℝᴰ
Feature Space
D ≫ d
Often Exponential
O(D)
Naive Cost
02

The Kernel Trick

Many algorithms in machine learning depend only on inner products ⟨x, y⟩ between data points. The kernel trick exploits this: instead of explicitly computing φ(x) and φ(y), define K(x, y) = ⟨φ(x), φ(y)⟩, a kernel function that directly computes the inner product in feature space without materializing the features.

Practical example: polynomial kernel K(x, y) = (⟨x, y⟩ + c)ᵈ implicitly maps to all monomials of degree ≤ d. For 100D input and d=3, this represents ~171,700 features, yet K is computed in O(100) time. The kernel acts as a similarity function: K(x, y) measures closeness in feature space, encoding the geometry of the problem.

K(x,y)
Similarity
φ(x)·φ(y)
Inner Product
O(D)→O(d)
Complexity Gain
≈171k
Implicit Features
03

Common Kernels

Several kernels appear repeatedly in practice. The linear kernel K(x, y) = ⟨x, y⟩ requires no feature map. The polynomial kernel K(x, y) = (⟨x, y⟩ + c)ᵈ encodes polynomial features up to degree d, controlled by offset c and degree d.

The RBF (Gaussian) kernel K(x, y) = exp(−γ·‖x − y‖²) is the most widely used. It implicitly maps to infinite-dimensional space, making it remarkably flexible. The parameter γ controls bandwidth: small γ gives high-variance (wiggly) boundaries; large γ produces simpler, smoother fits. String kernels and graph kernels extend the framework to structured data.

Linear: K(x,y) = ⟨x,y⟩ Polynomial: (⟨x,y⟩+c)ᵈ RBF: exp(−γ‖x−y‖²) Sigmoid: tanh(κ⟨x,y⟩+θ)
04

Mercer's Theorem & Validity

Not all symmetric functions are valid kernels. Mercer's theorem states: a symmetric function K is a valid kernel if and only if it is positive semi-definite. This means for any finite set of points x₁,...,xₙ, the Gram matrix G with entries Gᵢⱼ = K(xᵢ, xⱼ) has all non-negative eigenvalues.

Mercer's theorem guarantees the existence of a feature space and feature map φ such that K(x, y) = ⟨φ(x), φ(y)⟩. This is the theoretical foundation: valid kernels are precisely those that correspond to some (possibly infinite-dimensional) Hilbert space. New kernels can be constructed from old ones: sums, products, and compositions of valid kernels yield valid kernels.

PSD
Gram Matrix
λᵢ ≥ 0
All Eigenvalues
Σ + K
Valid Combo
K₁·K₂
Valid Product
05

Kernel SVM

Support Vector Machines are the canonical kernel method. The SVM finds the maximum-margin hyperplane that separates two classes. In the dual form, the optimization depends only on inner products K(xᵢ, xⱼ) between training points, never on φ explicitly.

The soft-margin SVM allows misclassification via slack variables, controlled by regularization parameter C. The decision boundary is a weighted combination of kernel evaluations at support vectors: f(x) = Σᵢ αᵢ·yᵢ·K(xᵢ, x) + b. Support vectors are the critical points near the boundary; in many cases, only a small fraction are nonzero, enabling efficient prediction.

α, b
Dual Variables
max-margin
Objective
f(x)=ΣαᵢyᵢK
Decision Rule
O(n³)
Training Cost
06

Beyond SVMs

The kernel trick applies to any algorithm relying solely on inner products. Kernel PCA performs principal component analysis in feature space, extracting nonlinear components. Kernel ridge regression applies ridge regression post-φ, achieving flexible nonlinear regression with closed-form solutions.

Gaussian Processes treat K as a covariance function, enabling probabilistic inference with automatic uncertainty quantification. Kernel k-means and kernel CCA extend classical methods to nonlinear settings. These generalizations demonstrate the kernel trick's universality: it's not specific to SVMs but a general framework for nonlinear learning.

Kernel PCA Kernel Ridge Regression Gaussian Processes Kernel k-Means Kernel CCA
07

Limitations & Modern Context

Kernel methods' greatest strength becomes a limitation at scale. Computing and storing the n×n Gram matrix costs O(n²) memory; inverting it costs O(n³) time. For datasets with millions of points, this is prohibitive. Modern alternatives include random Fourier features (approximating kernels via random projections) and neural networks.

Recent theory reveals a deep connection: neural tangent kernels (NTKs) link infinitely-wide neural networks to kernel methods. Deep learning succeeds by implicitly discovering useful feature maps. The future likely combines both: kernels for interpretability and theory, deep learning for scale and flexibility. Kernels remain indispensable for structured data, small-to-medium problems, and domains requiring uncertainty quantification.

O(n²)
Memory
O(n³)
Inversion
RFF
Approximation
NTK
Deep ↔ Kernels
08

Sources & References

References and sources for further study on the topics covered in this deep dive.