Kernel Methods — Deep Dive

SCPD ML Course

Kernel Methods

The Feature Map Idea

The fundamental insight behind kernel methods is that linear classifiers become far more powerful when applied to nonlinearly transformed data. If we map input points x from their original space to a higher-dimensional feature space via φ(x), linear separation becomes possible for problems that are impossible in the original space.

Classical example: the XOR problem. Two classes are perfectly separable in 2D by a circle, but no linear boundary works. However, map the points to 3D via φ(x) = (x₁, x₂, x₁·x₂), and the classes separate via a plane. The cost is the explicit computation and storage of these higher-dimensional features—yet the kernel trick eliminates this overhead.

x ∈ ℝᵈ

Original Space

φ(x) ∈ ℝᴰ

Feature Space

D ≫ d

Often Exponential

O(D)

Naive Cost

The Kernel Trick

Many algorithms in machine learning depend only on inner products ⟨x, y⟩ between data points. The kernel trick exploits this: instead of explicitly computing φ(x) and φ(y), define K(x, y) = ⟨φ(x), φ(y)⟩, a kernel function that directly computes the inner product in feature space without materializing the features.

Practical example: polynomial kernel K(x, y) = (⟨x, y⟩ + c)ᵈ implicitly maps to all monomials of degree ≤ d. For 100D input and d=3, this represents ~171,700 features, yet K is computed in O(100) time. The kernel acts as a similarity function: K(x, y) measures closeness in feature space, encoding the geometry of the problem.

K(x,y)

Similarity

φ(x)·φ(y)

Inner Product

O(D)→O(d)

Complexity Gain

≈171k

Implicit Features

Common Kernels

Several kernels appear repeatedly in practice. The linear kernel K(x, y) = ⟨x, y⟩ requires no feature map. The polynomial kernel K(x, y) = (⟨x, y⟩ + c)ᵈ encodes polynomial features up to degree d, controlled by offset c and degree d.

The RBF (Gaussian) kernel K(x, y) = exp(−γ·‖x − y‖²) is the most widely used. It implicitly maps to infinite-dimensional space, making it remarkably flexible. The parameter γ controls bandwidth: small γ gives high-variance (wiggly) boundaries; large γ produces simpler, smoother fits. String kernels and graph kernels extend the framework to structured data.

Linear: K(x,y) = ⟨x,y⟩ Polynomial: (⟨x,y⟩+c)ᵈ RBF: exp(−γ‖x−y‖²) Sigmoid: tanh(κ⟨x,y⟩+θ)

Mercer's Theorem & Validity

Not all symmetric functions are valid kernels. Mercer's theorem states: a symmetric function K is a valid kernel if and only if it is positive semi-definite. This means for any finite set of points x₁,...,xₙ, the Gram matrix G with entries Gᵢⱼ = K(xᵢ, xⱼ) has all non-negative eigenvalues.

Mercer's theorem guarantees the existence of a feature space and feature map φ such that K(x, y) = ⟨φ(x), φ(y)⟩. This is the theoretical foundation: valid kernels are precisely those that correspond to some (possibly infinite-dimensional) Hilbert space. New kernels can be constructed from old ones: sums, products, and compositions of valid kernels yield valid kernels.

PSD

Gram Matrix

λᵢ ≥ 0

All Eigenvalues

Σ + K

Valid Combo

K₁·K₂

Valid Product

Kernel SVM

Support Vector Machines are the canonical kernel method. The SVM finds the maximum-margin hyperplane that separates two classes. In the dual form, the optimization depends only on inner products K(xᵢ, xⱼ) between training points, never on φ explicitly.

The soft-margin SVM allows misclassification via slack variables, controlled by regularization parameter C. The decision boundary is a weighted combination of kernel evaluations at support vectors: f(x) = Σᵢ αᵢ·yᵢ·K(xᵢ, x) + b. Support vectors are the critical points near the boundary; in many cases, only a small fraction are nonzero, enabling efficient prediction.

α, b

Dual Variables

max-margin

Objective

f(x)=ΣαᵢyᵢK

Decision Rule

O(n³)

Training Cost

Beyond SVMs

The kernel trick applies to any algorithm relying solely on inner products. Kernel PCA performs principal component analysis in feature space, extracting nonlinear components. Kernel ridge regression applies ridge regression post-φ, achieving flexible nonlinear regression with closed-form solutions.

Gaussian Processes treat K as a covariance function, enabling probabilistic inference with automatic uncertainty quantification. Kernel k-means and kernel CCA extend classical methods to nonlinear settings. These generalizations demonstrate the kernel trick's universality: it's not specific to SVMs but a general framework for nonlinear learning.

Kernel PCA Kernel Ridge Regression Gaussian Processes Kernel k-Means Kernel CCA

Limitations & Modern Context

Kernel methods' greatest strength becomes a limitation at scale. Computing and storing the n×n Gram matrix costs O(n²) memory; inverting it costs O(n³) time. For datasets with millions of points, this is prohibitive. Modern alternatives include random Fourier features (approximating kernels via random projections) and neural networks.

Recent theory reveals a deep connection: neural tangent kernels (NTKs) link infinitely-wide neural networks to kernel methods. Deep learning succeeds by implicitly discovering useful feature maps. The future likely combines both: kernels for interpretability and theory, deep learning for scale and flexibility. Kernels remain indispensable for structured data, small-to-medium problems, and domains requiring uncertainty quantification.

O(n²)

Memory

O(n³)

Inversion

RFF

Approximation

NTK

Deep ↔ Kernels

Sources & References

References and sources for further study on the topics covered in this deep dive.

The Feature Map Idea

The Kernel Trick

Common Kernels

Mercer's Theorem & Validity

Kernel SVM

Beyond SVMs

Limitations & Modern Context

Sources & References

The Feature Map Idea

Why Feature Maps Help

The Computational Problem

The Kernel Trick

Worked Example: Polynomial Kernel

RBF Kernel and Infinite Dimensions

Common Kernels

Linear Kernel

Polynomial Kernel

RBF (Gaussian) Kernel

Sigmoid Kernel

Kernels on Structured Data

Mercer's Theorem & Validity

Statement

Positive Semi-Definiteness Test

Building New Kernels

Kernel SVM

The Dual Formulation

Decision Boundary and Support Vectors

Hyperparameter Tuning

Beyond SVMs

Kernel PCA

Kernel Ridge Regression

Gaussian Processes

Kernel k-Means

Limitations & Modern Context

Computational Scalability

Kernel Selection and Interpretability

Random Fourier Features

Neural Tangent Kernels and Deep Learning

When Kernels Remain Essential

Sources & References

Foundational Papers & Textbooks

Core Concepts & Proofs

Kernel Design & Applications