Stanford XCS229 · Machine Learning
Support Vector Machines
Week 3–4 · Kernels, Margins & Optimization
01

Margins Intuition

The margin represents the geometric distance between a hyperplane and the nearest data points from either class. A larger margin improves generalization by creating a wider decision boundary, reducing sensitivity to small perturbations in training data. This principle—maximizing the margin between classes—forms the core geometric intuition behind support vector machines.

The functional margin (γ̂) measures confidence based on the decision function output, while the geometric margin (γ) normalizes by the weight vector's magnitude. For separable data, the maximum margin classifier finds weights w and bias b that maximize γ, creating the most robust linear boundary. Support vectors are the training points lying exactly on the margin boundary.

γ
Geometric Margin
γ̂
Functional Margin
||w||
Weight Magnitude
SVs
Support Vectors
Geometric Distance Hyperplane Separation Linearly Separable
02

Feature Maps

Feature mapping transforms input data φ(x) into higher-dimensional spaces where previously nonlinear patterns become linearly separable. A polynomial feature map φ(x) = [x, x², x₁x₂, ...] creates quadratic features, while Gaussian features can map to infinite dimensions. The classic XOR problem—unsolvable with linear separation in 2D—becomes trivial after polynomial feature mapping to 3D space.

However, explicitly computing φ(x) becomes computationally expensive in high dimensions. For d-dimensional input with polynomial degree 2, the feature space dimensionality grows from O(d) to O(d²). This curse of dimensionality motivated the kernel trick: instead of computing features directly, we use a kernel function K(x,z) that computes inner products in feature space without materializing the transformed vectors.

Polynomial Features

φ(x) = [x, x²] raises input dimensionality, enabling quadratic boundaries. Degree-k polynomials grow feature space exponentially.

Gaussian Features

RBF kernel implicitly maps to infinite dimensions through exp(-γ||x-z||²), creating flexible nonlinear separators.

XOR Example

Data (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 becomes linearly separable in φ-space but not in original 2D.

Dimensionality

Feature maps can increase problem complexity dramatically; kernels avoid this by computing inner products implicitly.

03

The Kernel Trick

The kernel trick computes K(x,z) = φ(x)ᵀφ(z)—the inner product in feature space—without explicitly mapping to φ(x) or φ(z). This elegant insight reduces computation from O(d') to O(d), where d' is the high (or infinite) feature space dimension and d is the original input dimension. For polynomial kernels K(x,z) = (xᵀz + c)^p, the trick evaluates a p-degree polynomial in original space.

The SVM decision function becomes f(x) = sign(Σᵢ αᵢ yᵢ K(xᵢ, x) + b), depending only on kernel evaluations between the new point x and training examples xᵢ. This kernel-dependent formulation enables learning in arbitrary high-dimensional (or infinite-dimensional) feature spaces efficiently. Common kernels include polynomial K(x,z) = (xᵀz + 1)^d, Gaussian (RBF) K(x,z) = exp(-γ||x-z||²), and sigmoid K(x,z) = tanh(κxᵀz + θ).

Input Space x, z φ(x), φ(z) Feature Space (d' dimensions) K(x,z) Scalar φ(x)ᵀφ(z) Kernel trick: compute inner product without materializing φ space
04

Kernel Properties

A valid kernel must satisfy Mercer's theorem: the kernel matrix K (with entries K(xᵢ, xⱼ)) must be positive semi-definite (PSD). This ensures the kernel genuinely corresponds to some feature mapping φ(x) and that the optimization problem remains convex. Testing PSD is done by checking that all eigenvalues are non-negative; symmetric, real-valued kernels are PSD if and only if they can be decomposed as K = φ(x)ᵀφ(z).

Kernels can be combined using simple algebraic rules: the product of two valid kernels is valid K₁(x,z) · K₂(x,z), and positive scaling α·K(x,z) where α > 0 preserves validity. Sum and difference operations also preserve the PSD property under specific conditions. These composition rules allow engineers to design custom kernels for domain-specific problems while guaranteeing mathematical consistency and optimization convexity.

PSD
Positive Semi-Def
λ ≥ 0
Eigenvalues
K₁·K₂
Valid Product
Mercer
Theorem

Mercer's Theorem

A symmetric function K(x,z) defined on X×X corresponds to an inner product in some feature space if and only if K is positive semi-definite: for all finite sequences x₁,...,xₙ and coefficients c₁,...,cₙ, Σᵢⱼ cᵢcⱼK(xᵢ,xⱼ) ≥ 0.

05

Optimal Margin Classifier

The optimal margin classifier solves the constrained optimization problem: maximize γ subject to yᵢ(wᵀφ(xᵢ) + b) ≥ γ for all training points. This can be reformulated as: minimize ½||w||² subject to yᵢ(wᵀφ(xᵢ) + b) ≥ 1. The geometric margin γ = 1/||w||, so maximizing γ is equivalent to minimizing ||w||. The constraint ensures points stay on the correct side of the margin.

Using Lagrange multipliers, we construct L = ½||w||² - Σᵢ αᵢ[yᵢ(wᵀφ(xᵢ) + b) - 1], taking derivatives and setting them to zero. The KKT conditions require complementary slackness: αᵢ > 0 only if point i is on the margin (yᵢ(wᵀφ(xᵢ) + b) = 1). This sparse structure means only support vectors matter; non-margin points have αᵢ = 0 and don't affect the solution.

Decision Boundary (Optimal Margin) Margin γ = 1/||w|| Support Vectors (αᵢ > 0)
06

Dual Formulation

The dual problem reformulates the primal optimization into the space of Lagrange multipliers αᵢ: maximize Σᵢ αᵢ - ½Σᵢⱼ αᵢαⱼyᵢyⱼK(xᵢ, xⱼ) subject to Σᵢ αᵢyᵢ = 0 and αᵢ ≥ 0. This dual form depends only on kernel evaluations K(xᵢ, xⱼ), never requiring explicit feature vectors φ(xᵢ). The primal solution w = Σᵢ αᵢyᵢφ(xᵢ) becomes a weighted sum of feature-mapped support vectors.

The sparsity property emerges from complementary slackness: only points with αᵢ > 0 contribute to the solution. For most data, αᵢ = 0, meaning the decision boundary depends on a small subset of support vectors—typically 1–10% of training data. This sparsity enables efficient prediction: computing f(x) requires only evaluating K(xᵢ, x) for support vectors, not all training examples. The dual formulation naturally handles nonlinear kernels and scales better than the primal for high-dimensional feature spaces.

Dual Advantages

  • Kernel-based: no explicit φ(x) needed
  • Sparse solution: only support vectors matter
  • Scales to infinite-dimensional spaces
  • Convex optimization, globally optimal

Dual Challenges

  • O(n²) kernel matrix memory for n samples
  • O(n³) training time (without SMO tricks)
  • Quadratic programming required
  • Not all αᵢ interpretable as confidence
07

Regularization & Soft Margin

Real-world data is rarely linearly separable, even in high-dimensional feature spaces. The soft margin SVM allows misclassifications by introducing slack variables ξᵢ ≥ 0. The modified constraint yᵢ(wᵀφ(xᵢ) + b) ≥ 1 - ξᵢ permits ξᵢ > 0 for points inside or crossing the margin. The objective becomes: minimize ½||w||² + C·Σᵢ ξᵢ, where C balances margin width against training error.

The parameter C acts as a regularization knob: large C penalizes slack heavily, fitting more aggressively to training data (risk of overfitting), while small C tolerates more errors, favoring larger margins (better generalization). The hinge loss L(y, ŷ) = max(0, 1 - y·ŷ) emerges naturally from soft-margin SVMs: it equals ξ for each point. The dual problem is identical except αᵢ ≤ C, and KKT conditions link ξᵢ to dual variables. Cross-validation selects optimal C, balancing training accuracy and test generalization.

ξᵢ
Slack Variables
C
Regularization Param
1 - ξᵢ
Margin Constraint
Hinge
Loss Function
Soft Margin SVM: C Parameter Trade-off Small C: Large Margin Higher bias, lower variance Large C: Tight Margin Lower bias, higher variance C = 0.01 C = 1 (balanced) C = 100
08

The SMO Algorithm

Sequential Minimal Optimization (SMO), developed by John Platt, breaks the QP problem into subproblems. Instead of optimizing all n alphas simultaneously, SMO selects two multipliers αᵢ and αⱼ, optimizes them analytically while holding others fixed, then repeats. The two-variable problem has a closed-form solution, avoiding expensive quadratic programming. The constraint Σ αᵢyᵢ = 0 becomes: αᵢyᵢ + αⱼyⱼ = -Σₖ≠ᵢ,ⱼ αₖyₖ, reducing the problem to a single variable.

SMO works by repeatedly selecting promising pairs (i, j) to optimize. Heuristics choose variables with largest error differences: find i where yᵢ(fᵢ - yᵢ) is maximum (most violated), then find j where yⱼ(fⱼ - yⱼ) is minimum. After updating (αᵢ, αⱼ), bias b is recalculated to satisfy KKT conditions for support vectors. Convergence occurs when no pair violates KKT conditions by more than tolerance ε. SMO reduces training from O(n³) to O(n²) in practice, enabling SVMs to scale to millions of samples.

Coordinate Ascent Working Set Selection Closed-form Updates KKT Convergence O(n²) Complexity
Step 1
Choose pair (αᵢ, αⱼ) with maximum KKT violations.
Step 2
Solve the two-variable subproblem analytically.
Step 3
Update bias b using KKT conditions for SVs.
Step 4
Repeat until convergence (all KKT satisfied ±ε).