Support Vector Machines & Kernels

01

Margins Intuition

The margin represents the geometric distance between a hyperplane and the nearest data points from either class. A larger margin improves generalization by creating a wider decision boundary, reducing sensitivity to small perturbations in training data. This principle—maximizing the margin between classes—forms the core geometric intuition behind support vector machines.

The functional margin (γ̂) measures confidence based on the decision function output, while the geometric margin (γ) normalizes by the weight vector's magnitude. For separable data, the maximum margin classifier finds weights w and bias b that maximize γ, creating the most robust linear boundary. Support vectors are the training points lying exactly on the margin boundary.

γ

Geometric Margin

γ̂

Functional Margin

||w||

Weight Magnitude

SVs

Support Vectors

Geometric Distance Hyperplane Separation Linearly Separable

02

Feature Maps

Feature mapping transforms input data φ(x) into higher-dimensional spaces where previously nonlinear patterns become linearly separable. A polynomial feature map φ(x) = [x, x², x₁x₂, ...] creates quadratic features, while Gaussian features can map to infinite dimensions. The classic XOR problem—unsolvable with linear separation in 2D—becomes trivial after polynomial feature mapping to 3D space.

However, explicitly computing φ(x) becomes computationally expensive in high dimensions. For d-dimensional input with polynomial degree 2, the feature space dimensionality grows from O(d) to O(d²). This curse of dimensionality motivated the kernel trick: instead of computing features directly, we use a kernel function K(x,z) that computes inner products in feature space without materializing the transformed vectors.

Polynomial Features

φ(x) = [x, x²] raises input dimensionality, enabling quadratic boundaries. Degree-k polynomials grow feature space exponentially.

Gaussian Features

RBF kernel implicitly maps to infinite dimensions through exp(-γ||x-z||²), creating flexible nonlinear separators.

XOR Example

Data (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 becomes linearly separable in φ-space but not in original 2D.

Dimensionality

Feature maps can increase problem complexity dramatically; kernels avoid this by computing inner products implicitly.

03

The Kernel Trick

The kernel trick computes K(x,z) = φ(x)ᵀφ(z)—the inner product in feature space—without explicitly mapping to φ(x) or φ(z). This elegant insight reduces computation from O(d') to O(d), where d' is the high (or infinite) feature space dimension and d is the original input dimension. For polynomial kernels K(x,z) = (xᵀz + c)^p, the trick evaluates a p-degree polynomial in original space.

The SVM decision function becomes f(x) = sign(Σᵢ αᵢ yᵢ K(xᵢ, x) + b), depending only on kernel evaluations between the new point x and training examples xᵢ. This kernel-dependent formulation enables learning in arbitrary high-dimensional (or infinite-dimensional) feature spaces efficiently. Common kernels include polynomial K(x,z) = (xᵀz + 1)^d, Gaussian (RBF) K(x,z) = exp(-γ||x-z||²), and sigmoid K(x,z) = tanh(κxᵀz + θ).

04

Kernel Properties

A valid kernel must satisfy Mercer's theorem: the kernel matrix K (with entries K(xᵢ, xⱼ)) must be positive semi-definite (PSD). This ensures the kernel genuinely corresponds to some feature mapping φ(x) and that the optimization problem remains convex. Testing PSD is done by checking that all eigenvalues are non-negative; symmetric, real-valued kernels are PSD if and only if they can be decomposed as K = φ(x)ᵀφ(z).

Kernels can be combined using simple algebraic rules: the product of two valid kernels is valid K₁(x,z) · K₂(x,z), and positive scaling α·K(x,z) where α > 0 preserves validity. Sum and difference operations also preserve the PSD property under specific conditions. These composition rules allow engineers to design custom kernels for domain-specific problems while guaranteeing mathematical consistency and optimization convexity.

PSD

Positive Semi-Def

λ ≥ 0

Eigenvalues

K₁·K₂

Valid Product

Mercer

Theorem

Mercer's Theorem

A symmetric function K(x,z) defined on X×X corresponds to an inner product in some feature space if and only if K is positive semi-definite: for all finite sequences x₁,...,xₙ and coefficients c₁,...,cₙ, Σᵢⱼ cᵢcⱼK(xᵢ,xⱼ) ≥ 0.

05

Optimal Margin Classifier

The optimal margin classifier solves the constrained optimization problem: maximize γ subject to yᵢ(wᵀφ(xᵢ) + b) ≥ γ for all training points. This can be reformulated as: minimize ½||w||² subject to yᵢ(wᵀφ(xᵢ) + b) ≥ 1. The geometric margin γ = 1/||w||, so maximizing γ is equivalent to minimizing ||w||. The constraint ensures points stay on the correct side of the margin.

Using Lagrange multipliers, we construct L = ½||w||² - Σᵢ αᵢ[yᵢ(wᵀφ(xᵢ) + b) - 1], taking derivatives and setting them to zero. The KKT conditions require complementary slackness: αᵢ > 0 only if point i is on the margin (yᵢ(wᵀφ(xᵢ) + b) = 1). This sparse structure means only support vectors matter; non-margin points have αᵢ = 0 and don't affect the solution.

06

Dual Formulation

The dual problem reformulates the primal optimization into the space of Lagrange multipliers αᵢ: maximize Σᵢ αᵢ - ½Σᵢⱼ αᵢαⱼyᵢyⱼK(xᵢ, xⱼ) subject to Σᵢ αᵢyᵢ = 0 and αᵢ ≥ 0. This dual form depends only on kernel evaluations K(xᵢ, xⱼ), never requiring explicit feature vectors φ(xᵢ). The primal solution w = Σᵢ αᵢyᵢφ(xᵢ) becomes a weighted sum of feature-mapped support vectors.

The sparsity property emerges from complementary slackness: only points with αᵢ > 0 contribute to the solution. For most data, αᵢ = 0, meaning the decision boundary depends on a small subset of support vectors—typically 1–10% of training data. This sparsity enables efficient prediction: computing f(x) requires only evaluating K(xᵢ, x) for support vectors, not all training examples. The dual formulation naturally handles nonlinear kernels and scales better than the primal for high-dimensional feature spaces.

Dual Advantages

Kernel-based: no explicit φ(x) needed
Sparse solution: only support vectors matter
Scales to infinite-dimensional spaces
Convex optimization, globally optimal

Dual Challenges

O(n²) kernel matrix memory for n samples
O(n³) training time (without SMO tricks)
Quadratic programming required
Not all αᵢ interpretable as confidence

07

Regularization & Soft Margin

Real-world data is rarely linearly separable, even in high-dimensional feature spaces. The soft margin SVM allows misclassifications by introducing slack variables ξᵢ ≥ 0. The modified constraint yᵢ(wᵀφ(xᵢ) + b) ≥ 1 - ξᵢ permits ξᵢ > 0 for points inside or crossing the margin. The objective becomes: minimize ½||w||² + C·Σᵢ ξᵢ, where C balances margin width against training error.

The parameter C acts as a regularization knob: large C penalizes slack heavily, fitting more aggressively to training data (risk of overfitting), while small C tolerates more errors, favoring larger margins (better generalization). The hinge loss L(y, ŷ) = max(0, 1 - y·ŷ) emerges naturally from soft-margin SVMs: it equals ξ for each point. The dual problem is identical except αᵢ ≤ C, and KKT conditions link ξᵢ to dual variables. Cross-validation selects optimal C, balancing training accuracy and test generalization.

ξᵢ

Slack Variables

C

Regularization Param

1 - ξᵢ

Margin Constraint

Hinge

Loss Function

08

The SMO Algorithm

Sequential Minimal Optimization (SMO), developed by John Platt, breaks the QP problem into subproblems. Instead of optimizing all n alphas simultaneously, SMO selects two multipliers αᵢ and αⱼ, optimizes them analytically while holding others fixed, then repeats. The two-variable problem has a closed-form solution, avoiding expensive quadratic programming. The constraint Σ αᵢyᵢ = 0 becomes: αᵢyᵢ + αⱼyⱼ = -Σₖ≠ᵢ,ⱼ αₖyₖ, reducing the problem to a single variable.

SMO works by repeatedly selecting promising pairs (i, j) to optimize. Heuristics choose variables with largest error differences: find i where yᵢ(fᵢ - yᵢ) is maximum (most violated), then find j where yⱼ(fⱼ - yⱼ) is minimum. After updating (αᵢ, αⱼ), bias b is recalculated to satisfy KKT conditions for support vectors. Convergence occurs when no pair violates KKT conditions by more than tolerance ε. SMO reduces training from O(n³) to O(n²) in practice, enabling SVMs to scale to millions of samples.

Coordinate Ascent Working Set Selection Closed-form Updates KKT Convergence O(n²) Complexity

Step 1

Choose pair (αᵢ, αⱼ) with maximum KKT violations.

Step 2

Solve the two-variable subproblem analytically.

Step 3

Update bias b using KKT conditions for SVs.

Step 4

Repeat until convergence (all KKT satisfied ±ε).

09

References & Further Reading

Key References:

Margins Intuition

Feature Maps

Polynomial Features

Gaussian Features

XOR Example

Dimensionality

The Kernel Trick

Kernel Properties

Mercer's Theorem

Optimal Margin Classifier

Dual Formulation

Dual Advantages

Dual Challenges

Regularization & Soft Margin

The SMO Algorithm

References & Further Reading

Margins Intuition

Geometric vs. Functional Margin

Support Vectors and the Margin Boundary

Margin Intuition Summary

Feature Maps

The XOR Problem Solved by Feature Mapping

Computational Cost and the Curse of Dimensionality

Explicit Features

Implicit Features

Dimensionality Growth

Generalization

The Kernel Trick

Common Kernels and Their Properties

Kernel Evaluation Complexity

Why Kernels Matter

Kernel Properties

Testing for Positive Semi-Definiteness

Kernel Composition Rules

Kernel Design Strategies

PSD Kernels Guarantee

Non-PSD Issues

Optimal Margin Classifier

The Lagrangian and KKT Conditions

Support Vectors and Sparsity

Geometric Interpretation in Feature Space

Why Margins Matter for Generalization

Dual Formulation

Sparse Solutions via Support Vectors

Recovery of Primal Solution

Computational Complexity

Dual Form Elegance

Dual Form Challenge

Sparse Support Vectors

Hyperparameter Tuning

Regularization & Soft Margin

The C Parameter: Regularization Trade-off

Connection to Hinge Loss

KKT Conditions for Soft Margin

Cross-Validation for C Selection

Soft Margin Philosophy

The SMO Algorithm

The Two-Variable Optimization Problem

Working Set Selection: Choosing the Pair (i, j)

Convergence and Stopping Criteria

Implementation Details

Why SMO Works

Sources & References

Course Materials

Textbooks & Papers