The geometric margin γ represents the shortest perpendicular distance from any training point to a separating hyperplane wᵀx + b = 0. Maximizing this distance ensures the decision boundary is as "centered" as possible between the two classes, creating a buffer zone around the separating surface. This buffer makes the classifier robust: small perturbations in test examples are less likely to cross the boundary, improving generalization.
The functional margin γ̂ = yᵢ(wᵀxᵢ + b) measures confidence by the raw decision function value; negative values indicate misclassification. The relationship γ = γ̂/||w|| normalizes functional margin by weight magnitude. For a fixed functional margin, smaller ||w|| yields larger geometric margin. The maximum margin classifier simultaneously minimizes ||w|| and enforces yᵢ(wᵀxᵢ + b) ≥ 1, creating the widest-margin hyperplane.
Geometric vs. Functional Margin
Consider a simple 2D example: y₁ = +1 at (1, 0) and y₂ = -1 at (-1, 0) with boundary wᵀx + b = 0. If w = [1, 0] and b = 0, the functional margins are γ̂₁ = 1 and γ̂₂ = 1, with ||w|| = 1, giving geometric margin γ = 1. If instead w = [2, 0] and b = 0, functional margins double (γ̂₁ = 2, γ̂₂ = 2), but ||w|| = 2, so geometric margin remains γ = 1. This shows why ||w|| scaling matters: it can inflate functional margins while leaving geometry unchanged.
Support Vectors and the Margin Boundary
Points exactly on the margin boundary satisfy yᵢ(wᵀxᵢ + b) = 1. In the maximum margin solution, a few (often just n+1 in n dimensions) training examples touch the margin—these are support vectors. Non-margin points satisfy yᵢ(wᵀxᵢ + b) > 1 and do not affect the solution if perturbed slightly. This sparsity property is crucial: the decision boundary depends only on the handful of critical points, making SVMs memory-efficient and enabling elegant kernel-based solutions.
Margin Intuition Summary
Larger margins are better for generalization because they create robustness to small changes. The max-margin classifier finds the hyperplane that keeps the two classes as far apart as possible, reducing the risk that future test examples near the decision boundary will be misclassified.
A feature map φ: ℝᵈ → ℝᵈ' is a function that transforms raw input x into a higher-dimensional representation φ(x). The goal is to make nonlinear patterns in the original space become linearly separable in the feature space. For example, φ(x) = [x, x²] for univariate x maps the parabola y = x² (nonlinearly separable in 1D) into a line in 2D space.
Polynomial feature maps are defined as φ(x) = [x₁, x₂, ..., xₙ, x₁², x₁x₂, ..., xₙ²] for degree 2. For d input dimensions and degree k, the number of features grows as O(dᵏ)—exponential growth limiting practical use. Gaussian (RBF) features are implicitly infinite-dimensional: the RBF kernel corresponds to expanding x as a Taylor series of Gaussian basis functions centered at training points.
The XOR Problem Solved by Feature Mapping
XOR data cannot be separated linearly in 2D: (0,0)→0, (1,0)→1, (0,1)→1, (1,1)→0. No line separates the +1 points {(1,0), (0,1)} from -1 points {(0,0), (1,1)}. However, with φ(x₁, x₂) = [x₁, x₂, x₁x₂], we get φ(0,0) = [0,0,0], φ(1,0) = [1,0,0], φ(0,1) = [0,1,0], φ(1,1) = [1,1,1]. In 3D, the plane w₁ + w₂ - 3w₃ = 0.5 separates them: at (0,0,0) it gives -0.5 (negative, correct), at (1,1,1) it gives 2w₁ + 2w₂ - 3 ≈ -0.5 (negative, correct), and (1,0,0), (0,1,0) give positive values.
Computational Cost and the Curse of Dimensionality
If d = 100 and we use degree-2 polynomial features, the feature dimension d' = O(10,000). Computing w = Σ αᵢyᵢφ(xᵢ) requires storing a 10,000-dimensional vector. Prediction on a test point requires φ(x) to be computed (O(d') time) and then the inner product wᵀφ(x) (O(d') time). For large feature spaces, this becomes prohibitive. The kernel trick avoids this by computing K(x, z) = φ(x)ᵀφ(z) in O(d) time without materializing φ.
Explicit Features
Compute φ(x) explicitly, then use standard linear classifiers. Simple but slow in high dimensions.
Implicit Features
Use kernels to compute inner products in feature space without materializing φ(x). Faster, cleaner.
Dimensionality Growth
Polynomial degree-k features grow as O(dᵏ). RBF features are infinite-dimensional. Trade-off between expressiveness and computation.
Generalization
Higher feature dimension risks overfitting unless regularized (margin-based or ridge). Kernels + soft margin control complexity.
The kernel trick is a computational sleight-of-hand: instead of explicitly mapping input x to φ(x) and computing φ(x)ᵀφ(z), define a kernel function K(x, z) that directly computes the inner product. If K(x, z) = φ(x)ᵀφ(z) holds for some feature map φ (verified later by Mercer's theorem), then K can be used everywhere the inner product appears, avoiding the cost of computing φ.
The magic becomes apparent in the SVM decision function. In the dual formulation, the solution w = Σᵢ αᵢyᵢφ(xᵢ) involves a sum of feature vectors. For prediction, f(x) = wᵀφ(x) + b = Σᵢ αᵢyᵢφ(xᵢ)ᵀφ(x) + b = Σᵢ αᵢyᵢK(xᵢ, x) + b. The weight vector never needs to be explicitly computed; the classifier is determined entirely by kernel evaluations.
Common Kernels and Their Properties
Linear kernel: K(x, z) = xᵀz. No feature mapping, classical linear SVM.
Polynomial kernel: K(x, z) = (xᵀz + c)ᵈ. Corresponds to degree-d polynomial feature expansion. The parameter d controls polynomial degree; c shifts the constant term, affecting low-degree contributions.
Radial Basis Function (RBF): K(x, z) = exp(-γ||x - z||²). Nonparametric kernel with implicit infinite-dimensional mapping. γ controls the radius of the Gaussian; larger γ makes the kernel more localized (sharp peaks), smaller γ makes it smoother.
Sigmoid kernel: K(x, z) = tanh(κxᵀz + θ). Motivated by neural networks; not always a valid PSD kernel for all parameter values, so care is needed.
Kernel Evaluation Complexity
For polynomial kernel K(x, z) = (xᵀz + 1)ᵈ with d-dimensional input: the direct approach (xᵀz computes in O(d), then raising to power d takes O(log d) by fast exponentiation, total O(d + log d) ≈ O(d)). Explicit degree-d feature mapping would give φ(x) of dimension ≈ dᵈ, impossible to compute for large d. The kernel trick evaluates in mere O(d) time.
For RBF kernel K(x, z) = exp(-γ||x - z||²): computing ||x - z||² takes O(d) time, then exp is O(1), total O(d). Without kernels, the RBF feature space is infinite-dimensional (infinite features); kernels make it tractable.
Why Kernels Matter
Kernels decouple feature engineering from the learning algorithm. Instead of hand-crafting features φ(x), practitioners choose a kernel K(x,z) that measures similarity. This separation enables flexibility: different problems use different kernels, but the optimization machinery (SVM solver) remains unchanged.
Mercer's theorem provides the bridge between kernels and feature maps. It states: a symmetric function K: X × X → ℝ is a valid kernel if and only if the kernel matrix K (with entries Kᵢⱼ = K(xᵢ, xⱼ)) is positive semi-definite for all finite sets {x₁, ..., xₙ} in X. PSD means all eigenvalues are non-negative, equivalently, for any vector c ∈ ℝⁿ, cᵀKc ≥ 0.
The importance of Mercer's theorem is that it guarantees any symmetric PSD kernel corresponds to some feature space where a valid inner product is defined. This ensures the SVM optimization problem remains convex, with a unique global minimum. Without PSD verification, a kernel might not correspond to any feature space, and the optimization could become non-convex with multiple local minima.
Testing for Positive Semi-Definiteness
For a finite set of training points, the kernel matrix K ∈ ℝⁿˣⁿ can be checked for PSD by computing its eigenvalues using standard linear algebra. If all eigenvalues λ₁, ..., λₙ satisfy λᵢ ≥ 0, the matrix is PSD. Alternatively, Cholesky decomposition: K = LLᵀ where L is lower triangular. If decomposition succeeds without encountering negative diagonal entries, K is PSD.
Kernel Composition Rules
Product: If K₁ and K₂ are valid kernels, then K(x,z) = K₁(x,z) · K₂(x,z) is valid. Proof: their kernel matrices have non-negative eigenvalues; the Kronecker product of two PSD matrices is PSD.
Positive scalar multiple: If K is valid and α > 0, then αK is valid (scaling a PSD matrix by α > 0 preserves PSD).
Sum: If K₁ and K₂ are valid, then K₁ + K₂ is valid (sum of PSD matrices is PSD).
Exponential: If K is valid, then exp(K) = I + K + K²/2! + K³/3! + ... is valid (infinite series of products and sums preserves validity).
Kernel Design Strategies
Engineers design custom kernels by composing simple building blocks. Start with a base kernel (polynomial, RBF, linear). Combine via products, sums, or exponentials. Test PSD by eigenvalue check on a sample of data. Validate on cross-validation: a kernel might be theoretically valid but empirically poor for a specific task. Parameter tuning (e.g., RBF's γ) is crucial; cross-validation selects γ from a grid {0.001, 0.01, 0.1, 1, 10, ...}.
PSD Kernels Guarantee
- Convex optimization (unique global minimum)
- Valid feature space exists
- Mercer expansion is well-defined
- Robust theoretical foundation
Non-PSD Issues
- Non-convex, multiple local minima
- No guaranteed feature space
- Convergence is slower, unreliable
- Unpredictable generalization
The optimal margin classifier is defined by the constrained optimization problem:
minimize ½||w||² subject to yᵢ(wᵀφ(xᵢ) + b) ≥ 1 for all i
This formulation is equivalent to maximizing the geometric margin γ = 1/||w|| while ensuring all points are correctly classified with margin at least 1 (in functional margin terms). The constraint yᵢ(wᵢᵀφ(xᵢ) + b) ≥ 1 combines two requirements: (1) correct classification (sign of yᵢ · (wᵀφ(xᵢ) + b) is positive), and (2) margin (value is at least 1).
The Lagrangian and KKT Conditions
To solve the constrained problem, form the Lagrangian:
L(w, b, α) = ½||w||² - Σᵢ αᵢ[yᵢ(wᵀφ(xᵢ) + b) - 1]
where αᵢ ≥ 0 are Lagrange multipliers. At the optimum, the gradient with respect to w and b equals zero:
∂L/∂w = w - Σᵢ αᵢyᵢφ(xᵢ) = 0 ⟹ w = Σᵢ αᵢyᵢφ(xᵢ)
∂L/∂b = -Σᵢ αᵢyᵢ = 0 ⟹ Σᵢ αᵢyᵢ = 0
The KKT complementary slackness condition states: αᵢ > 0 only if the constraint is tight, i.e., yᵢ(wᵀφ(xᵢ) + b) = 1. This means only points on the margin boundary (support vectors) have non-zero αᵢ; interior points have αᵢ = 0.
Support Vectors and Sparsity
In a typical SVM solution, support vectors comprise only 1–10% of training data. For a 1000-point dataset, maybe 50–100 points are support vectors. The decision boundary is determined entirely by these few critical points: w = Σᵢ∈SV αᵢyᵢφ(xᵢ). If a non-support-vector point is moved (even deleted), the solution does not change. This sparsity property is computationally attractive: prediction f(x) = Σᵢ∈SV αᵢyᵢK(xᵢ, x) + b depends only on evaluations involving support vectors.
Geometric Interpretation in Feature Space
After feature mapping φ, the data becomes linearly separable (with high probability in high dimensions). The optimal hyperplane in feature space is the one that maximizes the margin. Support vectors are the points closest to the decision boundary; they are the "hardest" points to classify and determine where the boundary should be positioned. Non-support vectors are safely on the correct side with margin > 1 and do not constrain the solution.
Why Margins Matter for Generalization
The margin represents a zone of low density around the decision boundary. Large margins imply the classifier is "confident" in its predictions far from the boundary. Points in the margin zone are uncertain, but if the margin is wide, few (if any) test examples will fall there, reducing misclassification risk.
The dual problem is derived by substituting the KKT conditions w = Σᵢ αᵢyᵢφ(xᵢ) and Σᵢ αᵢyᵢ = 0 back into the Lagrangian:
max: Σᵢ αᵢ - ½Σᵢⱼ αᵢαⱼyᵢyⱼφ(xᵢ)ᵀφ(xⱼ) subject to Σᵢ αᵢyᵢ = 0, αᵢ ≥ 0
Using the kernel trick, φ(xᵢ)ᵀφ(xⱼ) = K(xᵢ, xⱼ), the dual becomes:
max: Σᵢ αᵢ - ½Σᵢⱼ αᵢαⱼyᵢyⱼK(xᵢ, xⱼ) subject to Σᵢ αᵢyᵢ = 0, αᵢ ≥ 0
This dual form depends only on kernel evaluations K(xᵢ, xⱼ), never on explicit feature vectors. The problem is a quadratic program (QP): the objective is quadratic in α, and constraints are linear.
Sparse Solutions via Support Vectors
Complementary slackness from KKT: αᵢ[yᵢ(wᵀφ(xᵢ) + b) - 1] = 0. So either αᵢ = 0 or yᵢ(wᵀφ(xᵢ) + b) = 1. In the solution, most αᵢ = 0 (non-support vectors). Only points with αᵢ > 0 matter; these are support vectors on the margin boundary.
Why sparsity helps: the decision boundary f(x) = Σᵢ αᵢyᵢK(xᵢ, x) + b needs only inner products involving support vectors. For n training points with m ≪ n support vectors, prediction is O(m) time instead of O(n). Storage is also O(m) instead of O(n).
Recovery of Primal Solution
After solving the dual (finding optimal α), the primal solution w is recovered by w = Σᵢ αᵢyᵢφ(xᵢ). The bias b is found using any support vector xₛ where yₛ(wᵀφ(xₛ) + b) = 1. Solving for b: b = yₛ - wᵀφ(xₛ) = yₛ - Σᵢ αᵢyᵢK(xᵢ, xₛ). To improve numerical stability, average over all support vectors: b = 1/|SV| Σₛ∈SV [yₛ - Σᵢ αᵢyᵢK(xᵢ, xₛ)].
Computational Complexity
The dual QP has n variables (α₁, ..., αₙ) and one equality constraint. Standard QP solvers (e.g., interior-point methods) take O(n³) time. For n = 10,000, this is 10¹² operations—feasible but slow. The SMO algorithm reduces this to O(n²) by solving subproblems rather than the full QP.
Dual Form Elegance
Kernel-based, nonlinear-capable, sparsity guarantees. Decision boundary depends only on support vectors and kernel.
Dual Form Challenge
Quadratic program hard to solve for large n. Requires QP solver; O(n³) baseline complexity before optimizations.
Sparse Support Vectors
Typical: 1–10% of training data are support vectors. Prediction is fast; storage is compact.
Hyperparameter Tuning
Kernel parameters (e.g., RBF γ) chosen by cross-validation on dual objective, not by inspecting weight vector w.
The hard-margin SVM assumes data is linearly separable in feature space. In practice, data is noisy, and even high-dimensional spaces may contain outliers. The soft-margin SVM relaxes separability by introducing slack variables ξᵢ ≥ 0:
minimize ½||w||² + C·Σᵢ ξᵢ subject to yᵢ(wᵀφ(xᵢ) + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0
Slack ξᵢ quantifies the margin violation: ξᵢ = 0 means the point is on the correct side of the margin; 0 < ξᵢ < 1 means the point is inside the margin but correctly classified; ξᵢ ≥ 1 means the point is misclassified. The parameter C trades off margin width against training error.
The C Parameter: Regularization Trade-off
Large C (e.g., C = 100): Heavily penalizes slack, forcing the algorithm to fit training data tightly. The margin becomes narrower, and the solution tends to overfit. Use when data is clean and accurate modeling is critical.
Small C (e.g., C = 0.01): Tolerates slack, allowing many points inside or across the margin. The margin becomes wider, promoting generalization. Use when data is noisy or when you prioritize test set accuracy over training accuracy.
Balanced C (e.g., C = 1): A middle ground. Practical choice when C is unknown; cross-validation refines it.
Connection to Hinge Loss
The soft-margin formulation is equivalent to minimizing the hinge loss:
min: Σᵢ max(0, 1 - yᵢ(wᵀφ(xᵢ) + b)) + λ||w||²
where λ = 1/(2C). The hinge loss L(y, ŷ) = max(0, 1 - y·ŷ) is zero if the prediction ŷ is confidently correct (yŷ > 1), and linearly increases with margin violation. This loss is a convex upper bound on 0–1 loss, making the optimization tractable. The L₂ regularization term λ||w||² prevents overfitting by penalizing large weights.
KKT Conditions for Soft Margin
The dual QP becomes:
max: Σᵢ αᵢ - ½Σᵢⱼ αᵢαⱼyᵢyⱼK(xᵢ, xⱼ) subject to Σᵢ αᵢyᵢ = 0, 0 ≤ αᵢ ≤ C
The constraint 0 ≤ αᵢ ≤ C (instead of just αᵢ ≥ 0) caps each Lagrange multiplier. KKT conditions now include:
• If αᵢ = 0: point i is not a support vector; yᵢ(wᵀφ(xᵢ) + b) > 1 (margin satisfied with slack = 0).
• If 0 < αᵢ < C: point i is a margin support vector; yᵢ(wᵀφ(xᵢ) + b) = 1 (on margin boundary).
• If αᵢ = C: point i is a non-margin or margin-violating support vector; yᵢ(wᵀφ(xᵢ) + b) ≤ 1 (inside or across margin).
Cross-Validation for C Selection
C is a hyperparameter chosen by cross-validation. Grid search over C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} is standard. For each C, train on K-1 folds, validate on the held-out fold. Compute validation accuracy, then average over K folds. Select the C with highest average accuracy. This approach avoids overfitting to the training set.
Soft Margin Philosophy
Real data is messy. Perfect separation is neither achievable nor desirable; some training error is acceptable. The soft-margin SVM embraces this pragmatism: find a wide margin that tolerates a few misclassified training examples, achieving better test generalization.
Sequential Minimal Optimization (SMO), introduced by John Platt in 1998, is a breakthrough algorithm for training SVMs efficiently. Instead of solving the entire quadratic program at once, SMO repeatedly selects a pair of Lagrange multipliers (αᵢ, αⱼ) and optimizes them jointly while holding all others fixed. The key insight: a two-variable QP problem has a closed-form analytical solution, avoiding expensive numerical optimization.
The Two-Variable Optimization Problem
With all αₖ (k ≠ i, j) fixed, the constraint Σᵢ αᵢyᵢ = 0 becomes: αᵢyᵢ + αⱼyⱼ = -Σₖ≠ᵢ,ⱼ αₖyₖ = γ (constant). So αⱼyⱼ = (γ - αᵢyᵢ)/yⱼ, allowing αⱼ to be expressed in terms of αᵢ. The quadratic objective in α becomes a univariate function of αᵢ; its minimum can be found analytically by taking the derivative and solving.
The updated α̂ᵢ is clipped to ensure 0 ≤ α̂ᵢ ≤ C (or 0 ≤ αᵢ ≤ C - yᵢyⱼ(C - αⱼ) due to constraints). After updating (αᵢ, αⱼ), the bias b is recalculated using KKT conditions for support vectors.
Working Set Selection: Choosing the Pair (i, j)
Naive approach: try all (n choose 2) pairs per iteration—expensive. SMO uses heuristics to choose promising pairs:
First choice (i): Find the point with the maximum KKT violation. Define the decision function error Eᵢ = fᵢ - yᵢ = (Σₖ αₖyₖK(xₖ, xᵢ) + b) - yᵢ. The KKT conditions are violated if:
• yᵢ·Eᵢ < -τ and αᵢ < C (margin violation, should increase αᵢ), or
• yᵢ·Eᵢ > τ and αᵢ > 0 (should decrease αᵢ).
Choose i with maximum |yᵢ·Eᵢ|.
Second choice (j): Find j that maximizes the step size in the two-variable update. This is typically the point with minimum Eⱼ if yᵢ > 0, or maximum Eⱼ if yᵢ < 0. The idea: make the step as large as possible to improve the objective quickly.
Convergence and Stopping Criteria
SMO converges when all KKT conditions are satisfied within tolerance ε (typically ε = 0.001). The algorithm iterates until no pair violates KKT by more than ε. Empirically, convergence occurs in O(n²) time instead of the O(n³) of standard QP solvers, making SVMs practical for large datasets.
Implementation Details
SMO caches kernel evaluations K(xᵢ, xⱼ) to avoid recomputing them; memory is O(n²) but manageable. The algorithm maintains the decision function values Eᵢ for all points, updating them incrementally after each (αᵢ, αⱼ) change. Modern libraries (libsvm, scikit-learn) use SMO with further optimizations like shrinking (reducing the working set size as training progresses).
Platt 1998
Coordinate Ascent
Closed-Form Updates
KKT Convergence Check
O(n²) Practical Time
Why SMO Works
By breaking the QP into bite-sized two-variable subproblems with closed-form solutions, SMO avoids the O(n³) cost of general QP solvers. It's a beautiful application of divide-and-conquer, making SVMs trainable on datasets with millions of examples—previously infeasible.