Bias-Variance Tradeoff
The bias-variance tradeoff is fundamental to understanding generalization in machine learning. Bias measures how much the average prediction of a model differs from the true target, reflecting systematic errors from model simplicity. Variance measures how much predictions fluctuate across different training datasets, reflecting sensitivity to small changes in the data.
Expected test error decomposes into three components: squared bias, variance, and irreducible noise. The bias-variance curve typically exhibits a U-shape: high-complexity models (e.g., deep neural networks) have low bias but high variance, while simple models (e.g., linear regression) have high bias but low variance. Finding the optimal model complexity that minimizes this tradeoff is central to building effective predictors.
Mathematical Decomposition
Formally, consider a regression problem with true function f and predictions ŷ. The expected squared error can be decomposed as E[(ŷ - y)²] = Bias²(ŷ) + Var(ŷ) + σ², where σ² is the irreducible noise in the target. Bias²(ŷ) = (E[ŷ] - f)² measures distance from the true function, while Var(ŷ) = E[(ŷ - E[ŷ])²] measures variation across training sets.
This decomposition reveals why simple models underfit: they cannot express the true function, creating large bias. Complex models overfit: they fit noise in the training data, creating large variance. The derivation assumes the data is drawn from a true underlying distribution, and both bias and variance are computed in expectation over training set samples. Understanding this breakdown is essential for diagnosing where a model fails and choosing appropriate complexity.
Double Descent
A striking modern discovery challenges the traditional bias-variance intuition: the double descent phenomenon. When models are overparameterized (parameter count exceeds training samples), test error can decrease again after passing the interpolation threshold where models fit all training data perfectly. This suggests the classical U-curve might be incomplete.
Double descent occurs in deep neural networks, kernel methods, and random forests. The regime where model capacity greatly exceeds sample size—called the benign overfitting regime—can achieve low generalization error despite perfect training fit. This challenges the notion that fitting noise is inherently bad. The mechanism involves implicit regularization in the learning algorithm itself: SGD and early stopping naturally select smooth solutions even in overparameterized settings, providing a form of automatic capacity control.
Sample Complexity Bounds
PAC (Probably Approximately Correct) learning provides finite-sample guarantees for generalization. For a hypothesis class H with size |H|, to guarantee error at most ε with confidence 1-δ, we need sample complexity m = O(log|H|/ε²δ). This bound decreases with hypothesis class size but increases logarithmically—suggesting larger hypothesis classes need only modestly more samples.
The VC (Vapnik-Chervonenkis) dimension refines this for infinite hypothesis classes by capturing expressive capacity. A hypothesis class shatters a set of points if it can realize all 2^d labelings on d points. The VC dimension is the size of the largest shattered set. For linear classifiers in d dimensions, VC dimension equals d+1. The fundamental theorem of VC theory bounds generalization gap by VC dimension, showing why simpler models (lower VC dimension) generalize better with fewer samples.
Regularization Methods
Regularization adds a penalty term to the loss function to control model complexity. L2 regularization (ridge) penalizes squared parameter magnitudes: J(w) = L(w) + λ||w||². This smooths decision boundaries and shrinks weights toward zero, reducing variance at the cost of introducing bias. L1 regularization (lasso) uses J(w) = L(w) + λ||w||₁, inducing sparsity by setting some weights exactly to zero, providing feature selection.
The regularization strength λ controls the bias-variance tradeoff directly: large λ increases bias but decreases variance, while small λ increases variance. Elastic Net combines L1 and L2: J(w) = L(w) + λ₁||w||₁ + λ₂||w||². From a Bayesian perspective, L2 regularization corresponds to Gaussian priors on weights, while L1 corresponds to Laplace priors. Regularization paths visualize how solution coefficients change with λ, revealing which features remain important at different complexity levels.
Implicit Regularization
Modern deep learning benefits from implicit regularization—built-in capacity control through the learning algorithm itself, even without explicit penalties. Early stopping halts training before overfitting begins, effectively limiting the number of gradient descent steps and thus model complexity. This simple technique often outperforms explicit L2 regularization, particularly in neural networks.
Stochastic gradient descent (SGD) provides implicit regularization through gradient noise. The random minibatch sampling introduces noise that biases optimization toward flatter minima and smoother solutions, favoring better-generalizing models. Overparameterized networks trained with SGD often exhibit implicit bias toward low-rank or simple solutions despite their capacity. Batch normalization and layer normalization also contribute to regularization by stabilizing gradients and smoothing the loss landscape. These mechanisms help explain why overparameterized neural networks trained with SGD and early stopping generalize well despite not using explicit L1/L2 penalties.
Key Insight
Implicit regularization shows that the training algorithm itself can control capacity and prevent overfitting. This fundamentally changed how practitioners approach deep learning, moving away from excessive explicit regularization toward designing algorithms with good implicit bias.
Model Selection
Model selection involves choosing hyperparameters (λ, network depth, etc.) and model architecture to minimize generalization error. The standard approach uses a three-way split: training set for fitting, validation set for model selection, and test set for final evaluation. K-fold cross-validation partitions data into k disjoint folds, training on k-1 folds and validating on the remaining fold, then averaging performance. Leave-One-Out Cross-Validation (LOOCV) uses each point as a validation set, providing low-bias but high-variance estimates.
Practical model selection often uses validation curves, plotting training and validation error as hyperparameters vary. Large gaps indicate overfitting. Grid search exhaustively evaluates a predefined hyperparameter space, while random search and Bayesian optimization sample more efficiently. Information criteria like AIC and BIC balance fit and complexity, useful when validation data is scarce. Cross-validation remains the gold standard: it fully leverages limited data and provides reliable estimates of generalization performance across different hyperparameter settings.
Bayesian Perspective
Bayesian machine learning interprets regularization as prior knowledge about parameters. L2 regularization J(w) = L(w) + λ||w||² corresponds to placing a Gaussian prior p(w) ∝ exp(-λ||w||²) on weights, with MAP (maximum a posteriori) estimation equivalent to regularized empirical risk. L1 regularization corresponds to Laplace priors. This perspective naturally encodes domain knowledge: strong priors (large λ) express high uncertainty about parameters, while weak priors (small λ) allow data to speak freely.
Bayesian model comparison uses marginal likelihood (evidence) to compare models: p(D|M) = ∫ p(D|w,M)p(w|M)dw. This automatically penalizes complexity—the integral over large parameter spaces is smaller—without requiring a separate validation set. The Bayesian approach also provides uncertainty quantification: posterior distributions p(w|D) encode parameter uncertainty, enabling prediction intervals and principled decision-making under uncertainty. Variational inference and MCMC methods approximate intractable posteriors, making Bayesian methods scalable to modern problems.
Prior as Regularizer
Strong priors (high certainty) lead to high regularization pressure, while weak priors allow data-driven learning. This unifies frequentist and Bayesian perspectives.
Model Comparison
Marginal likelihood automatically balances fit and complexity without validation data, providing principled model selection.
Uncertainty Quantification
Posterior distributions over parameters provide credible intervals and prediction uncertainty, essential for risk-sensitive applications.
Hierarchical Priors
Empirical Bayes and hierarchical models learn hyperparameters from data, automating the bias-variance tradeoff.
References & Further Reading
Key References: