Bias-Variance & Regularization

STANFORD XCS229 · MACHINE LEARNING

Bias-Variance & Regularization

Bias-Variance Tradeoff

The bias-variance tradeoff is fundamental to understanding generalization in machine learning. Bias measures how much the average prediction of a model differs from the true target, reflecting systematic errors from model simplicity. Variance measures how much predictions fluctuate across different training datasets, reflecting sensitivity to small changes in the data.

Expected test error decomposes into three components: squared bias, variance, and irreducible noise. The bias-variance curve typically exhibits a U-shape: high-complexity models (e.g., deep neural networks) have low bias but high variance, while simple models (e.g., linear regression) have high bias but low variance. Finding the optimal model complexity that minimizes this tradeoff is central to building effective predictors.

Bias

Systematic Error

Variance

Instability

Noise

Irreducible

U-Curve

Optimal Point

Mathematical Decomposition

Formally, consider a regression problem with true function f and predictions ŷ. The expected squared error can be decomposed as E[(ŷ - y)²] = Bias²(ŷ) + Var(ŷ) + σ², where σ² is the irreducible noise in the target. Bias²(ŷ) = (E[ŷ] - f)² measures distance from the true function, while Var(ŷ) = E[(ŷ - E[ŷ])²] measures variation across training sets.

This decomposition reveals why simple models underfit: they cannot express the true function, creating large bias. Complex models overfit: they fit noise in the training data, creating large variance. The derivation assumes the data is drawn from a true underlying distribution, and both bias and variance are computed in expectation over training set samples. Understanding this breakdown is essential for diagnosing where a model fails and choosing appropriate complexity.

Double Descent

A striking modern discovery challenges the traditional bias-variance intuition: the double descent phenomenon. When models are overparameterized (parameter count exceeds training samples), test error can decrease again after passing the interpolation threshold where models fit all training data perfectly. This suggests the classical U-curve might be incomplete.

Double descent occurs in deep neural networks, kernel methods, and random forests. The regime where model capacity greatly exceeds sample size—called the benign overfitting regime—can achieve low generalization error despite perfect training fit. This challenges the notion that fitting noise is inherently bad. The mechanism involves implicit regularization in the learning algorithm itself: SGD and early stopping naturally select smooth solutions even in overparameterized settings, providing a form of automatic capacity control.

Threshold Benign Overfitting Classical U

Sample Complexity Bounds

PAC (Probably Approximately Correct) learning provides finite-sample guarantees for generalization. For a hypothesis class H with size |H|, to guarantee error at most ε with confidence 1-δ, we need sample complexity m = O(log|H|/ε²δ). This bound decreases with hypothesis class size but increases logarithmically—suggesting larger hypothesis classes need only modestly more samples.

The VC (Vapnik-Chervonenkis) dimension refines this for infinite hypothesis classes by capturing expressive capacity. A hypothesis class shatters a set of points if it can realize all 2^d labelings on d points. The VC dimension is the size of the largest shattered set. For linear classifiers in d dimensions, VC dimension equals d+1. The fundamental theorem of VC theory bounds generalization gap by VC dimension, showing why simpler models (lower VC dimension) generalize better with fewer samples.

PAC

Probably Approx. Correct

VC Dim

Capacity Measure

O(log|H|)

Sample Complex.

Tighter

Bounds

Regularization Methods

Regularization adds a penalty term to the loss function to control model complexity. L2 regularization (ridge) penalizes squared parameter magnitudes: J(w) = L(w) + λ||w||². This smooths decision boundaries and shrinks weights toward zero, reducing variance at the cost of introducing bias. L1 regularization (lasso) uses J(w) = L(w) + λ||w||₁, inducing sparsity by setting some weights exactly to zero, providing feature selection.

The regularization strength λ controls the bias-variance tradeoff directly: large λ increases bias but decreases variance, while small λ increases variance. Elastic Net combines L1 and L2: J(w) = L(w) + λ₁||w||₁ + λ₂||w||². From a Bayesian perspective, L2 regularization corresponds to Gaussian priors on weights, while L1 corresponds to Laplace priors. Regularization paths visualize how solution coefficients change with λ, revealing which features remain important at different complexity levels.

Implicit Regularization

Modern deep learning benefits from implicit regularization—built-in capacity control through the learning algorithm itself, even without explicit penalties. Early stopping halts training before overfitting begins, effectively limiting the number of gradient descent steps and thus model complexity. This simple technique often outperforms explicit L2 regularization, particularly in neural networks.

Stochastic gradient descent (SGD) provides implicit regularization through gradient noise. The random minibatch sampling introduces noise that biases optimization toward flatter minima and smoother solutions, favoring better-generalizing models. Overparameterized networks trained with SGD often exhibit implicit bias toward low-rank or simple solutions despite their capacity. Batch normalization and layer normalization also contribute to regularization by stabilizing gradients and smoothing the loss landscape. These mechanisms help explain why overparameterized neural networks trained with SGD and early stopping generalize well despite not using explicit L1/L2 penalties.

Key Insight

Implicit regularization shows that the training algorithm itself can control capacity and prevent overfitting. This fundamentally changed how practitioners approach deep learning, moving away from excessive explicit regularization toward designing algorithms with good implicit bias.

Model Selection

Model selection involves choosing hyperparameters (λ, network depth, etc.) and model architecture to minimize generalization error. The standard approach uses a three-way split: training set for fitting, validation set for model selection, and test set for final evaluation. K-fold cross-validation partitions data into k disjoint folds, training on k-1 folds and validating on the remaining fold, then averaging performance. Leave-One-Out Cross-Validation (LOOCV) uses each point as a validation set, providing low-bias but high-variance estimates.

Practical model selection often uses validation curves, plotting training and validation error as hyperparameters vary. Large gaps indicate overfitting. Grid search exhaustively evaluates a predefined hyperparameter space, while random search and Bayesian optimization sample more efficiently. Information criteria like AIC and BIC balance fit and complexity, useful when validation data is scarce. Cross-validation remains the gold standard: it fully leverages limited data and provides reliable estimates of generalization performance across different hyperparameter settings.

Bayesian Perspective

Bayesian machine learning interprets regularization as prior knowledge about parameters. L2 regularization J(w) = L(w) + λ||w||² corresponds to placing a Gaussian prior p(w) ∝ exp(-λ||w||²) on weights, with MAP (maximum a posteriori) estimation equivalent to regularized empirical risk. L1 regularization corresponds to Laplace priors. This perspective naturally encodes domain knowledge: strong priors (large λ) express high uncertainty about parameters, while weak priors (small λ) allow data to speak freely.

Bayesian model comparison uses marginal likelihood (evidence) to compare models: p(D|M) = ∫ p(D|w,M)p(w|M)dw. This automatically penalizes complexity—the integral over large parameter spaces is smaller—without requiring a separate validation set. The Bayesian approach also provides uncertainty quantification: posterior distributions p(w|D) encode parameter uncertainty, enabling prediction intervals and principled decision-making under uncertainty. Variational inference and MCMC methods approximate intractable posteriors, making Bayesian methods scalable to modern problems.

Prior as Regularizer

Strong priors (high certainty) lead to high regularization pressure, while weak priors allow data-driven learning. This unifies frequentist and Bayesian perspectives.

Model Comparison

Marginal likelihood automatically balances fit and complexity without validation data, providing principled model selection.

Uncertainty Quantification

Posterior distributions over parameters provide credible intervals and prediction uncertainty, essential for risk-sensitive applications.

Hierarchical Priors

Empirical Bayes and hierarchical models learn hyperparameters from data, automating the bias-variance tradeoff.

References & Further Reading

Key References:

E_D[(y - h(x,D))²] = E_D[y² - 2y·h(x,D) + h(x,D)²] = E[y²] - 2E[y]E_D[h(x,D)] + E_D[h(x,D)²] = E[y²] - E[y]² + E[y]² - 2E[y]E_D[h(x,D)] + E_D[h(x,D)²] - E_D[h(x,D)]² + E_D[h(x,D)]² = Var(y) + (E[y] - E_D[h])² + E_D[(h - E_D[h])²]

Bias-Variance Tradeoff

Mathematical Decomposition

Double Descent

Sample Complexity Bounds

Regularization Methods

Implicit Regularization

Key Insight

Model Selection

Bayesian Perspective

Prior as Regularizer

Model Comparison

Uncertainty Quantification

Hierarchical Priors

References & Further Reading

Understanding Model Capacity

Formal Decomposition

Implications

Deriving the Error Decomposition

Example: Polynomial Regression

Why Cross-Validation Works

Beyond the U-Curve: Modern Phenomena

The Interpolation Threshold

Implicit Regularization Explanation

Practical Implications

How Much Data Do You Really Need?

VC Dimension and Infinite Classes

Practical Use

Controlling Complexity Explicitly

L2 Regularization (Ridge)

L1 Regularization (Lasso)

Elastic Net

Choosing λ

Regularization Without Penalties

Early Stopping

SGD Noise

Depth and Width

Practical Value

Choosing Architectures and Hyperparameters

The Three-Way Split

K-Fold Cross-Validation

Hyperparameter Tuning

Validation Curves

Information Criteria

Priors, Posteriors, and Uncertainty

Regularization as Prior

Model Comparison via Evidence

Uncertainty Quantification

Empirical Bayes

Computational Challenge

Sources & References

Course Materials

Textbooks & Papers