STANFORD XCS229 · MACHINE LEARNING
Linear Regression
Week 1–2 · Gradient Descent & Normal Equations
01

Supervised Learning Setup

Supervised learning begins with a training set of m examples, each pair of input x and output y. For regression tasks, we assume a linear relationship: our hypothesis function is h(x) = θ₀ + θ₁x₁ + ... + θₙxₙ, parameterized by weights θ. The design matrix X stacks examples as rows, and we estimate θ to minimize the gap between predictions and true values.

The core objective is to find parameters that minimize the cost function J(θ) = ½ Σᵢ(hθ(xⁱ) − yⁱ)². This sum-of-squared-errors formulation penalizes large deviations equally regardless of direction. The problem is convex, meaning any local minimum is global, guaranteeing a unique solution.

m
Training Examples
n
Features per Example
θ ∈ ℝⁿ
Parameter Space
Convex
Optimization Type
Design Matrix X Hypothesis h(x) Cost Function J(θ) Sum of Squares
02

The LMS Algorithm

The Least Mean Squares (LMS) algorithm uses gradient descent to iteratively update parameters toward the minimum. The gradient of the cost function is ∇J(θ) = Xᵀ(Xθ − y). At each iteration, we move in the negative gradient direction scaled by learning rate α: θ := θ − α ∇J(θ). Smaller α converges slowly but safely; larger α risks overshooting the minimum.

The learning rate α is critical and often problem-dependent. Too small and convergence takes forever; too large and oscillation prevents convergence. A practical heuristic is to start with α around 0.01 and adjust based on empirical convergence speed. The algorithm terminates when gradients become negligible or iteration count reaches a preset limit.

Update Rule

θ := θ − α Xᵀ(Xθ − y) per iteration, steering downhill.

Convergence Guarantee

Convex objective ensures global minimum; rate depends on condition number of X.

α
Learning Rate
0.01–0.1
Typical Range
Iterative
Update Type
≈1/κ(X)
Optimal Rate
03

Batch vs Stochastic GD

Batch gradient descent processes all m training examples per iteration, computing the full gradient ∇J(θ) = Xᵀ(Xθ − y). This guarantees downhill motion at every step and smooth convergence, but becomes prohibitively slow for large datasets (m = millions). Each iteration costs O(nm) time.

Stochastic gradient descent (SGD) processes one example at a time: θ := θ − α(hθ(xⁱ) − yⁱ)xⁱ. This is much faster—O(n) per iteration—but the gradient estimate is noisy, causing oscillation around the optimum. Mini-batch GD balances both: process k examples per iteration (k = 32, 256, etc.), reducing variance while remaining efficient on GPUs.

Batch GD

  • Smooth, monotonic convergence
  • Better use of vectorization
  • No hyperparameter noise

Batch GD

  • O(nm) cost per iteration
  • Impractical for large m
  • Can get stuck in poor local regions

SGD

  • O(n) cost per iteration
  • Scales to massive datasets
  • Escapes local minima easily

SGD

  • Noisy gradient estimates
  • Oscillates around optimum
  • Learning rate tuning critical
04

Normal Equations

Instead of iterating, we can solve the optimization problem in closed form. Setting the gradient to zero—∇J(θ) = Xᵀ(Xθ − y) = 0—yields θ = (XᵀX)⁻¹Xᵀy. This normal equation gives the exact least-squares solution without iteration. If XᵀX is invertible, there is a unique solution.

The computational cost is O(n³) due to matrix inversion, making this impractical for very high-dimensional problems (n > 100,000). However, for moderate n, it avoids hyperparameter tuning and convergence criteria. Numerical stability can be an issue if XᵀX is ill-conditioned (condition number κ >> 1), in which case small perturbations in data cause large changes in θ.

O(n³)
Inversion Cost
1 iteration
Solution Time
Exact
Solution Quality
κ(X)
Stability Factor
Closed-Form (X^T X)^{-1} No Iteration Numerical Stability
05

Matrix Derivatives

Deriving the normal equations requires matrix calculus. Key identities include ∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀB) = B, and ∂/∂A tr(AᵀAB) = 2AB. The cost function can be written as J(θ) = ½ tr((Xθ − y)ᵀ(Xθ − y)). Expanding and differentiating: ∇J(θ) = XᵀXθ − Xᵀy.

These trace tricks transform matrix equations into scalar traces, enabling standard calculus rules. The key insight is that ∂/∂θ θᵀAθ = (A + Aᵀ)θ, and when A is symmetric (as XᵀX is), this becomes 2Aθ. Mastering these identities is essential for deriving optimization algorithms in machine learning.

Key Matrix Identities

∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀAB) = 2AB, ∂/∂θ θᵀAθ = 2Aθ (A symmetric). These are derivable from the scalar product expansion and the definition of the gradient.

Trace Properties Chain Rule Symmetry Scalar Expansion
06

Probabilistic Interpretation

Assume observations follow a Gaussian model: yⁱ = θᵀxⁱ + εⁱ, where εⁱ ~ N(0, σ²) are i.i.d. noise. Then yⁱ | xⁱ; θ ~ N(θᵀxⁱ, σ²). The likelihood of the entire dataset is the product of individual Gaussians: L(θ) = ∏ᵢ (1/(√(2πσ²))) exp(−(yⁱ − θᵀxⁱ)² / (2σ²)).

Maximizing the log-likelihood ℓ(θ) = − m/(2σ²) J(θ) + const is equivalent to minimizing J(θ), the sum-of-squared-errors cost. This probabilistic view justifies the choice of quadratic loss and connects it to maximum likelihood estimation. It also reveals assumptions (Gaussianity, homoscedasticity) underlying least-squares regression.

Gaussian Noise

Assumption: ε ~ N(0, σ²) ensures optimality of squared-error loss under MLE.

MLE Equivalence

Minimizing sum-of-squares is equivalent to maximum likelihood when errors are Gaussian.

σ²
Noise Variance
N(0,σ²)
Error Distribution
L(θ)
Likelihood
MLE
Solution Method
07

Locally Weighted Regression

Locally Weighted Regression (LWR) is a non-parametric method that weights training examples based on proximity to the query point. For prediction at x, we solve min_θ Σᵢ wⁱ(yⁱ − θᵀxⁱ)², where weight wⁱ = exp(−(||xⁱ − x||² / (2τ²))). Examples near x have weight ≈ 1; distant examples decay exponentially. The bandwidth τ controls the locality radius.

LWR requires fitting a new model θ for each query point, making it O(m) at test time—much slower than parametric methods but more flexible. It adapts to local data structure without assuming global linearity. However, it is prone to overfitting in high dimensions ("curse of dimensionality") and lacks explicit regularization. The bandwidth τ is crucial: small τ fits only nearby points (high variance, low bias); large τ averages many points (low variance, high bias).

LWR Strengths

  • Flexible, non-parametric
  • Adapts to local structure
  • No global assumption needed

LWR Weaknesses

  • O(m) prediction cost
  • High-dimensional curse
  • τ tuning critical
τ
Bandwidth Parameter
O(m)
Test Cost
Non-parametric
Model Type
Gaussian
Kernel Shape
08

Practical Considerations

In practice, feature scaling dramatically improves convergence. If features have vastly different ranges (e.g., age 0–100 vs. income 0–1,000,000), gradient descent becomes inefficient because the loss surface is elongated. Standardizing features to zero mean and unit variance—x̃ = (x − μ) / σ—makes the loss surface more circular, enabling larger, safer learning rates.

Convergence criteria matter: monitor the norm of parameter changes ||θ^(t+1) − θ^(t)|| or the gradient norm ||∇J(θ)||. When these drop below a threshold, training has effectively converged. Regularization (L2 or L1) is often added to prevent overfitting: J(θ) + λ||θ||² or + λ||θ||₁. This introduces a bias-variance tradeoff; larger λ reduces overfitting but increases bias.

Common Pitfalls

Choosing learning rate too large causes divergence; too small causes slow convergence. Forgetting feature scaling can require 10x more iterations. Using an ill-conditioned design matrix (features are linearly dependent) makes (XᵀX)⁻¹ numerically unstable. Always check data quality and feature correlations before training.

μ, σ
Normalization Stats
λ
Regularization Weight
10⁻⁶
Gradient Threshold
Cross-Validation
Tuning Tool