Linear Regression & Gradient Descent

01

Supervised Learning Setup

Supervised learning begins with a training set of m examples, each pair of input x and output y. For regression tasks, we assume a linear relationship: our hypothesis function is h(x) = θ₀ + θ₁x₁ + ... + θₙxₙ, parameterized by weights θ. The design matrix X stacks examples as rows, and we estimate θ to minimize the gap between predictions and true values.

The core objective is to find parameters that minimize the cost function J(θ) = ½ Σᵢ(hθ(xⁱ) − yⁱ)². This sum-of-squared-errors formulation penalizes large deviations equally regardless of direction. The problem is convex, meaning any local minimum is global, guaranteeing a unique solution.

m

Training Examples

n

Features per Example

θ ∈ ℝⁿ

Parameter Space

Convex

Optimization Type

Design Matrix X Hypothesis h(x) Cost Function J(θ) Sum of Squares

02

The LMS Algorithm

The Least Mean Squares (LMS) algorithm uses gradient descent to iteratively update parameters toward the minimum. The gradient of the cost function is ∇J(θ) = Xᵀ(Xθ − y). At each iteration, we move in the negative gradient direction scaled by learning rate α: θ := θ − α ∇J(θ). Smaller α converges slowly but safely; larger α risks overshooting the minimum.

The learning rate α is critical and often problem-dependent. Too small and convergence takes forever; too large and oscillation prevents convergence. A practical heuristic is to start with α around 0.01 and adjust based on empirical convergence speed. The algorithm terminates when gradients become negligible or iteration count reaches a preset limit.

Update Rule

θ := θ − α Xᵀ(Xθ − y) per iteration, steering downhill.

Convergence Guarantee

Convex objective ensures global minimum; rate depends on condition number of X.

α

Learning Rate

0.01–0.1

Typical Range

Iterative

Update Type

≈1/κ(X)

Optimal Rate

03

Batch vs Stochastic GD

Batch gradient descent processes all m training examples per iteration, computing the full gradient ∇J(θ) = Xᵀ(Xθ − y). This guarantees downhill motion at every step and smooth convergence, but becomes prohibitively slow for large datasets (m = millions). Each iteration costs O(nm) time.

Stochastic gradient descent (SGD) processes one example at a time: θ := θ − α(hθ(xⁱ) − yⁱ)xⁱ. This is much faster—O(n) per iteration—but the gradient estimate is noisy, causing oscillation around the optimum. Mini-batch GD balances both: process k examples per iteration (k = 32, 256, etc.), reducing variance while remaining efficient on GPUs.

Batch GD

Smooth, monotonic convergence
Better use of vectorization
No hyperparameter noise

Batch GD

O(nm) cost per iteration
Impractical for large m
Can get stuck in poor local regions

SGD

O(n) cost per iteration
Scales to massive datasets
Escapes local minima easily

SGD

Noisy gradient estimates
Oscillates around optimum
Learning rate tuning critical

04

Normal Equations

Instead of iterating, we can solve the optimization problem in closed form. Setting the gradient to zero—∇J(θ) = Xᵀ(Xθ − y) = 0—yields θ = (XᵀX)⁻¹Xᵀy. This normal equation gives the exact least-squares solution without iteration. If XᵀX is invertible, there is a unique solution.

The computational cost is O(n³) due to matrix inversion, making this impractical for very high-dimensional problems (n > 100,000). However, for moderate n, it avoids hyperparameter tuning and convergence criteria. Numerical stability can be an issue if XᵀX is ill-conditioned (condition number κ >> 1), in which case small perturbations in data cause large changes in θ.

O(n³)

Inversion Cost

1 iteration

Solution Time

Exact

Solution Quality

κ(X)

Stability Factor

Closed-Form (X^T X)^{-1} No Iteration Numerical Stability

05

Matrix Derivatives

Deriving the normal equations requires matrix calculus. Key identities include ∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀB) = B, and ∂/∂A tr(AᵀAB) = 2AB. The cost function can be written as J(θ) = ½ tr((Xθ − y)ᵀ(Xθ − y)). Expanding and differentiating: ∇J(θ) = XᵀXθ − Xᵀy.

These trace tricks transform matrix equations into scalar traces, enabling standard calculus rules. The key insight is that ∂/∂θ θᵀAθ = (A + Aᵀ)θ, and when A is symmetric (as XᵀX is), this becomes 2Aθ. Mastering these identities is essential for deriving optimization algorithms in machine learning.

Key Matrix Identities

∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀAB) = 2AB, ∂/∂θ θᵀAθ = 2Aθ (A symmetric). These are derivable from the scalar product expansion and the definition of the gradient.

Trace Properties Chain Rule Symmetry Scalar Expansion

06

Probabilistic Interpretation

Assume observations follow a Gaussian model: yⁱ = θᵀxⁱ + εⁱ, where εⁱ ~ N(0, σ²) are i.i.d. noise. Then yⁱ | xⁱ; θ ~ N(θᵀxⁱ, σ²). The likelihood of the entire dataset is the product of individual Gaussians: L(θ) = ∏ᵢ (1/(√(2πσ²))) exp(−(yⁱ − θᵀxⁱ)² / (2σ²)).

Maximizing the log-likelihood ℓ(θ) = − m/(2σ²) J(θ) + const is equivalent to minimizing J(θ), the sum-of-squared-errors cost. This probabilistic view justifies the choice of quadratic loss and connects it to maximum likelihood estimation. It also reveals assumptions (Gaussianity, homoscedasticity) underlying least-squares regression.

Gaussian Noise

Assumption: ε ~ N(0, σ²) ensures optimality of squared-error loss under MLE.

MLE Equivalence

Minimizing sum-of-squares is equivalent to maximum likelihood when errors are Gaussian.

σ²

Noise Variance

N(0,σ²)

Error Distribution

L(θ)

Likelihood

MLE

Solution Method

07

Locally Weighted Regression

Locally Weighted Regression (LWR) is a non-parametric method that weights training examples based on proximity to the query point. For prediction at x, we solve min_θ Σᵢ wⁱ(yⁱ − θᵀxⁱ)², where weight wⁱ = exp(−(||xⁱ − x||² / (2τ²))). Examples near x have weight ≈ 1; distant examples decay exponentially. The bandwidth τ controls the locality radius.

LWR requires fitting a new model θ for each query point, making it O(m) at test time—much slower than parametric methods but more flexible. It adapts to local data structure without assuming global linearity. However, it is prone to overfitting in high dimensions ("curse of dimensionality") and lacks explicit regularization. The bandwidth τ is crucial: small τ fits only nearby points (high variance, low bias); large τ averages many points (low variance, high bias).

LWR Strengths

Flexible, non-parametric
Adapts to local structure
No global assumption needed

LWR Weaknesses

O(m) prediction cost
High-dimensional curse
τ tuning critical

τ

Bandwidth Parameter

O(m)

Test Cost

Non-parametric

Model Type

Gaussian

Kernel Shape

08

Practical Considerations

In practice, feature scaling dramatically improves convergence. If features have vastly different ranges (e.g., age 0–100 vs. income 0–1,000,000), gradient descent becomes inefficient because the loss surface is elongated. Standardizing features to zero mean and unit variance—x̃ = (x − μ) / σ—makes the loss surface more circular, enabling larger, safer learning rates.

Convergence criteria matter: monitor the norm of parameter changes ||θ^(t+1) − θ^(t)|| or the gradient norm ||∇J(θ)||. When these drop below a threshold, training has effectively converged. Regularization (L2 or L1) is often added to prevent overfitting: J(θ) + λ||θ||² or + λ||θ||₁. This introduces a bias-variance tradeoff; larger λ reduces overfitting but increases bias.

Common Pitfalls

Choosing learning rate too large causes divergence; too small causes slow convergence. Forgetting feature scaling can require 10x more iterations. Using an ill-conditioned design matrix (features are linearly dependent) makes (XᵀX)⁻¹ numerically unstable. Always check data quality and feature correlations before training.

μ, σ

Normalization Stats

λ

Regularization Weight

10⁻⁶

Gradient Threshold

Cross-Validation

Tuning Tool

09

References & Further Reading

Key References:

Supervised Learning Setup

The LMS Algorithm

Update Rule

Convergence Guarantee

Batch vs Stochastic GD

Batch GD

Batch GD

SGD

SGD

Normal Equations

Matrix Derivatives

Key Matrix Identities

Probabilistic Interpretation

Gaussian Noise

MLE Equivalence

Locally Weighted Regression

LWR Strengths

LWR Weaknesses

Practical Considerations

Common Pitfalls

References & Further Reading

Supervised Learning Setup

The Design Matrix and Feature Representation

Convexity and Global Optimality

Key Insight: Convexity

The LMS Algorithm

Convergence Analysis

Learning Rate Selection

Practical Tip: Learning Rate Scheduling

Batch vs Stochastic Gradient Descent

Mini-Batch Gradient Descent

Epoch and Learning Rate Schedules

When to Use Which

Normal Equations

Computational Complexity

Non-Invertibility and Regularization

Ridge Regression (L2 Regularization)

Matrix Derivatives

Key Matrix Identities

Hessian and Convexity Verification

Symmetry Simplifies Derivatives

Probabilistic Interpretation

Maximum Likelihood Estimation

Uncertainty Quantification

Assumptions Hidden in Least Squares

Locally Weighted Regression

Bandwidth Selection

Curse of Dimensionality

Non-Parametric Advantage

Practical Considerations

Feature Scaling and Normalization

Convergence Criteria and Early Stopping

Regularization to Prevent Overfitting

Model Selection and Cross-Validation

Common Mistakes to Avoid

Workflow Summary

Sources & References

Course Materials

Textbooks & Papers