Supervised learning begins with a training set of m examples, each pair of input x and output y. For regression tasks, we assume a linear relationship: our hypothesis function is h(x) = θ₀ + θ₁x₁ + ... + θₙxₙ, parameterized by weights θ. The design matrix X stacks examples as rows, and we estimate θ to minimize the gap between predictions and true values.
The core objective is to find parameters that minimize the cost function J(θ) = ½ Σᵢ(hθ(xⁱ) − yⁱ)². This sum-of-squared-errors formulation penalizes large deviations equally regardless of direction. The problem is convex, meaning any local minimum is global, guaranteeing a unique solution.
m
Training Examples
n
Features per Example
θ ∈ ℝⁿ
Parameter Space
Convex
Optimization Type
Design Matrix XHypothesis h(x)Cost Function J(θ)Sum of Squares
02
The LMS Algorithm
The Least Mean Squares (LMS) algorithm uses gradient descent to iteratively update parameters toward the minimum. The gradient of the cost function is ∇J(θ) = Xᵀ(Xθ − y). At each iteration, we move in the negative gradient direction scaled by learning rate α: θ := θ − α ∇J(θ). Smaller α converges slowly but safely; larger α risks overshooting the minimum.
The learning rate α is critical and often problem-dependent. Too small and convergence takes forever; too large and oscillation prevents convergence. A practical heuristic is to start with α around 0.01 and adjust based on empirical convergence speed. The algorithm terminates when gradients become negligible or iteration count reaches a preset limit.
Convex objective ensures global minimum; rate depends on condition number of X.
α
Learning Rate
0.01–0.1
Typical Range
Iterative
Update Type
≈1/κ(X)
Optimal Rate
03
Batch vs Stochastic GD
Batch gradient descent processes all m training examples per iteration, computing the full gradient ∇J(θ) = Xᵀ(Xθ − y). This guarantees downhill motion at every step and smooth convergence, but becomes prohibitively slow for large datasets (m = millions). Each iteration costs O(nm) time.
Stochastic gradient descent (SGD) processes one example at a time: θ := θ − α(hθ(xⁱ) − yⁱ)xⁱ. This is much faster—O(n) per iteration—but the gradient estimate is noisy, causing oscillation around the optimum. Mini-batch GD balances both: process k examples per iteration (k = 32, 256, etc.), reducing variance while remaining efficient on GPUs.
Batch GD
Smooth, monotonic convergence
Better use of vectorization
No hyperparameter noise
Batch GD
O(nm) cost per iteration
Impractical for large m
Can get stuck in poor local regions
SGD
O(n) cost per iteration
Scales to massive datasets
Escapes local minima easily
SGD
Noisy gradient estimates
Oscillates around optimum
Learning rate tuning critical
04
Normal Equations
Instead of iterating, we can solve the optimization problem in closed form. Setting the gradient to zero—∇J(θ) = Xᵀ(Xθ − y) = 0—yields θ = (XᵀX)⁻¹Xᵀy. This normal equation gives the exact least-squares solution without iteration. If XᵀX is invertible, there is a unique solution.
The computational cost is O(n³) due to matrix inversion, making this impractical for very high-dimensional problems (n > 100,000). However, for moderate n, it avoids hyperparameter tuning and convergence criteria. Numerical stability can be an issue if XᵀX is ill-conditioned (condition number κ >> 1), in which case small perturbations in data cause large changes in θ.
Deriving the normal equations requires matrix calculus. Key identities include ∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀB) = B, and ∂/∂A tr(AᵀAB) = 2AB. The cost function can be written as J(θ) = ½ tr((Xθ − y)ᵀ(Xθ − y)). Expanding and differentiating: ∇J(θ) = XᵀXθ − Xᵀy.
These trace tricks transform matrix equations into scalar traces, enabling standard calculus rules. The key insight is that ∂/∂θ θᵀAθ = (A + Aᵀ)θ, and when A is symmetric (as XᵀX is), this becomes 2Aθ. Mastering these identities is essential for deriving optimization algorithms in machine learning.
Key Matrix Identities
∂/∂A tr(AB) = Bᵀ, ∂/∂A tr(AᵀAB) = 2AB, ∂/∂θ θᵀAθ = 2Aθ (A symmetric). These are derivable from the scalar product expansion and the definition of the gradient.
Assume observations follow a Gaussian model: yⁱ = θᵀxⁱ + εⁱ, where εⁱ ~ N(0, σ²) are i.i.d. noise. Then yⁱ | xⁱ; θ ~ N(θᵀxⁱ, σ²). The likelihood of the entire dataset is the product of individual Gaussians: L(θ) = ∏ᵢ (1/(√(2πσ²))) exp(−(yⁱ − θᵀxⁱ)² / (2σ²)).
Maximizing the log-likelihood ℓ(θ) = − m/(2σ²) J(θ) + const is equivalent to minimizing J(θ), the sum-of-squared-errors cost. This probabilistic view justifies the choice of quadratic loss and connects it to maximum likelihood estimation. It also reveals assumptions (Gaussianity, homoscedasticity) underlying least-squares regression.
Gaussian Noise
Assumption: ε ~ N(0, σ²) ensures optimality of squared-error loss under MLE.
MLE Equivalence
Minimizing sum-of-squares is equivalent to maximum likelihood when errors are Gaussian.
σ²
Noise Variance
N(0,σ²)
Error Distribution
L(θ)
Likelihood
MLE
Solution Method
07
Locally Weighted Regression
Locally Weighted Regression (LWR) is a non-parametric method that weights training examples based on proximity to the query point. For prediction at x, we solve min_θ Σᵢ wⁱ(yⁱ − θᵀxⁱ)², where weight wⁱ = exp(−(||xⁱ − x||² / (2τ²))). Examples near x have weight ≈ 1; distant examples decay exponentially. The bandwidth τ controls the locality radius.
LWR requires fitting a new model θ for each query point, making it O(m) at test time—much slower than parametric methods but more flexible. It adapts to local data structure without assuming global linearity. However, it is prone to overfitting in high dimensions ("curse of dimensionality") and lacks explicit regularization. The bandwidth τ is crucial: small τ fits only nearby points (high variance, low bias); large τ averages many points (low variance, high bias).
LWR Strengths
Flexible, non-parametric
Adapts to local structure
No global assumption needed
LWR Weaknesses
O(m) prediction cost
High-dimensional curse
τ tuning critical
τ
Bandwidth Parameter
O(m)
Test Cost
Non-parametric
Model Type
Gaussian
Kernel Shape
08
Practical Considerations
In practice, feature scaling dramatically improves convergence. If features have vastly different ranges (e.g., age 0–100 vs. income 0–1,000,000), gradient descent becomes inefficient because the loss surface is elongated. Standardizing features to zero mean and unit variance—x̃ = (x − μ) / σ—makes the loss surface more circular, enabling larger, safer learning rates.
Convergence criteria matter: monitor the norm of parameter changes ||θ^(t+1) − θ^(t)|| or the gradient norm ||∇J(θ)||. When these drop below a threshold, training has effectively converged. Regularization (L2 or L1) is often added to prevent overfitting: J(θ) + λ||θ||² or + λ||θ||₁. This introduces a bias-variance tradeoff; larger λ reduces overfitting but increases bias.
Common Pitfalls
Choosing learning rate too large causes divergence; too small causes slow convergence. Forgetting feature scaling can require 10x more iterations. Using an ill-conditioned design matrix (features are linearly dependent) makes (XᵀX)⁻¹ numerically unstable. Always check data quality and feature correlations before training.
The foundation of linear regression rests on a training dataset D = {(x¹, y¹), ..., (xᵐ, yᵐ)}, where each xⁱ ∈ ℝⁿ is a feature vector and yⁱ ∈ ℝ is a scalar label. The design matrix X ∈ ℝᵐˣⁿ stacks training examples as rows: X = [x¹ᵀ; x²ᵀ; ...; xᵐᵀ]. We augment each example with a bias term, setting x₀ⁱ = 1, so the effective feature dimension becomes n+1 and θ ∈ ℝⁿ⁺¹.
The hypothesis function h(x) = θᵀx models the relationship between inputs and outputs. Our goal is to find parameters θ that minimize prediction error. We quantify error using the sum-of-squared-errors cost function:
J(θ) = ½ Σᵢ₌₁ᵐ (hθ(xⁱ) − yⁱ)² = ½ ||Xθ − y||²
This cost function is convex, meaning it has a single global minimum and no saddle points. The factor of ½ is a normalization convention that simplifies derivatives. The absolute error |hθ(xⁱ) − yⁱ| is squared, making the cost symmetric and penalizing large errors more heavily than small ones.
The Design Matrix and Feature Representation
Organizing data into the design matrix X enables efficient vectorized computations. Each row is an example; each column is a feature. The matrix formulation Xθ − y computes all residuals in one operation. This is crucial for modern machine learning, where datasets contain millions of examples.
Feature engineering—constructing meaningful features from raw data—is essential. For example, if predicting house prices from floor area and age, we might include polynomial features (area², area × age) to capture non-linear relationships. Linear regression on engineered features can approximate complex functions.
Convexity and Global Optimality
The Hessian of J(θ) is H = XᵀX, which is positive semi-definite. This guarantees convexity: J is a bowl-shaped function with a unique minimum. Any local minimum is the global minimum, and gradient descent will converge to it. This is a fundamental advantage of linear regression over non-convex problems like deep neural networks.
Key Insight: Convexity
The sum-of-squared-errors cost function is convex in θ because the Hessian H = XᵀX is positive semi-definite. This guarantees that any optimization method (gradient descent, normal equations) finds the global optimum, not a local one.
02
The LMS Algorithm
The Least Mean Squares (LMS) algorithm, also called the Widrow-Hoff rule, updates parameters by moving in the negative gradient direction. The gradient of J(θ) = ½ ||Xθ − y||² is:
∇J(θ) = Xᵀ(Xθ − y)
This gradient vector points in the direction of steepest increase in cost. Moving opposite to it—proportional to −∇J(θ)—decreases cost. The update rule is:
θ := θ − α∇J(θ) = θ − αXᵀ(Xθ − y)
The scalar α > 0 is the learning rate, controlling step size. At each iteration, we make a small adjustment toward the optimum. The sequence of costs J(θ⁽ᵗ⁾) is monotonically decreasing and converges to J(θ*), where θ* is the optimal solution.
Convergence Analysis
The convergence rate depends on the conditioning of XᵀX. Define the eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λₙ ≥ 0 of XᵀX. Convergence is fastest when α ≈ 1/λ₁ (the largest eigenvalue). The condition number κ = λ₁/λₙ measures how ill-conditioned the problem is. If κ is large, convergence is slow because the loss surface is elongated, forcing small learning rates.
To ensure convergence, α must satisfy 0 < α < 2/λ₁. In practice, α = 0.01 to 0.1 is often safe, but the optimal rate requires knowledge of λ₁, which is expensive to compute. Adaptive methods like AdaGrad or Adam learn α automatically by accumulating gradient history.
Learning Rate Selection
Choosing the learning rate α is a key practical decision. Too small and training is inefficient; too large and θ oscillates wildly, diverging away from the optimum. A common strategy is exponential decay: start with α₀ and decrease it over time as αₜ = α₀ / (1 + t). This allows large early steps and fine-tuning near convergence.
Another approach is line search: for each iteration, find the step size that minimizes cost in that direction. This is more expensive per iteration but may reduce total iterations. Modern practice often uses adaptive methods (Adam, RMSprop) that adjust learning rates per parameter based on gradient history.
Practical Tip: Learning Rate Scheduling
Start with a moderate learning rate (α = 0.01) and monitor training loss. If loss decreases smoothly, increase α slightly. If it oscillates, decrease α. After warmup iterations, use decay schedules to refine the solution near convergence.
03
Batch vs Stochastic Gradient Descent
Batch gradient descent processes the entire training set per iteration. The cost is J(θ) = ½ Σᵢ₌₁ᵐ (hθ(xⁱ) − yⁱ)², and each update is:
θ := θ − α Xᵀ(Xθ − y)
This requires O(nm) time per iteration, which becomes prohibitive when m = billions. However, every iteration moves toward the optimum (monotonic decrease), and convergence is smooth. The method fully leverages data before making an update.
Stochastic gradient descent (SGD) updates after processing a single example:
θ := θ − α(hθ(xⁱ) − yⁱ)xⁱ
This costs O(n) per iteration, 1/m the cost of batch GD. Over m iterations, SGD performs m parameter updates versus 1 for batch GD. The downside: the gradient estimate is noisy (a single example's loss direction may not align with the overall loss direction), causing oscillation and slower overall convergence.
Mini-Batch Gradient Descent
Mini-batch GD compromises: process k examples (e.g., k = 32 or 256) per iteration. The update is:
θ := θ − α Σᵢ ∈ ℬ (hθ(xⁱ) − yⁱ)xⁱ / k, where ℬ is a random batch.
This offers a sweet spot: lower variance than SGD (k samples average noise), faster than batch GD (can iterate many times with mini-batches before seeing all data). Modern training almost always uses mini-batch GD with typical batch sizes 32–512.
Epoch and Learning Rate Schedules
An epoch is one pass through the entire dataset. For SGD with batch size k, one epoch consists of m/k iterations. After each epoch, typical practice is to decay the learning rate: αₜ = α₀ · r^(epoch_number) with decay factor r ≈ 0.99. This allows larger steps early and finer refinement later.
Practical considerations: shuffle data randomly before each epoch to avoid biased mini-batches. Use different random seeds for reproducibility. Monitor validation loss to detect overfitting and stop early if it increases while training loss decreases.
When to Use Which
Batch GD: small datasets (m < 10k), need smooth convergence, memory available. SGD: huge datasets, online learning, noisy gradients acceptable. Mini-batch: most practical choice; size depends on GPU memory and convergence preferences.
04
Normal Equations
Rather than iterating, we can solve the optimization problem analytically. Taking the gradient ∇J(θ) = Xᵀ(Xθ − y) and setting it to zero:
Xᵀ(Xθ − y) = 0
XᵀXθ = Xᵀy
If the matrix XᵀX is invertible (full rank), we can solve for θ directly:
θ = (XᵀX)⁻¹Xᵀy
This is the closed-form solution, requiring only one matrix inversion. It avoids all hyperparameter tuning (no learning rate α) and convergence criteria. For moderate-sized datasets, this is often faster and simpler than gradient descent.
Computational Complexity
The bottleneck is computing (XᵀX)⁻¹. Matrix multiplication XᵀX costs O(n²m), and inversion costs O(n³) using direct methods (Gaussian elimination, LU decomposition). For n = 100, O(n³) = O(1M) operations is fast. For n = 10,000, O(10¹²) operations is prohibitive.
In practice, we use numerically stable algorithms like QR or SVD decomposition rather than explicit inversion. SVD(X) = UΣVᵀ gives θ = VΣ⁻¹Uᵀy. SVD also reveals the rank of X and condition number κ(X) = σ₁/σₙ (ratio of largest to smallest singular value). High κ indicates numerical instability.
Non-Invertibility and Regularization
If X has rank < n (features are linearly dependent), XᵀX is singular and non-invertible. This occurs when n > m (more features than examples) or when features are perfectly correlated. In such cases, the solution is not unique; infinitely many θ achieve the minimum cost.
To handle this, we add regularization: minimize J(θ) + λ||θ||² instead. The modified normal equation becomes (XᵀX + λI)θ = Xᵀy. The term λI is always invertible for λ > 0, stabilizing the solution. Larger λ pushes θ toward zero, reducing model complexity. This is called Tikhonov regularization or L2 regularization.
Ridge Regression (L2 Regularization)
θ = (XᵀX + λI)⁻¹Xᵀy. The term λI ensures invertibility and shrinks weights toward zero. λ is chosen via cross-validation. This trades bias (regularization) for variance reduction, improving generalization.
05
Matrix Derivatives
Deriving the normal equations requires matrix calculus. The cost function is:
J(θ) = ½ ||Xθ − y||² = ½ (Xθ − y)ᵀ(Xθ − y)
Expanding the inner product:
J(θ) = ½ (θᵀXᵀXθ − 2θᵀXᵀy + yᵀy)
The last term is constant w.r.t. θ, so it vanishes during differentiation. Differentiating the remaining terms:
∇J(θ) = ½ (2XᵀXθ − 2Xᵀy) = XᵀXθ − Xᵀy
This uses the matrix derivative rule: ∂/∂θ θᵀAθ = (A + Aᵀ)θ. When A is symmetric (as XᵀX is), this simplifies to 2Aθ.
Key Matrix Identities
The following identities are fundamental to machine learning derivations:
1. tr(A) = Σᵢ Aᵢᵢ (trace: sum of diagonal)
2. ∂/∂A tr(AB) = Bᵀ
3. ∂/∂A tr(AᵀB) = B
4. ∂/∂A tr(AᵀAB) = 2AB (useful for squared norms)
5. ∂/∂A tr(ABAC) = AᵀBᵀ + BᵀAᵀ (product rule)
These rules transform matrix expressions into scalar traces, enabling standard calculus. For example, the squared error can be rewritten as ||Xθ − y||² = tr((Xθ − y)ᵀ(Xθ − y)), and these rules apply directly.
Hessian and Convexity Verification
The second derivative (Hessian) of J(θ) is:
H = ∇²J(θ) = XᵀX
The Hessian is constant—independent of θ—and equals XᵀX. For any matrix X with full column rank, XᵀX is positive definite, meaning all eigenvalues are strictly positive. A positive definite Hessian confirms strict convexity: J is a bowl-shaped function with a unique global minimum.
The eigenvalues of XᵀX are the squared singular values of X. Large eigenvalues correspond to directions of high curvature (steep slopes), while small eigenvalues correspond to directions of low curvature (flat slopes). The condition number κ = λₘₐₓ/λₘᵢₙ measures how "skewed" the optimization landscape is.
Symmetry Simplifies Derivatives
When A is symmetric, ∂/∂θ θᵀAθ = 2Aθ. This is why (XᵀX) often appears in regression—it's symmetric, and its derivative is clean. Non-symmetric matrices require more care: ∂/∂θ θᵀAθ = (A + Aᵀ)θ.
06
Probabilistic Interpretation
Least-squares regression can be justified through a probabilistic model. Assume the output is generated as:
yⁱ = θᵀxⁱ + εⁱ, where εⁱ ~ N(0, σ²) i.i.d.
That is, the true relationship is linear plus Gaussian noise. Then the conditional distribution of y given x is:
Maximizing ℓ(θ) is equivalent to minimizing J(θ), the sum-of-squared-errors! The constant term doesn't affect the optimal θ. This is the maximum likelihood estimate (MLE) of θ under Gaussian noise.
The probabilistic view reveals assumptions: we assume errors are Gaussian (symmetric around the regression line), independent across examples, and have constant variance σ² (homoscedasticity). When these hold, least-squares is statistically optimal.
Uncertainty Quantification
Beyond point estimates of θ, the probabilistic model enables uncertainty quantification. The posterior distribution of θ given data has mean θ = (XᵀX)⁻¹Xᵀy and covariance Cov(θ) = σ²(XᵀX)⁻¹. From the covariance, we compute standard errors of parameters: SE(θⱼ) = σ√[(XᵀX)⁻¹]ⱼⱼ.
Confidence intervals follow: θⱼ ± 1.96 · SE(θⱼ) is approximately a 95% confidence interval (under Gaussianity). These intervals are essential in statistics for assessing which parameters are significantly different from zero. If a 95% CI excludes zero, the parameter is "significant" at the 5% level.
Assumptions Hidden in Least Squares
Least-squares is optimal when errors are Gaussian, independent, and have equal variance. If errors are skewed, outliers exist, or variance depends on x (heteroscedasticity), other methods (robust regression, weighted least-squares) may be preferable.
07
Locally Weighted Regression
Locally Weighted Regression (LWR) is a non-parametric method that adapts to local data structure without assuming global linearity. To predict y at a query point x, we solve:
min_θ Σᵢ wⁱ(yⁱ − θᵀxⁱ)²
The weight wⁱ depends on distance from xⁱ to the query x. A common choice is the Gaussian kernel:
wⁱ = exp(−||xⁱ − x||² / (2τ²))
Here τ > 0 is the bandwidth. Points close to x (small ||xⁱ − x||) have weight wⁱ ≈ 1; distant points decay exponentially. The parameter θ is fit using only nearby examples, making the model local.
Bandwidth Selection
The bandwidth τ controls the "locality" of the regression:
• Small τ: only examples very close to x are weighted significantly. Model fits local structure tightly but has high variance (overfits).
• Large τ: examples far from x still influence the fit. Model is smoother (low variance) but may miss local patterns (high bias).
• Optimal τ: balances bias and variance, found via cross-validation.
Unlike parametric methods that store a single θ, LWR stores the entire training set and fits a new model at each prediction point. This gives flexibility but at computational cost: prediction is O(m), compared to O(n) for linear regression.
Curse of Dimensionality
LWR suffers from the curse of dimensionality: as feature dimension n grows, examples become sparse in the feature space. Most examples are far from the query x, so their weights are tiny, and the effective sample size drops. To maintain locality, τ must increase, which defeats the purpose.
For high-dimensional data, LWR requires very large m to work well. Parametric methods like linear regression scale better because they impose global structure (linearity), trading flexibility for efficiency. LWR is most effective in low dimensions (n ≤ 10) with large training sets (m >> 1000).
Non-Parametric Advantage
LWR doesn't assume global linearity. If the underlying function is non-linear, LWR can approximate it locally without feature engineering. The downside: no closed-form θ, requires storing training data, O(m) prediction cost.
08
Practical Considerations
Successful machine learning requires attention to details beyond algorithm choice. Feature scaling, convergence monitoring, and regularization are crucial in practice.
Feature Scaling and Normalization
When features have vastly different ranges, gradient descent becomes inefficient. Suppose x₁ ∈ [0, 100] (age) and x₂ ∈ [0, 1,000,000] (income). The cost surface is highly elongated: moving 1 unit in the income direction changes cost less than moving 1 unit in the age direction. Gradient descent must take tiny steps to avoid overshooting, requiring thousands of iterations.
Standardization rescales features to zero mean and unit variance:
x̃ⱼ = (xⱼ − μⱼ) / σⱼ
where μⱼ and σⱼ are computed from training data. After standardization, the cost surface is more spherical, allowing larger learning rates and faster convergence. Always standardize before training, and apply the same transformation (using training statistics) to test data.
Convergence Criteria and Early Stopping
When should you stop training? Several strategies:
1. Fixed iterations: stop after a preset number of epochs (simple but requires tuning).
2. Parameter change: stop when ||θ^(t+1) − θ^(t)|| < ε (e.g., ε = 10⁻⁶). Indicates convergence to a stationary point.
3. Gradient norm: stop when ||∇J(θ)|| < ε. Similar idea, more direct.
4. Validation loss: split data into train/validation. Stop when validation loss increases for k consecutive epochs while training loss decreases. Prevents overfitting.
In practice, use a combination: set a maximum iteration count, monitor validation loss, and stop early if it plateaus or worsens.
Regularization to Prevent Overfitting
Overfitting occurs when a model fits training noise rather than underlying patterns. High-dimensional models (many features) are prone to overfitting, especially with small training sets. Regularization adds a penalty on model complexity:
J(θ) + λ Ω(θ)
Common penalties:
• L2 (Ridge): Ω(θ) = ||θ||² = Σⱼ θⱼ². Shrinks all weights toward zero gradually.
• L1 (Lasso): Ω(θ) = ||θ||₁ = Σⱼ |θⱼ|. Drives many weights exactly to zero (sparsity).
The regularization parameter λ ≥ 0 controls the strength. Larger λ simplifies the model; smaller λ fits training data more closely. Use cross-validation to choose λ: train on fold 1-4, validate on fold 5, repeat for all folds, and pick λ minimizing average validation error.
Model Selection and Cross-Validation
The true test of a model is performance on unseen data. To estimate this, use k-fold cross-validation: split data into k disjoint folds, train on k−1 folds, test on the remaining fold, repeat k times, and average test errors. Typical choices: k = 5 or 10.
Use cross-validation to select hyperparameters (learning rate, regularization λ, batch size). For each candidate value, run k-fold CV and pick the value minimizing average validation error. This is more reliable than a single train/test split, especially for small datasets.
Common Mistakes to Avoid
1. Not scaling features before training; 2. Choosing hyperparameters based on test loss (causes overfitting to test set); 3. Using the same learning rate for all parameters (vary with feature scale); 4. Ignoring the condition number of XᵀX (leads to numerical instability); 5. Training until zero training loss (fits noise).
Workflow Summary
1. Load and explore data. 2. Standardize features. 3. Choose initial hyperparameters (α, λ). 4. Train on fold 1-4, validate on fold 5 (repeat). 5. Select hyperparameters minimizing validation error. 6. Train on all data with selected hyperparameters. 7. Evaluate on held-out test set. 8. Deploy and monitor.
Section 09 — Sources
Sources & References
Course Materials
CS229 Main Lecture Notes — Stanford's authoritative course notes with rigorous mathematical treatment and practical algorithms for linear regression and gradient descent