SCPD ML Course
Linear Algebra for ML
Foundations for Machine Learning · Eight Sections
01

Vectors & Spaces

Vectors are the fundamental objects in linear algebra. While often visualized as arrows in 2D or 3D space, vectors are more abstractly elements of a vector space—any collection of objects where you can add elements together and scale them by numbers (scalars). In machine learning, vectors represent data: each feature is a dimension, and each data point is a vector in high-dimensional space.

A vector space requires closure under addition and scalar multiplication, contains a zero element, and satisfies associativity and distributivity. The span of vectors is the set of all linear combinations you can create from them. A basis is a minimal set of linearly independent vectors that span the entire space. Understanding these concepts is crucial: PCA finds new basis vectors that capture maximum variance; neural networks perform calculations in learned vector spaces; attention mechanisms compute similarities between vectors.

n
Dimensions
d·d
Dot Product
Basis
Coordinate System
Linear
Independence
02

Matrices & Transformations

A matrix is a rectangular array of numbers that represents a linear transformation. Instead of thinking of matrices as just static tables, understand them as functions: multiplying a vector by a matrix transforms that vector according to specific rules. Matrix multiplication composes transformations—if you apply transformation A then transformation B, you're computing BA (note the order reversal).

Every matrix has four fundamental subspaces: the column space (range) is where the transformation maps to; the null space (kernel) contains vectors that map to zero; the row space and left null space complete the picture. The rank is the dimension of the column space, directly measuring how much information the transformation preserves. This matters in ML because: neural networks compose matrices to create deep transformations; regression solves least-squares through matrix operations; dimensionality reduction eliminates null space directions.

Column Space

Range of the matrix; tells you what vectors the transformation can reach.

Null Space

Vectors mapping to zero; represents "missing" dimensions in the output.

Rank

Dimension of column space; how much information survives transformation.

Invertibility

Full rank matrices are invertible; uniquely solve systems Ax=b.

03

Systems of Equations

Solving linear systems Ax = b is one of the most common computational tasks in ML. Gaussian elimination (with row reduction) systematically finds solutions through row echelon form and reduced row echelon form (RREF). Understanding whether systems have unique, infinite, or no solutions depends on rank: if rank(A) = rank([A|b]), solutions exist; if rank(A) = n, the solution is unique.

Most real ML problems are overdetermined—more equations than unknowns—because you have more data points than parameters. When no exact solution exists, least squares finds the best approximate solution by minimizing ||Ax - b||². This appears everywhere: linear regression, neural network training, computer vision. Underdetermined systems (fewer equations than unknowns) have infinitely many solutions; regularization (L1, L2) picks the "simplest" one.

Full Rank
Unique Solution
Rank < n
Infinite Solutions
Rank < m
Overdetermined
Least Sq
Best Fit
04

Eigenvalues & Eigenvectors

An eigenvector of a matrix A is a special vector that doesn't change direction when multiplied by A—it only gets scaled by a number called the eigenvalue: Av = λv. Geometrically, eigenvectors are the "natural directions" of a transformation. If you align your coordinate system with the eigenvectors, the matrix becomes diagonal, which is far simpler to analyze.

Eigenvalues reveal fundamental properties. For a covariance matrix, eigenvalues are the variance captured along each eigenvector direction—this is PCA's core insight. For a Markov chain's transition matrix, the eigenvector with eigenvalue 1 is the stationary distribution. For graph Laplacians in spectral clustering, small eigenvalues reveal cluster structure. Deep learning optimizers exploit eigenvalue structure of the Hessian (second derivatives). Understanding eigendecomposition is essential for understanding what models actually learn.

Characteristic Polynomial

det(A - λI) = 0 finds eigenvalues; roots of this polynomial.

Diagonalization

A = PDP⁻¹ where D diagonal; converts matrix to simplest form.

Spectral Theorem

Symmetric matrices have real eigenvalues and orthogonal eigenvectors.

ML Applications

PCA, spectral clustering, power method, PageRank, neural network Hessians.

05

Singular Value Decomposition

The Singular Value Decomposition (SVD) is arguably the most important matrix decomposition for data science. Every matrix A can be written as A = UΣV^T where U and V are orthogonal (columns are orthonormal) and Σ is diagonal with non-negative entries called singular values. Unlike eigendecomposition (which only works for square matrices), SVD works for any shape matrix.

SVD reveals the structure of any dataset: the singular values (in decreasing order) show how much variance exists along each direction; U columns are the "left singular vectors" (principal directions in the data); V columns are the "right singular vectors" (principal directions in the features). Truncating small singular values gives low-rank approximation—the mathematical foundation of PCA. This compression appears in: image compression, recommendation systems (Netflix problem), noise reduction, and approximate computation of expensive operations. The rank of a matrix equals its number of non-zero singular values.

U
Left Singular Vectors
Σ
Singular Values (σᵢ)
V^T
Right Singular Vectors
rank
Non-zero singular values
06

Matrix Calculus

Machine learning optimization requires computing gradients of scalar loss functions with respect to matrix parameters. Matrix calculus extends single-variable calculus to matrices and vectors. Key identities you'll see constantly: ∇(x^T A x) = (A + A^T)x (or 2Ax if A is symmetric); ∇(||Ax - b||²) = 2A^T(Ax - b); chain rule for matrices: dL/dX = (dL/dY)(dY/dX) where the multiplication is careful matrix multiplication.

Every ML algorithm that optimizes parameters uses these derivatives. Backpropagation in neural networks is repeated application of matrix calculus chain rules. Gradient descent takes steps in the direction of -∇L(θ). Understanding these derivatives helps you reason about optimization behavior: the Hessian (matrix of second derivatives) determines curvature and convergence speed. Ridge regression adds λI to the cost to improve conditioning. Batch normalization uses statistics of matrix operations to stabilize learning.

∇(x^TAx)
Quadratic Forms
∇(||Ax-b||²)
Least Squares
Chain Rule
Backpropagation
Hessian
Second Derivatives
07

Norms & Inner Products

A norm is a function that measures vector "size" or "length." The L2 norm (Euclidean distance) ||x||₂ = √(∑xᵢ²) is most familiar from geometry. The L1 norm ||x||₁ = ∑|xᵢ| measures Manhattan distance and appears in LASSO regression (encourages sparsity). The L∞ norm ||x||∞ = max|xᵢ| is the maximum absolute element. Different norms emphasize different aspects: L2 penalizes large individual values; L1 penalizes non-zero entries more equally, encouraging many features to be exactly zero.

Inner products generalize the dot product. The standard Euclidean inner product ⟨x,y⟩ = x^T y measures similarity; orthogonal vectors have inner product zero. More generally, any symmetric positive-definite matrix K defines an inner product ⟨x,y⟩_K = x^T K y. This connects to kernel methods: kernel functions implicitly compute inner products in high-dimensional spaces without explicit coordinates. Gram-Schmidt orthogonalization converts any basis into an orthonormal basis by repeatedly subtracting projections. This appears in QR decomposition and stabilizing numerical computations.

L1 Norm

∑|xᵢ|; sparsity-inducing, robust to outliers, Manhattan distance.

L2 Norm

√(∑xᵢ²); Euclidean distance, geometrically intuitive, smooth.

Inner Products

Measure similarity; define orthogonality; generalized via kernels.

Orthonormality

Orthonormal bases simplify calculations; appear in SVD, QR, eigendecomposition.

08

Key Decompositions for ML

Beyond SVD and eigendecomposition, several other matrix decompositions appear frequently in ML algorithms. The QR decomposition factors any matrix as A = QR where Q is orthogonal and R is upper triangular. This is numerically stable for solving least squares and appears in the QR algorithm for computing eigenvalues. The LU decomposition factors square matrices as A = LU (lower and upper triangular); it's efficient for solving multiple systems with the same coefficient matrix. The Cholesky decomposition A = LL^T exists only for symmetric positive-definite matrices and is very efficient, appearing in sampling Gaussian distributions and optimizing quadratic functions.

Positive definite matrices are special: all eigenvalues are positive, they have a unique Cholesky decomposition, and x^T A x > 0 for all nonzero x. Positive definite matrices appear everywhere in ML: covariance matrices are positive semi-definite (eigenvalues ≥ 0); the Hessian of a convex function is positive semi-definite at all points; regularized loss functions add λI (positive definite) to improve conditioning. The condition number (ratio of largest to smallest singular value) measures numerical stability; ill-conditioned problems require careful algorithms. These decompositions connect the entire field: SVD is the most general; eigendecomposition handles square matrices; Cholesky and QR solve specific problems efficiently.

Efficiency

  • Cholesky: O(n³/3) for positive definite matrices
  • QR: Numerically stable for least squares
  • LU: Efficient for multiple right-hand sides
  • SVD: Universal but most expensive

Limitations

  • Cholesky requires positive definite matrices
  • LU may need pivoting for stability
  • QR less intuitive than eigendecomposition
  • SVD expensive for very large matrices
09

Sources & References

References and sources for further study on the topics covered in this deep dive.