Linear Algebra for ML

SCPD ML Course

Linear Algebra for ML

Vectors & Spaces

Vectors are the fundamental objects in linear algebra. While often visualized as arrows in 2D or 3D space, vectors are more abstractly elements of a vector space—any collection of objects where you can add elements together and scale them by numbers (scalars). In machine learning, vectors represent data: each feature is a dimension, and each data point is a vector in high-dimensional space.

A vector space requires closure under addition and scalar multiplication, contains a zero element, and satisfies associativity and distributivity. The span of vectors is the set of all linear combinations you can create from them. A basis is a minimal set of linearly independent vectors that span the entire space. Understanding these concepts is crucial: PCA finds new basis vectors that capture maximum variance; neural networks perform calculations in learned vector spaces; attention mechanisms compute similarities between vectors.

Dimensions

d·d

Dot Product

Basis

Coordinate System

Linear

Independence

Matrices & Transformations

A matrix is a rectangular array of numbers that represents a linear transformation. Instead of thinking of matrices as just static tables, understand them as functions: multiplying a vector by a matrix transforms that vector according to specific rules. Matrix multiplication composes transformations—if you apply transformation A then transformation B, you're computing BA (note the order reversal).

Every matrix has four fundamental subspaces: the column space (range) is where the transformation maps to; the null space (kernel) contains vectors that map to zero; the row space and left null space complete the picture. The rank is the dimension of the column space, directly measuring how much information the transformation preserves. This matters in ML because: neural networks compose matrices to create deep transformations; regression solves least-squares through matrix operations; dimensionality reduction eliminates null space directions.

Column Space

Range of the matrix; tells you what vectors the transformation can reach.

Null Space

Vectors mapping to zero; represents "missing" dimensions in the output.

Rank

Dimension of column space; how much information survives transformation.

Invertibility

Full rank matrices are invertible; uniquely solve systems Ax=b.

Systems of Equations

Solving linear systems Ax = b is one of the most common computational tasks in ML. Gaussian elimination (with row reduction) systematically finds solutions through row echelon form and reduced row echelon form (RREF). Understanding whether systems have unique, infinite, or no solutions depends on rank: if rank(A) = rank([A|b]), solutions exist; if rank(A) = n, the solution is unique.

Most real ML problems are overdetermined—more equations than unknowns—because you have more data points than parameters. When no exact solution exists, least squares finds the best approximate solution by minimizing ||Ax - b||². This appears everywhere: linear regression, neural network training, computer vision. Underdetermined systems (fewer equations than unknowns) have infinitely many solutions; regularization (L1, L2) picks the "simplest" one.

Full Rank

Unique Solution

Rank < n

Infinite Solutions

Rank < m

Overdetermined

Least Sq

Best Fit

Eigenvalues & Eigenvectors

An eigenvector of a matrix A is a special vector that doesn't change direction when multiplied by A—it only gets scaled by a number called the eigenvalue: Av = λv. Geometrically, eigenvectors are the "natural directions" of a transformation. If you align your coordinate system with the eigenvectors, the matrix becomes diagonal, which is far simpler to analyze.

Eigenvalues reveal fundamental properties. For a covariance matrix, eigenvalues are the variance captured along each eigenvector direction—this is PCA's core insight. For a Markov chain's transition matrix, the eigenvector with eigenvalue 1 is the stationary distribution. For graph Laplacians in spectral clustering, small eigenvalues reveal cluster structure. Deep learning optimizers exploit eigenvalue structure of the Hessian (second derivatives). Understanding eigendecomposition is essential for understanding what models actually learn.

Characteristic Polynomial

det(A - λI) = 0 finds eigenvalues; roots of this polynomial.

Diagonalization

A = PDP⁻¹ where D diagonal; converts matrix to simplest form.

Spectral Theorem

Symmetric matrices have real eigenvalues and orthogonal eigenvectors.

ML Applications

PCA, spectral clustering, power method, PageRank, neural network Hessians.

Singular Value Decomposition

The Singular Value Decomposition (SVD) is arguably the most important matrix decomposition for data science. Every matrix A can be written as A = UΣV^T where U and V are orthogonal (columns are orthonormal) and Σ is diagonal with non-negative entries called singular values. Unlike eigendecomposition (which only works for square matrices), SVD works for any shape matrix.

SVD reveals the structure of any dataset: the singular values (in decreasing order) show how much variance exists along each direction; U columns are the "left singular vectors" (principal directions in the data); V columns are the "right singular vectors" (principal directions in the features). Truncating small singular values gives low-rank approximation—the mathematical foundation of PCA. This compression appears in: image compression, recommendation systems (Netflix problem), noise reduction, and approximate computation of expensive operations. The rank of a matrix equals its number of non-zero singular values.

Left Singular Vectors

Singular Values (σᵢ)

V^T

Right Singular Vectors

rank

Non-zero singular values

Matrix Calculus

Machine learning optimization requires computing gradients of scalar loss functions with respect to matrix parameters. Matrix calculus extends single-variable calculus to matrices and vectors. Key identities you'll see constantly: ∇(x^T A x) = (A + A^T)x (or 2Ax if A is symmetric); ∇(||Ax - b||²) = 2A^T(Ax - b); chain rule for matrices: dL/dX = (dL/dY)(dY/dX) where the multiplication is careful matrix multiplication.

Every ML algorithm that optimizes parameters uses these derivatives. Backpropagation in neural networks is repeated application of matrix calculus chain rules. Gradient descent takes steps in the direction of -∇L(θ). Understanding these derivatives helps you reason about optimization behavior: the Hessian (matrix of second derivatives) determines curvature and convergence speed. Ridge regression adds λI to the cost to improve conditioning. Batch normalization uses statistics of matrix operations to stabilize learning.

∇(x^TAx)

Quadratic Forms

∇(||Ax-b||²)

Least Squares

Chain Rule

Backpropagation

Hessian

Second Derivatives

Norms & Inner Products

A norm is a function that measures vector "size" or "length." The L2 norm (Euclidean distance) ||x||₂ = √(∑xᵢ²) is most familiar from geometry. The L1 norm ||x||₁ = ∑|xᵢ| measures Manhattan distance and appears in LASSO regression (encourages sparsity). The L∞ norm ||x||∞ = max|xᵢ| is the maximum absolute element. Different norms emphasize different aspects: L2 penalizes large individual values; L1 penalizes non-zero entries more equally, encouraging many features to be exactly zero.

Inner products generalize the dot product. The standard Euclidean inner product ⟨x,y⟩ = x^T y measures similarity; orthogonal vectors have inner product zero. More generally, any symmetric positive-definite matrix K defines an inner product ⟨x,y⟩_K = x^T K y. This connects to kernel methods: kernel functions implicitly compute inner products in high-dimensional spaces without explicit coordinates. Gram-Schmidt orthogonalization converts any basis into an orthonormal basis by repeatedly subtracting projections. This appears in QR decomposition and stabilizing numerical computations.

L1 Norm

∑|xᵢ|; sparsity-inducing, robust to outliers, Manhattan distance.

L2 Norm

√(∑xᵢ²); Euclidean distance, geometrically intuitive, smooth.

Inner Products

Measure similarity; define orthogonality; generalized via kernels.

Orthonormality

Orthonormal bases simplify calculations; appear in SVD, QR, eigendecomposition.

Key Decompositions for ML

Beyond SVD and eigendecomposition, several other matrix decompositions appear frequently in ML algorithms. The QR decomposition factors any matrix as A = QR where Q is orthogonal and R is upper triangular. This is numerically stable for solving least squares and appears in the QR algorithm for computing eigenvalues. The LU decomposition factors square matrices as A = LU (lower and upper triangular); it's efficient for solving multiple systems with the same coefficient matrix. The Cholesky decomposition A = LL^T exists only for symmetric positive-definite matrices and is very efficient, appearing in sampling Gaussian distributions and optimizing quadratic functions.

Positive definite matrices are special: all eigenvalues are positive, they have a unique Cholesky decomposition, and x^T A x > 0 for all nonzero x. Positive definite matrices appear everywhere in ML: covariance matrices are positive semi-definite (eigenvalues ≥ 0); the Hessian of a convex function is positive semi-definite at all points; regularized loss functions add λI (positive definite) to improve conditioning. The condition number (ratio of largest to smallest singular value) measures numerical stability; ill-conditioned problems require careful algorithms. These decompositions connect the entire field: SVD is the most general; eigendecomposition handles square matrices; Cholesky and QR solve specific problems efficiently.

Efficiency

Cholesky: O(n³/3) for positive definite matrices
QR: Numerically stable for least squares
LU: Efficient for multiple right-hand sides
SVD: Universal but most expensive

Limitations

Cholesky requires positive definite matrices
LU may need pivoting for stability
QR less intuitive than eigendecomposition
SVD expensive for very large matrices

Sources & References

References and sources for further study on the topics covered in this deep dive.

Vectors & Spaces

Matrices & Transformations

Column Space

Null Space

Rank

Invertibility

Systems of Equations

Eigenvalues & Eigenvectors

Characteristic Polynomial

Diagonalization

Spectral Theorem

ML Applications

Singular Value Decomposition

Matrix Calculus

Norms & Inner Products

L1 Norm

L2 Norm

Inner Products

Orthonormality

Key Decompositions for ML

Efficiency

Limitations

Sources & References

Vectors & Spaces

Vector Space Examples in ML

Linear Independence and Basis

Dot Product and Projections

Worked Example: 3D Vector Space

Matrices & Transformations

Matrix Multiplication as Composition

Rank and Full Rank Conditions

Four Fundamental Subspaces

Worked Example: Rank and Column Space

Systems of Equations

Overdetermined vs Underdetermined Systems

Gaussian Elimination with Worked Example

Least Squares and Normal Equations

Eigenvalues & Eigenvectors

Diagonalization

Spectral Theorem and Symmetric Matrices

Worked Example: 2×2 Matrix

ML Applications: PCA

Singular Value Decomposition

Interpretation and Low-Rank Approximation

Connection to PCA and Eigendecomposition

Worked Example: 3×2 Matrix

Practical Applications

Matrix Calculus

Chain Rule for Matrices

Common Derivatives in ML

Worked Example: Least Squares Gradient

Vector-by-Matrix Notation

Norms & Inner Products

Inner Products and Orthogonality

Kernels and Generalized Inner Products

Frobenius and Spectral Norms for Matrices

Worked Example: Gram-Schmidt on 3D Vectors

Key Decompositions for ML

LU Decomposition

Cholesky Decomposition

Positive Definite Matrices

Condition Number and Numerical Stability

Worked Example: Ridge Regression via Cholesky

Decomposition Comparison

Sources & References

Linear Algebra Textbooks

Online Resources

Machine Learning Applications