Vectors are the fundamental objects in linear algebra. While often visualized as arrows in 2D or 3D space, vectors are more abstractly elements of a vector space—any collection of objects where you can add elements together and scale them by numbers (scalars). In machine learning, vectors represent data: each feature is a dimension, and each data point is a vector in high-dimensional space.
A vector space requires closure under addition and scalar multiplication, contains a zero element, and satisfies associativity and distributivity. The span of vectors is the set of all linear combinations you can create from them. A basis is a minimal set of linearly independent vectors that span the entire space. Understanding these concepts is crucial: PCA finds new basis vectors that capture maximum variance; neural networks perform calculations in learned vector spaces; attention mechanisms compute similarities between vectors.
n
Dimensions
d·d
Dot Product
Basis
Coordinate System
Linear
Independence
02
Matrices & Transformations
A matrix is a rectangular array of numbers that represents a linear transformation. Instead of thinking of matrices as just static tables, understand them as functions: multiplying a vector by a matrix transforms that vector according to specific rules. Matrix multiplication composes transformations—if you apply transformation A then transformation B, you're computing BA (note the order reversal).
Every matrix has four fundamental subspaces: the column space (range) is where the transformation maps to; the null space (kernel) contains vectors that map to zero; the row space and left null space complete the picture. The rank is the dimension of the column space, directly measuring how much information the transformation preserves. This matters in ML because: neural networks compose matrices to create deep transformations; regression solves least-squares through matrix operations; dimensionality reduction eliminates null space directions.
Column Space
Range of the matrix; tells you what vectors the transformation can reach.
Null Space
Vectors mapping to zero; represents "missing" dimensions in the output.
Rank
Dimension of column space; how much information survives transformation.
Invertibility
Full rank matrices are invertible; uniquely solve systems Ax=b.
03
Systems of Equations
Solving linear systems Ax = b is one of the most common computational tasks in ML. Gaussian elimination (with row reduction) systematically finds solutions through row echelon form and reduced row echelon form (RREF). Understanding whether systems have unique, infinite, or no solutions depends on rank: if rank(A) = rank([A|b]), solutions exist; if rank(A) = n, the solution is unique.
Most real ML problems are overdetermined—more equations than unknowns—because you have more data points than parameters. When no exact solution exists, least squares finds the best approximate solution by minimizing ||Ax - b||². This appears everywhere: linear regression, neural network training, computer vision. Underdetermined systems (fewer equations than unknowns) have infinitely many solutions; regularization (L1, L2) picks the "simplest" one.
Full Rank
Unique Solution
Rank < n
Infinite Solutions
Rank < m
Overdetermined
Least Sq
Best Fit
04
Eigenvalues & Eigenvectors
An eigenvector of a matrix A is a special vector that doesn't change direction when multiplied by A—it only gets scaled by a number called the eigenvalue: Av = λv. Geometrically, eigenvectors are the "natural directions" of a transformation. If you align your coordinate system with the eigenvectors, the matrix becomes diagonal, which is far simpler to analyze.
Eigenvalues reveal fundamental properties. For a covariance matrix, eigenvalues are the variance captured along each eigenvector direction—this is PCA's core insight. For a Markov chain's transition matrix, the eigenvector with eigenvalue 1 is the stationary distribution. For graph Laplacians in spectral clustering, small eigenvalues reveal cluster structure. Deep learning optimizers exploit eigenvalue structure of the Hessian (second derivatives). Understanding eigendecomposition is essential for understanding what models actually learn.
Characteristic Polynomial
det(A - λI) = 0 finds eigenvalues; roots of this polynomial.
Diagonalization
A = PDP⁻¹ where D diagonal; converts matrix to simplest form.
Spectral Theorem
Symmetric matrices have real eigenvalues and orthogonal eigenvectors.
ML Applications
PCA, spectral clustering, power method, PageRank, neural network Hessians.
05
Singular Value Decomposition
The Singular Value Decomposition (SVD) is arguably the most important matrix decomposition for data science. Every matrix A can be written as A = UΣV^T where U and V are orthogonal (columns are orthonormal) and Σ is diagonal with non-negative entries called singular values. Unlike eigendecomposition (which only works for square matrices), SVD works for any shape matrix.
SVD reveals the structure of any dataset: the singular values (in decreasing order) show how much variance exists along each direction; U columns are the "left singular vectors" (principal directions in the data); V columns are the "right singular vectors" (principal directions in the features). Truncating small singular values gives low-rank approximation—the mathematical foundation of PCA. This compression appears in: image compression, recommendation systems (Netflix problem), noise reduction, and approximate computation of expensive operations. The rank of a matrix equals its number of non-zero singular values.
U
Left Singular Vectors
Σ
Singular Values (σᵢ)
V^T
Right Singular Vectors
rank
Non-zero singular values
06
Matrix Calculus
Machine learning optimization requires computing gradients of scalar loss functions with respect to matrix parameters. Matrix calculus extends single-variable calculus to matrices and vectors. Key identities you'll see constantly: ∇(x^T A x) = (A + A^T)x (or 2Ax if A is symmetric); ∇(||Ax - b||²) = 2A^T(Ax - b); chain rule for matrices: dL/dX = (dL/dY)(dY/dX) where the multiplication is careful matrix multiplication.
Every ML algorithm that optimizes parameters uses these derivatives. Backpropagation in neural networks is repeated application of matrix calculus chain rules. Gradient descent takes steps in the direction of -∇L(θ). Understanding these derivatives helps you reason about optimization behavior: the Hessian (matrix of second derivatives) determines curvature and convergence speed. Ridge regression adds λI to the cost to improve conditioning. Batch normalization uses statistics of matrix operations to stabilize learning.
∇(x^TAx)
Quadratic Forms
∇(||Ax-b||²)
Least Squares
Chain Rule
Backpropagation
Hessian
Second Derivatives
07
Norms & Inner Products
A norm is a function that measures vector "size" or "length." The L2 norm (Euclidean distance) ||x||₂ = √(∑xᵢ²) is most familiar from geometry. The L1 norm ||x||₁ = ∑|xᵢ| measures Manhattan distance and appears in LASSO regression (encourages sparsity). The L∞ norm ||x||∞ = max|xᵢ| is the maximum absolute element. Different norms emphasize different aspects: L2 penalizes large individual values; L1 penalizes non-zero entries more equally, encouraging many features to be exactly zero.
Inner products generalize the dot product. The standard Euclidean inner product ⟨x,y⟩ = x^T y measures similarity; orthogonal vectors have inner product zero. More generally, any symmetric positive-definite matrix K defines an inner product ⟨x,y⟩_K = x^T K y. This connects to kernel methods: kernel functions implicitly compute inner products in high-dimensional spaces without explicit coordinates. Gram-Schmidt orthogonalization converts any basis into an orthonormal basis by repeatedly subtracting projections. This appears in QR decomposition and stabilizing numerical computations.
L1 Norm
∑|xᵢ|; sparsity-inducing, robust to outliers, Manhattan distance.
Measure similarity; define orthogonality; generalized via kernels.
Orthonormality
Orthonormal bases simplify calculations; appear in SVD, QR, eigendecomposition.
08
Key Decompositions for ML
Beyond SVD and eigendecomposition, several other matrix decompositions appear frequently in ML algorithms. The QR decomposition factors any matrix as A = QR where Q is orthogonal and R is upper triangular. This is numerically stable for solving least squares and appears in the QR algorithm for computing eigenvalues. The LU decomposition factors square matrices as A = LU (lower and upper triangular); it's efficient for solving multiple systems with the same coefficient matrix. The Cholesky decomposition A = LL^T exists only for symmetric positive-definite matrices and is very efficient, appearing in sampling Gaussian distributions and optimizing quadratic functions.
Positive definite matrices are special: all eigenvalues are positive, they have a unique Cholesky decomposition, and x^T A x > 0 for all nonzero x. Positive definite matrices appear everywhere in ML: covariance matrices are positive semi-definite (eigenvalues ≥ 0); the Hessian of a convex function is positive semi-definite at all points; regularized loss functions add λI (positive definite) to improve conditioning. The condition number (ratio of largest to smallest singular value) measures numerical stability; ill-conditioned problems require careful algorithms. These decompositions connect the entire field: SVD is the most general; eigendecomposition handles square matrices; Cholesky and QR solve specific problems efficiently.
Efficiency
Cholesky: O(n³/3) for positive definite matrices
QR: Numerically stable for least squares
LU: Efficient for multiple right-hand sides
SVD: Universal but most expensive
Limitations
Cholesky requires positive definite matrices
LU may need pivoting for stability
QR less intuitive than eigendecomposition
SVD expensive for very large matrices
09
Sources & References
References and sources for further study on the topics covered in this deep dive.
Section 01 — Details
Vectors & Spaces
At their core, vectors are not geometric objects—they are algebraic structures. A vector space V over a scalar field (usually ℝ) is a set where two operations are defined: addition (x + y ∈ V for any x,y ∈ V) and scalar multiplication (αx ∈ V for any scalar α and vector x). Eight axioms must hold: closure, associativity, commutativity of addition, existence of identity and inverse elements, distributivity, associativity of scalar multiplication, and the identity scalar 1·x = x.
Vector Space Examples in ML
ℝⁿ is the most familiar: n-dimensional space of real numbers. In machine learning, each data point is a vector in ℝᵈ where d is the number of features. Space of all m×n matrices, where addition and scalar multiplication are element-wise operations. Space of all polynomial functions of degree ≤ n. Even solution spaces of differential equations form vector spaces. This abstraction unifies different types of data: images are vectors (pixel intensities), text can be vectors (word embeddings), time series are vectors.
The concept of linear combination is fundamental: a vector y is a linear combination of vectors v₁, v₂, ..., vₖ if y = c₁v₁ + c₂v₂ + ... + cₖvₖ for some scalars cᵢ. The span of a set of vectors is the collection of all linear combinations you can form from them. For example, in ℝ³, the span of (1,0,0) and (0,1,0) is the xy-plane; every vector in that plane can be written as a linear combination of these two vectors.
Linear Independence and Basis
A set of vectors {v₁, v₂, ..., vₖ} is linearly independent if the only solution to c₁v₁ + c₂v₂ + ... + cₖvₖ = 0 is c₁ = c₂ = ... = cₖ = 0. Otherwise, the set is linearly dependent—some vector can be written as a combination of others. Linear independence is crucial: if you have a linearly dependent set, you have redundancy. In PCA, we remove linearly dependent or nearly dependent directions to reduce dimensionality.
A basis of a vector space V is a linearly independent set of vectors whose span equals V. In ℝⁿ, any basis has exactly n vectors (the dimension of the space). The standard basis is {e₁, e₂, ..., eₙ} where eᵢ has a 1 in position i and 0s elsewhere, but any linearly independent set of n vectors forms a valid basis. The crucial insight: every vector x ∈ V can be uniquely represented as x = α₁v₁ + α₂v₂ + ... + αₙvₙ when {v₁, ..., vₙ} is a basis. The coefficients αᵢ are the coordinates of x with respect to that basis.
Dot Product and Projections
The dot product (inner product) of two vectors x and y is defined as ⟨x,y⟩ = x^T y = x₁y₁ + x₂y₂ + ... + xₙyₙ. This operation has geometric meaning: the dot product equals ||x|| ||y|| cos(θ) where θ is the angle between vectors. When vectors are orthogonal (perpendicular), the dot product is zero. When parallel, the dot product is maximized. This enables similarity computation: high dot product means vectors point in similar directions (used in attention mechanisms, recommender systems, nearest neighbor search).
The projection of vector x onto vector u is proj_u(x) = (⟨x,u⟩ / ⟨u,u⟩) u. This finds the component of x in the direction of u. Projections appear in: least squares regression (projecting the data vector onto the column space of the design matrix), principal component analysis (finding components along directions of maximum variance), and Gram-Schmidt orthogonalization (repeatedly projecting to maintain orthogonality).
Worked Example: 3D Vector Space
Consider vectors v₁ = (1, 0, 0), v₂ = (0, 1, 0), v₃ = (1, 1, 0). Are they linearly independent? Check: c₁(1,0,0) + c₂(0,1,0) + c₃(1,1,0) = (0,0,0) gives c₁ + c₃ = 0, c₂ + c₃ = 0, which has non-trivial solutions (e.g., c₁ = 1, c₂ = 1, c₃ = -1). So they're linearly dependent—v₃ = v₁ + v₂. The set {v₁, v₂} is linearly independent, so {v₁, v₂} forms a basis for the xy-plane (a 2D subspace of ℝ³).
Now project the vector x = (2, 3, 0) onto v₁. The projection is proj_{v₁}(x) = (⟨x,v₁⟩ / ||v₁||²) v₁ = (2/1)(1,0,0) = (2,0,0). Project onto the xy-plane span(v₁,v₂): since x is already in this plane, its projection is itself (2,3,0). These projections are used in least squares: given data y and design matrix A, the least squares solution minimizes ||y - Ax||². This happens when Ax is the projection of y onto the column space of A.
Section 02 — Details
Matrices & Transformations
A matrix A ∈ ℝᵐˣⁿ can be interpreted as a linear transformation from ℝⁿ to ℝᵐ. Instead of thinking of it as a static table of numbers, think of it as a function f(x) = Ax that takes an n-dimensional input and produces an m-dimensional output. This transformation has specific properties: f(x + y) = f(x) + f(y) (additivity) and f(αx) = αf(x) (homogeneity). These linearity properties are what make matrices so useful—and what limit them.
The column vectors of A are fundamental: A = [a₁ a₂ ... aₙ]. When you compute Ax, you're computing a linear combination of the columns: Ax = x₁a₁ + x₂a₂ + ... + xₙaₙ. The image (column space) of A is the set of all vectors reachable by the transformation: Im(A) = {Ax : x ∈ ℝⁿ}. This space has dimension equal to the rank of A, which is at most min(m,n). Vectors that map to zero form the null space (kernel): Null(A) = {x : Ax = 0}. By the rank-nullity theorem, rank(A) + nullity(A) = n. These subspaces completely characterize the transformation's behavior.
Matrix Multiplication as Composition
Matrix multiplication isn't commutative (AB ≠ BA in general), but it is associative. More importantly, matrix multiplication composes transformations. If A transforms ℝⁿ → ℝᵐ and B transforms ℝᵐ → ℝᵖ, then BA transforms ℝⁿ → ℝᵖ. Note the order: (BA)x = B(Ax) means you apply A first, then B. In neural networks, layers compose: output = σ(W₂σ(W₁x + b₁) + b₂) applies W₁, then nonlinearity σ, then W₂, then σ, then bias. Each matrix multiplication is a linear transformation; the nonlinearity breaks linearity and enables expressivity.
Rank and Full Rank Conditions
The rank of a matrix is the dimension of its column space: rank(A) = dim(Im(A)). Equivalently, rank equals the number of linearly independent rows or columns. For an m×n matrix, rank(A) ≤ min(m,n). A matrix is full rank if rank(A) = min(m,n). For a square matrix A ∈ ℝⁿˣⁿ, full rank means rank(A) = n, which is equivalent to A being invertible (having a unique inverse A⁻¹ such that AA⁻¹ = A⁻¹A = I).
Rank determines solvability of systems. For Ax = b: if rank(A) = rank([A|b]), solutions exist (may be multiple if rank < n). If rank(A) = n, the solution is unique. If rank(A) < rank([A|b]), no solution exists. In machine learning, when you have more samples m than features n, typically rank(A) = n (assuming features are not redundant), and the system is overdetermined. When you have more features than samples (common in high-dimensional settings), rank(A) < n, and the system is underdetermined—infinitely many solutions exist, and regularization chooses among them.
Four Fundamental Subspaces
Every matrix A ∈ ℝᵐˣⁿ generates four subspaces: (1) Column space: Col(A) ⊆ ℝᵐ with dimension rank(A); (2) Null space: Null(A) ⊆ ℝⁿ with dimension n - rank(A); (3) Row space: Row(A) ⊆ ℝⁿ with dimension rank(A) (equals the column space of A^T); (4) Left null space: Null(A^T) ⊆ ℝᵐ with dimension m - rank(A). These subspaces are orthogonal complements: Col(A) ⊥ Null(A^T) and Row(A) ⊥ Null(A). Every vector in ℝⁿ decomposes uniquely as a sum of a component in Row(A) and a component in Null(A). Understanding these subspaces is crucial for understanding what information a linear transformation preserves or loses.
Worked Example: Rank and Column Space
Consider A = [1 2; 2 4; 3 6]. The first column is (1, 2, 3) and second is (2, 4, 6) = 2×(1, 2, 3). They're linearly dependent, so rank(A) = 1 (only the first column is independent). The column space is the line spanned by (1, 2, 3) in ℝ³. The null space: solve Ax = 0. We need x₁ + 2x₂ = 0, giving x₁ = -2x₂. So Null(A) = {(-2t, t) : t ∈ ℝ}, a 1-dimensional line in ℝ². This confirms rank-nullity: rank(A) + nullity(A) = 1 + 1 = 2 = number of columns. For systems Ax = b where b = (b₁, b₂, b₃): solutions exist only if b is in the column space, i.e., if b = c(1, 2, 3) for some c, meaning b₁ = b₂/2 = b₃/3. If this holds, infinitely many solutions exist (a 1-dimensional line, since we can add any element of Null(A) to a particular solution).
Section 03 — Details
Systems of Equations
A system of linear equations can be written as Ax = b where A ∈ ℝᵐˣⁿ, x ∈ ℝⁿ is the unknown, and b ∈ ℝᵐ is the known right-hand side. Gaussian elimination solves this by transforming the augmented matrix [A|b] into row echelon form (REF) or reduced row echelon form (RREF) through elementary row operations: swapping rows, scaling rows, and adding multiples of one row to another. These operations don't change solutions because they're reversible.
The process has three stages: (1) Forward elimination reduces to row echelon form where the first nonzero entry in each row is to the right of the first nonzero entry in the row above. (2) Back substitution solves from the bottom up. (3) Further reduction to RREF makes each pivot (leading nonzero) into 1 and zeros out above and below. The rank of A equals the number of pivots. If rank(A) = rank([A|b]), at least one solution exists. If rank(A) = n additionally, the solution is unique. If rank(A) < n, free variables correspond to null space directions; infinitely many solutions exist.
Overdetermined vs Underdetermined Systems
When m > n (more equations than unknowns), the system is overdetermined. In general, no exact solution exists—the system is inconsistent. This is common in machine learning: you have many data points (equations) but few parameters (unknowns). The least squares solution minimizes the residual: x* = argmin ||Ax - b||². This optimal x satisfies the normal equations A^T A x = A^T b. If A has full column rank, x* = (A^T A)⁻¹ A^T b. This is the foundation of linear regression: each equation is a data point, A is the design matrix (features), b is the target, and x* is the fitted coefficients.
When m < n (more unknowns than equations), the system is underdetermined and typically has infinitely many solutions. The solution set is an affine subspace (translation of a subspace). Least norm solution: x* = A^T(AA^T)⁻¹b finds the solution with minimum ||x||. In machine learning, underdetermined settings appear when features >> samples. Regularization (adding λI to A^T A) picks a specific solution: x* = (A^T A + λI)⁻¹ A^T b. Ridge regression uses L2 regularization, preferring smaller coefficients. LASSO uses L1 regularization, encouraging sparsity (some coefficients exactly zero).
Gaussian Elimination with Worked Example
Consider the system: 2x + y - z = 8, -3x - y + 2z = -11, -2x + y + 2z = -3. Augmented matrix:
[ 2 1 -1 | 8 ]
[ -3 -1 2 | -11 ]
[ -2 1 2 | -3 ]
Row reduce: R₂ ← R₂ + (3/2)R₁ and R₃ ← R₃ + R₁:
[ 2 1 -1 | 8 ]
[ 0 0.5 0.5 | 1 ]
[ 0 2 1 | 5 ]
Pivot in second column: R₃ ← R₃ - 4R₂:
[ 2 1 -1 | 8 ]
[ 0 0.5 0.5 | 1 ]
[ 0 0 -1 | 1 ]
Back substitution: -z = 1 gives z = -1. Then 0.5y + 0.5(-1) = 1 gives y = 3. Finally, 2x + 3 - (-1) = 8 gives x = 2. Solution: (x,y,z) = (2,3,-1). The rank is 3 (three pivots), matching the number of unknowns, so the solution is unique.
Least Squares and Normal Equations
For overdetermined Ax = b (m > n), the least squares solution minimizes ||Ax - b||². Taking the derivative with respect to x: ∇(||Ax - b||²) = 2A^T(Ax - b) = 0 gives the normal equations: A^T A x = A^T b. If A has full column rank (rank(A) = n), then A^T A is invertible, and x* = (A^T A)⁻¹ A^T b. The matrix (A^T A)⁻¹ A^T is called the pseudo-inverse.
In linear regression, we fit a model y = Xβ where X is the design matrix (samples × features) and y is targets. The least squares estimator is β̂ = (X^T X)⁻¹ X^T y. The fitted values are ŷ = Xβ̂ = X(X^T X)⁻¹ X^T y = Py where P = X(X^T X)⁻¹ X^T is the projection matrix onto the column space of X. The residuals are r = y - ŷ = (I - P)y, orthogonal to the column space (r^T X = 0). The residual sum of squares is ||r||² = ||y - Xβ̂||², which measures fit quality.
Section 04 — Details
Eigenvalues & Eigenvectors
An eigenvector of a square matrix A is a nonzero vector v such that Av = λv for some scalar λ called the eigenvalue. Geometrically, multiplying by A doesn't change v's direction—it only scales it by λ. If λ > 1, the transformation stretches along v. If 0 < λ < 1, it contracts. If λ < 0, it reverses direction and scales. If λ = 0, v is in the null space. These special directions reveal the "natural" structure of the transformation.
To find eigenvalues, we solve det(A - λI) = 0. This determinant is a polynomial of degree n (the characteristic polynomial). Its roots are the eigenvalues. For each eigenvalue λ, we solve (A - λI)v = 0 to find the eigenvectors—the null space of (A - λI). This is a linear system; the eigenspace for eigenvalue λ is the null space of (A - λI), which may be multi-dimensional. Multiplicity of an eigenvalue is how many times it appears as a root of the characteristic polynomial; geometric multiplicity is the dimension of the eigenspace.
Diagonalization
If A has n linearly independent eigenvectors v₁, v₂, ..., vₙ (which is guaranteed for symmetric matrices, less so for general matrices), we can write A = PDP⁻¹ where P = [v₁ v₂ ... vₙ] (columns are eigenvectors) and D is diagonal with eigenvalues on the diagonal. This is diagonalization. Computing powers becomes easy: A^k = PD^kP⁻¹, and D^k is just the diagonal entries raised to power k. This enables efficient computation of matrix exponentials A^k v for large k, solving differential equations, and analyzing long-term behavior of iterative processes.
Spectral Theorem and Symmetric Matrices
For symmetric matrices (A = A^T), the spectral theorem guarantees: (1) All eigenvalues are real. (2) Eigenvectors corresponding to different eigenvalues are orthogonal. (3) A always has n orthonormal eigenvectors (a complete orthonormal basis). (4) A = QΛQ^T where Q is orthogonal (columns are orthonormal eigenvectors) and Λ is diagonal with eigenvalues. This is orthogonal diagonalization, which avoids computing inverses and is numerically stable.
Symmetric positive definite (SPD) matrices have all positive eigenvalues. The quadratic form x^T A x equals ∑λᵢ(x^T qᵢ)² where qᵢ are orthonormal eigenvectors, showing it's always positive for nonzero x. SPD matrices appear everywhere: covariance matrices (symmetric by definition, positive definite since x^T Σ x = E[||x^T Z||²] ≥ 0), Hessians of convex functions at optima, and regularization terms in machine learning.
For λ₁ = 2: (A - 2I)v = 0 gives [1 1; 1 1]v = 0, so v = (1, -1) (or any scalar multiple). Normalized: v₁ = (1/√2, -1/√2).
For λ₂ = 4: (A - 4I)v = 0 gives [-1 1; 1 -1]v = 0, so v = (1, 1). Normalized: v₂ = (1/√2, 1/√2).
So A = [1/√2 1/√2; -1/√2 1/√2] [2 0; 0 4] [1/√2 -1/√2; 1/√2 1/√2]. The eigenvectors form an orthonormal basis where the matrix is diagonal. In this basis, the transformation is simply: scale the (1,-1) direction by 2 and the (1,1) direction by 4.
ML Applications: PCA
Principal Component Analysis finds the directions of maximum variance in data. Given centered data X ∈ ℝᵐˣⁿ (m samples, n features), the covariance matrix is Σ = (1/m) X^T X. By the spectral theorem, Σ = QΛQ^T where Λ contains eigenvalues (variances) in decreasing order on the diagonal and Q contains the corresponding eigenvectors (principal components). The first principal component is the eigenvector with largest eigenvalue—it's the direction along which data varies most. Projecting data onto the first k principal components gives k-dimensional approximation. PCA thus reduces from n to k dimensions while preserving maximum variance.
Section 05 — Details
Singular Value Decomposition
The Singular Value Decomposition is arguably the most powerful and widely applicable matrix decomposition. Every matrix A ∈ ℝᵐˣⁿ can be factored as A = UΣV^T where: U ∈ ℝᵐˣʳ has orthonormal columns (left singular vectors); Σ ∈ ℝʳˣʳ is diagonal with non-negative entries σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0 (singular values); V ∈ ℝⁿˣʳ has orthonormal columns (right singular vectors); and r = rank(A) ≤ min(m,n).
The SVD reveals the structure of any dataset. The singular values measure how much variance exists along each principal direction. The ratio σ₁/σᵣ is the condition number—how ill-conditioned the matrix is. The columns of U are directions in the data space (ℝᵐ) along which data varies most; columns of V are directions in the feature space (ℝⁿ). Unlike eigendecomposition (which requires square matrices and may have complex eigenvalues), SVD always exists and uses real, orthogonal matrices—numerically stable.
Interpretation and Low-Rank Approximation
We can write A = ∑ᵢ σᵢ uᵢ vᵢ^T as a sum of rank-1 matrices, each with "weight" σᵢ. Truncating to the first k terms gives A_k = ∑ᵢ₌₁ᵏ σᵢ uᵢ vᵢ^T, a rank-k approximation. By Eckart-Young theorem, this is the best rank-k approximation in Frobenius norm: ||A - A_k||_F ≤ ||A - B||_F for any rank-k matrix B. This compression is the mathematical foundation of PCA (applied to covariance matrices), latent factor models, and matrix completion. In image compression, you keep the first few singular vectors and discard the rest; in recommendation systems (Netflix problem), you factorize the user-item matrix to find latent factors.
Connection to PCA and Eigendecomposition
Given centered data X ∈ ℝᵐˣⁿ (m samples, n features), the SVD is X = UΣV^T. The covariance matrix is Σ_data = (1/m)X^T X = (1/m)VΣ^T Σ V^T = (1/m)VΣ²V^T. So the eigendecomposition of the covariance matrix uses V as eigenvectors and σᵢ²/m as eigenvalues (variance along principal components). PCA is just SVD applied to centered data; you keep the first k columns of V to project to k dimensions.
Worked Example: 3×2 Matrix
Consider A = [2 0; 0 1; 0 0]. Compute A^T A = [4 0; 0 1]. Eigenvalues of A^T A: λ₁ = 4 and λ₂ = 1. Eigenvectors: v₁ = (1,0) and v₂ = (0,1). Singular values: σ₁ = √4 = 2, σ₂ = √1 = 1. Left singular vectors: u₁ = Av₁/σ₁ = (2,0,0)/2 = (1,0,0); u₂ = Av₂/σ₂ = (0,1,0)/1 = (0,1,0). So A = [1 0; 0 1; 0 0] [2 0; 0 1] [1 0; 0 1]. Rank is 2 (two nonzero singular values). Truncating to rank 1: A₁ = 2 · (1,0,0) · (1,0) = [2 0; 0 0; 0 0], which preserves the largest singular value contribution.
Practical Applications
Matrix completion (Netflix problem): given a partially observed user-item matrix with many missing entries, find the low-rank approximation. SVD provides the optimal rank-k reconstruction. Image compression: represent images as matrices; SVD separates important features (large σᵢ) from noise (small σᵢ). Noise reduction: add Gaussian noise to data; SVD of noisy matrix recovers the signal by truncating small singular values. Solving least squares: for Ax ≈ b, x* = V Σ⁻¹ U^T b (pseudo-inverse) is numerically more stable than normal equations. Dimensionality reduction: keep first k columns of U or V to reduce dimension while preserving maximum variance.
Section 06 — Details
Matrix Calculus
Machine learning requires optimizing loss functions with respect to matrix parameters. Matrix calculus is the calculus of multivariate functions where variables are vectors or matrices. The gradient ∇_x f(x) is the vector of partial derivatives; for a function f(X) where X is a matrix, the gradient is a matrix of partial derivatives ∂f/∂X_{ij}. Critical optimization operations reduce to computing these gradients, then taking steps in the negative gradient direction (gradient descent) to minimize the loss.
Key matrix calculus identities appear repeatedly in machine learning. For a quadratic form f(x) = x^T A x, the gradient is ∇_x(x^T A x) = (A + A^T)x, which simplifies to 2Ax if A is symmetric. For least squares f(x) = ||Ax - b||² = (Ax - b)^T(Ax - b), the gradient is ∇_x f(x) = 2A^T(Ax - b). These identities appear in linear regression, neural network training, and every convex optimization algorithm.
Chain Rule for Matrices
The chain rule extends to matrices. If L = f(Y) is a scalar loss (Y ∈ ℝᵖˣᵍ), and Y = g(X) (X ∈ ℝᵐˣⁿ), then ∂L/∂X = (∂L/∂Y) · (∂Y/∂X) where · denotes careful matrix multiplication. More precisely, treating everything as vectors and using the vec operator (stacking matrix columns into a vector), this is matrix-vector multiplication. In backpropagation, you compute gradients layer-by-layer: if z = W y + b, then ∂L/∂W = (∂L/∂z) y^T and ∂L/∂y = W^T (∂L/∂z). The transpose appears because of how derivatives respect dimensions.
Common Derivatives in ML
Linear regression: L = ||Xβ - y||² with Hessian ∇²_β L = 2X^T X. Logistic regression: cross-entropy loss L = -y^T log(σ(Xβ)) - (1-y)^T log(1 - σ(Xβ)) where σ is sigmoid; gradient ∇_β L = X^T(σ(Xβ) - y). Neural networks: for layer z = σ(W x + b), the chain rule propagates gradients backward. The Hessian (matrix of second derivatives) reveals curvature: positive definite Hessian means local convexity (gradient descent works well); large eigenvalues mean steep curvature (slow convergence); small eigenvalues mean flat directions (fast convergence). Newton's method uses the Hessian: x_{new} = x_old - (∇²L)⁻¹ ∇L, converging faster than gradient descent but with higher per-iteration cost.
Worked Example: Least Squares Gradient
We minimize L(β) = ||Xβ - y||². Expand: L = (Xβ - y)^T(Xβ - y) = β^T X^T X β - 2β^T X^T y + y^T y.
Taking derivative with respect to β: ∂L/∂β = 2X^T X β - 2X^T y. Setting to zero: X^T X β = X^T y, the normal equations. If X^T X is invertible, β* = (X^T X)⁻¹ X^T y. The Hessian ∇²_β L = 2X^T X (constant, independent of β). For least squares, the Hessian is always positive semi-definite (often positive definite when X has full column rank), guaranteeing convexity and unique minimum.
Gradient descent updates: β_{t+1} = β_t - α∇_β L|_{β=β_t} = β_t - 2α(X^T X β_t - X^T y). Each iteration moves in the direction of negative gradient by step size α. The eigenvalues of X^T X determine convergence rate: large eigenvalues (steep directions) converge fast; small eigenvalues (flat directions) converge slowly. Conditioning X^T X (via regularization or preconditioning) improves convergence.
Vector-by-Matrix Notation
When you see ∂L/∂X for X ∈ ℝᵐˣⁿ, you're taking derivative of scalar L with respect to each element of X independently. The result is an m×n matrix. The numerator layout convention puts the result shape matching the denominator. Under this convention: ∂(Ax)/∂x = A^T (for vector x); ∂(x^T A)/∂x = A (treating x as column vector); ∂(x^T A y)/∂A = x y^T. These conventions matter in implementations; most papers and PyTorch use specific conventions. Always check documentation or verify with small examples.
Section 07 — Details
Norms & Inner Products
A norm is a function ||·|| : ℝⁿ → ℝ that measures vector "size" and satisfies: (1) ||x|| ≥ 0 with equality iff x = 0 (positive definiteness); (2) ||αx|| = |α| ||x|| for scalars α (homogeneity); (3) ||x + y|| ≤ ||x|| + ||y|| (triangle inequality). Different norms emphasize different aspects of vectors. The L_p norm is ||x||_p = (∑|x_i|^p)^{1/p}. Common choices: L₁ (p=1), L₂ (p=2, Euclidean), L∞ (p=∞, maximum absolute value).
L₁ norm ||x||₁ = ∑|x_i| measures Manhattan distance. It's less sensitive to outliers than L₂ and promotes sparsity (LASSO regression using L₁ penalty drives many coefficients exactly to zero). L₂ norm ||x||₂ = √(∑x_i²) is the familiar Euclidean distance; smooth and geometrically intuitive. L∞ norm ||x||∞ = max_i |x_i| is minimax—the maximum absolute value. In robust optimization, L∞ constraints bound all coordinates. The choice of norm affects optimization: L₁ prefers sparse solutions, L₂ prefers balanced solutions, L∞ prefers balanced maxima.
Inner Products and Orthogonality
An inner product ⟨·,·⟩ : V × V → ℝ is a function satisfying: (1) bilinearity (linear in each argument); (2) symmetry ⟨x,y⟩ = ⟨y,x⟩; (3) positive definiteness ⟨x,x⟩ > 0 for x ≠ 0. The standard Euclidean inner product is ⟨x,y⟩ = x^T y = ∑x_i y_i. Vectors x and y are orthogonal if ⟨x,y⟩ = 0; orthonormal if additionally ||x|| = ||y|| = 1. Orthogonal matrices Q satisfy Q^T Q = I, preserving inner products: ⟨Qx, Qy⟩ = (Qx)^T(Qy) = x^T Q^T Q y = x^T y = ⟨x,y⟩.
The Gram matrix G_{ij} = ⟨v_i, v_j⟩ encodes all pairwise inner products of vectors v₁, ..., v_n. For orthonormal vectors, G = I. The Gram-Schmidt process orthogonalizes any basis: given v₁, v₂, ..., v_n, construct orthonormal u₁, u₂, ..., u_n by: u₁ = v₁/||v₁||; u_{k} = (v_k - ∑_{j<k} ⟨v_k, u_j⟩ u_j) / ||v_k - ∑_{j<k} ⟨v_k, u_j⟩ u_j||. This repeatedly removes components along previously constructed vectors, ensuring orthogonality. QR decomposition uses Gram-Schmidt: A = QR where Q has orthonormal columns and R is upper triangular.
Kernels and Generalized Inner Products
A kernel function k(x,y) maps pairs of vectors to scalars. For a positive definite kernel, there exists a feature space (possibly infinite-dimensional) where k(x,y) = φ(x)^T φ(y) implicitly computes the inner product of transformed features φ(x) and φ(y). This is powerful: support vector machines use kernels to work in high-dimensional spaces without explicitly computing φ(x). Common kernels: linear k(x,y) = x^T y; polynomial k(x,y) = (x^T y + c)^d; RBF k(x,y) = exp(-γ||x-y||²). The kernel trick avoids the curse of dimensionality by avoiding explicit feature expansion.
Frobenius and Spectral Norms for Matrices
For matrices, the Frobenius norm ||A||_F = √(∑_ij A_{ij}²) is like L₂ norm applied to all entries; it equals √(tr(A^T A)) and equals √(∑_i σ_i²) where σ_i are singular values. The spectral norm (operator norm) ||A||_2 = max_{||x||=1} ||Ax|| equals the largest singular value σ₁. These norms measure matrix "size" differently: Frobenius treats it like a high-dimensional vector; spectral is the maximum stretching factor. In machine learning, Frobenius norm regularizes all elements; spectral norm (used in spectral normalization) constrains the largest singular value, stabilizing GAN training.
Result: {(1,0,0), (0,1,0), (0,0,1)} — the standard basis! Matrix form: A = [1 1 1; 0 1 1; 0 0 1] = QR where Q = I and R = [1 1 1; 0 1 1; 0 0 1]. The QR form reveals the structure: columns of Q form orthonormal basis; R describes how original vectors combine from basis.
Section 08 — Details
Key Decompositions for ML
Beyond SVD, several matrix decompositions address specific problem structures. The QR decomposition factors any m×n matrix as A = QR where Q ∈ ℝᵐˣⁿ has orthonormal columns (or ℝᵐˣᵐ with orthogonal columns if m < n) and R ∈ ℝⁿˣⁿ is upper triangular. QR is numerically stable for solving least squares problems: Ax ≈ b becomes QRx ≈ b. Since Q^T Q = I, multiply by Q^T: Rx ≈ Q^T b. Solving the triangular system Rx = Q^T b is fast and accurate. QR also reveals the rank: zero diagonal entries in R indicate zero singular values. The QR algorithm repeatedly applies QR decomposition to compute eigenvalues: A_{k+1} = R_k Q_k where A_k = Q_k R_k, and the sequence A_k converges to upper triangular (eigenvalue) form.
LU Decomposition
The LU decomposition factors a square matrix A as A = LU where L is lower triangular with 1s on diagonal and U is upper triangular. LU exists (possibly after row permutation) for any nonsingular matrix. LU is efficient for solving multiple systems Ax = b_i with the same A but different right-hand sides: factor once (O(n³)), then solve each system (O(n²)) by forward substitution (L y = b_i) then back substitution (U x = y). For a single system, QR and SVD are more stable. LU requires careful pivoting (reordering rows) for numerical stability; partial pivoting (swapping rows to maximize pivot) is standard. Cholesky factorization is a special case of LU for symmetric positive definite matrices, using only half the work: A = LL^T.
Cholesky Decomposition
For a symmetric positive definite (SPD) matrix A, the Cholesky decomposition A = LL^T exists uniquely where L is lower triangular with positive diagonal entries. Cholesky is roughly twice as fast as LU (O(n³/3) vs O(n³/2.67)) and avoids pivoting. It appears when solving: normal equations in least squares (solve A^T A x = A^T b via Cholesky); convex optimization (A is Hessian); sampling from multivariate Gaussians (L defines the covariance: if y ~ N(0,I), then x = μ + Ly ~ N(μ, A)). Cholesky fails if A is not positive definite, providing a way to detect non-convexity.
The Cholesky update/downdate algorithms efficiently update a factorization when adding or removing data points, important in online learning. Regularization (adding λI to ill-conditioned matrices) makes them positive definite, enabling Cholesky: A + λI becomes well-conditioned. Ridge regression uses this: (X^T X + λI)⁻¹ X^T y is computed via Cholesky of X^T X + λI.
Positive Definite Matrices
A symmetric matrix A is positive definite (PD) if x^T A x > 0 for all nonzero x, equivalently: all eigenvalues are positive, all leading principal minors are positive, or Cholesky decomposition exists. Positive semi-definite (PSD) relaxes the inequality to ≥, allowing zero eigenvalues. These matrices appear ubiquitously in ML: covariance matrices Σ = E[(X - μ)(X - μ)^T] are always PSD (eigenvalues are variances ≥ 0); Hessians ∇²f of convex functions are PSD everywhere (convexity); kernel matrices K_{ij} = k(x_i, x_j) from valid kernels are PSD (Mercer's theorem). The set of PD matrices forms a cone: if A, B are PD, so is A + B (hence adding regularization λI to an indefinite matrix can make it PD).
Condition Number and Numerical Stability
The condition number of a matrix measures how sensitive solutions to changes in input. For A ∈ ℝⁿˣⁿ, κ(A) = σ_max/σ_min (ratio of largest to smallest singular value). Well-conditioned matrices have κ ≈ 1 (changes in input cause proportional changes in output). Ill-conditioned matrices have κ ≫ 1 (small input changes cause large output swings). For solving Ax = b, relative error in x is roughly κ(A) times relative error in b. Ill-conditioned systems require special care: use direct methods (QR, Cholesky) not normal equations; use regularization (add λI); use preconditioning in iterative solvers (multiply by matrix approximating A⁻¹).
Worked Example: Ridge Regression via Cholesky
In ridge regression, we minimize ||Xβ - y||² + λ||β||² with solution β* = (X^T X + λI)⁻¹ X^T y. Computing: First, compute M = X^T X + λI. Second, apply Cholesky: M = LL^T. Third, solve Ly = X^T y via forward substitution. Fourth, solve L^T β* = y via back substitution. This avoids explicit inversion and is numerically stable. The regularization parameter λ > 0 ensures M is positive definite (even if X^T X is singular), enabling Cholesky. As λ increases, condition number κ(M) decreases (better conditioning), but bias increases (β* moves away from true optimum). This bias-variance tradeoff is fundamental in regularized estimation.
Decomposition Comparison
SVD: Works for any matrix; reveals singular values; used for PCA, low-rank approximation, pseudo-inverse. Costs O(mn²) or O(m²n). Eigendecomposition: Works for square matrices; reveals eigenvalues; used for diagonalization, understanding transformations. Costs O(n³). QR: Orthogonal decomposition; numerically stable for least squares; used in iterative eigenvalue algorithms. Costs O(mn²). LU: Fast for multiple right-hand sides; requires pivoting for stability. Costs O(n³). Cholesky: Fastest for SPD matrices; avoids pivoting; used in optimization. Costs O(n³/3). Choose decomposition based on matrix structure, numerical stability needs, and problem size.