K-Means Clustering
K-Means is a simple yet powerful algorithm for unsupervised partitioning of data into k clusters. The algorithm iteratively assigns each point to the nearest cluster centroid, then recomputes centroids as the mean of assigned points. This coordinate-wise optimization minimizes the within-cluster distortion, or sum of squared distances from points to their assigned centroids.
Convergence is guaranteed due to the non-increasing distortion objective: in each iteration, either assignment improves (lower distortion) or centroids move to lower-distortion positions. Initialization matters significantly—k-means++ seeds initial centers by probabilistically selecting points that are far from existing centers, greatly improving solution quality and convergence speed. Although k-means finds local optima rather than global optima, careful initialization mitigates this limitation.
Mixture of Gaussians
The Gaussian Mixture Model (GMM) extends k-means by treating cluster assignment probabilistically. Rather than hard assignments (point belongs to cluster k), GMM computes soft responsibilities—the posterior probability that a point was generated by each Gaussian component. Each component has its own mean μ_k, covariance Σ_k, and mixing weight π_k that sums to one.
GMM is a latent variable model where cluster identity z is unobserved; the likelihood marginalizes over z. Maximizing this likelihood requires the EM algorithm since there is no closed-form solution. The soft assignments naturally handle ambiguity in cluster membership and provide uncertainty estimates. When covariances collapse to spherical and equal, EM becomes equivalent to k-means, showing the fundamental connection between hard and soft clustering.
Soft Assignments
Each point has a probability distribution over clusters, not a single discrete label. Reflects uncertainty in cluster membership naturally.
Probabilistic Framework
Well-defined likelihood model with explicit assumptions on data generation. Enables model selection via AIC/BIC.
Covariance Structure
Different shapes and sizes per cluster. Generalizes k-means spherical assumption to full covariance matrices.
EM Optimization
Expectation-Maximization provides guaranteed convergence to local maximum of marginal likelihood.
Jensen's Inequality
Jensen's Inequality is the mathematical foundation of the EM algorithm. For a concave function f and random variable X, it states E[f(X)] ≤ f(E[X]). The proof uses the definition of concavity: a concave function lies above its tangent line. In the context of EM, applying Jensen's Inequality to the log function (which is concave) creates a lower bound on the log-likelihood—the Evidence Lower Bound or ELBO.
The ELBO is key: log p(x) = log Σ_z p(x, z) ≥ Σ_z q(z) log(p(x,z)/q(z)), where q(z) is any distribution over latent variables z. Equality holds when q(z) = p(z|x), the true posterior. This insight drives EM: the E-step optimizes q to match the posterior (making the bound tight), and the M-step optimizes model parameters θ to increase the ELBO. Monotonically improving the ELBO guarantees convergence.
The EM Algorithm
The Expectation-Maximization algorithm solves maximum likelihood problems with latent variables. In the E-step, we compute the posterior distribution over latent variables z given observed data x and current parameters θ: γ_nk = p(z_n = k | x_n, θ). For GMM, this is the soft responsibility of component k for point n. In the M-step, we maximize the expected complete-data log-likelihood, treating responsibilities as fixed weights.
For GMM, the M-step updates are intuitive: π_k becomes the average responsibility, μ_k becomes the responsibility-weighted mean of points, and Σ_k becomes the responsibility-weighted covariance. Each update has a closed form. The algorithm alternates E and M steps, monotonically increasing the marginal likelihood until convergence (typically a few dozen iterations). EM avoids the direct optimization of the intractable marginal likelihood by decomposing it into tractable expectations and parameter optimization.
General EM Framework
EM is a general framework applicable to any latent variable model, not just GMM. The key insight is coordinate ascent on the ELBO in two steps. The E-step finds the q(z) that maximizes ELBO for fixed θ, yielding q(z) = p(z|x, θ). The M-step finds θ that maximizes ELBO for fixed q, equivalent to maximizing Σ_z q(z) log p(x, z | θ). Crucially, this is a guaranteed improvement because we first make the bound tight, then increase it.
EM exhibits monotonic improvement: L(θ^(t)) ≤ L(θ^(t+1)) where L is the marginal likelihood. However, EM finds local maxima, not global ones—initialization matters. The gap between ELBO and true likelihood is KL(q(z)||p(z|x,θ)), minimized to zero by the E-step. In practice, running EM multiple times from different initializations and selecting the best can improve solutions. The framework extends beyond clustering: it powers latent factor models, missing-data imputation, and mixed-effect models.
Strengths
- Guaranteed monotonic improvement in likelihood
- Handles missing data and latent variables naturally
- Closed-form M-step for exponential families
- Principled probabilistic framework
Limitations
- Finds local maxima, not global optima
- Can be slow (many E-M iterations)
- Initialization sensitivity
- May overfit if not regularized
Factor Analysis
Factor Analysis is a latent variable model for continuous hidden factors. Unlike GMM's discrete cluster assignments, Factor Analysis assumes each observation x ∈ ℝ^d is generated from a low-dimensional latent factor z ∈ ℝ^k (k << d) via a linear transformation: x = Λz + ε, where Λ is a d × k loading matrix and ε ~ N(0, Ψ) is diagonal noise. This assumes x has a low-rank covariance structure Σ = ΛΛ^T + Ψ, useful when d >> n (more features than samples).
The model assumes z ~ N(0, I) and ε is diagonal, so each observed dimension's variance decomposes into shared factor variance (captured by Λz) and independent noise (ε). EM fits Factor Analysis by alternating: E-step computes posterior over z given x, M-step updates Λ and Ψ. Unlike PCA (which uses SVD on centered data), Factor Analysis is probabilistic and provides uncertainty estimates for factor values. It's essential in psychometrics, finance (factor models), and domains with high-dimensional data where interpretable latent factors are valued.
Variational Inference
Variational Inference generalizes EM by parameterizing the posterior distribution q(z|x) as a flexible distribution from a family Q, then optimizing over Q to minimize KL(q||p(z|x)). Unlike EM's E-step which computes the exact posterior (only tractable for certain models), variational inference trades accuracy for tractability. A common approach is mean-field approximation: q(z) = ∏_i q_i(z_i), assuming conditional independence among latent variables. This factorization enables efficient updates: optimize each q_i while holding others fixed.
Variational Autoencoders (VAEs) combine variational inference with deep neural networks: the encoder q_φ(z|x) and decoder p_θ(x|z) are neural networks parameterized by φ and θ. The ELBO decomposes as E_q[log p(x|z)] - KL(q_φ||p(z)), combining reconstruction loss and a regularizer pushing q toward the prior. VAEs enable scalable learning on high-dimensional data and generative modeling. The reparameterization trick allows backpropagation through stochastic nodes, making end-to-end training feasible. Variational methods are foundational in modern deep generative modeling.
Mean-Field Approximation
Factorizes posterior into independent factors. Computationally tractable but may underestimate posterior variance (q too sharp).
KL Minimization
Minimizes reverse KL: q close to true posterior. Direct optimization of variational objective avoids EM iterations.
Neural Network Parameterization
Use neural networks to parameterize q and p, enabling scalability to large datasets and complex distributions.
Reparameterization Trick
z = μ + σ ⊙ ε allows gradient flow through stochastic sampling. Essential for end-to-end deep learning.
Practical Clustering
Choosing the number of clusters k is crucial but non-trivial. The elbow method plots distortion or inertia versus k and looks for a "kink"—beyond which increasing k yields diminishing improvements. Silhouette coefficient measures how similar each point is to points in its own cluster versus other clusters (range [-1,1]; higher is better). These heuristics are intuitive but subjective. For GMM, information criteria like BIC (Bayesian Information Criterion) and AIC trade likelihood against model complexity, providing principled model selection. Cross-validation can estimate generalization performance on held-out data.
Beyond k-means and GMM, spectral clustering exploits the eigendecomposition of the similarity matrix to uncover non-convex clusters. It first constructs a graph where edge weights encode similarity, then performs graph Laplacian eigendecomposition, and applies k-means to the embedded points. Spectral methods are robust to elongated clusters and non-spherical shapes. Hierarchical clustering (agglomerative or divisive) builds a dendrogram showing how clusters merge, useful for exploring structure at multiple scales. Density-based methods like DBSCAN find clusters of arbitrary shape by grouping points with many neighbors. Practical clustering requires domain knowledge, multiple evaluations, and iterative refinement.
References & Further Reading
Key References: