Unsupervised Learning

01

K-Means Clustering

K-Means is a simple yet powerful algorithm for unsupervised partitioning of data into k clusters. The algorithm iteratively assigns each point to the nearest cluster centroid, then recomputes centroids as the mean of assigned points. This coordinate-wise optimization minimizes the within-cluster distortion, or sum of squared distances from points to their assigned centroids.

Convergence is guaranteed due to the non-increasing distortion objective: in each iteration, either assignment improves (lower distortion) or centroids move to lower-distortion positions. Initialization matters significantly—k-means++ seeds initial centers by probabilistically selecting points that are far from existing centers, greatly improving solution quality and convergence speed. Although k-means finds local optima rather than global optima, careful initialization mitigates this limitation.

O(nkd)

Per Iteration

Convex

Subproblems

k-means++

Better Init

Local Min

Guaranteed

02

Mixture of Gaussians

The Gaussian Mixture Model (GMM) extends k-means by treating cluster assignment probabilistically. Rather than hard assignments (point belongs to cluster k), GMM computes soft responsibilities—the posterior probability that a point was generated by each Gaussian component. Each component has its own mean μ_k, covariance Σ_k, and mixing weight π_k that sums to one.

GMM is a latent variable model where cluster identity z is unobserved; the likelihood marginalizes over z. Maximizing this likelihood requires the EM algorithm since there is no closed-form solution. The soft assignments naturally handle ambiguity in cluster membership and provide uncertainty estimates. When covariances collapse to spherical and equal, EM becomes equivalent to k-means, showing the fundamental connection between hard and soft clustering.

Soft Assignments

Each point has a probability distribution over clusters, not a single discrete label. Reflects uncertainty in cluster membership naturally.

Probabilistic Framework

Well-defined likelihood model with explicit assumptions on data generation. Enables model selection via AIC/BIC.

Covariance Structure

Different shapes and sizes per cluster. Generalizes k-means spherical assumption to full covariance matrices.

EM Optimization

Expectation-Maximization provides guaranteed convergence to local maximum of marginal likelihood.

03

Jensen's Inequality

Jensen's Inequality is the mathematical foundation of the EM algorithm. For a concave function f and random variable X, it states E[f(X)] ≤ f(E[X]). The proof uses the definition of concavity: a concave function lies above its tangent line. In the context of EM, applying Jensen's Inequality to the log function (which is concave) creates a lower bound on the log-likelihood—the Evidence Lower Bound or ELBO.

The ELBO is key: log p(x) = log Σ_z p(x, z) ≥ Σ_z q(z) log(p(x,z)/q(z)), where q(z) is any distribution over latent variables z. Equality holds when q(z) = p(z|x), the true posterior. This insight drives EM: the E-step optimizes q to match the posterior (making the bound tight), and the M-step optimizes model parameters θ to increase the ELBO. Monotonically improving the ELBO guarantees convergence.

Concave

Requirement

ELBO

Lower Bound

Equality

At Posterior

Gap

KL(q||p)

04

The EM Algorithm

The Expectation-Maximization algorithm solves maximum likelihood problems with latent variables. In the E-step, we compute the posterior distribution over latent variables z given observed data x and current parameters θ: γ_nk = p(z_n = k | x_n, θ). For GMM, this is the soft responsibility of component k for point n. In the M-step, we maximize the expected complete-data log-likelihood, treating responsibilities as fixed weights.

For GMM, the M-step updates are intuitive: π_k becomes the average responsibility, μ_k becomes the responsibility-weighted mean of points, and Σ_k becomes the responsibility-weighted covariance. Each update has a closed form. The algorithm alternates E and M steps, monotonically increasing the marginal likelihood until convergence (typically a few dozen iterations). EM avoids the direct optimization of the intractable marginal likelihood by decomposing it into tractable expectations and parameter optimization.

E-Step

Compute γ_nk = p(z_n = k | x_n, θ^(t)). Responsibility of component k for data point n.

M-Step

Maximize Q(θ,θ^(t)) = E_z[log p(x,z|θ)]. Update π_k, μ_k, Σ_k with closed-form solutions.

Convergence

Repeat until log p(x|θ) or parameters change by less than threshold ε.

05

General EM Framework

EM is a general framework applicable to any latent variable model, not just GMM. The key insight is coordinate ascent on the ELBO in two steps. The E-step finds the q(z) that maximizes ELBO for fixed θ, yielding q(z) = p(z|x, θ). The M-step finds θ that maximizes ELBO for fixed q, equivalent to maximizing Σ_z q(z) log p(x, z | θ). Crucially, this is a guaranteed improvement because we first make the bound tight, then increase it.

EM exhibits monotonic improvement: L(θ^(t)) ≤ L(θ^(t+1)) where L is the marginal likelihood. However, EM finds local maxima, not global ones—initialization matters. The gap between ELBO and true likelihood is KL(q(z)||p(z|x,θ)), minimized to zero by the E-step. In practice, running EM multiple times from different initializations and selecting the best can improve solutions. The framework extends beyond clustering: it powers latent factor models, missing-data imputation, and mixed-effect models.

Strengths

Guaranteed monotonic improvement in likelihood
Handles missing data and latent variables naturally
Closed-form M-step for exponential families
Principled probabilistic framework

Limitations

Finds local maxima, not global optima
Can be slow (many E-M iterations)
Initialization sensitivity
May overfit if not regularized

06

Factor Analysis

Factor Analysis is a latent variable model for continuous hidden factors. Unlike GMM's discrete cluster assignments, Factor Analysis assumes each observation x ∈ ℝ^d is generated from a low-dimensional latent factor z ∈ ℝ^k (k << d) via a linear transformation: x = Λz + ε, where Λ is a d × k loading matrix and ε ~ N(0, Ψ) is diagonal noise. This assumes x has a low-rank covariance structure Σ = ΛΛ^T + Ψ, useful when d >> n (more features than samples).

The model assumes z ~ N(0, I) and ε is diagonal, so each observed dimension's variance decomposes into shared factor variance (captured by Λz) and independent noise (ε). EM fits Factor Analysis by alternating: E-step computes posterior over z given x, M-step updates Λ and Ψ. Unlike PCA (which uses SVD on centered data), Factor Analysis is probabilistic and provides uncertainty estimates for factor values. It's essential in psychometrics, finance (factor models), and domains with high-dimensional data where interpretable latent factors are valued.

Low-Rank

Covariance

Σ = ΛΛ^T+Ψ

Structure

Continuous

Latent z

Diagonal Ψ

Assumption

07

Variational Inference

Variational Inference generalizes EM by parameterizing the posterior distribution q(z|x) as a flexible distribution from a family Q, then optimizing over Q to minimize KL(q||p(z|x)). Unlike EM's E-step which computes the exact posterior (only tractable for certain models), variational inference trades accuracy for tractability. A common approach is mean-field approximation: q(z) = ∏_i q_i(z_i), assuming conditional independence among latent variables. This factorization enables efficient updates: optimize each q_i while holding others fixed.

Variational Autoencoders (VAEs) combine variational inference with deep neural networks: the encoder q_φ(z|x) and decoder p_θ(x|z) are neural networks parameterized by φ and θ. The ELBO decomposes as E_q[log p(x|z)] - KL(q_φ||p(z)), combining reconstruction loss and a regularizer pushing q toward the prior. VAEs enable scalable learning on high-dimensional data and generative modeling. The reparameterization trick allows backpropagation through stochastic nodes, making end-to-end training feasible. Variational methods are foundational in modern deep generative modeling.

Mean-Field Approximation

Factorizes posterior into independent factors. Computationally tractable but may underestimate posterior variance (q too sharp).

KL Minimization

Minimizes reverse KL: q close to true posterior. Direct optimization of variational objective avoids EM iterations.

Neural Network Parameterization

Use neural networks to parameterize q and p, enabling scalability to large datasets and complex distributions.

Reparameterization Trick

z = μ + σ ⊙ ε allows gradient flow through stochastic sampling. Essential for end-to-end deep learning.

08

Practical Clustering

Choosing the number of clusters k is crucial but non-trivial. The elbow method plots distortion or inertia versus k and looks for a "kink"—beyond which increasing k yields diminishing improvements. Silhouette coefficient measures how similar each point is to points in its own cluster versus other clusters (range [-1,1]; higher is better). These heuristics are intuitive but subjective. For GMM, information criteria like BIC (Bayesian Information Criterion) and AIC trade likelihood against model complexity, providing principled model selection. Cross-validation can estimate generalization performance on held-out data.

Beyond k-means and GMM, spectral clustering exploits the eigendecomposition of the similarity matrix to uncover non-convex clusters. It first constructs a graph where edge weights encode similarity, then performs graph Laplacian eigendecomposition, and applies k-means to the embedded points. Spectral methods are robust to elongated clusters and non-spherical shapes. Hierarchical clustering (agglomerative or divisive) builds a dendrogram showing how clusters merge, useful for exploring structure at multiple scales. Density-based methods like DBSCAN find clusters of arbitrary shape by grouping points with many neighbors. Practical clustering requires domain knowledge, multiple evaluations, and iterative refinement.

09

References & Further Reading

Key References:

K-Means Clustering

Mixture of Gaussians

Soft Assignments

Probabilistic Framework

Covariance Structure

EM Optimization

Jensen's Inequality

The EM Algorithm

General EM Framework

Strengths

Limitations

Factor Analysis

Variational Inference

Mean-Field Approximation

KL Minimization

Neural Network Parameterization

Reparameterization Trick

Practical Clustering

References & Further Reading

K-Means Clustering

Mixture of Gaussians

Geometric Intuition

Jensen's Inequality

The EM Algorithm

E-Step: Responsibilities

M-Step: MLE

Monotonic Improvement

Probabilistic Interpretation

General EM Framework

Intuition: Two-Level Optimization

Factor Analysis

Variational Inference

Mean-Field VI

Reparameterization Trick

Neural Parameterization

Generative & Inference

Practical Clustering

Best Practices

Common Pitfalls

Sources & References

Course Materials

Textbooks & Papers