PCA & ICA — Dimensionality Reduction & Source Separation

STANFORD XCS229 · MACHINE LEARNING

PCA & ICA

Dimensionality Reduction

High-dimensional data poses computational and statistical challenges due to the "curse of dimensionality." As the number of features grows, data becomes sparse, distances become less meaningful, and models require exponentially more training samples. Dimensionality reduction addresses this by projecting data onto lower-dimensional subspaces while preserving essential structure.

Two complementary objectives motivate reduction techniques: compression reduces storage and computation costs by removing redundancy, while denoising improves model generalization by filtering noise. Principal Component Analysis (PCA) assumes linear structure and finds orthogonal directions of maximum variance. Independent Component Analysis (ICA) relaxes linearity assumptions and seeks components that are statistically independent. Both are unsupervised methods applied in signal processing, image compression, feature extraction, and visualization.

n → k

Reduction Ratio

Orthogonal

PCA Constraint

Independent

ICA Constraint

Unsupervised

Both Methods

PCA Formulation

PCA seeks to find a set of orthogonal directions (principal components) that capture the maximum variance in the data. Given centered data X ∈ ℝ^(m×n), the first principal component is the direction u₁ that maximizes Var(Xu₁). The second principal component is orthogonal to the first and maximizes variance in the remaining directions. This greedy optimization is equivalent to finding eigenvectors of the sample covariance matrix Σ = (1/m)X^T X, sorted by decreasing eigenvalue.

Mathematically, the problem reduces to eigendecomposition: Σv = λv. The eigenvalues λ represent variance explained per component, and eigenvectors v form the principal directions. PCA is optimal in a least-squares sense: if we project data onto the k-dimensional subspace spanned by the top-k eigenvectors, we minimize reconstruction error ‖X − X̂‖²_F. This elegant formulation connects variance maximization, eigendecomposition, and least-squares fitting into a unified framework.

Orthogonality

Principal components are mutually perpendicular, ensuring they capture independent directions of variance.

Eigendecomposition

Solution found via eigendecomposition of covariance matrix; top-k eigenvectors are optimal projection directions.

Variance Explained

Eigenvalues quantify the amount of variance captured by each component; larger eigenvalues indicate more important directions.

Centered Data

Requires zero-mean data. Centering via X := X − mean(X) removes translation, ensuring variance-optimal fit.

PCA Algorithm

PCA implementation involves three steps: (1) center the data X := X − mean(X), (2) compute the covariance matrix Σ = (1/m)X^T X or its singular value decomposition, and (3) extract eigenvectors and eigenvalues. The direct eigendecomposition approach requires O(n³) time but is stable for moderate dimensions. For high-dimensional problems (n >> m), Singular Value Decomposition (SVD) of X itself is preferred: X = UΣV^T. The right singular vectors V directly give principal components, and the singular values σᵢ relate to eigenvalues via λᵢ = σᵢ²/(m−1).

Numerical stability of SVD exceeds covariance matrix eigendecomposition, especially when X is ill-conditioned. SVD avoids explicitly forming X^T X, which squares the condition number and risks numerical errors. After SVD, projecting data onto the first k components yields the low-rank approximation X̂ = UₖΣₖVₖ^T, where Uₖ, Σₖ, Vₖ contain the top-k singular vectors and values. This compressed representation achieves k(m+n) storage instead of m·n, enabling efficient computation and storage for large datasets.

Choosing k

Selecting the number of components k requires balancing model complexity against information preservation. The scree plot visualizes eigenvalues (or explained variance per component) in descending order. An "elbow" point—where eigenvalue decrease sharply flattens—heuristically suggests k. A more principled approach uses the explained variance ratio (EVR): the cumulative fraction of total variance captured by the top k components. A common rule is selecting k such that the cumulative EVR exceeds 95%, though domain-specific requirements may demand different thresholds.

Cross-validation on downstream tasks (e.g., classification, reconstruction) provides task-specific guidance. For compression, minimize reconstruction error ‖X − X̂‖²_F / ‖X‖²_F; for visualization, use k = 2 or 3. Information-theoretic criteria like Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC) balance model fit against complexity penalties. Empirically, many datasets show steep drop-offs in the first few components (e.g., 80–90% variance in 5–10 components), enabling significant dimensionality reduction without severe information loss.

ICA Problem Setup

Independent Component Analysis solves the source separation problem: given mixed signals s = As, recover both the unknown mixing matrix A and the original independent sources s. The classic example is the "cocktail party problem"—separating overlapping speeches from multiple speakers recorded by multiple microphones. Unlike PCA (which assumes orthogonality and Gaussian noise), ICA assumes sources are statistically independent and non-Gaussian. These assumptions align with many real-world scenarios: speech signals, economic indices, brain activity (fMRI), and sensor data exhibit complex statistical structure beyond second-order moments.

ICA formulates the problem as: given observed data x = As (where A is unknown m×n mixing matrix and s is unknown n-dimensional source vector), estimate W = A⁻¹ such that recovered sources ŝ = Wx are mutually independent. Independence is measured via mutual information I(ŝ₁; ŝ₂; ...; ŝₙ) = 0, or equivalently, joint density factors as p(ŝ) = ∏pᵢ(ŝᵢ). This contrasts with PCA's variance-based objective and enables extraction of sources beyond second-order structure—critical for non-Gaussian data.

Key Difference: PCA vs ICA

PCA: Orthogonal directions of maximum variance. Assumes Gaussian, additive noise. Uses second-order statistics (covariance). ICA: Statistically independent components. Assumes non-Gaussian sources. Uses higher-order statistics (kurtosis, negentropy). Both are unsupervised; ICA is more powerful but requires stronger assumptions.

ICA Ambiguities

ICA recovery of sources is inherently ambiguous up to three non-uniqueness issues. First, scaling ambiguity: if A and s are a solution, then (A/c)·(c·s) is equally valid for any scalar c > 0. The recovered sources can be scaled arbitrarily; only their relative structure matters. Second, permutation ambiguity: the order of independent components is arbitrary. Swapping columns of A and corresponding entries of s yields an equivalent solution. These ambiguities reflect the fundamental structure of the problem: only statistical independence is preserved, not amplitude or ordering.

Third, and most restrictive: at most one source can be Gaussian. If two sources are jointly Gaussian, they cannot be separated using independence criteria—jointly Gaussian random variables with zero mean are characterized entirely by their covariance, which ICA cannot distinguish. This constraint ensures identifiability: for n sources, at most one may be Gaussian. Practically, this means ICA requires sufficiently non-Gaussian data. In applications where data is Gaussian or nearly Gaussian (e.g., whitened PCA components), ICA fails or requires regularization. Preprocessing (whitening, non-linear transformations) mitigates this issue.

Advantages

Recovers independent, non-Gaussian components.
More powerful than PCA for non-linear structure.
Applicable to diverse domains (speech, biomedical).

Limitations & Ambiguities

Scale and permutation ambiguities; requires domain knowledge to interpret.
At most one Gaussian source; fails on nearly-Gaussian data.
Requires careful preprocessing and hyperparameter tuning.

ICA Algorithm

ICA optimization frames source recovery as a maximum likelihood (ML) or maximum entropy problem. The log-likelihood for unknown densities is ∑ log p(ŝᵢ) + log|detW|, where p(ŝᵢ) are marginal source densities and detW is the Jacobian. With non-parametric density estimates (e.g., histograms or kernel methods), ML becomes infeasible. Instead, negentropy—the KL divergence between ŝᵢ and a Gaussian with matching variance—approximates non-Gaussianity. The FastICA algorithm minimizes ∑negentropy(ŝᵢ) via fixed-point iteration on W, achieving rapid convergence (cubic in favorable conditions) compared to gradient descent.

Preprocessing is critical: whitening (sphering) the data via X_w = Σ^{-1/2}X ensures spherical covariance, reducing the problem dimensionality and improving numerical stability. After whitening, ICA works on the whitened data, simplifying W to be orthogonal initially, then refined. Initialization matters—poor starting points may converge to local minima. Practical implementations use multiple random initializations and select the solution with highest negentropy. Extensions handle noisy mixing (ŝ = Ax + noise) and nonlinear mixing via kernel ICA, though complexity increases significantly.

Self-Supervised & Foundation Models

Self-supervised learning leverages unlabeled data to learn useful representations via pretext tasks. Autoencoders—neural networks that encode input x into latent code z = f(x) then reconstruct x̂ = g(z)—perform learned dimensionality reduction. Variational Autoencoders (VAE) add a probabilistic structure, minimizing reconstruction loss plus a KL divergence regularizer on the latent distribution. Contrastive methods (e.g., SimCLR, BYOL) learn representations by pulling similar examples together and pushing dissimilar examples apart in latent space.

Foundation models (e.g., BERT, GPT, Vision Transformers) scale self-supervised pretraining to massive unlabeled corpora. Vision pretraining learns rich image representations via masked image modeling or contrastive objectives; language pretraining learns from next-token prediction. These pretrained representations transfer exceptionally well to downstream tasks—fine-tuning on minimal labeled data often outperforms training from scratch. Zero-shot capability emerges: models generalize to unseen classes via text descriptions (CLIP) or few-shot examples without any task-specific training. This paradigm shift—learning once, adapting widely—represents the frontier of modern dimensionality reduction and representation learning.

Autoencoders

Neural encoder-decoder networks learning nonlinear dimensionality reduction via reconstruction loss.

Variational Autoencoders

Probabilistic variant enforcing latent distribution structure; useful for generation and interpolation.

Contrastive Learning

Learn similarity structure without labels; powers foundation models and transfer learning.

Foundation Models

Large-scale pretrained models (BERT, GPT, ViT) learn universal representations transferable to diverse downstream tasks.

References & Further Reading

Key References:

Dimensionality Reduction

PCA Formulation

Orthogonality

Eigendecomposition

Variance Explained

Centered Data

PCA Algorithm

Choosing k

ICA Problem Setup

Key Difference: PCA vs ICA

ICA Ambiguities

Advantages

Limitations & Ambiguities

ICA Algorithm

Self-Supervised & Foundation Models

Autoencoders

Variational Autoencoders

Contrastive Learning

Foundation Models

References & Further Reading

Dimensionality Reduction

Why Reduction Matters

Mathematical Foundations

PCA Formulation

Geometric Interpretation

Connection to SVD

PCA Algorithm

Computational Complexity

Numerical Stability

Choosing k

Information-Theoretic Criteria

Cross-Validation Approach

Practical Guidelines

ICA Problem Setup

Independence vs Orthogonality

Linear vs Nonlinear Mixing

ICA Ambiguities

Identifiability Conditions

ICA Algorithm

Whitening Step

Nonlinearity Selection

Self-Supervised & Foundation Models

Foundation Models & Transfer Learning

Sources & References

Course Materials

Textbooks & Papers