STANFORD XCS229 · PROBLEM SET 5
PS5: PCA, EM & ICA
55 Points · Unsupervised Learning & Implicit Regularization
01

Problem Set 5 Overview

XCS229 Problem Set 5 is a 55-point assignment spanning weeks 8–10 and focusing on unsupervised learning and modern generalization theory. Four comprehensive problems cover Principal Component Analysis (PCA), semi-supervised learning with Expectation–Maximization (EM), Independent Component Analysis (ICA), and implicit regularization in neural networks. Together, they form a unified exploration of dimensionality reduction, learning from limited labeled data, source separation, and the theoretical underpinnings of why overparameterized models generalize.

This problem set bridges classical unsupervised learning with contemporary machine learning practice. Students implement core algorithms from scratch, analyze their mathematical properties, and apply them to real data. The emphasis is on understanding latent structure, combining labeled and unlabeled information efficiently, and discovering why simple learning rules (like SGD) work so well in practice.

55
Total Points
4
Problems
3
Weeks
Unsupervised
Focus
02

PCA Implementation

Problem 01 requires implementing Principal Component Analysis from first principles. Students derive the eigendecomposition of the covariance matrix, compute principal components, and make informed decisions about the number of components k. The implementation involves centering data, computing the covariance matrix, solving the eigenvalue problem, and projecting data onto the principal subspace to achieve dimensionality reduction.

The problem emphasizes the mathematical intuition: PCA finds the directions of maximum variance in the data. Students compute reconstruction error to measure information loss, visualize the principal components and their loadings, and understand how variance explained relates to the eigenvalues. This foundation prepares for downstream analysis of how well PCA captures the true latent structure.

Eigendecomposition

Solve U Λ UT = Σ for the covariance matrix using numpy or scipy.

Choosing k

Use cumulative explained variance to decide dimensionality, typically 90%+ of total.

Reconstruction Error

Measure ||X − X̂|| to quantify information loss from dimensionality reduction.

Visualization

Plot original 2D/3D data and PCA projections to inspect cluster structure.

03

PCA Analysis

Problem 01 continues with detailed analysis of PCA's behavior on real datasets. Students produce scree plots showing eigenvalues (or cumulative explained variance) and interpret them: steep drops indicate lower-dimensional structure, while flat tails suggest noise. The analysis reveals whether the dataset is intrinsically low-rank or genuinely high-dimensional.

Students also investigate how PCA discovers latent structure: do the principal components align with known class labels? How well do they separate clusters? What aspects of the data does each component capture? This interpretive work bridges theory and application, demonstrating that PCA is not just a black-box tool but a lens for understanding data geometry and hidden factors.

Scree Plot
Diagnostic
Variance
Explained
Component
Loadings
Intrinsic Dim.
Estimation
04

Semi-supervised EM

Problem 02 introduces the Expectation–Maximization algorithm extended to semi-supervised learning. Given a small set of labeled examples and a large set of unlabeled examples, students implement a modified EM that simultaneously fits a mixture model (or other latent-variable model) and leverages the labeled data to guide cluster assignment. The E-step incorporates both labeled and unlabeled likelihoods; the M-step updates parameters using the expected sufficient statistics from both groups.

The key innovation is that unlabeled data provides information about the global structure—the location and covariance of clusters—while labeled data provides anchor points. Students implement the convergence criterion (log-likelihood or parameter changes), handle the cold-start problem, and compare the semi-supervised fit to fully unsupervised and fully supervised baselines. This is where theory meets practice in learning with limited labels.

Modified E-Step

Compute posterior responsibilities using both labeled constraints and soft assignments.

M-Step Integration

Update means, covariances, and mixing weights from weighted labeled/unlabeled contributions.

Convergence

Monitor log-likelihood or parameter changes; typically 10–50 iterations suffice.

Baseline Comparison

Benchmark against pure unsupervised and supervised methods to validate benefit.

05

Semi-supervised Analysis

Problem 02 concludes with analysis of when and why semi-supervised learning works. Students measure label efficiency: how much does adding unlabeled data improve over using only a small labeled set? They explore the regime where unlabeled data helps most (cluster structure is strong, label budget is tight) versus when it adds little (data is noise-like or labels are abundant). The analysis also examines sensitivity to initialization and hyperparameters like the mixture model order k.

This section bridges theory and practice by investigating practical considerations: computational cost, robustness to model misspecification (e.g., true clusters don't match the assumed mixture model), and the "when does unlabeled help?" question that machine learning practitioners face daily. Understanding these tradeoffs is essential for deploying semi-supervised methods responsibly.

Label Efficiency
Gain
Optimal k
Selection
Convergence
Speed
Robustness
Eval
06

ICA Source Separation

Problem 03 focuses on Independent Component Analysis (ICA), a technique to recover independent source signals from their mixed observations—the classic "cocktail party" problem. Unlike PCA which finds directions of variance, ICA finds directions of statistical independence. Students implement or use an ICA algorithm (e.g., FastICA) on synthetic or real data: given a mixed signal matrix X = AS where A is unknown mixing matrix and S contains independent sources, recover S and A.

The problem covers preprocessing (whitening via PCA to pre-condition the data), the non-Gaussianity assumption (ICA exploits higher-order moments, not just covariance), and the lack of identifiability up to permutation and sign. Students verify solution quality by checking how well recovered sources are independent (via mutual information or kurtosis) and compare recovered sources to ground truth. This problem demonstrates a radically different perspective on dimensionality and latent structure.

Whitening

Pre-process via PCA to make covariance identity; reduces complexity of ICA.

Non-Gaussianity

Maximize kurtosis or negentropy to find independent components from mixed signals.

FastICA Algorithm

Fixed-point iteration to find independent components; numerically efficient and stable.

Permutation Ambiguity

Recovered components may appear in any order; loss of identifiability is unavoidable.

07

Implicit Regularization

Problem 04 investigates implicit regularization in overparameterized neural networks—why do models with far more parameters than training examples still generalize? The problem guides students through empirical and theoretical exploration: train neural networks on synthetic data, measure training and test loss, study how early stopping affects generalization, and compare SGD's implicit bias with explicit regularization (L2, dropout).

Key insights include: SGD finds a solution biased toward small norm (minimum norm solution among those that fit training data), early stopping acts as a regularizer by interrupting the trajectory before overfitting, and overparameterization provides a beneficial implicit bias without the computational cost of explicit regularization. This problem bridges classical learning theory and modern deep learning, showing why simple algorithms work surprisingly well.

SGD Bias
Toward Simplicity
Early Stopping
Regularization
Overparameterization
Generalization
Implicit vs. Explicit
Regularization
08

Key Takeaways

Across these four problems, several unifying themes emerge. Unsupervised methods (PCA, ICA) discover latent structure without labels, enabling data exploration and feature engineering. Semi-supervised learning combines labeled and unlabeled data synergistically, dramatically improving efficiency in label-scarce regimes. Implicit regularization explains why modern neural networks generalize despite overparameterization, fundamentally changing our understanding of learning.

The problem set reinforces that machine learning is not purely algorithmic: success requires understanding data geometry (PCA), leveraging structure assumptions (ICA's non-Gaussianity), designing learning objectives (semi-supervised EM's hybrid loss), and appreciating optimization dynamics (SGD's implicit bias). These principles generalize far beyond PS5, underpinning practical machine learning and ongoing research in representation learning, weakly-supervised learning, and neural network theory.

Big Picture

XCS229 Problem Set 5 is a capstone on unsupervised and semi-supervised learning. By implementing, analyzing, and comparing these methods, students gain deep intuition for when each technique applies, how to interpret results, and why modern machine learning works so well in practice.

09

References & Further Reading

Key resources for PCA, EM algorithm, semi-supervised learning, ICA, and implicit regularization theory. Stanford CS229 lectures on unsupervised learning and regularization cover all four problem areas. Murphy's textbook provides intuitive explanations of PCA and EM. ESL Chapter 14 covers unsupervised learning. For implicit regularization, consult modern deep learning theory papers and Belkin et al.'s work on benign overfitting extended to unsupervised settings.

Wikipedia articles establish definitions for PCA, eigendecomposition, expectation-maximization, and independent component analysis. Research papers on implicit bias of SGD (Liang et al., Barrett & Hardt) and neural network implicit regularization are essential for understanding Problem 04. Scikit-learn documentation provides practical implementation examples for PCA and clustering.