XCS229 Problem Set 5 is a comprehensive exploration of unsupervised and semi-supervised learning, with 55 points distributed across four main problems. Each problem builds on core concepts from earlier problem sets while introducing new algorithmic and theoretical challenges. The assignment spans weeks 8–10 and serves as a bridge between classical statistical learning and modern deep learning practice.
The four problems are tightly integrated: PCA provides dimensionality reduction and feature discovery; semi-supervised EM shows how to leverage unlabeled data; ICA demonstrates an alternative notion of latent structure based on independence rather than variance; and implicit regularization explains why overparameterized models generalize. Together, they form a coherent narrative about learning from complex, high-dimensional data with limited supervision.
Problem Structure
Problem 01 (PCA): Implement PCA from scratch, analyze scree plots, and investigate latent structure discovery. Approximately 15 points.
Problem 02 (Semi-supervised EM): Extend EM to combine labeled and unlabeled data. Analyze label efficiency and convergence. Approximately 15 points.
Problem 03 (ICA): Apply or implement ICA for source separation, understanding independence as an alternative to variance. Approximately 15 points.
Problem 04 (Implicit Regularization): Study why overparameterized networks generalize and how SGD implicitly regularizes. Approximately 10 points.
Learning Objectives
- Understand the mathematics of dimensionality reduction via eigendecomposition and variance maximization.
- Implement and extend probabilistic latent-variable models (mixture models, EM algorithm).
- Discover and interpret latent structure in unlabeled and semi-labeled data.
- Appreciate the diversity of approaches to finding structure: variance, independence, implicit bias.
- Gain intuition for when unsupervised, semi-supervised, and supervised learning are appropriate.
Resources & Prerequisites
Strong familiarity with linear algebra (eigendecomposition, matrix decomposition), probability (mixture models, expectation-maximization), and numpy/scipy is essential. References include Murphy's "Machine Learning: A Probabilistic Perspective" (chapters on PCA, EM, ICA) and recent papers on implicit regularization in neural networks (e.g., Bartlett et al., Gunasekar et al.).
The PCA implementation problem asks students to build a complete PCA pipeline: data centering, covariance computation, eigendecomposition, and projection. This foundational work develops intuition for how PCA discovers the directions of maximum variance and why these directions form an effective low-dimensional representation.
Algorithm Steps
Step 1: Center the Data Subtract the mean from each feature so that the data is centered at the origin. This ensures the covariance matrix is computed correctly.
Step 2: Compute Covariance Matrix Calculate Σ = (1/n) XT X (for centered data) or use numpy's cov function. This captures pairwise feature correlations.
Step 3: Eigendecomposition Solve Σ v = λ v using numpy.linalg.eig or scipy.linalg.eigh. Sort eigenvalues in descending order and corresponding eigenvectors; these are the principal components.
Step 4: Select k Components Choose k by examining cumulative explained variance. Typical heuristics: retain 90% of variance, elbow method on scree plot, or domain knowledge about target dimensionality.
Step 5: Project Data Compute Z = X U_k where U_k contains the top k eigenvectors. This k-dimensional representation preserves the most variance in the original data.
Step 6: Reconstruction & Error Reconstruct via X̂ = Z U_kT and compute MSE loss ||X − X̂||2. Lower reconstruction error indicates the chosen k captures essential structure.
Practical Considerations
- Numerical Stability: Use scipy.linalg.eigh (symmetric eigendecomposition) rather than numpy.linalg.eig when possible; it's more stable.
- Centering: Always center data before computing covariance, and store the mean to apply the same transformation at test time.
- Interpretation: Examine the loadings (entries of U_k) to understand which original features contribute most to each principal component.
- Scaling: If features have different units or ranges, standardize (divide by std. dev.) before PCA to avoid bias toward high-variance features.
Expected Outcomes
On well-behaved datasets (e.g., MNIST, Iris), PCA should recover meaningful structure: clusters separate in 2D or 3D projections, reconstruction error plateaus, and scree plots show clear elbows. Reconstruction error decreases as k increases, asymptotically approaching zero as k approaches the original dimension.
The analysis phase turns PCA implementation into interpretation. Students produce detailed plots, compute diagnostic statistics, and draw conclusions about the latent structure revealed by principal components. This is where the algorithm becomes a tool for data understanding.
Scree Plot & Explained Variance
Plot eigenvalues λ_i (or cumulative explained variance Σ_{j≤i} λ_j / Σ_j λ_j) against component index. Steep drops indicate a low-intrinsic-dimensional structure; flat tails suggest the data is genuinely high-dimensional or noisy. Interpret the plot: if the curve elbows sharply at k=5, the data is well-approximated in 5 dimensions.
Visualization of Projections
For datasets with known class labels, plot projections onto the first two or three principal components and color-code by class. Do clusters separate cleanly? Are there overlaps or outliers? Perfect separation suggests PCA has discovered the true underlying structure; overlap indicates either non-linear structure (where PCA fails) or noisy data.
Component Loadings
Analyze the loadings matrix U_k: each column is a principal component (direction in original feature space). Visualize as a heatmap or bar chart: large positive/negative loadings indicate features that strongly influence that component. This reveals which original features are "important" for capturing variance.
Intrinsic Dimensionality
Use PCA results to estimate the intrinsic dimensionality of the data. If 90% of variance is captured by k = 5 components out of 100, the intrinsic dimension is approximately 5, suggesting the data lies near a 5-dimensional subspace of the full ambient space. This has profound implications for downstream learning and generalization.
Comparative Analysis
Compare PCA on the full dataset versus PCA per-class, PCA on subsamples, or PCA after adding noise. How robust is the structure? How sensitive are the principal directions to data perturbations? This builds confidence in findings and identifies potential artifacts.
Interpretation Template
- Summarize the scree plot: at what k do you see diminishing returns?
- Describe cluster structure in the 2D/3D projection: well-separated, overlapping, non-convex?
- Identify the top 3 principal components and interpret their loadings in terms of the original features.
- Estimate intrinsic dimensionality and discuss what this means for the data.
- Discuss any surprising findings or aspects of the data not captured by PCA.
Semi-supervised learning bridges unsupervised and supervised learning by combining a small set of labeled examples with a much larger set of unlabeled examples. The semi-supervised EM algorithm leverages both: the labeled data anchors the cluster centers, while the unlabeled data refines the estimates by providing information about the overall density and structure.
Problem Setup
Given: X_L (labeled data, n_L examples with known cluster assignments), X_U (unlabeled data, n_U examples). Assume a mixture model p(x,z) = Σ_k π_k N(x | μ_k, Σ_k) where z ∈ {1,...,K} is the cluster assignment. Goal: estimate θ = {π_k, μ_k, Σ_k} leveraging both labeled and unlabeled information.
Modified E-Step
For labeled examples: responsibility r_{ik} is deterministic. If x_i has known label z_i = k, set r_{ik} = 1 and r_{ij} = 0 for j ≠ k. For unlabeled examples: compute soft responsibilities using Bayes' rule: r_{ik} = π_k N(x_i | μ_k, Σ_k) / Σ_j π_j N(x_i | μ_j, Σ_j). This blends hard (labeled) and soft (unlabeled) assignments.
Modified M-Step
Update parameters using weighted contributions from both groups. For cluster k:
π_k ← (Σ_i r_{ik}) / (n_L + n_U) (averaged over all examples)
μ_k ← (Σ_i r_{ik} x_i) / (Σ_i r_{ik}) (weighted mean)
Σ_k ← (Σ_i r_{ik} (x_i − μ_k)(x_i − μ_k)T) / (Σ_i r_{ik}) (weighted covariance)
The labeled examples "vote" with certainty; unlabeled examples "vote" softly.
Convergence & Implementation
Iterate E and M steps until convergence, typically monitored by log-likelihood change < ε or parameter changes < ε. Log-likelihood combines labeled and unlabeled likelihoods:
ℓ = Σ_i log p(x_i, z_i | θ) + Σ_i log p(x_i | θ). Typical runtime: 10–50 iterations on most datasets. Initialize using k-means on unlabeled data or by centering labeled examples at their cluster means.
Practical Issues
- Initialization: Sensitive to starting point; try multiple random initializations and select the best log-likelihood.
- Covariance Singularity: If a cluster has few examples, Σ_k can become singular. Regularize by adding a small diagonal term (e.g., 0.01 * I).
- Unlabeled Data Help: Significant gains occur when clusters are well-separated and label budget is tight (e.g., 5 labels per class with 1000s of unlabeled). Diminishing returns if labels are abundant or clusters overlap severely.
- Model Mismatch: If the true data distribution is not a mixture of Gaussians, the method may underperform. Consider non-parametric alternatives (e.g., semi-supervised k-NN) if Gaussian assumption is violated.
Evaluation Metrics
Compare semi-supervised EM against: (1) fully supervised learning on X_L alone; (2) fully unsupervised EM on X_L ∪ X_U; (3) standard supervised baselines (logistic regression, SVM). Metric: classification accuracy on a held-out test set, or log-likelihood on test data. Gains from semi-supervised learning are most pronounced in low-label regimes.
The analysis phase examines when, why, and how much semi-supervised learning helps. By varying the number of labeled examples and evaluating performance, students gain intuition for the label efficiency gains and the conditions under which unlabeled data is most beneficial.
Label Efficiency Curves
Plot test accuracy as a function of the number of labeled examples (e.g., 1, 5, 10, 50, 100 per class). Compare three curves: (1) supervised learning using only labeled data; (2) semi-supervised EM using labeled + unlabeled; (3) fully supervised baseline (e.g., using all data as labeled). The gap between (1) and (2) quantifies the benefit of unlabeled data. Typical finding: semi-supervised is strongest when labeled budget is 1–10% of total, converging toward supervised performance as labeled data increases.
Data Regime Analysis
Helpful Regime: Clear cluster structure, well-separated classes, and tight label budget. Unlabeled data provides strong prior on cluster geometry; combining with few labels yields near-supervised performance.
Unhelpful Regime: Highly overlapping clusters, non-Gaussian data, or abundant labels. Unlabeled data provides little new information; method approaches supervised baseline.
Harmful Regime: Model misspecification (true data is not a mixture of Gaussians). Wrong cluster structure can degrade performance compared to supervised-only; label budget is wasted fitting incorrect models.
Sensitivity to k (number of clusters)
Evaluate semi-supervised EM for k = 2, 3, ..., 20 on a fixed labeled/unlabeled split. Plot accuracy vs. k. Ideally, peak accuracy occurs at the true number of clusters. If the peak is sharp, the method is sensitive to k (requiring careful tuning); if broad, the method is robust. Use model selection criteria (BIC, AIC, held-out validation) to select k automatically.
Convergence & Stability
Plot log-likelihood vs. iteration number for both labeled and unlabeled portions. Monitor for: (1) smooth convergence (sign of stable algorithm); (2) log-likelihood plateaus (sign of convergence); (3) oscillations or divergence (sign of numerical issues or poor initialization). Test multiple random initializations and report the mean and std. dev. of final accuracy.
Visualization of Learned Structure
After fitting semi-supervised EM, visualize the learned cluster centers (μ_k) and covariances (Σ_k) in PCA space (using PCA from the full unlabeled dataset as a pre-processing step). Color-code labeled examples by their true labels; color-code unlabeled examples by their soft cluster assignments (e.g., assign to the argmax_k r_{ik} cluster). Do the learned clusters align with true labels? Are there misaligned unlabeled examples?
Comparative Analysis
Compare semi-supervised EM against:
- k-NN (supervised): Simple baseline using only labeled data.
- Semi-supervised k-NN: Use unlabeled examples as pseudo-labeled neighbors; measure accuracy.
- Self-training: Iteratively label high-confidence unlabeled examples and re-train; compare with EM approach.
- Co-training: If features split naturally, train separate classifiers and iteratively label; measure performance.
Summarize which methods work best in which regimes and why. Discuss computational cost: semi-supervised EM is O(n_L + n_U) × K per iteration, making it scalable to large n_U; k-NN is O(n_U × n_L) for inference, potentially slower.
Independent Component Analysis (ICA) seeks to recover independent source signals from their linear mixture. Unlike PCA which finds directions of maximum variance, ICA finds directions of maximum independence. The problem is framed as: given X = A S where X is observed (n × m), A is unknown mixing (n × k), and S is unknown sources (k × m), recover S and A (up to permutation and scaling).
Mathematical Formulation
Assume each source s_j is independently distributed with non-Gaussian distribution p(s_j). The ICA model is x_i = Σ_j a_{ij} s_j + noise. The goal is to find a demixing matrix W such that y_j = Σ_i w_{ji} x_i estimates the sources s_j. If W = A−1 (up to permutation/scaling), we recover the sources.
Non-Gaussianity & Independence
Why Non-Gaussianity? By the Central Limit Theorem, a sum of independent random variables approaches Gaussianity. Therefore, the mixture X (a sum of independent non-Gaussian sources) is "more Gaussian" than the original sources S. ICA exploits this: find directions where data is least Gaussian (most non-Gaussian).
Measures of Non-Gaussianity: Kurtosis (4th moment), negentropy (differential entropy relative to Gaussian), or mutual information. Typical approach: maximize absolute kurtosis or negentropy subject to orthogonality (after whitening).
Algorithm: FastICA
Step 1: Whitening Apply PCA to X to obtain X̃ = X U D−1/2 UT where U D UT is eigendecomposition of Cov(X). This decorrelates X and normalizes variance to 1, pre-conditioning the problem.
Step 2: Initialization Start with random orthogonal matrix W₀ (via QR decomposition of random matrix).
Step 3: Fixed-Point Iteration For each component j, iterate:
w_j ← E[x̃ g(w_jT x̃)] − E[g'(w_jT x̃)] w_j
where g is the non-linearity (e.g., g(u) = tanh(u), g'(u) = 1 − tanh²(u)), then orthogonalize w_j against other components. Repeat until convergence (typically 10–50 iterations).
Step 4: De-mixing Recover sources: S = W X̃ (for whitened data) or transform back to original scale if needed.
Identifiability Issues
ICA recovers sources up to: (1) Permutation (component order is arbitrary); (2) Scaling (source amplitude is ambiguous: s_j and −s_j are equally valid). The model cannot distinguish which source appeared first or whether a source is scaled by ±λ. These are fundamental identifiability limits, not algorithmic failures.
Evaluation & Validation
On synthetic data (e.g., mixed audio signals or natural images):
1. Independence Check: Compute mutual information between recovered components; should be near zero for truly independent sources. Alternatively, compute correlation or higher-order dependence measures.
2. Signal Quality: If ground-truth sources are available, measure MSE or correlation between recovered and true sources (up to permutation and sign).
3. Permutation Matching: Recovered components may appear in any order; match them to ground truth via max-correlation / min-distance heuristics.
Practical Challenges
- Non-Gaussianity Requirement: ICA fails if sources are Gaussian (e.g., pure noise); other methods needed.
- Scaling Ambiguity: Source magnitude is recovered up to sign/scale; absolute amplitude cannot be determined from data alone.
- Initialization Sensitivity: FastICA can converge to local optima; try multiple random initializations and select best by non-Gaussianity criterion.
- Number of Sources: Must specify k (number of sources) in advance; over/underspecifying k degrades results.
Applications
ICA is powerful for: cocktail party effect (separating mixed speech signals), biomedical signal processing (EEG, fMRI artifact removal), image separation, and anomaly detection (directions of independence may reveal hidden structure).
A striking modern phenomenon: neural networks with far more parameters than training examples (e.g., 1M parameters, 10K examples) generalize well despite having zero training error. Traditional theory predicts severe overfitting. The explanation: Stochastic Gradient Descent (SGD) has an implicit bias toward simple solutions, finding the minimum-norm solution that fits the data—a form of automatic regularization.
Empirical Setup
Train a neural network (e.g., 2-3 hidden layers, widths much larger than dataset size) on a classification or regression task. Plot training loss and test loss vs. iteration. Typical pattern: training loss decreases to zero; test loss initially decreases, then plateaus or increases slightly (overfitting). Measure final test accuracy and compare to explicit L2 regularization (weight decay), dropout, or early stopping.
The Implicit Bias of SGD
Theory (Bartlett et al., Gunasekar et al.) shows: over sufficiently many iterations, SGD finds a solution close to the minimum-norm solution—the parameter vector θ* = argmin ||θ|| subject to fitting the training data. This is as if SGD automatically applied L2 regularization, but without tuning a regularization coefficient. The norm ||θ|| correlates with model complexity; simpler models generalize better.
Early Stopping as Regularization
Instead of running SGD to convergence, stop early (e.g., when validation loss plateaus or after a fixed number of iterations). Early stopping interrupts the optimization before SGD converges to the minimum-norm solution, landing at a nearby point with better test performance. This is equivalent to implicit regularization without explicit regularization machinery.
Measure test accuracy for different stopping times: t = 10, 50, 100, 500, 1000, ... iterations. Typically, there is an "Goldilocks" zone where test accuracy peaks; stopping too early gives high bias, stopping too late gives high variance.
Explicit Regularization Comparison
Train the same network with explicit L2 regularization (weight decay λ = 0, 0.01, 0.1, 1.0) and measure test accuracy. How do SGD's implicit regularization and explicit L2 regularization compare? Typical finding: explicit regularization helps a little, but SGD's implicit bias is the dominant effect. The two mechanisms are complementary.
Overparameterization & Generalization
Train networks of increasing width (width = 50, 100, 500, 1000, 5000) on the same dataset. Plot test accuracy vs. width. Classical theory predicts accuracy decreases (overfitting). However, modern empirical results often show: test accuracy improves with width, even to extreme overparameterization. This reversal of classical understanding is explained by implicit regularization: wider networks are easier to optimize (SGD finds good minima faster) and implicitly regularize toward simpler solutions.
Theoretical Framework
For linear models y = θT x, SGD finds the minimum-norm solution θ* = argmin ||θ|| s.t. Xθ = y (least-squares fit). For nonlinear models, the implicit bias is toward solutions with small norm in a particular metric (related to the network's activation patterns). The bias strength depends on step size, number of iterations, and data geometry.
Practical Implications
- Implicit Regularization is Free: Neural networks regularize automatically; no need to carefully tune L2 coefficients.
- Early Stopping Matters: Generalization often improves by stopping before convergence; monitor validation performance.
- Overparameterization Helps: Wider networks have implicit regularization and easier optimization landscape; don't fear overfitting.
- Step Size & Iterations: Implicit regularization depends on both; larger step sizes and more iterations lead to sparser, simpler solutions.
Research Frontiers
Current research extends these ideas: characterizing implicit bias for different optimizers (Adam, momentum), understanding when implicit bias fails (e.g., double descent), connecting implicit regularization to data geometry and model architecture. This is an active frontier bridging optimization, statistics, and deep learning.
XCS229 Problem Set 5 synthesizes four major areas of machine learning, each offering a distinct lens on learning from data. Together, they form a comprehensive view of modern machine learning practice.
Unsupervised Learning: Discovering Structure
PCA & ICA are foundational unsupervised methods that discover latent structure without labels. PCA finds directions of maximum variance; ICA finds independent components. Both reveal hidden structure, enable visualization, and provide features for downstream supervised learning. The lesson: not all structure is about predicting labels; understanding data geometry is intrinsically valuable.
Semi-supervised Learning: Leveraging Unlabeled Data
Semi-supervised EM shows how to combine labeled and unlabeled data synergistically. The key insight: unlabeled data reveals the global structure (cluster locations, densities), while labeled data anchors this structure. This regime—abundant unlabeled data, scarce labels—is increasingly common in practice (e.g., web-scale data, pre-training). Semi-supervised learning is a practical necessity, not just a theoretical curiosity.
Implicit Regularization: Why Simple Algorithms Work
Implicit regularization explains a fundamental mystery of modern machine learning: why do overparameterized neural networks generalize despite fitting all training data? The answer: optimization algorithms (SGD) have built-in bias toward simple solutions. This has profound implications: we don't need complex regularization machinery; we can trust simple, fast algorithms; and overparameterization is not inherently bad—it aids both optimization and generalization.
Unifying Themes
1. Structure & Inductive Bias: All methods encode assumptions about data structure: PCA assumes low-rank structure, ICA assumes independence, semi-supervised EM assumes mixture structure, SGD assumes simplicity. Success requires matching the method's assumptions to the true data structure.
2. Latent Variables & Representation Learning: Beneath complex high-dimensional data lie simpler latent factors. Discovering these factors (via PCA, ICA, EM) is at the heart of representation learning, increasingly central to modern deep learning (autoencoders, VAEs, diffusion models).
3. Scaling & Practical Considerations: Algorithms that scale to large data (e.g., online EM, approximate ICA, SGD) often have advantages beyond speed. They naturally regularize, avoid overfitting to the full dataset, and adapt to streaming settings. Theory and practice align: scalable algorithms often have the best generalization.
When to Use What
- PCA: Dimensionality reduction, visualization, preprocessing for other algorithms, understanding variance-based structure.
- ICA: Source separation (cocktail party problem), artifact removal, discovering independent factors that PCA would miss.
- Semi-supervised EM: Learning with few labels but abundant unlabeled data; mixture model assumptions are reasonable; need probabilistic outputs (e.g., cluster probabilities).
- Implicit Regularization / SGD: Training deep neural networks; expect generalization despite overfitting; monitor test performance for early stopping.
Looking Forward
These four problem areas are not isolated. Modern deep learning combines unsupervised pre-training (learning representations without labels, akin to PCA/ICA), semi-supervised fine-tuning (leveraging limited labeled data), and implicit regularization (via SGD). Representation learning, self-supervised learning, and zero-shot/few-shot learning all build on concepts from PS5. Mastering these fundamentals prepares for research and practice at the frontier of machine learning.
Resources for Further Learning
- Books: Murphy "Machine Learning: A Probabilistic Perspective", Goodfellow et al. "Deep Learning"
- Papers: Bartlett et al. on implicit regularization, Gunasekar et al. on gradient flow, Cohen et al. on implicit bias of overparameterization
- Courses: Stanford CS229 (full course), Stanford CS224N/231N (deep learning), MIT 6.869 (advanced vision)
- Practice: Kaggle competitions, academic datasets (UCI, OpenML), preprints on arxiv.org
Stanford CS229 Course Materials
Stanford CS229: Machine Learning — Official course page with complete lectures on unsupervised learning, including PCA, clustering, EM algorithm, ICA, and regularization theory.
CS229 Lecture Notes & Sections — Detailed notes on all unsupervised learning topics and advanced generalization theory. Weeks 8-10 materials directly support Problem Set 5.
Dimensionality Reduction & PCA
Wikipedia: Principal Component Analysis — Comprehensive treatment of PCA, eigendecomposition, variance explained, and interpretation of principal components.
Wikipedia: Eigenvalues and Eigenvectors — Mathematical foundations essential for understanding PCA's eigendecomposition of the covariance matrix.
Expectation-Maximization & Mixture Models
Wikipedia: Expectation-Maximization Algorithm — Detailed treatment of EM algorithm, convergence properties, and applications to latent variable models.
Wikipedia: Mixture Model — Overview of mixture models, Gaussian mixture models, and their role in unsupervised learning.
Source Separation & ICA
Wikipedia: Independent Component Analysis — Introduction to ICA, independence assumption, contrast functions, and blind source separation problem.
Wikipedia: Blind Source Separation — Applications of ICA including cocktail party problem and artifact removal in signal processing.
Regularization & Implicit Bias
Wikipedia: Regularization — Mathematical foundations of regularization, L1/L2 penalties, and their effects on learning.
Wikipedia: Stochastic Gradient Descent — Overview of SGD algorithm, learning rate schedules, and implicit regularization properties.
Comprehensive References
The Elements of Statistical Learning (ESL) - Free PDF — Authoritative reference covering unsupervised learning (Chapter 14), dimensionality reduction, clustering, and regularization theory. Essential reading for all PS5 topics.
Deep Learning by Goodfellow, Bengio, Courville — Modern textbook covering neural networks, regularization techniques, and implicit bias in deep learning.
Machine Learning: A Probabilistic Perspective by Kevin Murphy — Comprehensive probabilistic treatment of EM, mixture models, latent variable models, and dimensionality reduction.
Research Papers on Implicit Regularization
arXiv.org — Search for: "implicit regularization" + "SGD", "benign overfitting", "implicit bias gradient descent". Key authors: Bartlett, Belkin, Liang, Gunasekar, Cohen.
Study Roadmap for PS5
Start with Stanford CS229 lectures on unsupervised learning. Review PCA and eigendecomposition fundamentals via Wikipedia. For EM algorithm, consult CS229 notes and ESL Chapter 8. For ICA, study contrast functions and independence testing. For implicit regularization, read recent papers on benign overfitting and SGD bias. Finally, synthesize all topics by understanding how modern deep learning combines unsupervised pre-training, semi-supervised fine-tuning, and implicit regularization from SGD.