XCS229 Machine Learning

SCPD · Stanford

XCS229 Machine Learning

Supervised Learning Foundations

Linear regression and logistic regression form the foundation of supervised learning. Linear regression minimizes squared error: min ‖y − Xβ‖². The closed-form solution is β* = (XᵀX)⁻¹Xᵀy (normal equation). Alternatively, gradient descent iteratively updates: β ← β − α∇J(β), where J(β) is the loss.

Logistic regression applies the same principle to classification. It models P(y=1|x) = σ(θᵀx) where σ is the sigmoid. Gradient descent on log-loss drives the decision boundary toward the data. Both methods are probabilistically motivated: linear regression via Gaussian error, logistic regression via Bernoulli likelihood. These form the conceptual and computational backbone of modern ML.

β*=(XᵀX)⁻¹Xᵀy

Normal Equation

β ← β−α∇J

Gradient Descent

σ(θᵀx)

Logistic (Classification)

J(θ)=−Σylog(ŷ)

Log-Loss

Generative vs Discriminative

Generative models learn the joint distribution P(x, y) by modeling P(x|y)P(y). They can generate new samples and naturally handle missing data. Discriminative models learn P(y|x) directly, focusing on the decision boundary. GDA (Gaussian Discriminant Analysis) assumes x|y follows a multivariate Gaussian; Naive Bayes assumes conditional independence of features given y.

Naive Bayes on text data: P(y|x₁,...,xd) ∝ P(y)∏P(xᵢ|y). Laplace smoothing adds pseudocounts to avoid zero probabilities. Softmax regression extends logistic regression to K > 2 classes. The asymptotic efficiency (as n→∞) often favors discriminative models, but with small n and strong priors, generative models excel. Understanding both perspectives reveals complementary strengths.

GDA: P(x|y)P(y) Naive Bayes: ∏P(xᵢ|y) Softmax: K-class logistic Laplace: +1 smoothing

Kernel Methods & SVMs

Kernel methods extend linear classifiers to nonlinear problems by implicitly mapping data to high-dimensional feature spaces. The kernel trick K(x,y) = ⟨φ(x), φ(y)⟩ enables efficient computation without materializing the features. Support Vector Machines find the maximum-margin separating hyperplane in feature space, solving via dual optimization that depends only on K(xᵢ,xⱼ).

Common kernels: polynomial K(x,y) = (⟨x,y⟩+c)ᵈ encodes degree-d features; RBF K(x,y) = exp(−γ‖x−y‖²) maps to infinite-dimensional space. Mercer's theorem characterizes valid kernels as positive semi-definite functions. SVMs are unmatched for small-to-medium datasets and structured data; they remain essential tools for interpretable nonlinear learning with uncertainty quantification.

K(x,y)

Kernel

max-margin

Objective

αᵢ≠0

Support Vectors

PSD

Valid (Mercer)

Deep Learning Foundations

Neural networks are compositions of nonlinear functions: h₁ = σ(W₁x + b₁), h₂ = σ(W₂h₁ + b₂), ..., f = Wₖhₖ₋₁. Backpropagation computes gradients via the chain rule: ∂L/∂W = (∂L/∂z)(∂z/∂W), enabling efficient training of deep networks. Modern architectures (CNNs, RNNs, Transformers) inject inductive biases: convolution for images, recurrence for sequences, attention for long-range dependencies.

Deep learning excels on large, unstructured datasets by learning hierarchical feature representations. The bias-variance trade-off remains critical: regularization (dropout, L2, early stopping) prevents overfitting. Neural networks are universal approximators but require careful hyperparameter tuning and can be unstable to train. Their expressiveness comes at the cost of interpretability and computational demand.

h = σ(Wx+b) Backpropagation: chain rule ReLU, Tanh activations Dropout, L2 regularization

Unsupervised Learning

K-means clustering partitions data into k clusters by minimizing Σᵢ‖xᵢ − μc(i)‖². The algorithm alternates: assign points to nearest centroid, recompute centroids. EM (Expectation-Maximization) generalizes this to probabilistic models. Mixture of Gaussians models data as drawn from k Gaussians; EM iteratively estimates responsibilities (soft assignments) and parameters. Factor Analysis discovers latent factors underlying observed data.

ICA (Independent Component Analysis) recovers hidden independent sources from mixed observations, crucial for blind source separation. These methods reveal structure when labels are unavailable. Unsupervised learning is often the first step: dimensionality reduction, anomaly detection, data exploration. The lack of ground truth makes evaluation challenging, relying on internal metrics (silhouette, gap statistic) and domain expertise.

K-means

Hard Clustering

EM Algorithm

Soft Clustering

MoG

Probabilistic

ICA

Blind Separation

Dimensionality Reduction

PCA (Principal Component Analysis) finds orthogonal directions of maximum variance. Compute SVD of centered data: X = UΣVᵀ. The top k singular vectors form a projection matrix; ẑ = Vₖᵀx projects x to k dimensions. PCA is unsupervised, interpretable, and linear. ICA recovers independent components by assuming non-Gaussian sources and maximizing statistical independence.

These methods address the curse of dimensionality: high-dimensional spaces are sparse and distances become meaningless. Reduction speeds computation, aids visualization, and often improves generalization by removing noise. Trade-off: loss of information. Modern alternatives (VAEs, autoencoders) learn nonlinear reductions. Understanding PCA's limitations (assumes linearity, sensitive to scale) is crucial for applying dimensionality reduction effectively.

PCA: variance maximization SVD: X = UΣVᵀ ICA: independence Whitening, centering

Generalization Theory

The bias-variance decomposition reveals why models fail. High bias: underfitting (model too simple). High variance: overfitting (model too flexible). Regularization penalizes complexity: L2 adds λ‖β‖², L1 adds λ‖β‖₁. Cross-validation estimates test error without a held-out set: k-fold splits data into k folds, trains k times, averages error.

VC dimension measures model expressiveness: a set is shattered if the model can separate any dichotomy. Theoretical generalization bounds via Rademacher complexity show overfitting risk grows with model capacity and shrinks with data size. Practical debugging: plot train vs test error to diagnose bias vs variance. Error analysis: manually examine mistakes to identify systematic issues. These tools guide model selection and hyperparameter tuning.

Train Error

Bias

Test − Train

Variance

Unbiased Est

VC Dim

Capacity

Course Roadmap & Problem Sets

XCS229 spans 10 weeks and 5 problem sets (PS1–PS5, 200 pts total). Weeks 1–2 cover regression and gradient descent (PS1). Weeks 3–4 introduce classification, generative models, SVMs (PS2). Week 5–6 focus on clustering and EM (PS3). Weeks 7–8 build neural networks, backprop, and regularization (PS4). Weeks 9–10 synthesize generalization theory and practical debugging (PS5).

Problem sets are the course's heart: linear algebra, probability, coding implementations. Start early, read error messages, test incrementally. Office hours and piazza offer guidance. The course progresses from simple (linear models) to complex (neural networks, VC theory), building intuition at each stage. Success requires implementing algorithms, understanding mathematics, and developing debugging skills. By the end, you'll have built classifiers, trained neural networks, and applied unsupervised learning—the foundation for research and industry applications.

10 weeks 5 problem sets 200 points ~5 hrs/week coding

Sources & References

References and sources for further study on the topics covered in this deep dive.

Supervised Learning Foundations

Generative vs Discriminative

Kernel Methods & SVMs

Deep Learning Foundations

Unsupervised Learning

Dimensionality Reduction

Generalization Theory

Course Roadmap & Problem Sets

Sources & References

Supervised Learning Foundations

Linear Regression

Logistic Regression

Probabilistic Interpretation

Gradient Descent Variants

Generative vs Discriminative

Gaussian Discriminant Analysis (GDA)

Naive Bayes

Softmax Regression (K-class)

Kernel Methods & SVMs

Common Kernels

Support Vector Machines

Mercer's Theorem

Deep Learning Foundations

Architecture

Backpropagation

Regularization & Generalization

Unsupervised Learning

K-Means Clustering

Expectation-Maximization (EM)

Factor Analysis & ICA

Dimensionality Reduction

Principal Component Analysis (PCA)

Independent Component Analysis (ICA)

Trade-offs

Generalization Theory

Bias-Variance Decomposition

Cross-Validation

VC Dimension & Rademacher Complexity

Regularization

Course Roadmap & Problem Sets

Week-by-Week Breakdown

Problem Set Tips

Conceptual Connections

Sources & References

Course Materials & Lectures

Foundational Textbooks

Visual Learning Resources

Reference Materials