SCPD · Stanford
XCS229 Machine Learning
01

Supervised Learning Foundations

Linear regression and logistic regression form the foundation of supervised learning. Linear regression minimizes squared error: min ‖y − Xβ‖². The closed-form solution is β* = (XᵀX)⁻¹Xᵀy (normal equation). Alternatively, gradient descent iteratively updates: β ← β − α∇J(β), where J(β) is the loss.

Logistic regression applies the same principle to classification. It models P(y=1|x) = σ(θᵀx) where σ is the sigmoid. Gradient descent on log-loss drives the decision boundary toward the data. Both methods are probabilistically motivated: linear regression via Gaussian error, logistic regression via Bernoulli likelihood. These form the conceptual and computational backbone of modern ML.

β*=(XᵀX)⁻¹Xᵀy
Normal Equation
β ← β−α∇J
Gradient Descent
σ(θᵀx)
Logistic (Classification)
J(θ)=−Σylog(ŷ)
Log-Loss
02

Generative vs Discriminative

Generative models learn the joint distribution P(x, y) by modeling P(x|y)P(y). They can generate new samples and naturally handle missing data. Discriminative models learn P(y|x) directly, focusing on the decision boundary. GDA (Gaussian Discriminant Analysis) assumes x|y follows a multivariate Gaussian; Naive Bayes assumes conditional independence of features given y.

Naive Bayes on text data: P(y|x₁,...,xd) ∝ P(y)∏P(xᵢ|y). Laplace smoothing adds pseudocounts to avoid zero probabilities. Softmax regression extends logistic regression to K > 2 classes. The asymptotic efficiency (as n→∞) often favors discriminative models, but with small n and strong priors, generative models excel. Understanding both perspectives reveals complementary strengths.

GDA: P(x|y)P(y) Naive Bayes: ∏P(xᵢ|y) Softmax: K-class logistic Laplace: +1 smoothing
03

Kernel Methods & SVMs

Kernel methods extend linear classifiers to nonlinear problems by implicitly mapping data to high-dimensional feature spaces. The kernel trick K(x,y) = ⟨φ(x), φ(y)⟩ enables efficient computation without materializing the features. Support Vector Machines find the maximum-margin separating hyperplane in feature space, solving via dual optimization that depends only on K(xᵢ,xⱼ).

Common kernels: polynomial K(x,y) = (⟨x,y⟩+c)ᵈ encodes degree-d features; RBF K(x,y) = exp(−γ‖x−y‖²) maps to infinite-dimensional space. Mercer's theorem characterizes valid kernels as positive semi-definite functions. SVMs are unmatched for small-to-medium datasets and structured data; they remain essential tools for interpretable nonlinear learning with uncertainty quantification.

K(x,y)
Kernel
max-margin
Objective
αᵢ≠0
Support Vectors
PSD
Valid (Mercer)
04

Deep Learning Foundations

Neural networks are compositions of nonlinear functions: h₁ = σ(W₁x + b₁), h₂ = σ(W₂h₁ + b₂), ..., f = Wₖhₖ₋₁. Backpropagation computes gradients via the chain rule: ∂L/∂W = (∂L/∂z)(∂z/∂W), enabling efficient training of deep networks. Modern architectures (CNNs, RNNs, Transformers) inject inductive biases: convolution for images, recurrence for sequences, attention for long-range dependencies.

Deep learning excels on large, unstructured datasets by learning hierarchical feature representations. The bias-variance trade-off remains critical: regularization (dropout, L2, early stopping) prevents overfitting. Neural networks are universal approximators but require careful hyperparameter tuning and can be unstable to train. Their expressiveness comes at the cost of interpretability and computational demand.

h = σ(Wx+b) Backpropagation: chain rule ReLU, Tanh activations Dropout, L2 regularization
05

Unsupervised Learning

K-means clustering partitions data into k clusters by minimizing Σᵢ‖xᵢ − μc(i)‖². The algorithm alternates: assign points to nearest centroid, recompute centroids. EM (Expectation-Maximization) generalizes this to probabilistic models. Mixture of Gaussians models data as drawn from k Gaussians; EM iteratively estimates responsibilities (soft assignments) and parameters. Factor Analysis discovers latent factors underlying observed data.

ICA (Independent Component Analysis) recovers hidden independent sources from mixed observations, crucial for blind source separation. These methods reveal structure when labels are unavailable. Unsupervised learning is often the first step: dimensionality reduction, anomaly detection, data exploration. The lack of ground truth makes evaluation challenging, relying on internal metrics (silhouette, gap statistic) and domain expertise.

K-means
Hard Clustering
EM Algorithm
Soft Clustering
MoG
Probabilistic
ICA
Blind Separation
06

Dimensionality Reduction

PCA (Principal Component Analysis) finds orthogonal directions of maximum variance. Compute SVD of centered data: X = UΣVᵀ. The top k singular vectors form a projection matrix; ẑ = Vₖᵀx projects x to k dimensions. PCA is unsupervised, interpretable, and linear. ICA recovers independent components by assuming non-Gaussian sources and maximizing statistical independence.

These methods address the curse of dimensionality: high-dimensional spaces are sparse and distances become meaningless. Reduction speeds computation, aids visualization, and often improves generalization by removing noise. Trade-off: loss of information. Modern alternatives (VAEs, autoencoders) learn nonlinear reductions. Understanding PCA's limitations (assumes linearity, sensitive to scale) is crucial for applying dimensionality reduction effectively.

PCA: variance maximization SVD: X = UΣVᵀ ICA: independence Whitening, centering
07

Generalization Theory

The bias-variance decomposition reveals why models fail. High bias: underfitting (model too simple). High variance: overfitting (model too flexible). Regularization penalizes complexity: L2 adds λ‖β‖², L1 adds λ‖β‖₁. Cross-validation estimates test error without a held-out set: k-fold splits data into k folds, trains k times, averages error.

VC dimension measures model expressiveness: a set is shattered if the model can separate any dichotomy. Theoretical generalization bounds via Rademacher complexity show overfitting risk grows with model capacity and shrinks with data size. Practical debugging: plot train vs test error to diagnose bias vs variance. Error analysis: manually examine mistakes to identify systematic issues. These tools guide model selection and hyperparameter tuning.

Train Error
Bias
Test − Train
Variance
CV
Unbiased Est
VC Dim
Capacity
08

Course Roadmap & Problem Sets

XCS229 spans 10 weeks and 5 problem sets (PS1–PS5, 200 pts total). Weeks 1–2 cover regression and gradient descent (PS1). Weeks 3–4 introduce classification, generative models, SVMs (PS2). Week 5–6 focus on clustering and EM (PS3). Weeks 7–8 build neural networks, backprop, and regularization (PS4). Weeks 9–10 synthesize generalization theory and practical debugging (PS5).

Problem sets are the course's heart: linear algebra, probability, coding implementations. Start early, read error messages, test incrementally. Office hours and piazza offer guidance. The course progresses from simple (linear models) to complex (neural networks, VC theory), building intuition at each stage. Success requires implementing algorithms, understanding mathematics, and developing debugging skills. By the end, you'll have built classifiers, trained neural networks, and applied unsupervised learning—the foundation for research and industry applications.

10 weeks 5 problem sets 200 points ~5 hrs/week coding
09

Sources & References

References and sources for further study on the topics covered in this deep dive.