Stanford XCS229 · Problem Set 2
PS2: Classification
35 Points · Linear Models, PU Learning & Imbalance
01

PS2 Overview

Problem Set 2 (35 points) spans linear classifiers, positive-only learning, and imbalanced data—three fundamental challenges in supervised learning. This problem set bridges weeks 2-3 material and prepares you for real-world classification tasks where data is messy.

Three problems build complementary skills: logistic regression versus Gaussian discriminant analysis, learning from incomplete labels, and handling class imbalance. Together, they form a comprehensive foundation in practical classification.

02

Linear Classification

Problem 01 compares two core linear classification algorithms on the same dataset. Logistic regression models the conditional probability directly; GDA makes a generative assumption about the class distributions.

Both produce linear decision boundaries, but their inductive biases differ. Understanding when each shines and when they fail is essential for model selection in practice.

03

Decision Boundary Analysis

Decision boundaries are where the classifier changes its prediction. In high dimensions, linear boundaries may be insufficient, requiring feature engineering or nonlinear models. This section explores when linearity works and when you need to augment the feature space.

Visualizing boundaries reveals classifier behavior, class overlap, and margin structure—intuition that guides algorithm choice and hyperparameter tuning.

04

Positive-Only Learning

Positive-only (PU) learning handles a realistic scenario: you observe positive labels but cannot distinguish between unlabeled and negative examples. This is common in medical diagnosis, churn prediction, and recommendation systems.

The key insight is that unlabeled data carries information. By assuming SCAR (selected completely at random), you can estimate p(y=1|x) from observed frequencies and label densities.

05

PU Learning Theory

PU learning theory formalizes identifiability. Under SCAR, you can recover the true classifier even from biased labels—provided the unlabeled population is representative. Estimating the class prior π = p(s=1|y=1) is the bottleneck.

Label frequency and the constant-factor assumption unlock closed-form solutions. Theory validates the intuition that unlabeled data is not wasted.

06

Imbalanced Datasets

Class imbalance occurs when one class vastly outnumbers the other. Standard accuracy is misleading; a majority-class baseline trivializes the problem. Precision, recall, F1, and ROC-AUC reveal true classifier performance.

Imbalance is rampant in fraud detection, disease screening, and anomaly detection. Addressing it requires both evaluation metrics and training strategies.

07

Resampling & Reweighting

Resampling (over- and undersampling) and reweighting adjust the class distribution seen during training. Oversampling the minority class or undersampling the majority can balance gradients. Cost-sensitive learning directly penalizes minority errors more heavily.

SMOTE (Synthetic Minority Oversampling) generates synthetic samples, mitigating overfitting to the (small) minority set. Each approach trades off computational cost, variance, and bias.

08

Key Takeaways

This problem set unifies three perspectives: algorithmic (logistic regression vs. GDA), information-theoretic (PU learning's use of unlabeled data), and practical (imbalance handling). Mastering these lays a strong foundation for subsequent classification tasks.

The real lesson: classification is not monolithic. Your choice of model, loss, metrics, and training strategy must align with data characteristics and business requirements.

09

References & Further Reading

Comprehensive references for classification theory and practice. Stanford CS229 covers logistic regression, GDA, and loss functions. For positive-only learning, see lecture notes on semi-supervised learning and the foundational PU learning papers. The Elements of Statistical Learning provides rigorous treatment of classifiers, evaluation metrics, and imbalance. Scikit-learn and imbalanced-learn documentation offer practical implementation guidance.

Key Wikipedia articles cover logistic regression, Gaussian discriminant analysis, precision/recall, and ROC curves. For SMOTE and advanced resampling techniques, consult the original papers and modern survey articles on class imbalance.