Problem Set 2 (35 points) contains three complementary problems that deepen your understanding of supervised classification. Each problem targets a different aspect of real-world complexity.
Problem Structure
01—Linear Classification (12 pts): Compare logistic regression and Gaussian Discriminant Analysis on the same synthetic data. Implement both, analyze decision boundaries, and understand their inductive biases and convergence properties.
02—Positive-Only Learning (12 pts): Train a classifier when you observe positive labels and unlabeled data (no explicit negatives). Estimate the true p(y=1|x) under the SCAR assumption, deriving the relationship between observed and true label frequencies.
03—Imbalanced Data (11 pts): Address a dataset where one class dominates. Implement resampling and reweighting strategies, evaluate using appropriate metrics (F1, ROC-AUC), and justify your approach.
Learning Outcomes
By the end, you will understand linear classifier trade-offs, how to learn from incomplete label information, and practical strategies for imbalanced data. You will also recognize that evaluation metrics must match problem structure.
Recommended Resources
Review lecture notes on logistic regression, GDA, and loss functions. Consult the textbook sections on semi-supervised learning and class imbalance. The sklearn documentation for imbalance handling (imbalanced-learn) is also invaluable.
Linear classifiers (logistic regression and GDA) form the backbone of supervised learning. Both partition the input space with a hyperplane, but they make different assumptions about the data-generating process.
Logistic Regression
Logistic regression directly models p(y=1|x) = σ(θ^T x). It uses maximum likelihood estimation and empirical risk minimization. The learned boundary is the set where p(y=1|x) = 0.5. Logistic regression is robust and requires fewer assumptions; it works well even when the data does not follow any particular distributional form.
Gaussian Discriminant Analysis
GDA assumes class-conditional distributions p(x|y) are Gaussian with shared covariance. By modeling the joint distribution p(x,y), GDA derives a posterior p(y|x) that is logistic. GDA can be more sample-efficient if the Gaussian assumption holds, but it is sensitive to violating this assumption.
Empirical Comparison
On the same synthetic dataset, plot the learned boundaries for both models. Compare: (1) convergence speed, (2) decision boundary shape, (3) robustness to outliers, and (4) classification accuracy on held-out data. Document your findings in the writeup.
Key Insight
Logistic regression is a discriminative model (learns p(y|x) directly), while GDA is generative (learns the full joint). Generative models can be more data-efficient but are less flexible. Discriminative models are more robust but may require more data.
The decision boundary is the set of points where the classifier's prediction changes. Understanding boundary geometry reveals classifier strengths and weaknesses.
Linearity and Feature Space
Both logistic regression and GDA with a diagonal covariance matrix produce linear boundaries in the original feature space. If the data is not linearly separable, you must either augment the feature space (add polynomial or interaction terms) or use a nonlinear model.
Visualization Strategies
For 2D data, plot the boundary as a contour of p(y=1|x) = 0.5. For higher dimensions, use dimensionality reduction (PCA, t-SNE) or plot marginal 2D slices. Color-code training points by true label and mark misclassifications.
Margin and Confidence
The margin is the distance from a point to the decision boundary. Points far from the boundary are classified with high confidence. Margins reveal which regions are well-learned and which are ambiguous. Small margins indicate potential overfitting or overlapping class distributions.
Feature Engineering
If the boundary is nonlinear, engineer features: polynomial terms (x_1^2, x_1 x_2), radial basis functions, or domain-specific transformations. Re-train the linear classifier in the augmented space. Document which features improve the boundary.
Positive-only (PU) learning addresses a practical scenario: you observe positive examples explicitly, but the unlabeled set contains both positives and negatives. You cannot directly observe the negative class.
Problem Setup
You observe labeled pairs (x, s) where s = 1 indicates a positive label, and s = 0 indicates unlabeled (unknown true label y). The true label y ∈ {0, 1}, and the relationship is s = 1 ⟹ y = 1 (labeled positives are true positives) but y = 0 ⟹ s = 0 (negatives are unlabeled).
SCAR Assumption
Under Selected Completely At Random (SCAR), the probability of observing a label s=1 depends only on the true label y, not on x: p(s=1|x,y) = p(s=1|y). This means the labeling process is uninformative given the true label. SCAR is a strong but often reasonable assumption in applications like crowdsourcing or medical diagnosis.
Core Insight
Observed frequency f(s=1|x) = p(s=1|x) = p(s=1|y=1) · p(y=1|x) = π · p(y=1|x), where π is the label frequency. By estimating π from aggregate statistics, you can recover p(y=1|x) = f(s=1|x) / π.
Practical Steps
Estimate π (the fraction of observed positives among true positives), using methods like label frequency or Elkan-Noto constant-factor assumptions. Then, train a classifier on the observed positives vs. rest, then adjust its scores by π. Finally, evaluate on a held-out set with true labels.
PU learning theory formalizes when and how you can recover the true classifier from positive-only observations. The core question: is p(y=1|x) identifiable from p(s=1|x)?
Identifiability
Under SCAR, p(s=1|x) = π · p(y=1|x) where π = p(s=1|y=1) is a constant. This relationship is linear in p(y=1|x), so you can invert it: p(y=1|x) = p(s=1|x) / π. Identifiability holds as long as π > 0 (at least some positives are labeled) and we can estimate π accurately.
Estimating the Class Prior π
π can be estimated using several methods: (1) label frequency (if you trust domain knowledge about the overall prevalence of positives), (2) constant-factor assumption (the labeled positive rate divided by a known or assumed factor), or (3) mixture model fitting (assume the unlabeled set is a mixture of positives and negatives, then infer π via EM).
Error Bounds
Misestimating π introduces systematic bias. If true π = 0.3 but you estimate π = 0.2, your recovered p(y=1|x) will be 1.5× too large. This affects threshold selection and calibration. The problem statement may provide guidance on π estimation.
Practical Robustness
PU learning is robust to moderate π misestimation if you use cross-validation and tuning on a held-out labeled set. Sensitivity analysis (vary π and observe performance) quantifies robustness. Always validate on a separate test set with true labels.
Class imbalance arises when one class is much rarer than the other. Standard accuracy is a poor metric; even a baseline that always predicts the majority class achieves high accuracy.
The Problem
In fraud detection, disease screening, or anomaly detection, the positive class is often 1–5% of the data. Training a standard classifier on raw data biases it toward the majority class; the gradient for the minority class is overwhelmed. The model learns to "mostly predict negative" and achieves high accuracy despite uselessness.
Appropriate Metrics
Precision: of predicted positives, how many are truly positive? Recall (Sensitivity): of true positives, how many did we find? F1 Score: harmonic mean of precision and recall; balances both. ROC-AUC: area under the receiver operating characteristic curve; threshold-independent measure of rank-ordering quality. PR-AUC: precision-recall curve; focuses on the positive class, especially valuable for extreme imbalance.
Why Accuracy Fails
If 99% of data is negative and the model always predicts negative, accuracy = 99% but recall = 0% (catching no true positives). In cost-sensitive settings, missing a positive may be far more expensive than a false alarm.
Evaluation Discipline
Always report multiple metrics. Use stratified k-fold cross-validation to preserve class distribution in each fold. On imbalanced data, random split can lead to leakage (e.g., all positives in test set). Use a held-out test set with the true class distribution.
Resampling and reweighting adjust the training distribution to address imbalance. Each approach has trade-offs in computational cost, variance, and bias.
Oversampling
Replicate minority-class examples until the class distribution is balanced. Pros: no data loss, works with any classifier. Cons: increases dataset size and computational cost; can lead to overfitting if you memorize replicated examples. Mitigation: add small noise to replicated examples (jittering).
Undersampling
Randomly remove majority-class examples to balance the distribution. Pros: reduces dataset size and training time. Cons: information loss (you discard potentially useful majority-class samples); may increase bias. Best used when data is very large or when combined with ensemble methods.
SMOTE (Synthetic Minority Oversampling Technique)
Generate synthetic minority examples by interpolating between existing minority samples. SMOTE creates new examples in the feature space, reducing memorization risk compared to simple replication. Variants exist for edge cases (borderline SMOTE, SVM-SMOTE).
Cost-Sensitive Learning
Assign higher misclassification cost to the minority class. In logistic regression, weight the loss: loss = w+ · (minority errors) + w− · (majority errors). Set w+ > w− to prioritize minority performance. This avoids the data-manipulation of resampling, directly optimizing the imbalanced objective.
Ensemble Methods
Train multiple classifiers on different resampled versions of the data (e.g., balanced random forests). Averaging predictions often reduces variance and improves robustness to resampling artifacts.
Problem Set 2 synthesizes algorithmic, information-theoretic, and practical perspectives on classification. These three problems are not isolated; they interact in real applications.
Algorithmic Perspective: Model Choice
Logistic regression and GDA are both linear classifiers, but they embody different inductive biases. Logistic regression is discriminative and robust; GDA is generative and can be sample-efficient. Neither is universally best; context dictates choice.
Information-Theoretic Perspective: Unlabeled Data
Unlabeled data is not wasted; under SCAR, it carries information about p(y=1|x). PU learning formalizes this insight. Semi-supervised and self-supervised learning build on this principle. Whenever labels are scarce, unlabeled data is your ally.
Practical Perspective: Evaluation and Balance
Accuracy is a red herring for imbalanced data. Precision, recall, F1, and ROC-AUC tell the true story. Addressing imbalance requires a combined strategy: appropriate metrics, resampling/reweighting, and validation discipline.
Integration
A real-world classification task may combine all three themes: you might use logistic regression (linear classifier), adjust for label imbalance (reweighting), and even have some positive-only labels (semi-supervised). Mastery of these fundamentals prepares you for such complexity.
Next Steps
After PS2, deepen your understanding with nonlinear classifiers (neural networks, kernel methods), advanced semi-supervised techniques (consistency regularization, pseudo-labeling), and domain-specific applications (NLP, computer vision, time series). The foundations you build here will anchor everything ahead.
Stanford CS229 Course Materials
Stanford CS229: Machine Learning — Official course page with lectures on logistic regression, GDA, and loss functions. See weeks 2-3 lecture notes for material directly relevant to PS2.
Linear Classification Theory
Wikipedia: Logistic Regression — Comprehensive treatment of logistic regression, maximum likelihood, and regularization.
Wikipedia: Linear Discriminant Analysis — Overview of GDA and discriminant analysis methods. The Gaussian assumption and derivation of linear boundaries are clearly explained.
Positive-Only Learning
Wikipedia: Positive and Unlabeled Learning — Foundational overview of PU learning, SCAR assumption, and unlabeled data usage.
Imbalanced Data and Evaluation Metrics
Wikipedia: Precision and Recall — Detailed explanation of precision, recall, F1-score, and their relationship to confusion matrices.
Wikipedia: ROC Curve — ROC curves, AUC, and threshold-performance trade-offs essential for evaluating imbalanced classifiers.
Wikipedia: Oversampling and Undersampling — Overview of resampling strategies for class imbalance.
Advanced References
The Elements of Statistical Learning (ESL) - Free PDF — Authoritative reference for classification, including logistic regression (Chapter 4), GAM models, and evaluation metrics (Chapter 11). Highly recommended for rigorous treatment of linear classifiers and imbalance.
imbalanced-learn Documentation — Python library for handling imbalanced datasets. Includes SMOTE, random over/undersampling, and advanced resampling techniques.
Learning Path
Start with Stanford CS229 lectures on logistic regression and GDA. Review Wikipedia articles on precision/recall and ROC curves to understand evaluation metrics. Consult ESL Chapter 4 for rigorous mathematical treatment of linear classifiers. Use imbalanced-learn documentation for practical implementation. For PU learning, search for foundational papers by Elkan & Noto on label bias correction.