XCS229 Problem Set 2

01 — Overview

PS2 Overview

Problem Set 2 (35 points) contains three complementary problems that deepen your understanding of supervised classification. Each problem targets a different aspect of real-world complexity.

Problem Structure

01—Linear Classification (12 pts): Compare logistic regression and Gaussian Discriminant Analysis on the same synthetic data. Implement both, analyze decision boundaries, and understand their inductive biases and convergence properties.

02—Positive-Only Learning (12 pts): Train a classifier when you observe positive labels and unlabeled data (no explicit negatives). Estimate the true p(y=1|x) under the SCAR assumption, deriving the relationship between observed and true label frequencies.

03—Imbalanced Data (11 pts): Address a dataset where one class dominates. Implement resampling and reweighting strategies, evaluate using appropriate metrics (F1, ROC-AUC), and justify your approach.

Learning Outcomes

By the end, you will understand linear classifier trade-offs, how to learn from incomplete label information, and practical strategies for imbalanced data. You will also recognize that evaluation metrics must match problem structure.

Recommended Resources

Review lecture notes on logistic regression, GDA, and loss functions. Consult the textbook sections on semi-supervised learning and class imbalance. The sklearn documentation for imbalance handling (imbalanced-learn) is also invaluable.

02 — Linear Classification

Linear Classification

Linear classifiers (logistic regression and GDA) form the backbone of supervised learning. Both partition the input space with a hyperplane, but they make different assumptions about the data-generating process.

Logistic Regression

Logistic regression directly models p(y=1|x) = σ(θ^T x). It uses maximum likelihood estimation and empirical risk minimization. The learned boundary is the set where p(y=1|x) = 0.5. Logistic regression is robust and requires fewer assumptions; it works well even when the data does not follow any particular distributional form.

Gaussian Discriminant Analysis

GDA assumes class-conditional distributions p(x|y) are Gaussian with shared covariance. By modeling the joint distribution p(x,y), GDA derives a posterior p(y|x) that is logistic. GDA can be more sample-efficient if the Gaussian assumption holds, but it is sensitive to violating this assumption.

Empirical Comparison

On the same synthetic dataset, plot the learned boundaries for both models. Compare: (1) convergence speed, (2) decision boundary shape, (3) robustness to outliers, and (4) classification accuracy on held-out data. Document your findings in the writeup.

Key Insight

Logistic regression is a discriminative model (learns p(y|x) directly), while GDA is generative (learns the full joint). Generative models can be more data-efficient but are less flexible. Discriminative models are more robust but may require more data.

03 — Decision Boundary Analysis

Decision Boundary Analysis

The decision boundary is the set of points where the classifier's prediction changes. Understanding boundary geometry reveals classifier strengths and weaknesses.

Linearity and Feature Space

Both logistic regression and GDA with a diagonal covariance matrix produce linear boundaries in the original feature space. If the data is not linearly separable, you must either augment the feature space (add polynomial or interaction terms) or use a nonlinear model.

Visualization Strategies

For 2D data, plot the boundary as a contour of p(y=1|x) = 0.5. For higher dimensions, use dimensionality reduction (PCA, t-SNE) or plot marginal 2D slices. Color-code training points by true label and mark misclassifications.

Margin and Confidence

The margin is the distance from a point to the decision boundary. Points far from the boundary are classified with high confidence. Margins reveal which regions are well-learned and which are ambiguous. Small margins indicate potential overfitting or overlapping class distributions.

Feature Engineering

If the boundary is nonlinear, engineer features: polynomial terms (x_1^2, x_1 x_2), radial basis functions, or domain-specific transformations. Re-train the linear classifier in the augmented space. Document which features improve the boundary.

04 — Positive-Only Learning

Positive-Only Learning

Positive-only (PU) learning addresses a practical scenario: you observe positive examples explicitly, but the unlabeled set contains both positives and negatives. You cannot directly observe the negative class.

Problem Setup

You observe labeled pairs (x, s) where s = 1 indicates a positive label, and s = 0 indicates unlabeled (unknown true label y). The true label y ∈ {0, 1}, and the relationship is s = 1 ⟹ y = 1 (labeled positives are true positives) but y = 0 ⟹ s = 0 (negatives are unlabeled).

SCAR Assumption

Under Selected Completely At Random (SCAR), the probability of observing a label s=1 depends only on the true label y, not on x: p(s=1|x,y) = p(s=1|y). This means the labeling process is uninformative given the true label. SCAR is a strong but often reasonable assumption in applications like crowdsourcing or medical diagnosis.

Core Insight

Observed frequency f(s=1|x) = p(s=1|x) = p(s=1|y=1) · p(y=1|x) = π · p(y=1|x), where π is the label frequency. By estimating π from aggregate statistics, you can recover p(y=1|x) = f(s=1|x) / π.

Practical Steps

Estimate π (the fraction of observed positives among true positives), using methods like label frequency or Elkan-Noto constant-factor assumptions. Then, train a classifier on the observed positives vs. rest, then adjust its scores by π. Finally, evaluate on a held-out set with true labels.

05 — PU Learning Theory

PU Learning Theory

PU learning theory formalizes when and how you can recover the true classifier from positive-only observations. The core question: is p(y=1|x) identifiable from p(s=1|x)?

Identifiability

Estimating the Class Prior π

π can be estimated using several methods: (1) label frequency (if you trust domain knowledge about the overall prevalence of positives), (2) constant-factor assumption (the labeled positive rate divided by a known or assumed factor), or (3) mixture model fitting (assume the unlabeled set is a mixture of positives and negatives, then infer π via EM).

Error Bounds

Misestimating π introduces systematic bias. If true π = 0.3 but you estimate π = 0.2, your recovered p(y=1|x) will be 1.5× too large. This affects threshold selection and calibration. The problem statement may provide guidance on π estimation.

Practical Robustness

PU learning is robust to moderate π misestimation if you use cross-validation and tuning on a held-out labeled set. Sensitivity analysis (vary π and observe performance) quantifies robustness. Always validate on a separate test set with true labels.

06 — Imbalanced Datasets

Imbalanced Datasets

Class imbalance arises when one class is much rarer than the other. Standard accuracy is a poor metric; even a baseline that always predicts the majority class achieves high accuracy.

The Problem

In fraud detection, disease screening, or anomaly detection, the positive class is often 1–5% of the data. Training a standard classifier on raw data biases it toward the majority class; the gradient for the minority class is overwhelmed. The model learns to "mostly predict negative" and achieves high accuracy despite uselessness.

Appropriate Metrics

Precision: of predicted positives, how many are truly positive? Recall (Sensitivity): of true positives, how many did we find? F1 Score: harmonic mean of precision and recall; balances both. ROC-AUC: area under the receiver operating characteristic curve; threshold-independent measure of rank-ordering quality. PR-AUC: precision-recall curve; focuses on the positive class, especially valuable for extreme imbalance.

Why Accuracy Fails

If 99% of data is negative and the model always predicts negative, accuracy = 99% but recall = 0% (catching no true positives). In cost-sensitive settings, missing a positive may be far more expensive than a false alarm.

Evaluation Discipline

Always report multiple metrics. Use stratified k-fold cross-validation to preserve class distribution in each fold. On imbalanced data, random split can lead to leakage (e.g., all positives in test set). Use a held-out test set with the true class distribution.

07 — Resampling & Reweighting

Resampling & Reweighting

Resampling and reweighting adjust the training distribution to address imbalance. Each approach has trade-offs in computational cost, variance, and bias.

Oversampling

Replicate minority-class examples until the class distribution is balanced. Pros: no data loss, works with any classifier. Cons: increases dataset size and computational cost; can lead to overfitting if you memorize replicated examples. Mitigation: add small noise to replicated examples (jittering).

Undersampling

Randomly remove majority-class examples to balance the distribution. Pros: reduces dataset size and training time. Cons: information loss (you discard potentially useful majority-class samples); may increase bias. Best used when data is very large or when combined with ensemble methods.

SMOTE (Synthetic Minority Oversampling Technique)

Generate synthetic minority examples by interpolating between existing minority samples. SMOTE creates new examples in the feature space, reducing memorization risk compared to simple replication. Variants exist for edge cases (borderline SMOTE, SVM-SMOTE).

Cost-Sensitive Learning

Assign higher misclassification cost to the minority class. In logistic regression, weight the loss: loss = w+ · (minority errors) + w− · (majority errors). Set w+ > w− to prioritize minority performance. This avoids the data-manipulation of resampling, directly optimizing the imbalanced objective.

Ensemble Methods

Train multiple classifiers on different resampled versions of the data (e.g., balanced random forests). Averaging predictions often reduces variance and improves robustness to resampling artifacts.

08 — Key Takeaways

Key Takeaways

Problem Set 2 synthesizes algorithmic, information-theoretic, and practical perspectives on classification. These three problems are not isolated; they interact in real applications.

Algorithmic Perspective: Model Choice

Logistic regression and GDA are both linear classifiers, but they embody different inductive biases. Logistic regression is discriminative and robust; GDA is generative and can be sample-efficient. Neither is universally best; context dictates choice.

Information-Theoretic Perspective: Unlabeled Data

Unlabeled data is not wasted; under SCAR, it carries information about p(y=1|x). PU learning formalizes this insight. Semi-supervised and self-supervised learning build on this principle. Whenever labels are scarce, unlabeled data is your ally.

Practical Perspective: Evaluation and Balance

Accuracy is a red herring for imbalanced data. Precision, recall, F1, and ROC-AUC tell the true story. Addressing imbalance requires a combined strategy: appropriate metrics, resampling/reweighting, and validation discipline.

Integration

A real-world classification task may combine all three themes: you might use logistic regression (linear classifier), adjust for label imbalance (reweighting), and even have some positive-only labels (semi-supervised). Mastery of these fundamentals prepares you for such complexity.

Next Steps

After PS2, deepen your understanding with nonlinear classifiers (neural networks, kernel methods), advanced semi-supervised techniques (consistency regularization, pseudo-labeling), and domain-specific applications (NLP, computer vision, time series). The foundations you build here will anchor everything ahead.

09 — References

References & Further Reading

Stanford CS229 Course Materials

Stanford CS229: Machine Learning — Official course page with lectures on logistic regression, GDA, and loss functions. See weeks 2-3 lecture notes for material directly relevant to PS2.

Linear Classification Theory

Wikipedia: Logistic Regression — Comprehensive treatment of logistic regression, maximum likelihood, and regularization.

Wikipedia: Linear Discriminant Analysis — Overview of GDA and discriminant analysis methods. The Gaussian assumption and derivation of linear boundaries are clearly explained.

Positive-Only Learning

Wikipedia: Positive and Unlabeled Learning — Foundational overview of PU learning, SCAR assumption, and unlabeled data usage.

Imbalanced Data and Evaluation Metrics

Wikipedia: Precision and Recall — Detailed explanation of precision, recall, F1-score, and their relationship to confusion matrices.

Wikipedia: ROC Curve — ROC curves, AUC, and threshold-performance trade-offs essential for evaluating imbalanced classifiers.

Wikipedia: Oversampling and Undersampling — Overview of resampling strategies for class imbalance.

Advanced References

The Elements of Statistical Learning (ESL) - Free PDF — Authoritative reference for classification, including logistic regression (Chapter 4), GAM models, and evaluation metrics (Chapter 11). Highly recommended for rigorous treatment of linear classifiers and imbalance.

imbalanced-learn Documentation — Python library for handling imbalanced datasets. Includes SMOTE, random over/undersampling, and advanced resampling techniques.

Learning Path

Start with Stanford CS229 lectures on logistic regression and GDA. Review Wikipedia articles on precision/recall and ROC curves to understand evaluation metrics. Consult ESL Chapter 4 for rigorous mathematical treatment of linear classifiers. Use imbalanced-learn documentation for practical implementation. For PU learning, search for foundational papers by Elkan & Noto on label bias correction.