ML Problem Sets — Classical ML Algorithms

SCPD ML Course

ML Problem Sets

Generative vs Discriminative

Gaussian Discriminant Analysis

Logistic Regression

Naive Bayes

Support Vector Machines

Softmax & Multiclass

Decision Boundaries

When to Use What

Sources & References

Generative vs Discriminative

The choice between generative and discriminative models fundamentally affects how algorithms learn from data. Ng & Jordan (2001) showed there are two distinct regimes: generative classifiers learn faster with small data, discriminative classifiers have lower asymptotic error with large data.

Generative Approach

Models: P(x, y) = P(y) · P(x|y). Learns class-conditional distributions and class priors.

Examples: Gaussian Discriminant Analysis, Naive Bayes.

Decision rule: argmax P(y|x) = argmax P(x|y)P(y) / P(x). Can sample new x from P(x|y).

Sample efficiency: Requires fewer samples to reach asymptotic error (faster convergence rate).

Assumption burden: Makes strong assumptions about P(x|y) (e.g., Gaussian, conditional independence).

Discriminative Approach

Models: P(y|x) directly. Learns decision boundary.

Examples: Logistic Regression, Support Vector Machines.

Decision rule: argmax P(y|x). Directly optimizes what we care about.

Lower asymptotic error: Fewer modeling assumptions → lower asymptotic error with infinite data.

Flexibility: Makes no assumptions about P(x), only P(y|x).

The Tradeoff

Generative Wins When:

• Small datasets (n < 1000)
• High-dimensional sparse data (text, NLP)
• Missing data (can impute via P(x|y))
• Interpretability of class distributions matters

Discriminative Wins When:

• Large datasets (n > 10,000)
• Dense features
• Asymptotic performance critical
• Direct P(y|x) fit is flexible

The convergence rate is O(1/m) for generative (m = sample size), but asymptotic error is higher. Discriminative has slower convergence O(1/√m) but lower asymptotic error. They cross over around m ≈ 1-5K samples depending on problem.

Gaussian Discriminant Analysis

Gaussian Discriminant Analysis (GDA) models class-conditional distributions as multivariate Gaussians. It is a generative model: we learn P(y) and P(x|y), then use Bayes rule to compute P(y|x).

Model Formulation

Class prior: P(y) = φ^y(1-φ)^(1-y) for y ∈ {0, 1}.

Class-conditional: P(x|y=k) = N(μ_k, Σ_k). Gaussian with mean μ_k and covariance Σ_k.

Posterior (Bayes rule): P(y|x) = P(x|y)P(y) / P(x).

MLE Parameter Estimates

φ = (1/m) Σ 1{y^(i)=1}
μ_k = Σ_{i: y^(i)=k} x^(i) / |{i: y^(i)=k}|
Σ_k = Σ_{i: y^(i)=k} (x^(i) - μ_k)(x^(i) - μ_k)^T / |{i: y^(i)=k}|

Each class has its own covariance matrix Σ_k. This is Quadratic Discriminant Analysis (QDA).

Linear Discriminant Analysis (LDA)

If we assume all classes share the same covariance Σ_k = Σ ∀k, the decision boundary becomes linear in x.

Shared covariance estimate: Σ = (1/m) Σ_{i=1}^m (x^(i) - μ_{y^(i)})(x^(i) - μ_{y^(i)})^T

With shared Σ, the log-odds log P(y=1|x)/P(y=0|x) is linear in x, producing a hyperplane decision boundary.

Decision Boundary Geometry

QDA (separate covariances): Decision boundary is a quadratic surface. In 2D, it is a conic section (ellipse, hyperbola, parabola).

LDA (shared covariance): Decision boundary is a line (in 2D) or hyperplane (in d dimensions). Simpler model, fewer parameters.

When to Use GDA

Small-to-medium datasets (Gaussian assumption is strong)
When interpretable class distributions help understanding
Dense features (not sparse high-dimensional data)
LDA as baseline: simple, linear, fewer parameters

Logistic Regression

Logistic Regression directly models P(y|x) using the sigmoid function. It is a discriminative classifier, making no assumptions about the feature distribution P(x).

The Sigmoid Function

Definition: σ(z) = 1 / (1 + e^(-z)). Maps R → (0, 1).

Properties: σ(0) = 0.5, σ'(z) = σ(z)(1 - σ(z)), lim_{z→∞} σ(z) = 1, lim_{z→-∞} σ(z) = 0.

Model

Hypothesis: h_θ(x) = σ(w^T x + b) = 1 / (1 + e^(-(w^T x + b)))

Interpretation: h_θ(x) is the probability P(y=1|x; w, b).

Loss Function: Cross-Entropy

L(ŷ, y) = -y log(ŷ) - (1-y) log(1-ŷ)
         = -y log(σ(w^T x + b)) - (1-y) log(1 - σ(w^T x + b))

Why cross-entropy? It is the negative log-likelihood under Bernoulli distribution. Convex in w and b, guarantees global optimum.

Empirical loss: J(w, b) = (1/m) Σ_{i=1}^m L(h_θ(x^(i)), y^(i))

Optimization: Gradient Descent

Gradient of sigmoid: ∂σ(z)/∂z = σ(z)(1 - σ(z))

Gradient of cross-entropy: ∂J/∂w = (1/m) Σ_{i=1}^m (σ(w^T x^(i) + b) - y^(i)) x^(i)

Update rule (batch GD): w := w - α (∂J/∂w), b := b - α (∂J/∂b)

α (learning rate): Typical values 0.001 to 0.1. Smaller → slower but safer convergence.

Newton's method: Faster convergence (quadratic) if Hessian is tractable. θ := θ - H^(-1) ∇J.

Regularization

L2 (Ridge): J_L2 = J + (λ/2m) ||w||^2. Shrinks weights, reduces complexity.

L1 (Lasso): J_L1 = J + (λ/m) ||w||_1. Produces sparse weights (feature selection).

L2 Regularization

• Smooth penalty on all weights
• Weights approach 0 gradually
• All features retained (smaller)
• Faster, smoother gradient

L1 Regularization

• Sharp corners at 0
• Drives some weights to exactly 0
• Automatic feature selection
• Less sensitive to scale

Feature Engineering Matters

Logistic Regression is linear in the feature space. Add polynomial features (x₁², x₁x₂, etc.) to capture nonlinear relationships. This increases model capacity but risks overfitting.

Naive Bayes

Naive Bayes is a generative classifier that applies Bayes theorem with a strong conditional independence assumption: given the class y, all features x_j are independent.

The Naive Independence Assumption

Bayes rule: P(y|x) = P(x|y)P(y) / P(x)

Naive assumption: P(x|y) = P(x_1|y) P(x_2|y) ... P(x_d|y)

Why "naive"? In reality, features are often correlated. But empirically, Naive Bayes often performs well despite this violation.

Parameter Estimation

Class prior: P(y=k) = count(y^(i)=k) / m

Feature likelihood: P(x_j|y=k) depends on feature type (Bernoulli, Multinomial, Gaussian).

Bernoulli NB: P(x_j|y) ∈ {0, 1} binary feature
Multinomial NB: x_j ∈ {1,...,V} categorical
Gaussian NB: P(x_j|y) = N(μ_{jk}, σ²_{jk})

Laplace Smoothing

Problem: If feature j never appears with class k in training data, P(x_j|y=k) = 0, making the entire P(y|x) = 0 due to multiplication.

Solution: Add pseudocount α (usually α=1 for Laplace smoothing):

P(x_j=v|y=k) = (count(x_j=v, y=k) + α) / (count(y=k) + α·|V_j|)

where |V_j| is the number of unique values of feature j.

Text Classification Example

Task: Classify email as spam (y=1) or ham (y=0).

Features: Word indicators (binary). x_j = 1 if word j appears, 0 otherwise.

Training: Count word frequencies in spam vs. ham emails. Estimate P(word_j|spam) and P(word_j|ham).

Prediction: P(spam|email) ∝ P(email|spam)P(spam) = P(spam) ∏ P(word_j|spam)^{x_j}

Log-odds trick: Avoid underflow by computing log-probabilities: log P(y=1|x) = log P(y=1) + Σ log P(x_j|y=1)

Naive Bayes vs. Logistic Regression

Naive Bayes

• Generative (learns P(x|y))
• Fast training (just count)
• Works with small data
• Clear probability estimates
• Less flexible decision boundary

Logistic Regression

• Discriminative (learns P(y|x))
• Gradient-based optimization
• Needs more data for good fit
• No explicit P(x) model
• More flexible, can fit better

Naive Bayes learns faster with small data (O(n) samples), but logistic regression has lower asymptotic error as n → ∞ (both converge, but LR has lower limit).

Support Vector Machines

Support Vector Machines (SVMs) find the hyperplane that maximizes the margin between classes. They use hinge loss and can employ the kernel trick for nonlinear classification.

The Margin and Hard-Margin SVM

Geometric margin: Distance from point x to hyperplane w^T x + b = 0 is |w^T x + b| / ||w||.

Margin γ: Minimum distance of any training point to the hyperplane. We want to maximize γ.

Hard-margin SVM (linearly separable):

min   (1/2) ||w||²
 w,b
s.t.  y^(i)(w^T x^(i) + b) ≥ 1  ∀i

The constraint ensures points are at least distance 1/||w|| from the hyperplane. Minimizing ||w|| maximizes margin.

Soft-Margin SVM (Non-separable Data)

Introduce slack variables ξ_i ≥ 0 to allow violations:

min   (1/2) ||w||² + C Σ ξ_i
 w,b,ξ
s.t.  y^(i)(w^T x^(i) + b) ≥ 1 - ξ_i  ∀i
      ξ_i ≥ 0

Parameter C: Regularization trade-off. Large C → favor margin, allow few violations. Small C → allow more misclassifications.

Hinge Loss Formulation

Soft-margin SVM minimizes:

J(w, b) = (1/2) ||w||² + C Σ max(0, 1 - y^(i)(w^T x^(i) + b))

The term max(0, 1 - y·f(x)) is hinge loss. It is 0 when the point is correctly classified with margin ≥ 1.

Dual Form and Kernel Trick

Primal: Find w, b. Hard to solve for large d (# features).

Dual (Lagrangian):

max  Σ α_i - (1/2) Σ α_i α_j y^(i) y^(j) (x^(i) · x^(j))
α
s.t. 0 ≤ α_i ≤ C,  Σ α_i y^(i) = 0

Kernel trick: Replace inner products x^(i) · x^(j) with K(x^(i), x^(j)) = φ(x^(i)) · φ(x^(j)) without computing φ explicitly.

Common Kernels

Linear: K(x,x') = x · x'
Polynomial: K(x,x') = (γ x·x' + r)^d
RBF (Gaussian): K(x,x') = exp(-γ ||x - x'||²)
Sigmoid: K(x,x') = tanh(γ x·x' + r)

SVM vs. Logistic Regression

SVMs: Better for small-medium datasets, clearer geometric interpretation
Logistic Reg: Probabilistic outputs, scales to large data, more interpretable weights
SVMs with RBF kernel: Nonlinear, but slow to predict. Logistic with kernel: Slow to train

Softmax & Multiclass

Softmax Regression generalizes logistic regression from binary to K-class classification. It models P(y|x) as a categorical distribution using the softmax function.

The Softmax Function

Definition: σ_k(z) = e^(z_k) / Σ_j e^(z_j)

Input: z ∈ R^K (K logits, one per class).

Output: Vector of K probabilities that sum to 1.

Properties: Differentiable, maintains probability interpretation, generalizes sigmoid (K=2 case).

Model

z_k = w_k^T x + b_k   for k = 1,...,K
P(y=k|x) = e^(z_k) / Σ_j e^(z_j) = softmax(z)_k

Parameters: w_1,...,w_K ∈ R^d and b_1,...,b_K ∈ R (total (d+1)K parameters).

Cross-Entropy Loss

For one sample: L(y, ŷ) = -Σ_k 1{y=k} log ŷ_k = -log ŷ_y

If y=2 and ŷ = [0.1, 0.7, 0.2], loss = -log(0.7) ≈ 0.357.

Batch loss: J(W, b) = (1/m) Σ_i L(y^(i), softmax(W^T x^(i) + b))

Optimization

Gradient: ∂J/∂w_k = (1/m) Σ_i (σ_k(z^(i)) - 1{y^(i)=k}) x^(i)

Gradient update: w_k := w_k - α ∂J/∂w_k (standard gradient descent)

Softmax is convex, so gradient descent converges to global optimum.

One-vs-All (OVA) vs. Softmax

One-vs-All

• Train K binary classifiers
• Class k vs. all others
• Takes max confidence
• Simple but less unified
• K independent models

Softmax

• Single K-class model
• Directly optimizes joint distribution
• Probabilities sum to 1
• More principled (MLE)
• Shared representation

Connection to Neural Networks

Softmax is the standard output layer in neural networks for multiclass classification. Hidden layers learn nonlinear features, softmax outputs class probabilities. Logistic regression is the single-layer version (no hidden layers).

Decision Boundaries Compared

The shape of the decision boundary reveals each algorithm's implicit assumptions about the data. Here we compare them on a 2D synthetic dataset.

Theoretical Decision Surfaces

Linear (Logistic, LDA): Straight line or hyperplane. Models P(y|x) or P(x|y) with shared covariance.
Quadratic (QDA): Conic sections (ellipse, parabola). Each class has own Gaussian covariance.
Nonlinear (SVM + kernel): Complex curved boundary. Kernel trick maps to higher-dimensional space.
Axis-aligned (Naive Bayes): Rectangular regions. Independence assumption creates orthogonal splits.

Bias-Variance Perspective

High bias, low variance: Logistic regression / LDA. Simple linear model, stable but may underfit.

Low bias, high variance: SVM with RBF, QDA. Flexible models, fit the data well but may overfit on small n.

Middle ground: Naive Bayes. Makes strong independence assumptions but learns quickly from small data.

Regularization Shapes Decision Boundaries

In logistic regression, increasing λ (L2 penalty) pushes weights toward 0, tilting the boundary closer to the origin and making the decision surface more conservative. Decreasing λ allows the boundary to fit the training data more tightly.

In SVMs, smaller C (penalty for margin violations) allows more slack, softening the boundary. Larger C hardens the boundary to fit training data precisely.

When to Use What

This section provides a practical flowchart for selecting the right algorithm. No single algorithm is best—context matters.

Quick Decision Grid

Small Dataset (n < 1K)

First choice: Naive Bayes or LDA.
Generative models converge faster. Low parameter count, don't need much data to estimate Gaussian.

Large Dataset (n > 10K)

First choice: Logistic Regression or SVM.
Discriminative models have lower asymptotic error. Scale well with good solvers (SGD).

High Dimensions (d > 1000)

Best: Logistic Regression.
L2 regularization handles many features. Linear models scale. Avoid dense covariance matrices.

Sparse Features (text, NLP)

Best: Naive Bayes or Logistic Reg + L1.
NB excels with conditional independence. L1 does automatic feature selection.

Interpretability Critical

Best: Logistic Regression or LDA.
Weights/means directly show feature importance. Clear probability estimates. SVM margins are geometric but less intuitive.

Training Speed Matters

Fastest: Naive Bayes (just count frequencies).
Logistic Reg (convex, fast convergence). SVM slower due to quadratic solver. QDA medium.

Nonlinear Separation

Best: SVM + kernel or Neural Networks.
Logistic Reg + polynomial features works but manual feature engineering. Kernels are automatic.

Class Imbalance

Strategies: Use class_weight=balanced in SVM/LR. Oversample minority. Adjust C or cost parameters.

Algorithm Verdict Matrix

Logistic Regression

Interpretable weights
Fast training + prediction
Probabilistic outputs
Scales to huge data (SGD)
L1/L2 regularization clear

Logistic Regression

Linear boundaries only
Polynomial features manual
Needs feature scaling
Assumes independence
No built-in kernel trick

Naive Bayes

Extremely fast training
Works with small n
Text/NLP gold standard
Handles missing data
Generates P(x|y) samples

Naive Bayes

Violates independence often
Lower asymptotic performance
Feature scaling doesnt help
Limited nonlinearity
Hard to tune hyperparams

Gaussian Discriminant Analysis

Elegant probabilistic model
Learns class distributions
LDA is simple baseline
Works with small data
Fast prediction (distance)

Gaussian Discriminant Analysis

Gaussian assumption often wrong
QDA: many parameters
High-dimensional failure
Covariance estimation hard (n < d)
Less flexible than SVM

Support Vector Machines

Excellent at margins
Kernel trick → nonlinear
Works well on medium data
Sparse solutions possible
Strong theoretical foundation

Support Vector Machines

Slow on huge datasets
Hard to interpret (kernels)
Hyperparameter tuning (C, γ)
No probability outputs (custom)
Memory: n² in dual form

Classical Flowchart

Is data linearly separable? Yes → Logistic Reg or SVM (linear kernel). No → Continue.
Is n small (<1K) or sparse? Yes → Naive Bayes. No → Continue.
Is interpretability critical? Yes → LDA or Logistic Reg. No → Continue.
Willing to engineer polynomial features? Yes → Logistic + poly. No → Continue.
Use SVM + RBF kernel: Flexible, strong regularization via C.

References and sources for further study on the topics covered in this deep dive.