The choice between generative and discriminative models fundamentally affects how algorithms learn from data. Ng & Jordan (2001) showed there are two distinct regimes: generative classifiers learn faster with small data, discriminative classifiers have lower asymptotic error with large data.
Models: P(x, y) = P(y) · P(x|y). Learns class-conditional distributions and class priors.
Examples: Gaussian Discriminant Analysis, Naive Bayes.
Decision rule: argmax P(y|x) = argmax P(x|y)P(y) / P(x). Can sample new x from P(x|y).
Sample efficiency: Requires fewer samples to reach asymptotic error (faster convergence rate).
Assumption burden: Makes strong assumptions about P(x|y) (e.g., Gaussian, conditional independence).
Models: P(y|x) directly. Learns decision boundary.
Examples: Logistic Regression, Support Vector Machines.
Decision rule: argmax P(y|x). Directly optimizes what we care about.
Lower asymptotic error: Fewer modeling assumptions → lower asymptotic error with infinite data.
Flexibility: Makes no assumptions about P(x), only P(y|x).
• Small datasets (n < 1000)
• High-dimensional sparse data (text, NLP)
• Missing data (can impute via P(x|y))
• Interpretability of class distributions matters
• Large datasets (n > 10,000)
• Dense features
• Asymptotic performance critical
• Direct P(y|x) fit is flexible
The convergence rate is O(1/m) for generative (m = sample size), but asymptotic error is higher. Discriminative has slower convergence O(1/√m) but lower asymptotic error. They cross over around m ≈ 1-5K samples depending on problem.
Gaussian Discriminant Analysis (GDA) models class-conditional distributions as multivariate Gaussians. It is a generative model: we learn P(y) and P(x|y), then use Bayes rule to compute P(y|x).
Class prior: P(y) = φ^y(1-φ)^(1-y) for y ∈ {0, 1}.
Class-conditional: P(x|y=k) = N(μ_k, Σ_k). Gaussian with mean μ_k and covariance Σ_k.
Posterior (Bayes rule): P(y|x) = P(x|y)P(y) / P(x).
φ = (1/m) Σ 1{y^(i)=1}
μ_k = Σ_{i: y^(i)=k} x^(i) / |{i: y^(i)=k}|
Σ_k = Σ_{i: y^(i)=k} (x^(i) - μ_k)(x^(i) - μ_k)^T / |{i: y^(i)=k}|
Each class has its own covariance matrix Σ_k. This is Quadratic Discriminant Analysis (QDA).
If we assume all classes share the same covariance Σ_k = Σ ∀k, the decision boundary becomes linear in x.
Shared covariance estimate: Σ = (1/m) Σ_{i=1}^m (x^(i) - μ_{y^(i)})(x^(i) - μ_{y^(i)})^T
With shared Σ, the log-odds log P(y=1|x)/P(y=0|x) is linear in x, producing a hyperplane decision boundary.
QDA (separate covariances): Decision boundary is a quadratic surface. In 2D, it is a conic section (ellipse, hyperbola, parabola).
LDA (shared covariance): Decision boundary is a line (in 2D) or hyperplane (in d dimensions). Simpler model, fewer parameters.
Logistic Regression directly models P(y|x) using the sigmoid function. It is a discriminative classifier, making no assumptions about the feature distribution P(x).
Definition: σ(z) = 1 / (1 + e^(-z)). Maps R → (0, 1).
Properties: σ(0) = 0.5, σ'(z) = σ(z)(1 - σ(z)), lim_{z→∞} σ(z) = 1, lim_{z→-∞} σ(z) = 0.
Hypothesis: h_θ(x) = σ(w^T x + b) = 1 / (1 + e^(-(w^T x + b)))
Interpretation: h_θ(x) is the probability P(y=1|x; w, b).
L(ŷ, y) = -y log(ŷ) - (1-y) log(1-ŷ)
= -y log(σ(w^T x + b)) - (1-y) log(1 - σ(w^T x + b))
Why cross-entropy? It is the negative log-likelihood under Bernoulli distribution. Convex in w and b, guarantees global optimum.
Empirical loss: J(w, b) = (1/m) Σ_{i=1}^m L(h_θ(x^(i)), y^(i))
Gradient of sigmoid: ∂σ(z)/∂z = σ(z)(1 - σ(z))
Gradient of cross-entropy: ∂J/∂w = (1/m) Σ_{i=1}^m (σ(w^T x^(i) + b) - y^(i)) x^(i)
Update rule (batch GD): w := w - α (∂J/∂w), b := b - α (∂J/∂b)
α (learning rate): Typical values 0.001 to 0.1. Smaller → slower but safer convergence.
Newton's method: Faster convergence (quadratic) if Hessian is tractable. θ := θ - H^(-1) ∇J.
L2 (Ridge): J_L2 = J + (λ/2m) ||w||^2. Shrinks weights, reduces complexity.
L1 (Lasso): J_L1 = J + (λ/m) ||w||_1. Produces sparse weights (feature selection).
• Smooth penalty on all weights
• Weights approach 0 gradually
• All features retained (smaller)
• Faster, smoother gradient
• Sharp corners at 0
• Drives some weights to exactly 0
• Automatic feature selection
• Less sensitive to scale
Logistic Regression is linear in the feature space. Add polynomial features (x₁², x₁x₂, etc.) to capture nonlinear relationships. This increases model capacity but risks overfitting.
Naive Bayes is a generative classifier that applies Bayes theorem with a strong conditional independence assumption: given the class y, all features x_j are independent.
Bayes rule: P(y|x) = P(x|y)P(y) / P(x)
Naive assumption: P(x|y) = P(x_1|y) P(x_2|y) ... P(x_d|y)
Why "naive"? In reality, features are often correlated. But empirically, Naive Bayes often performs well despite this violation.
Class prior: P(y=k) = count(y^(i)=k) / m
Feature likelihood: P(x_j|y=k) depends on feature type (Bernoulli, Multinomial, Gaussian).
Bernoulli NB: P(x_j|y) ∈ {0, 1} binary feature
Multinomial NB: x_j ∈ {1,...,V} categorical
Gaussian NB: P(x_j|y) = N(μ_{jk}, σ²_{jk})
Problem: If feature j never appears with class k in training data, P(x_j|y=k) = 0, making the entire P(y|x) = 0 due to multiplication.
Solution: Add pseudocount α (usually α=1 for Laplace smoothing):
P(x_j=v|y=k) = (count(x_j=v, y=k) + α) / (count(y=k) + α·|V_j|)
where |V_j| is the number of unique values of feature j.
Task: Classify email as spam (y=1) or ham (y=0).
Features: Word indicators (binary). x_j = 1 if word j appears, 0 otherwise.
Training: Count word frequencies in spam vs. ham emails. Estimate P(word_j|spam) and P(word_j|ham).
Prediction: P(spam|email) ∝ P(email|spam)P(spam) = P(spam) ∏ P(word_j|spam)^{x_j}
Log-odds trick: Avoid underflow by computing log-probabilities: log P(y=1|x) = log P(y=1) + Σ log P(x_j|y=1)
• Generative (learns P(x|y))
• Fast training (just count)
• Works with small data
• Clear probability estimates
• Less flexible decision boundary
• Discriminative (learns P(y|x))
• Gradient-based optimization
• Needs more data for good fit
• No explicit P(x) model
• More flexible, can fit better
Naive Bayes learns faster with small data (O(n) samples), but logistic regression has lower asymptotic error as n → ∞ (both converge, but LR has lower limit).
Support Vector Machines (SVMs) find the hyperplane that maximizes the margin between classes. They use hinge loss and can employ the kernel trick for nonlinear classification.
Geometric margin: Distance from point x to hyperplane w^T x + b = 0 is |w^T x + b| / ||w||.
Margin γ: Minimum distance of any training point to the hyperplane. We want to maximize γ.
Hard-margin SVM (linearly separable):
min (1/2) ||w||² w,b s.t. y^(i)(w^T x^(i) + b) ≥ 1 ∀i
The constraint ensures points are at least distance 1/||w|| from the hyperplane. Minimizing ||w|| maximizes margin.
Introduce slack variables ξ_i ≥ 0 to allow violations:
min (1/2) ||w||² + C Σ ξ_i
w,b,ξ
s.t. y^(i)(w^T x^(i) + b) ≥ 1 - ξ_i ∀i
ξ_i ≥ 0
Parameter C: Regularization trade-off. Large C → favor margin, allow few violations. Small C → allow more misclassifications.
Soft-margin SVM minimizes:
J(w, b) = (1/2) ||w||² + C Σ max(0, 1 - y^(i)(w^T x^(i) + b))
The term max(0, 1 - y·f(x)) is hinge loss. It is 0 when the point is correctly classified with margin ≥ 1.
Primal: Find w, b. Hard to solve for large d (# features).
Dual (Lagrangian):
max Σ α_i - (1/2) Σ α_i α_j y^(i) y^(j) (x^(i) · x^(j)) α s.t. 0 ≤ α_i ≤ C, Σ α_i y^(i) = 0
Kernel trick: Replace inner products x^(i) · x^(j) with K(x^(i), x^(j)) = φ(x^(i)) · φ(x^(j)) without computing φ explicitly.
Linear: K(x,x') = x · x'Polynomial: K(x,x') = (γ x·x' + r)^dRBF (Gaussian): K(x,x') = exp(-γ ||x - x'||²)Sigmoid: K(x,x') = tanh(γ x·x' + r)Softmax Regression generalizes logistic regression from binary to K-class classification. It models P(y|x) as a categorical distribution using the softmax function.
Definition: σ_k(z) = e^(z_k) / Σ_j e^(z_j)
Input: z ∈ R^K (K logits, one per class).
Output: Vector of K probabilities that sum to 1.
Properties: Differentiable, maintains probability interpretation, generalizes sigmoid (K=2 case).
z_k = w_k^T x + b_k for k = 1,...,K P(y=k|x) = e^(z_k) / Σ_j e^(z_j) = softmax(z)_k
Parameters: w_1,...,w_K ∈ R^d and b_1,...,b_K ∈ R (total (d+1)K parameters).
For one sample: L(y, ŷ) = -Σ_k 1{y=k} log ŷ_k = -log ŷ_y
If y=2 and ŷ = [0.1, 0.7, 0.2], loss = -log(0.7) ≈ 0.357.
Batch loss: J(W, b) = (1/m) Σ_i L(y^(i), softmax(W^T x^(i) + b))
Gradient: ∂J/∂w_k = (1/m) Σ_i (σ_k(z^(i)) - 1{y^(i)=k}) x^(i)
Gradient update: w_k := w_k - α ∂J/∂w_k (standard gradient descent)
Softmax is convex, so gradient descent converges to global optimum.
• Train K binary classifiers
• Class k vs. all others
• Takes max confidence
• Simple but less unified
• K independent models
• Single K-class model
• Directly optimizes joint distribution
• Probabilities sum to 1
• More principled (MLE)
• Shared representation
Softmax is the standard output layer in neural networks for multiclass classification. Hidden layers learn nonlinear features, softmax outputs class probabilities. Logistic regression is the single-layer version (no hidden layers).
The shape of the decision boundary reveals each algorithm's implicit assumptions about the data. Here we compare them on a 2D synthetic dataset.
Linear (Logistic, LDA): Straight line or hyperplane. Models P(y|x) or P(x|y) with shared covariance.Quadratic (QDA): Conic sections (ellipse, parabola). Each class has own Gaussian covariance.Nonlinear (SVM + kernel): Complex curved boundary. Kernel trick maps to higher-dimensional space.Axis-aligned (Naive Bayes): Rectangular regions. Independence assumption creates orthogonal splits.High bias, low variance: Logistic regression / LDA. Simple linear model, stable but may underfit.
Low bias, high variance: SVM with RBF, QDA. Flexible models, fit the data well but may overfit on small n.
Middle ground: Naive Bayes. Makes strong independence assumptions but learns quickly from small data.
In logistic regression, increasing λ (L2 penalty) pushes weights toward 0, tilting the boundary closer to the origin and making the decision surface more conservative. Decreasing λ allows the boundary to fit the training data more tightly.
In SVMs, smaller C (penalty for margin violations) allows more slack, softening the boundary. Larger C hardens the boundary to fit training data precisely.
This section provides a practical flowchart for selecting the right algorithm. No single algorithm is best—context matters.
First choice: Naive Bayes or LDA.
Generative models converge faster. Low parameter count, don't need much data to estimate Gaussian.
First choice: Logistic Regression or SVM.
Discriminative models have lower asymptotic error. Scale well with good solvers (SGD).
Best: Logistic Regression.
L2 regularization handles many features. Linear models scale. Avoid dense covariance matrices.
Best: Naive Bayes or Logistic Reg + L1.
NB excels with conditional independence. L1 does automatic feature selection.
Best: Logistic Regression or LDA.
Weights/means directly show feature importance. Clear probability estimates. SVM margins are geometric but less intuitive.
Fastest: Naive Bayes (just count frequencies).
Logistic Reg (convex, fast convergence). SVM slower due to quadratic solver. QDA medium.
Best: SVM + kernel or Neural Networks.
Logistic Reg + polynomial features works but manual feature engineering. Kernels are automatic.
Strategies: Use class_weight=balanced in SVM/LR. Oversample minority. Adjust C or cost parameters.
References and sources for further study on the topics covered in this deep dive.