Logistic Regression

01

Binary Classification Problem

Binary classification extends regression to predict discrete labels: 0 or 1, negative or positive class. Unlike regression where we predict continuous values, classification requires discrete outputs and a natural threshold at probability 0.5. The central challenge arises because linear regression can produce values outside [0,1], making it unsuitable for probability predictions.

The decision boundary—the separating surface where the classifier transitions from predicting class 0 to class 1—is fundamental to classification geometry. In logistic regression, this boundary is linear in the feature space, enabling efficient learning while maintaining interpretability. Classification accuracy depends critically on feature engineering and the separability of the two classes.

Binary

Output Space

Threshold

At 0.5

Linear

Decision Boundary

XCS229

Week 2

02

The Sigmoid Function

The sigmoid function σ(z) = 1/(1+e^(-z)) elegantly maps all real numbers to the interval [0,1], making it ideal for probability estimation. The sigmoid is monotonically increasing, differentiable everywhere, and asymptotically approaches 0 and 1 at the extremes. Its S-shaped curve reflects the intuitive behavior of decision-making: uncertain predictions near z=0, confident predictions far from zero.

Logistic regression applies the sigmoid to the linear predictor: P(y=1|x) = σ(θᵀx). This probabilistic interpretation allows Bayesian reasoning and principled uncertainty quantification. The decision boundary occurs where σ(θᵀx) = 0.5, equivalently where θᵀx = 0, making classification geometrically equivalent to linear separation in the original feature space.

Sigmoid Properties

Monotonic, differentiable, squashes all reals to [0,1], with σ'(z) = σ(z)(1−σ(z)).

Probabilistic View

Output represents P(y=1|x,θ), enabling Bayesian interpretation and uncertainty estimates.

Decision Boundary

Occurs at σ(θᵀx) = 0.5, equivalent to linear boundary θᵀx = 0 in feature space.

Gradient Friendly

Smooth derivative enables efficient gradient-based optimization without discrete jumps.

03

Maximum Likelihood Estimation

Maximum likelihood estimation frames logistic regression as finding parameters that maximize the probability of observing the training data. For binary classification with Bernoulli labels, the likelihood is L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ), where P(y=1|x) = σ(θᵀx) and P(y=0|x) = 1−σ(θᵀx). This can be rewritten as a single expression: P(y|x) = σ(θᵀx)^y · (1−σ(θᵀx))^(1−y).

Taking logarithms transforms the product into a sum: ℓ(θ) = Σᵢ [yᵢ log σ(θᵀxᵢ) + (1−yᵢ) log(1−σ(θᵀxᵢ))]. This is the log-likelihood or cross-entropy loss. Maximizing log-likelihood is equivalent to minimizing cross-entropy error, providing the objective function for optimization. The gradient ∂ℓ/∂θ = Σᵢ (yᵢ − σ(θᵀxᵢ)) xᵢ has an elegant form that suggests gradient ascent.

Bernoulli

Label Distribution

Log-likelihood

Objective

Cross-entropy

Loss Equiv.

y ∈ {0,1}

Output Space

04

Gradient Update Rule

The gradient ascent update rule for logistic regression closely resembles the least-mean-squares (LMS) rule from linear regression: θ := θ + α Σᵢ (yᵢ − hθ(xᵢ)) xᵢ. However, here hθ(x) = σ(θᵀx) rather than θᵀx, making this a nonlinear optimization problem despite the linear decision boundary. Each iteration moves parameters in the direction of increasing likelihood, with step size α controlling convergence speed.

Stochastic gradient ascent updates on individual examples: θ := θ + α (yᵢ − σ(θᵀxᵢ)) xᵢ. This variant converges faster in practice and handles large datasets efficiently. The algorithm terminates when the gradient norm falls below a threshold or after a fixed number of iterations. Unlike the closed-form solution for linear regression, logistic regression requires iterative optimization, but convergence is guaranteed for convex objectives like log-likelihood.

Key Insight

The update rule closely mirrors LMS, but the sigmoid nonlinearity makes this a harder optimization problem that generally requires iterative methods rather than closed-form solutions.

Convex objective Guaranteed convergence Stochastic variant Differentiable everywhere

05

The Perceptron Algorithm

The perceptron algorithm predates logistic regression and uses a simple classification rule: predict 1 if θᵀx ≥ 0, else predict 0. During training, update parameters only when misclassifications occur: θ := θ + y xᵢ when y(θᵀx) ≤ 0. This greedy approach has provable convergence guarantees when the data is linearly separable, formalized in the perceptron convergence theorem.

The perceptron's strength lies in simplicity and geometric interpretation: it finds any separating hyperplane when one exists. However, it produces no probabilistic output, provides no natural measure of classification confidence, and guarantees nothing when data isn't linearly separable. Logistic regression extends perceptron ideas by outputting probabilities and optimizing a principled loss function, making it more robust and interpretable than the perceptron.

Strengths

Convergence guaranteed on separable data
Simple geometric interpretation
Efficient training with no hyperparameters
Linear decision boundary

Weaknesses

No probabilistic output
Fails on non-separable data
No confidence estimates
Solution non-unique on separable data

06

Multi-class Softmax Regression

Extending classification to K classes (K > 2) requires replacing the binomial Bernoulli distribution with the multinomial distribution. Softmax regression outputs a K-dimensional probability vector where each component represents P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx). The softmax function normalizes K scores into valid probabilities summing to 1, generalizing the sigmoid which is softmax for binary (K=2) classes.

The multinomial log-likelihood extends naturally: ℓ(θ) = Σᵢ Σₖ 𝟙[yᵢ=k] log P(y=k|xᵢ). This is also called categorical cross-entropy loss, the fundamental objective for multi-class classification. The gradient and optimization follow the same principled approach as binary logistic regression. One-vs-all (OVA) offers an alternative: train K independent binary classifiers and predict argmax of the K scores, though softmax is more principled.

07

Newton's method is a second-order optimization technique that converges faster than gradient descent by incorporating curvature information via the Hessian matrix H, where H_ij = ∂²ℓ/∂θᵢ∂θⱼ. The update rule is θ := θ + H⁻¹ ∇ℓ(θ). For logistic regression, the Hessian is H = −XᵀD(σ(Xθ))X where D is a diagonal matrix of sigmoid derivatives, making H positive definite and the objective strictly convex.

Newton's method typically converges in far fewer iterations than gradient ascent, often achieving Newton-level quadratic convergence near the optimum. However, computing and inverting the Hessian scales as O(d³) in dimension d, making it expensive for high-dimensional problems. In practice, quasi-Newton methods like BFGS approximate the Hessian efficiently. For logistic regression on moderate-sized datasets, Newton's method and its variants are standard in optimization libraries due to superior convergence speed.

Convergence Benefit

Newton's method exhibits quadratic convergence (error squared each iteration), vastly faster than the linear convergence of gradient ascent. The trade-off is higher cost per iteration due to Hessian computation.

08

Linear & Nonlinear Boundaries

The decision boundary of logistic regression is linear: the set {x : θᵀx = 0}. This linear boundary directly corresponds to the geometry where σ(θᵀx) = 0.5, the natural classification threshold. Linear boundaries are geometrically interpretable and computationally efficient, but they lack expressiveness for data with complex, nonlinear separation patterns.

Nonlinear boundaries arise by feature engineering: applying nonlinear transformations φ(x) before classification. For example, adding polynomial features like x₁², x₁x₂, x₂² to a 2D input creates a 5D feature space where logistic regression can learn curved boundaries in the original space. This approach preserves the convexity and efficient optimization of logistic regression while gaining expressiveness. Deep neural networks generalize this idea, learning hierarchical feature transformations automatically, though at the cost of losing the convexity guarantee.

09

References & Further Reading

Key References:

Binary Classification Problem

The Sigmoid Function

Sigmoid Properties

Probabilistic View

Decision Boundary

Gradient Friendly

Maximum Likelihood Estimation

Gradient Update Rule

Key Insight

The Perceptron Algorithm

Strengths

Weaknesses

Multi-class Softmax Regression

Convergence Benefit

Linear & Nonlinear Boundaries

References & Further Reading

Why Linear Regression Fails

Decision Boundaries and Geometry

Class Imbalance

Feature Scaling

Regularization

Confidence Calibration

σ(z) = 1/(1+e−z)

Historical Context and Motivation

Bernoulli Log-Likelihood

Deriving the Gradient

Convexity Guarantee

θ := θ + α∇ℓ(θ)

Convergence Analysis

Batch Update

Stochastic Update

Mini-Batch

Early Stopping

Threshold Decision Rule

Perceptron vs. Logistic Regression

K-Class Multinomial Classification

One-vs-All Alternative

Numerical Stability

Second-Order Optimization

Convergence Rates and Computational Cost

Linear & Nonlinear Separation

Nonlinear Boundaries via Feature Engineering

Linear Boundaries

Nonlinear Boundaries

Sources & References

Course Materials

Textbooks & Papers

σ(z) = 1/(1+e^−z)