Binary classification extends regression to predict discrete labels: 0 or 1, negative or positive class. Unlike regression where we predict continuous values, classification requires discrete outputs and a natural threshold at probability 0.5. The central challenge arises because linear regression can produce values outside [0,1], making it unsuitable for probability predictions.
The decision boundary—the separating surface where the classifier transitions from predicting class 0 to class 1—is fundamental to classification geometry. In logistic regression, this boundary is linear in the feature space, enabling efficient learning while maintaining interpretability. Classification accuracy depends critically on feature engineering and the separability of the two classes.
Binary
Output Space
Threshold
At 0.5
Linear
Decision Boundary
XCS229
Week 2
02
The Sigmoid Function
The sigmoid function σ(z) = 1/(1+e^(-z)) elegantly maps all real numbers to the interval [0,1], making it ideal for probability estimation. The sigmoid is monotonically increasing, differentiable everywhere, and asymptotically approaches 0 and 1 at the extremes. Its S-shaped curve reflects the intuitive behavior of decision-making: uncertain predictions near z=0, confident predictions far from zero.
Logistic regression applies the sigmoid to the linear predictor: P(y=1|x) = σ(θᵀx). This probabilistic interpretation allows Bayesian reasoning and principled uncertainty quantification. The decision boundary occurs where σ(θᵀx) = 0.5, equivalently where θᵀx = 0, making classification geometrically equivalent to linear separation in the original feature space.
Sigmoid Properties
Monotonic, differentiable, squashes all reals to [0,1], with σ'(z) = σ(z)(1−σ(z)).
Probabilistic View
Output represents P(y=1|x,θ), enabling Bayesian interpretation and uncertainty estimates.
Decision Boundary
Occurs at σ(θᵀx) = 0.5, equivalent to linear boundary θᵀx = 0 in feature space.
Gradient Friendly
Smooth derivative enables efficient gradient-based optimization without discrete jumps.
03
Maximum Likelihood Estimation
Maximum likelihood estimation frames logistic regression as finding parameters that maximize the probability of observing the training data. For binary classification with Bernoulli labels, the likelihood is L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ), where P(y=1|x) = σ(θᵀx) and P(y=0|x) = 1−σ(θᵀx). This can be rewritten as a single expression: P(y|x) = σ(θᵀx)^y · (1−σ(θᵀx))^(1−y).
Taking logarithms transforms the product into a sum: ℓ(θ) = Σᵢ [yᵢ log σ(θᵀxᵢ) + (1−yᵢ) log(1−σ(θᵀxᵢ))]. This is the log-likelihood or cross-entropy loss. Maximizing log-likelihood is equivalent to minimizing cross-entropy error, providing the objective function for optimization. The gradient ∂ℓ/∂θ = Σᵢ (yᵢ − σ(θᵀxᵢ)) xᵢ has an elegant form that suggests gradient ascent.
Bernoulli
Label Distribution
Log-likelihood
Objective
Cross-entropy
Loss Equiv.
y ∈ {0,1}
Output Space
04
Gradient Update Rule
The gradient ascent update rule for logistic regression closely resembles the least-mean-squares (LMS) rule from linear regression: θ := θ + α Σᵢ (yᵢ − hθ(xᵢ)) xᵢ. However, here hθ(x) = σ(θᵀx) rather than θᵀx, making this a nonlinear optimization problem despite the linear decision boundary. Each iteration moves parameters in the direction of increasing likelihood, with step size α controlling convergence speed.
Stochastic gradient ascent updates on individual examples: θ := θ + α (yᵢ − σ(θᵀxᵢ)) xᵢ. This variant converges faster in practice and handles large datasets efficiently. The algorithm terminates when the gradient norm falls below a threshold or after a fixed number of iterations. Unlike the closed-form solution for linear regression, logistic regression requires iterative optimization, but convergence is guaranteed for convex objectives like log-likelihood.
Key Insight
The update rule closely mirrors LMS, but the sigmoid nonlinearity makes this a harder optimization problem that generally requires iterative methods rather than closed-form solutions.
The perceptron algorithm predates logistic regression and uses a simple classification rule: predict 1 if θᵀx ≥ 0, else predict 0. During training, update parameters only when misclassifications occur: θ := θ + y xᵢ when y(θᵀx) ≤ 0. This greedy approach has provable convergence guarantees when the data is linearly separable, formalized in the perceptron convergence theorem.
The perceptron's strength lies in simplicity and geometric interpretation: it finds any separating hyperplane when one exists. However, it produces no probabilistic output, provides no natural measure of classification confidence, and guarantees nothing when data isn't linearly separable. Logistic regression extends perceptron ideas by outputting probabilities and optimizing a principled loss function, making it more robust and interpretable than the perceptron.
Strengths
Convergence guaranteed on separable data
Simple geometric interpretation
Efficient training with no hyperparameters
Linear decision boundary
Weaknesses
No probabilistic output
Fails on non-separable data
No confidence estimates
Solution non-unique on separable data
06
Multi-class Softmax Regression
Extending classification to K classes (K > 2) requires replacing the binomial Bernoulli distribution with the multinomial distribution. Softmax regression outputs a K-dimensional probability vector where each component represents P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx). The softmax function normalizes K scores into valid probabilities summing to 1, generalizing the sigmoid which is softmax for binary (K=2) classes.
The multinomial log-likelihood extends naturally: ℓ(θ) = Σᵢ Σₖ 𝟙[yᵢ=k] log P(y=k|xᵢ). This is also called categorical cross-entropy loss, the fundamental objective for multi-class classification. The gradient and optimization follow the same principled approach as binary logistic regression. One-vs-all (OVA) offers an alternative: train K independent binary classifiers and predict argmax of the K scores, though softmax is more principled.
07
Newton's method is a second-order optimization technique that converges faster than gradient descent by incorporating curvature information via the Hessian matrix H, where H_ij = ∂²ℓ/∂θᵢ∂θⱼ. The update rule is θ := θ + H⁻¹ ∇ℓ(θ). For logistic regression, the Hessian is H = −XᵀD(σ(Xθ))X where D is a diagonal matrix of sigmoid derivatives, making H positive definite and the objective strictly convex.
Newton's method typically converges in far fewer iterations than gradient ascent, often achieving Newton-level quadratic convergence near the optimum. However, computing and inverting the Hessian scales as O(d³) in dimension d, making it expensive for high-dimensional problems. In practice, quasi-Newton methods like BFGS approximate the Hessian efficiently. For logistic regression on moderate-sized datasets, Newton's method and its variants are standard in optimization libraries due to superior convergence speed.
Convergence Benefit
Newton's method exhibits quadratic convergence (error squared each iteration), vastly faster than the linear convergence of gradient ascent. The trade-off is higher cost per iteration due to Hessian computation.
08
Linear & Nonlinear Boundaries
The decision boundary of logistic regression is linear: the set {x : θᵀx = 0}. This linear boundary directly corresponds to the geometry where σ(θᵀx) = 0.5, the natural classification threshold. Linear boundaries are geometrically interpretable and computationally efficient, but they lack expressiveness for data with complex, nonlinear separation patterns.
Nonlinear boundaries arise by feature engineering: applying nonlinear transformations φ(x) before classification. For example, adding polynomial features like x₁², x₁x₂, x₂² to a 2D input creates a 5D feature space where logistic regression can learn curved boundaries in the original space. This approach preserves the convexity and efficient optimization of logistic regression while gaining expressiveness. Deep neural networks generalize this idea, learning hierarchical feature transformations automatically, though at the cost of losing the convexity guarantee.
Linear regression predicts unbounded real values hθ(x) = θᵀx, which is fundamentally incompatible with classification where we need discrete outputs. When applied naively to binary classification, linear regression often produces predictions well outside [0,1], making probability interpretation impossible. A threshold like "predict 1 if hθ(x) ≥ 0.5" is arbitrary and lacks principled justification.
Consider the geometric interpretation: linear regression minimizes squared error between predictions and labels {0,1}. This loss function is symmetric and unbounded, driving predictions toward extremes. Outliers in one class push the decision boundary far from the natural separation between classes. An input far from the training data might receive predictions like −10 or +50, nonsensical probabilities.
Decision Boundaries and Geometry
The decision boundary is the set of points where the classifier transitions between predictions. For logistic regression with P(y=1|x) = σ(θᵀx), the natural boundary is where σ(θᵀx) = 0.5. Since the sigmoid is monotonic, this occurs exactly where θᵀx = 0, a d-dimensional hyperplane in feature space. This linear boundary has elegant geometric properties: it's determined by only d parameters, interpretable as the normal vector indicating the direction of increasing probability.
Separability and margins determine classification difficulty. Data is linearly separable if a hyperplane perfectly divides the classes, enabling perceptron convergence and logistic regression with zero training error. When classes overlap, no linear boundary perfectly separates them, but logistic regression still learns the best linear boundary in maximum likelihood sense. The margin—minimum distance from decision boundary to training points—quantifies robustness: larger margins indicate more confident, generalizable classifiers.
Class Imbalance
Unequal class frequencies bias optimization toward the majority class. Cost-sensitive learning or weighted loss functions correct this imbalance.
Feature Scaling
Gradient ascent convergence depends on feature scales. Normalization to unit variance accelerates optimization and stabilizes learning.
Regularization
L1 (Lasso) and L2 (Ridge) penalties prevent overfitting by constraining parameter magnitudes. L1 enables feature selection.
Confidence Calibration
Logistic regression outputs should reflect true probabilities. Platt scaling or Isotonic regression improves calibration post-hoc.
Sigmoid Function
σ(z) = 1/(1+e−z)
The sigmoid function σ(z) = 1/(1+e⁻ᶻ) is defined for all real z and outputs values in (0,1), making it ideal for modeling probabilities. Algebraically, it satisfies the identity 1 − σ(z) = σ(−z), a symmetry useful in derivations. The sigmoid is bounded, monotonic, and smooth—properties essential for reliable optimization.
Differentiating the sigmoid yields σ'(z) = σ(z)(1−σ(z)), a remarkably clean form. This derivative is positive everywhere, confirming monotonicity, and reaches maximum 0.25 at z=0, where uncertainty is highest. The derivative depends only on σ(z), enabling efficient computation without re-evaluating the exponential. This mathematical elegance makes sigmoid the canonical choice for binary classification.
Historical Context and Motivation
The sigmoid emerges naturally from multiple perspectives. In information theory, it's the canonical link function for Bernoulli distributions in generalized linear models (GLMs), meaning it connects linear predictors to class probabilities optimally under maximum likelihood. In neuroscience, it approximates the activation function of biological neurons, bounded between resting and firing rates. These foundations explain why sigmoid is ubiquitous across machine learning and deep learning.
Alternatives like the hyperbolic tangent tanh(z) = (eᶻ − e⁻ᶻ)/(eᶻ + e⁻ᶻ) produce outputs in (−1,1) and are sometimes used in neural networks. The logit function logit(p) = log(p/(1−p)) is the inverse sigmoid, converting probabilities to unbounded scores. These related functions share the sigmoid's smoothness and mathematical convenience, though sigmoid dominates binary classification due to its direct probabilistic interpretation.
Monotonic increasingOutputs in (0,1)Smooth everywhereσ'(z) = σ(z)(1−σ(z))Symmetric: 1−σ(z)=σ(−z)
Maximum Likelihood
Bernoulli Log-Likelihood
Maximum likelihood estimation (MLE) selects parameters that maximize the probability of observing the training data. For binary classification, assuming each label yᵢ is drawn independently from Bernoulli(σ(θᵀxᵢ)), the likelihood is L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ). This product expands to L(θ) = ∏ᵢ σ(θᵀxᵢ)^yᵢ · (1−σ(θᵀxᵢ))^(1−yᵢ).
Taking logarithms converts the intractable product into a tractable sum: ℓ(θ) = log L(θ) = Σᵢ [yᵢ log σ(θᵀxᵢ) + (1−yᵢ) log(1−σ(θᵀxᵢ))]. This is the log-likelihood. Equivalently, we minimize the negative log-likelihood (NLL) or cross-entropy loss J(θ) = −ℓ(θ). Cross-entropy measures the divergence between the predicted distribution σ(θᵀxᵢ) and the true distribution yᵢ, motivating the name.
Deriving the Gradient
To optimize, we compute ∂ℓ/∂θ = Σᵢ [yᵢ · σ'(θᵀxᵢ)xᵢ/σ(θᵀxᵢ) − (1−yᵢ) · σ'(θᵀxᵢ)xᵢ/(1−σ(θᵀxᵢ))]. Substituting σ'(z) = σ(z)(1−σ(z)) and simplifying yields ∂ℓ/∂θ = Σᵢ (yᵢ − σ(θᵀxᵢ)) xᵢ. This elegant gradient has a clear interpretation: update proportional to the error (yᵢ − prediction), weighted by the input xᵢ. Large errors drive large updates; zero errors stop updates.
The objective ℓ(θ) is strictly concave in θ (for logistic regression), guaranteeing a unique global optimum. This convexity is a major advantage over many ML models: any local optimum found by gradient ascent is the global optimum. Combined with the smooth gradient everywhere, this ensures reliable, well-behaved optimization without saddle points or multiple local minima to confuse learning algorithms.
Convexity Guarantee
The log-likelihood objective is strictly concave, ensuring any local optimum is global. Gradient ascent reliably converges to the unique best solution.
Step 1
Define Bernoulli model P(y|x,θ) = σ(θᵀx)^y (1−σ(θᵀx))^(1−y)
Step 2
Form likelihood L(θ) = ∏ᵢ P(yᵢ|xᵢ,θ)
Step 3
Take logarithm: ℓ(θ) = Σᵢ log P(yᵢ|xᵢ,θ)
Step 4
Compute gradient: ∇ℓ(θ) = Σᵢ (yᵢ − σ(θᵀxᵢ)) xᵢ
Step 5
Optimize via gradient ascent with guaranteed convergence
Gradient Ascent Update
θ := θ + α∇ℓ(θ)
The gradient ascent algorithm iteratively moves parameters in the direction of increasing likelihood. Starting from an initial θ₀ (often zero), repeatedly apply θₜ₊₁ := θₜ + α ∇ℓ(θₜ). The learning rate α controls step size: too large risks divergence, too small causes slow convergence. For convex objectives like log-likelihood, careful choice of α ensures convergence to the optimum.
Stochastic gradient ascent (SGA) updates on mini-batches or single examples: θ := θ + α(yᵢ − σ(θᵀxᵢ))xᵢ. This variant introduces noise but converges faster in practice, requires less memory, and scales to massive datasets. Modern implementations use adaptive learning rates (Adam, RMSprop) that adjust α per parameter based on gradient history, further accelerating convergence without careful manual tuning.
Convergence Analysis
Batch gradient ascent on convex, L-smooth objectives converges at rate O(1/t), meaning error decreases inversely with iterations. With diminishing step sizes αₜ = 1/t, convergence is guaranteed. Accelerated methods like Nesterov momentum achieve O(1/t²) rates. For strictly convex objectives like logistic regression, convergence to the unique global optimum is guaranteed for any positive α smaller than 2/L where L is the Lipschitz constant of ∇ℓ.
In practice, algorithms detect convergence when ||∇ℓ(θ)|| < ε for small ε, or after a fixed iteration budget. For large datasets, stochastic variants with warm-start initialization (using a model trained on a subset) provide excellent practical performance. The challenge in production systems is often not convergence but avoiding overfitting through regularization and validation.
Batch Update
θ := θ + α Σᵢ (yᵢ − σ(θᵀxᵢ)) xᵢ accumulates gradients over all examples, stable but slow on huge datasets.
Stochastic Update
θ := θ + α(yᵢ − σ(θᵀxᵢ)) xᵢ updates per-example, noisy but faster and more memory-efficient.
Mini-Batch
Process B examples per update, balancing stability of batch and speed of stochastic methods. B ∈ [32,256] is typical.
Early Stopping
Monitor validation error and stop if it increases, preventing overfitting without explicit regularization penalty.
Perceptron Algorithm
Threshold Decision Rule
The perceptron, introduced by Rosenblatt in 1958, is the simplest linear classifier. It outputs a discrete prediction: hθ(x) = 1 if θᵀx ≥ 0, else hθ(x) = 0. During training, parameters update only on misclassifications: if hθ(xᵢ) ≠ yᵢ, apply θ := θ + yᵢ xᵢ (where y ∈ {−1,+1} or equivalently offset by 1). This greedy update moves the decision boundary directly toward misclassified points.
The perceptron convergence theorem states that if data is linearly separable (exists θ* with yᵢ(θ*ᵀxᵢ) > 0 for all i), the perceptron algorithm terminates with a separating hyperplane in O(1/γ²) iterations, where γ is the margin of θ*. This is a remarkable guarantee: finite convergence without probabilistic assumptions, just linear separability. However, the theorem says nothing about non-separable data—the algorithm loops forever if no perfect separator exists.
Perceptron vs. Logistic Regression
Logistic regression improves upon the perceptron by outputting probabilities and optimizing a principled loss function over all examples, not just misclassifications. This principled approach enables several advantages: probabilistic predictions for uncertainty quantification, robust performance on non-separable data, direct optimization of a global objective rather than local greedy updates, and superior generalization through regularization. Logistic regression effectively combines the geometric appeal of perceptrons with the statistical rigor of maximum likelihood.
Modern deep learning has revived interest in perceptron-like architectures: neural networks are essentially nested perceptrons with nonlinear activations. The perceptron's simplicity also makes it useful for understanding fundamental concepts in learning theory, such as the VC dimension, linear separability, and generalization bounds. For practical classification, however, logistic regression and gradient-boosted trees dominate due to superior performance and robustness.
Softmax regression generalizes logistic regression to K ≥ 2 classes via the softmax function. Given K score vectors θ₁,...,θₖ ∈ ℝᵈ, the predicted probabilities are P(y=k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx). The softmax normalizes K scores into a valid probability distribution (summing to 1, each ∈ [0,1]). For binary classification (K=2), softmax reduces to logistic regression via the equivalence P(y=1|x) = σ(θ₁ᵀx).
The multinomial log-likelihood is ℓ(θ) = Σᵢ log P(yᵢ|xᵢ) = Σᵢ θ_{yᵢ}ᵀxᵢ − Σᵢ log Σⱼ exp(θⱼᵀxᵢ). This objective is convex, enabling globally optimal solutions via gradient-based optimization. The gradient ∂ℓ/∂θₖ = Σᵢ xᵢ (𝟙[yᵢ=k] − P(y=k|xᵢ)) parallels binary logistic regression: update based on prediction error for class k.
One-vs-All Alternative
An older approach, one-vs-all (OVA), trains K independent binary classifiers: one for each class against all others. During prediction, output the class with highest score. OVA is simpler to implement and parallelize but lacks principled probabilistic interpretation. The K classifiers optimize independently without enforcing that predictions sum to 1. Softmax is theoretically superior and now standard due to better probabilities and unified optimization, though OVA remains pragmatically useful when training many binary models is easier than joint optimization.
Softmax generalizes further in neural networks as the output layer for multi-class problems. Combined with cross-entropy loss, it provides excellent calibration (output probabilities reflect true class frequencies). Numerical stability is crucial: directly computing softmax risks overflow/underflow. The standard trick is to subtract max(scores) before exponentiation: softmax(s) = softmax(s − max(s)), maintaining numerical precision.
Numerical Stability
Compute softmax(scores) = exp(scores − max(scores)) / sum(exp(scores − max(scores))) to avoid overflow on large scores and underflow on small scores.
K classes
Output Space
Softmax
Normalization
Convex
Objective
Cross-entropy
Loss Equiv.
Newton's Method
Second-Order Optimization
Newton's method is a classical optimization algorithm that uses both first-order (gradient) and second-order (Hessian) information. The update rule is θ := θ − H⁻¹∇ℓ(θ), where H is the Hessian matrix Hᵢⱼ = ∂²ℓ/∂θᵢ∂θⱼ. Geometrically, Newton's method fits a quadratic approximation to the objective at the current point and jumps directly to its optimum. For quadratic objectives, this reaches the optimum in one step; for general objectives, it converges rapidly near the solution.
For logistic regression, the Hessian is H = −XᵀDX where D is a diagonal matrix with entries Dᵢᵢ = σ(θᵀxᵢ)(1−σ(θᵀxᵢ)). Since D is positive definite and X has full rank, H is negative definite, confirming the objective is strictly concave. The Newton step direction H⁻¹∇ℓ(θ) is guaranteed to be an ascent direction, and the step size is automatically appropriate (no learning rate tuning).
Convergence Rates and Computational Cost
Newton's method exhibits quadratic convergence: in a neighborhood of the optimum, the error squares each iteration, resulting in exponential convergence. Gradient descent has linear convergence (error multiplied by constant < 1). For a target error ε, Newton's method needs O(log log(1/ε)) iterations vs. O(log(1/ε)) for gradient descent—a dramatic improvement. However, each Newton iteration requires computing and inverting the d×d Hessian, costing O(d³) operations. For d > 1000, this is prohibitive.
Quasi-Newton methods like BFGS and L-BFGS approximate the Hessian using gradient information, reducing cost to O(d²) per iteration while maintaining superlinear convergence. L-BFGS, limited-memory BFGS, further reduces memory from O(d²) to O(d), making it practical for high-dimensional problems. In practice, modern solvers for logistic regression (sklearn, statsmodels, TensorFlow) use L-BFGS or stochastic variants, balancing convergence speed and computational feasibility.
Quadratic convergenceO(d³) per iterationH = HessianH⁻¹∇ℓ stepNo learning rate
Decision Boundaries
Linear & Nonlinear Separation
The decision boundary of logistic regression applied to features x is the hyperplane {x : θᵀx = 0}. This linear boundary directly corresponds to the threshold σ(θᵀx) = 0.5. The normal vector θ points in the direction of increasing probability; moving along θ increases σ(θᵀx). The distance from a point x to the boundary is |θᵀx| / ||θ||, related to the margin in support vector machine terminology.
Linear boundaries are geometrically interpretable and computationally efficient—just d parameters to store and O(d) to evaluate. However, they lack expressiveness for data with complex separation patterns. XOR is the canonical example: in 2D, no line separates {(0,0), (1,1)} from {(0,1), (1,0)}, so linear classifiers fail. Nonlinear boundaries solve this via feature engineering.
Nonlinear Boundaries via Feature Engineering
Applying nonlinear feature transformations φ(x) before logistic regression enables curved decision boundaries in the original space. For example, adding polynomial features φ(x) = [x₁, x₂, x₁², x₁x₂, x₂²]ᵀ to 2D input creates a 5D space where logistic regression can learn curves. Kernel methods like support vector machines implicitly apply infinite-dimensional feature maps. Deep neural networks learn feature hierarchies automatically, progressively transforming raw inputs into linearly separable representations.
The trade-off is critical: feature engineering preserves convexity and efficient optimization (logistic regression remains convex), but manual feature design is labor-intensive and requires domain expertise. Neural networks remove this burden but sacrifice convexity, introducing local minima and convergence challenges. In practice, the choice depends on data size (small: engineered features; large: deep learning) and problem domain (images/text: deep learning; tabular: both viable).