Stanford XCS229 · Machine Learning
Generalized Linear Models
Week 2–3 · The Exponential Family & GLM Construction
01

The Exponential Family Form

The exponential family is a class of probability distributions that share a common mathematical structure. Every distribution in this family can be written in the form p(y;η) = b(y)exp(ηᵀT(y) - a(η)), where η is the natural parameter, T(y) is the sufficient statistic, a(η) is the log-partition function, and b(y) is the base measure. This elegant formulation reveals deep connections between seemingly different distributions and provides a unified framework for building models.

The natural parameter η encodes all the information about the distribution. The log-partition function a(η) acts as a normalizing constant ensuring probabilities integrate to one. Sufficient statistics T(y) summarize the data: they contain all information needed to estimate the parameters. Understanding exponential families allows us to derive regression models from first principles using only three simple assumptions about the conditional distribution of the response.

∞ models
Unified Framework
η
Natural Parameter
a(η)
Log-Partition
T(y)
Sufficient Stat
02

Bernoulli as Exponential Family

The Bernoulli distribution, which models binary outcomes, fits elegantly into the exponential family form. Starting from p(y;φ) = φʸ(1-φ)¹⁻ʸ, we can rearrange it as p(y;φ) = exp(y log(φ/(1-φ)) + log(1-φ)). This reveals that the natural parameter is η = log(φ/(1-φ)), the log-odds. The sufficient statistic is T(y) = y, and the log-partition function is a(η) = log(1 + eⁿ). This derivation immediately yields logistic regression when we assume η = θᵀx.

The relationship between natural parameter η and the probability φ is given by φ = 1/(1 + e⁻ⁿ), which is the logistic sigmoid function. This is not an arbitrary choice—it emerges naturally from the exponential family structure. The expected value E[y] = φ equals the derivative of a(η) with respect to η, showing how the mean parameter relates to the natural parameter. This structure simplifies maximum likelihood estimation and interpretation significantly.

Natural Parameter

η = log(φ/(1-φ)) maps probability to log-odds, capturing binary information.

Log-Partition

a(η) = log(1 + eⁿ) normalizes the distribution and generates moments.

03

Gaussian as Exponential Family

The Gaussian distribution also belongs to the exponential family. Writing the standard normal likelihood p(y;μ,σ²) = (1/√(2πσ²))exp(-(y-μ)²/(2σ²)) in exponential family form requires careful algebra. Setting σ² to a fixed constant and focusing on μ, we get η = μ/σ², T(y) = y, and a(η) = η²σ²/2. The sufficient statistic is simply the observation itself, and the natural parameter scales inversely with variance. This structure underlies ordinary least squares regression.

For Gaussian models with unknown variance, we can use the two-parameter exponential family form where both μ and σ² are natural parameters. The mean is E[y] = μ, relating to η via a simple derivative. Unlike Bernoulli's nonlinear relationship, the Gaussian naturally connects the natural parameter directly to the mean, leading to the identity link function in linear regression. This mathematical structure explains why OLS remains so powerful: the natural parameter equals the regression coefficients times features.

η = μ/σ²
T(y) = y
a(η) = η²σ²/2
E[y|x] = μ
04

Constructing GLMs

The generalized linear model rests on three key assumptions. First, the response y follows a distribution from the exponential family, y|x;θ ~ ExpFamily(η). Second, the mean of the response is a function of the linear predictor, E[y|x] = g⁻¹(θᵀx), where g is the link function. Third, the natural parameter is linear in the features, η = θᵀx. These three assumptions bind together the exponential family distribution, the linear predictor, and the actual predictions in a consistent framework.

The link function g maps from the mean space to the linear predictor space. For Bernoulli models, the canonical link is the logit: g(μ) = log(μ/(1-μ)). For Gaussian models, the canonical link is the identity: g(μ) = μ. The canonical link function is always chosen so that η = θᵀx directly, simplifying maximum likelihood estimation. By choosing the appropriate link function and exponential family, we can build models for continuous responses, binary outcomes, count data, and more—all from the same three assumptions.

Features x Linear Predictor θᵀx Link Function g⁻¹ Prediction E[y|x] Mean Response
05

Ordinary Least Squares as GLM

Ordinary least squares fits naturally into the GLM framework when we assume the response follows a Gaussian distribution with constant variance. The three GLM assumptions become: (1) y|x;θ ~ N(μ, σ²) with σ² fixed; (2) E[y|x] = θᵀx (identity link); (3) η = μ/σ². Since the natural parameter is simply a scaled version of the mean, the linear regression coefficients directly control the natural parameter. The likelihood becomes the familiar sum of squared errors criterion when we take the negative log-likelihood and set variance to its maximum likelihood estimate.

Maximum likelihood estimation under the Gaussian GLM model yields closed-form solutions: θ = (XᵀX)⁻¹Xᵀy. The interpretation is transparent: the regression coefficients represent the change in expected response for a unit increase in each feature, holding others constant. The Gaussian assumption with identity link creates a direct, interpretable relationship between features and predictions. This is why linear regression remains so powerful for continuous responses and serves as the foundation for understanding more complex GLMs with nonlinear links like logistic regression.

Advantages

  • Closed-form solutions
  • Direct interpretation
  • Efficient computation
  • Well-understood theory

Limitations

  • Assumes normality
  • Unbounded predictions
  • Sensitive to outliers
  • Equal variance assumption
06

Logistic Regression as GLM

Logistic regression is the GLM for binary classification, using Bernoulli likelihood with logit link. The assumptions are: (1) y|x;θ ~ Bernoulli(φ); (2) E[y|x] = φ = σ(θᵀx) where σ is the sigmoid; (3) η = θᵀx. The sigmoid function σ(z) = 1/(1 + e⁻ᶻ) maps the unbounded linear predictor to the probability space [0,1]. This link function emerges naturally from the exponential family structure of Bernoulli: it is the canonical response function satisfying φ = da(η)/dη. Maximum likelihood estimation yields the cross-entropy loss, driving the model to assign high probability to the correct class.

The decision boundary occurs where θᵀx = 0, a hyperplane in the feature space. Points above the hyperplane are classified as class 1 with probability greater than 0.5, while points below are classified as class 0. The model's confidence depends on the magnitude of θᵀx: when |θᵀx| is large, predictions are near 0 or 1; when near 0, the model is uncertain. The exponential family structure guarantees that the log-likelihood is concave in θ, ensuring a unique global maximum that can be found with gradient descent or Newton's method. Logistic regression demonstrates how GLMs provide principled, interpretable models beyond regression.

Sigmoid: σ(z) = 1/(1+e⁻ᶻ) θᵀx=0 z = θᵀx P(y=1|x) 1 0
07

Softmax Regression as GLM

Softmax regression extends logistic regression to multiclass problems with K > 2 classes. The GLM assumptions become: (1) y|x;θ ~ Multinomial(φ₁,...,φ_K); (2) E[y|x] = φ with φ_k = exp(θₖᵀx)/Σⱼexp(θⱼᵀx); (3) η_k = θₖᵀx for each class. The softmax function generalizes the sigmoid by exponentiating each linear predictor and normalizing by the sum of all exponentials, producing a probability distribution over K classes. This canonical link ensures that the natural parameters are θₖᵀx. The model learns K separate sets of coefficients, one for each class.

Maximum likelihood estimation minimizes the multinomial cross-entropy loss, penalizing misassignments. The decision boundary between classes k and j is where θₖᵀx = θⱼᵀx, a hyperplane in feature space. Unlike multiclass SVM or one-vs-rest approaches, softmax regression directly models the probability of each class and makes globally coherent predictions. The exponential family framework ensures the log-likelihood is concave, allowing efficient optimization. Softmax regression demonstrates the power of GLMs for structured prediction problems where the output space is categorical but the input space is continuous.

K classes
softmax(z)
multinomial
cross-entropy
hyperplane boundaries
08

Properties of GLMs

All GLMs share fundamental properties derived from the exponential family structure. The canonical response function always satisfies E[y|x] = da(η)/dη, linking the mean to natural parameters. Variance is determined by the second derivative: Var(y|x) = d²a(η)/dη². For Bernoulli, Var = φ(1-φ); for Gaussian, Var = σ². This reveals that variance is a deterministic function of the mean—heteroscedasticity is built into the model structure, not an optional feature. Maximum likelihood estimation for any GLM reduces to minimizing a convex loss function, guaranteeing globally optimal solutions when optimization algorithms converge.

The sufficient statistics for an exponential family capture all information needed to estimate parameters. For Bernoulli, it is simply the observed y; for Gaussian, it is both y and y². This parsimony enables efficient parameter estimation: you need not memorize raw data, only sufficient statistics. The exponential family parameterization also reveals why certain link functions are canonical: they make η = θᵀx directly, linearizing the model in parameters and enabling closed-form solutions when possible. Non-canonical links introduce nonlinearity into the optimization but can improve predictions when the data violates exponential family assumptions. Understanding these properties helps practitioners choose appropriate models and interpret results correctly.

Key Insight

The three GLM assumptions—exponential family, mean as link inverse, linear natural parameter—unify regression, classification, and survival analysis under a single mathematical framework, each differing only in the choice of distribution and link function.