Stanford XCS229 · Machine Learning
Generative Algorithms
Week 3–4 · GDA, Naive Bayes & Text Classification
01

Discriminative vs Generative Models

Discriminative models directly estimate the posterior class probability p(y|x) given the data x. Generative models instead estimate the class-conditional likelihood p(x|y) and class prior p(y), then use Bayes rule to compute p(y|x). Discriminative models often require fewer assumptions about the distribution of features, while generative models capture the underlying data distribution. The choice between them depends on application requirements, data availability, and computational constraints.

When discriminative models are preferred: abundant labeled data, computational efficiency is critical. When generative models are preferred: modeling the data distribution, handling missing data, generation tasks, strong prior knowledge about feature distributions. Both approaches have their merits, and modern machine learning often combines ideas from both paradigms for optimal performance.

02

Gaussian Discriminant Analysis

Gaussian Discriminant Analysis assumes that the feature vectors in each class are drawn from a multivariate Gaussian (normal) distribution. The model assumes that p(x|y=0) and p(x|y=1) are both multivariate Gaussians with shared covariance matrix Σ but different means μ₀ and μ₁. This strong assumption allows for closed-form maximum likelihood solutions and interpretable decision boundaries. GDA is particularly effective when this Gaussian assumption is reasonable and samples are limited.

The shared covariance assumption leads to a linear decision boundary between classes, making GDA interpretable and efficient. Each feature's contribution is weighted by the inverse covariance, which captures correlations between features. For well-separated, normally distributed data, GDA provides excellent generalization. However, if the Gaussian assumption is violated, model performance may suffer relative to more flexible approaches.

03

GDA Model Fitting

Fitting the GDA model involves computing maximum likelihood estimates for the class means μ₀ and μ₁, shared covariance Σ, and class prior φ. The prior φ is simply the empirical proportion of positive examples: φ = (1/m) Σ I(y⁽ⁱ⁾=1). The means are computed as the average of feature vectors within each class. The shared covariance is the pooled sample covariance across both classes. These closed-form solutions avoid iterative optimization, making GDA computationally efficient and suitable for real-time applications.

Once fitted, the decision boundary is determined by comparing class-conditional densities weighted by the prior. The boundary is the set of points where p(y=1|x) = 0.5, which forms a linear hyperplane in the feature space (or a quadratic boundary if covariances are not shared). This linearity of the decision boundary is a key characteristic of GDA, providing interpretability and stability on test data with similar distributions to the training data.

04

GDA vs Logistic Regression

GDA and logistic regression both produce linear decision boundaries but via different routes. GDA makes stronger distributional assumptions (multivariate Gaussian with shared covariance), while logistic regression makes weaker assumptions and directly optimizes the log-likelihood of the posterior. With sufficient data and correct model assumptions, both approaches asymptotically converge to the same decision boundary. However, with finite data, they can differ significantly in their estimates.

GDA excels when the Gaussian assumption holds and data is limited, as it leverages the assumed structure. Logistic regression is more robust to model misspecification and generally performs better when the Gaussian assumption is violated. Logistic regression requires iterative optimization but is more flexible. In practice, logistic regression is often preferred due to its robustness, but GDA remains valuable when strong assumptions are justified by domain knowledge or exploratory data analysis.

05

Naive Bayes Classifier

Naive Bayes simplifies the feature model by assuming conditional independence: given the class y, all features x₁, x₂, ..., xₙ are independent. This allows the joint likelihood p(x|y) to factor as ∏ᵢ p(xᵢ|y), dramatically reducing the number of parameters from O(n²) to O(n). While the conditional independence assumption is often violated in practice ("naive"), Naive Bayes remains effective due to this efficiency and its natural fit to classification. It is particularly powerful for high-dimensional discrete data like text.

Naive Bayes predictions are made by computing the posterior for each class using Bayes rule and selecting the class with maximum posterior probability. For text classification, each feature typically represents word counts or presence, and the conditional probabilities are estimated from training data. The simplicity of Naive Bayes makes it interpretable—feature probabilities directly show which words are indicative of each class. Despite violating its core assumption in most applications, Naive Bayes achieves competitive performance and scales linearly with both training set size and feature count.

06

Laplace Smoothing

The zero-frequency problem arises when a word never appears in the training data for a given class. The naive estimate for p(word|class) would be zero, which causes the entire class probability to become zero even if other features strongly support that class. Laplace smoothing (add-one smoothing) avoids this by adding a count of one to each feature in each class, effectively treating each feature as if it appeared at least once. This simple regularization has a profound impact on Naive Bayes text classification performance.

Formally, with smoothing: p(xᵢ|y) = (count(xᵢ,y) + 1) / (count(y) + |V|), where |V| is the vocabulary size. This ensures all probabilities are positive and sums to one. While this biases probability estimates toward uniform distributions, it prevents zero probabilities and improves generalization to unseen words. Laplace smoothing is a practical necessity for Naive Bayes on text, making the difference between models that catastrophically fail on new words and robust classifiers.

07

Event Models for Text

Two event models describe how text is generated: the multivariate Bernoulli model treats each word as present or absent (1 or 0), while the multinomial model treats text as a sequence of word occurrences (bag-of-words with counts). In the Bernoulli model, φᵢ|ᵧ = P(word i appears | class y). In the multinomial model, word positions are sampled independently from a per-class vocabulary distribution, and φᵢ|ᵧ = P(word i is chosen | class y). The multinomial model typically outperforms Bernoulli because it captures word frequency information, though Bernoulli is computationally simpler and sometimes preferred for sparse text.

Both models enable feature selection and dimensionality reduction by pruning low-information words. Information gain or mutual information scores identify discriminative words that differ significantly between classes. Removing uninformative words reduces noise, improves computational efficiency, and sometimes improves generalization. The choice between models, smoothing parameters, and features to include defines the Naive Bayes pipeline and requires careful tuning via cross-validation. Despite these details, the core algorithm remains elegant and interpretable.

08

Practical Applications

Naive Bayes powers real-world systems across domains. Spam filtering is the canonical application: emails are classified as spam or ham based on word presence/frequency, where strong spam indicators (pharmaceutical keywords, money-related phrases) have high p(word|spam). Sentiment analysis treats reviews or tweets as documents, classifying them as positive or negative based on emotional language. The learned word probabilities reveal sentiment: positive words like "excellent" have high p(word|positive), while "disappointing" has high p(word|negative). Document classification assigns documents to categories (news topics, support tickets) based on their content.

Modern production systems often use Naive Bayes as a baseline or combine it with deep learning models for ensemble performance. Its speed makes it suitable for streaming or real-time systems. Interpretability is a key advantage: explaining why a document was classified requires showing which words triggered each class. This transparency is valuable in regulated domains (e.g., credit decisions) and for model debugging. Naive Bayes also handles imbalanced datasets gracefully and naturally extends to multi-class problems, making it a durable foundation for text classification pipelines.