STANFORD XCS229 · MACHINE LEARNING
Decision Trees & Ensembles
Week 4 · Trees, Random Forests & Boosting
01

Decision Trees

Decision trees recursively partition the feature space into increasingly homogeneous regions, creating a hierarchical structure of binary or multi-way splits. At each node, the algorithm selects the feature and threshold that best separates the data according to a chosen criterion, building a tree where leaves represent final predictions and internal nodes represent decision rules. Trees excel at capturing non-linear relationships and interactions without feature scaling.

The interpretability of decision trees is unmatched: each path from root to leaf traces a human-readable rule like "if x₁ ≤ 5.2 and x₂ > 3.1 then predict class A." This transparency makes trees invaluable in applications requiring explainability, from medical diagnosis to lending decisions. However, individual trees suffer from high variance and tend to overfit on training data, motivating the use of ensemble methods.

02

Information Gain

Information gain measures the reduction in entropy from a split, quantifying how much a feature decreases uncertainty about the target variable. Entropy H(S) = -Σ pᵢ log₂(pᵢ) represents the impurity of a set S; a lower entropy means more homogeneous labels. The information gain from splitting on feature X is IG(S,X) = H(S) - Σ |Sᵥ|/|S| · H(Sᵥ), where the sum weights the entropy of child nodes by their relative sizes. This criterion, used in algorithms like ID3 and C4.5, greedily selects splits that maximize information gain.

The greedy approach ensures efficient tree construction but does not guarantee globally optimal trees. Information gain has a bias toward high-cardinality features with many possible values, since they create more child nodes and often achieve lower weighted entropy purely by fragmentation. Practitioners often use gain ratio (normalizing by split entropy) or other criteria to mitigate this effect and build more balanced, interpretable trees.

03

Gini Impurity

Gini impurity provides an alternative to entropy, measuring the probability of misclassifying a randomly selected element if it were labeled according to the class distribution in a node. Defined as Gini(S) = 1 - Σ pᵢ², it ranges from 0 (pure) to 0.5 (maximum impurity at binary classification). The Gini-based split criterion selects the feature that minimizes weighted child impurity: Gini(S,X) = Σ |Sᵥ|/|S| · Gini(Sᵥ). Gini is computationally simpler than entropy since it avoids logarithms and is the default in scikit-learn's DecisionTreeClassifier.

Empirically, Gini and entropy typically produce similar trees despite their conceptual differences. Gini tends to be slightly faster to compute and creates marginally more balanced splits in certain datasets. Neither method is universally superior; the choice often depends on domain conventions and computational constraints. Both are greedy heuristics that work well in practice but occasionally miss the globally optimal structure.

04

Pruning & Regularization

Trees grown to full depth invariably overfit: they memorize training noise and fail to generalize. Pre-pruning constrains tree growth by stopping expansion when splits no longer improve validation performance or when constraints are hit—such as maximum tree depth, minimum samples per leaf, or minimum impurity decrease. These hyperparameters directly control model complexity and are easily tuned via cross-validation. Post-pruning, conversely, grows a full tree then removes nodes whose deletion reduces cost-complexity loss, removing the least valuable subtrees first in a sequence from which the best size is selected via validation.

Cost-complexity pruning balances tree size against training error: CCP(T) = Error(T) + α·|Leaves(T)|, where α controls the penalty on tree size. Larger α favors smaller trees; tuning α on a validation set identifies the optimal complexity. Pre-pruning is faster and often sufficient in practice. Post-pruning, implemented via the pruning path, can sometimes find better trees by exploring the full complexity spectrum. Modern libraries like scikit-learn automate both approaches, making pruning a standard best practice for production trees.

05

Bagging

Bootstrap aggregation (bagging) reduces variance by training multiple models on bootstrap samples—random samples drawn with replacement from the training set—then averaging their predictions. Each bootstrap sample has approximately 63.2% unique examples; the remaining 36.8% are duplicates. Bagging is trivial to parallelize since each model trains independently, making it scalable to large datasets. For regression, predictions average; for classification, voting aggregates class labels. Bagging particularly benefits high-variance, low-bias models like unpruned decision trees, turning an ensemble into a robust predictor.

Out-of-bag (OOB) error estimates generalization without a separate validation set: for each training example, aggregate predictions from only the bootstrap samples that did not contain it (roughly one-third of the ensemble). OOB error is an unbiased estimate nearly equivalent to leave-one-out cross-validation but computed at negligible cost. Bagging cannot reduce bias; if the base learner is systematically biased, the ensemble inherits that bias. Despite this limitation, bagging is simple, effective, and forms the foundation for random forests, which add feature subsampling to further improve generalization.

06

Random Forests

Random forests extend bagging by introducing feature subsampling at each split: at each node, only a random subset of √d features (for classification) or d/3 features (for regression) is eligible for splitting, reducing correlation between trees. This decorrelation is critical—if all trees make the same splits on the same features, their predictions are highly correlated and ensemble variance reduction is minimal. By forcing each tree to explore different feature subsets, random forests generate a more diverse ensemble, reducing the effective correlation and amplifying variance reduction.

Random forests are robust, minimal-tuning algorithms that often perform competitively without hyperparameter tuning. Key hyperparameters are the number of trees (more is safer but slower), max_features (controlling feature subsampling), max_depth, and min_samples_leaf. Feature importance is easily extracted: the total impurity decrease across all splits for each feature, averaged over all trees, quantifies predictive value. Random forests scale to large datasets and handle high-dimensional data gracefully, making them a practical go-to ensemble for many tasks and a strong baseline for benchmarking.

07

Boosting

Boosting trains weak learners sequentially, with each new learner focused on examples the previous ensemble mislabeled. AdaBoost maintains sample weights, initially uniform; after each weak learner trains, weights on misclassified examples increase, forcing subsequent learners to focus on hard cases. The final ensemble is a weighted combination where each learner's weight depends on its training accuracy: α_m = 0.5 · log((1 - err_m) / err_m), where err_m is the weighted error rate. This exponential weighting of mistakes creates a strong feedback loop that drives the ensemble toward low training error.

Boosting reduces bias by iteratively correcting residuals and errors, complementing bagging's variance reduction. The sequential nature precludes parallelization, but boosting typically needs far fewer base learners than bagging to achieve equivalent generalization. The exponential loss minimized by AdaBoost magnifies the penalty on hard examples, making boosting sensitive to outliers and mislabeled data. Modern variants like LogitBoost and multiclass extensions improve stability. Boosting's power to turn weak learners into strong predictors—combined with its interpretability—made it a foundational technique in machine learning, preceding deep learning as the dominant ensemble method.

08

Gradient Boosting

Gradient boosting generalizes boosting to arbitrary loss functions by framing ensemble learning as gradient descent in function space. Instead of reweighting samples, gradient boosting fits each new model to the negative gradient (residuals) of the loss function with respect to the current ensemble's predictions. For squared error loss, residuals are simply actual − predicted; for log-loss, they are pseudo-residuals ∂L/∂f. The new tree is scaled by a learning rate (shrinkage) η before adding to the ensemble, controlling step size and improving generalization. Regularization (tree depth, min_samples_leaf, learning rate) is crucial to prevent overfitting.

XGBoost, LightGBM, and CatBoost are production-grade implementations adding tree-specific regularization (L1/L2 penalties on leaf weights, tree complexity penalties), efficient column-wise splitting, and learned categorical encodings. These libraries dominate Kaggle competitions and real-world applications because of speed, accuracy, and robustness. Gradient boosting subsumes AdaBoost and enables custom loss functions, from quantile regression to ranking metrics. With careful regularization tuning, gradient boosting consistently achieves state-of-the-art performance on tabular data, rivaling deep learning on structured tasks and requiring far less computational resources.