Decision trees recursively partition the feature space into increasingly homogeneous regions, creating a hierarchical structure of binary or multi-way splits. At each node, the algorithm selects the feature and threshold that best separates the data according to a chosen criterion, building a tree where leaves represent final predictions and internal nodes represent decision rules. Trees excel at capturing non-linear relationships and interactions without feature scaling.
The interpretability of decision trees is unmatched: each path from root to leaf traces a human-readable rule like "if x₁ ≤ 5.2 and x₂ > 3.1 then predict class A." This transparency makes trees invaluable in applications requiring explainability, from medical diagnosis to lending decisions. However, individual trees suffer from high variance and tend to overfit on training data, motivating the use of ensemble methods.
02
Information Gain
Information gain measures the reduction in entropy from a split, quantifying how much a feature decreases uncertainty about the target variable. Entropy H(S) = -Σ pᵢ log₂(pᵢ) represents the impurity of a set S; a lower entropy means more homogeneous labels. The information gain from splitting on feature X is IG(S,X) = H(S) - Σ |Sᵥ|/|S| · H(Sᵥ), where the sum weights the entropy of child nodes by their relative sizes. This criterion, used in algorithms like ID3 and C4.5, greedily selects splits that maximize information gain.
The greedy approach ensures efficient tree construction but does not guarantee globally optimal trees. Information gain has a bias toward high-cardinality features with many possible values, since they create more child nodes and often achieve lower weighted entropy purely by fragmentation. Practitioners often use gain ratio (normalizing by split entropy) or other criteria to mitigate this effect and build more balanced, interpretable trees.
03
Gini Impurity
Gini impurity provides an alternative to entropy, measuring the probability of misclassifying a randomly selected element if it were labeled according to the class distribution in a node. Defined as Gini(S) = 1 - Σ pᵢ², it ranges from 0 (pure) to 0.5 (maximum impurity at binary classification). The Gini-based split criterion selects the feature that minimizes weighted child impurity: Gini(S,X) = Σ |Sᵥ|/|S| · Gini(Sᵥ). Gini is computationally simpler than entropy since it avoids logarithms and is the default in scikit-learn's DecisionTreeClassifier.
Empirically, Gini and entropy typically produce similar trees despite their conceptual differences. Gini tends to be slightly faster to compute and creates marginally more balanced splits in certain datasets. Neither method is universally superior; the choice often depends on domain conventions and computational constraints. Both are greedy heuristics that work well in practice but occasionally miss the globally optimal structure.
04
Pruning & Regularization
Trees grown to full depth invariably overfit: they memorize training noise and fail to generalize. Pre-pruning constrains tree growth by stopping expansion when splits no longer improve validation performance or when constraints are hit—such as maximum tree depth, minimum samples per leaf, or minimum impurity decrease. These hyperparameters directly control model complexity and are easily tuned via cross-validation. Post-pruning, conversely, grows a full tree then removes nodes whose deletion reduces cost-complexity loss, removing the least valuable subtrees first in a sequence from which the best size is selected via validation.
Cost-complexity pruning balances tree size against training error: CCP(T) = Error(T) + α·|Leaves(T)|, where α controls the penalty on tree size. Larger α favors smaller trees; tuning α on a validation set identifies the optimal complexity. Pre-pruning is faster and often sufficient in practice. Post-pruning, implemented via the pruning path, can sometimes find better trees by exploring the full complexity spectrum. Modern libraries like scikit-learn automate both approaches, making pruning a standard best practice for production trees.
05
Bagging
Bootstrap aggregation (bagging) reduces variance by training multiple models on bootstrap samples—random samples drawn with replacement from the training set—then averaging their predictions. Each bootstrap sample has approximately 63.2% unique examples; the remaining 36.8% are duplicates. Bagging is trivial to parallelize since each model trains independently, making it scalable to large datasets. For regression, predictions average; for classification, voting aggregates class labels. Bagging particularly benefits high-variance, low-bias models like unpruned decision trees, turning an ensemble into a robust predictor.
Out-of-bag (OOB) error estimates generalization without a separate validation set: for each training example, aggregate predictions from only the bootstrap samples that did not contain it (roughly one-third of the ensemble). OOB error is an unbiased estimate nearly equivalent to leave-one-out cross-validation but computed at negligible cost. Bagging cannot reduce bias; if the base learner is systematically biased, the ensemble inherits that bias. Despite this limitation, bagging is simple, effective, and forms the foundation for random forests, which add feature subsampling to further improve generalization.
06
Random Forests
Random forests extend bagging by introducing feature subsampling at each split: at each node, only a random subset of √d features (for classification) or d/3 features (for regression) is eligible for splitting, reducing correlation between trees. This decorrelation is critical—if all trees make the same splits on the same features, their predictions are highly correlated and ensemble variance reduction is minimal. By forcing each tree to explore different feature subsets, random forests generate a more diverse ensemble, reducing the effective correlation and amplifying variance reduction.
Random forests are robust, minimal-tuning algorithms that often perform competitively without hyperparameter tuning. Key hyperparameters are the number of trees (more is safer but slower), max_features (controlling feature subsampling), max_depth, and min_samples_leaf. Feature importance is easily extracted: the total impurity decrease across all splits for each feature, averaged over all trees, quantifies predictive value. Random forests scale to large datasets and handle high-dimensional data gracefully, making them a practical go-to ensemble for many tasks and a strong baseline for benchmarking.
07
Boosting
Boosting trains weak learners sequentially, with each new learner focused on examples the previous ensemble mislabeled. AdaBoost maintains sample weights, initially uniform; after each weak learner trains, weights on misclassified examples increase, forcing subsequent learners to focus on hard cases. The final ensemble is a weighted combination where each learner's weight depends on its training accuracy: α_m = 0.5 · log((1 - err_m) / err_m), where err_m is the weighted error rate. This exponential weighting of mistakes creates a strong feedback loop that drives the ensemble toward low training error.
Boosting reduces bias by iteratively correcting residuals and errors, complementing bagging's variance reduction. The sequential nature precludes parallelization, but boosting typically needs far fewer base learners than bagging to achieve equivalent generalization. The exponential loss minimized by AdaBoost magnifies the penalty on hard examples, making boosting sensitive to outliers and mislabeled data. Modern variants like LogitBoost and multiclass extensions improve stability. Boosting's power to turn weak learners into strong predictors—combined with its interpretability—made it a foundational technique in machine learning, preceding deep learning as the dominant ensemble method.
08
Gradient Boosting
Gradient boosting generalizes boosting to arbitrary loss functions by framing ensemble learning as gradient descent in function space. Instead of reweighting samples, gradient boosting fits each new model to the negative gradient (residuals) of the loss function with respect to the current ensemble's predictions. For squared error loss, residuals are simply actual − predicted; for log-loss, they are pseudo-residuals ∂L/∂f. The new tree is scaled by a learning rate (shrinkage) η before adding to the ensemble, controlling step size and improving generalization. Regularization (tree depth, min_samples_leaf, learning rate) is crucial to prevent overfitting.
XGBoost, LightGBM, and CatBoost are production-grade implementations adding tree-specific regularization (L1/L2 penalties on leaf weights, tree complexity penalties), efficient column-wise splitting, and learned categorical encodings. These libraries dominate Kaggle competitions and real-world applications because of speed, accuracy, and robustness. Gradient boosting subsumes AdaBoost and enables custom loss functions, from quantile regression to ranking metrics. With careful regularization tuning, gradient boosting consistently achieves state-of-the-art performance on tabular data, rivaling deep learning on structured tasks and requiring far less computational resources.
A decision tree is a hierarchical model that partitions the feature space through recursive binary (or multi-way) splits. Starting at the root, each internal node tests a feature against a threshold; the true branch follows if the condition is satisfied, otherwise the false branch is traversed. This process continues until reaching a leaf, which outputs a class label (classification) or continuous value (regression). The learned rules are transparent and mimic human decision-making, making trees invaluable for interpretable machine learning.
Tree construction greedily selects splits to minimize impurity—entropy or Gini impurity—at each node. This greedy strategy is computationally efficient (O(dn log n) for n samples and d features) but does not guarantee globally optimal trees. Early stopping via max_depth, min_samples_leaf, or min_impurity_decrease prevents overfitting by constraining tree growth. Unpruned trees memorize training data, achieving near-zero training error but poor test performance. Pruning—removing the least valuable subtrees—recovers generalization at the cost of training accuracy.
Tree Structure & Splitting
At each split, the algorithm scans candidate thresholds for each feature and selects the split minimizing the impurity of the resulting child nodes. For a split on feature X at threshold t, the weighted child impurity is |L|/|S| · Imp(L) + |R|/|S| · Imp(R), where L and R are left and right child subsets and S is the parent. Trees naturally handle missing values through surrogate splits and categorical features through one-hot encoding or ordinal encoding.
Advantages & Limitations
Trees are interpretable, require no feature scaling, capture interactions automatically, and handle mixed feature types. Conversely, individual trees are high-variance (small data changes cause large tree structure changes), prone to overfitting, and create non-convex decision boundaries that may miss simple linear patterns. Trees also have a bias toward high-cardinality features, making feature engineering and regularization essential for good generalization.
Overfitting & Regularization
A fully grown tree achieves zero training error by creating leaves with pure labels, but this memorization does not generalize. Regularization via max_depth, min_samples_leaf, or cost-complexity pruning trades training accuracy for test accuracy. The optimal tree depth is found via cross-validation; shallow trees underfit while deep trees overfit. Post-pruning grows a full tree then removes nodes in order of cost-complexity, selecting the subtree that minimizes validation error.
02
Information Gain
Information theory provides a principled framework for measuring split quality via information gain. Entropy quantifies the uncertainty in a set of labels: H(S) = -Σ p_c log₂(p_c), where p_c is the proportion of class c in S. A pure set (all same class) has H = 0; a balanced binary set has H = 1. When a split partitions S into subsets S_v, the weighted child entropy is Σ |S_v|/|S| · H(S_v). Information gain IG(S, X) = H(S) - Σ |S_v|/|S| · H(S_v) quantifies entropy reduction—the information learned from splitting on feature X.
Tree construction greedily selects the feature and threshold maximizing information gain at each node. This approach, used in ID3 and C4.5, is computationally efficient and often produces good trees. However, information gain is biased toward features with many possible values: a feature with k distinct values creates k potential splits, each potentially lowering weighted entropy simply through fragmentation rather than genuine purity improvement. This bias can lead to overly complex splits and poor generalization.
Information Gain Formula & Computation
For a feature X and threshold t, the information gain is IG(S, X ≤ t) = H(S) - [|S_left|/|S| · H(S_left) + |S_right|/|S| · H(S_right)]. Computing IG for all candidate thresholds—typically the midpoints between consecutive sorted values of X—requires O(n log n) sorting and O(n) IG computation per feature, totaling O(dn log n) for d features. In practice, fast libraries use histogram-based approximations, partitioning feature values into bins to reduce computation.
Gain Ratio & Intrinsic Value
To mitigate the bias toward high-cardinality features, gain ratio normalizes information gain by the intrinsic value of the split: GainRatio(S, X) = IG(S, X) / H_split(X), where H_split(X) = -Σ |S_v|/|S| · log₂(|S_v|/|S|) measures how evenly a feature partitions the set. Features that create many balanced child subsets have high intrinsic value and lower gain ratio. C4.5 uses gain ratio to prefer simpler, more interpretable splits over those fragmenting data into many small groups.
Limitations & Extensions
Information gain is a greedy heuristic that does not account for future splits or interactions. A split with lower immediate IG might enable better downstream splits, but greedy selection misses this. Also, IG assumes discrete or binned features; continuous features require threshold scanning. Modern tree libraries (XGBoost, LightGBM) use histogram-based approximations and second-order statistics, trading fine-grained IG for computational efficiency. Despite its limitations, IG remains a powerful, intuitive split criterion that aligns well with information-theoretic principles and produces interpretable, accurate trees in practice.
03
Gini Impurity
Gini impurity measures the probability of misclassifying a random element if labeled according to the class distribution of a node: Gini(S) = 1 - Σ p_c². For binary classification with balanced classes, Gini = 0.5; for pure class, Gini = 0. Unlike entropy (which uses logarithms), Gini is algebraically simpler and does not require transcendental functions. The split criterion selects the feature minimizing weighted child Gini: Gini(S, X) = Σ |S_v|/|S| · Gini(S_v). In scikit-learn's DecisionTreeClassifier and RandomForestClassifier, Gini is the default.
Empirically, Gini and entropy produce very similar trees and generalization performance. Gini favors slightly more balanced splits, while entropy is more sensitive to the largest minority classes. For high-dimensional problems, Gini's computational simplicity (avoiding log operations) provides marginal speedup. The choice between Gini and entropy is more a matter of convention and computational budget than fundamental performance differences; most practitioners use whichever criterion is the library default and focus on regularization hyperparameters for performance gains.
Gini Index & Misclassification Error
Gini(S) = 1 - Σ p_c² can also be interpreted as Gini = 2 · p(A) · p(B) for binary classification, the probability of selecting two examples of different classes at random. This interpretation connects Gini to the expected misclassification rate if a random label is chosen. For k classes, Gini ranges from 0 (pure) to (k-1)/k. Classification error (1 - max(p_c)) is another impurity measure but is less sensitive to class imbalance and less suitable for building trees; Gini and entropy are preferred in practice.
Computational Efficiency
Computing Gini requires only frequencies p_c and their squares, no logarithms. For a dataset of n samples, Gini(S) is O(k) where k is the number of classes (typically small). Scanning all candidate thresholds for d features costs O(dn log n), identical to entropy. However, Gini's simpler operations reduce constant factors, providing 5–10% speedup in tree construction. For very large datasets where tree construction is a bottleneck, this efficiency gain justifies Gini's use. For typical problems, the choice is immaterial, and regularization hyperparameters have far larger impact on generalization.
Bias Toward Balanced Splits
Gini impurity, like entropy, is biased toward features with many distinct values. A feature with k values can create k potential splits; each fragment reduces weighted Gini through increased granularity, not necessarily through purer classification. Practitioners often limit max_features or use gain ratio analogues to counter this bias. Some tree libraries use alternative split criteria like XGBoost's second-order statistics (gradient and hessian), which are more robust to feature cardinality.
Modern Extensions
While Gini and entropy dominate classical trees, modern gradient boosting libraries use custom loss functions tailored to the prediction task. XGBoost minimizes regularized learning objectives; LightGBM optimizes Leaf-wise splits using second-order approximations; CatBoost leverages categorical feature interactions. These extensions trade the simplicity of Gini for improved accuracy on complex, high-dimensional datasets. For foundational understanding and most practical tasks, Gini remains the standard, reliable choice.
04
Pruning & Regularization
Fully grown trees achieve near-zero training error by memorizing noise. Regularization constrains growth to recover generalization. Pre-pruning stops expansion early using hard constraints: max_depth limits tree height, min_samples_leaf ensures leaves contain enough samples, min_impurity_decrease requires splits to improve purity beyond a threshold, min_weight_fraction_leaf bounds the fraction of training samples. These hyperparameters are tuned via cross-validation; typical values are max_depth ∈ [3, 15] and min_samples_leaf ∈ [1, 20].
Post-pruning grows an unpruned tree, then removes nodes whose deletion reduces cost-complexity loss. Cost-complexity pruning scores each subtree as CCP(T) = Error(T) + α·|Leaves(T)|, where α is a regularization parameter. Larger α penalizes tree size, favoring smaller trees; α=0 prefers the original tree. A sequence of trees from maximum complexity (α=0) to minimum complexity (single root) is computed efficiently via a pruning path. Scikit-learn's cost_complexity_pruning_path() and ccp_alpha parameter implement this; the optimal α is selected via cross-validation on the validation set's CCP values.
Pre-Pruning Strategy
Pre-pruning is faster (no need to grow a full tree) and often sufficient. Common strategies: (1) Grid search over max_depth, min_samples_leaf, min_impurity_decrease using k-fold cross-validation; (2) Early stopping during tree growth if validation error increases for consecutive splits; (3) Monitoring OOB error in bagging/forests. Pre-pruning is the practical choice for production systems where training time matters. The hyperparameter search space is modest (few knobs), and default values (max_depth=None, min_samples_leaf=1) are permissive starting points.
Post-Pruning & Cost-Complexity
Post-pruning can sometimes find subtly better trees by exploring the full complexity spectrum. The pruning path generates a sequence T₀ ⊃ T₁ ⊃ ... ⊃ T_root where T_i+1 is obtained by removing the node that minimizes the increase in training error per leaf removed. For each α, the tree in the path minimizing CCP(T, α) is optimal for that regularization strength. Cross-validation error on the validation set selects the best α. The cost-complexity framework is theoretically elegant and often used in academic settings; for practitioners, pre-pruning is usually sufficient.
Hyperparameter Tuning
GridSearchCV or RandomizedSearchCV automate hyperparameter tuning. Example: GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': [3,5,7,10], 'min_samples_leaf': [1,5,10]}, cv=5). Cross-validation ensures the selected hyperparameters generalize to unseen data. For large datasets, randomized search reduces computation. For production systems, early stopping (monitoring validation error during training) or online tuning (adapting hyperparameters as data arrives) can dynamically optimize complexity.
Regularization Best Practices
Start with weak regularization (large max_depth, small min_samples_leaf) and gradually increase constraints until cross-validation error stabilizes. Use feature selection (selecting high-variance, low-noise features) to reduce the model's capacity to fit noise. Ensemble methods (bagging, boosting) often obviate the need for aggressive pre-pruning by averaging predictions across diverse subtrees. For critical applications (medical, financial), prioritize interpretability: smaller, shallower trees with aggressive min_samples_leaf constraints ensure that every split is statistically robust.
05
Bagging
Bootstrap aggregation (bagging) trains multiple models on bootstrap samples—random samples drawn with replacement from the training data—then aggregates predictions. Each bootstrap sample has size n (same as original data) but contains roughly 63.2% unique examples; 36.8% are duplicates. For regression, predictions are averaged; for classification, majority voting is used (or soft voting averaging class probabilities). Bagging is trivial to parallelize since each bootstrap sample is independent, enabling scalability to massive datasets via distributed training.
Bagging reduces variance without biasing predictions. Mathematically, if models have variance σ² and correlation ρ, the ensemble variance is ρσ² + (1 - ρ)σ²/m where m is the ensemble size. Uncorrelated models (ρ ≈ 0) achieve near σ²/m variance, a factor-m reduction. Bagging works best with high-variance, low-bias base learners like unpruned decision trees, turning an ensemble of overfitting trees into a robust predictor. Bagging cannot reduce bias; if the base learner systematically underestimates (or overestimates), the ensemble inherits that bias.
Out-of-Bag Error Estimation
A key advantage of bagging is free validation via out-of-bag (OOB) error. For each training example x_i, only ~37% of bootstrap samples exclude it; aggregate predictions from those samples (those that did not contain x_i). OOB error is an unbiased estimate of test error, nearly equivalent to leave-one-out cross-validation but computed at negligible cost during training. OOB error is plotted versus ensemble size to identify the number of models needed for convergence; typically, error stabilizes after 50–100 trees.
Feature Importance from Bagging
Average feature importances across the ensemble: sum the impurity decrease (or other importance metrics) from all splits on each feature, dividing by the ensemble size. OOB importances permute each feature in OOB samples and measure prediction degradation—features whose shuffling increases OOB error are important. OOB importances are often more stable and unbiased than impurity-based importances, especially for correlated features. For high-dimensional datasets, importance analysis identifies which features truly drive predictions, guiding feature engineering and domain understanding.
Bias-Variance Trade-off
Bagging is a variance reduction technique; it does not reduce bias. Averaging m models with variance σ² and correlation ρ yields ensemble variance ρσ² + (1-ρ)σ²/m. If ρ = 0 (uncorrelated models), variance drops by factor m; if ρ = 1 (perfectly correlated), variance is unchanged. With high-variance, low-bias learners, bagging creates diverse models (ρ is small), achieving significant variance reduction. With low-variance, high-bias learners (e.g., shallow trees, linear models), diversity is limited (ρ is large), and bagging provides minimal benefit.
Comparison to Single Models
Bagging always reduces variance (never increases it); test error improves if variance reduction outweighs any bias increase (usually small). For unpruned trees, bagging is dramatically superior: single trees overfit; bagged trees generalize well. For already-regularized models (pruned trees, linear regression with L2 penalty), bagging provides marginal gains since the base learner's bias is already low and variance is controlled. Bagging is a simple, powerful technique for high-variance models, especially when training parallelization is feasible.
06
Random Forests
Random forests extend bagging by adding feature subsampling at each split: instead of considering all d features, only a random subset is eligible for splitting. Typical choices are √d features for classification and d/3 for regression. This forced diversity decorrelates trees—if all bagged trees explore the same features, their splits are correlated and variance reduction is minimal. By constraining each tree to a random feature subset, random forests generate diverse, uncorrelated trees, reducing the ensemble correlation ρ and amplifying variance reduction by factor (1-ρ)/ρ.
Random forests are robust, minimal-tuning algorithms that often perform competitively on tabular data without extensive hyperparameter tuning. The default max_features (√d for classification) is nearly optimal for many datasets. Other key hyperparameters are n_estimators (number of trees, typically 100–1000), max_depth (None for full growth), min_samples_leaf (typically 1–5), and max_samples (bootstrap sample size). Random forests scale well to high-dimensional data, handle mixed feature types, require no feature scaling, and naturally encode feature importances.
Feature Subsampling Strategy
At each node, a random subset of max_features features is evaluated for splitting. This is different from feature selection (dropping irrelevant features globally); instead, each tree dynamically considers different features, even if globally available. For datasets with many irrelevant features, feature subsampling prevents trees from finding spurious correlations. With d features and max_features = √d, the probability a strong feature is ignored at a node is ~1 - √d/d, ensuring that most important features appear in most trees but with sufficient variation for diversity.
Hyperparameter Tuning Guidelines
n_estimators: More trees reduce variance at the cost of computation; 100–1000 is typical. Plot OOB error vs n_estimators; once stable (usually by 100 trees), increasing further provides diminishing returns. max_features: Default (√d or d/3) is near-optimal; tuning rarely helps. max_depth: Default (None, grow full) often works; shallow trees (max_depth=10) add regularization if overfitting is severe. min_samples_leaf: Higher values (5–20) reduce variance but increase bias; useful for noisy datasets. GridSearchCV tunes these systematically, but random forests are often competitive without tuning.
Feature Importance & Interpretability
Random forests extract global feature importance: for each feature, sum the impurity decrease from all splits across all trees, normalized by the number of trees. Features appearing in many splits with large impurity drops are important; those rarely selected have low importance. Permutation importance, computed on OOB samples, shuffles each feature and measures OOB error increase—robust to feature correlations. Feature importance guides feature engineering, data collection priorities, and domain understanding, making random forests interpretable despite their ensemble nature.
Advantages Over Single Trees & Bagging
Random forests reduce correlation between trees via feature subsampling, achieving better variance reduction than standard bagging at minimal computational cost. For datasets with many irrelevant features, random forests are dramatically superior to bagging, since feature subsampling prevents individual trees from picking up spurious correlations. Random forests are industry standard for tabular data: easy to use (few hyperparameters), robust (minimal tuning required), scalable (parallelizable), and interpretable (feature importance). They serve as excellent baselines and are often competitive with more complex methods like gradient boosting.
07
Boosting
Boosting trains weak learners sequentially, with each new learner focused on examples the previous ensemble mislabeled. Unlike bagging, which trains learners in parallel on random samples, boosting creates a sequential dependency: each learner's training distribution depends on the prior learner's errors. AdaBoost, the canonical boosting algorithm, maintains sample weights initialized uniformly; after each weak learner trains, weights on misclassified examples are increased exponentially, forcing the next learner to focus on hard cases. The final prediction is a weighted combination of weak learners, where each learner's weight depends on its training accuracy.
Boosting reduces bias by iteratively correcting residuals, complementing bagging's variance reduction. The exponential weighting of mistakes creates a powerful feedback loop: a weak learner (slightly better than random, ≥50% accuracy on weighted data) reduces weighted error, and the ensemble's training error decreases exponentially with the number of boosting rounds. Unlike bagging, which can be parallelized, boosting requires sequential training (each learner depends on priors). Boosting is also sensitive to outliers and mislabeled data—examples that many learners misclassify receive very high weights, potentially driving the ensemble astray.
AdaBoost Algorithm & Exponential Loss
AdaBoost minimizes exponential loss: L(y, f) = exp(-y·f), where y ∈ {-1, +1} and f is the ensemble prediction. Exponential loss magnifies the penalty on hard examples (those with large |y·f|), driving the algorithm to focus on difficult cases. The weight update after learning weak learner h_m with weighted error ε_m is: w_i^(m+1) ∝ w_i^(m) · exp(-α_m · y_i · h_m(x_i)), where α_m = 0.5 · log((1-ε_m)/ε_m). Learners with low error (ε_m < 0.5) receive positive α_m, amplifying their contribution; errors above 50% are rejected (α_m < 0). The ensemble is f(x) = Σ α_m · h_m(x).
Weak Learner Requirements
A weak learner is any classifier achieving >50% accuracy on the weighted distribution (better than random guessing in binary classification). Decision stumps (depth-1 trees) are the canonical choice: simple, fast, and surprisingly effective when boosted. Shallow trees (max_depth=2 or 3) also work well. Weak learners must be responsive to sample weights (skew learning toward high-weight examples); linear models, SVMs, and neural networks typically support this via sample_weight parameters. Weak learners should have low bias but high variance; boosting reduces variance through averaging, so overfitting on weighted data is acceptable.
Multiclass & Loss Functions
Binary AdaBoost extends to multiclass via one-vs-rest or one-vs-one schemes. SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss) directly handles K-class problems, replacing the binary exponential loss with a generalized version. LogitBoost replaces exponential loss with log-loss (logistic regression loss), which is more robust to outliers. Gentle AdaBoost uses regression (least-squares fitting of residuals) instead of classification, often producing smoother boosting curves. These variants trade some of AdaBoost's theoretical guarantees for improved robustness.
Boosting vs Bagging
Bagging trains in parallel, reducing variance without changing bias; boosting trains sequentially, reducing bias. Boosting typically requires fewer learners than bagging for equivalent performance (20–50 boosting rounds vs 100+ bags) but is slower to train. Boosting is sensitive to noise and outliers (large weights on mislabeled examples); bagging is more robust. For clean data, boosting is preferred; for noisy data, bagging or robust boosting variants (LogitBoost, gentle AdaBoost) are safer. Modern practice increasingly favors gradient boosting, which generalizes both and offers additional flexibility.