Ensemble Methods¶

Stats 507, Fall 2021

James Henderson, PhD
November 11, 2021

Overview¶

  • Decision Trees
  • Ensemble Methods
  • Random Forests
  • Gradient Boosted Trees
  • Takeaways

    As previously, these slide are intended to be read in conjunction with the "Isolet Demo" from the course repo.

Decision Trees¶

  • A decision tree is a model in which features are used one at a time to recursively divide the data into parts so that "like" cases are together.
  • Decision trees are highly flexible and particularly good for models with many interactions between features.
  • Single tree models are especially prone to being over fit.
  • Let's review Figure 8.9 from ESL.

Ensemble Methods¶

  • An ensemble method or model is one in which a collection of ML models are trained on a common task and then combined to form a single estimator for that class.
  • Two import categories of ensemble methods are bagging and boosting.
  • Scikit-learn collects ensemble methods in the sklearn.ensemble API.

Bagging¶

  • Bagging short for bootstrap aggregation is one of the earliest ensemble methods.
  • It's intention is to improve the stability and reduce the variance of ML models such as decision trees.
  • In a nutshell, individual models are fit repeatedly to bootstrap replications of the training data and then averaged to form predictions.

Boosting¶

  • Boosting builds an additive classifier or regression function from a collection of simpler estimators such as linear models or trees.
  • Three key ideas:
    1. Simpler classifiers such as trees are combined sequentially;
    2. At each step, samples are re-weighted to give more influence to high-residual or misclassified samples;
    3. The final model is a weighted sum of the models learned in each step.
  • Let's review figure 10.1 from The Elements of Statistical Learning (ESL).

Adaboost¶

  • Conceptually, it is helpful to refer to the Adaboost algorithm to understand how boosting works.
  • Adaboost increases the weight on points not well-fit by our model and (consequently) decreases the weight on well-fit points.
  • Individual classifiers that are weighted by their performance on the samples they were trained on.
  • Let's review Algorithm 10.1 from ESL.

Gradient Boosting¶

  • Gradient boosting is similar to Adaboost, but at each stage fits a model to the negative gradient of the loss function rather than to the pseudo-residuals.
  • Let's review algorithm 10.3 in ESL.

Random Forests¶

  • Random forests are one of the easiest to use ML tools for tabular data.
  • Random forests implement bagging with a key insight to help limit over fitting - each tree is from a random subset of the features.
  • Random forests often perform well even with minimal tuning.
  • The algorithm allows for an “out-of-bag” (OOB) estimate of the error rate called a generalization score.

Random Forests in SKL¶

  • The RandomForestClassifier from sklearn.ensemble implements RF for classification
  • Similarly RandomForestRegressor can be used for regression problems.
In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

RF Hyper-parameters¶

  • n_estimators - how many trees to use,
  • max_depth - how far down to grow the trees,
    • min_samples_split and min_samples_leaf are alternatives,
  • max_features - maximum number of features to use for each tree,
  • max_samples - fit the tree to a subsample rather than full bootstrap s sample.
  • Read more here.
In [ ]:
RandomForestClassifier(
    n_estimators=500, # number of trees
    criterion='entropy',
    max_depth=None, 
    max_features='sqrt',
    oob_score=True,   # use CV otherwise
    max_samples=0.5,  # smaller yields more regularization
    n_jobs=2
)

RF Hyper-parameters¶

  • Empirically, using $\sqrt{p}$ for max_features has been found to work well.
  • Focus on tuning: number of trees, depth of trees, and subsample size.
  • Introduce more regularization by using fewer trees, shallower trees, or smaller subsample size.

RF OOB Score¶

  • When using a subsample, each tree uses only a subset of the data.
  • Unused samples (for that tree) are called "out-of-bag" (OOB).
  • A "sub-estimator" can be constructed for each training case by using only the trees for which it is OOB.
  • The average performance on these sub-estimators can be used to assess howe well a random forest model will generalize.

Gradient Boosted Trees¶

  • Sklearn implement gradient boosting in the estimators GradientBoostedClassifier and GradientBoostedRegressor.
  • The xgboost implementation is also popular.
In [ ]:
from sklearn.ensemble import GradientBoostedRegressor
from sklearn.ensemble import GradientBoostedClassifier

GBT Hyper-parameters¶

  • n_estimators - the number of boosting rounds (e.g. number of trees)
  • learning_rate - value by which to scale each new regressor
  • subsample - fraction of sample to use in each boosting round
  • max_depth - maximum depth of trees, see also: min_impurity_decrease, min_samples_split, and min_samples_leaf for implicit control of tree depth
  • max_features - maximum number of features to use for each tree,
In [ ]:
gb1 = GradientBoostingClassifier(
    loss='deviance',
    n_estimators=100, # number of trees
    learning_rate=.1,  
    subsample=1,
    max_depth=16, 
    max_features='sqrt',
    verbose=0
)

GBT Tuning¶

  • If the learning rate isn't small enough, the model may fail to "learn" and may not even fit the training data well.
  • It the learning rate is too small, the number of rounds must be increased to compensate.
  • My preferred strategy: find a suitable learning rate, then select the number of boosting rounds that perform best for that learning rate.
  • Other hyper-parameters are similar to random forests.

GBT Tuning¶

  • The method .staged_predict_proba() can be used to get estimates from the working model after each boosting round ("stage").
  • Useful for comparing out-of-sample performance on validation data for different numbers of rounds (aka "early stopping").

Takeaways¶

  • Random Forests and Gradient Boosted Trees are two of the best general purposes classifiers (regressors) for ML with tabular data.
  • Random Forests are complex models but easy to tune.
  • Carefully tuned gradient boosted trees often give the best classification performance and are typically better calibrated.
  • Consider both as candidate models for most ML projects on tabular data, and at least one (likely RF) as a "baseline" model.
  • Set random state to make reproducible.