ensemble_methods slides

Ensemble Methods¶

Stats 507, Fall 2021

James Henderson, PhD
November 11, 2021

Overview¶

Decision Trees
Ensemble Methods
Random Forests
Gradient Boosted Trees
Takeaways

As previously, these slide are intended to be read in conjunction with the "Isolet Demo" from the course repo.

Decision Trees¶

A decision tree is a model in which features are used one at a time to recursively divide the data into parts so that "like" cases are together.
Decision trees are highly flexible and particularly good for models with many interactions between features.
Single tree models are especially prone to being over fit.
Let's review Figure 8.9 from ESL.

Ensemble Methods¶

An ensemble method or model is one in which a collection of ML models are trained on a common task and then combined to form a single estimator for that class.
Two import categories of ensemble methods are bagging and boosting.
Scikit-learn collects ensemble methods in the sklearn.ensemble API.

Bagging¶

Bagging short for bootstrap aggregation is one of the earliest ensemble methods.
It's intention is to improve the stability and reduce the variance of ML models such as decision trees.
In a nutshell, individual models are fit repeatedly to bootstrap replications of the training data and then averaged to form predictions.

Boosting¶

Boosting builds an additive classifier or regression function from a collection of simpler estimators such as linear models or trees.
Three key ideas:
1. Simpler classifiers such as trees are combined sequentially;
2. At each step, samples are re-weighted to give more influence to high-residual or misclassified samples;
3. The final model is a weighted sum of the models learned in each step.
Let's review figure 10.1 from The Elements of Statistical Learning (ESL).

Adaboost¶

Conceptually, it is helpful to refer to the Adaboost algorithm to understand how boosting works.
Adaboost increases the weight on points not well-fit by our model and (consequently) decreases the weight on well-fit points.
Individual classifiers that are weighted by their performance on the samples they were trained on.
Let's review Algorithm 10.1 from ESL.

Gradient Boosting¶

Gradient boosting is similar to Adaboost, but at each stage fits a model to the negative gradient of the loss function rather than to the pseudo-residuals.
Let's review algorithm 10.3 in ESL.

Random Forests¶

Random forests are one of the easiest to use ML tools for tabular data.
Random forests implement bagging with a key insight to help limit over fitting - each tree is from a random subset of the features.
Random forests often perform well even with minimal tuning.
The algorithm allows for an “out-of-bag” (OOB) estimate of the error rate called a generalization score.

Random Forests in SKL¶

The RandomForestClassifier from sklearn.ensemble implements RF for classification
Similarly RandomForestRegressor can be used for regression problems.

In [ ]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

RF Hyper-parameters¶

n_estimators - how many trees to use,
max_depth - how far down to grow the trees,
- min_samples_split and min_samples_leaf are alternatives,
max_features - maximum number of features to use for each tree,
max_samples - fit the tree to a subsample rather than full bootstrap s sample.
Read more here.

In [ ]:

RandomForestClassifier(
    n_estimators=500, # number of trees
    criterion='entropy',
    max_depth=None, 
    max_features='sqrt',
    oob_score=True,   # use CV otherwise
    max_samples=0.5,  # smaller yields more regularization
    n_jobs=2
)

RF Hyper-parameters¶

Empirically, using $\sqrt{p}$ for max_features has been found to work well.
Focus on tuning: number of trees, depth of trees, and subsample size.
Introduce more regularization by using fewer trees, shallower trees, or smaller subsample size.

RF OOB Score¶

When using a subsample, each tree uses only a subset of the data.
Unused samples (for that tree) are called "out-of-bag" (OOB).
A "sub-estimator" can be constructed for each training case by using only the trees for which it is OOB.
The average performance on these sub-estimators can be used to assess howe well a random forest model will generalize.

Gradient Boosted Trees¶

Sklearn implement gradient boosting in the estimators GradientBoostedClassifier and GradientBoostedRegressor.
The xgboost implementation is also popular.

In [ ]:

from sklearn.ensemble import GradientBoostedRegressor
from sklearn.ensemble import GradientBoostedClassifier

GBT Hyper-parameters¶

n_estimators - the number of boosting rounds (e.g. number of trees)
learning_rate - value by which to scale each new regressor
subsample - fraction of sample to use in each boosting round
max_depth - maximum depth of trees, see also: min_impurity_decrease, min_samples_split, and min_samples_leaf for implicit control of tree depth
max_features - maximum number of features to use for each tree,

In [ ]:

gb1 = GradientBoostingClassifier(
    loss='deviance',
    n_estimators=100, # number of trees
    learning_rate=.1,  
    subsample=1,
    max_depth=16, 
    max_features='sqrt',
    verbose=0
)

GBT Tuning¶

If the learning rate isn't small enough, the model may fail to "learn" and may not even fit the training data well.
It the learning rate is too small, the number of rounds must be increased to compensate.
My preferred strategy: find a suitable learning rate, then select the number of boosting rounds that perform best for that learning rate.
Other hyper-parameters are similar to random forests.

GBT Tuning¶

The method .staged_predict_proba() can be used to get estimates from the working model after each boosting round ("stage").
Useful for comparing out-of-sample performance on validation data for different numbers of rounds (aka "early stopping").

Takeaways¶

Random Forests and Gradient Boosted Trees are two of the best general purposes classifiers (regressors) for ML with tabular data.
Random Forests are complex models but easy to tune.
Carefully tuned gradient boosted trees often give the best classification performance and are typically better calibrated.
Consider both as candidate models for most ML projects on tabular data, and at least one (likely RF) as a "baseline" model.
Set random state to make reproducible.