AdaBoost, short for adaptive boosting, is a meta-algorithm used to improve the performance of other learning algorithms. In traditional boosting, sets of weak learners (such as decision stumps) are aggregated to yield strong learners. In AdaBoost, the error rate of an initial set of weak learners is used to reweight the data for the subsequent set of weak learners. AdaBoost is generally described as less susceptible to overfitting problems, yet also sensitive to noisy data and outliers.
In this tutorial, we will provide an overview of AdaBoost implementations, focusing in particular on applying them to the problem of classifying occupancy in UCI’s Occupancy Detection Data Set.
The below three tutorials contain overviews of AdaBoost methods in R, Python, and SAS.
In scikit-learn, there is a Adaboost library which contains two useful functions: AdaBoostClassifier and AdaBoostRegressor. From their names we can easily find that AdaBoostClassifier is for classification and AdaBoostRegressor is for regression.
In this section, I will focus on AdaBoostClassifier. I have generated three different kinds of data sets: a linearly separable data set, a nonlinearly separable data set, an inseparable data set and multi-class data set. The code to generate data can be found at the first tab below. I will also try AdaBoostClassifier on real data.
## import packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_gaussian_quantiles
Linearly separable data set
## Construct linear separable data set
def generate_linear_data(n):
w0 = np.random.rand(2) # choose true linear separator at random
X = np.random.randn(n, 2) # examples follow gaussian distribution
Y = (np.sign(X.dot(w0)) + 1)/2 # select signs using true separator
X = X + 0.1*np.outer(Y, w0) # increase the margin a little bit
return X, Y
X_linear, Y_linear = generate_linear_data(400)
Nonlinearly separable data set
## Construct nonlinearly separable data set
def generate_nonlinear_data(n):
X1, Y1 = make_gaussian_quantiles(cov = 2., n_samples = n, n_features = 2, n_classes = 2, random_state = 1)
X2, Y2 = make_gaussian_quantiles(mean = (3, 3), cov = 1.5, n_samples = n, n_features = 2, n_classes = 2, random_state = 1)
X = np.concatenate((X1, X2))
Y = np.concatenate((Y1, - Y2 + 1))
return X, Y
X_nonlinear, Y_nonlinear = generate_nonlinear_data(200)
Inseparable data set
## Construct inseparable data set
def generate_inseparable_data(n):
w0 = np.random.rand(2) # choose true logistic parameter
X = np.random.randn(n, 2) # example follow gaussian distribution
# compute probabilities using logistic transformation
probs = np.exp(5*X.dot(w0))/(1+np.exp(5*X.dot(w0)))
# generate Y[i] to be +1 with probability probs[i]
# and -1 otherwise
Y = (np.random.rand(n) < probs)
return X, Y
X_inseparable, Y_inseparable = generate_inseparable_data(400)
Multi-class data set
def generate_multiclass_data(n):
X, Y = make_gaussian_quantiles(n_samples=n, n_features=10, n_classes=3, random_state=1)
n_split = n/4
X_train, X_test = X[:n_split], X[n_split:]
Y_train, Y_test = Y[:n_split], Y[n_split:]
return X_train, X_test, Y_train, Y_test
X_multi_train, X_multi_test, Y_multi_train, Y_multi_test = generate_multiclass_data(1200)
There are two groups of parameters in AdaBoostClassifier: Framework Parameters and Weak Learner Parameters.
AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’)
This refers to weak category learners. In theory, you can choose any classification learner, but need to support the sample weight. AdABoostClassifier defaults to using the CART classification tree. But if our choice of AdaBoostClassifier algorithm is SAMME.R then our weak category learner also needs to support probabilistic predictions.
Scikit-learn implements two Adaboost classification algorithms, SAMME and SAMME.R. SAMME uses weakness as a weak learner weight for sample sets, whereas SAMME.R uses the prediction probability. The default algorithm for AdaBoostClassifier is SAMME.R.
This is the maximum of the iterations of our weak learner. In gereral, too small n_estimators will lead to under-fitting; while too large n_estimators will lead to over-fitting. The default is 50.
This is the weight reduction coefficient of each weak learner. A smaller learning_rate means more iterations. Usually we use the step size and the maximum number of iterations together to determine the fitting effect. So n_estmators and leaning_rate should be adjusted together.
There are many different kinds of weak learners to choose. I will take DecisionTreeClassifier as an example to introduce parameters here.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_features=None, max_depth, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0, max_leaf_nodes=None), n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’)
This is the maximum number of features to consider when partitioning the sample. “log2” means considering \(log_2N\) features. “sqrt” or “auto” means considering \(\sqrt{N}\) features. Integer represents the absolute number of features considered. Float represents the percentage of the features. Default–“None” means considering all features.
This is the maximum depth of the decision tree. When the sample size is small, we do not need to consider the depth. However, when the sample size is big, we commonly use 10-100.
If a node has fewer samples than min_samples_split, it will not continue to select the best feature to divide. The default is 2. If the sample size is not large, we do not need to consider this value.
If the number of leaf nodes is less than min_samples_leaf, it will be pruned along with the sibling nodes. The default is 1. Integer represents the number; while float represents the percentage. If the sample size is small, we do not need to consider this value.
If the sum of the weight of leaf nodes is less than min_weight_fraction_leaf, it will be pruned along with the sibling nodes. The default is 0. In general, if we have too many missing values, we introduce this parameter.
By limiting the maximum number of leaf nodes, you can prevent over-fitting. The default is “None”. If the feature is small, this value may not be considered, otherwise we can use CV to choose a max_leaf_nodes.
Using AdaBoostClassifier to classify three different kinds of data set. The first two data set are separable; the last data set is inseparable.
This function is used to calculate the classification correct rate. It takes X, Y, max_depth = 2, algorithm = “SAMME”, n_estimators = 10 as function parameters. max_depth is used to choose weak learner; algorithm and n_estimators are used to change function framework.
## Calculate correct rate
def calculate_rate(X, Y, max_depth = 2, algorithm = "SAMME", n_estimators = 10):
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth = max_depth), algorithm = algorithm, n_estimators = n_estimators)
bdt.fit(X, Y)
n = sum(bdt.predict(X)==Y)
return float(n)/len(X)
This function is used to visualize the classification result on two-class data. It takes X, Y, max_depth = 2, algorithm = “SAMME”, n_estimators = 10 as function parameters. max_depth is used to choose weak learner; algorithm and n_estimators are used to change function framework.
## Draw the classification plot
def draw_classification(X, Y, max_depth=2, algorithm="SAMME", n_estimators=10):
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth = max_depth), algorithm = algorithm, n_estimators = n_estimators)
bdt.fit(X, Y)
## Plot the decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.01))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap = plt.cm.Paired)
plt.axis("tight")
# Plot the training points
class_names = "AB"
plot_colors = ['orange', 'blue']
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(Y == i)
plt.scatter(X[idx, 0], X[idx, 1], c = c, cmap = plt.cm.Paired, s = 20, edgecolor = 'k', label = "Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
print calculate_rate(X_linear, Y_linear)
print calculate_rate(X_nonlinear, Y_nonlinear)
print calculate_rate(X_nonseparable, Y_nonseparable)
1.0
0.8725
0.6525
plt.figure(figsize=(10, 4))
plt.subplot(131)
draw_classification(X_linear, Y_linear)
plt.title("Linear Data")
plt.subplot(132)
draw_classification(X_nonlinear, Y_nonlinear)
plt.title("Nonlinear Data")
plt.subplot(133)
draw_classification(X_inseparable, Y_inseparable)
plt.title("Inseparable Data")
_ = plt.tight_layout()
plt.show()
As the classification on the second data set is not that satisfying, I change the parameters to improve the performance.
print calculate_rate(X_nonlinear, Y_nonlinear, 3, "SAMME", 10)
print calculate_rate(X_nonlinear, Y_nonlinear, 4, "SAMME", 10)
print calculate_rate(X_nonlinear, Y_nonlinear, 5, "SAMME", 10)
0.91
0.995
1.0
plt.figure(figsize=(10, 4))
plt.subplot(131)
draw_classification(X_nonlinear, Y_nonlinear, max_depth=3)
plt.title("Nonlinear Data with Depth=4")
plt.subplot(132)
draw_classification(X_nonlinear, Y_nonlinear, max_depth=4)
plt.title("Nonlinear Data with Depth=6")
plt.subplot(133)
draw_classification(X_nonlinear, Y_nonlinear, max_depth=5)
plt.title("Nonlinear Data with Depth=8")
_ = plt.tight_layout()
plt.show()
Let’s also examine how to improve the performance on the last data set by changing parameters. Through this way I will show over-fitting.
print calculate_rate(X_inseparable, Y_inseparable, 4, "SAMME", 10)
print calculate_rate(X_inseparable, Y_inseparable, 6, "SAMME", 10)
print calculate_rate(X_inseparable, Y_inseparable, 8, "SAMME", 10)
0.9275
0.9975
1.0
plt.figure(figsize=(10, 4))
plt.subplot(131)
draw_classification(X_inseparable, Y_inseparable, max_depth=4)
plt.title("Inseparable Data with Depth=4")
plt.subplot(132)
draw_classification(X_inseparable, Y_inseparable, max_depth=6)
plt.title("Inseparable Data with Depth=6")
plt.subplot(133)
draw_classification(X_inseparable, Y_inseparable, max_depth=8)
plt.title("Inseparable Data with Depth=8")
_ = plt.tight_layout()
plt.show()
We divide the data into a training group and test group to avoid the problem of over-fitting.
f_train = file('./datatraining.txt','r')
X_train = []
Y_train = []
for line in f_train.readlines()[1:]:
line=line.strip('\n')
X_train.append(map(lambda x: float(x), line.split(',')[2:6]))
Y_train.append(float(line.split(',')[-1]))
f_test = file('./datatest.txt','r')
X_test = []
Y_test = []
for line in f_test.readlines()[1:]:
line=line.strip('\n')
X_test.append(map(lambda x: float(x), line.split(',')[2:6]))
Y_test.append(float(line.split(',')[-1]))
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 2), algorithm = "SAMME", n_estimators = 20)
bdt.fit(X_train, Y_train)
n_train = sum(bdt.predict(X_train)==Y_train)
print float(n_train)/len(X_train)
n_test = sum(bdt.predict(X_test)==Y_test)
print float(n_test)/len(X_test)
0.994350976299
0.960225140713
bdt_real = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=600, learning_rate=1)
bdt_discrete = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=600, learning_rate=1.5, algorithm="SAMME")
bdt_real.fit(X_train, y_train)
bdt_discrete.fit(X_train, y_train)
real_test_errors = []
discrete_test_errors = []
for real_test_predict, discrete_test_predict in zip(
bdt_real.staged_predict(X_test), bdt_discrete.staged_predict(X_test)):
real_test_errors.append(
1. - accuracy_score(real_test_predict, y_test))
discrete_test_errors.append(
1. - accuracy_score(discrete_test_predict, y_test))
n_trees_discrete = len(bdt_discrete)
n_trees_real = len(bdt_real)
discrete_estimator_errors = bdt_discrete.estimator_errors_[:n_trees_discrete]
real_estimator_errors = bdt_real.estimator_errors_[:n_trees_real]
discrete_estimator_weights = bdt_discrete.estimator_weights_[:n_trees_discrete]
real_estimator_weights = bdt_real.estimator_weights_[:n_trees_real]
plt.figure(figsize=(15, 5))
plt.subplot(131)
plt.plot(range(1, n_trees_discrete + 1), discrete_test_errors, c='black', label='SAMME')
plt.plot(range(1, n_trees_real + 1), real_test_errors, c='black', linestyle='dashed', label='SAMME.R')
plt.legend()
plt.ylim(0.18, 0.62)
plt.ylabel('Test Error')
plt.xlabel('Number of Trees')
plt.subplot(132)
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_errors, "b", label='SAMME', alpha=.5)
plt.plot(range(1, n_trees_real + 1), real_estimator_errors, "r", label='SAMME.R', alpha=.5)
plt.legend()
plt.ylabel('Error')
plt.xlabel('Number of Trees')
plt.ylim((.2, max(real_estimator_errors.max(), discrete_estimator_errors.max()) * 1.2))
plt.xlim((-20, len(bdt_discrete) + 20))
plt.subplot(133)
plt.plot(range(1, n_trees_discrete + 1), discrete_estimator_weights, "b", label='SAMME')
plt.legend()
plt.ylabel('Weight')
plt.xlabel('Number of Trees')
plt.ylim((0, discrete_estimator_weights.max() * 1.2))
plt.xlim((-20, n_trees_discrete + 20))
plt.subplots_adjust(wspace=0.25)
plt.show()
First, we read the data sets and use the R package adabag to generate the adaboost classifier using the training data set:
library(DRR)
library(RcppRoll)
library(caret)
library(adabag)
## read two test data sets and one training data
test1 <- read.table("datatest.txt",header = T,sep = ",")
test1 <- test1[,-1]
test1$Occupancy <- as.factor(test1$Occupancy)
test2 <- read.table("datatest2.txt",header = T,sep = ",")
test2 <- test2[,-1]
test2$Occupancy <- as.factor(test2$Occupancy)
training <- read.table("datatraining.txt",header = T, sep = ",")
training <- training[,-1]
training$Occupancy <- as.factor(training$Occupancy)
# Use boosting to fit the model and calculate error rate
a <- boosting(Occupancy~.,data=training)
z0 <- table(training[,6],predict(a,training)$class)
z0
##
## 0 1
## 0 6414 0
## 1 0 1729
sprintf("The error rate on training data is %f",(sum(z0)-sum(diag(z0)))/sum(z0))
## [1] "The error rate on training data is 0.000000"
The error rate on the training data is zero, indicating perfect classification. Next, we use two test data sets to test the classifier and calculate the test error rate:
z1 <- table(test1[,6],predict(a,test1)$class)
z1
##
## 0 1
## 0 1646 47
## 1 132 840
sprintf("The error rate on test data 1 is %f",(sum(z1)-sum(diag(z1)))/sum(z1))
## [1] "The error rate on test data 1 is 0.067167"
z2 <- table(test2[,6],predict(a,test2)$class)
z2
##
## 0 1
## 0 7410 293
## 1 233 1816
The first table shows that 53 observations whose Occupancy is actually 1 and 90 observations whose Occupancy is actually 0 are misclassified. The error rate is 0.053659. The information of the second test data is similar.
sprintf("The error rate on test data 2 is %f",(sum(z2)-sum(diag(z2)))/sum(z2))
## [1] "The error rate on test data 2 is 0.053938"
First, we can see the error trend in the training data set for our adaboost algorithm:
b<-errorevol(a,training)
plot(b$error,type="l",xlab="Number of trees",ylab="error rate",main="AdaBoost error vs number of trees",mgp=c(2,1,0))
We can see that after the seventh iteration the error rate becomes zero. Then we can use test data to draw the plot:
c<-errorevol(a,test1)
plot(c$error,type="l",xlab="Number of trees",ylab="error rate",main="AdaBoost error vs number of trees",mgp=c(2,1,0))
After around 40 iterations, the error rate changes minimally.
After setting coeflearn we can make different classifications using the “Breiman”, “Freund” or SAMME algorithms. We can then compare approaches using the test data:
learn<-function(coef)
{
a<-boosting(Occupancy~.,data=training,coeflearn = coef)
b<-errorevol(a,test1)
plot(b$error,type="l",ylim=c(0.02,1),xlab="Number of trees",ylab="error rate",main="AdaBoost error vs number of trees",mgp=c(2,1,0))
}
learn("Breiman")
legend(85,0.04,c("Breiman","Freund","Zhu"),lty=c(1,1,1),col=c("Black","Red","Blue"))
learnline<-function(coef,color)
{
a<-boosting(Occupancy~.,data=training,coeflearn = coef)
b<-errorevol(a,test1)
lines(b$error,col=color)
}
learnline("Freund","Red")
learnline("Zhu","Blue")
We can see that the three methods have distinct properties. “Breiman” method makes the algorithm converge fast, but it has relatively high error rate. “Freund” method has low error rate, but it makes the algorithm converge a little slower. The “Zhu” method performs worst in this method.
This is an implementation of Real AdaBoost, a variant of AdaBoost that uses a slightly different reweighting procedure. The implementation relies on a macro developed by Paul Edwards, Dina Duhon, and Suhail Shergill. (Link)
Prior to running AdaBoost, it is important to make sure your dataset of interest is partitioned into training, validation, and test sets. The occupancy data used here for demonstration are already partitioned.
Running the AdaBoost macro is simple, but demands that the user specify the following inputs:
Data - Training data set.
Target - Response variable of interest, which should be a binary variable. In the case of the occupancy data, the response variable is an indicator of whether a building is occupied.
Var - Predictors of interest. With the occupancy data, the predictors of interest include Temperature, Humidity, Light, CO2, and HumidityRatio. Note that, when specifying the variables in the macro call, they should be separated only by a space.
Scoreme - Data sets used to evaluate the AdaBoost performance, such as a test and validation set. At least one test or validation set should be provided.
Seed - Randomization seed. Setting a randomization seed will allow for replication of results.
Ntree - Number of trees to use in the model.
Treedepth - Maximum depth of each tree. Depth here refers to the number of leaves for each tree.
Outada - The name of the object storing the adaboost model.
Learningrate - Controls the rate of model optimization. Should be set between 0 and 1. Generally speaking, a lower learning rate improves model fit but increases computational expense.
Samplprop - Proportion of the training population to bag at each iteration. For a given model, the i-th tree is fit using this proportion of the training data, while the remaining data is used to prune the tree. If samplprop equals 1, bagging and tree pruning are disabled.
%adaboost(data=WORK.TRAIN,
target=Occupancy,
var= Temperature Humidity Light CO2 HumidityRatio,
scoreme= work.TEST work.VALIDATE,
seed=1234,
ntree=10,
treedepth=3,
outada=outadaboost,
learningrate=0.2,
samplprop=1);
Among the outputs of the adaboost model, one can generate diagnostics that indicate, among other things, the relative significance of different variables in predicting the response variable. In this case, the table below displays scores associated with the different predictor variables, indicating that Light and CO2, for example, are the most significant predictors, while Humidity is less pertinent.
%adaimp(inmodel=outadaboost);
ods select none;
ods output Measures=MM;
proc freq data=Validate_scr ;
tables Occupancy * ( adascore ) / measures noprint;
run;
ods select all;
The adaboost macro also produces summary stats regarding classification error, which can be plotted using proc sgplot in SAS. The below plot shows training and validation misclassifiation rates across different numbers of leaves per tree. According to these results, the optimal number of leaves per tree is roughly 30.
proc sgplot data=_vsq;
title 'Leaves and Classification Error';
series x=_NW_ y=_MISC_ / LEGENDLABEL="Training Set";
series x=_NW_ y=_VMISC_ / LEGENDLABE="Validation Set";
YAXIS LABEL = "Misclassification Rate";
run;
To test different parameterizations of the adaboost model, it is useful to compare test prediction performance by looping through different parameter selections. To do this, one can use %do loops within a SAS macro. Specify some variable (in this case, Q) to loop over, and use this loop to test different parameter values.
The example and plot below demonstrates this comparison by specifically comparing the prediction performance associated with incorporating different numbers of trees. As the plot suggests, with this particular parameterization, there is negligible difference in prediction accuracy across models with 10-20 trees. One can also iterate through different learning rates, sample proportions, and tree depths, in order to find an optimal model.
%Macro TestParams;
proc datasets lib=work;
delete Testerror_1-Testerror_30;
run;
%DO Q = 10 %to 20;
%adaboost(data=WORK.TRAIN,
target=Occupancy,
var= Temperature Humidity Light CO2 HumidityRatio,
scoreme= work.VALIDATE work.TEST,
seed=1234, ntree=&Q,
treedepth=8,
outada=outadaboost,
learningrate=.2,
samplprop=.7);
Data work.Testerror_&Q;
set Work.Test_scr;
ParamLevel = &Q;
PredCorrect = Occupancy*adapredict_Occupancy;
keep PredCorrect ParamLevel Occupancy adapredict_Occupancy;
run;
%end;
data work.Testerror;
set work.Testerror_:;
run;
proc sql;
create table MeanPredRate as
select distinct
ParamLevel,
mean(PredCorrect) as PredAccuracy
from work.TestError
group by ParamLevel;
quit;
%Mend TestParams;
%TestParams;
proc sgplot data=Meanpredrate;
title "Trees and Test Accuracy";
series x=ParamLevel y=PredAccuracy / LEGENDLABEL="Prediction Accuracy";
YAXIS LABEL = "Accurate Prediction Rate" MIN=0 MAX=1;
XAXIS LABEL = "Number of Trees";
run;
The real adaboost macro can be found at the following github site: https://github.com/pedwardsada/real_adaboost/blob/master/adaboost.sas