Data: Diamonds
Languages:
SAS - Yung-Chun Lee
Python - Yuming Yang
R - Sandra Nwogu
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but they are also a popular tool in machine learning.
In general, decision tree models are simple to understand and to interpret because trees can be visualised. But they often meet problems like overfitting and instability.
(Note that in this tutorial, we focus on an introduction to constructing and fitting a decision tree, rather than full diagnostics on the model used)
To further conceptualize the idea of a decision tree, consider this example using a set of graphics to partition a sample space.
Suppose we have two sets of points (20 reds and 20 blues) distributed in \(\in \mathcal{R^2}\) plane. We can carry out a set of steps to classify the data into groups and subgroups, essentially making a tree with color coded branches.
Step 1: We define all points where the x value is greater than 4.7 as blue, and those do not match this criterion are red.
Step 2: Under sub space of x’s > 4.7 from the previous step, we further define additional blue points if y > 1.5, and the rest, red.
Step 3: Under the section x > 4.7 and y > 2.5, we add another branch for if y is greater than 3.4 and also assign blue to the points in this region, with the rest being red.
The dataset used in this tutorial will be the ‘Diamonds’ dataset. This data frame with 53940 observations and 10 variables includes:
Price
in $ (continuous)
Carat weight
(numeric)
Cut quality
(factor)
Diamond color
(factor)
Clarity
(factor)
Length variables
(numeric)
As well as depth
and width
variables (numeric).
For this analysis, we define price
as the response variable, with carat
, weight
, cut
and color
as the predictor variables.
In the original data set, price
is a continuous variable but here we have categorized price
into four levels and replaced it with this factor variable price_cat
: \[
\begin{split}
\text{price_cat} =
\begin{cases}
\text{<5000} & \text{if } price <5000\\
\text{5000~10000} & \text{if } 5000 \leq price < 10000\\
\text{10000~15000} & \text{if } 10000 \leq price < 15000\\
\text{>=15000} & \text{if } \geq 15000\\
\end{cases}
\end{split}
\] We also split the data into train and test datasets to run a prediction and test our model; the ratio of the split is 80:20 for train and test respectively.
library(data.table)
diamond = fread("diamonds.csv")
##splitting data
set.seed(4*12*2017)
test.index = sort(sample(1:dim(diamond)[1], 11000))
train.diamond = diamond[-test.index,]
test.diamond = diamond[test.index,]
write.csv(test.diamond, file="test.diamond.csv")
write.csv(train.diamond, file="train.diamond.csv")
proc import datafile= 'M:/STAT506/SAS/data/train.diamond.csv' out= diamond_train;
run;
Using proc import datafile = [DATA_PATH] out = [NAME_YOUR_DATA];
to import csv file into SAS working enviroment. Make sure adding run;
to run the code above.
In the original data set, variable price
is a continuous variable, we coded it into four levels and set it as response vairables and select following varaibles as predictors:
Varables | Input Type | Desire Type | Levels |
---|---|---|---|
carat | numeric | continuous | x |
cut | character | categorical | 5 |
clarity | character | categorical | 11 |
color | character | categorical | 7 |
Using the SAS code:
data train;
set diamond_train;
price_cat = '>=15000';
if ( price <15000 and price >=10000 )then price_cat= '10000~15000';
if ( price <10000 and price >=5000 )then price_cat= '5000~10000';
if price <5000 then price_cat = '<5000';
drop V1 depth table x y z var1;
run;
we can either use: drop [VAR_TO_BE_DROP_1] [VAR_TO_BE_DROP_2] ...;
,or keep [VAR_TO_BE_KEEP_1] [VAR_TO_BE_KEEP_1] ...;
to preserve the variables we want.
The see the ouput of the data, we can use:
proc print data=train(obs=10) ;
run;
proc hpsplit data=train seed=2017;
target price_cat;
input carat / level=INT ;
input color cut clarity / level =NOM;
run;
PROC HPSPLIT
is the procedure in SAS to fit decision tree.
Here we specify seed
to be a certain number seed = [CONSTANT]
so that the result will be reproducible.
TARGET [RESPONSE]
: here we plug in a single response variable.
Recall that we selected four variables : carat
, cut
, clarity
and color
as predictors, we can specify the desire data type in the INPUT
option: INPUT [PREDICTOR] / LEVEL= (INT/NOM)
to be either integer / nomial type.
Another Syntax to fit PROC HPSPLIT
/* same approach different syntax, similar to proc glm*/
proc hpsplit data=train seed=2017 ;
class price_cat carat color cut clarity;
model price_cat = carat color cut clarity;
run;
The above syntax will give identical result from the one we specified. Share certain syntax similarity with PROC GLM
which uses CLASS
to declare if the type of the variable is categorical. and use MODEL [RESPONSE] = [PREDICTOR_1] [PREDICTOR_2] ...;
to specify the model.
As we observe the output:
Since we did not specify any constraint on the model, the default option tend to pick a large model and might be overfitting.
/* control tree depth/branch */
proc hpsplit data=train seed=2017 maxbranch=2 maxdepth=3;
target price_cat;
input carat / level=INT ;
input color cut clarity / level =NOM;
run;
Comparing with the model in I, the only difference is in the splitting options. MAXBRANCH = [NUM]
, MAXDEPTH = [NUM]
which can control the maximun branches and depth respectively. See HPSPLIT from SAS for more statement options.
By imposing constraints on decision tree we get a simpler model:
/* pruning */
proc hpsplit data=diamond_train seed=2017 ;
input carat / level=INT ;
input color cut clarity / level =NOM;
prune misc / N <= 8; /*default= entropy*/
run;
Once again, we added one extra statement comppare to the model we fitted in I. PRUNE [BY-METRIC] / {METRIC STATEMENT}
+ [BY-METRIC]
has several options using different statistic, such as ASE
,ENTROPY
(defualt), GINI
and MISC
. see details here
{METRIC STATEMENT}
:= until operator value, details in hereHere we can access the ouput example from PROC HPSPLIT
and hence analysis, adjust the model. Notice that there is NO generic procedure in SAS for prediction with decision tree. we can use the information to help us pruning the parameters and setting up appropriate constraints.
PROC HPSPLIT
while decision tree method is intuitive and easy to carry out, it might varies across each trial, you might consider PROC FOREST
to fit a random forest and make prediction based on that.In this page, I use the tree package in sklearn to do the decision tree classification.
Here I import the packages first. Make sure these packages are installed.
import numpy as np
import pandas as pd
from sklearn import tree
import bisect
from sklearn import preprocessing
import graphviz
Use read_csv function in pandas to read the train and test data.
train = pd.read_csv('train.diamond.csv')
test = pd.read_csv('test.diamond.csv')
The tree package in sklearn doesn’t support categorical preditors. Since the variable color, cut, clarity are all categorical variables, we need to encode them into numeric values. Here I use LabelEncoder in sklearn.preprocessing.
# define the encoders
le_color = preprocessing.LabelEncoder()
le_clarity = preprocessing.LabelEncoder()
le_cut = preprocessing.LabelEncoder()
# fit the encoders
le_color.fit(['J', 'I', 'H', 'G', 'F', 'E', 'D'])
le_clarity.fit(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'])
le_cut.fit(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'])
# transform our variables
train[['color']] = le_color.transform(train[['color']].values).reshape((42940,1))
train[['cut']] = le_cut.transform(train[['cut']].values).reshape((42940,1))
train[['clarity']] = le_clarity.transform(train[['clarity']].values).reshape((42940,1))
test[['color']] = le_color.transform(test[['color']].values).reshape((11000,1))
test[['cut']] = le_cut.transform(test[['cut']].values).reshape((11000,1))
test[['clarity']] = le_clarity.transform(test[['clarity']].values).reshape((11000,1))
Then we can get our train and test matrix. I use bisect to code the price into 4 levels.
# define the bounds
sec = [0, 5000, 10000, 15000]
# select useful columns to get X
train_X = train[['carat', 'color', 'clarity', 'cut']].values
# code the price
train_Y = []
for value in train[['price']].values:
train_Y.append(bisect.bisect_left(sec, value[0]))
test_X = test[['carat','color','clarity','cut']].values
test_Y = []
for value in test[['price']].values:
test_Y.append(bisect.bisect_left(sec, value[0]))
print train_X
Then the final train predictor matrix train_X looks like:
[[ 0.21 1. 2. 3. ]
[ 0.23 1. 4. 1. ]
[ 0.29 5. 5. 3. ]
...,
[ 0.72 0. 2. 2. ]
[ 0.72 0. 2. 1. ]
[ 0.7 0. 2. 4. ]]
We can build our decision tree model now. Here we set max_depth=3 to get a simple tree structure.
clf = tree.DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, min_samples_leaf=4)
clf = clf.fit(train_X, train_Y)
We can fit the model on test data now. Also we can check the accuracy of this model by comparing the original result and fit result.
yfit = clf.predict(test_X)
pe = np.linalg.norm((yfit-test_Y)==0)/11000
print "Accuracy"
print pe
Accuracy
0.00882895312158
A tree graph will make the decision tree classification process easier to understand. We draw the graph using graphviz. Make sure your graphviz executable file (gvedit.exe) is in your sysytem PATH.
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=['carat','color','clarity','cut'],
class_names=['0-5000','5000-10000','10000-15000','15000+'],
filled=True, rounded=True,
special_characters=True, impurity=False, proportion=True)
graph = graphviz.Source(dot_data)
graph.render("DecisionTree.pdf")
The graph looks like:
In this graph, different colors of the frames represents different classification result. We can see that the carat and clarity are two cirtical varibales in the decision process with tree max_depth = 3.
Decision Tree using python sklearn packages is easy but not so flexible.
Pros:
1. The tree graph is good for understanding and easy to draw. Simple to understand and interpret.
2. Easy to build the model and convenient to adjust the parameter (like max_depth).
Cons:
1. Categorical Variables are not supported and the analysis of the final model is blank. Many preprocessing work and manual analysis is needed.
2. Provide few parameter choice as well as statistics. Hard to assess and adjust the model.
Using the same diamonds dataset, we will replicate the decision tree in R.
library(ggplot2)
library(rpart)
library(rpart.plot)
library(dplyr)
library(RColorBrewer)
library(party)
library(tree)
library(rattle)
#Defining income levels as a factor variable with 4 levels
diamonds <- diamonds %>%
mutate(price_cat = ifelse(price <= 5000, "1",
ifelse(price > 5000 & price <= 10000, "2",
ifelse(price > 10000 & price <= 15000, "3",
ifelse(price > 15000, "4", "N/A")))))
diamonds$price_cat <- factor(diamonds$price_cat, levels = c(1, 2, 3, 4),
labels = c("<= $5000",
"$5000 - $10000",
"$10000 - $15000",
"> $15000"))
#Splitting data into train and test subsets so we can test how well our model performs
set.seed(4*12*2017)
test.index = sort(sample(1:dim(diamonds)[1], 11000))
train.diamond = diamonds[-test.index,]
test.diamond = diamonds[test.index,]
I will be using the “rpart” package here. The “rpart” package is a standard package for classification by decision trees and can also be used to generate regression trees.
“Rpart” works by first finding the single variable which best splits the data into two groups. The data is separated and the same process is applied to each group recursively until the groups either reach a minimum size or no improvement can be made. The split is made on the independent variable that results in the largest possible reduction in heterogeneity of the dependent variable.
rpart_fit <- rpart(price_cat ~ carat + cut + color + clarity, train.diamond, method="class")
#Plor decision tree
fancyRpartPlot(rpart_fit, main="Classification Tree", sub="")
prp(rpart_fit)
#Other options include "igraph's" graph.tree()
Here I used fancyRpartplot(), and prp() for comparison. Prp() is from the “rpart” package, and fancyRpartplot comes with the “rattle” package. I’ve found base R’s plot() harder to work with, especially for deep trees like this.
Note that the algorithm makes the best possible choice at each fork in the decision space without any consideration for whether those choices remain the optimal choices in future stages.
To avoid overfitting the data, and ensure we have the optimal model, we can prune the tree by removing sections with little or no power to classify instances. This can be done in a number of ways, most of which include cross validation. Here, I’m going to be using cross validation to prune the trees with the least cross validated error. I’ll do this using the “rpart” package and then show how to, using the “trees” package.
#We can prune the tree by using the cp with the least cross validated error so we aren't overfitting the data.
printcp(rpart_fit)
##
## Classification tree:
## rpart(formula = price_cat ~ carat + cut + color + clarity, data = train.diamond,
## method = "class")
##
## Variables actually used in tree construction:
## [1] carat clarity color
##
## Root node error: 11677/42940 = 0.27194
##
## n= 42940
##
## CP nsplit rel error xerror xstd
## 1 0.315321 0 1.00000 1.00000 0.0078962
## 2 0.097200 1 0.68468 0.68468 0.0069078
## 3 0.030145 3 0.49028 0.49054 0.0060337
## 4 0.024236 5 0.42999 0.43025 0.0057039
## 5 0.020382 6 0.40575 0.40601 0.0055616
## 6 0.012589 7 0.38537 0.38794 0.0054514
## 7 0.010000 10 0.34761 0.35360 0.0052316
#The lowest cp is 0.01, but the next lowest 0.012 is so close to this number, I'll use this as my cp bound and prune the tree to 8 nodes.
ctree <- prune(rpart_fit, cp= 0.012589)
fancyRpartPlot(ctree, main = "Pruned Tree", sub="")
A classification tree is intrisically testing the null hypothesis of independence between any of the predictor variables and the response variable. The split is made on the variable with the lowest p-value, the one with the strongest evidence for rejecting that hypothesis. In this case ‘carat’.
#Tree model
plot.new()
treemod <- tree(price_cat ~ carat + cut + color + clarity, train.diamond)
plot(treemod)
text(treemod, cex=.5)
#Using the cv.tree() function
cvTree <- cv.tree(treemod, FUN=prune.misclass)
plot(cvTree, xlim = c(0, 11))
#Setting the number of nodes using the best=() parameter, as you can see the lowest point is from 9-10 nodes, but the dip at 8 nodes is miniscule
prunedTree <- prune.misclass(treemod, best=8)
plot(prunedTree)
text(prunedTree, cex=.5)
In the cvtree plot, the lower x-axis is the number of terminal nodes and the upper x-axis is the number of nodes or pieces the data is split in the cross validation. It shows how the #misclassification error varies against these. This plot suggests pruning at 8 terminal nodes and the misclassification error is the same as at 10 nodes. To keep the tree as simple as possible, and the misclassification error as low as posisble, I pruneed at 8 nodes just like with the “rpart” package.
#Get column index of predicted variable in dataset
ColNum <- grep("price_cat", names(diamonds))
#Predict on test data
pred <- predict(ctree, test.diamond[,-ColNum], type="class")
#Find proportion of correct predictios using test set
mean(pred==test.diamond$price_cat)
## [1] 0.8950909
#Confusion matrix
table(pred=pred, true=test.diamond$price_cat)
## true
## pred <= $5000 $5000 - $10000 $10000 - $15000 > $15000
## <= $5000 7519 109 0 0
## $5000 - $10000 444 1801 237 14
## $10000 - $15000 0 37 281 94
## > $15000 0 24 195 245
#How does the pruned model stack up against the original?
pred <- predict(rpart_fit, test.diamond[,-ColNum], type="class")
#Find proportion of correct predictios using test set
mean(pred==test.diamond$price_cat)
## [1] 0.9058182
#Confusion matrix
table(pred=pred, true=test.diamond$price_cat)
## true
## pred <= $5000 $5000 - $10000 $10000 - $15000 > $15000
## <= $5000 7693 165 0 0
## $5000 - $10000 270 1745 237 14
## $10000 - $15000 0 37 281 94
## > $15000 0 24 195 245