Data: Diamonds
Languages:
SAS - Yung-Chun Lee
Python - Yuming Yang
R - Sandra Nwogu

Introduction of Decison Tree

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but they are also a popular tool in machine learning.

In general, decision tree models are simple to understand and to interpret because trees can be visualised. But they often meet problems like overfitting and instability.

(Note that in this tutorial, we focus on an introduction to constructing and fitting a decision tree, rather than full diagnostics on the model used)

Examples

To further conceptualize the idea of a decision tree, consider this example using a set of graphics to partition a sample space.

Suppose we have two sets of points (20 reds and 20 blues) distributed in \(\in \mathcal{R^2}\) plane. We can carry out a set of steps to classify the data into groups and subgroups, essentially making a tree with color coded branches.

flowchart

flowchart

Description of Data

The dataset used in this tutorial will be the ‘Diamonds’ dataset. This data frame with 53940 observations and 10 variables includes:
Price in $ (continuous)
Carat weight (numeric)
Cut quality (factor)
Diamond color (factor)
Clarity (factor)
Length variables (numeric)
As well as depth and width variables (numeric).

For this analysis, we define price as the response variable, with carat, weight, cut and color as the predictor variables.
In the original data set, price is a continuous variable but here we have categorized price into four levels and replaced it with this factor variable price_cat: \[ \begin{split} \text{price_cat} = \begin{cases} \text{<5000} & \text{if } price <5000\\ \text{5000~10000} & \text{if } 5000 \leq price < 10000\\ \text{10000~15000} & \text{if } 10000 \leq price < 15000\\ \text{>=15000} & \text{if } \geq 15000\\ \end{cases} \end{split} \] We also split the data into train and test datasets to run a prediction and test our model; the ratio of the split is 80:20 for train and test respectively.

library(data.table)
diamond = fread("diamonds.csv")

##splitting data 
set.seed(4*12*2017)
test.index = sort(sample(1:dim(diamond)[1], 11000))
train.diamond = diamond[-test.index,]
test.diamond = diamond[test.index,]
write.csv(test.diamond, file="test.diamond.csv")
write.csv(train.diamond, file="train.diamond.csv")

Decision Tree Practice In Differrent Languages

via SAS

Input data in SAS

proc import datafile= 'M:/STAT506/SAS/data/train.diamond.csv' out= diamond_train;

run;

Using proc import datafile = [DATA_PATH] out = [NAME_YOUR_DATA]; to import csv file into SAS working enviroment. Make sure adding run; to run the code above.

Extract data and convert variable

In the original data set, variable price is a continuous variable, we coded it into four levels and set it as response vairables and select following varaibles as predictors:

Varables Input Type Desire Type Levels
carat numeric continuous x
cut character categorical 5
clarity character categorical 11
color character categorical 7

Using the SAS code:

data train;
  set diamond_train;
  price_cat = '>=15000';
  if ( price <15000 and price >=10000 )then price_cat= '10000~15000';
  if ( price <10000 and price >=5000 )then price_cat= '5000~10000';
  if price <5000 then price_cat = '<5000';
  drop V1 depth table x y z var1;
run;

we can either use: drop [VAR_TO_BE_DROP_1] [VAR_TO_BE_DROP_2] ...; ,or keep [VAR_TO_BE_KEEP_1] [VAR_TO_BE_KEEP_1] ...; to preserve the variables we want.

Data display

The see the ouput of the data, we can use:

proc print data=train(obs=10) ; 
run;
data=train

data=train

Decision tree fitting (PROC HPSPLIT)

I.First we start by fitting the decision tree without setting constraints.
proc hpsplit data=train seed=2017;
   target price_cat;
   input carat / level=INT ; 
   input color cut clarity / level =NOM;
run;
  • PROC HPSPLIT is the procedure in SAS to fit decision tree.

  • Here we specify seed to be a certain number seed = [CONSTANT]so that the result will be reproducible.

  • TARGET [RESPONSE]: here we plug in a single response variable.

  • Recall that we selected four variables : carat, cut, clarity and color as predictors, we can specify the desire data type in the INPUT option: INPUT [PREDICTOR] / LEVEL= (INT/NOM) to be either integer / nomial type.

Another Syntax to fit PROC HPSPLIT

/* same approach different syntax, similar to proc glm*/
proc hpsplit data=train seed=2017 ; 
   class price_cat carat color cut clarity;
   model price_cat = carat color cut clarity; 
run;

The above syntax will give identical result from the one we specified. Share certain syntax similarity with PROC GLM which uses CLASS to declare if the type of the variable is categorical. and use MODEL [RESPONSE] = [PREDICTOR_1] [PREDICTOR_2] ...; to specify the model.

As we observe the output:

Overfitting

Overfitting

Since we did not specify any constraint on the model, the default option tend to pick a large model and might be overfitting.

II. Fitting model with some constraints.
/* control tree depth/branch */
proc hpsplit data=train seed=2017 maxbranch=2 maxdepth=3;
   target price_cat;
   input carat / level=INT ; 
   input color cut clarity / level =NOM;
run;

Comparing with the model in I, the only difference is in the splitting options. MAXBRANCH = [NUM], MAXDEPTH = [NUM] which can control the maximun branches and depth respectively. See HPSPLIT from SAS for more statement options.

By imposing constraints on decision tree we get a simpler model:

constraints

constraints

III. Pruning
/* pruning */
proc hpsplit data=diamond_train seed=2017 ;
   input carat / level=INT ; 
   input color cut clarity / level =NOM;
   prune misc / N <= 8; /*default= entropy*/
run;

Once again, we added one extra statement comppare to the model we fitted in I. PRUNE [BY-METRIC] / {METRIC STATEMENT} + [BY-METRIC] has several options using different statistic, such as ASE,ENTROPY(defualt), GINI and MISC. see details here

  • {METRIC STATEMENT} := until operator value, details in here
IV. Accessing output
Output

Output

Here we can access the ouput example from PROC HPSPLIT and hence analysis, adjust the model. Notice that there is NO generic procedure in SAS for prediction with decision tree. we can use the information to help us pruning the parameters and setting up appropriate constraints.

Things to consider

  • PROC HPSPLIT while decision tree method is intuitive and easy to carry out, it might varies across each trial, you might consider PROC FOREST to fit a random forest and make prediction based on that.

via Python

In this page, I use the tree package in sklearn to do the decision tree classification.
Here I import the packages first. Make sure these packages are installed.

import numpy as np
import pandas as pd
from sklearn import tree
import bisect
from sklearn import preprocessing
import graphviz

Read the dataset

Use read_csv function in pandas to read the train and test data.

train = pd.read_csv('train.diamond.csv')
test = pd.read_csv('test.diamond.csv')

Preprocessing

The tree package in sklearn doesn’t support categorical preditors. Since the variable color, cut, clarity are all categorical variables, we need to encode them into numeric values. Here I use LabelEncoder in sklearn.preprocessing.

# define the encoders
le_color = preprocessing.LabelEncoder()
le_clarity = preprocessing.LabelEncoder()
le_cut = preprocessing.LabelEncoder()

# fit the encoders
le_color.fit(['J', 'I', 'H', 'G', 'F', 'E', 'D'])
le_clarity.fit(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'])
le_cut.fit(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'])

# transform our variables
train[['color']] = le_color.transform(train[['color']].values).reshape((42940,1))
train[['cut']] = le_cut.transform(train[['cut']].values).reshape((42940,1))
train[['clarity']] = le_clarity.transform(train[['clarity']].values).reshape((42940,1))

test[['color']] = le_color.transform(test[['color']].values).reshape((11000,1))
test[['cut']] = le_cut.transform(test[['cut']].values).reshape((11000,1))
test[['clarity']] = le_clarity.transform(test[['clarity']].values).reshape((11000,1))

Then we can get our train and test matrix. I use bisect to code the price into 4 levels.

# define the bounds
sec = [0, 5000, 10000, 15000]

# select useful columns to get X
train_X = train[['carat', 'color', 'clarity', 'cut']].values

# code the price
train_Y = []
for value in train[['price']].values:
    train_Y.append(bisect.bisect_left(sec, value[0]))

test_X = test[['carat','color','clarity','cut']].values
test_Y = []
for value in test[['price']].values:
    test_Y.append(bisect.bisect_left(sec, value[0]))

print train_X

Then the final train predictor matrix train_X looks like:

[[ 0.21  1.    2.    3.  ]
 [ 0.23  1.    4.    1.  ]
 [ 0.29  5.    5.    3.  ]
 ..., 
 [ 0.72  0.    2.    2.  ]
 [ 0.72  0.    2.    1.  ]
 [ 0.7   0.    2.    4.  ]]

Do classification with decision tree

We can build our decision tree model now. Here we set max_depth=3 to get a simple tree structure.

clf = tree.DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, min_samples_leaf=4)
clf = clf.fit(train_X, train_Y)

We can fit the model on test data now. Also we can check the accuracy of this model by comparing the original result and fit result.

yfit = clf.predict(test_X)
pe = np.linalg.norm((yfit-test_Y)==0)/11000
print "Accuracy"
print pe
Accuracy
0.00882895312158

Draw the tree graph to see the structure

A tree graph will make the decision tree classification process easier to understand. We draw the graph using graphviz. Make sure your graphviz executable file (gvedit.exe) is in your sysytem PATH.

dot_data = tree.export_graphviz(clf, out_file=None,
                         feature_names=['carat','color','clarity','cut'],
                         class_names=['0-5000','5000-10000','10000-15000','15000+'],
                         filled=True, rounded=True,
                         special_characters=True, impurity=False, proportion=True)
graph = graphviz.Source(dot_data)
graph.render("DecisionTree.pdf")

The graph looks like:

In this graph, different colors of the frames represents different classification result. We can see that the carat and clarity are two cirtical varibales in the decision process with tree max_depth = 3.


Things to consider

Decision Tree using python sklearn packages is easy but not so flexible.
Pros:
1. The tree graph is good for understanding and easy to draw. Simple to understand and interpret.
2. Easy to build the model and convenient to adjust the parameter (like max_depth).
Cons:
1. Categorical Variables are not supported and the analysis of the final model is blank. Many preprocessing work and manual analysis is needed.
2. Provide few parameter choice as well as statistics. Hard to assess and adjust the model.

via R

Using the same diamonds dataset, we will replicate the decision tree in R.

Packages used

library(ggplot2)
library(rpart)
library(rpart.plot)
library(dplyr)
library(RColorBrewer)
library(party)
library(tree)
library(rattle)

Preparing the data

#Defining income levels as a factor variable with 4 levels

diamonds <- diamonds %>%
  mutate(price_cat = ifelse(price <= 5000, "1", 
                            ifelse(price > 5000 & price <= 10000, "2", 
                                   ifelse(price > 10000 & price <= 15000, "3", 
                                          ifelse(price > 15000, "4", "N/A")))))
                                          
diamonds$price_cat <- factor(diamonds$price_cat, levels = c(1, 2, 3, 4), 
                                                 labels = c("<= $5000", 
                                                            "$5000 - $10000",
                                                            "$10000 - $15000", 
                                                            "> $15000"))
#Splitting data into train and test subsets so we can test how well our model performs
set.seed(4*12*2017)
test.index = sort(sample(1:dim(diamonds)[1], 11000))
train.diamond = diamonds[-test.index,]
test.diamond = diamonds[test.index,]

I will be using the “rpart” package here. The “rpart” package is a standard package for classification by decision trees and can also be used to generate regression trees.

“Rpart” works by first finding the single variable which best splits the data into two groups. The data is separated and the same process is applied to each group recursively until the groups either reach a minimum size or no improvement can be made. The split is made on the independent variable that results in the largest possible reduction in heterogeneity of the dependent variable.


We define our model and plot the decision tree

rpart_fit <- rpart(price_cat ~ carat + cut + color + clarity, train.diamond, method="class")
#Plor decision tree
fancyRpartPlot(rpart_fit, main="Classification Tree", sub="")

prp(rpart_fit)

#Other options include "igraph's" graph.tree()

Here I used fancyRpartplot(), and prp() for comparison. Prp() is from the “rpart” package, and fancyRpartplot comes with the “rattle” package. I’ve found base R’s plot() harder to work with, especially for deep trees like this.

Note that the algorithm makes the best possible choice at each fork in the decision space without any consideration for whether those choices remain the optimal choices in future stages.

To avoid overfitting the data, and ensure we have the optimal model, we can prune the tree by removing sections with little or no power to classify instances. This can be done in a number of ways, most of which include cross validation. Here, I’m going to be using cross validation to prune the trees with the least cross validated error. I’ll do this using the “rpart” package and then show how to, using the “trees” package.

#We can prune the tree by using the cp with the least cross validated error so we aren't overfitting the data.
printcp(rpart_fit)
## 
## Classification tree:
## rpart(formula = price_cat ~ carat + cut + color + clarity, data = train.diamond, 
##     method = "class")
## 
## Variables actually used in tree construction:
## [1] carat   clarity color  
## 
## Root node error: 11677/42940 = 0.27194
## 
## n= 42940 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.315321      0   1.00000 1.00000 0.0078962
## 2 0.097200      1   0.68468 0.68468 0.0069078
## 3 0.030145      3   0.49028 0.49054 0.0060337
## 4 0.024236      5   0.42999 0.43025 0.0057039
## 5 0.020382      6   0.40575 0.40601 0.0055616
## 6 0.012589      7   0.38537 0.38794 0.0054514
## 7 0.010000     10   0.34761 0.35360 0.0052316
#The lowest cp is 0.01, but the next lowest 0.012 is so close to this number, I'll use this as my cp bound and prune the tree to 8 nodes. 

ctree <- prune(rpart_fit, cp= 0.012589)
fancyRpartPlot(ctree, main = "Pruned Tree", sub="")

A classification tree is intrisically testing the null hypothesis of independence between any of the predictor variables and the response variable. The split is made on the variable with the lowest p-value, the one with the strongest evidence for rejecting that hypothesis. In this case ‘carat’.


Using the “trees” package

#Tree model
plot.new()
treemod <- tree(price_cat ~ carat + cut + color + clarity, train.diamond)

plot(treemod)
text(treemod, cex=.5)

#Using the cv.tree() function
cvTree <- cv.tree(treemod, FUN=prune.misclass)
plot(cvTree, xlim = c(0, 11))

#Setting the number of nodes using the best=() parameter, as you can see the lowest point is from 9-10 nodes, but the dip at 8 nodes is miniscule
prunedTree <- prune.misclass(treemod, best=8)
plot(prunedTree)
text(prunedTree, cex=.5)

In the cvtree plot, the lower x-axis is the number of terminal nodes and the upper x-axis is the number of nodes or pieces the data is split in the cross validation. It shows how the #misclassification error varies against these. This plot suggests pruning at 8 terminal nodes and the misclassification error is the same as at 10 nodes. To keep the tree as simple as possible, and the misclassification error as low as posisble, I pruneed at 8 nodes just like with the “rpart” package.


Lets predict on the train data to test our model

#Get column index of predicted variable in dataset
ColNum <- grep("price_cat", names(diamonds))
#Predict on test data
pred <- predict(ctree, test.diamond[,-ColNum], type="class")
#Find proportion of correct predictios using test set
mean(pred==test.diamond$price_cat) 
## [1] 0.8950909
#Confusion matrix
table(pred=pred, true=test.diamond$price_cat)
##                  true
## pred              <= $5000 $5000 - $10000 $10000 - $15000 > $15000
##   <= $5000            7519            109               0        0
##   $5000 - $10000       444           1801             237       14
##   $10000 - $15000        0             37             281       94
##   > $15000               0             24             195      245
#How does the pruned model stack up against the original?
pred <- predict(rpart_fit, test.diamond[,-ColNum], type="class")
#Find proportion of correct predictios using test set
mean(pred==test.diamond$price_cat) 
## [1] 0.9058182
#Confusion matrix
table(pred=pred, true=test.diamond$price_cat)
##                  true
## pred              <= $5000 $5000 - $10000 $10000 - $15000 > $15000
##   <= $5000            7693            165               0        0
##   $5000 - $10000       270           1745             237       14
##   $10000 - $15000        0             37             281       94
##   > $15000               0             24             195      245

Things to consider

  1. It might seem like the unpruned model is better for prediction, but each model needs to be tested on different training and test sets to ensure the validity of its claims.
  2. Consider using “jitter” for deeper trees to make the vizualization better