Linear Discriminant Analysis (LDA) is one of the most commonly used techniques nowadays for several applications, such as “Decision Making”, “Classification”, and “Dimension Reduction”. The main goal of LDA is to find the discriminant functions which are the linear combinations of independent variables that will group two or more categories of the dependent variable in the best possible way. It helps one to examine whether significant differences exist among the groups, in terms of the independent variables. It also evaluates how accurate the process of the classification is.
Generally speaking, LDA is intended for data sets in which there is a categorical response variable. This allows us to group the observations into categories based on the values of their independent variables. Note that LDA is applicable for classification for data sets when there are two or more groups, based on the categorical response. It should also be noted that under LDA, it is assumed that our independent variables are multivariate normally distributed by group and that the variances of the independent variables are equal.
For our data analysis, we will be using the “Seeds” dataset, found on UC Irvine’s Machine Learning Repository. This dataset includes observations from 3 different varieties of wheat: Kama, Rosa and Canadian. Each group has randomly selected 70 observations, for a total of 210 observations. Aside from the grouping variable, there are 7 geometrical measurements given to each observation:
Number | Variable Name | Variable Description |
---|---|---|
1 | area | Area |
2 | peri | Perimeter |
3 | comp | Compactness |
4 | l | Length of kernel |
5 | w | Width of kernel |
6 | asym | Asymmetry coefficient |
7 | lgroove | Length of kernel groove |
It is reasonable to use this data set for LDA firstly because there is a categorical response here, namely the seed variety. In terms of normality, a cursory look at the independent variables indicated that the empirical distributions of the independent variables by group were all roughly symmetric. While not perfectly normal, the independent variables by group are close enough to normality that we can make the normality assumption. We also choose to standardize such that the variances of the independent variables are on the same scale, and doing so makes it such that the variances are close enough to be assumed equal. Hence, this data set appears to be reasonable for our use of LDA.
Note that the only major data processing change to the data set was to add column names. We did this manually in the csv itself such that each language uses the same names. The (slightly modified) csv can be obtained from: Github site
For this analysis, we present three tutorials, each using a different statistical software. The languages presented here are R, STATA, and SAS.
For our actual tutorials, we will start by demonstrating how to load the Seeds data in each language and how to display some basic summary statistics. We will then split the data into a training and a test set through randomization, with a roughly 70-30 split. The training set will then be used to estimate discriminant functions using the independent variables from the Seeds data set. We can then use the discriminant functions from the training set to classify observations in the test set based on their values for the independent variables. This will allow us to estimate prediction error, as we will compare the predicted classes from the LDA model to the actual classes of observations in the test set.
In our analysis, it is necessary to split the data into training and test set; we do this to estimate discriminant functions based on the training data and then apply these functions to the test data to obtain estimated prediction errors. Due to different randomization techniques/mechanisms in the different languages, the training sets and test sets differ in each tutorial. As such, the prediction errors and estimates for the discriminant functions differ across languages. The results are similar, as the core task shown in each tutorial is the same, but the numerical summaries differ somewhat.
There are a few key libraries to install and load for our analysis. To perform Linear Discriminant Analysis in R we will make use of the lda
function in the package MASS
. We will also use candisc
for standardization, and ggplot2
and car
for graphing purposes.
library(MASS)
library(candisc)
library(ggplot2)
library(car)
After reading in our dataset and grouping our categorical variable, we can view the summaries of our variables of interest using the summary
function.
seeds = read.csv("./seeds.csv", sep="\t", header=TRUE)
seeds$group = as.factor(seeds$group)
levels(seeds$group) = c("Kama", "Rosa", "Canadian")
summary(seeds)
## area peri comp l
## Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899
## 1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262
## Median :14.36 Median :14.32 Median :0.8734 Median :5.524
## Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629
## 3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980
## Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675
## w Asym lgroove group
## Min. :2.630 Min. :0.7651 Min. :4.519 Kama :70
## 1st Qu.:2.944 1st Qu.:2.5615 1st Qu.:5.045 Rosa :70
## Median :3.237 Median :3.5990 Median :5.223 Canadian:70
## Mean :3.259 Mean :3.7002 Mean :5.408
## 3rd Qu.:3.562 3rd Qu.:4.7687 3rd Qu.:5.877
## Max. :4.033 Max. :8.4560 Max. :6.550
scatterplotMatrix(seeds[1:7])
We notice through the scatterplot matrix that some variables are collinear with each other, especially the variables area
and peri
with l
and w
. This makes sense intuitively, as we use length and width to find the area and perimeter. Normally this would be a problem, but as our goal is to use all given data for classification purposes only, then as long as we achieve high accuracy on our testing procedure, we can ignore this issue.
We want to find the linear combinations of the variables which can provide the best separation to the three different kinds of seeds. In finding how many combinations are needed to best differentiate our classifications, we take the number of groups and variables into account. We notice that the number of groups is \(3\) and the number of variables is \(7\). The greatest amount of useful discriminant functions that can separate the wheat types by geometric properties is the lesser of “number of groups minus one” and “number of predictors”. As a result, the discriminant functions that we can expect to use to classify our wheat types is 2.
In order to use LDA, we need to first split the data into a part used to train the classifier, and another part that will be used to test the classifier. For this example we will try an 70:30 split
set.seed(123)
seedss = sample.int(n = nrow(seeds), size = floor(.7*nrow(seeds)), replace = F)
train = seeds[seedss, ]
test = seeds[-seedss, ]
We are then able to train our classifier in the following way:
lseeds = lda(group~., train)
lseeds
## Call:
## lda(group ~ ., data = train)
##
## Prior probabilities of groups:
## Kama Rosa Canadian
## 0.3333333 0.3469388 0.3197279
##
## Group means:
## area peri comp l w Asym lgroove
## Kama 14.40959 14.34306 0.8784694 5.539224 3.242184 2.841982 5.117714
## Rosa 18.44157 16.17196 0.8847667 6.157922 3.695314 3.639667 6.025902
## Canadian 11.77851 13.20085 0.8486149 5.215064 2.839617 4.952319 5.107383
##
## Coefficients of linear discriminants:
## LD1 LD2
## area 0.16603150 4.6227939
## peri 3.29964578 -9.2444535
## comp 14.49254194 -91.4585465
## l -6.44409488 -8.0380276
## w -2.73151849 0.5525972
## Asym -0.03354983 0.3187115
## lgroove 3.14388132 6.8455566
##
## Proportion of trace:
## LD1 LD2
## 0.6862 0.3138
This means that the first discriminant function is a linear combination of the variables: \[0.166*Area+3.300*Perimeter+...+3.144*Groove\]. and the second is : \[4.623*Area-9.244*Perimeter+...+6.846*Groove\]
The lda
function in R doesn’t automatically standardize the coefficients in each function. We can see what the standardized functions look like using the candisc
function:
x=lm(cbind(area,peri,comp,l,w,Asym,lgroove)~group, train)
y=candisc(x, terms="Groups")
summary(y)
##
## Canonical Discriminant Analysis for group:
##
## CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.86771 6.5592 3.5591 68.615 68.615
## 2 0.75001 3.0002 3.5591 31.385 100.000
##
## Class means:
##
## Can1 Can2
## Kama -1.0297 -2.32227
## Rosa 3.3695 0.58218
## Canadian -2.5828 1.78936
##
## std coefficients:
## Can1 Can2
## area 0.197043 5.486233
## peri 1.767951 -4.953181
## comp 0.249791 -1.576367
## l -1.410953 -1.759949
## w -0.473699 0.095831
## Asym -0.041526 0.394480
## lgroove 0.713234 1.553012
This makes our first standardized function to be: \[0.197*Area+1.768*Perimeter+...+0.713*Groove\]. and the second is: \[5.486*Area-4.953*Perimeter+...+1.553*Groove\]
Now with our train data model, we can predict our classifications with our test data:
lseeds.values = predict(lseeds, test[,1:7])
This function will attempt to classify each of these observations, based solely on their quantitative attributes.
plot.data = data.frame(LD1=lseeds.values$x[,1], LD2=lseeds.values$x[,2], WheatType=test$group)
head(plot.data)
## LD1 LD2 WheatType
## 2 -0.95512934 -4.243118 Kama
## 3 -1.07492098 -2.963890 Kama
## 4 -2.15775577 -3.307814 Kama
## 7 -0.38321769 -1.766544 Kama
## 17 -1.03053264 -1.248880 Kama
## 18 0.07430078 -3.303365 Kama
p <- ggplot(data=plot.data, aes(x=LD1, y=LD2)) +
geom_point(aes(color=WheatType)) +
theme_bw()
p
We can easily see the separation of wheat types from the scatterplot. The first discriminant function (LD1) separates Rosa from the other types very well, but does not perfectly separate Kama and Canadian. The second discriminant function (LD2) shows a good separation of Kama and Canadian, a fair separation of Rosa and Kama, and very little separation of Rosa and Canadian. In order to have the best separation of the three types of wheat, we should use both the first discriminant function and the second discriminant function based on these findings.
Now we can check the error rate of our classification test:
error_tab = table(test$group,lseeds.values$class)
Kama | Rosa | Canadian | |
---|---|---|---|
Kama | 20 | 0 | 1 |
Rosa | 0 | 19 | 0 |
Canadian | 0 | 0 | 23 |
As we see from the table, only 1 of the 63 observations in our test data was incorrectly classified (The one ‘Kama’ that was classified as ‘Canadian’). This gives us a predicted error rate of 1.59%, very low considering. This gives us more reason to believe LDA makes good use in our classification.
Now, we perform our analysis using STATA. First, we read in the Seeds data set, using “import delimited.” The dataset is already processed, so there is no additional step to perform there. We then obtain some summary statistics with “summarize” to get an initial understanding of the data with which we will be working:
// import the seeds data from local directory, relabel, and summarize
import delimited seeds.csv
summarize area peri comp l w asym lgroove
Next, we create a matrix of graphs of correlations using “graph matrix” to see correlations between the potential explanatory variables to gain an initial understanding of how they are related:
// graph correlations between the variables of interest
graph matrix area peri comp l w asym lgroove
We see some strong correlations in the predictors, as there are several pairs of explanatory variables that have a reasonably strong, approximately linear correlation.
We should note that the idea of the linear discriminant analysis (LDA) in this analysis is primarily classification. We are hoping to find linear combinations of the 7 potential predictors in the data set that yield the best possible separation among seeds based on their wheat variety. In other words, we want the best separation such that each group of seeds can be identified based on their other characteristics.
Note that we want to separate the seeds by their wheat variety. There are three possible wheat varieties, so the number of groups here is 3, and the number of possible predictors is 7. Since the maximum number of significant discriminant functions that we can have is the minimum of “number of groups minus one” and “number of predictors”, there are at most two discriminant functions that can be used to classify the seeds in this analysis.
To perform an LDA analysis, it is necessary to split the data into a training set and a test set such that the performance of the LDA functions can be measured. To do this, we select random entries from the data set such that 70% of the data can be used for training and the remaining 30% can be used for testing. We use “generate random” and “sort random” followed by “generate trainsamp” to obtain random indices in order to create a training set composed of 70% of the data, saving the remaining 30% separately into a test set:
// separate into training and test set
set seed 123
generate random = runiform()
sort random
generate trainsamp = _n <= 147
// save test set for later
preserve
keep if !trainsamp
save test.dta, replace
restore
// keep training
keep if trainsamp
We then use the “candisc” function on our training set to perform our LDA analysis:
// perform LDA analysis
candisc area peri comp l w asym lgroove, group(group)
As seen above, STATA provides a considerable amount of output for the LDA Analysis from the function “candisc.” For our purposes here, the tables we are most concerned with are the “Canonical Linear Discriminant Analysis” table and the “Standardized canonical discriminant function coefficients” table. The “Canonical Linear Discriminant Analysis” table indicates how many discriminant functions are necessary here, and the “Standardized canonical discriminant function coefficients” table yields the (standardized) coefficient estimates for our LDA functions. We see from the first table that there are two discriminant dimensions, and the F-ratio tests both are significant. This indicates that both dimensions are needed to describe the differences between the seeds.
From the next table, then, our two discriminant functions, estimated from the training data, are: \[-0.494*Area+1.997*Perimeter...+.721*Groove\] and: \[4.888*Area+-4.471*Perimeter...+1.598*Groove\]
The other tables in the “candisc” output yield means and classification percentages (i.e. the proportion classified correctly) for the training data.
Now, to visualize the separation, we use the function “scoreplot.” This function displays the test data, with the first discriminant function on the x-axis and the second discriminant function on the y-axis. The observations are then labeled according to seed type. To make the output more readable, we use the “label” function to relabel the observations such that they are only displayed as the first letter of the seed variety (i.e. “C” for Canadian). We then have:
// change labels for future plotting
label define lab2 1 K 2 R 3 C
label values group lab2
// generate plot to show the separation
scoreplot, msymbol(i)
We see that the first discriminant function appears to separate Rosa seeds from the other types well, though it does not seems to separate Canadian and Kama seeds very much.
The second discriminant function appears to separate the Kama seeds fairly well from the others, but Rosa and Canadian seeds are not well-separated. The plot suggests that both discriminant functions are necessary here to differentiate between the seed types.
Finally, we apply our estimated functions from the training data to the test data to get an idea of the prediction error that comes from this LDA model. The code below will yield the number of observations in the test set that were misclassified. We use the “predict” function, which will apply our LDA model to the test.dta data set to predict see variety. Then, we count the number of seeds that were misclassified using “count if”:
// calculate prediction error by finding incorrect predictions for test set
use test.dta, clear
predict outcome, classification
label values outcome labseed
count if outcome != group
Running the code above yields a result of 3, indicating that of the 63 observations in the test set, only 3 were incorrectly predicted by our trained LDA model. As such, the prediction error here is only 4.76%. Hence, the LDA approach appears to work well for this data, which is supported not only by the low prediction error, but also by the plot, which shows fairly clear separation for the seed varieties.
proc format
and import data to SAS/* file name */
filename seeds '~/stat506/project/seeds.csv';
/* format for grouping variable */
proc format;
value group_f 1 = 'K'
2 = 'R'
3 = 'C';
run;
/* import data */
data seeds_original;
infile seeds delimiter = ' ' MISSOVER firstobs=2;
input area peri comp l w asym lgroove group;
format group group_f.;
run;
proc means
statement)/* summary of the dataset*/
proc means data=seeds_original n mean std min max;
var area peri comp l w asym lgroove;
run;
There are 210 observations and 7 predictors in the whole dataset.
\(~\)
proc means
statement)/* summary of the dataset by group*/
proc means data=seeds_original n mean std;
class group;
var area peri comp l w asym lgroove;
run;
K represents Kama; R represents Rosa; C represents Canadian. The table below shows the mean of each predictors by different groups. We can notice that several variables have clear differences of means for different groups.
\(~\)
proc corr
statement)/* correlation table */
proc corr data=seeds_original;
var area peri comp l w asym lgroove;
run;
According to the correlation table above, we can notice that most predictors have strong correlations with each other in this dataset.
\(~\)
proc surveyselect
statement to do “train and test split” with ratio 70:30 to create training and testing datasets. We use training dataset to get discriminant function and then use testset to do classification and test the performance of the model./* train and test split by group */
proc surveyselect data=seeds_original rate=.3 outall out=seeds_select;
strata group;
run;
/* training set*/
data seeds_train;
set seeds_select;
where Selected = 0;
drop Selected SelectionProb SamplingWeight;
run;
/* testing set*/
data seeds_test;
set seeds_select;
where Selected = 1;
drop Selected SelectionProb SamplingWeight;
run;
\(~\)
proc discrim
statement)/* analysis */
proc discrim data=seeds_train testdata=seeds_test testout=test_out out=discrim_out can;
class group;
var area peri comp l w asym igroove;
run;
There are three groups and seven predictors which indicates that the number of discriminant dimensions is \(3-1 = 2\). The two discriminant dimensions are both statistically significant based on the table above (P values of the F-tests are very small). As a result, both dimensions are very useful to describe the differences between the groups of seeds. The canonical correlations for the two dimensions are 0.93 and 0.87.
\(~\)
The three tables below provide us more details of the model that we get based on the training dataset. We can know the discriminant coefficients (raw and standardized) and class means of each dimension.
The two discriminant functions are (based on the standardized coefficients):
\(discriminant_1 = 0.838*Area + 0.651*Perimeter + ... + 0.940*Groove\)
\(discriminant_2 = 4.310*Area + -4.142*Perimeter + ... + 1.360*Groove\)
\(~\)
The table below is the summary tables of the testing dataset classified using the discriminant functions generated by the training set. There are 63 observations in this dataset. Based on this table, we can notice that only two observations are classified incorrectly and the testing error (3.17%) is very small. The accuracy is almost 1. The model that we have predicts the testing data very well.
\(~\)
proc sgrender
statement/* making plot */
data plotclass;
merge test_out discrim_out;
run;
proc template;
define statgraph classify;
begingraph;
layout overlay;
contourplotparm x=Can1 y=Can2 z=_into_ / contourtype=fill
nhint = 30 gridded = false;
scatterplot x=Can1 y=Can2 / group=group includemissinggroup=false
markercharactergroup = group;
endlayout;
endgraph;
end;
run;
proc sgrender data = plotclass template = classify;
title "Discriminant function score";
run;
From the plot above, we can know that two discriminant dimensions can classify three levels of wheat very well. We can notice that “can1” can separate Rosa from other two groups well and “can2” can separate Canadian and Rosa well.
In all three languages, even with the differences in training set and test set, we saw low estimated prediction errors. For STATA, there were only 3 of 63 seeds misclassified in the test sets. For SAS, there were only 2 of 63 seeds misclassified in the test sets. There was only 1 seed misclassified for the R test set. It should also be noted that in all three languages, we found that the first discriminant function appeared to separate Rosa seeds from the other to seeds well, but did not separate the Canadian and Kama seeds. On the other hand, the second LDA function appeared to separate Canadian and Kama seeds well, but not Canadian and Rosa seeds. In any case, it appears that LDA performed reasonably well at classifying the seeds.