Linear Discriminant Analysis

Introduction

What is Linear Discriminant Analysis?

Linear Discriminant Analysis (LDA) is one of the most commonly used techniques nowadays for several applications, such as “Decision Making”, “Classification”, and “Dimension Reduction”. The main goal of LDA is to find the discriminant functions which are the linear combinations of independent variables that will group two or more categories of the dependent variable in the best possible way. It helps one to examine whether significant differences exist among the groups, in terms of the independent variables. It also evaluates how accurate the process of the classification is.

When to Use LDA

Generally speaking, LDA is intended for data sets in which there is a categorical response variable. This allows us to group the observations into categories based on the values of their independent variables. Note that LDA is applicable for classification for data sets when there are two or more groups, based on the categorical response. It should also be noted that under LDA, it is assumed that our independent variables are multivariate normally distributed by group and that the variances of the independent variables are equal.

Dataset: Geometric Properties of Wheat Seeds

For our data analysis, we will be using the “Seeds” dataset, found on UC Irvine’s Machine Learning Repository. This dataset includes observations from 3 different varieties of wheat: Kama, Rosa and Canadian. Each group has randomly selected 70 observations, for a total of 210 observations. Aside from the grouping variable, there are 7 geometrical measurements given to each observation:

Number Variable Name Variable Description
1 area Area
2 peri Perimeter
3 comp Compactness
4 l Length of kernel
5 w Width of kernel
6 asym Asymmetry coefficient
7 lgroove Length of kernel groove

It is reasonable to use this data set for LDA firstly because there is a categorical response here, namely the seed variety. In terms of normality, a cursory look at the independent variables indicated that the empirical distributions of the independent variables by group were all roughly symmetric. While not perfectly normal, the independent variables by group are close enough to normality that we can make the normality assumption. We also choose to standardize such that the variances of the independent variables are on the same scale, and doing so makes it such that the variances are close enough to be assumed equal. Hence, this data set appears to be reasonable for our use of LDA.

Note that the only major data processing change to the data set was to add column names. We did this manually in the csv itself such that each language uses the same names. The (slightly modified) csv can be obtained from: Github site

Analysis

For this analysis, we present three tutorials, each using a different statistical software. The languages presented here are R, STATA, and SAS.

For our actual tutorials, we will start by demonstrating how to load the Seeds data in each language and how to display some basic summary statistics. We will then split the data into a training and a test set through randomization, with a roughly 70-30 split. The training set will then be used to estimate discriminant functions using the independent variables from the Seeds data set. We can then use the discriminant functions from the training set to classify observations in the test set based on their values for the independent variables. This will allow us to estimate prediction error, as we will compare the predicted classes from the LDA model to the actual classes of observations in the test set.

Differences Across Languages

In our analysis, it is necessary to split the data into training and test set; we do this to estimate discriminant functions based on the training data and then apply these functions to the test data to obtain estimated prediction errors. Due to different randomization techniques/mechanisms in the different languages, the training sets and test sets differ in each tutorial. As such, the prediction errors and estimates for the discriminant functions differ across languages. The results are similar, as the core task shown in each tutorial is the same, but the numerical summaries differ somewhat.

Languages

R

There are a few key libraries to install and load for our analysis. To perform Linear Discriminant Analysis in R we will make use of the lda function in the package MASS. We will also use candisc for standardization, and ggplot2 and car for graphing purposes.

library(MASS)
library(candisc)
library(ggplot2)
library(car)

Description of Data

After reading in our dataset and grouping our categorical variable, we can view the summaries of our variables of interest using the summary function.

seeds = read.csv("./seeds.csv", sep="\t", header=TRUE)
seeds$group = as.factor(seeds$group)
levels(seeds$group) = c("Kama", "Rosa", "Canadian")
summary(seeds)
##       area            peri            comp              l        
##  Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899  
##  1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262  
##  Median :14.36   Median :14.32   Median :0.8734   Median :5.524  
##  Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629  
##  3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980  
##  Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675  
##        w              Asym           lgroove           group   
##  Min.   :2.630   Min.   :0.7651   Min.   :4.519   Kama    :70  
##  1st Qu.:2.944   1st Qu.:2.5615   1st Qu.:5.045   Rosa    :70  
##  Median :3.237   Median :3.5990   Median :5.223   Canadian:70  
##  Mean   :3.259   Mean   :3.7002   Mean   :5.408                
##  3rd Qu.:3.562   3rd Qu.:4.7687   3rd Qu.:5.877                
##  Max.   :4.033   Max.   :8.4560   Max.   :6.550
scatterplotMatrix(seeds[1:7])

We notice through the scatterplot matrix that some variables are collinear with each other, especially the variables area and peri with l and w. This makes sense intuitively, as we use length and width to find the area and perimeter. Normally this would be a problem, but as our goal is to use all given data for classification purposes only, then as long as we achieve high accuracy on our testing procedure, we can ignore this issue.

We want to find the linear combinations of the variables which can provide the best separation to the three different kinds of seeds. In finding how many combinations are needed to best differentiate our classifications, we take the number of groups and variables into account. We notice that the number of groups is \(3\) and the number of variables is \(7\). The greatest amount of useful discriminant functions that can separate the wheat types by geometric properties is the lesser of “number of groups minus one” and “number of predictors”. As a result, the discriminant functions that we can expect to use to classify our wheat types is 2.

Train/Test Sequence

In order to use LDA, we need to first split the data into a part used to train the classifier, and another part that will be used to test the classifier. For this example we will try an 70:30 split

set.seed(123) 
seedss = sample.int(n = nrow(seeds), size = floor(.7*nrow(seeds)), replace = F)
train = seeds[seedss, ]
test  = seeds[-seedss, ]

We are then able to train our classifier in the following way:

lseeds = lda(group~., train)
lseeds
## Call:
## lda(group ~ ., data = train)
## 
## Prior probabilities of groups:
##      Kama      Rosa  Canadian 
## 0.3333333 0.3469388 0.3197279 
## 
## Group means:
##              area     peri      comp        l        w     Asym  lgroove
## Kama     14.40959 14.34306 0.8784694 5.539224 3.242184 2.841982 5.117714
## Rosa     18.44157 16.17196 0.8847667 6.157922 3.695314 3.639667 6.025902
## Canadian 11.77851 13.20085 0.8486149 5.215064 2.839617 4.952319 5.107383
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## area     0.16603150   4.6227939
## peri     3.29964578  -9.2444535
## comp    14.49254194 -91.4585465
## l       -6.44409488  -8.0380276
## w       -2.73151849   0.5525972
## Asym    -0.03354983   0.3187115
## lgroove  3.14388132   6.8455566
## 
## Proportion of trace:
##    LD1    LD2 
## 0.6862 0.3138

This means that the first discriminant function is a linear combination of the variables: \[0.166*Area+3.300*Perimeter+...+3.144*Groove\]. and the second is : \[4.623*Area-9.244*Perimeter+...+6.846*Groove\]

The lda function in R doesn’t automatically standardize the coefficients in each function. We can see what the standardized functions look like using the candisc function:

x=lm(cbind(area,peri,comp,l,w,Asym,lgroove)~group, train)
y=candisc(x, terms="Groups")
summary(y)
## 
## Canonical Discriminant Analysis for group:
## 
##    CanRsq Eigenvalue Difference Percent Cumulative
## 1 0.86771     6.5592     3.5591  68.615     68.615
## 2 0.75001     3.0002     3.5591  31.385    100.000
## 
## Class means:
## 
##             Can1     Can2
## Kama     -1.0297 -2.32227
## Rosa      3.3695  0.58218
## Canadian -2.5828  1.78936
## 
##  std coefficients:
##              Can1      Can2
## area     0.197043  5.486233
## peri     1.767951 -4.953181
## comp     0.249791 -1.576367
## l       -1.410953 -1.759949
## w       -0.473699  0.095831
## Asym    -0.041526  0.394480
## lgroove  0.713234  1.553012

This makes our first standardized function to be: \[0.197*Area+1.768*Perimeter+...+0.713*Groove\]. and the second is: \[5.486*Area-4.953*Perimeter+...+1.553*Groove\]

Now with our train data model, we can predict our classifications with our test data:

lseeds.values = predict(lseeds, test[,1:7])

This function will attempt to classify each of these observations, based solely on their quantitative attributes.

plot.data = data.frame(LD1=lseeds.values$x[,1], LD2=lseeds.values$x[,2], WheatType=test$group)
head(plot.data)                       
##            LD1       LD2 WheatType
## 2  -0.95512934 -4.243118      Kama
## 3  -1.07492098 -2.963890      Kama
## 4  -2.15775577 -3.307814      Kama
## 7  -0.38321769 -1.766544      Kama
## 17 -1.03053264 -1.248880      Kama
## 18  0.07430078 -3.303365      Kama
p <- ggplot(data=plot.data, aes(x=LD1, y=LD2)) +
  geom_point(aes(color=WheatType)) +
  theme_bw()
p

We can easily see the separation of wheat types from the scatterplot. The first discriminant function (LD1) separates Rosa from the other types very well, but does not perfectly separate Kama and Canadian. The second discriminant function (LD2) shows a good separation of Kama and Canadian, a fair separation of Rosa and Kama, and very little separation of Rosa and Canadian. In order to have the best separation of the three types of wheat, we should use both the first discriminant function and the second discriminant function based on these findings.

Now we can check the error rate of our classification test:

error_tab = table(test$group,lseeds.values$class)
Table 1 Classifications of test group. Each row represents the test group, and each column shows how they were classified in the model.
Kama Rosa Canadian
Kama 20 0 1
Rosa 0 19 0
Canadian 0 0 23

As we see from the table, only 1 of the 63 observations in our test data was incorrectly classified (The one ‘Kama’ that was classified as ‘Canadian’). This gives us a predicted error rate of 1.59%, very low considering. This gives us more reason to believe LDA makes good use in our classification.

STATA

Description of Data

Now, we perform our analysis using STATA. First, we read in the Seeds data set, using “import delimited.” The dataset is already processed, so there is no additional step to perform there. We then obtain some summary statistics with “summarize” to get an initial understanding of the data with which we will be working:

// import the seeds data from local directory, relabel, and summarize
import delimited seeds.csv
summarize area peri comp l w asym lgroove

Next, we create a matrix of graphs of correlations using “graph matrix” to see correlations between the potential explanatory variables to gain an initial understanding of how they are related:

// graph correlations between the variables of interest
graph matrix area peri comp l w asym lgroove

We see some strong correlations in the predictors, as there are several pairs of explanatory variables that have a reasonably strong, approximately linear correlation.

We should note that the idea of the linear discriminant analysis (LDA) in this analysis is primarily classification. We are hoping to find linear combinations of the 7 potential predictors in the data set that yield the best possible separation among seeds based on their wheat variety. In other words, we want the best separation such that each group of seeds can be identified based on their other characteristics.

Note that we want to separate the seeds by their wheat variety. There are three possible wheat varieties, so the number of groups here is 3, and the number of possible predictors is 7. Since the maximum number of significant discriminant functions that we can have is the minimum of “number of groups minus one” and “number of predictors”, there are at most two discriminant functions that can be used to classify the seeds in this analysis.

Train/Test Sequence

To perform an LDA analysis, it is necessary to split the data into a training set and a test set such that the performance of the LDA functions can be measured. To do this, we select random entries from the data set such that 70% of the data can be used for training and the remaining 30% can be used for testing. We use “generate random” and “sort random” followed by “generate trainsamp” to obtain random indices in order to create a training set composed of 70% of the data, saving the remaining 30% separately into a test set:

// separate into training and test set
set seed 123
generate random = runiform()
sort random
generate trainsamp = _n <= 147

// save test set for later
preserve
keep if !trainsamp
save test.dta, replace
restore

// keep training 
keep if trainsamp

We then use the “candisc” function on our training set to perform our LDA analysis:

// perform LDA analysis
candisc area peri comp l w asym lgroove, group(group)

As seen above, STATA provides a considerable amount of output for the LDA Analysis from the function “candisc.” For our purposes here, the tables we are most concerned with are the “Canonical Linear Discriminant Analysis” table and the “Standardized canonical discriminant function coefficients” table. The “Canonical Linear Discriminant Analysis” table indicates how many discriminant functions are necessary here, and the “Standardized canonical discriminant function coefficients” table yields the (standardized) coefficient estimates for our LDA functions. We see from the first table that there are two discriminant dimensions, and the F-ratio tests both are significant. This indicates that both dimensions are needed to describe the differences between the seeds.

From the next table, then, our two discriminant functions, estimated from the training data, are: \[-0.494*Area+1.997*Perimeter...+.721*Groove\] and: \[4.888*Area+-4.471*Perimeter...+1.598*Groove\]

The other tables in the “candisc” output yield means and classification percentages (i.e. the proportion classified correctly) for the training data.

Now, to visualize the separation, we use the function “scoreplot.” This function displays the test data, with the first discriminant function on the x-axis and the second discriminant function on the y-axis. The observations are then labeled according to seed type. To make the output more readable, we use the “label” function to relabel the observations such that they are only displayed as the first letter of the seed variety (i.e. “C” for Canadian). We then have:

// change labels for future plotting
label define lab2 1 K 2 R 3 C 
label values group lab2
// generate plot to show the separation
scoreplot, msymbol(i)

We see that the first discriminant function appears to separate Rosa seeds from the other types well, though it does not seems to separate Canadian and Kama seeds very much.
The second discriminant function appears to separate the Kama seeds fairly well from the others, but Rosa and Canadian seeds are not well-separated. The plot suggests that both discriminant functions are necessary here to differentiate between the seed types.

Finally, we apply our estimated functions from the training data to the test data to get an idea of the prediction error that comes from this LDA model. The code below will yield the number of observations in the test set that were misclassified. We use the “predict” function, which will apply our LDA model to the test.dta data set to predict see variety. Then, we count the number of seeds that were misclassified using “count if”:

// calculate prediction error by finding incorrect predictions for test set
use test.dta, clear
predict outcome, classification
label values outcome labseed
count if outcome != group

Running the code above yields a result of 3, indicating that of the 63 observations in the test set, only 3 were incorrectly predicted by our trained LDA model. As such, the prediction error here is only 4.76%. Hence, the LDA approach appears to work well for this data, which is supported not only by the low prediction error, but also by the plot, which shows fairly clear separation for the seed varieties.

SAS

Description of Data

  • Create format for “group” using proc format and import data to SAS
/* file name */
filename seeds '~/stat506/project/seeds.csv';

/* format for grouping variable */
proc format;
    value group_f 1 = 'K' 
                  2 = 'R'
                  3 = 'C';
run;

/* import data */
data seeds_original;
    infile seeds delimiter = '  ' MISSOVER firstobs=2;
    input area peri comp l w asym lgroove group;
    format group group_f.;
run;
  • Summary of the dataset (using proc means statement)
/* summary of the dataset*/
proc means data=seeds_original n mean std min max;
  var area peri comp l w asym lgroove;
run;

There are 210 observations and 7 predictors in the whole dataset.

\(~\)

  • Summary of dataset by group (using proc means statement)
/* summary of the dataset by group*/
proc means data=seeds_original n mean std;
  class group;
  var area peri comp l w asym lgroove;
run;

K represents Kama; R represents Rosa; C represents Canadian. The table below shows the mean of each predictors by different groups. We can notice that several variables have clear differences of means for different groups.

\(~\)

  • Correlation between predictors (using proc corr statement)
/* correlation table */
proc corr data=seeds_original;
  var area peri comp l w asym lgroove;
run;

According to the correlation table above, we can notice that most predictors have strong correlations with each other in this dataset.

\(~\)

Train/Test Sequence

  • Using proc surveyselect statement to do “train and test split” with ratio 70:30 to create training and testing datasets. We use training dataset to get discriminant function and then use testset to do classification and test the performance of the model.
/* train and test split by group */
proc surveyselect data=seeds_original rate=.3 outall out=seeds_select;
    strata group;
run;
/* training set*/
data seeds_train;
    set seeds_select;
    where Selected = 0;
    drop Selected SelectionProb SamplingWeight;
run;
/* testing set*/
data seeds_test;
    set seeds_select;
    where Selected = 1;
    drop Selected SelectionProb SamplingWeight;
run;

\(~\)

Discriminant Analysis (using proc discrim statement)

/* analysis */
proc discrim data=seeds_train testdata=seeds_test testout=test_out out=discrim_out can;
  class group;
  var area peri comp l w asym igroove;
run;

There are three groups and seven predictors which indicates that the number of discriminant dimensions is \(3-1 = 2\). The two discriminant dimensions are both statistically significant based on the table above (P values of the F-tests are very small). As a result, both dimensions are very useful to describe the differences between the groups of seeds. The canonical correlations for the two dimensions are 0.93 and 0.87.

\(~\)

The three tables below provide us more details of the model that we get based on the training dataset. We can know the discriminant coefficients (raw and standardized) and class means of each dimension.

The two discriminant functions are (based on the standardized coefficients):

\(discriminant_1 = 0.838*Area + 0.651*Perimeter + ... + 0.940*Groove\)

\(discriminant_2 = 4.310*Area + -4.142*Perimeter + ... + 1.360*Groove\)

\(~\)

The table below is the summary tables of the testing dataset classified using the discriminant functions generated by the training set. There are 63 observations in this dataset. Based on this table, we can notice that only two observations are classified incorrectly and the testing error (3.17%) is very small. The accuracy is almost 1. The model that we have predicts the testing data very well.

\(~\)

Plot the classification

  • making plot using proc sgrender statement
/* making plot */
data plotclass;
  merge test_out discrim_out;
run;

proc template;
  define statgraph classify;
    begingraph;
      layout overlay;
        contourplotparm x=Can1 y=Can2 z=_into_ / contourtype=fill  
                         nhint = 30 gridded = false;
        scatterplot x=Can1 y=Can2 / group=group includemissinggroup=false
                            markercharactergroup = group;
      endlayout;
    endgraph;
  end;
run;

proc sgrender data = plotclass template = classify;
    title "Discriminant function score";
run;

From the plot above, we can know that two discriminant dimensions can classify three levels of wheat very well. We can notice that “can1” can separate Rosa from other two groups well and “can2” can separate Canadian and Rosa well.

Conclusions

In all three languages, even with the differences in training set and test set, we saw low estimated prediction errors. For STATA, there were only 3 of 63 seeds misclassified in the test sets. For SAS, there were only 2 of 63 seeds misclassified in the test sets. There was only 1 seed misclassified for the R test set. It should also be noted that in all three languages, we found that the first discriminant function appeared to separate Rosa seeds from the other to seeds well, but did not separate the Canadian and Kama seeds. On the other hand, the second LDA function appeared to separate Canadian and Kama seeds well, but not Canadian and Rosa seeds. In any case, it appears that LDA performed reasonably well at classifying the seeds.