The purpose of this tutorial is to demonstrate multinomial logistic regression in R(multinom), Stata(mlogit) and SAS(proc logistic).
The following is a brief summary of the multinomial logistic regression(All vs Reference). The way to implement the multi-category logistic regression model is to run K-1 independent binary logistic regression model for all K possible classification results. During the operation, one of the categories is considered as the reference(base) category, and then the other k-1 categories and the reference(base) category we choose are respectively regression.In this way, if we choose the result K as the main category, we can get the following formula:
\[ln\frac{Pr(Y_{i} = 1)}{Pr(Y_{i} = K)} = \beta_{1}\cdot X_{i}\] \[ln\frac{Pr(Y_{i} = 2)}{Pr(Y_{i} = K)} = \beta_{2}\cdot X_{i}\] \[\cdots\] \[ln\frac{Pr(Y_{i} = K-1)}{Pr(Y_{i} = K)} = \beta_{K-1}\cdot X_{i}\] It is important to note that in the above formula we have introduced a set of regression coefficients corresponding to all possible results. Then, the left and right sides of the formula can be indexed to obtain the following formula: \[Pr(Y_{i} = 1) = Pr(Y_{i} = K) \cdot e^{\beta_{1}\cdot X_{i}}\] \[Pr(Y_{i} = 2) = Pr(Y_{i} = K) \cdot e^{\beta_{2}\cdot X_{i}}\] \[\cdots\] \[Pr(Y_{i} = K-1) = Pr(Y_{i} = K) \cdot e^{\beta_{K-1}\cdot X_{i}}\] Note that the probability that we end up with must add up to 1, based on the fact that we can get: \[Pr(Y_{i} = K) = 1 - \sum^{K-1}_{k=1}Pr(Y_{i} = K)\cdot e^{\beta_{k}\cdot X_{i}}\] \[Pr(Y_{i} = K) = \frac{1}{1 + \sum^{K-1}_{k=1}e^{\beta_{k}\cdot X_{i}}}\] We can use this to find the other probabilities: \[Pr(Y_{i} = 1) = \frac{e^{\beta_{1}\cdot X_{i}}}{1 + \sum^{K-1}_{k=1}e^{\beta_{k}\cdot X_{i}}}\] \[Pr(Y_{i} = 2) = \frac{e^{\beta_{2}\cdot X_{i}}}{1 + \sum^{K-1}_{k=1}e^{\beta_{k}\cdot X_{i}}}\] \[\cdots\] \[Pr(Y_{i} = K-1) = \frac{e^{\beta_{K-1}\cdot X_{i}}}{1 + \sum^{K-1}_{k=1}e^{\beta_{k}\cdot X_{i}}}\] \[Pr(Y_{i} = K) = \frac{1}{1 + \sum^{K-1}_{k=1}e^{\beta_{k}\cdot X_{i}}}\]
Multinomial Logistic Regression Model is useful to classify our interested subjects into several categories based on values of the predictor variables. Comparing to logistic regression, it is more general since the response variable is not restricted to only two categories.
In this tutorial, we will work on the Iris flower data set, which is a multivariate data set introduced by Ronald Fisher in 1936. Thus it is also refered as famout Fisher’s Iris data set. The data set consists of three species of Iris and each species has 50 samples. Each sample contains four characteristics and they are width and length of the sepals and petals, in centimeters. Thus the iris data set is a 150-row, 5-column table. Generally, the iris data set is used to do classification for iris flowers where each sample contains different information of sepals and petals. What we need to do is to build a classifier which can be judged through the four characteristics of samples belongs to setosa, versicolor or Virginica iris.
Below are pictures of three species of Iris flowers.
We can obtain Iris data from here. Also, this dataset is already contained in R. Therefore just by loading it, Iris data is available to use in R. The dataset contains a set of 150 records under five attributes - petal length, petal width, sepal length, sepal width and species. Description of variables are as followed.
Variables | Description | Data type |
---|---|---|
Sepal.Length | sepal length in centimeters | positive real number |
Sepal.Width | sepal width in centimeters | positive real number |
Petal.Length | petal length in centimeters | positive real number |
petal.Width | petal width in centimeters | positive real number |
Species | species | categorical(setosa/versicolour/virginica) |
Our task is to model two equations below using R(multinom), Stata(mlogit) and SAS(proc logistic) with Iris data set.
\[ln\frac{Pr(Species = setosa)}{Pr(Species = virginica)} = \beta_{10} + \beta_{11}\cdot Sepal.Length + \beta_{12}\cdot Sepal.Width + \beta_{13}\cdot Petal.Length + \beta_{14}\cdot Petal.Width X_{i}\] \[ln\frac{Pr(Species = versicolor)}{Pr(Species = virginica)} = \beta_{20} + \beta_{21}\cdot Sepal.Length + \beta_{22}\cdot Sepal.Width + \beta_{23}\cdot Petal.Length + \beta_{24}\cdot Petal.Width X_{i}\]
First, we need to load iris data into R. It is possible to download the data from here, but iris dataset is already contained in R. Therefore, just by loading it, iris data will be available and can be used.
data(iris)
This is what the first 6 rows look like.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We can get basic descriptives for the entire data set by using summary.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We partition the data into two parts - a training set consisting of the first 40 observations for each species and a testing set of the remaining 10 observations for each species. We use a training set to build and estimate the model and a testing set to see how well the model does on new data not used in the constuction of the model.
train_1 = iris[1:40,]; train_2 = iris[51:90,]; train_3 = iris[101:140,]
train = rbind(train_1,train_2); train = rbind(train, train_3)
test_1 = iris[41:50,]; test_2 = iris[91:100,]; test_3 = iris[141:150,]
test = rbind(test_1,test_2); test = rbind(test, test_3)
Below we use the multinom function from the nnet package to estimate a multinomial logistic regression model. Before running our model, we choose the level of our outcome that we wish to use our baseline and specify this in the relevel function. We pick “virginica” as our baseline. Then, we run our model using multinom. The nnet package does not include p-value calculation for the regression coefficients, so we calculate p-values using Wald tests(here z-tests).
library(nnet)
train$species2 = relevel(train$Species, ref = "virginica")
model = multinom(species2 ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width , data = train)
## # weights: 18 (10 variable)
## initial value 131.833475
## iter 10 value 10.705803
## iter 20 value 6.137992
## iter 30 value 5.954540
## iter 40 value 5.942417
## iter 50 value 5.938196
## iter 60 value 5.929857
## iter 70 value 5.925361
## iter 80 value 5.924819
## final value 5.923148
## converged
summary(model)
## Call:
## multinom(formula = species2 ~ Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width, data = train)
##
## Coefficients:
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.28918 10.978504 16.793690 -23.702171 -18.17830
## versicolor 41.55886 2.416804 6.591888 -9.215024 -17.93605
##
## Std. Errors:
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 18.62301 18.505809 57.575716 75.538379 40.361463
## versicolor 25.43734 2.385384 4.460913 4.693846 9.756292
##
## Residual Deviance: 11.8463
## AIC: 31.8463
z = summary(model)$coefficients/summary(model)$standard.errors
z
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 0.2840132 0.5932464 0.2916801 -0.3137765 -0.4503876
## versicolor 1.6337740 1.0131715 1.4776994 -1.9632137 -1.8384089
p = (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 0.7764003 0.5530163 0.7705312 0.75369076 0.65243097
## versicolor 0.1023063 0.3109783 0.1394882 0.04962134 0.06600218
we first see that some output is generated by running the model, even though we are assigning the model to a new R object. This model-running output includes some iteration history and include the final negative log-likelihood 5.923148. This value multiplied by two is then see in the model summary as the Residual Deviance.
The model summary output has a block of coefficients and a block of standard errors. Each of these blocks has one row of values corresponding to a model equation. Two models are tested in this multinomial regression, one comparing membership to setosa versus virginica and one comparing membership to versicolor versus virginica. They correspond to the two equations below:
\[ \begin{split} ln\frac{Pr(Species = setosa)}{Pr(Species = virginica)} = & 5.28918 +10.978504\times Sepal.Length + 16.793690\times Sepal.Width\\ & -23.702171 \times Petal.Length -18.17830\times Petal.Width \end{split} \] \[ \begin{split} ln\frac{Pr(Species = versicolor)}{Pr(Species = virginica)} = & 41.55886 + 2.416804\times Sepal.Length + 6.591888\times Sepal.Width\\ & -9.215024\times Petal.Length -17.93605\times Petal.Width \end{split} \]
Through the p-values, we can see all of the coefficients of the first model(setosa vs. virginica) are not significant while some of the coefficients of the second model(versicolor vs. virginica) are significant.
Now we can test our model using a test set data. We compute the accuracy rate. The result is 1. Good performance!
## Test the accuracy of model using the test set
library(data.table)
# generate predictions of probabilities for each species using the model
a = predict(model, newdata =test, "probs")
a = data.table(a)
# generate species data of each observation based on the model
b = rep(1,30)
b[which(a$versicolor == apply(a, 1, max))]=2
b[which(a$virginica == apply(a, 1, max))]=3
# generate species data based on true observation
c = c(rep(1,10), rep(2,10), rep(3,10))
# compute the accuracy ratio
accuracy = sum(b==c)/30
accuracy
## [1] 1
Multinomial Logistic Regression Algorithm Description: https://en.wikipedia.org/wiki/Multinomial_logistic_regression
iris data description : https://en.wikipedia.org/wiki/Iris_flower_data_set
picture of iris : http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html
multinomial logistic regression in R : https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
multinomial logistic regression in SAS : https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/
multinomial logistic regression in STATA : https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
We use proc import to import the dataset.
FILENAME REFFILE '/Iris.csv';
PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
OUT=work.import;
GETNAMES=YES;
RUN;
We get basic descriptives for the entire data set by using proc summary.
proc summary data=work.import min q1 mean q3 max std n;
var Sepal_Length Sepal_Width Petal_Length Petal_Width;
by Species;
output out = iris_summary;
proc print data = iris_summary;
title 'Summary of Iris data set';
run;
We separate the data into two parts - a training set consisting of the first 40 observations for each species and a testing set of the remaining 10 observations for each species. We use a training set to build and estimate the model and a testing set to see how well the model does on new data not used in the constuction of the model.
data train;
set work.import;
if _N_ in (1:40, 51:90, 101:140) then output;
run;
data test;
set work.import;
if _N_ in (41:50, 91:100, 141:150) then output;
run;
Below we use proc logistic to estimate a multinomial logistic regression model. We can specify the baseline category for species using (ref = virginica). Then we get the detailed fitting result. Finally, we use test set to predict their species.
proc logistic data = train;
class species (ref = "virginic");
model species = Sepal_Length Sepal_Width Petal_Length Petal_Width / link = glogit;
score data=test out=valpred;
title 'Multinomial Logistic Regression Model';
run;
In the output above, the likelihood ratio chi-square of 251.8012 with a p-value < 0.0001 tells us that our model as a whole fits significantly better than an empty model (i.e., a model with no predictors)
Two models are tested in this multinomial regression, one comparing membership to setosa versus virginic and one comparing membership to versicolor versus virginic. They correspond to the two equations below:
\[ \begin{split} ln\frac{Pr(Species = setosa)}{Pr(Species = virginica)} = & 35.8688 +2.3092\times Sepal.Length + 13.1311\times Sepal.Width\\ &- 12.5865\times Petal.Length - 23.8865\times Petal.Width \end{split} \] \[ \begin{split} ln\frac{Pr(Species = versicolor)}{Pr(Species = virginica)} = & 41.1253 +2.4109\times Sepal.Length + 6.5280\times Sepal.Width\\ &- 9.1397\times Petal.Length - 17.7683\times Petal.Width \end{split} \]
Based on the fitted model, now we can predict the each species probability of the testing data set. Each predicted species probability is between 0 and 1, hence we regard the highest probability one as its species.
Compared to the true species situation, we compute the accuracy rate which is 1. Good performance!
/*Test the accuracy of our model on the test set*/
data prediction;
set valpred;
species_pre = 0;
species_ori = 0;
accuracy = 1;
/*Labe the original species into 1(setosa), 2(versicolor), 3(virginica)
and generate a new variable species_orei to save that label*/
if species = 'setosa' then species_ori=1;
if species = 'versicol' then species_ori=2;
if species = 'virginic' then species_ori=3;
/*Choose the highest probability of three speices as the final result*/
if P_setosa > P_versicol and p_setosa > P_virginic then species_pre=1;
if P_versicol > P_setosa and P_versicol > P_virginic then species_pre=2;
if P_virginic > P_setosa and P_virginic > P_versicol then species_pre=3;
/*Compare the original species and the predict species and reserve the
data with same result*/
if species_ori > species_pre then delete;
if species_ori < species_pre then delete;
keep species accuracy;
/*Compute accuracy ratio*/
proc summary data=prediction;
output out=accuracy
sum(accuracy) = accuracy;
data final;
set accuracy;
accuracy = accuracy/30;
proc print data=final;
title 'Test the Accuracy by Test Set';
run;
Multinomial Logistic Regression Algorithm Description: https://en.wikipedia.org/wiki/Multinomial_logistic_regression
iris data description : https://en.wikipedia.org/wiki/Iris_flower_data_set
picture of iris : http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html
multinomial logistic regression in R : https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
multinomial logistic regression in SAS : https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/
multinomial logistic regression in STATA : https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
We start with loading the data and then present the first 5 rows.
clear all
* Loading the data
import delimited iris.csv
list in 1/5
Then we get the basic description for both predictors and responses.
* Discription of Data
summarize sepal_length sepal_width petal_length petal_width
tab species
We partition the data into two parts - a training set consisting of the first 40 observations for each species and a testing set of the remaining 10 observations for each species. We use a training set to build and estimate the model and a testing set to see how well the model does on new data not used in the constuction of the model.
* Separate the data into train set and test set
gene seqnum=_n
generate training = 1 if seqnum <= 40 | seqnum >= 51 & seqnum <= 90 | seqnum >= 101 & seqnum <= 140
replace training = 0 if training == .
Below we use the mlogit command to estimate a multinomial logistic regression model. In the model, we choose to pick virginica as the baseline category and then fitting the model.
* In the model, we choose to pick virginica as the baseline category
* Since mlogit only accepts numeric arguments, thus encoding string into numeric
encode species, generate(new_species)
mlogit new_species sepal_length sepal_width petal_length petal_width if training == 1, base(3)
From the output, we can notice the likelihood ratio chi-square is 251.82 with p-value 0, which means our model fits significantly better than model with no predictors.
From the regression coefficient, we notice length and width of petal are negatively related with relatively log odds in both case; while length and width of sepal is positively related with log odds in both case. For example, a one-unit increase in the variable sepal_length is associated with a 2.46 increase in the relative log odds of being in setosa v.s. virginica.
Relative Risk is defined as the ratio of probability of one outcome category over the probability of baseline category. In this case, the relative risk in setosa v.s. virginica is the ratio of probability of setosa over the probability of virginica. Here we can use the mlogit, rrr command to get the regression results in the form of relative risk ratios.
mlogit, rrr
In fact, the relative risk in the table can be obtained by exponentiating the regression coefficients above. For instance, the ralative risk for the on-unit increase in the variable sepal_length is 11.698 insetosa v.s. virginica, which can be obtained by exp(2.459417) from the output of regression coefficients.
We test our model by using the testing dataset. We predict the probability of each species in the testing dataset based on the fitted model, and then, treat the species with highest probability as the final species. Comparing the predicted species with original ones, we find there is no difference between them. Good Performance!
* Get the Prediction by using test dataset
keep if training == 0
predict setosa
predict versicolor, equation(versicolor)
predict virginica, equation(virginica)
* Encode the origin species to numbers
gen species_ori = 1 if species == "setosa"
replace species_ori = 2 if species == "versicolor"
replace species_ori = 3 if species == "virginica"
* Consider the catogory of highest probability as the last predict model
gen species_pre=1 if setosa > versicolor & setosa > virginica
replace species_pre=2 if versicolor > setosa & versicolor > virginica
replace species_pre=3 if virginica > setosa & virginica > versicolor
* Check the difference between prediction and the original category
display 1-(species_pre - species_ori)
Multinomial Logistic Regression Algorithm Description: https://en.wikipedia.org/wiki/Multinomial_logistic_regression
iris data description : https://en.wikipedia.org/wiki/Iris_flower_data_set
picture of iris : http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html
multinomial logistic regression in R : https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
multinomial logistic regression in SAS : https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/
multinomial logistic regression in STATA : https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/
The results summary are as followed. The coefficients for predictors from 3 language examples are different. Interestingly, the coefficients in the model comparing virginica with setosa are quite different while the coefficients in the model comparing virginica with versicolor are pretty similar with some errors.
Setosa
intercept | Sepal.Length | Sepal.Width | Petal.Length | petal.Width | |
---|---|---|---|---|---|
R | 5.29 | 10.98 | 16.79 | -23.70 | -18.18 |
SAS | 35.87 | 2.31 | 13.13 | -12.59 | -23.89 |
STATA | 32.23 | 2.46 | 22.67 | -19.65 | -32.76 |
Versicolor
intercept | Sepal.Length | Sepal.Width | Petal.Length | petal.Width | |
---|---|---|---|---|---|
R | 41.56 | 2.42 | 6.59 | -9.22 | -17.94 |
SAS | 41.13 | 2.41 | 6.53 | -9.14 | -17.77 |
STATA | 41.79 | 2.41 | 6.61 | -9.25 | -17.99 |
It may be because the coefficients of the former are not significant at all while some of the coefficients of the latter are significant. It means we may need to combine setosa and virginica. Then the model becomes a binomial logistic regression model.
p-value calculated in R
intercept | Sepal.Length | Sepal.Width | Petal.Length | petal.Width | |
---|---|---|---|---|---|
Setosa | 0.7764003 | 0.5530163 | 0.7705312 | 0.75369076 | 0.65243097 |
versicolor | 0.1023063 | 0.3109783 | 0.1394882 | 0.04962134 | 0.06600218 |
p-value calculated in SAS
intercept | Sepal.Length | Sepal.Width | Petal.Length | petal.Width | |
---|---|---|---|---|---|
Setosa | 0.8338 | 0.9582 | 0.8020 | 0.8001 | 0.7977 |
versicolor | 0.1014 | 0.3102 | 0.1399 | 0.0486 | 0.0655 |
p-value calculated in Stata
intercept | Sepal.Length | Sepal.Width | Petal.Length | petal.Width | |
---|---|---|---|---|---|
Setosa | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
versicolor | 0.312 | 0.140 | 0.050 | 0.066 | 0.102 |
Finnaly, we test our model in 3 languages using the same test dataset and find that the model accuracy rate is 1, which means our model performs good.
Multinomial Logistic Regression Algorithm Description: https://en.wikipedia.org/wiki/Multinomial_logistic_regression
iris data description : https://en.wikipedia.org/wiki/Iris_flower_data_set
picture of iris : http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html
multinomial logistic regression in R : https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
multinomial logistic regression in SAS : https://stats.idre.ucla.edu/sas/dae/multinomiallogistic-regression/
multinomial logistic regression in STATA : https://stats.idre.ucla.edu/stata/dae/multinomiallogistic-regression/