Introduction to ordinal logistic regression

Ordinal logistic regression is applied when the dependent variable is measured at the ordinal level, given one or more independent variables, which could be ordinal, continuous and categorical. Usually, there are two assumptions that needs to be checked for the ordinal logistic regression, no multicollinearity and existence of proportional odds. By no multicollinearity assumption, the variables are not highly correlated with each other; the proportional odds mean that the independent variables have identical effects at each dependent variable levels.

Description of the Data

In this tutorial, we will use the “soup” data from the R package “Ordinal”. The dataset represents the ratings of sample soup from 185 respondents, including ‘sureness’ which is the respondents ratings of soup samples, and other descriptive variables, such as ‘souptype’, ‘cold’ and ‘soupfreq’ (‘souptype’, with three types canned, dry-mix and self-made; cold‘, which is a Yes/No variable where Yes indicates that respondent have a cold;’soupfreq’ with 3 levels, which is the frequency with which the respondent consumes soup). By fitting ordinal logistic regression model to the data, we desire to determine the factors that will significantly influence the soup rating, and also explore the marginal effects of the factors of interest. We will display the data manipulation and analysis in three different languages, R, STATA and SAS. A link to the data descripition could be found here.

Implementation

Ordinal Logistic Regression in SAS

Similar in R and STATA, the first step is to import the data into the workspace.

proc import datafile='/folders/myfolders/soup.csv'
out=work.soup
dbms=CSV;
run;

To view first ten rows of data.

proc print data=soup (obs=10);
run;

As we planned to fit the model based on the dependent variable ‘sureness’, it would be desired to explore the frequency and means of the variable ‘sureness’.

proc freq data=soup;
tables SURENESS;
run;

proc means data=soup;
var SURENESS;
run;

From the table above, it’s clear that ‘sureness’ has higher proportion in higher end, 5 or 6, which will results in a high mean value as 4.3768.

Next, we pick the factors of interest and check whether they will have potential effects on the dependent variable ‘sureness’.

proc freq data=soup;
tables SURENESS*SOUPTYPE / norow nocol missprint;
tables SURENESS*SOUPFREQ / norow nocol missprint;
tables SURENESS*COLD / norow nocol missprint;
run;

The tables above suggests that the binary variable ‘cold’ and categorical variable ‘souptype’ cause some difference in ‘sureness, while ’soupfreq’ does not show effects on ‘sureness’. Therefore, a model ‘sureness ~ souptype + cold’ is fitted to the data.

proc logistic data=soup desc;
class SOUPTYPE(ref='Self-made') COLD(ref='No') / param=reference;
model SURENESS= SOUPTYPE COLD ;
estimate "Pr prob SURENESS=2 at SOUPTYPE='Canned' " intercept 1  SOUPTYPE 1 COLD 1/ ilink category='2';
estimate "Pr prob SURENESS=2 at SOUPTYPE='Dry-mix' " intercept 1 SOUPTYPE 2 COLD 1/ ilink category='2';

run;

From the test above, because the p-value is smaller than 0.05, the null hypothesis can be rejected and the proportional odds assumption is valid for the model.

From the table above, the ‘souptype’ has p-value smaller than 0.05 while ‘cold’ does not, which indicates that only ‘souptype’ is a significant variable in explaining the variance. In terms of the coefficients, for example, when ‘souptype’ changes from ‘self-made’ to ‘canned’, the sureness will decrease by 0.2050, given that all of the other variables in the model are held constant.

Then table above shows that for sureness =2, the predicted probability for ‘dry-mix’ and ‘canned’ soup are 0.8582 and 0.8814 correspondingly. These numbers are different from those in R and STATA, because the order of variable ‘sureness’ is different, as indicated by the opposite sign of intercepts in SAS from those in R and STATA.

Ordinal Logistic Regression in R

The first step is to import the data into the workspace.

data(soup)
soup_test = data.table(soup)

Next we looked at the first ten rows of the data set.

soup_test[1:10,]
##     RESP PROD PRODID SURENESS DAY SOUPTYPE SOUPFREQ COLD EASY GENDER
##  1:    1  Ref      1        6   1   Canned  >1/week  Yes    7 Female
##  2:    1 Test      2        5   1   Canned  >1/week  Yes    7 Female
##  3:    1  Ref      1        5   1   Canned  >1/week  Yes    7 Female
##  4:    1 Test      3        6   1   Canned  >1/week  Yes    7 Female
##  5:    1  Ref      1        5   2   Canned  >1/week  Yes    7 Female
##  6:    1 Test      6        5   2   Canned  >1/week  Yes    7 Female
##  7:    1 Test      2        2   2   Canned  >1/week  Yes    7 Female
##  8:    1 Test      4        5   2   Canned  >1/week  Yes    7 Female
##  9:    1 Test      5        5   2   Canned  >1/week  Yes    7 Female
## 10:    1  Ref      1        2   2   Canned  >1/week  Yes    7 Female
##     AGEGROUP LOCATION
##  1:    51-65 Region 1
##  2:    51-65 Region 1
##  3:    51-65 Region 1
##  4:    51-65 Region 1
##  5:    51-65 Region 1
##  6:    51-65 Region 1
##  7:    51-65 Region 1
##  8:    51-65 Region 1
##  9:    51-65 Region 1
## 10:    51-65 Region 1

This data set has a six level variable called “sureness”, which is the respondents ratings of soup samples, that we will use as our outcome variable. It would be desired to explore the frequency and mean of it.

summary(soup_test$SURENESS)
##   1   2   3   4   5   6 
## 228 260 115  98 277 869
mean(as.numeric(soup_test$SURENESS))
## [1] 4.376827

From the output, it’s clear that ‘sureness’ has higher proportion in higher end, 5 or 6, which will results in a high mean value as 4.3768.

Then, we pick some factors of interest and check whether they will have potential effects on the dependent variable ‘sureness’. Let’s start with the descriptive statistics of these variables “souptype”, “cold” and “soupfreq”.

lapply(soup_test[,c("SOUPTYPE", "SOUPFREQ", "COLD", "SURENESS")], table)
## $SOUPTYPE
## 
## Self-made    Canned   Dry-mix 
##       958       589       300 
## 
## $SOUPFREQ
## 
##   >1/week 1-4/month  <1/month 
##       799       948       100 
## 
## $COLD
## 
##   No  Yes 
## 1608  239 
## 
## $SURENESS
## 
##   1   2   3   4   5   6 
## 228 260 115  98 277 869
ftable(xtabs(~ SOUPTYPE + SOUPFREQ + COLD + SURENESS, data = soup_test))
##                          SURENESS   1   2   3   4   5   6
## SOUPTYPE  SOUPFREQ  COLD                                 
## Self-made >1/week   No             53  54  21  19  55 178
##                     Yes             3  13   3   2   8  41
##           1-4/month No             48  42  30  22  57 170
##                     Yes             6  10   7   2  13  41
##           <1/month  No              4   8   4   2  13  29
##                     Yes             0   0   0   0   0   0
## Canned    >1/week   No             28  16   7   4  18  76
##                     Yes             7  12   2   2  15  22
##           1-4/month No             50  49  26  24  50 141
##                     Yes             0   3   0   1   1   5
##           <1/month  No              1   8   5   2   6   8
##                     Yes             0   0   0   0   0   0
## Dry-mix   >1/week   No              8  17   4   8  18  75
##                     Yes             0   2   0   0   4   4
##           1-4/month No             20  21   6  10  14  69
##                     Yes             0   1   0   0   1   8
##           <1/month  No              0   4   0   0   4   2
##                     Yes             0   0   0   0   0   0

The tables above shows that there are some rows are all 0 because of the “soupfreq”, so we could not use “soupfreq” as predictor. Therefore, a model ‘sureness ~ souptype + cold’ is fitted to the data.

Below we use the polr command from the MASS package to estimate an ordered logistic regression model.

m = polr(SURENESS ~ SOUPTYPE + COLD, data = soup_test, Hess = TRUE)
summary(m)
## Call:
## polr(formula = SURENESS ~ SOUPTYPE + COLD, data = soup_test, 
##     Hess = TRUE)
## 
## Coefficients:
##                   Value Std. Error t value
## SOUPTYPECanned  -0.2050    0.09579  -2.140
## SOUPTYPEDry-mix  0.1922    0.12443   1.545
## COLDYes          0.2400    0.12850   1.868
## 
## Intercepts:
##     Value    Std. Error t value 
## 1|2  -1.9705   0.0845   -23.3184
## 2|3  -1.0316   0.0700   -14.7460
## 3|4  -0.7306   0.0675   -10.8200
## 4|5  -0.4967   0.0663    -7.4932
## 5|6   0.1164   0.0653     1.7825
## 
## Residual Deviance: 5535.993 
## AIC: 5551.993

Next, we test the proportional odds assumption:

sf = function(y) {
  c('Y>=1' = qlogis(mean(y >= 1)),
    'Y>=2' = qlogis(mean(y >= 2)),
    'Y>=3' = qlogis(mean(y >= 3)),
    'Y>=4' = qlogis(mean(y >= 4)),
    'Y>=5' = qlogis(mean(y >= 5)),
    'Y>=6' = qlogis(mean(y >= 6)))
}

s = with(soup_test, 
          summary(as.numeric(SURENESS) ~ SOUPTYPE + COLD, fun = sf))
s
## as.numeric(SURENESS)     N= 1847 
## 
## +--------+---------+----+----+--------+---------+---------+---------+-----------+
## |        |         |N   |Y>=1|Y>=2    |Y>=3     |Y>=4     |Y>=5     |Y>=6       |
## +--------+---------+----+----+--------+---------+---------+---------+-----------+
## |SOUPTYPE|Self-made| 958|Inf |2.001954|1.0902789|0.7564595|0.5387604|-0.08355589|
## |        |Canned   | 589|Inf |1.766243|0.8692232|0.5609500|0.3254224|-0.29065384|
## |        |Dry-mix  | 300|Inf |2.273598|1.1344906|0.9610567|0.6781843| 0.10676798|
## +--------+---------+----+----+--------+---------+---------+---------+-----------+
## |COLD    |No       |1608|Inf |1.884780|1.0046160|0.6987494|0.4528575|-0.13952941|
## |        |Yes      | 239|Inf |2.634583|1.1609554|0.9016919|0.7630169| 0.02510592|
## +--------+---------+----+----+--------+---------+---------+---------+-----------+
## |Overall |         |1847|Inf |1.960218|1.0241890|0.7241701|0.4915250|-0.11816654|
## +--------+---------+----+----+--------+---------+---------+---------+-----------+

We can use the values in this table to help us assess whether the proportional odds assumption is reasonable for our model. The difference between each set of categories of the dependent variable remains similar. So the proportional odds assumption is valid for the model.

To add p-values, first we store the coefficient table, then calculate the pvalues and combine back with the table.

ctable = coef(summary(m))
p = pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
(ctable = cbind(ctable, "p value" = p))
##                      Value Std. Error    t value       p value
## SOUPTYPECanned  -0.2049982 0.09579422  -2.139985  3.235594e-02
## SOUPTYPEDry-mix  0.1922225 0.12443500   1.544762  1.224038e-01
## COLDYes          0.2400353 0.12850292   1.867936  6.177093e-02
## 1|2             -1.9704728 0.08450303 -23.318368 2.886946e-120
## 2|3             -1.0316108 0.06995880 -14.745977  3.265604e-49
## 3|4             -0.7306223 0.06752547 -10.819950  2.769276e-27
## 4|5             -0.4967487 0.06629286  -7.493246  6.719055e-14
## 5|6              0.1164024 0.06530217   1.782520  7.466443e-02

We can also get confidence intervals for the parameter estimates. If the 95% CI does not cross 0, the parameter estimate is statistically significant.

(ci = confint(m))
##                       2.5 %      97.5 %
## SOUPTYPECanned  -0.39269077 -0.01711287
## SOUPTYPEDry-mix -0.05050130  0.43752670
## COLDYes         -0.01016799  0.49393046

From the table above, the CIs of ‘souptype’ canned do not include 0 while the CIs of ‘cold’ does, which indicates that only ‘souptype’ is a significant variable in explaining the variance.

To get the odd ratios and confidence intervals, we just exponentiate the estimates and confidence intervals.

exp(coef(m))
##  SOUPTYPECanned SOUPTYPEDry-mix         COLDYes 
##       0.8146488       1.2119401       1.2712940
exp(cbind(OR = coef(m), ci))
##                        OR     2.5 %    97.5 %
## SOUPTYPECanned  0.8146488 0.6752375 0.9830327
## SOUPTYPEDry-mix 1.2119401 0.9507527 1.5488717
## COLDYes         1.2712940 0.9898835 1.6387446

These coefficients are called proportional odds ratios and we would interpret these pretty much as we would odds ratios from a binary logistic regression. For cold, we would say that for a one unit increase in cold, the odds of sureness 1 versus sureness higher rate combined are 1.27 greater, given that all of the other variables in the model are held constant.

Last, we can obtain predicted probabilities, which are usually easier to understand than either the coefficients or the odds ratios.

newdat = data.frame(SOUPTYPE = soup_test$SOUPTYPE,
                    COLD = soup_test$COLD)
newdat_p = cbind(newdat, predict(m, newdat, type = "probs"))
deduped.data = unique(newdat_p[, 1:8])
deduped.data
##      SOUPTYPE COLD          1         2          3          4         5
## 1      Canned  Yes 0.11862563 0.1374155 0.06137775 0.05268165 0.1502297
## 11  Self-made   No 0.12233811 0.1404338 0.06228626 0.05324684 0.1507628
## 31     Canned   No 0.14610601 0.1582558 0.06717630 0.05603728 0.1520902
## 51    Dry-mix   No 0.10315084 0.1241117 0.05711605 0.04988336 0.1467921
## 181 Self-made  Yes 0.09881082 0.1201648 0.05577387 0.04895833 0.1454233
## 660   Dry-mix  Yes 0.08296471 0.1049116 0.05026802 0.04498196 0.1385598
##             6
## 1   0.4796699
## 11  0.4709322
## 31  0.4203344
## 51  0.5189459
## 181 0.5308689
## 660 0.5783138

Then table above shows that for sureness = 2 and cold = Yes, the predicted probability for ‘dry-mix’ and ‘canned’ soup are 0.1049 and 0.1374 correspondingly. The results are same with the results from Stata.

Ordinal Logistic Regression in Stata

The first part of the code was to read in the data:

import delimited "soup_data.csv"

Next we looked at the first ten rows to get an idea of the data set

list if v1 <=10

The next step was to look at the variable sureness as we were using this variable as the response variable. We created some frequency and contingency tables for the variable with possible predictor variables:

tab sureness
tabstat sureness, s(n mean SD)

From the table above, having a mean of 4.377, shows that there is a higher proportion of values on the higher end of the sureness scale.

tab sureness souptype2

tab sureness cold2

The predictor variables we chose were souptype and cold so the model was fit as: sureness ~ souptype + cold.

Souptype and cold were recoded as numbers so they would be easier to use as factors.

gen cold2 = 1 if cold == "Yes"
replace cold2 = 0 if cold == "No"

gen souptype2 = 0 if souptype == "Self-made"
replace souptype2 = 1 if souptype == "Canned"
replace souptype2 = 2 if souptype == "Dry-mix"

The summary for the model is shown below:

ologit sureness i.souptype2 i.cold2

Looking at the chi-squared test since p-value is equal to .0045, which is less than .05, we can reject the null hypothesis that no relationship exists.

Below is a prediction for the percent for sureness equal to two and souptype equal to canned and dry-mix and the respondent having a cold, as well as a graph with a 95% confidence interval for the prediction of sureness given the souptype and cold.

margins souptype2 cold2
margins, at(souptype2 = (1/2) cold2 = 1) predict(outcome(2)) 
marginsplot, recast(line) recastci(rarea) xtitle("Soup Type") ytitle("Prob.")

The plot shows the difference in probabilities, between canned and dry-mix, of sureness equal to two, when the respondent has a cold. When the souptype is canned the probability of sureness being equal to two is .105 and when souptype is dry-mix the probability of sureness being equal to two is .137.

Summary

The results of the model showed that a change from self-made to canned soup would have a negative effect on sureness, where as changing from self-made to a dry-mix would have a positive effect. Also, the respondent having a cold would also have a positive effect when compared to a respondent who does not have a cold.

Ordinal logistic regression is used when there are more than two ordered response categories. After obtaining a model, we can also obtain predicted probabilities at given values, which was shown at the end of each tutorial. These probabilities are usually easier to understand and interpret than what is given in the model.

References

UCLA Institute for Digital Research and Education - Ordinal Logistic Regression
Statistics.laerd.com. (2018). How to perform an Ordinal Regression in SPSS | Laerd Statistics.