Introduction

The goal of this tutorial is to demonstrate the use of Logistic Regression, and the model diagnostics for this type of regression. We will start this tutorial by explaining the algorithm and the modeling behind Logistic Regression. This would be followed by an illustrative example using three statistical software languages: Python, R, and STATA.

Logistic Regression Algorithm

Logistic regression is a predictive modeling algorithm that is used when the response variable \(y\) is a binary categorical variable, i.e. \(y\) can only take two values: 0 and 1. It is defined by the probability mass function: \[P(y_i=1|x_i=x) = \frac{exp(\beta'x)}{1+exp(\beta'x)}\] which implies that \[P(y_i=0|x_i=x) = \frac{1}{1+exp(\beta'x)}\]

In logistic regression, the logit function is defined as \(logit(x) = log \frac{x}{1-x}\) and the expit function, the inverse of logit function, is defined as \(expit(x) = \frac{exp(x)}{1+exp(x)}\). The logit function maps unit interval onto real line and the expit function does the inverse. Here, in logistic regression, the logit function maps \(\beta'x\) to a probablity. As a result, the linear predictor \(\beta'x\) can be interpreted in logistic regression as the conditional log odds: \[log \frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)} = \beta'x\]

Data Summary

In this tutorial, we work on the pima dataset from the package faraway v1.0.7 in R, which has 9 variables. The data comes from a National Institute of Diabetes and Digestive and Kidney Diseases study on 768 adult female Pima Indians living near Phoenix. Different measurements of the subjects were taken and the following variables were recorded:

pregnant: Number of times pregnant
glucose: Plasma glucose concentration, which is the result of a 2 hours oral glucose tolerance test
diastolic: Diastolic blood pressure (mm Hg)
triceps: Triceps skin fold thickness (mm)
insulin: 2-Hour serum insulin (mu U/ml)
bmi: Body mass index (weight in kg/(height in m2))
diabetes: Diabetes pedigree function
age: Age (years)
test: A test whether the patient shows signs of diabetes , coded 0 if negative, 1 if positive.

Since this dataset contains a binary categorical variable test, logistic regression can be used to create a predictive model using the rest of the variables as predictors to determine the probability that a person will show positive signs of diabetes.

Model Diagnostics

Subsequent to fitting a logistic regression model, we will conduct variable selection using backwards elimination and the Bayesian Information Criterion (BIC) as the selection criterion to determine the best model for the given data. Backward elimination is appropriate for the given data as we only have a few parameters available. Using BIC prevents overfitting by penalizing adding parameters to the model. The performance of the selected model will be assessed using the Receiver Operating Characteristic (ROC) Curve.

The Hosmer-Lemeshow Test will be conducted to assess the model’s goodness of fit. The test compares the expected probabilities against the actual observations by subgroups of the model population. The null hypothesis for the test is that the expected proportions from the model are the same as those observed across all subgroups, i.e., the model is a good fit. The alternative hypothesis is that the observed and expected proportions are not the same. The Hosmer-Lemeshow statistic is given by: \[H = \sum_{g=1}^G\frac{(O_{1g}-E_{1g})^2}{N_g\pi_g(1-\pi_g)}\] where \(O_{1g}\) and \(E_{1g}\) are the observed and expected Y=1 events, respectively; \(N_g\) is the total number of observations; \(\pi_g\) is the predicted risk for the \(g\)th risk decile group; and \(G\) is the number of groups.

We then compare this test statistic to the \(\chi ^{2}\) distribution with \(G − 2\) degrees of freedom to calculate for the p-value of the test. A high p-value would mean that the model generates fitted values equal to observed values, hence indicating a good fit.

Languages

R

In order to run this analysis in R, we need to install two key packages: “Faraway” and “ResourceSelection” The former is required to access the data used in this illustration and the latter is used to run the model diagnostics. Once installed with the command, install.packages(“packagename”), they can be loaded using library( ).

Loading Dataset and Numerical Summaries

Once the dataframe pima is loaded, we can view the summaries of the variables by using the function, summary.

library(faraway)
data(pima)
summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

The variables: diastolic, insulin, glucose and bmi exhibit zero values, since that is not physically possible we conclude that these are measurement errors or that missing values might have been coded as 0. In an attempt to clean up the data, we remove these values.

library(dplyr)
pima_clean = filter(pima, diastolic > 0 & bmi > 0) %>% filter(insulin > 0 & glucose > 0)
summary(pima_clean)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 2.000   Median :119.0   Median : 70.00   Median :29.00  
##  Mean   : 3.301   Mean   :122.6   Mean   : 70.66   Mean   :29.15  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00   3rd Qu.:37.00  
##  Max.   :17.000   Max.   :198.0   Max.   :110.00   Max.   :63.00  
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0850   Min.   :21.00  
##  1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697   1st Qu.:23.00  
##  Median :125.50   Median :33.20   Median :0.4495   Median :27.00  
##  Mean   :156.06   Mean   :33.09   Mean   :0.5230   Mean   :30.86  
##  3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870   3rd Qu.:36.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3316  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Model Selection using the Backward Elimination Method

We use the Backward elimination method to select the model specification, with the BIC as a guide. Please note that a lower BIC indicates improved fit of the model. So, we will be focusing on the individual p-values for the variables and the BIC for the model.

We start with a logistic model with test as the response variable and all the other variables as explanatory. The variable test is binary i.e. takes the value 0 and 1. We term a positive diatebetes detection as a “success” (test = 1) and the negative detection as a “failure” (test = 0).

Fit = glm(test~ ., family = binomial(link="logit"), data = pima_clean)
summary(Fit)
## 
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7823  -0.6603  -0.3642   0.6409   2.5612  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
## pregnant     8.216e-02  5.543e-02   1.482  0.13825    
## glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
## diastolic   -1.420e-03  1.183e-02  -0.120  0.90446    
## triceps      1.122e-02  1.708e-02   0.657  0.51128    
## insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
## bmi          7.054e-02  2.734e-02   2.580  0.00989 ** 
## diabetes     1.141e+00  4.274e-01   2.669  0.00760 ** 
## age          3.395e-02  1.838e-02   1.847  0.06474 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.02  on 383  degrees of freedom
## AIC: 362.02
## 
## Number of Fisher Scoring iterations: 5
BIC(Fit)
## [1] 397.7626

The stepwise elimination of variables that are individually statistically insignificant and using the BIC criterion, results in the final model specification with four independent variables: age, bmi, glucose, and diabetes.

Fit_final = glm(test~ glucose+bmi+diabetes+age, family = binomial(link="logit"), data = pima_clean)
summary(Fit_final)
## 
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial(link = "logit"), 
##     data = pima_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8228  -0.6617  -0.3759   0.6702   2.5881  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.092018   1.080251  -9.342  < 2e-16 ***
## glucose       0.036189   0.004982   7.264 3.76e-13 ***
## bmi           0.074449   0.020267   3.673 0.000239 ***
## diabetes      1.087129   0.419408   2.592 0.009541 ** 
## age           0.053012   0.013439   3.945 8.00e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 347.23  on 387  degrees of freedom
## AIC: 357.23
## 
## Number of Fisher Scoring iterations: 5
BIC(Fit_final)
## [1] 377.0913

The estimated coefficients indicate the change in the log odds of the response when the predictors change by one unit. For example, the log odds of testing positive for diabetes increases by 0.053 when age increases by one unit. In addition, we can obtain the confidence intervals for the coefficient estimates as given below (Note, that these confidence intervals are based on the standard errors and rely on the normality assumption).

#computing confidence intervals:
confint(Fit_final)
##                    2.5 %      97.5 %
## (Intercept) -12.32530580 -8.07943034
## glucose       0.02676551  0.04635419
## bmi           0.03564664  0.11539762
## diabetes      0.28298910  1.92733798
## age           0.02715532  0.08002198

Hosmer-Lemeshow Goodness of Fit Test

We apply the Hosmer-Lemeshow test to assess the fit of the model. Essentially, we are testing the null hypothesis that the specified logistic regression model is the correct model. Since the p-value is high, we fail to reject the null hypothesis.

library(ResourceSelection)
h1 = hoslem.test(Fit_final$y, fitted(Fit_final), g = 10)
h1
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Fit_final$y, fitted(Fit_final)
## X-squared = 7.617, df = 8, p-value = 0.4718

In addition, the Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted probability values. Models for which the expected and observed probabilities in subgroups are similar are considered to be well calibrated. This provides evidence that our model is well calibrated.

library(ResourceSelection)
#inspect the expected and observed values
h2 = cbind(h1$expected, h1$observed)
h2
##                      yhat0     yhat1 y0 y1
## [0.00869,0.0451] 38.735269  1.264731 39  1
## (0.0451,0.0804]  36.602126  2.397874 38  1
## (0.0804,0.116]   35.113616  3.886384 37  2
## (0.116,0.161]    33.586721  5.413279 34  5
## (0.161,0.233]    31.402819  7.597181 31  8
## (0.233,0.325]    28.491642 10.508358 25 14
## (0.325,0.431]    24.456071 14.543929 26 13
## (0.431,0.638]    18.137699 20.862301 13 26
## (0.638,0.794]    10.755510 28.244490 13 26
## (0.794,0.992]     4.718525 35.281475  6 34

Confusion Matrix

This provides us with another tool to evaluate the model. It gives us a measure of the Type 1 and Type 2 error in our modeling. The diagonal gives the count of the instances when our model predicts correctly. Inspecting the Confusion matrix for our model we deduce that the overall predicted accurancy for our model is around 80 percent.

p = predict(Fit_final, pima_clean, type = "response")
Con_table = table(p > 0.5, pima_clean$test)
Con_table
##        
##           0   1
##   FALSE 233  51
##   TRUE   29  79

Reciever Operating Characteristic (ROC) Curve

Finally, we inspect the ROC curve. This curve plots the false positive rate against the true probablistic prediction for a range of threshold probabilities. The area under the curve is viewed as a measure of prediction accuracy. The larger the area under the curve, and hence the farther away the ROC curve is from the diagonal, the better the model performance. Computing the area under the curve (AUC) allows us to quantitatively evaluate the model performance. This could serve as a useful tool for model comparision as well.

library(ROCR)
p = predict(Fit_final, pima_clean, type = "response")
pred = prediction(p, pima_clean$test)
roc = performance(pred, "tpr", "fpr")
plot(roc, col = 'red',
     main = "ROC Curve")
abline(a=0, b=1)
#Higher area under the curve the better the fit (AUC)
auc = performance(pred, "auc")
auc = unlist(slot(auc, "y.values"))
auc = round(auc, 2)
legend(.6, .2, auc, title = "Area under the Curve", cex = .75)

Summary

In this tutorial we illustrated how to implement logistic regression and conduct model diagnostics. We used the pima dataset to study how the diabetes detection indicator is associated to variables, such as: age, BMI, glucose and the diabetes pedigree score of women belonging to a Native Indian tribe. We used the Backward Elimination method to decide on the model specification for the logistic regression model. In order to assess the fit of our model we employed techniques such the Hosmer-Lemeshow Test and the ROC Curve, which led us to the conclusion that our model specification is a good fit for these data and hence our findings are meaningful.

References

Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc. UCLA Institute for Digital Research and Education website: https://stats.idre.ucla.edu/r/dae/logit-regression/

Python

This page uses the following packages in Python. Please make sure that you can load them before trying to run the examples on this page.

import pandas as pd                                ## data analysis package
import matplotlib.pyplot as plt                    ## plot package
import warnings
warnings.filterwarnings("ignore")                
import statsmodels.formula.api as smf             
import statsmodels.api as sm                       ## statistical model package
from sklearn.metrics import confusion_matrix       ## logistic regression report packages
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from scipy.stats import chi2                       ## chi-square 

Loading Dataframe and Numerical Summaries

In this tutorial, we use pima data as our dataset. In order to fit the package in Python. We first divide our model into dependent variable and independent variables. Besides, since all independent variables here are numeric, we do not need to change them into factors. However, when utilizing sm.Logit() in Python, it doesn’t include intercept, so we add a new column named intercept in the independent variable. The code is as following:

## Load data
data=pd.read_csv("pima.txt",
   delim_whitespace=True,
   skipinitialspace=True)
data['intercept'] = 1.0

Once the dataframe pima is loaded, we can view the summaries of the variables by using the function in Python, \(data.describe()\).

## Summary of data
pd.set_option('display.max_columns', None)
describe = data.describe()
print(describe)
data summary before data cleaning

data summary before data cleaning

Then we clean the rows whose bmi, diastolic, glucose, triceps and insulin equals 0 with following code:

## Data Cleaning
data = data[data['bmi']>0]
data = data[data['diastolic']>0]
data = data[data['glucose']>0]
data = data[data['triceps']>0]
data = data[data['insulin']>0]
y=data['test']
X=data.drop(['test'], axis=1)

After we do the data cleaning, we have the result as follows:

## Summary of data after data cleaning
describe = data.describe()
print(describe)
data summary before data cleaning

data summary before data cleaning

Model Selection Using the Backward Elimination Method

We first use a logistic model with test as the response variable and all the other variables as explanatory.

## Logistic regression
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Logistic regression

Logistic regression

In the output above, we can gain the estimated coefficients, standard error, confidence interval of coefficients, z-value (Wald z-statistic) and its corresponding p-values. Here, we can find that variable pregnant, glucose, bmi, diabete, intercept are all statistically significant, while triceps, insulin, age, diastolic are not. The result above can be interpreted as following:

\[ log \frac{p}{1-p} =-10.04+ 0.082\times pregnant+ 0.038 \times glucose -0.001 \times diastolic +0.011 \times triceps -0.0008 \times insulin + 0.071 \times bmi+ 1.141 \times diabetes +0.034 \times age \]

Backward stepwise with BIC criterion, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most BIC deterioration of the model fit, and repeating this process until no further variables can be deleted without a BIC loss of fit.

There is no built-in method for choosing a regression model by forward selection. So we try to compare change in BIC when deleting each feature and we have the result as following

## Backward stepwise with BIC
# Triceps individually highly insignificant so drop the variable and assess the model
X1=X.drop(['triceps'], axis=1)
logit_model1=sm.Logit(y,X1)
result1=logit_model1.fit()
print(result1.bic)
BIC after deleting triceps

BIC after deleting triceps

The BIC of original model is 398, after deleting triceps, BIC declines to 392. We follow the same step and delete diastolic and have the result as following:

# Drop diastolic (individually statistically insignificant) and assess the model
X2=X.drop(['triceps','diastolic'], axis=1)
logit_model2=sm.Logit(y,X2)
result2=logit_model2.fit()
print(result2.bic)
BIC after deleting triceps and diastolic

BIC after deleting triceps and diastolic

After deleting diastolic, BIC declines to 386. We follow the same step and delete insulin and have the result as following:

# Drop insulin (indiviadually insignificant) and assess the model
X3=X.drop(['triceps','diastolic','insulin'], axis=1)
logit_model3=sm.Logit(y,X3)
result3=logit_model3.fit()
print(result3.bic)
BIC after deleting triceps, diastolic and insulin

BIC after deleting triceps, diastolic and insulin

After deleting insulin, BIC declines to 381. We now follow the same step and delete pregnant and finally have the model as:

# Drop age (indiviadually insignificant) and assess the model
X4=X.drop(['triceps','diastolic','insulin', 'age'], axis=1)
logit_model4=sm.Logit(y,X4)
result4=logit_model4.fit()
print(result4.summary2())
Logistic regression

Logistic regression

We can find that the final model only contain 4 variables and intercept. All of them are statistically significant. The result above can be interpreted as following: \[log \frac{p}{1-p} =-10.092+ 0.036 \times glucose+ 0.074 \times bmi+ 1.087 \times diabetes +0.053 \times age\]

Model Fitting

In this part, we fit model with the model we derived in the last part.

# Fitting value
y_prob = result4.predict(X4)

We first calculate the accuracy of our classification model:

## Test accuracy
pred = result4.predict(X4)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
test_acc = (y == pred).mean()
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(test_acc))

Hosmer-Lemeshow Goodness of Fit Test

The Hosmer-Lemeshow Test is a goodness-of-fit test for logistic regression models. We use this to compare the fitted values and observed values from the dataset.

# Hoslem-Lemeshow Test
y_prob = pd.DataFrame(y_prob)
y_prob1 = pd.concat([y_prob, y], axis =1)
y_prob1['decile'] = pd.qcut(y_prob1[0], 10)
obsevents_pos = y_prob1['test'].groupby(y_prob1.decile).sum()
obsevents_neg = y_prob1[0].groupby(y_prob1.decile).count() - obsevents_pos
expevents_pos = y_prob1[0].groupby(y_prob1.decile).sum()
expevents_neg = y_prob1[0].groupby(y_prob1.decile).count() - expevents_pos
hl = ((obsevents_neg - expevents_neg)**2/expevents_neg).sum()+((obsevents_pos - expevents_pos)**2/expevents_pos).sum()
print('chi-square: {:.2f}'.format(hl))
## df = group-2
pvalue=1-chi2.cdf(hl,8)
print('p-value: {:.2f}'.format(pvalue))
obsevents_pos = pd.DataFrame(obsevents_pos)
obsevents_neg = pd.DataFrame(obsevents_neg)
expevents_pos = pd.DataFrame(expevents_pos)
expevents_neg = pd.DataFrame(expevents_neg)
final = pd.concat([obsevents_pos, obsevents_neg, expevents_pos, expevents_neg], axis =1)
final.columns=['obs_pos','obs_neg','exp_pos', 'exp_neg']
print(final)

The observed number of successes and failures in each interval are obtained by counting the subjects in that interval. The expected number of successes in an interval is the sum of the probability of success for the subjects in that interval.

Table of observed and expected counts

Table of observed and expected counts

The Hosmer-Lemeshow statistic is calculated from the above table and we final calculate p-value as follows:

Since the p-value is high, we fail to reject the null hypothesis that the expected and observed probabilities are equal.

Confusion Matrix and Classification Report

# Confusion matrix
confusion_matrix = confusion_matrix(y, pred)
print(confusion_matrix)

The (1,1) entry means real value equals 1 and predict value equals 1. The (1,2) entry means real value equals 1 and predict value equals 0. The (2,1) entry means real value equals 0 and predict value equals 1. The (2,2) entry means real value equals 0 and predict value equals 0. Thus, we have 233+79 correct predictions and 51+29 false predictions.

# classification report
print(classification_report(y, pred))
classification report

classification report

Reciever Operating Characteristic (ROC) Curve

When the prediction effect is good, the ROC curve is convex to the apex of the upper left corner. The diagonal line in the translation is tangent to the ROC curve to obtain a point where the TPR is large and the FPR is small. The better the model effect, the farther away the ROC curve is from the diagonal. In the extreme case, the ROC curve passes through the (0,1) point, that is, the positive example is all positive and the negative example is all negative. The area under the ROC curve can quantitatively evaluate the effect of the model, which is recorded as AUC. The larger the AUC, the better the model effect.

# ROC curve
logit_roc_auc = roc_auc_score(y, pred)
fpr, tpr, thresholds = roc_curve(y, y_prob)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
classification report

classification report

Summary

In this tutorial we illustrated how to implement logistic regression, select a model and conduct model diagnostics in Python. We were able to fit a model using Backward Elimination and the BIC criterion, then assess the resulting model using the classification matrix, Hosmer-Lemeshow test, and ROC curve. The resulting model was found to have good fit and accuracy.

STATA

For STATA, all functions for the tests in this tutorials are built-in, so there is no need to install or load any additional packages prior to performing the analysis in this tutorial.

Loading Dataset and Numerical Summaries

The data set was exported from R into a comma-separated variable file pima.csv. We can import and summarize the pima dataset from a CSV file using the following:

import delimited using pima.txt, clear
summarize

We see from the summary that the data for bmi, diastolic, glucose, and triceps contain 0 values, which are not possible measures of these parameters. We must therefore clear rows with these values from our dataset. We can do this as follows:

drop if glucose == 0 | diastolic == 0 | triceps ==0 | insulin == 0 | bmi == 0
summarize 

We can see that we now have 392 observations from the initial 768, and all values are now within reasonable range.

Model Selection using the Backward Elimination method

For this model, our response variable is test which indicates whether or not a patient tested positive for diabetes. Note that the rest of the variables are numeric and continuous, so we can use them as-is in our regression model. Using all other variables as independent variables, we can generate the full model using this code:

logit test pregnant glucose diastolic triceps insulin bmi diabetes age

We now want to determine the best fit model for the given data. Starting with the full model, we use backward elimination to drop variables one-by-one and the Bayesian Information Criterion (BIC) to determine if the model is an improvement from the previous model.

We first get the BIC of the model using the following command:

estat ic

To determine the order at which to drop the variables, we can use Stata’s stepwise function using a 95% confidence level.

stepwise, pr(0.05): logit test pregnant glucose diastolic triceps insulin bmi diabetes age

From this, we get the dropping order diastolic, insulin, triceps, pregnant.

First, we drop diastolic:

// Model 1 (drop diastolic):
quietly logit test pregnant glucose triceps insulin bmi diabetes age
estat ic

From this, we get a BIC of 391.8057, an improvement from the full model. We then drop insulin:

// Model 2 (drop insulin):
quietly logit test pregnant glucose triceps bmi diabetes age
estat ic

Our BIC is now 386.223. Dropping the variable triceps, we get another model iteration:

// Model 3 (drop triceps): 
quietly logit test pregnant glucose bmi diabetes age
estat ic

This gives us a BIC of 380.7127, still an improvement from the previous. Lastly, dropping pregnant, we use the following code and get the model summary:

// Model 4 (drop pregnant): 
logit test glucose bmi diabetes age
estat ic

We see that all variables are now signifcant and we get a minimized BIC of 377.0913 for this model.

Now that we have a final model, we can create a new column p which contains the fitted values of our model using the following command:

predict p

Hosmer-Lemeshow Goodness of Fit Test

We use the Hosmer-Lemeshow test to assess the goodness-of-fit of the model. Our null hypothesis is that the fitted values from the model are the same as what is observed.

We group the data into deciles and perform the Hosmer-Lemeshow Test using the following code:

estat gof, group(10) table

Since the p-value is greater than \(\alpha = 0.05\), we fail to reject the null hypothesis that the observed and expected proportions are the same across all groups. We can conclude that the observed and expected proportions are equal, meaning that the model selected is a good fit.

Confusion Matrix and Classification Report

To determine the accuracy of our model, we can generate a classification report using the following command:

estat classification

The results show around a model accuracy of about 80%.

Reciever Operating Characteristic (ROC) Curve

These results, specifically the False Positive Rate and True Positive Rate can be plotted on a Receiving Operator Characteristic (ROC) Curve using the following command:

lroc, title("Receiving Operator Characteristic Curve") xtitle("False Positive Rate") ytitle("True Positive Rate")
ROC Curve

ROC Curve

We get an area under the ROC curve of 0.8605, which is close to 1.00, indicative of good model accuracy.

Summary

This tutorial illustrated how to logistic regression can be used on the pima dataset, and how diagnostics can be done on the fitted model using STATA. The test variable was used as the response variable to fit a model that predicts the probability of showing positive signs of diabetes based on the rest of the variables in the dataset. Using Backwards Elimination, we determined the best model which had glucose, bmi, diabetes, and age as explanatory variables. The Hosmer-Lemeshow test showed that the model predicted values close to those observed. The classification matrix and ROC Curve both indicated that the model had high accuracy.