The goal of this tutorial is to demonstrate the use of Logistic Regression, and the model diagnostics for this type of regression. We will start this tutorial by explaining the algorithm and the modeling behind Logistic Regression. This would be followed by an illustrative example using three statistical software languages: Python, R, and STATA.
Logistic regression is a predictive modeling algorithm that is used when the response variable \(y\) is a binary categorical variable, i.e. \(y\) can only take two values: 0 and 1. It is defined by the probability mass function: \[P(y_i=1|x_i=x) = \frac{exp(\beta'x)}{1+exp(\beta'x)}\] which implies that \[P(y_i=0|x_i=x) = \frac{1}{1+exp(\beta'x)}\]
In logistic regression, the logit function is defined as \(logit(x) = log \frac{x}{1-x}\) and the expit function, the inverse of logit function, is defined as \(expit(x) = \frac{exp(x)}{1+exp(x)}\). The logit function maps unit interval onto real line and the expit function does the inverse. Here, in logistic regression, the logit function maps \(\beta'x\) to a probablity. As a result, the linear predictor \(\beta'x\) can be interpreted in logistic regression as the conditional log odds: \[log \frac{P(y_i=1|x_i=x)}{P(y_i=0|x_i=x)} = \beta'x\]
In this tutorial, we work on the pima dataset from the package faraway v1.0.7 in R, which has 9 variables. The data comes from a National Institute of Diabetes and Digestive and Kidney Diseases study on 768 adult female Pima Indians living near Phoenix. Different measurements of the subjects were taken and the following variables were recorded:
pregnant
: Number of times pregnant
glucose
: Plasma glucose concentration, which is the result of a 2 hours oral glucose tolerance test
diastolic
: Diastolic blood pressure (mm Hg)
triceps
: Triceps skin fold thickness (mm)
insulin
: 2-Hour serum insulin (mu U/ml)
bmi
: Body mass index (weight in kg/(height in m2))
diabetes
: Diabetes pedigree function
age
: Age (years)
test
: A test whether the patient shows signs of diabetes , coded 0 if negative, 1 if positive.
Since this dataset contains a binary categorical variable test
, logistic regression can be used to create a predictive model using the rest of the variables as predictors to determine the probability that a person will show positive signs of diabetes.
Subsequent to fitting a logistic regression model, we will conduct variable selection using backwards elimination and the Bayesian Information Criterion (BIC) as the selection criterion to determine the best model for the given data. Backward elimination is appropriate for the given data as we only have a few parameters available. Using BIC prevents overfitting by penalizing adding parameters to the model. The performance of the selected model will be assessed using the Receiver Operating Characteristic (ROC) Curve.
The Hosmer-Lemeshow Test will be conducted to assess the model’s goodness of fit. The test compares the expected probabilities against the actual observations by subgroups of the model population. The null hypothesis for the test is that the expected proportions from the model are the same as those observed across all subgroups, i.e., the model is a good fit. The alternative hypothesis is that the observed and expected proportions are not the same. The Hosmer-Lemeshow statistic is given by: \[H = \sum_{g=1}^G\frac{(O_{1g}-E_{1g})^2}{N_g\pi_g(1-\pi_g)}\] where \(O_{1g}\) and \(E_{1g}\) are the observed and expected Y=1 events, respectively; \(N_g\) is the total number of observations; \(\pi_g\) is the predicted risk for the \(g\)th risk decile group; and \(G\) is the number of groups.
We then compare this test statistic to the \(\chi ^{2}\) distribution with \(G − 2\) degrees of freedom to calculate for the p-value of the test. A high p-value would mean that the model generates fitted values equal to observed values, hence indicating a good fit.
In order to run this analysis in R, we need to install two key packages: “Faraway” and “ResourceSelection” The former is required to access the data used in this illustration and the latter is used to run the model diagnostics. Once installed with the command, install.packages(“packagename”), they can be loaded using library( ).
Once the dataframe pima
is loaded, we can view the summaries of the variables by using the function, summary.
library(faraway)
data(pima)
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
The variables: diastolic
, insulin
, glucose
and bmi
exhibit zero values, since that is not physically possible we conclude that these are measurement errors or that missing values might have been coded as 0. In an attempt to clean up the data, we remove these values.
library(dplyr)
pima_clean = filter(pima, diastolic > 0 & bmi > 0) %>% filter(insulin > 0 & glucose > 0)
summary(pima_clean)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
## Median : 2.000 Median :119.0 Median : 70.00 Median :29.00
## Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
## Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00
## 1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00
## Median :125.50 Median :33.20 Median :0.4495 Median :27.00
## Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86
## 3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3316
## 3rd Qu.:1.0000
## Max. :1.0000
We use the Backward elimination method to select the model specification, with the BIC as a guide. Please note that a lower BIC indicates improved fit of the model. So, we will be focusing on the individual p-values for the variables and the BIC for the model.
We start with a logistic model with test
as the response variable and all the other variables as explanatory. The variable test is binary i.e. takes the value 0 and 1. We term a positive diatebetes detection as a “success” (test = 1) and the negative detection as a “failure” (test = 0).
Fit = glm(test~ ., family = binomial(link="logit"), data = pima_clean)
summary(Fit)
##
## Call:
## glm(formula = test ~ ., family = binomial(link = "logit"), data = pima_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7823 -0.6603 -0.3642 0.6409 2.5612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
## pregnant 8.216e-02 5.543e-02 1.482 0.13825
## glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
## diastolic -1.420e-03 1.183e-02 -0.120 0.90446
## triceps 1.122e-02 1.708e-02 0.657 0.51128
## insulin -8.253e-04 1.306e-03 -0.632 0.52757
## bmi 7.054e-02 2.734e-02 2.580 0.00989 **
## diabetes 1.141e+00 4.274e-01 2.669 0.00760 **
## age 3.395e-02 1.838e-02 1.847 0.06474 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 344.02 on 383 degrees of freedom
## AIC: 362.02
##
## Number of Fisher Scoring iterations: 5
BIC(Fit)
## [1] 397.7626
The stepwise elimination of variables that are individually statistically insignificant and using the BIC criterion, results in the final model specification with four independent variables: age
, bmi
, glucose
, and diabetes
.
Fit_final = glm(test~ glucose+bmi+diabetes+age, family = binomial(link="logit"), data = pima_clean)
summary(Fit_final)
##
## Call:
## glm(formula = test ~ glucose + bmi + diabetes + age, family = binomial(link = "logit"),
## data = pima_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8228 -0.6617 -0.3759 0.6702 2.5881
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.092018 1.080251 -9.342 < 2e-16 ***
## glucose 0.036189 0.004982 7.264 3.76e-13 ***
## bmi 0.074449 0.020267 3.673 0.000239 ***
## diabetes 1.087129 0.419408 2.592 0.009541 **
## age 0.053012 0.013439 3.945 8.00e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 347.23 on 387 degrees of freedom
## AIC: 357.23
##
## Number of Fisher Scoring iterations: 5
BIC(Fit_final)
## [1] 377.0913
The estimated coefficients indicate the change in the log odds of the response when the predictors change by one unit. For example, the log odds of testing positive for diabetes increases by 0.053 when age increases by one unit. In addition, we can obtain the confidence intervals for the coefficient estimates as given below (Note, that these confidence intervals are based on the standard errors and rely on the normality assumption).
#computing confidence intervals:
confint(Fit_final)
## 2.5 % 97.5 %
## (Intercept) -12.32530580 -8.07943034
## glucose 0.02676551 0.04635419
## bmi 0.03564664 0.11539762
## diabetes 0.28298910 1.92733798
## age 0.02715532 0.08002198
We apply the Hosmer-Lemeshow test to assess the fit of the model. Essentially, we are testing the null hypothesis that the specified logistic regression model is the correct model. Since the p-value is high, we fail to reject the null hypothesis.
library(ResourceSelection)
h1 = hoslem.test(Fit_final$y, fitted(Fit_final), g = 10)
h1
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Fit_final$y, fitted(Fit_final)
## X-squared = 7.617, df = 8, p-value = 0.4718
In addition, the Hosmer–Lemeshow test specifically identifies subgroups as the deciles of fitted probability values. Models for which the expected and observed probabilities in subgroups are similar are considered to be well calibrated. This provides evidence that our model is well calibrated.
library(ResourceSelection)
#inspect the expected and observed values
h2 = cbind(h1$expected, h1$observed)
h2
## yhat0 yhat1 y0 y1
## [0.00869,0.0451] 38.735269 1.264731 39 1
## (0.0451,0.0804] 36.602126 2.397874 38 1
## (0.0804,0.116] 35.113616 3.886384 37 2
## (0.116,0.161] 33.586721 5.413279 34 5
## (0.161,0.233] 31.402819 7.597181 31 8
## (0.233,0.325] 28.491642 10.508358 25 14
## (0.325,0.431] 24.456071 14.543929 26 13
## (0.431,0.638] 18.137699 20.862301 13 26
## (0.638,0.794] 10.755510 28.244490 13 26
## (0.794,0.992] 4.718525 35.281475 6 34
This provides us with another tool to evaluate the model. It gives us a measure of the Type 1 and Type 2 error in our modeling. The diagonal gives the count of the instances when our model predicts correctly. Inspecting the Confusion matrix for our model we deduce that the overall predicted accurancy for our model is around 80 percent.
p = predict(Fit_final, pima_clean, type = "response")
Con_table = table(p > 0.5, pima_clean$test)
Con_table
##
## 0 1
## FALSE 233 51
## TRUE 29 79
Finally, we inspect the ROC curve. This curve plots the false positive rate against the true probablistic prediction for a range of threshold probabilities. The area under the curve is viewed as a measure of prediction accuracy. The larger the area under the curve, and hence the farther away the ROC curve is from the diagonal, the better the model performance. Computing the area under the curve (AUC) allows us to quantitatively evaluate the model performance. This could serve as a useful tool for model comparision as well.
library(ROCR)
p = predict(Fit_final, pima_clean, type = "response")
pred = prediction(p, pima_clean$test)
roc = performance(pred, "tpr", "fpr")
plot(roc, col = 'red',
main = "ROC Curve")
abline(a=0, b=1)
#Higher area under the curve the better the fit (AUC)
auc = performance(pred, "auc")
auc = unlist(slot(auc, "y.values"))
auc = round(auc, 2)
legend(.6, .2, auc, title = "Area under the Curve", cex = .75)
In this tutorial we illustrated how to implement logistic regression and conduct model diagnostics. We used the pima
dataset to study how the diabetes detection indicator is associated to variables, such as: age, BMI, glucose and the diabetes pedigree score of women belonging to a Native Indian tribe. We used the Backward Elimination method to decide on the model specification for the logistic regression model. In order to assess the fit of our model we employed techniques such the Hosmer-Lemeshow Test and the ROC Curve, which led us to the conclusion that our model specification is a good fit for these data and hence our findings are meaningful.
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc. UCLA Institute for Digital Research and Education website: https://stats.idre.ucla.edu/r/dae/logit-regression/
This page uses the following packages in Python. Please make sure that you can load them before trying to run the examples on this page.
import pandas as pd ## data analysis package
import matplotlib.pyplot as plt ## plot package
import warnings
warnings.filterwarnings("ignore")
import statsmodels.formula.api as smf
import statsmodels.api as sm ## statistical model package
from sklearn.metrics import confusion_matrix ## logistic regression report packages
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from scipy.stats import chi2 ## chi-square
In this tutorial, we use pima
data as our dataset. In order to fit the package in Python. We first divide our model into dependent variable and independent variables. Besides, since all independent variables here are numeric, we do not need to change them into factors. However, when utilizing sm.Logit()
in Python, it doesn’t include intercept, so we add a new column named intercept in the independent variable. The code is as following:
## Load data
data=pd.read_csv("pima.txt",
delim_whitespace=True,
skipinitialspace=True)
data['intercept'] = 1.0
Once the dataframe pima
is loaded, we can view the summaries of the variables by using the function in Python, \(data.describe()\).
## Summary of data
pd.set_option('display.max_columns', None)
describe = data.describe()
print(describe)
Then we clean the rows whose bmi
, diastolic
, glucose
, triceps
and insulin
equals 0 with following code:
## Data Cleaning
data = data[data['bmi']>0]
data = data[data['diastolic']>0]
data = data[data['glucose']>0]
data = data[data['triceps']>0]
data = data[data['insulin']>0]
y=data['test']
X=data.drop(['test'], axis=1)
After we do the data cleaning, we have the result as follows:
## Summary of data after data cleaning
describe = data.describe()
print(describe)
We first use a logistic model with test
as the response variable and all the other variables as explanatory.
## Logistic regression
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
In the output above, we can gain the estimated coefficients, standard error, confidence interval of coefficients, z-value (Wald z-statistic) and its corresponding p-values. Here, we can find that variable pregnant
, glucose
, bmi
, diabete
, intercept
are all statistically significant, while triceps
, insulin
, age
, diastolic
are not. The result above can be interpreted as following:
\[ log \frac{p}{1-p} =-10.04+ 0.082\times pregnant+ 0.038 \times glucose -0.001 \times diastolic +0.011 \times triceps -0.0008 \times insulin + 0.071 \times bmi+ 1.141 \times diabetes +0.034 \times age \]
Backward stepwise with BIC criterion, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most BIC deterioration of the model fit, and repeating this process until no further variables can be deleted without a BIC loss of fit.
There is no built-in method for choosing a regression model by forward selection. So we try to compare change in BIC when deleting each feature and we have the result as following
## Backward stepwise with BIC
# Triceps individually highly insignificant so drop the variable and assess the model
X1=X.drop(['triceps'], axis=1)
logit_model1=sm.Logit(y,X1)
result1=logit_model1.fit()
print(result1.bic)
The BIC of original model is 398, after deleting triceps
, BIC declines to 392. We follow the same step and delete diastolic
and have the result as following:
# Drop diastolic (individually statistically insignificant) and assess the model
X2=X.drop(['triceps','diastolic'], axis=1)
logit_model2=sm.Logit(y,X2)
result2=logit_model2.fit()
print(result2.bic)
After deleting diastolic
, BIC declines to 386. We follow the same step and delete insulin
and have the result as following:
# Drop insulin (indiviadually insignificant) and assess the model
X3=X.drop(['triceps','diastolic','insulin'], axis=1)
logit_model3=sm.Logit(y,X3)
result3=logit_model3.fit()
print(result3.bic)
After deleting insulin
, BIC declines to 381. We now follow the same step and delete pregnant
and finally have the model as:
# Drop age (indiviadually insignificant) and assess the model
X4=X.drop(['triceps','diastolic','insulin', 'age'], axis=1)
logit_model4=sm.Logit(y,X4)
result4=logit_model4.fit()
print(result4.summary2())
We can find that the final model only contain 4 variables and intercept. All of them are statistically significant. The result above can be interpreted as following: \[log \frac{p}{1-p} =-10.092+ 0.036 \times glucose+ 0.074 \times bmi+ 1.087 \times diabetes +0.053 \times age\]
In this part, we fit model with the model we derived in the last part.
# Fitting value
y_prob = result4.predict(X4)
We first calculate the accuracy of our classification model:
## Test accuracy
pred = result4.predict(X4)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
test_acc = (y == pred).mean()
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(test_acc))
The Hosmer-Lemeshow Test is a goodness-of-fit test for logistic regression models. We use this to compare the fitted values and observed values from the dataset.
# Hoslem-Lemeshow Test
y_prob = pd.DataFrame(y_prob)
y_prob1 = pd.concat([y_prob, y], axis =1)
y_prob1['decile'] = pd.qcut(y_prob1[0], 10)
obsevents_pos = y_prob1['test'].groupby(y_prob1.decile).sum()
obsevents_neg = y_prob1[0].groupby(y_prob1.decile).count() - obsevents_pos
expevents_pos = y_prob1[0].groupby(y_prob1.decile).sum()
expevents_neg = y_prob1[0].groupby(y_prob1.decile).count() - expevents_pos
hl = ((obsevents_neg - expevents_neg)**2/expevents_neg).sum()+((obsevents_pos - expevents_pos)**2/expevents_pos).sum()
print('chi-square: {:.2f}'.format(hl))
## df = group-2
pvalue=1-chi2.cdf(hl,8)
print('p-value: {:.2f}'.format(pvalue))
obsevents_pos = pd.DataFrame(obsevents_pos)
obsevents_neg = pd.DataFrame(obsevents_neg)
expevents_pos = pd.DataFrame(expevents_pos)
expevents_neg = pd.DataFrame(expevents_neg)
final = pd.concat([obsevents_pos, obsevents_neg, expevents_pos, expevents_neg], axis =1)
final.columns=['obs_pos','obs_neg','exp_pos', 'exp_neg']
print(final)
The observed number of successes and failures in each interval are obtained by counting the subjects in that interval. The expected number of successes in an interval is the sum of the probability of success for the subjects in that interval.
The Hosmer-Lemeshow statistic is calculated from the above table and we final calculate p-value as follows:
Since the p-value is high, we fail to reject the null hypothesis that the expected and observed probabilities are equal.
# Confusion matrix
confusion_matrix = confusion_matrix(y, pred)
print(confusion_matrix)
The (1,1) entry means real value equals 1 and predict value equals 1. The (1,2) entry means real value equals 1 and predict value equals 0. The (2,1) entry means real value equals 0 and predict value equals 1. The (2,2) entry means real value equals 0 and predict value equals 0. Thus, we have 233+79 correct predictions and 51+29 false predictions.
# classification report
print(classification_report(y, pred))
When the prediction effect is good, the ROC curve is convex to the apex of the upper left corner. The diagonal line in the translation is tangent to the ROC curve to obtain a point where the TPR is large and the FPR is small. The better the model effect, the farther away the ROC curve is from the diagonal. In the extreme case, the ROC curve passes through the (0,1) point, that is, the positive example is all positive and the negative example is all negative. The area under the ROC curve can quantitatively evaluate the effect of the model, which is recorded as AUC. The larger the AUC, the better the model effect.
# ROC curve
logit_roc_auc = roc_auc_score(y, pred)
fpr, tpr, thresholds = roc_curve(y, y_prob)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
In this tutorial we illustrated how to implement logistic regression, select a model and conduct model diagnostics in Python. We were able to fit a model using Backward Elimination and the BIC criterion, then assess the resulting model using the classification matrix, Hosmer-Lemeshow test, and ROC curve. The resulting model was found to have good fit and accuracy.
For STATA, all functions for the tests in this tutorials are built-in, so there is no need to install or load any additional packages prior to performing the analysis in this tutorial.
The data set was exported from R into a comma-separated variable file pima.csv
. We can import and summarize the pima dataset from a CSV file using the following:
import delimited using pima.txt, clear
summarize
We see from the summary that the data for bmi
, diastolic
, glucose
, and triceps
contain 0 values, which are not possible measures of these parameters. We must therefore clear rows with these values from our dataset. We can do this as follows:
drop if glucose == 0 | diastolic == 0 | triceps ==0 | insulin == 0 | bmi == 0
summarize
We can see that we now have 392 observations from the initial 768, and all values are now within reasonable range.
For this model, our response variable is test
which indicates whether or not a patient tested positive for diabetes. Note that the rest of the variables are numeric and continuous, so we can use them as-is in our regression model. Using all other variables as independent variables, we can generate the full model using this code:
logit test pregnant glucose diastolic triceps insulin bmi diabetes age
We now want to determine the best fit model for the given data. Starting with the full model, we use backward elimination to drop variables one-by-one and the Bayesian Information Criterion (BIC) to determine if the model is an improvement from the previous model.
We first get the BIC of the model using the following command:
estat ic
To determine the order at which to drop the variables, we can use Stata’s stepwise
function using a 95% confidence level.
stepwise, pr(0.05): logit test pregnant glucose diastolic triceps insulin bmi diabetes age
From this, we get the dropping order diastolic
, insulin
, triceps
, pregnant
.
First, we drop diastolic
:
// Model 1 (drop diastolic):
quietly logit test pregnant glucose triceps insulin bmi diabetes age
estat ic
From this, we get a BIC of 391.8057, an improvement from the full model. We then drop insulin
:
// Model 2 (drop insulin):
quietly logit test pregnant glucose triceps bmi diabetes age
estat ic
Our BIC is now 386.223. Dropping the variable triceps
, we get another model iteration:
// Model 3 (drop triceps):
quietly logit test pregnant glucose bmi diabetes age
estat ic
This gives us a BIC of 380.7127, still an improvement from the previous. Lastly, dropping pregnant
, we use the following code and get the model summary:
// Model 4 (drop pregnant):
logit test glucose bmi diabetes age
estat ic
We see that all variables are now signifcant and we get a minimized BIC of 377.0913 for this model.
Now that we have a final model, we can create a new column p
which contains the fitted values of our model using the following command:
predict p
We use the Hosmer-Lemeshow test to assess the goodness-of-fit of the model. Our null hypothesis is that the fitted values from the model are the same as what is observed.
We group the data into deciles and perform the Hosmer-Lemeshow Test using the following code:
estat gof, group(10) table
Since the p-value is greater than \(\alpha = 0.05\), we fail to reject the null hypothesis that the observed and expected proportions are the same across all groups. We can conclude that the observed and expected proportions are equal, meaning that the model selected is a good fit.
To determine the accuracy of our model, we can generate a classification report using the following command:
estat classification
The results show around a model accuracy of about 80%.
These results, specifically the False Positive Rate and True Positive Rate can be plotted on a Receiving Operator Characteristic (ROC) Curve using the following command:
lroc, title("Receiving Operator Characteristic Curve") xtitle("False Positive Rate") ytitle("True Positive Rate")
We get an area under the ROC curve of 0.8605, which is close to 1.00, indicative of good model accuracy.
This tutorial illustrated how to logistic regression can be used on the pima
dataset, and how diagnostics can be done on the fitted model using STATA. The test
variable was used as the response variable to fit a model that predicts the probability of showing positive signs of diabetes based on the rest of the variables in the dataset. Using Backwards Elimination, we determined the best model which had glucose
, bmi
, diabetes
, and age
as explanatory variables. The Hosmer-Lemeshow test showed that the model predicted values close to those observed. The classification matrix and ROC Curve both indicated that the model had high accuracy.