The purpose of our group is to compare the performance between Probit Regression and Logistic Regression in ‘Communities and Crime’ Data Set with Stata, R and Python.
Introduction to Probit Regression and Logistic Regression
When dealing with the dataset that includes the binary dependent variable, the logistic regression can also be called logit model. The logit model and probit model are both used to do nonlinear estimation on dichotomous or binary dependent variables. These two models both share the same basic idea of applying a link function to binary dependent variables to transform it to continuous variables on the set of real numbers. After the transformation, linear regression can be applied to the new data with the original independent variables and transformed dependent variables. The main difference between the logit model and probit model is the link function they used. The link function of probit model is the inversed cumulative distribution function of standard normal distribution \[\Phi^{-1}(p)\] The link function of logit model is the log of odds ratio (log odds) \[log(\frac{p}{1-p})\] The logit function’s shape is very similar to the standard normal distribution, but the normal distribution has heavier tails. After the transformation, the maximum likelihood estimation will be adopted to estimate the parameters that can maximize the probability of what has been observed.
Description of Data
The dataset we used is ‘Communities and Crime’ Data Set from UCI Machine Learning Repository. This data set is made up of US Census socio-economic data from 1990, US LEMAS survey data about law enforcement from 1990, and information about crime from the 1995 FBI UCR.
We chose ViolentCrimesPerPop as response variable and Population, perCapInc, PctPopUnderPov, pctUrban as predictor variables. Those variables are listed with their definitions in the table below. ViolentCrimesPerPop is a continuous variable but both logit and probit regression require a binary outcome. We made our response binary by classifying observations of 0.5 or higher as high crime communities and observations of less than 0.5 as low crime communities. We chose 0.5 as the cut off because it corresponds to 1 violent crime for every 200,000 people in the community. Our task is to figure out what is more likely to cause high violent crime rate based on community information. The probit and logistic regression are carried out to handle this challenge.
Name | Definition |
---|---|
Population | population for community |
PctUrban | percentage of people living in areas classified as urban |
PerCapInc | per capita income |
PctPopUnderPov | percentage of people under the poverty level |
ViolentCrimesPerPop | total number of violent crimes per 100K popuation |
Outline of Examples
Below are three examples using probit and logit regression. They all include a probit regression and a discussion of results and diagnostics followed by a logit regression with a similar discussion. One of the diagnostic tests we run is the Hosmer-Lemeshow (HL) test which is also dicussed by Group 5. You could look at their project if you would like more examples.
The different languages have different capabilities for tests, so the examples don’t match exactly. First is an example in Python created by Jinwen. She created her own function to run the HL test, so look there for a function by hand and more technical information about the test. Next is an example in R by Xian. This example expands on Jinwen’s. Last is an example in Stata by Olivia. Stata has the easiest access to the tests and statistics we were interested in, so this example includes the longest discussion.
Python Part
1) Import libraries needed
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
import pysal
from scipy.stats import chi2, norm
2) Load and clean the data
Before implementing the models, we first load the data in our working directory. Having noticed that this dataset used ‘?’ as missing values, we replace it with np.nan so that the missing values can be figured out by Python. And then we select the independent variables we need which are ‘population’ (col6), ‘pctUrban’ (col17), ‘perCapInc’ (col26), ‘PctPopUnderPov’ (col34) and dependent variable ‘ViolentCrimesPerPop’ (col128). And then we drop all the lines that include missing values. And the top 5 lines of the dataset are shown below.
with open('communities.data','rb') as f:
file_content = list(item.decode() for item in f)
raw_data = []
for item in file_content:
item_rp = item.rstrip('\r\n')
item_sp = item_rp.split(',')
raw_data.append(item_sp)
col_names = range(128)
f = pd.read_csv('communities.data', header = None)
raw_data = f.replace('?', np.nan)
mydata = raw_data.iloc[:,[6-1, 17-1, 26-1, 34-1, 128-1]]
mydata.columns = ['population', 'pctUrban',
'perCapInc', 'PctPopUnderPov',
'ViolentCrimesPerPop']
data = mydata.dropna()
data.head()
3) Recode the data
As this tutorial mainly focuses on the binary outcome variable, we create a cut-off point for the dependent variable to transform it from continuous to binary. We set the value of ViolentCrimesPerPop (total number of violent crimes per 100K popuation) to be 1 when it is larger than or equal to 0.5 and to be 0 when it is smaller than 0.5. The following table is the summary of the recoded data. We can see that the data is standardized — the range of each variables are [0, 1].
data['ViolentCrimesPerPop'][ data['ViolentCrimesPerPop'] > 0.5 ] = 1
data['ViolentCrimesPerPop'][ data['ViolentCrimesPerPop'] <= 0.5 ] = 0
print(data.describe())
X = data.iloc[:,:4].values
Y = data.iloc[:, 4].values[:, None]
4) Define some functions for future models
1) Function for HL-test
We will apply Hosmer-Lemeshow test to compare the goodness of fit in probit and logit model. However, Python doesn’t have any package for this test, So, we define the HL test function based on its concept and output the chi-squared statistic and p-value for it.
def hl_test(data, g):
'''
Hosmer-Lemeshow test to judge the goodness of fit for binary data
Input: dataframe(data), integer(num of subgroups divided)
Output: float
'''
data_st = data.sort_values('prob')
data_st['dcl'] = pd.qcut(data_st['prob'], g)
ys = data_st['ViolentCrimesPerPop'].groupby(data_st.dcl).sum()
yt = data_st['ViolentCrimesPerPop'].groupby(data_st.dcl).count()
yn = yt - ys
yps = data_st['prob'].groupby(data_st.dcl).sum()
ypt = data_st['prob'].groupby(data_st.dcl).count()
ypn = ypt - yps
hltest = ( ((ys - yps)**2 / yps) + ((yn - ypn)**2 / ypn) ).sum()
pval = 1 - chi2.cdf(hltest, g-2)
df = g-2
print('\n HL-chi2({}): {}, p-value: {}\n'.format(df, hltest, pval))
2) Function for p-values
Due to the lack of the function for outputting the significance of each variables in sklearn logistic packages, we write a function to print out p-values for each independent variables in logistic model in sklearn package.
def logit_p(skm, x):
'''
Print the p-value for sklearn logit model
(The function written below is mainly based on the stackoverflow website -- P-value function for sklearn[3])
Input: model, nparray(df of independent variables)
Output: none
'''
pb = skm.predict_proba(x)
n = len(pb)
m = len(skm.coef_[0]) + 1
coefs = np.concatenate([skm.intercept_, skm.coef_[0]])
x_full = np.matrix(np.insert(np.array(x), 0, 1, axis = 1))
result = np.zeros((m, m))
for i in range(n):
result = result + np.dot(np.transpose(x_full[i, :]),
x_full[i, :]) * pb[i,1] * pb[i, 0]
vcov = np.linalg.inv(np.matrix(result))
se = np.sqrt(np.diag(vcov))
t = coefs/se
pval = (1 - norm.cdf(abs(t))) * 2
print(pd.DataFrame(pval,
index=['intercept','population','pctUrban',
'perCapInc', 'PctPopUnderPov'],
columns=['p-value']))
5) Fit the model
Now we can fit our different models for the manipulated dataset and apply the HL test.
1) Probit model
pbt = pysal.spreg.Probit(Y, X,
name_x=['population', 'pctUrban',
'perCapInc', 'PctPopUnderPov'],
name_y='ViolentCrimesPerPop',
name_ds='Coummunities and Crime')
data['prob'] = pbt.predy
print(pbt.summary)
p_probit = hl_test(data, 5)
From the summary of the probit model above, we can see that perCapInc variable is insignificant, while other 3 variables are significant. According to the meaning of these varibales, the population for community, percentage of people living in areas and the percentage of people under the poverty level do have enough influence on the crime rate in communities, while the income of people doesn’t have significant influence on the crime rate. And according to the HL test, we can see that the p-value is very close to 0, so we have to reject the null hypthesis. Thus, the probit model didn’t fit the data well.
2) Logit model
lr = LogisticRegression(C=100.0, random_state=0)
lr.fit(X, Y)
data['prob'] = lr.predict_proba(X)[:, 1]
print(logit_p(lr, X))
p_logit = hl_test(data, 5)
From the summary of logit model, we can also see that except variable perCapInc, other 3 variables are significant. And from the figure above, we can see that the p-value of the HL test in logit model is also very close to 0. So, we have to reject the null hypothesis and concludes that it is a poor fit. Compared the result of HL test in logit model with that of the probit model, we can see that there is slight difference. Thus, the logit model and probit model both didn’t fit the dataset well.
R Part
1) Loading data
head(df3)
## Population PctUrban PerCapInc PctPopUnderPov ViolentCrimesPerPop
## 1 0.19 1.0 0.40 0.19 0.20
## 2 0.00 1.0 0.37 0.24 0.67
## 3 0.00 0.0 0.27 0.27 0.43
## 4 0.04 1.0 0.36 0.10 0.12
## 5 0.01 0.9 0.43 0.06 0.03
## 6 0.02 1.0 0.72 0.12 0.14
## Highcrime
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
2) Probit model
The code below estimates a probit regression model using the glm (generalized linear model) function.
myprobit = glm(Highcrime~Population+PctUrban+PerCapInc+PctPopUnderPov, family=binomial(link = "probit"), data=df3)
summary(myprobit)
##
## Call:
## glm(formula = Highcrime ~ Population + PctUrban + PerCapInc +
## PctPopUnderPov, family = binomial(link = "probit"), data = df3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1337 -0.4633 -0.2877 -0.1829 2.9356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5832 0.2234 -11.561 < 2e-16 ***
## Population 1.6337 0.2691 6.070 1.28e-09 ***
## PctUrban 0.7187 0.1072 6.706 2.00e-11 ***
## PerCapInc -0.5417 0.4027 -1.345 0.179
## PctPopUnderPov 2.6762 0.2596 10.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1607.1 on 1993 degrees of freedom
## Residual deviance: 1204.4 on 1989 degrees of freedom
## AIC: 1214.4
##
## Number of Fisher Scoring iterations: 6
The deviance residuals are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. The next part of the output shows the coefficients, their standard errors, the z-statistic and the associated p-values.
From the result we can see that “Population, PctPopUnderPov, pctUrban” are statistically significant while “perCapInc” doesn’t have good performance.
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Also below we show an example of how you can use these values to help assess model fit. In addition, we can obtain confidence intervals for the coefficient estimates that is based on the standard error and the normal assumption.
confint(myprobit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.0167613 -2.1492940
## Population 1.1273463 2.1510192
## PctUrban 0.5160104 0.9259521
## PerCapInc -1.3582403 0.2144111
## PctPopUnderPov 2.1746639 3.1801548
We can test for an overall effect of “Population, PctPopUnderPov, pctUrban, perCapInc” using the wald.test function.
wald.test(b = coef(myprobit), Sigma=vcov(myprobit), Terms=2:5)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 322.7, df = 4, P(> X2) = 0.0
The chi-squared test statistic of 322.7 with 4 degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of the 4 predictors is statistically significant. Then we can compute the difference in deviance and test it with chi-square test.
df3[, c("p","se")]=predict(myprobit, df3, type = "response", se.fit = TRUE)[-3]
with(myprobit, pchisq(null.deviance-deviance, df.null-df.residual, lower.tail = FALSE))
## [1] 7.037114e-86
We can also see measures of how well our model fits. The chi-square of 305.8052 with 3 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model.
Then we use the Hosmer-Lemeshow test to test the probit model.
hoslem.test(myprobit$y, fitted(myprobit), g=5)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: myprobit$y, fitted(myprobit)
## X-squared = 29.956, df = 3, p-value = 1.41e-06
The group number is set as 5 because this will be easy to compare the results with other language. The p-value is 1.41e-06 which is very close to zero. This means we have reject the null hypothesis for the HL test that the model has a good fit. The H-L test is not reliable. Therefore we should trust the result of Wald test instead of the H-L test.
3) Logistic model
Just like the probit regression, the code below estimates a logit regression model using the glm (generalized linear model) function.
mylogit=glm(Highcrime~Population+PctUrban+PerCapInc+PctPopUnderPov, data=df3, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = Highcrime ~ Population + PctUrban + PerCapInc +
## PctPopUnderPov, family = "binomial", data = df3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1657 -0.4651 -0.3097 -0.2128 2.8491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.3882 0.4413 -9.944 < 2e-16 ***
## Population 2.8216 0.4667 6.046 1.48e-09 ***
## PctUrban 1.3755 0.2004 6.864 6.71e-12 ***
## PerCapInc -1.3555 0.8156 -1.662 0.0965 .
## PctPopUnderPov 4.5111 0.4839 9.323 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1607.1 on 1993 degrees of freedom
## Residual deviance: 1218.2 on 1989 degrees of freedom
## AIC: 1228.2
##
## Number of Fisher Scoring iterations: 6
confint.default(mylogit)
## 2.5 % 97.5 %
## (Intercept) -5.2531112 -3.5232938
## Population 1.9069412 3.7362885
## PctUrban 0.9826963 1.7682535
## PerCapInc -2.9540894 0.2430434
## PctPopUnderPov 3.5627340 5.4595066
We can see that Logistic model has similar result as the Probit model. All the predictors are statistically significant except for perCapInc. (The p-value for perCapInc in Logistic model is lower than in Probit model)
wald.test(b=coef(mylogit), Sigma=vcov(mylogit), Terms=2:5)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 285.1, df = 4, P(> X2) = 0.0
The chi-squared test statistic of 285.1 ( a little bit lower than the one in probit regression) with 4 degrees of freedom is associated with a p-value of less than 0.001 indicating that the overall effect of the 4 predictors is statistically significant.
hoslem.test(mylogit$y, fitted(mylogit), g=5)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mylogit$y, fitted(mylogit)
## X-squared = 27.669, df = 3, p-value = 4.263e-06
The group number is set as 5 as the probit model. From the result of H-L test in Logistic model, we can see that the p-value is 4.263e-06 which is similar with the result in Probit model. We should take the result of wald-test and give up the result of unreliable H-L test.
STATA Part
In this example, the models will be evaluated using Pseudo \(R^2\), Wald tests, Pearson’s test and the Hosmer-Lemeshow test. All of these diagnostic tests or values are more easily available in Stata than they are in R or python, so we thought it was worth while to explore all of them briefly below.
1) Import data set
Below is the code to import the data set to Stata. The data loaded is a subset of the original data that only includes the relevant columns. The columns are all named according to their attribute names from the data description on the UCI Machine Learning Repository. Additionally, the data set includes the response variable highcrime. Highcrime was made by taking ViolentCrimesPerPop and classifying observations of 0.5 or higher as high crime communities (1) and observations of less than 0.5 as low crime communities (0). In total, the data set includes the variables highcrime, percapinc, pctpopunderpov, pcturban, and population. The file CommunitiesCrimeData.csv is available in the github repository.
clear
import delimited CommunitiesCrimeData.csv
2) Probit model
First, we run a probit regression with the data. The syntax for this command is to give the dependent variable followed by any independent variables. If, in and weight options are also avaible, but we didn’t use them in this example.
probit highcrime percapinc pctpopunderpov pcturban population
Here is the output for the probit regression. One measure of goodness of fit that can be accessed at this point is Pseudo \(R^2\) which is 0.2487 for this model. Pseudo \(R^2\) is used for probit regression because the typical ordinary least squares (OLS) \(R^2\) does not apply. This measure should be used cautiously because it is difficult to interpret and does not represent the proportion of the variance explained by the independent variables like the OLS \(R^2\).
The coefficients for the probit model can be interpreted as the difference in Z score associated with each one-unit difference in the predictor variable. Three of the four coefficients are significant according to a wald test. The coefficient for per capita income is close to significant but not quite there.
We can access more information on goodness of fit by using a postestimation command that depends on the most recent model. The command is gof for goodness of fit.
estat gof
This command performs a Pearson’s goodness of fit test. The p-value for this test (the Prob > chi2 line) is 1, so we fail to reject the null that the model is a good fit.
We can also perform a Hosmer-Lemeshow test for goodness of fit. The code is very similar to the code for the Pearson test, but we specify the number of groups. Adding table to the command tells Stata to make the table of observation counts for the different groups.
estat gof, group(5) table
Typically, the Hosmer-Lemeshow specifies 10 groups. However, in the case when there are small group counts, it is recommended that we choose a smaller number of groups. Here we’ve reduced the number of groups here to 5. Even with the reduced number of groups, the first two groups still have low counts of high crime communities. These low group counts mean that the results of this test are unreliable.
The p-value for this test is > 0.001, which tells us to reject the null hypothesis that the model is a good fit. This contradicts the conclusion we came to with Pearson’s test. We showed above that this test may not be reliable, so we should trust the Pearson’s test results instead of these results.
3) Logit model
Next we run a logistic regression. The syntax for this command is the same as for the probit command.
logistic highcrime percapinc pctpopunderpov pcturban population
Here is the output for the logistic regression. Again, Pseudo \(R^2\) is available. For this model it is 0.2402. Similar to probit regression, Pseudo \(R^2\) is used for probit regression because the typical OLS \(R^2\) does not apply.
The coefficients are given as odds ratios. Exponentiating those quantities will give us information on the relationship between the independent variables and the odds of being a highcrime community. According to wald tests for significance, all four independent variables are significant.
We can access more information on goodness of fit by using a postestimation command that depends on the most recent model.
estat gof
This command performs a Pearson’s goodness of fit test. The p-value for this test (the Prob > chi2 line) is 1, so we fail to reject the null that the model is a good fit.
We can also perform a Hosmer-Lemeshow test for goodness of fit.
estat gof, group(5) table
Similarly to the probit case, we’ve reduced the number of groups here to 5. Even with the reduced number of groups, the first two groups still have low counts of high crime communities. The low group counts mean that the results of this test are unreliable.
The p-value for this test is > 0.001, which tells us to reject the null hypothesis that the model is a good fit. However, this contradicts the conclusion we came to with Pearson’s test, and because we established that this test may not be reliable, we should trust the Pearson’s test results.
Overall, the logistic and probit models perform similarly. The psuedo \(R^2\) is slightly higher for the probit model in this situation. Three of the four predictors are significant in the probit model, while all four are significant in the logistic model. This difference may lead to slightly different final models if we were using these models as intermediate steps in our model building process. Both models have good fit according the Pearson’s goodness of fit test. Both models were poor fitting according to the Hosmer-Lemeshow test, but this test is likely to be unreliable due to low group counts. The major difference is the link function and not performance.
Conclusion
In all three examples, you can see that logistic and probit regression performed similarly. Running the models and doing the diagnostic tests were relatively easy and short in both Stata and R. However, it was inconvient in Python because Python doesn’t have any packages for tests such as the HL test. Also, logistic regression in the sklearn package doesn’t include the functions to access the values for statistical inference such as the p-values of each independent variable. It is necessary to code the tests from scratch. And the p-value computed in probit model using the Pysal package is comparatively inaccurate, which also illustrates that Python is not as good as R and Stata for dealing with statistics problems. This may influence your language choice if you are running similar analysis.
Your choice of link function may depend on the audience you will present to. Logit models are common in biological and epidemiological settings because odds ratios are common in those settings. Probit models more often arise in econometric or political science settings.
Reference