Algorithm Description

The purpose of this tutorial is to demonstrate logistic regression in Stata, R and Python. The following is a brief summary of the logistic regression.

We have a binary output variable \(Y\) , and we want to model the conditional probability \(Pr(Y = 1|X = x)\) as a function of \(x\); any unknown parameters in the function are to be estimated by maximum likelihood. By now, it will not surprise you to learn that statisticians have approach this problem by asking themselves “how can we use linear regression to solve this?”

The easiest modification which has an unbounded range is the logistic (or logit) transformation \[ \operatorname{log}\frac{p}{1-p} . \] We can make this a linear function of \(x\) without fear of nonsensical results. (Of course the results could still happen to be wrong, but they’re not guaranteed to be wrong.)

Formally, the logistic regression model is that \[ \operatorname{log}\frac{p(x)}{1-p(x)} = \beta_{0} + \sum_{i=1}^{n} \beta_{i}X_{i} \] Solving for \(p\), this gives \[ p(x) = \frac{e^{\beta_{0} + \sum_{i=1}^{n} \beta_{i}X_{i}}}{1+e^{\beta_{0} + \sum_{i=1}^{n} \beta_{i}X_{i}}} = \frac{1}{1+e^{-(\beta_{0} + \sum_{i=1}^{n} \beta_{i}X_{i})}} \] Notice that the over-all specification is a lot easier to grasp in terms of the transformed probability that in terms of the untransformed probability.

Data Summary

The sinking of RMS Titanic is a huge tragedy, killing 1502 out of 2224 passengers and crew. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

The sinking Titanic

The sinking Titanic

We obtain the passengers record from here. In the dataset, 891 passengers are included in the dataset. In addition to the passenger number and survival status, there are other ten relative variables. Their description are listed as follow.

Variable Description Data_Type
Survival Corresponding passenger survived or not. 1 represents yes, and 0 represents no. Categorical. 0/1
Pclass The cabin class of the corresponding passenger. Categorical. 1/2/3
Name Name of the passenger. String.
Sex Gender of the passenger. Categorical. male/female
Age Age of the passenger. Positive Integer.
SibSp # of siblings / spouses aboard. Non-negative integer.
Parch # of parents / children aboard. String
Ticket Ticket number. Positive real number.
Fare Ticket fare. String
Cabin Cabin number of the passenger. Categorical. S/C/Q
Embark Port of Embarkation Categorical. 0/1

Our task is to figure out who is more likely to survive this tragedy based on their personal information. The logistic regression is carried out to handle this challenge.

R

Steps

i) Loading data

This dataset has a binary response variable ‘Survived’(1 for survival, 0 for death). Based on the data types and experience, we choose ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’ and ‘Fare’ as predictors. We treat ‘Pclass’ and ‘Sex’ as categorical factors and others as continuous. ‘Pclass’ takes on values 1, 2 and 3 where class 1 represents the highest rank of cabin and 3 represents the lowest rank.

### R ### 
# Loading data
titanic = read.csv('~/Downloads/Titanic_train.csv')
# View the first few rows of data
head(titanic)[,c(1:3,5:8,10)]
##   PassengerId Survived Pclass    Sex Age SibSp Parch    Fare
## 1           1        0      3   male  22     1     0  7.2500
## 2           2        1      1 female  38     1     0 71.2833
## 3           3        1      3 female  26     0     0  7.9250
## 4           4        1      1 female  35     1     0 53.1000
## 5           5        0      3   male  35     0     0  8.0500
## 6           6        0      3   male  NA     0     0  8.4583

ii) Obtaining Summary Statistics

1.Data Summary

For simplicity, we change ‘male’ of sex to 1, ‘female’ to 0. We can get basic descriptives for the entire data set by using summary.

### R ###
# Redefine Sex: 'male' as 1 and 'female' as 0
titanic$Sex = (titanic$Sex=='male')*1

# Subset the data including the columns we want
titanic_sub = subset(titanic,select = c(Survived,Pclass, Sex, 
                                        Age, SibSp, Parch, Fare))
# Summary data
summary(titanic_sub)
##     Survived          Pclass           Sex              Age       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Min.   : 0.42  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:20.12  
##  Median :0.0000   Median :3.000   Median :1.0000   Median :28.00  
##  Mean   :0.3838   Mean   :2.309   Mean   :0.6476   Mean   :29.70  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:38.00  
##  Max.   :1.0000   Max.   :3.000   Max.   :1.0000   Max.   :80.00  
##                                                    NA's   :177    
##      SibSp           Parch             Fare       
##  Min.   :0.000   Min.   :0.0000   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   Max.   :512.33  
## 

2. Two-way Contingency Table

Two-way contingency table can give us a basic relationship between the response and some predictor directly. For example, we tabulate ‘survived’ and ‘Pcalss’ to show the impact of cabin class on the probability of survival. From the table, we can roughly conclude that the odds ratio of survival is larger when the cabin class is higher.

### R ###
# 2-way contingency table 
xtabs(~Survived + Pclass, data = titanic_sub)
##         Pclass
## Survived   1   2   3
##        0  80  97 372
##        1 136  87 119

3.Missing Values and Training/Testing Sets

Missing values appear frequently when we deal with big datasets. There are many different ways to fix this problem, such as deleting or imputing the missing value. Here we replace the missing values in ‘Age’ by the average age.

Separating testing set from the original dataset is a common way to evaluate the goodness of fit. The ratio of the number of training data and testing data is usually 8:2 or 9:1. Here, we leave the first 90% data as training data and the last 10% of data as testing data.

### R ###
# Replace NAs in Age by the average age
titanic_sub$Age[is.na(titanic_sub$Age)] = mean(titanic_sub$Age
                                               [!is.na(titanic_sub$Age)])
# Define the training set and testing set
train_set = titanic_sub[1:802,]
test_set = titanic_sub[803:891,]

iii) Logistic Model

1. Fitting Model, Summary and Interpretation.

Before fitting the logistic model, we need to define categorical variables (Pclass and Sex). In order to get the detailed fitting result, we use the summary command.

### R ###
# set the class of Pclass and Sex as factor
train_set$Pclass = as.factor(train_set$Pclass)
train_set$Sex = as.factor(train_set$Sex)

# fit the logistic model
logit_fit = glm(Survived ~ ., data = train_set, family = "binomial")

# Summary the output
summary(logit_fit)
## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6455  -0.6173  -0.4280   0.6028   2.3910  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.754748   0.468403   8.016 1.09e-15 ***
## Pclass2     -0.953840   0.309265  -3.084  0.00204 ** 
## Pclass3     -2.113095   0.305253  -6.922 4.44e-12 ***
## Sex1        -2.793594   0.210181 -13.291  < 2e-16 ***
## Age         -0.036724   0.008206  -4.475 7.64e-06 ***
## SibSp       -0.307275   0.114790  -2.677  0.00743 ** 
## Parch       -0.140971   0.126600  -1.114  0.26549    
## Fare         0.002485   0.002412   1.030  0.30282    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1069.22  on 801  degrees of freedom
## Residual deviance:  712.05  on 794  degrees of freedom
## AIC: 728.05
## 
## Number of Fisher Scoring iterations: 5

The first step to interpret the output is testing the goodness of the model based on the deviance. The residual deviance tells us the intuitive distance between the current model and the saturated model. Because the p-value of the residual deviance is 0.98 ( following appropriately \(\chi^{2}(n-p)\)), the logistic model is not statistically different from the saturated model. Then, we see that the difference between the null deviance and residual deviance is huge, which means the current model is significantly better than the null model. Hence, the current logistic model fits the data well.

Above the deviance, we can gain the estimated coefficients, standard error, z-value (Wald z-statistic) and its corresponding p-values. From the p-value, we conclude that the Pclass, Sex, Age and Sibsp are statistically significant, while Parch and Fare are not.

The estimated coefficients indicate the change in the log odds of the response when the predictors change a one unit. For example, the log odds of survival decreases by 0.037 when the Age increases a one unit. The difference of the log odds of survival is 0.954 between class 1 and class 2, 2.113 between class 1 and class 3 which agrees with the outcomes of two-way contingency table.

Below the deviance, AIC is computed which can be used for model selection (the lower AIC is, the better model fits). In addition, we can obtain confidence intervals for the coefficient estimates that is based on the standard error and the normal assumption.

# compute the confidence intervals of coefficients
ci = confint.default(logit_fit); ci
##                    2.5 %       97.5 %
## (Intercept)  2.836695150  4.672800458
## Pclass2     -1.559988966 -0.347690854
## Pclass3     -2.711379272 -1.514810311
## Sex1        -3.205541580 -2.381647364
## Age         -0.052807923 -0.020639917
## SibSp       -0.532259491 -0.082290253
## Parch       -0.389101686  0.107160130
## Fare        -0.002241897  0.007212022

2. Prediction and Error Rate

Based on the fitted model, now we can predict the survival probability of the testing data. The predicted survival probability is between 0 and 1, hence we regard it as survival when p > 0.5, and as not survival when p <= 0.5.

Compared to the true survival situation, we compute the error rate (the probability of making mistakes) which is approximately 0.18. Not a bad performance!

### R ###
# Prediction on the test set
test_set$Pclass = as.factor(test_set$Pclass)
test_set$Sex = as.factor(test_set$Sex)

surv_prob = predict(logit_fit, newdata = test_set, type = 'response')
head(surv_prob)
##        803        804        805        806        807        808 
## 0.56615518 0.21633191 0.10658066 0.09354815 0.38436076 0.73104753
# The error rate 
mean(round(surv_prob)!=test_set$Survived)
## [1] 0.1797753

Summary

To sum up, we can see that the performance of logistic regression is not bad. The logistic regression is the simplest method to handle 0-1 classification problems; and we can easily perform it on R, Stata and Python. But the interpretation of the results is complicated, due to the non-linear relationship between the response and predictors. However, when it comes to more complicated scenario, some more advanced tools should be used instead, like Probit regression.

Stata

Steps

i) Loading data

We start with loading data and present the first 5 observations.

* Importing data
clear
import delimited Titanic_train.csv
list if passengerid <= 5
First 5 observations

First 5 observations

ii) Obtaining Summary Statistics

1.Data Summary

This dataset has a binary response Survived (1 for survived, 0 for not survived). Based on the data types and inference, we choose Pclass, Sex, Age, SibSp, Parch and Fare as predictors. Pclass and Sex are treated as categorical predictors, and others as continuous. Specifically, Pclass takes integer values from 1 to 3 as the highest rank cabin to the lowest rank cabin. For simplicity with the variable sex, ‘male’ are changed to 1, and ‘female’ are changed to 0.

After these procedures, the following shows a summary table of the predictors.

* Change sex to numeric value
gen gender = 1 if sex == "male"
replace gender = 0 if sex == "female"

* Select potential variables
drop name cabin embarked sex ticket

* Summarize data
summarize pclass i.gender age sibsp parch far
Variable Summary

Variable Summary

2. Two-way Contingency Table

Two-way contingency table can give us a basic relationship between the response and some predictors directly. Here we tabulate Survived and Pcalss, showing the distribution of survival status within each cabin class. From the table, we can roughly conclude that the odds of survival is larger when the cabin class is higher.

* Contingency Table
tab survived pclass
Two way contingency Table

Two way contingency Table

3.Missing Values and Training/Testing Sets

Missing values appear frequently when we deal with big datasets. There are many different ways to fix this problems. Here, for example, we replace the missing values in Age by the average of age.

In order to evaluate our model, we have to divide the dataset into training part the testing part. The ratio of these two parts is usually 8:2 or 9:1. Here, we choose the one with ratio 9:1, giving us 802 training samples and 89 testing samples.

* Deal with missing value in variable age
egen ave_age = mean(age)
replace age = ave_age if missing(age)
drop ave_age

* Setting indicator for 
gen training = 1 if passengerid < 803
replace training = 0 if passengerid > 802

iii) Logistic Model

1. Fitting Model, Summary and Interpretation.

Before fitting the logistic model, we need to define categorical variables (Pclass and Sex). In Stata, a variable is converted to categorical predictors by adding ‘i.’ before the predictors. Then the logit command will automatically considered this variable as categorical. The logit command will also present the regression summary as soon as task finishes.

* Fitting Model
logit survived i.pclass i.gender age sibsp parch fare if training == 1
Regression

Regression

In the output above, we can obtain the estimated coefficients, standard error, z-value (Wald z-statistic) and its corresponding p-values and the confidence interval. By calculating the residual deviance, whose distribution is \(\chi^{2}(n-p)\), from the log likelihood by \(\operatorname{Residual Deviance} = -2 \operatorname{Log Likelihood} = 712.05\), the p-value is \(0.98\). It means that our model has no statistical difference from the saturated model. And the p-value for the test between our model and null model is \(0\), meaning that our model is statistically different from the null model that has no predictor. Therefore, we can conclude that our model is satisfying. Also, we can tell that Pclass, Sex, Age and Sibsp are statistically significant, while Parch and Fare are not.

The estimated coefficients indicates the change in the log odds of the response when the predictors change a one unit. For example, the log odds of survival decreases by 0.037 when the Age increases a one unit. The difference of the log odds of survival is 0.954 between class 1 and class 2, 2.113 between class 1 and class 3 which agrees with the outcomes of two-way contingency table.

2. Prediction and Error Rate

Based on the fitted model, now we can predict the survival probability on the testing data. The predicted survival probability is between 0 and 1, hence we regard it as survived when p \(>\) 0.5, and as not survived when p \(\leq\) 0.5.

Compared to the true survival status, we compute the error rate (the probability of making mistakes) which is approximately 0.18. Not a bad performance!

* Prediction
predict psurvived if training == 0

* Testing Error
gen prediction = 1 if psurvived > 0.5 & passengerid > 802
replace prediction = 0 if psurvived <= 0.5 & passengerid > 802

* Error Rate
gen error = prediction != survived
mean error if passengerid > 802
Error Rate

Error Rate

Summary

To sum up, we can see that the performance of logistic regression is not bad. The logistic regression is the simplest method to handle 0-1 classification problems; and we can easily perform it on R, Stata and Python. But the interpretation of the results is complicated, due to the non-linear relationship between the response and predictors. However, when it comes to more complicated scenario, some more advanced tools should be used instead, like Probit regression.

Python

Steps

i) Loading data

This dataset has a binary response variable ‘Survived’(1 for survival, 0 for death). Based on the data types and experience, we choose ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’ and ‘Fare’ as predictors. We treat ‘Pclass’ and ‘Sex’ as categorical factors and others as continuous. ‘Pclass’ takes on values 1, 2 and 3 where class 1 represents the highest rank of cabin and 3 represents the lowest rank.

We start with loading data and select out the variables we will use. And present the first 5 observations.

### Python ###
# Library packages
import panda as pd
import statsmodels.api as sm
# Read data
tdata = pd.read_csv('Titanic_train.csv')
# get the subdata we need to use
pdpre = tdata.iloc[:,[1,2,4,5,6,7,9]]
# present the head
pdpre.head()
First 5 observations

First 5 observations

ii) Obtaining Summary Statistics

1.Data Summary

For simplicity, we change ‘male’ of sex to 1, ‘female’ to 0. We can get basic descriptives for the entire data set by using summary.

### Python ###
# change ‘male’ to 1, ‘female’ to 0
def sex_to_numeric(x):
    if x == 'male':
        return 1
    if x=='female':
        return 0
pdpre['Sex'] = pdpre['Sex'].apply(sex_to_numeric)

# Summary
sumy = pdpre.iloc[:,1:].describe(include='all')
Variable Summary

Variable Summary

2. Two-way Contingency Table

Two-way contingency table can give us a basic relationship between the response and some predictor directly. For example, we tabulate ‘survived’ and ‘Pcalss’ to show the impact of cabin class on the probability of survival. From the table, we can roughly conclude that the odds ratio of survival is larger when the cabin class is higher.

### Python ###
#Two-way contingency table
pd.crosstab(pdpre['Survived'], pdpre['Pclass'], margins=True)
Two way contingency Table

Two way contingency Table

3.Missing Values and Training/Testing Sets

Missing values appear frequently when we deal with big datasets. There are many different ways to fix this problems. For example here, we replace the missing values in ‘Age’ by the average age.

Before fitting the logistic model, we also need to define categorical variables (Pclass and Sex).

Separating testing set from the original dataset is a common way to evaluate the goodness of fit. The ratio of the number of training data and testing data is usually 8:2 or 9:1. Here, we leave the first 90% data as training data and the last 10% of data as testing data.

### Python ###
# Filling missing value
pdpre["Age"].fillna(pdpre["Age"].mean(), inplace=True)

# Transfer categorical variables to dummy variables
pc = pd.get_dummies(pdpre['Pclass'], prefix='Pclass')
sx = pd.get_dummies(pdpre['Sex'], prefix='Sex')
cols_to_keep = ['Survived', 'Age', 'SibSp', 'Parch', 'Fare']
df_temp = pdpre[cols_to_keep].join(pc.loc[: , 'Pclass_2' : ])
df = df_temp.join(sx['Sex_1'])

# Divide to training and test data
train = df.iloc[ : 802, : ]
test = df.iloc[802 : , : ]
train_y = train['Survived']
train_x = train.iloc[ : , 1 : ]
test_y = test['Survived']
test_x = test.iloc[ : , 1 : ]

iii) Logistic Model

1. Fitting Model, Summary and Interpretation.

Fitting the logistic model. In order to get the detailed fitting result, we use the summary command.

### Python ###
# Fitting model
train_x['intercept'] = 1.0
logit = sm.Logit(train_y, train_x)
result = logit.fit()
result.summary()
Regression

Regression

In the output above, we can gain the estimated coefficients, standard error, confidence interval of coefficients, z-value (Wald z-statistic) and its corresponding p-values. Pclass_2, Pclass_3, Sex_1, Age and Sibsp are statistically significant, while Parch and Fare are not.

The estimated coefficients indicates the change in the log odds of the response when the predictors change a one unit. For example, the log odds of survival decreases by 0.037 when the Age increases a one unit. The difference of the log odds of survival is 0.954 between class 1 and class 2, 2.113 between class 1 and class 3 which agrees with the outcomes of two-way contingency table.

2. Prediction and Error Rate

Based on the fitted model, now we can predict the survival probability of the testing data. The predicted survival probability is between 0 and 1, hence we regard it as survival when p > 0.5, and as not survival when p <= 0.5.

Compared to the true survival situation, we compute the error rate (the probability of making mistakes) which is approximately 0.18. Not a bad performance!

### Python ###
# Predict
test_x['intercept'] = 1.0
pred = result.predict(test_x)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
test_err = 1 - (test_y == pred).mean()
print("test error rate: " + str(test_err))
Error Rate

Error Rate

Summary

To sum up, we can see that the performance of logistic regression is not bad. The logistic regression is the simplest method to handle 0-1 classification problems; and we can easily perform it on R, Stata and Python. But the interpretation of the results is complicated, due to the non-linear relationship between the response and predictors. However, when it comes to more complicated scenario, some more advanced tools should be used instead, like Probit regression.

References

Institute for Digital Research and Education, UCLA. LOGIT REGRESSION | R DATA ANALYSIS EXAMPLES. 2017-12-4.

Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.

Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.