Linear Regression: Model Selection & Model Diagnostics

Data Discription and Topic introduction

Since the topic we chose is easy but sophisticated, the code and inference for each programming language would be rather long :(

Data Discription

The data we use is a medical cost personal dataset accessible on Kaggle (https://www.kaggle.com/mirichoi0218/insurance).

library(data.table)
setwd("~/Desktop/group_project/group_project/final_version/")
insurance <- fread("insurance.csv")
head(insurance)

##    age    sex    bmi children smoker    region   charges
## 1:  19 female 27.900        0    yes southwest 16884.924
## 2:  18   male 33.770        1     no southeast  1725.552
## 3:  28   male 33.000        3     no southeast  4449.462
## 4:  33   male 22.705        0     no northwest 21984.471
## 5:  32   male 28.880        0     no northwest  3866.855
## 6:  31 female 25.740        0     no southeast  3756.622

Age is the age of the people. Sex is the gender. Bmi refers to the Body Mass Index (BMI). Children refers to the number of child one has. Smoker indicates whether given people smoke. Region describes in which of four regions (southwest, southeast, northeast, northwest) one lives in. Charges is the insurance cost in dollar. The data visualization is done in all three language. In this turorial,our goal is to estimated a linear model which predicts the insurance charges upon the above dataset.

Topic: Linear Regression, Model Selection and Model Diagnostics

Introduction

Formal definition from : Regression analysis is used for explaining or modeling the relationship between a single variable Y, called the response, output or dependent variable; and one or more predictor, input, independent or explanatory variables, \(X_1\)…\(X_p\). When \(p=1\), it is called simple regression but when \(p>1\) it is called multiple regression or sometimes multivariate regression. (Faraway, J 2009) An informal interpretation is that linear regression establish a linear relationship between the response variable and independent variable(s) in a data set. It is widely applied in many areas such as sciences, engineering and machine learning. Mathematically: \[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n \end{equation}\]

Solving linear regression problems with least square

One well defined way to solve for \(\beta\) is called the least square method: Using above definition, we can write: \[\begin{equation} y = X \beta + \epsilon \end{equation}\] with \(y = [y_1, ..., y_n]^T\), \(\epsilon = [\epsilon_1,..., \epsilon_n]^T\), \(\beta = [\beta_0,...,\beta_m]^T\) and: \[\begin{equation} X = \begin{bmatrix} 1 & x_{11}& x_{12} & \dots & x_{1m} \\ 1 & x_{21}& x_{22} & \dots & x_{2m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1}& x_{d2} & \dots & x_{nm} \end{bmatrix} \end{equation}\] In above expression, \(\epsilon\) is the residual, and one way to obtain best estimation of \(\beta\) is to minimize: \[\begin{equation} \sum_{i=1}^n \epsilon_i^2 = \epsilon^T\epsilon \end{equation}\] The above equation can be rewritten as: \[\begin{equation} (y - X\beta)^T (y - X\beta) \end{equation}\] Now, we take the differential of (4) with respect to \(\beta\), notice that to minimize (4), the graident should be zero; thus: \[\begin{equation} X^TX\hat{\beta} = X^T y \end{equation}\] in which \(\hat{\beta}\) is our estimation of \(\beta\). Rearranging (5) we get: \[\begin{equation} \hat{\beta} = (X^TX)^{-1}X^T y \end{equation}\]

This is the norm equation.

Diagnostic and Model Selection

Normally, there are numerous variables in a data set, and some of variables are actually correlated; then one problem is how to establish a rather simple model which could be easily interpret. One popular technique is the . The idea is to choose model with respect to a specific criterion that measures the behavior of fit. In our project, we will introduce three common criterion: the , the , and the . ##### AIC Before defining AIC, we first define the following values: Number of independent predictors p: \(p = \# X_i\) Residual sum of square - RSS: \[\begin{equation} RSS = \hat{\epsilon}^T\hat{\epsilon} = (y - X\hat\beta)^T(y - X\hat\beta) \end{equation}\] And \[\begin{equation} AIC = n \ln{(RSS/n)} + 2(p+1) \end{equation}\] Pick the model minimizes AIC. ##### BIC \[\begin{equation} BIC = n \ln{(RSS/n)} + (p+1)\ln n \end{equation}\] Pick the model minimizes BIC. ##### Adjusted R square The R square value is defined as: \[\begin{equation} R^2 = 1 - \frac{RSS}{TSS} \end{equation}\] The adjusted R square value is: \[\begin{equation} 1 - \frac{n-1}{n-p-1}(1 - R^2) \end{equation}\]

Select a model that maximize the adjusted \(R^2\) value

Structure examples in this tutorial

In each example, we start by data visualization. Then pick a strategy to scale the response. Then, we convert the categorical columns in predictors to dummy variables and regress on the full dataset. It turns out their is a clear linearity associated with residual. So we modify our model to reduce such linearity. Having the original model, we start by adding some interaction predictors in our model if two variables seems to have interacting effect on the response. For example, smoke:age. Next, we raise the power of continous predictor. Above diagonastic process reduce the linearity in residuals. Then, we regress the selected predictors against the original response, and do model selection.

Citation:

1.Xie, Y. Yihui Xie’s knitr example

2.A regression example in Kaggle

3.Faraway, J. 2009. Linear Models with R. University of British Columbia, Canada.

4.Shedden, K. Proffessor Shedden’s Notes on model diagostics

5.Shedden, K. Proffessor Shedden’s Notes on model selection

By Python

Python (Zhenbang Jiao)

The following packages are used in Python version of the tutorial.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant

Data Preprocessing

#read the data and save as a dataframe
ds0 = pd.DataFrame(pd.read_csv('~/Desktop/506/groupproject/insurance.csv'))

#have a glimpse of the data
print(ds0.head())

##    age     sex     bmi  children smoker     region      charges
## 0   19  female  27.900         0    yes  southwest  16884.92400
## 1   18    male  33.770         1     no  southeast   1725.55230
## 2   28    male  33.000         3     no  southeast   4449.46200
## 3   33    male  22.705         0     no  northwest  21984.47061
## 4   32    male  28.880         0     no  northwest   3866.85520

print(ds0.columns)

#check if there is any null

## Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

ds0.isnull().sum()
#no null :)

## age         0
## sex         0
## bmi         0
## children    0
## smoker      0
## region      0
## charges     0
## dtype: int64

# map the character to factor variable
ds0.sex = ds0.sex.map({'female':1, 'male':0})
ds0.smoker = ds0.smoker.map({'yes':1, 'no':0})
ds_reg = pd.get_dummies(ds0['region'])

# since we change it to dummy variables, we have to drop one of the column
ds0 = ds0.join(ds_reg.iloc[:,0:3])

Data Visualization

ds0['logcharges'] = np.log(ds0.charges)
ds1 = ds0.drop(['charges','region'],axis=1)

import seaborn as sns
import matplotlib.pyplot as plt
# As we can see, the charges have a heavy tail. So, we take log on it.
sns.distplot(ds0['charges'], color='b')
# Data in every other column looks fine right now.
f,ax = plt.subplots(2,3,figsize=(10,8))
sns.distplot(ds1["age"], kde=False, ax=ax[0,0])
sns.boxplot(x='sex',y='charges', data=ds0, ax=ax[0,1])
sns.distplot(ds1['logcharges'], ax=ax[0,2], kde=False, color='b')
# The logcharges are now normally distributed.
sns.distplot(ds1['bmi'],ax=ax[1,0], kde=False, color='b')
sns.countplot('children',data=ds1, ax=ax[1,1])
sns.countplot('region',data=ds0, ax=ax[1,2])

ax[0,0].set_title('Distribution of Ages')
ax[0,1].set_title('Charges boxplot by Sex')
ax[0,2].set_title('Distribution of log charges')
ax[1,0].set_title('Distribution of bmi')
ax[1,1].set_title('Distribution of children')
ax[1,2].set_title('Distribution of regions')

Optional Step

Actually centralize is not a must, which would not change the general result of the regression. In this example, we won’t do so.

def centralize(arra):
    cen = arra - np.mean(arra)
    var = np.sqrt(sum(cen**2)/(len(arra)-1))
    arra = cen/var
    return arra
ds0.bmi = centralize(ds0.bmi)
ds0.age = centralize(ds0.age)

Model Diagnostics

Firstly, we define dependent variable y and covariates X for future convenience.

y = ds1.logcharges
X = ds1.drop(['logcharges'], axis = 1)

#Conduct the first regression!
regr_1 = OLS(y, add_constant(X)).fit()

VIF

def variance_inflation_factor(exog, exog_idx):
    k_vars = exog.shape[1]
    x_i = exog.iloc[:, exog_idx]
    mask = np.arange(k_vars) != exog_idx
    x_noti = exog.iloc[:, mask]
    r_squared_i = OLS(x_i, x_noti).fit().rsquared
    vif = 1. / (1. - r_squared_i)
    return vif
    
# We skip the constant column.
VIF = [variance_inflation_factor(add_constant(X), i) for i in range(1,X.shape[1]+1)]
print(VIF)

## [1.0168221490038107, 1.0089001621005733, 1.106629732428617, 1.004010642137024, 1.0120736649061481, 1.5262104138696806, 1.524748317955475, 1.5971027909244977]

The vif of each column is ok. All of them are smaller than 5, even 2.

Residual Distribution

sns.distplot(regr_1.resid)

Acting like normal which is good. Since the residual itself is normal, box-cox is not necessary.

Box-Cox transformation

namda = 0.1
regr_test = OLS((y**namda-1)/namda, add_constant(X)).fit()
sns.jointplot((y**namda-1)/namda, regr_test.resid)
sns.distplot(regr_test.resid)

Residual Plot

If the regression works well, the dependent data should be uncorrelated with the residual. If we have gaussian prerequisite, independence is what we expect.

sns.jointplot(y, regr_1.resid)

The residual plot looks very strange. Y and residual are highly dependent. Maybe the model is not linear at the first place. Since there is explicit non-linear in this model, we have to add some non-linear covariates in it.

partial residual plot

Which attempts to show how covariate is related to dependent variable, if we control for the effects of all other covariates.

f,ax = plt.subplots(1,2,figsize=(10,8))
sns.jointplot(regr_1.params.bmi * X.bmi + regr_1.resid, X.bmi, ax=ax[0,0])
sns.jointplot(regr_1.params.age * X.age + regr_1.resid, X.age, ax=ax[0,1])

Partial residual plots show that both bmi and age could significantly affect the charges.

Model Selection

For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.

Original model

print(regr_1.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.768
## Model:                            OLS   Adj. R-squared:                  0.767
## Method:                 Least Squares   F-statistic:                     549.8
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -808.52
## No. Observations:                1338   AIC:                             1635.
## Df Residuals:                    1329   BIC:                             1682.
## Df Model:                           8                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          6.8262      0.076     90.168      0.000       6.678       6.975
## age            0.0346      0.001     39.655      0.000       0.033       0.036
## sex            0.0754      0.024      3.091      0.002       0.028       0.123
## bmi            0.0134      0.002      6.381      0.000       0.009       0.017
## children       0.1019      0.010     10.085      0.000       0.082       0.122
## smoker         1.5543      0.030     51.333      0.000       1.495       1.614
## northeast      0.1290      0.035      3.681      0.000       0.060       0.198
## northwest      0.0652      0.035      1.863      0.063      -0.003       0.134
## southeast     -0.0282      0.034     -0.819      0.413      -0.096       0.039
## ==============================================================================
## Omnibus:                      463.882   Durbin-Watson:                   2.046
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1673.760
## Skew:                           1.679   Prob(JB):                         0.00
## Kurtosis:                       7.330   Cond. No.                         330.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

logcharges = constant + age + sex + bmi + children + smoker + northeast + northwest + southeast

The first try : add an interactive covariate smoker:bmi

X_2 = X.iloc[:,:]
X_2['sm_bm'] = X_2.smoker * X_2.bmi
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which certainly improve the performance of the model

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.784
## Model:                            OLS   Adj. R-squared:                  0.782
## Method:                 Least Squares   F-statistic:                     534.0
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -762.05
## No. Observations:                1338   AIC:                             1544.
## Df Residuals:                    1328   BIC:                             1596.
## Df Model:                           9                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          7.1128      0.079     90.254      0.000       6.958       7.267
## age            0.0348      0.001     41.281      0.000       0.033       0.036
## sex            0.0871      0.024      3.688      0.000       0.041       0.133
## bmi            0.0034      0.002      1.502      0.133      -0.001       0.008
## children       0.1031      0.010     10.569      0.000       0.084       0.122
## smoker         0.1564      0.146      1.071      0.284      -0.130       0.443
## northeast      0.1375      0.034      4.062      0.000       0.071       0.204
## northwest      0.0664      0.034      1.964      0.050    8.84e-05       0.133
## southeast     -0.0252      0.033     -0.757      0.449      -0.091       0.040
## sm_bm          0.0456      0.005      9.773      0.000       0.036       0.055
## ==============================================================================
## Omnibus:                      520.045   Durbin-Watson:                   2.036
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2150.321
## Skew:                           1.846   Prob(JB):                         0.00
## Kurtosis:                       7.994   Cond. No.                         661.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The second try : add an interactive covariate smoker:age

X_2['sm_ag'] = X_2.smoker*X_2.age
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which increase the performance of the model significantly

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.825
## Model:                            OLS   Adj. R-squared:                  0.824
## Method:                 Least Squares   F-statistic:                     625.0
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -620.29
## No. Observations:                1338   AIC:                             1263.
## Df Residuals:                    1327   BIC:                             1320.
## Df Model:                          10                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          6.8960      0.072     95.828      0.000       6.755       7.037
## age            0.0416      0.001     48.930      0.000       0.040       0.043
## sex            0.0856      0.021      4.030      0.000       0.044       0.127
## bmi            0.0012      0.002      0.563      0.573      -0.003       0.005
## children       0.1063      0.009     12.106      0.000       0.089       0.124
## smoker         1.2793      0.146      8.769      0.000       0.993       1.566
## northeast      0.1509      0.030      4.953      0.000       0.091       0.211
## northwest      0.0860      0.030      2.827      0.005       0.026       0.146
## southeast      0.0061      0.030      0.202      0.840      -0.053       0.065
## sm_bm          0.0510      0.004     12.130      0.000       0.043       0.059
## sm_ag         -0.0334      0.002    -17.698      0.000      -0.037      -0.030
## ==============================================================================
## Omnibus:                      860.664   Durbin-Watson:                   1.998
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8198.732
## Skew:                           2.969   Prob(JB):                         0.00
## Kurtosis:                      13.574   Cond. No.                         744.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Since we only have two continuous covariates, we can try to give them an extra power.

The third try : add \(bmi^{1.5}\)

X_2['bmi^1.5'] = X_2.bmi ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.826
## Model:                            OLS   Adj. R-squared:                  0.825
## Method:                 Least Squares   F-statistic:                     574.1
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -614.16
## No. Observations:                1338   AIC:                             1252.
## Df Residuals:                    1326   BIC:                             1315.
## Df Model:                          11                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          5.9730      0.274     21.823      0.000       5.436       6.510
## age            0.0414      0.001     48.940      0.000       0.040       0.043
## sex            0.0859      0.021      4.060      0.000       0.044       0.127
## bmi            0.0928      0.026      3.528      0.000       0.041       0.144
## children       0.1065      0.009     12.175      0.000       0.089       0.124
## smoker         1.2759      0.145      8.782      0.000       0.991       1.561
## northeast      0.1564      0.030      5.147      0.000       0.097       0.216
## northwest      0.0848      0.030      2.797      0.005       0.025       0.144
## southeast      0.0149      0.030      0.497      0.619      -0.044       0.074
## sm_bm          0.0513      0.004     12.243      0.000       0.043       0.060
## sm_ag         -0.0335      0.002    -17.815      0.000      -0.037      -0.030
## bmi^1.5       -0.0110      0.003     -3.494      0.000      -0.017      -0.005
## ==============================================================================
## Omnibus:                      861.038   Durbin-Watson:                   1.995
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             8243.481
## Skew:                           2.968   Prob(JB):                         0.00
## Kurtosis:                      13.612   Cond. No.                     4.89e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.89e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

The fourth try: add \(age^{1.5}\)

X_2['age^1.5'] = X_2.age ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:             logcharges   R-squared:                       0.828
## Model:                            OLS   Adj. R-squared:                  0.827
## Method:                 Least Squares   F-statistic:                     532.2
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -607.51
## No. Observations:                1338   AIC:                             1241.
## Df Residuals:                    1325   BIC:                             1309.
## Df Model:                          12                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          5.5668      0.294     18.907      0.000       4.989       6.144
## age            0.0776      0.010      7.785      0.000       0.058       0.097
## sex            0.0855      0.021      4.060      0.000       0.044       0.127
## bmi            0.0919      0.026      3.508      0.000       0.041       0.143
## children       0.0966      0.009     10.583      0.000       0.079       0.114
## smoker         1.2704      0.145      8.784      0.000       0.987       1.554
## northeast      0.1566      0.030      5.179      0.000       0.097       0.216
## northwest      0.0858      0.030      2.844      0.005       0.027       0.145
## southeast      0.0148      0.030      0.497      0.619      -0.044       0.073
## sm_bm          0.0514      0.004     12.326      0.000       0.043       0.060
## sm_ag         -0.0335      0.002    -17.881      0.000      -0.037      -0.030
## bmi^1.5       -0.0108      0.003     -3.466      0.001      -0.017      -0.005
## age^1.5       -0.0039      0.001     -3.639      0.000      -0.006      -0.002
## ==============================================================================
## Omnibus:                      895.912   Durbin-Watson:                   1.992
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             9255.812
## Skew:                           3.102   Prob(JB):                         0.00
## Kurtosis:                      14.293   Cond. No.                     9.48e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

The fifth try: What if we take the ~~log~~ away?
Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.

y_2 = ds0.charges
regr_test = OLS(y_2, add_constant(X_2)).fit()
print(regr_test.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.845
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     600.3
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -13232.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1325   BIC:                         2.656e+04
## Df Model:                          12                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7665.5804   3687.192     -2.079      0.038   -1.49e+04    -432.209
## age         -320.3823    124.794     -2.567      0.010    -565.197     -75.567
## sex          509.3387    263.692      1.932      0.054      -7.960    1026.637
## bmi         1056.6806    328.082      3.221      0.001     413.063    1700.298
## children     678.7408    114.266      5.940      0.000     454.579     902.902
## smoker     -2.029e+04   1811.135    -11.200      0.000   -2.38e+04   -1.67e+04
## northeast   1288.7634    378.774      3.402      0.001     545.701    2031.825
## northwest    616.4165    377.748      1.632      0.103    -124.634    1357.467
## southeast    122.5000    374.253      0.327      0.743    -611.693     856.693
## sm_bm       1444.5800     52.238     27.654      0.000    1342.101    1547.059
## sm_ag         -3.7854     23.430     -0.162      0.872     -49.750      42.179
## bmi^1.5     -123.8640     39.080     -3.170      0.002    -200.529     -47.199
## age^1.5       62.3346     13.294      4.689      0.000      36.256      88.413
## ==============================================================================
## Omnibus:                      737.144   Durbin-Watson:                   2.068
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4743.381
## Skew:                           2.581   Prob(JB):                         0.00
## Kurtosis:                      10.645   Cond. No.                     9.48e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region para is not that important.

X_3 = X_2.drop(columns = ['northwest', 'southeast'])
regr_3 = OLS(y_2, add_constant(X_3)).fit()
print(regr_3.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.844
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     719.5
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:25   Log-Likelihood:                -13233.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1327   BIC:                         2.655e+04
## Df Model:                          10                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7462.5827   3670.004     -2.033      0.042   -1.47e+04    -262.941
## age         -322.6577    124.831     -2.585      0.010    -567.546     -77.770
## sex          509.7468    263.786      1.932      0.054      -7.736    1027.230
## bmi         1075.2760    326.981      3.288      0.001     433.821    1716.732
## children     681.9706    114.227      5.970      0.000     457.886     906.056
## smoker     -2.032e+04   1811.634    -11.219      0.000   -2.39e+04   -1.68e+04
## northeast   1036.4906    309.503      3.349      0.001     429.323    1643.658
## sm_bm       1444.4375     52.251     27.644      0.000    1341.934    1546.941
## sm_ag         -3.0043     23.396     -0.128      0.898     -48.902      42.893
## bmi^1.5     -126.7912     38.882     -3.261      0.001    -203.068     -50.514
## age^1.5       62.5812     13.298      4.706      0.000      36.495      88.668
## ==============================================================================
## Omnibus:                      740.495   Durbin-Watson:                   2.062
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4801.892
## Skew:                           2.592   Prob(JB):                         0.00
## Kurtosis:                      10.698   Cond. No.                     9.43e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.43e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

the AIC and BIC actually get smaller.

Similarly smoker:age is not that important either.

X_4 = X_3.drop(columns = ['sm_ag'])
regr_4 = OLS(y_2, add_constant(X_4)).fit()
print(regr_4.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                charges   R-squared:                       0.844
## Model:                            OLS   Adj. R-squared:                  0.843
## Method:                 Least Squares   F-statistic:                     800.1
## Date:                Fri, 07 Dec 2018   Prob (F-statistic):               0.00
## Time:                        02:24:26   Log-Likelihood:                -13233.
## No. Observations:                1338   AIC:                         2.649e+04
## Df Residuals:                    1328   BIC:                         2.654e+04
## Df Model:                           9                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const      -7441.2915   3664.898     -2.030      0.043   -1.46e+04    -251.670
## age         -323.1930    124.715     -2.591      0.010    -567.854     -78.532
## sex          509.8762    263.686      1.934      0.053      -7.411    1027.163
## bmi         1075.1073    326.857      3.289      0.001     433.895    1716.320
## children     681.6865    114.163      5.971      0.000     457.727     905.646
## smoker     -2.043e+04   1630.461    -12.527      0.000   -2.36e+04   -1.72e+04
## northeast   1036.8028    309.378      3.351      0.001     429.879    1643.726
## sm_bm       1443.9491     52.093     27.719      0.000    1341.755    1546.143
## bmi^1.5     -126.7498     38.866     -3.261      0.001    -202.996     -50.504
## age^1.5       62.5735     13.292      4.707      0.000      36.497      88.650
## ==============================================================================
## Omnibus:                      740.467   Durbin-Watson:                   2.063
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4799.401
## Skew:                           2.592   Prob(JB):                         0.00
## Kurtosis:                      10.695   Cond. No.                     9.41e+03
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.41e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.

After the model selection part, we now have our residual:

sns.residplot(np.sum(regr_4.params*X_4,1)+regr_4.params[0], y_2)

Which is much more acceptable than before.

Summary

For now, the best model would be:

charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)

The coefficients are accessible on the table above. We have Adjusted \(R^{2}\) at 0.843, which is much better than 0.767 from the original model. So, till now, we have solved the residual’s problem and increased the Adjusted \(R^{2}\) significantly. It is a pretty complete model diagnostics and model selection process.

Caution: As you may see, the coefficient of smokeryes is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.

By R

R (Mimi Tran)

The following packages are used in R version of the tutorial.

# libraries--------------------------------------------------------------------
library(faraway)
library(readr)
library(dplyr)

Data Preprocessing

# Load data--------------------------------------------------------------------
ds0 <- read_csv("insurance.csv")

## Parsed with column specification:
## cols(
##   age = col_integer(),
##   sex = col_character(),
##   bmi = col_double(),
##   children = col_integer(),
##   smoker = col_character(),
##   region = col_character(),
##   charges = col_double()
## )

names(ds0)

## [1] "age"      "sex"      "bmi"      "children" "smoker"   "region"  
## [7] "charges"

#have a glimpse of the data
head(ds0)

## # A tibble: 6 x 7
##     age sex      bmi children smoker region    charges
##   <int> <chr>  <dbl>    <int> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.

# Check if is there any null---------------------------------------------------
is.na(ds0)

No missing value, we’re good.

# Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
ds0$sex[ds0$sex == "male"]="0"
ds0$sex[ds0$sex == "female"]="1"
ds0$smoker[ds0$smoker == "no"]="0"
ds0$smoker[ds0$smoker == "yes"]="1"

Data Visualization

# Examining the distribution of each variables---------------------------------
hist(ds0$age,xlab="Age", main="Distribution of Age")

hist(ds0$bmi,xlab="BMI", main="Distribution of BMI")

hist(ds0$children,xlab="Children", main="Distribution of Children")

hist(ds0$charges,xlab="Charges", main="Distribution of Charges")

# Take log for charge since its heavy tail-------------------------------------
ds0$logcharges <- log(ds0$charges+1)

hist(ds0$charges)

ds0$logcharges <- log(ds0$charges+1)
hist(ds0$logcharges, breaks = 10)

Model Dignostics

Firstly, we define dependent variable y and covariates X. In this analysis, the reponse is logcharges and the preditors are age, sex, bmi, children, smoker, and region .

VIF

# Conduct the first regression!
# Since they're all smaller than 5, even smaller than 2.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
X <- model.matrix(fit0)[, -1]
round(vif(X),2)

##                        age                       sex1 
##                       1.02                       1.01 
##                        bmi                   children 
##                       1.11                       1.00 
##                    smoker1 as.factor(region)northwest 
##                       1.01                       1.52 
## as.factor(region)southeast as.factor(region)southwest 
##                       1.65                       1.53

hist(fit0$residuals, xlab="Residuals")

plot(fit0$res, xlab="Residuals")
abline(h=0) # acting like noraml so it's good.

Residual plots

plot(fit)

Partial residual plots

# Partial residual plots-------------------------------------------------------
#Which attempts to show how covariate is related to dependent variable
# if we control for the effects of all other covariates
# partial residual plots look acceptable.
fit <- lm(logcharges~ bmi, data=ds0)
plot(fit)

Model Selection

Original model

fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
summary(fit0)

## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region), data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07141 -0.19836 -0.04914  0.06601  2.16602 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 6.9556815  0.0732494  94.959  < 2e-16 ***
## age                         0.0345744  0.0008719  39.655  < 2e-16 ***
## sex1                        0.0753873  0.0243966   3.090 0.002043 ** 
## bmi                         0.0133740  0.0020956   6.382 2.41e-10 ***
## children                    0.1018259  0.0100976  10.084  < 2e-16 ***
## smoker1                     1.5541456  0.0302739  51.336  < 2e-16 ***
## as.factor(region)northwest -0.0637739  0.0348992  -1.827 0.067867 .  
## as.factor(region)southeast -0.1571539  0.0350762  -4.480 8.09e-06 ***
## as.factor(region)southwest -0.1289211  0.0350206  -3.681 0.000241 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4442 on 1329 degrees of freedom
## Multiple R-squared:  0.768,  Adjusted R-squared:  0.7666 
## F-statistic: 549.8 on 8 and 1329 DF,  p-value: < 2.2e-16

AIC(fit0)

## [1] 1636.536

BIC(fit0)

## [1] 1688.526

The first try: add interactive covariate smoker*bmi

fit1 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi, data=ds0)
summary(fit1)

## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91486 -0.17764 -0.05144  0.04640  2.22282 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.2508346  0.0769477  94.231  < 2e-16 ***
## age                         0.0347880  0.0008427  41.280  < 2e-16 ***
## sex1                        0.0870349  0.0236027   3.688 0.000236 ***
## bmi                         0.0034055  0.0022672   1.502 0.133309    
## children                    0.1031176  0.0097574  10.568  < 2e-16 ***
## smoker1                     0.1562914  0.1459708   1.071 0.284497    
## as.factor(region)northwest -0.0711167  0.0337288  -2.108 0.035176 *  
## as.factor(region)southeast -0.1626838  0.0338962  -4.799 1.77e-06 ***
## as.factor(region)southwest -0.1374810  0.0338491  -4.062 5.16e-05 ***
## bmi:smoker1                 0.0455727  0.0046624   9.775  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4292 on 1328 degrees of freedom
## Multiple R-squared:  0.7835, Adjusted R-squared:  0.7821 
## F-statistic: 534.1 on 9 and 1328 DF,  p-value: < 2.2e-16

BIC(fit1)

## [1] 1602.769

# which certainly improve the performance of the model

The second try: add an interactive covariate smoker*age

fit2 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi+as.numeric(smoker)*age, data=ds0)
summary(fit2)

## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi + as.numeric(smoker) * age, 
##     data = ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65395 -0.15404 -0.06305 -0.01126  2.34914 
## 
## Coefficients: (1 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.0475291  0.0701861 100.412  < 2e-16 ***
## age                         0.0415643  0.0008495  48.928  < 2e-16 ***
## sex1                        0.0855867  0.0212384   4.030 5.90e-05 ***
## bmi                         0.0011518  0.0020440   0.563   0.5732    
## children                    0.1062970  0.0087818  12.104  < 2e-16 ***
## smoker1                     1.2788788  0.1458651   8.768  < 2e-16 ***
## as.factor(region)northwest -0.0649078  0.0303520  -2.139   0.0327 *  
## as.factor(region)southeast -0.1448205  0.0305173  -4.746 2.30e-06 ***
## as.factor(region)southwest -0.1508956  0.0304676  -4.953 8.26e-07 ***
## as.numeric(smoker)                 NA         NA      NA       NA    
## bmi:smoker1                 0.0510336  0.0042067  12.132  < 2e-16 ***
## age:as.numeric(smoker)     -0.0333924  0.0018870 -17.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3862 on 1327 degrees of freedom
## Multiple R-squared:  0.8249, Adjusted R-squared:  0.8235 
## F-statistic:   625 on 10 and 1327 DF,  p-value: < 2.2e-16

BIC(fit2)

## [1] 1326.493

# which increase the performance of the model significantly

Since we only have two continuous covariates, we can try to give them an extra power.

The third try: add \(bmi^{1.5}\)

 #since we only have two continuous covariates, we can try to give them an extra power, add bmi^1.5.
fit3 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5, data=ds0)
summary(fit3)

## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03798 -0.17402 -0.03470  0.05657  2.04224 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.1763930  0.0720456  99.609  < 2e-16 ***
## age                         0.0375651  0.0007998  46.967  < 2e-16 ***
## sex1                        0.1190553  0.0224451   5.304 1.32e-07 ***
## bmi                        -0.0015026  0.0023231  -0.647   0.5179    
## children                    0.1225808  0.0093304  13.138  < 2e-16 ***
## smoker1                     0.0513262  0.1383354   0.371   0.7107    
## as.factor(region)northwest  0.0100724  0.0315101   0.320   0.7493    
## as.factor(region)southeast -0.0649199  0.0337726  -1.922   0.0548 .  
## as.factor(region)southwest -0.1401404  0.0330701  -4.238 2.41e-05 ***
## bmi:smoker1                 0.0495246  0.0045698  10.837  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4012 on 1328 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.8064 
## F-statistic: 619.8 on 9 and 1328 DF,  p-value: < 2.2e-16

BIC(fit3)

## [1] 1422.136

The fourth try: add \(age^{1.5}\)

fit4 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit4)

## 
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5 + age^1.5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91636 -0.18144 -0.05810  0.02661  2.13811 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 7.2645976  0.0841646  86.314  < 2e-16 ***
## age                         0.0347150  0.0008966  38.720  < 2e-16 ***
## sex1                        0.1085825  0.0249587   4.350 1.46e-05 ***
## bmi                         0.0027220  0.0024320   1.119   0.2632    
## children                    0.0792971  0.0107489   7.377 2.84e-13 ***
## smoker1                     0.1785370  0.1539089   1.160   0.2463    
## as.factor(region)northwest -0.0821657  0.0375473  -2.188   0.0288 *  
## as.factor(region)southeast -0.0423219  0.0348873  -1.213   0.2253    
## as.factor(region)southwest -0.0817111  0.0350003  -2.335   0.0197 *  
## bmi:smoker1                 0.0419282  0.0048774   8.596  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4515 on 1328 degrees of freedom
## Multiple R-squared:  0.7465, Adjusted R-squared:  0.7447 
## F-statistic: 434.4 on 9 and 1328 DF,  p-value: < 2.2e-16

BIC(fit4)

## [1] 1738.278

The fifth try: What if we take the ~~log~~ away? so we use charges instead of logcharges for the response.

fit5 <-lm(charges ~age+sex+bmi+children+smoker+as.factor(region)+
            smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit5)

## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) * 
##     age + bmi^1.5 + age^1.5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9501.1 -2050.8 -1444.9  -458.8 24540.4 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -1730.677    911.048  -1.900 0.057695 .  
## age                           257.277      9.705  26.510  < 2e-16 ***
## sex1                         1003.023    270.168   3.713 0.000214 ***
## bmi                            -3.801     26.325  -0.144 0.885205    
## children                      238.749    116.353   2.052 0.040372 *  
## smoker1                    -20194.774   1666.003 -12.122  < 2e-16 ***
## as.factor(region)northwest   -921.164    406.435  -2.266 0.023584 *  
## as.factor(region)southeast   -335.442    377.641  -0.888 0.374564    
## as.factor(region)southwest   -641.983    378.864  -1.694 0.090406 .  
## bmi:smoker1                  1422.834     52.796  26.949  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4887 on 1328 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:  0.8235 
## F-statistic: 694.2 on 9 and 1328 DF,  p-value: < 2.2e-16

BIC(fit5)

## [1] 26597.19

Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.

Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region parameter is not that important, so I’m only keep northeast in region column.

new_ds0 <- ds0[which(ds0$region=="northeast"),]
#Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
new_ds0$sex[new_ds0$sex == "male"]="0"
new_ds0$sex[new_ds0$sex == "female"]="1"
new_ds0$smoker[new_ds0$smoker == "no"]="0"
new_ds0$smoker[new_ds0$smoker == "yes"]="1"

# Now fit the model with only region is northeast.
fit6 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+
             as.numeric(smoker)*age+I(bmi^1.5)+I(age^1.5), data=new_ds0)
summary(fit6)

## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     smoker * bmi + as.numeric(smoker) * age + I(bmi^1.5) + I(age^1.5), 
##     data = new_ds0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9406.3 -2174.0 -1538.2  -518.7 21804.4 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5132.11    8108.96  -0.633  0.52726    
## age                       -99.62     266.41  -0.374  0.70870    
## sex1                      792.23     563.31   1.406  0.16060    
## bmi                       577.85     748.29   0.772  0.44056    
## children                  716.01     247.08   2.898  0.00402 ** 
## smoker1                -18454.30    3635.64  -5.076 6.62e-07 ***
## as.numeric(smoker)            NA         NA      NA       NA    
## I(bmi^1.5)                -58.07      91.07  -0.638  0.52417    
## I(age^1.5)                 36.39      28.46   1.279  0.20201    
## bmi:smoker1              1444.18     115.75  12.477  < 2e-16 ***
## age:as.numeric(smoker)    -45.95      52.15  -0.881  0.37901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4997 on 314 degrees of freedom
## Multiple R-squared:  0.8084, Adjusted R-squared:  0.8029 
## F-statistic: 147.2 on 9 and 314 DF,  p-value: < 2.2e-16

AIC(fit6)

## [1] 6450.074

BIC(fit6)

## [1] 6491.662

The AIC and BIC actually get smaller.

Similarly smoker*age is not that important either.

fit7 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+region+I(bmi^1.5)+I(age^1.5),data=new_ds0)
summary(fit7)
AIC(fit7)
BIC(fit7)

summary(fit7)

## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     smoker * bmi + region + I(bmi^1.5) + I(age^1.5), data = new_ds0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11786.7  -1716.1  -1351.6   -665.2  30838.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7441.29    3664.90  -2.030 0.042513 *  
## age           -323.19     124.72  -2.591 0.009662 ** 
## sex1           509.88     263.69   1.934 0.053369 .  
## bmi           1075.11     326.86   3.289 0.001031 ** 
## children       681.69     114.16   5.971 3.02e-09 ***
## smoker1     -20425.25    1630.46 -12.527  < 2e-16 ***
## region1       1036.80     309.38   3.351 0.000827 ***
## I(bmi^1.5)    -126.75      38.87  -3.261 0.001138 ** 
## I(age^1.5)      62.57      13.29   4.707 2.77e-06 ***
## bmi:smoker1   1443.95      52.09  27.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4795 on 1328 degrees of freedom
## Multiple R-squared:  0.8443, Adjusted R-squared:  0.8432 
## F-statistic: 800.1 on 9 and 1328 DF,  p-value: < 2.2e-16

AIC(fit7)

## [1] 26488.94

BIC(fit7)

## [1] 26546.13

After dropping smoker*age, AIC and BIC values stay the same.For now, the best model would be: charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)

Caution: As you may see, the coefficient of smoker1 is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.

By Matlab

Matlab (Yijia Zhang)

Importing and modifying data

insurance = readtable('insurance.csv');
data = insurance;
%Transfer categorical value to numerical value
X = data(:,1:6);
Y = data(:,7);
[GN, ~, G] = unique(X(:,2));
X.sex = G-1;
X.smoker = strcmp(X.smoker, 'yes');
head(X)

Result:

    age    sex     bmi      children    smoker      region   
    ___    ___    ______    ________    ______    ___________

    19      0       27.9       0        true      'southwest'
    18      1      33.77       1        false     'southeast'
    28      1         33       3        false     'southeast'
    33      1     22.705       0        false     'northwest'
    32      1      28.88       0        false     'northwest'
    31      0      25.74       0        false     'southeast'
    46      0      33.44       1        false     'southeast'
    37      0      27.74       3        false     'northwest'

Now we are ready to see the distribution of each cols and select a strategy to estimate our model:

figure(1)
histogram(data.age,'BinWidth',5);
xlabel('Age');
ylabel('Frequency');

The distribution of age:

Age distribution

The distribution of sex:

figure(2)
C = categorical(X.sex,[0 1],{'Female','Male'});
histogram(C,'BarWidth',0.5);
xlabel('Sex');
ylabel('Frequency');

Sex distribution

The distribution of bmi:

figure(3)
histogram(data.bmi);
xlabel('bmi');
ylabel('Frequency');

Bmi distribution

The distribution of Children:

figure(4)
histogram(data.children);
xlabel('children');
ylabel('Frequency');

Children distribution

The distribuition of Smoker:

figure(5)
C0 = categorical(X.smoker,[0 1],{'No','Yes'});
histogram(C0, 'BarWidth',0.5);
xlabel('smoker');
ylabel('Frequency');

Smoker distribution

The distribution of Region:

figure(6)
C1 = categorical(data.region);
histogram(C1, 'BarWidth',0.5);
xlabel('Region');
ylabel('Frequency');

Region distribution

The distribution of Charges:

figure(7)
histogram(data.charges,'BinWidth',5000);
xlabel('charges');
ylabel('Frequency');

Charges distribution

Since Y is the response and it has a heavy-tail distribution, one general way to fit it is to take the log:

Y.charges = log(Y.charges);
figure(8)
histogram(Y.charges,'BinWidth',0.5);
xlabel('ln-charges');
ylabel('Frequency');

The log charge histogram:

log_Charges distribution

Now, it is a bell-shaped distribution. To regress the full set of data, we have to convert region into dummy variables:

X.southwest = ( strcmp(X.region ,'southwest'));
X.northwest = ( strcmp(X.region ,'northwest'));
X.southeast = ( strcmp(X.region ,'southeast'));
X = removevars(X,{'region'});
head(X)

Result

    age    sex     bmi      children    smoker    southwest    northwest    southeast
    ___    ___    ______    ________    ______    _________    _________    _________

    19      0       27.9       0        true        true         false        false  
    18      1      33.77       1        false       false        false        true   
    28      1         33       3        false       false        false        true   
    33      1     22.705       0        false       false        true         false  
    32      1      28.88       0        false       false        true         false  
    31      0      25.74       0        false       false        false        true   
    46      0      33.44       1        false       false        false        true   
    37      0      27.74       3        false       false        true         false

Regression and model selection

Note, in this section each \(X_n\) is correspond the \(n^{th}\) predictor in the following predictors list: {age sex bmi children smoker southwest northwest southeast smokerbmi smokerage} We start by regressing the full set of predictors:

x = X{:,:};
[m,p] = size(x);
y = Y{:,:};
n = length(y);
mdl = fitlm(x,y)
mdl.ModelCriterion

mdl = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8

Estimated Coefficients:
                   Estimate         SE         tStat       pValue   
                   _________    __________    _______    ___________

    (Intercept)       7.0306      0.072396     97.112              0
    x1              0.034582    0.00087205     39.655    1.3701e-227
    x2             -0.075416      0.024401    -3.0907      0.0020384
    x3              0.013375      0.002096     6.3811     2.4234e-10
    x4               0.10186      0.010099     10.085     4.2411e-23
    x5                1.5543      0.030279     51.333    1.1199e-317
    x6              -0.12895      0.035027    -3.6815     0.00024116
    x7             -0.063788      0.034906    -1.8274        0.06786
    x8               -0.1572      0.035083    -4.4807     8.0776e-06


Number of observations: 1338, Error degrees of freedom: 1329
Root Mean Squared Error: 0.444
R-squared: 0.768,  Adjusted R-Squared 0.767
F-statistic vs. constant model: 550, p-value = 0

AIC and BIC

     AIC: 1.6350e+03
    AICc: 1.6352e+03
     BIC: 1.6818e+03
    CAIC: 1.6908e+03

Plot the residual:

%Plot residual
figure(9)
plotResiduals(mdl);
%Normal probability plot
figure(10)
plotResiduals(mdl,'probability');
figure(11)
plotResiduals(mdl,'fitted');

Plot1: Plot2

It turns out that residuals have some linear relationship, so we have to modify our model.

We use formula from Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980 in open toolbox to check the VIF. It turns our they are limited and most of them are lower than 5.

function [V]=vif(X)
R0 = corrcoef(X); % correlation matrix
V=diag(inv(R0))';

Our first try is to add some interactive elements in our model smoker:bmi:

x1 = X;
x1.smokerbmi = x1.smoker.*x1.bmi;
x11 = x1{:,:};
[m1,p1] = size(x11);
mdl2 = fitlm(x11,y);
mdl2.ModelCriterion;

Regression Output


mdl2 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

Estimated Coefficients:
                   Estimate         SE         tStat       pValue   
                   _________    __________    _______    ___________

    (Intercept)       7.3374      0.076673     95.697              0
    x1              0.034795    0.00084289     41.281    2.4521e-240
    x2             -0.087064      0.023607     -3.688     0.00023512
    x3              0.003406     0.0022676      1.502        0.13334
    x4               0.10315     0.0097594     10.569     3.9892e-25
    x5               0.15642         0.146     1.0714         0.2842
    x6              -0.13751      0.033856    -4.0617     5.1563e-05
    x7             -0.071131      0.033735    -2.1085       0.035176
    x8              -0.16273      0.033903    -4.7998     1.7681e-06
    x9              0.045574     0.0046633      9.773     7.8431e-22


Number of observations: 1338, Error degrees of freedom: 1328
Root Mean Squared Error: 0.429
R-squared: 0.784,  Adjusted R-Squared 0.782
F-statistic vs. constant model: 534, p-value = 0

AIC and BIC

     AIC: 1.5441e+03
    AICc: 1.5443e+03
     BIC: 1.5961e+03
    CAIC: 1.6061e+03

We see an increase in Adjusted R-square and a decrease in BIC, so we retain this new interactive predictor.

Our second try is to add an interactive predictor: smoker:age

x2 = X;
x2.smokerbmi = x2.smoker.*x2.bmi;
x2.smokerage = x2.smoker.*x2.age;
x12 = x2{:,:};
[m2,p2] = size(x12);
mdl3 = fitlm(x12,y);
mdl3.ModelCriterion;

Regression output

mdl3 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Estimated Coefficients:
                   Estimate         SE         tStat       pValue   
                   _________    __________    _______    ___________

    (Intercept)       7.1326      0.069955     101.96              0
    x1              0.041573    0.00084966      48.93    2.0913e-299
    x2             -0.085616      0.021242    -4.0305     5.8833e-05
    x3             0.0011516     0.0020444    0.56331        0.57332
    x4               0.10633     0.0087834     12.106     4.5087e-32
    x5                1.2793       0.14589     8.7689      5.423e-18
    x6              -0.15093      0.030473    -4.9529     8.2515e-07
    x7              -0.06492      0.030357    -2.1385       0.032657
    x8              -0.14486      0.030523    -4.7459     2.3003e-06
    x9              0.051037     0.0042074      12.13     3.4458e-32
    x10            -0.033401     0.0018873    -17.698     4.3552e-63


Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.386
R-squared: 0.825,  Adjusted R-Squared 0.824
F-statistic vs. constant model: 625, p-value = 0

AIC and BIC

     AIC: 1.2626e+03
    AICc: 1.2628e+03
     BIC: 1.3198e+03
    CAIC: 1.3308e+03

Retain this predictor for samiliar reason. Now, since there are only two coninuous predictors, one useful idea is to raise their power since we expect the continous variables can explain more change in response than logical variable. Add bmi^1.5

x3 = x2;
x3.bmi = x3.bmi+x3.bmi.^1.5;
x13 = x3{:,:};
[m3,p3] = size(x13);
mdl4 = fitlm(x13,y);
mdl4.ModelCriterion;

Regression output

mdl4 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Estimated Coefficients:
                    Estimate         SE         tStat       pValue   
                   __________    __________    _______    ___________

    (Intercept)         7.152       0.05642     126.76              0
    x1               0.041601    0.00084936     48.979    8.7652e-300
    x2              -0.085515      0.021244    -4.0254     6.0099e-05
    x3             6.9951e-05    0.00021768    0.32136        0.74799
    x4                0.10637     0.0087842     12.109     4.3441e-32
    x5                 1.2662       0.14588     8.6796      1.144e-17
    x6               -0.15044      0.030469    -4.9375     8.9176e-07
    x7              -0.064937      0.030361    -2.1388       0.032633
    x8               -0.14327      0.030536    -4.6916     2.9908e-06
    x9               0.051497     0.0042081     12.238     1.0536e-32
    x10             -0.033429     0.0018876     -17.71      3.634e-63


Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.386
R-squared: 0.825,  Adjusted R-Squared 0.824
F-statistic vs. constant model: 625, p-value = 0

AIC and BIC

     AIC: 1.2628e+03
    AICc: 1.2630e+03
     BIC: 1.3200e+03
    CAIC: 1.3310e+03

Add age^1.5

x4 = x3;
x4.age = x4.age+x4.age.^1.5;
x14 = x4{:,:};
mdl5 = fitlm(x14,y);
mdl5.ModelCriterion;

Regression output


mdl5 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Estimated Coefficients:
                   Estimate         SE         tStat       pValue   
                   _________    __________    _______    ___________

    (Intercept)       7.5956       0.05336     142.35              0
    x1             0.0039719    8.3054e-05     47.824    5.3991e-291
    x2              -0.08619       0.02157    -3.9958     6.8006e-05
    x3             6.153e-05    0.00022107    0.27832        0.78081
    x4               0.11671     0.0089159     13.091     6.4801e-37
    x5                1.2569       0.14813     8.4854     5.6652e-17
    x6              -0.15002      0.030937    -4.8492     1.3859e-06
    x7             -0.065807      0.030828    -2.1347       0.032969
    x8              -0.14376      0.031006    -4.6367     3.8901e-06
    x9              0.051251     0.0042726     11.995     1.5076e-31
    x10            -0.032981     0.0019157    -17.216     4.1606e-60


Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.392
R-squared: 0.819,  Adjusted R-Squared 0.818
F-statistic vs. constant model: 602, p-value = 0

AIC and BIC

     AIC: 1.3036e+03
    AICc: 1.3038e+03
     BIC: 1.3608e+03
    CAIC: 1.3718e+03

Now,plot the resiudal against fitted value. We find that, although there is still colinearity, but now less scatter points are on the above linear line.

Now, let’s try to fit previous preditors with the original response: Refit the model:

x_1 = x4;
x_11 = x_1{:,:};
y_1 = data.charges;
mdl6 = fitlm(x_11,y_1);
mdl6.ModelCriterion;

Regression output:

mdl6 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Estimated Coefficients:
                   Estimate      SE       tStat        pValue   
                   ________    ______    ________    ___________

    (Intercept)     773.81     656.17      1.1793         0.2385
    x1              25.752     1.0213      25.215    6.1455e-115
    x2             -500.25     265.25      -1.886       0.059515
    x3              1.6724     2.7185     0.61517        0.53855
    x4              582.65     109.64      5.3143     1.2549e-07
    x5              -20379     1821.5     -11.188     7.8924e-28
    x6             -1224.5     380.43     -3.2186      0.0013191
    x7             -588.75     379.09     -1.5531        0.12064
    x8             -1185.7     381.28     -3.1099      0.0019113
    x9              1448.5      52.54      27.569    1.1927e-132
    x10            -5.0746     23.558    -0.21541        0.82948


Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 4.82e+03
R-squared: 0.843,  Adjusted R-Squared 0.841
F-statistic vs. constant model: 710, p-value = 0

AIC and BIC

     AIC: 2.6504e+04
    AICc: 2.6504e+04
     BIC: 2.6561e+04
    CAIC: 2.6572e+04

We see both an increase in BIC and adjusted R squared value. So, if we are selecting model with respect to adjusted R, we should stop here. Otherwise we continous to next step. It turns out that region is not a very significant predictor, so we only keep southwest which has the lowest P-value. Refit the model. Note \(X_n\) correspond to {age sex bmi children smoker northwest smokerbmi smokerage age15 bmi15 }

x_2 = x2;
x_2.age15 = x_2.age.^1.5;
x_2.bmi15 = x_2.bmi.^1.5;
x_2 = removevars(x_2,{'southwest','southeast'});
x_12 = x_2{:,:};
y_1 = data.charges;
mdl7 = fitlm(x_12,y_1);
mdl7.ModelCriterion;

Regression result

mdl7 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10

Estimated Coefficients:
                   Estimate      SE       tStat        pValue   
                   ________    ______    ________    ___________

    (Intercept)    -6048.4     3674.8     -1.6459        0.10002
    x1             -320.92     125.36     -2.5601       0.010575
    x2             -504.03     264.88     -1.9029       0.057269
    x3                1020     328.62       3.104      0.0019494
    x4              672.29     114.72      5.8603     5.8227e-09
    x5              -20140     1818.8     -11.073      2.558e-27
    x6              133.45     311.08     0.42899          0.668
    x7              1439.6     52.452      27.446    1.0272e-131
    x8             -3.7246     23.494    -0.15854        0.87406
    x9              62.475     13.353      4.6786     3.1841e-06
    x10            -121.14     39.114     -3.0972      0.0019944


Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 4.82e+03
R-squared: 0.843,  Adjusted R-Squared 0.842
F-statistic vs. constant model: 712, p-value = 0

AIC and BIC

     AIC: 2.6500e+04
    AICc: 2.6500e+04
     BIC: 2.6557e+04
    CAIC: 2.6568e+04

It turns out that smoker:age has a really high P value, so we drop it. Now we have predictors:

{age sex bmi children smoker northwest smokerbmi age15 bmi15 }

x_3 = x_2;
x_3 = removevars(x_3,{'smokerage'});
x_13 = x_3{:,:};
y_1 = data.charges;
mdl8 = fitlm(x_13,y_1);
mdl8.ModelCriterion;

Regression result

mdl8 = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9

Estimated Coefficients:
                   Estimate      SE       tStat       pValue   
                   ________    ______    _______    ___________

    (Intercept)    -6021.6     3669.5     -1.641        0.10104
    x1             -321.58     125.24    -2.5678       0.010345
    x2             -504.19     264.78    -1.9042       0.057097
    x3              1019.8     328.49     3.1046      0.0019457
    x4              671.94     114.66     5.8604     5.8154e-09
    x5              -20265     1636.9     -12.38     2.1457e-33
    x6              132.94     310.95    0.42753        0.66907
    x7                1439     52.293     27.518    2.7926e-132
    x8              62.466     13.348     4.6796     3.1676e-06
    x9              -121.1     39.098    -3.0972      0.0019944

AIC and BIC

     AIC: 2.6498e+04
    AICc: 2.6498e+04
     BIC: 2.6550e+04
    CAIC: 2.6560e+04

After dropping smokerage, AIC and BIC values stay the same.For now, the best model would be:

charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)

Linear Regression: Model Selection & Model Diagnostics

Group 19: Mimi Tran, Yijia Zhang and Zhenbang Jiao

12/7/2018

Data Discription and Topic introduction

Data Discription

Topic: Linear Regression, Model Selection and Model Diagnostics

Introduction

Solving linear regression problems with least square

Diagnostic and Model Selection

Structure examples in this tutorial

Citation:

By Python

Python (Zhenbang Jiao)

Data Preprocessing

Data Visualization

Optional Step

Model Diagnostics

VIF

Residual Distribution

Box-Cox transformation

Residual Plot

partial residual plot

Model Selection

Original model

The first try : add an interactive covariate smoker:bmi

The second try : add an interactive covariate smoker:age

The third try : add \(bmi^{1.5}\)

The fourth try: add \(age^{1.5}\)

The fifth try: What if we take the log away?

Summary

By R

R (Mimi Tran)

Data Preprocessing

Data Visualization

Model Dignostics

VIF

Residual plots

Partial residual plots

Model Selection

Original model

The first try: add interactive covariate smoker*bmi

The second try: add an interactive covariate smoker*age

The third try: add \(bmi^{1.5}\)

The fourth try: add \(age^{1.5}\)

The fifth try: What if we take the log away? so we use charges instead of logcharges for the response.

By Matlab

Matlab (Yijia Zhang)

Importing and modifying data

Regression and model selection