Since the topic we chose is easy but sophisticated, the code and inference for each programming language would be rather long :(
The data we use is a medical cost personal dataset accessible on Kaggle (https://www.kaggle.com/mirichoi0218/insurance).
library(data.table)
setwd("~/Desktop/group_project/group_project/final_version/")
insurance <- fread("insurance.csv")
head(insurance)
## age sex bmi children smoker region charges
## 1: 19 female 27.900 0 yes southwest 16884.924
## 2: 18 male 33.770 1 no southeast 1725.552
## 3: 28 male 33.000 3 no southeast 4449.462
## 4: 33 male 22.705 0 no northwest 21984.471
## 5: 32 male 28.880 0 no northwest 3866.855
## 6: 31 female 25.740 0 no southeast 3756.622
Age is the age of the people. Sex is the gender. Bmi refers to the Body Mass Index (BMI). Children refers to the number of child one has. Smoker indicates whether given people smoke. Region describes in which of four regions (southwest, southeast, northeast, northwest) one lives in. Charges is the insurance cost in dollar. The data visualization is done in all three language. In this turorial,our goal is to estimated a linear model which predicts the insurance charges upon the above dataset.
This is the norm equation.
Select a model that maximize the adjusted \(R^2\) value
In each example, we start by data visualization. Then pick a strategy to scale the response. Then, we convert the categorical columns in predictors to dummy variables and regress on the full dataset. It turns out their is a clear linearity associated with residual. So we modify our model to reduce such linearity. Having the original model, we start by adding some interaction predictors in our model if two variables seems to have interacting effect on the response. For example, smoke:age. Next, we raise the power of continous predictor. Above diagonastic process reduce the linearity in residuals. Then, we regress the selected predictors against the original response, and do model selection.
1.Xie, Y. Yihui Xie’s knitr example
2.A regression example in Kaggle
3.Faraway, J. 2009. Linear Models with R. University of British Columbia, Canada.
4.Shedden, K. Proffessor Shedden’s Notes on model diagostics
5.Shedden, K. Proffessor Shedden’s Notes on model selection
The following packages are used in Python version of the tutorial.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
#read the data and save as a dataframe
ds0 = pd.DataFrame(pd.read_csv('~/Desktop/506/groupproject/insurance.csv'))
#have a glimpse of the data
print(ds0.head())
## age sex bmi children smoker region charges
## 0 19 female 27.900 0 yes southwest 16884.92400
## 1 18 male 33.770 1 no southeast 1725.55230
## 2 28 male 33.000 3 no southeast 4449.46200
## 3 33 male 22.705 0 no northwest 21984.47061
## 4 32 male 28.880 0 no northwest 3866.85520
print(ds0.columns)
#check if there is any null
## Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
ds0.isnull().sum()
#no null :)
## age 0
## sex 0
## bmi 0
## children 0
## smoker 0
## region 0
## charges 0
## dtype: int64
# map the character to factor variable
ds0.sex = ds0.sex.map({'female':1, 'male':0})
ds0.smoker = ds0.smoker.map({'yes':1, 'no':0})
ds_reg = pd.get_dummies(ds0['region'])
# since we change it to dummy variables, we have to drop one of the column
ds0 = ds0.join(ds_reg.iloc[:,0:3])
ds0['logcharges'] = np.log(ds0.charges)
ds1 = ds0.drop(['charges','region'],axis=1)
import seaborn as sns
import matplotlib.pyplot as plt
# As we can see, the charges have a heavy tail. So, we take log on it.
sns.distplot(ds0['charges'], color='b')
# Data in every other column looks fine right now.
f,ax = plt.subplots(2,3,figsize=(10,8))
sns.distplot(ds1["age"], kde=False, ax=ax[0,0])
sns.boxplot(x='sex',y='charges', data=ds0, ax=ax[0,1])
sns.distplot(ds1['logcharges'], ax=ax[0,2], kde=False, color='b')
# The logcharges are now normally distributed.
sns.distplot(ds1['bmi'],ax=ax[1,0], kde=False, color='b')
sns.countplot('children',data=ds1, ax=ax[1,1])
sns.countplot('region',data=ds0, ax=ax[1,2])
ax[0,0].set_title('Distribution of Ages')
ax[0,1].set_title('Charges boxplot by Sex')
ax[0,2].set_title('Distribution of log charges')
ax[1,0].set_title('Distribution of bmi')
ax[1,1].set_title('Distribution of children')
ax[1,2].set_title('Distribution of regions')
Actually centralize is not a must, which would not change the general result of the regression. In this example, we won’t do so.
def centralize(arra):
cen = arra - np.mean(arra)
var = np.sqrt(sum(cen**2)/(len(arra)-1))
arra = cen/var
return arra
ds0.bmi = centralize(ds0.bmi)
ds0.age = centralize(ds0.age)
Firstly, we define dependent variable y and covariates X for future convenience.
y = ds1.logcharges
X = ds1.drop(['logcharges'], axis = 1)
#Conduct the first regression!
regr_1 = OLS(y, add_constant(X)).fit()
def variance_inflation_factor(exog, exog_idx):
k_vars = exog.shape[1]
x_i = exog.iloc[:, exog_idx]
mask = np.arange(k_vars) != exog_idx
x_noti = exog.iloc[:, mask]
r_squared_i = OLS(x_i, x_noti).fit().rsquared
vif = 1. / (1. - r_squared_i)
return vif
# We skip the constant column.
VIF = [variance_inflation_factor(add_constant(X), i) for i in range(1,X.shape[1]+1)]
print(VIF)
## [1.0168221490038107, 1.0089001621005733, 1.106629732428617, 1.004010642137024, 1.0120736649061481, 1.5262104138696806, 1.524748317955475, 1.5971027909244977]
The vif of each column is ok. All of them are smaller than 5, even 2.
sns.distplot(regr_1.resid)
Acting like normal which is good. Since the residual itself is normal, box-cox is not necessary.
namda = 0.1
regr_test = OLS((y**namda-1)/namda, add_constant(X)).fit()
sns.jointplot((y**namda-1)/namda, regr_test.resid)
sns.distplot(regr_test.resid)
If the regression works well, the dependent data should be uncorrelated with the residual. If we have gaussian prerequisite, independence is what we expect.
sns.jointplot(y, regr_1.resid)
The residual plot looks very strange. Y and residual are highly dependent. Maybe the model is not linear at the first place. Since there is explicit non-linear in this model, we have to add some non-linear covariates in it.
Which attempts to show how covariate is related to dependent variable, if we control for the effects of all other covariates.
f,ax = plt.subplots(1,2,figsize=(10,8))
sns.jointplot(regr_1.params.bmi * X.bmi + regr_1.resid, X.bmi, ax=ax[0,0])
sns.jointplot(regr_1.params.age * X.age + regr_1.resid, X.age, ax=ax[0,1])
Partial residual plots show that both bmi and age could significantly affect the charges.
For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.
print(regr_1.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.768
## Model: OLS Adj. R-squared: 0.767
## Method: Least Squares F-statistic: 549.8
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -808.52
## No. Observations: 1338 AIC: 1635.
## Df Residuals: 1329 BIC: 1682.
## Df Model: 8
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 6.8262 0.076 90.168 0.000 6.678 6.975
## age 0.0346 0.001 39.655 0.000 0.033 0.036
## sex 0.0754 0.024 3.091 0.002 0.028 0.123
## bmi 0.0134 0.002 6.381 0.000 0.009 0.017
## children 0.1019 0.010 10.085 0.000 0.082 0.122
## smoker 1.5543 0.030 51.333 0.000 1.495 1.614
## northeast 0.1290 0.035 3.681 0.000 0.060 0.198
## northwest 0.0652 0.035 1.863 0.063 -0.003 0.134
## southeast -0.0282 0.034 -0.819 0.413 -0.096 0.039
## ==============================================================================
## Omnibus: 463.882 Durbin-Watson: 2.046
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 1673.760
## Skew: 1.679 Prob(JB): 0.00
## Kurtosis: 7.330 Cond. No. 330.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
logcharges = constant + age + sex + bmi + children + smoker + northeast + northwest + southeast
X_2 = X.iloc[:,:]
X_2['sm_bm'] = X_2.smoker * X_2.bmi
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which certainly improve the performance of the model
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.784
## Model: OLS Adj. R-squared: 0.782
## Method: Least Squares F-statistic: 534.0
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -762.05
## No. Observations: 1338 AIC: 1544.
## Df Residuals: 1328 BIC: 1596.
## Df Model: 9
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 7.1128 0.079 90.254 0.000 6.958 7.267
## age 0.0348 0.001 41.281 0.000 0.033 0.036
## sex 0.0871 0.024 3.688 0.000 0.041 0.133
## bmi 0.0034 0.002 1.502 0.133 -0.001 0.008
## children 0.1031 0.010 10.569 0.000 0.084 0.122
## smoker 0.1564 0.146 1.071 0.284 -0.130 0.443
## northeast 0.1375 0.034 4.062 0.000 0.071 0.204
## northwest 0.0664 0.034 1.964 0.050 8.84e-05 0.133
## southeast -0.0252 0.033 -0.757 0.449 -0.091 0.040
## sm_bm 0.0456 0.005 9.773 0.000 0.036 0.055
## ==============================================================================
## Omnibus: 520.045 Durbin-Watson: 2.036
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 2150.321
## Skew: 1.846 Prob(JB): 0.00
## Kurtosis: 7.994 Cond. No. 661.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X_2['sm_ag'] = X_2.smoker*X_2.age
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
# which increase the performance of the model significantly
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.825
## Model: OLS Adj. R-squared: 0.824
## Method: Least Squares F-statistic: 625.0
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -620.29
## No. Observations: 1338 AIC: 1263.
## Df Residuals: 1327 BIC: 1320.
## Df Model: 10
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 6.8960 0.072 95.828 0.000 6.755 7.037
## age 0.0416 0.001 48.930 0.000 0.040 0.043
## sex 0.0856 0.021 4.030 0.000 0.044 0.127
## bmi 0.0012 0.002 0.563 0.573 -0.003 0.005
## children 0.1063 0.009 12.106 0.000 0.089 0.124
## smoker 1.2793 0.146 8.769 0.000 0.993 1.566
## northeast 0.1509 0.030 4.953 0.000 0.091 0.211
## northwest 0.0860 0.030 2.827 0.005 0.026 0.146
## southeast 0.0061 0.030 0.202 0.840 -0.053 0.065
## sm_bm 0.0510 0.004 12.130 0.000 0.043 0.059
## sm_ag -0.0334 0.002 -17.698 0.000 -0.037 -0.030
## ==============================================================================
## Omnibus: 860.664 Durbin-Watson: 1.998
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 8198.732
## Skew: 2.969 Prob(JB): 0.00
## Kurtosis: 13.574 Cond. No. 744.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Since we only have two continuous covariates, we can try to give them an extra power.
X_2['bmi^1.5'] = X_2.bmi ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.826
## Model: OLS Adj. R-squared: 0.825
## Method: Least Squares F-statistic: 574.1
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -614.16
## No. Observations: 1338 AIC: 1252.
## Df Residuals: 1326 BIC: 1315.
## Df Model: 11
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 5.9730 0.274 21.823 0.000 5.436 6.510
## age 0.0414 0.001 48.940 0.000 0.040 0.043
## sex 0.0859 0.021 4.060 0.000 0.044 0.127
## bmi 0.0928 0.026 3.528 0.000 0.041 0.144
## children 0.1065 0.009 12.175 0.000 0.089 0.124
## smoker 1.2759 0.145 8.782 0.000 0.991 1.561
## northeast 0.1564 0.030 5.147 0.000 0.097 0.216
## northwest 0.0848 0.030 2.797 0.005 0.025 0.144
## southeast 0.0149 0.030 0.497 0.619 -0.044 0.074
## sm_bm 0.0513 0.004 12.243 0.000 0.043 0.060
## sm_ag -0.0335 0.002 -17.815 0.000 -0.037 -0.030
## bmi^1.5 -0.0110 0.003 -3.494 0.000 -0.017 -0.005
## ==============================================================================
## Omnibus: 861.038 Durbin-Watson: 1.995
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 8243.481
## Skew: 2.968 Prob(JB): 0.00
## Kurtosis: 13.612 Cond. No. 4.89e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.89e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
X_2['age^1.5'] = X_2.age ** 1.5
regr_test = OLS(y, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: logcharges R-squared: 0.828
## Model: OLS Adj. R-squared: 0.827
## Method: Least Squares F-statistic: 532.2
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -607.51
## No. Observations: 1338 AIC: 1241.
## Df Residuals: 1325 BIC: 1309.
## Df Model: 12
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 5.5668 0.294 18.907 0.000 4.989 6.144
## age 0.0776 0.010 7.785 0.000 0.058 0.097
## sex 0.0855 0.021 4.060 0.000 0.044 0.127
## bmi 0.0919 0.026 3.508 0.000 0.041 0.143
## children 0.0966 0.009 10.583 0.000 0.079 0.114
## smoker 1.2704 0.145 8.784 0.000 0.987 1.554
## northeast 0.1566 0.030 5.179 0.000 0.097 0.216
## northwest 0.0858 0.030 2.844 0.005 0.027 0.145
## southeast 0.0148 0.030 0.497 0.619 -0.044 0.073
## sm_bm 0.0514 0.004 12.326 0.000 0.043 0.060
## sm_ag -0.0335 0.002 -17.881 0.000 -0.037 -0.030
## bmi^1.5 -0.0108 0.003 -3.466 0.001 -0.017 -0.005
## age^1.5 -0.0039 0.001 -3.639 0.000 -0.006 -0.002
## ==============================================================================
## Omnibus: 895.912 Durbin-Watson: 1.992
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 9255.812
## Skew: 3.102 Prob(JB): 0.00
## Kurtosis: 14.293 Cond. No. 9.48e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
y_2 = ds0.charges
regr_test = OLS(y_2, add_constant(X_2)).fit()
print(regr_test.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.845
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 600.3
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -13232.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1325 BIC: 2.656e+04
## Df Model: 12
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7665.5804 3687.192 -2.079 0.038 -1.49e+04 -432.209
## age -320.3823 124.794 -2.567 0.010 -565.197 -75.567
## sex 509.3387 263.692 1.932 0.054 -7.960 1026.637
## bmi 1056.6806 328.082 3.221 0.001 413.063 1700.298
## children 678.7408 114.266 5.940 0.000 454.579 902.902
## smoker -2.029e+04 1811.135 -11.200 0.000 -2.38e+04 -1.67e+04
## northeast 1288.7634 378.774 3.402 0.001 545.701 2031.825
## northwest 616.4165 377.748 1.632 0.103 -124.634 1357.467
## southeast 122.5000 374.253 0.327 0.743 -611.693 856.693
## sm_bm 1444.5800 52.238 27.654 0.000 1342.101 1547.059
## sm_ag -3.7854 23.430 -0.162 0.872 -49.750 42.179
## bmi^1.5 -123.8640 39.080 -3.170 0.002 -200.529 -47.199
## age^1.5 62.3346 13.294 4.689 0.000 36.256 88.413
## ==============================================================================
## Omnibus: 737.144 Durbin-Watson: 2.068
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4743.381
## Skew: 2.581 Prob(JB): 0.00
## Kurtosis: 10.645 Cond. No. 9.48e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.48e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region para is not that important.
X_3 = X_2.drop(columns = ['northwest', 'southeast'])
regr_3 = OLS(y_2, add_constant(X_3)).fit()
print(regr_3.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.844
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 719.5
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:25 Log-Likelihood: -13233.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1327 BIC: 2.655e+04
## Df Model: 10
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7462.5827 3670.004 -2.033 0.042 -1.47e+04 -262.941
## age -322.6577 124.831 -2.585 0.010 -567.546 -77.770
## sex 509.7468 263.786 1.932 0.054 -7.736 1027.230
## bmi 1075.2760 326.981 3.288 0.001 433.821 1716.732
## children 681.9706 114.227 5.970 0.000 457.886 906.056
## smoker -2.032e+04 1811.634 -11.219 0.000 -2.39e+04 -1.68e+04
## northeast 1036.4906 309.503 3.349 0.001 429.323 1643.658
## sm_bm 1444.4375 52.251 27.644 0.000 1341.934 1546.941
## sm_ag -3.0043 23.396 -0.128 0.898 -48.902 42.893
## bmi^1.5 -126.7912 38.882 -3.261 0.001 -203.068 -50.514
## age^1.5 62.5812 13.298 4.706 0.000 36.495 88.668
## ==============================================================================
## Omnibus: 740.495 Durbin-Watson: 2.062
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4801.892
## Skew: 2.592 Prob(JB): 0.00
## Kurtosis: 10.698 Cond. No. 9.43e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.43e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
the AIC and BIC actually get smaller.
Similarly smoker:age is not that important either.
X_4 = X_3.drop(columns = ['sm_ag'])
regr_4 = OLS(y_2, add_constant(X_4)).fit()
print(regr_4.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: charges R-squared: 0.844
## Model: OLS Adj. R-squared: 0.843
## Method: Least Squares F-statistic: 800.1
## Date: Fri, 07 Dec 2018 Prob (F-statistic): 0.00
## Time: 02:24:26 Log-Likelihood: -13233.
## No. Observations: 1338 AIC: 2.649e+04
## Df Residuals: 1328 BIC: 2.654e+04
## Df Model: 9
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -7441.2915 3664.898 -2.030 0.043 -1.46e+04 -251.670
## age -323.1930 124.715 -2.591 0.010 -567.854 -78.532
## sex 509.8762 263.686 1.934 0.053 -7.411 1027.163
## bmi 1075.1073 326.857 3.289 0.001 433.895 1716.320
## children 681.6865 114.163 5.971 0.000 457.727 905.646
## smoker -2.043e+04 1630.461 -12.527 0.000 -2.36e+04 -1.72e+04
## northeast 1036.8028 309.378 3.351 0.001 429.879 1643.726
## sm_bm 1443.9491 52.093 27.719 0.000 1341.755 1546.143
## bmi^1.5 -126.7498 38.866 -3.261 0.001 -202.996 -50.504
## age^1.5 62.5735 13.292 4.707 0.000 36.497 88.650
## ==============================================================================
## Omnibus: 740.467 Durbin-Watson: 2.063
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4799.401
## Skew: 2.592 Prob(JB): 0.00
## Kurtosis: 10.695 Cond. No. 9.41e+03
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 9.41e+03. This might indicate that there are
## strong multicollinearity or other numerical problems.
After the model selection part, we now have our residual:
sns.residplot(np.sum(regr_4.params*X_4,1)+regr_4.params[0], y_2)
Which is much more acceptable than before.
For now, the best model would be:
charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)
The coefficients are accessible on the table above. We have Adjusted \(R^{2}\) at 0.843, which is much better than 0.767 from the original model. So, till now, we have solved the residual’s problem and increased the Adjusted \(R^{2}\) significantly. It is a pretty complete model diagnostics and model selection process.
Caution: As you may see, the coefficient of smokeryes is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.
The following packages are used in R version of the tutorial.
# libraries--------------------------------------------------------------------
library(faraway)
library(readr)
library(dplyr)
# Load data--------------------------------------------------------------------
ds0 <- read_csv("insurance.csv")
## Parsed with column specification:
## cols(
## age = col_integer(),
## sex = col_character(),
## bmi = col_double(),
## children = col_integer(),
## smoker = col_character(),
## region = col_character(),
## charges = col_double()
## )
names(ds0)
## [1] "age" "sex" "bmi" "children" "smoker" "region"
## [7] "charges"
#have a glimpse of the data
head(ds0)
## # A tibble: 6 x 7
## age sex bmi children smoker region charges
## <int> <chr> <dbl> <int> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
# Check if is there any null---------------------------------------------------
is.na(ds0)
No missing value, we’re good.
# Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
ds0$sex[ds0$sex == "male"]="0"
ds0$sex[ds0$sex == "female"]="1"
ds0$smoker[ds0$smoker == "no"]="0"
ds0$smoker[ds0$smoker == "yes"]="1"
# Examining the distribution of each variables---------------------------------
hist(ds0$age,xlab="Age", main="Distribution of Age")
hist(ds0$bmi,xlab="BMI", main="Distribution of BMI")
hist(ds0$children,xlab="Children", main="Distribution of Children")
hist(ds0$charges,xlab="Charges", main="Distribution of Charges")
# Take log for charge since its heavy tail-------------------------------------
ds0$logcharges <- log(ds0$charges+1)
hist(ds0$charges)
ds0$logcharges <- log(ds0$charges+1)
hist(ds0$logcharges, breaks = 10)
Firstly, we define dependent variable y and covariates X. In this analysis, the reponse is logcharges and the preditors are age, sex, bmi, children, smoker, and region .
# Conduct the first regression!
# Since they're all smaller than 5, even smaller than 2.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
X <- model.matrix(fit0)[, -1]
round(vif(X),2)
## age sex1
## 1.02 1.01
## bmi children
## 1.11 1.00
## smoker1 as.factor(region)northwest
## 1.01 1.52
## as.factor(region)southeast as.factor(region)southwest
## 1.65 1.53
hist(fit0$residuals, xlab="Residuals")
plot(fit0$res, xlab="Residuals")
abline(h=0) # acting like noraml so it's good.
plot(fit)
# Partial residual plots-------------------------------------------------------
#Which attempts to show how covariate is related to dependent variable
# if we control for the effects of all other covariates
# partial residual plots look acceptable.
fit <- lm(logcharges~ bmi, data=ds0)
plot(fit)
For this part, we would try different adding variables and try to drop variables that are useless. The primary concern in this case is that we have to add variables so that the residual is relatively indep with y.
fit0 <- lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region), data=ds0)
summary(fit0)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region), data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07141 -0.19836 -0.04914 0.06601 2.16602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.9556815 0.0732494 94.959 < 2e-16 ***
## age 0.0345744 0.0008719 39.655 < 2e-16 ***
## sex1 0.0753873 0.0243966 3.090 0.002043 **
## bmi 0.0133740 0.0020956 6.382 2.41e-10 ***
## children 0.1018259 0.0100976 10.084 < 2e-16 ***
## smoker1 1.5541456 0.0302739 51.336 < 2e-16 ***
## as.factor(region)northwest -0.0637739 0.0348992 -1.827 0.067867 .
## as.factor(region)southeast -0.1571539 0.0350762 -4.480 8.09e-06 ***
## as.factor(region)southwest -0.1289211 0.0350206 -3.681 0.000241 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4442 on 1329 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7666
## F-statistic: 549.8 on 8 and 1329 DF, p-value: < 2.2e-16
AIC(fit0)
## [1] 1636.536
BIC(fit0)
## [1] 1688.526
fit1 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi, data=ds0)
summary(fit1)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91486 -0.17764 -0.05144 0.04640 2.22282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2508346 0.0769477 94.231 < 2e-16 ***
## age 0.0347880 0.0008427 41.280 < 2e-16 ***
## sex1 0.0870349 0.0236027 3.688 0.000236 ***
## bmi 0.0034055 0.0022672 1.502 0.133309
## children 0.1031176 0.0097574 10.568 < 2e-16 ***
## smoker1 0.1562914 0.1459708 1.071 0.284497
## as.factor(region)northwest -0.0711167 0.0337288 -2.108 0.035176 *
## as.factor(region)southeast -0.1626838 0.0338962 -4.799 1.77e-06 ***
## as.factor(region)southwest -0.1374810 0.0338491 -4.062 5.16e-05 ***
## bmi:smoker1 0.0455727 0.0046624 9.775 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4292 on 1328 degrees of freedom
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.7821
## F-statistic: 534.1 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit1)
## [1] 1602.769
# which certainly improve the performance of the model
fit2 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+smoker*bmi+as.numeric(smoker)*age, data=ds0)
summary(fit2)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi + as.numeric(smoker) * age,
## data = ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65395 -0.15404 -0.06305 -0.01126 2.34914
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0475291 0.0701861 100.412 < 2e-16 ***
## age 0.0415643 0.0008495 48.928 < 2e-16 ***
## sex1 0.0855867 0.0212384 4.030 5.90e-05 ***
## bmi 0.0011518 0.0020440 0.563 0.5732
## children 0.1062970 0.0087818 12.104 < 2e-16 ***
## smoker1 1.2788788 0.1458651 8.768 < 2e-16 ***
## as.factor(region)northwest -0.0649078 0.0303520 -2.139 0.0327 *
## as.factor(region)southeast -0.1448205 0.0305173 -4.746 2.30e-06 ***
## as.factor(region)southwest -0.1508956 0.0304676 -4.953 8.26e-07 ***
## as.numeric(smoker) NA NA NA NA
## bmi:smoker1 0.0510336 0.0042067 12.132 < 2e-16 ***
## age:as.numeric(smoker) -0.0333924 0.0018870 -17.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3862 on 1327 degrees of freedom
## Multiple R-squared: 0.8249, Adjusted R-squared: 0.8235
## F-statistic: 625 on 10 and 1327 DF, p-value: < 2.2e-16
BIC(fit2)
## [1] 1326.493
# which increase the performance of the model significantly
Since we only have two continuous covariates, we can try to give them an extra power.
#since we only have two continuous covariates, we can try to give them an extra power, add bmi^1.5.
fit3 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5, data=ds0)
summary(fit3)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03798 -0.17402 -0.03470 0.05657 2.04224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1763930 0.0720456 99.609 < 2e-16 ***
## age 0.0375651 0.0007998 46.967 < 2e-16 ***
## sex1 0.1190553 0.0224451 5.304 1.32e-07 ***
## bmi -0.0015026 0.0023231 -0.647 0.5179
## children 0.1225808 0.0093304 13.138 < 2e-16 ***
## smoker1 0.0513262 0.1383354 0.371 0.7107
## as.factor(region)northwest 0.0100724 0.0315101 0.320 0.7493
## as.factor(region)southeast -0.0649199 0.0337726 -1.922 0.0548 .
## as.factor(region)southwest -0.1401404 0.0330701 -4.238 2.41e-05 ***
## bmi:smoker1 0.0495246 0.0045698 10.837 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4012 on 1328 degrees of freedom
## Multiple R-squared: 0.8077, Adjusted R-squared: 0.8064
## F-statistic: 619.8 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit3)
## [1] 1422.136
fit4 <-lm(logcharges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit4)
##
## Call:
## lm(formula = logcharges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5 + age^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.91636 -0.18144 -0.05810 0.02661 2.13811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2645976 0.0841646 86.314 < 2e-16 ***
## age 0.0347150 0.0008966 38.720 < 2e-16 ***
## sex1 0.1085825 0.0249587 4.350 1.46e-05 ***
## bmi 0.0027220 0.0024320 1.119 0.2632
## children 0.0792971 0.0107489 7.377 2.84e-13 ***
## smoker1 0.1785370 0.1539089 1.160 0.2463
## as.factor(region)northwest -0.0821657 0.0375473 -2.188 0.0288 *
## as.factor(region)southeast -0.0423219 0.0348873 -1.213 0.2253
## as.factor(region)southwest -0.0817111 0.0350003 -2.335 0.0197 *
## bmi:smoker1 0.0419282 0.0048774 8.596 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4515 on 1328 degrees of freedom
## Multiple R-squared: 0.7465, Adjusted R-squared: 0.7447
## F-statistic: 434.4 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit4)
## [1] 1738.278
fit5 <-lm(charges ~age+sex+bmi+children+smoker+as.factor(region)+
smoker*bmi,as.numeric(smoker)*age+bmi^1.5+age^1.5, data=ds0)
summary(fit5)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## as.factor(region) + smoker * bmi, data = ds0, subset = as.numeric(smoker) *
## age + bmi^1.5 + age^1.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9501.1 -2050.8 -1444.9 -458.8 24540.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1730.677 911.048 -1.900 0.057695 .
## age 257.277 9.705 26.510 < 2e-16 ***
## sex1 1003.023 270.168 3.713 0.000214 ***
## bmi -3.801 26.325 -0.144 0.885205
## children 238.749 116.353 2.052 0.040372 *
## smoker1 -20194.774 1666.003 -12.122 < 2e-16 ***
## as.factor(region)northwest -921.164 406.435 -2.266 0.023584 *
## as.factor(region)southeast -335.442 377.641 -0.888 0.374564
## as.factor(region)southwest -641.983 378.864 -1.694 0.090406 .
## bmi:smoker1 1422.834 52.796 26.949 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4887 on 1328 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.8235
## F-statistic: 694.2 on 9 and 1328 DF, p-value: < 2.2e-16
BIC(fit5)
## [1] 26597.19
Caution: this procedure is not a typical one for model selectiom. But if we take off the log, the performance certainly get better.
Now we have added all the possible covariates we can, we can consider drop some of them. As we can see from the summary, it seems that the region parameter is not that important, so I’m only keep northeast in region column.
new_ds0 <- ds0[which(ds0$region=="northeast"),]
#Recode: sex:femal = 1, male = 0, smoke: yes=1, no=0
new_ds0$sex[new_ds0$sex == "male"]="0"
new_ds0$sex[new_ds0$sex == "female"]="1"
new_ds0$smoker[new_ds0$smoker == "no"]="0"
new_ds0$smoker[new_ds0$smoker == "yes"]="1"
# Now fit the model with only region is northeast.
fit6 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+
as.numeric(smoker)*age+I(bmi^1.5)+I(age^1.5), data=new_ds0)
summary(fit6)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## smoker * bmi + as.numeric(smoker) * age + I(bmi^1.5) + I(age^1.5),
## data = new_ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9406.3 -2174.0 -1538.2 -518.7 21804.4
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5132.11 8108.96 -0.633 0.52726
## age -99.62 266.41 -0.374 0.70870
## sex1 792.23 563.31 1.406 0.16060
## bmi 577.85 748.29 0.772 0.44056
## children 716.01 247.08 2.898 0.00402 **
## smoker1 -18454.30 3635.64 -5.076 6.62e-07 ***
## as.numeric(smoker) NA NA NA NA
## I(bmi^1.5) -58.07 91.07 -0.638 0.52417
## I(age^1.5) 36.39 28.46 1.279 0.20201
## bmi:smoker1 1444.18 115.75 12.477 < 2e-16 ***
## age:as.numeric(smoker) -45.95 52.15 -0.881 0.37901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4997 on 314 degrees of freedom
## Multiple R-squared: 0.8084, Adjusted R-squared: 0.8029
## F-statistic: 147.2 on 9 and 314 DF, p-value: < 2.2e-16
AIC(fit6)
## [1] 6450.074
BIC(fit6)
## [1] 6491.662
The AIC and BIC actually get smaller.
Similarly smoker*age is not that important either.
fit7 <- lm(charges ~age+sex+bmi+children+smoker+smoker*bmi+region+I(bmi^1.5)+I(age^1.5),data=new_ds0)
summary(fit7)
AIC(fit7)
BIC(fit7)
summary(fit7)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## smoker * bmi + region + I(bmi^1.5) + I(age^1.5), data = new_ds0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11786.7 -1716.1 -1351.6 -665.2 30838.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7441.29 3664.90 -2.030 0.042513 *
## age -323.19 124.72 -2.591 0.009662 **
## sex1 509.88 263.69 1.934 0.053369 .
## bmi 1075.11 326.86 3.289 0.001031 **
## children 681.69 114.16 5.971 3.02e-09 ***
## smoker1 -20425.25 1630.46 -12.527 < 2e-16 ***
## region1 1036.80 309.38 3.351 0.000827 ***
## I(bmi^1.5) -126.75 38.87 -3.261 0.001138 **
## I(age^1.5) 62.57 13.29 4.707 2.77e-06 ***
## bmi:smoker1 1443.95 52.09 27.719 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4795 on 1328 degrees of freedom
## Multiple R-squared: 0.8443, Adjusted R-squared: 0.8432
## F-statistic: 800.1 on 9 and 1328 DF, p-value: < 2.2e-16
AIC(fit7)
## [1] 26488.94
BIC(fit7)
## [1] 26546.13
After dropping smoker*age, AIC and BIC values stay the same.For now, the best model would be: charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)
Caution: As you may see, the coefficient of smoker1 is negative. Please don’t take the result as smoking is good for your health. Since smoke is also used in smoke:bmi, bmi is large. So the bad influence of smoking has been transferred to the smoke:bmi term.
insurance = readtable('insurance.csv');
data = insurance;
%Transfer categorical value to numerical value
X = data(:,1:6);
Y = data(:,7);
[GN, ~, G] = unique(X(:,2));
X.sex = G-1;
X.smoker = strcmp(X.smoker, 'yes');
head(X)
Result:
age sex bmi children smoker region
___ ___ ______ ________ ______ ___________
19 0 27.9 0 true 'southwest'
18 1 33.77 1 false 'southeast'
28 1 33 3 false 'southeast'
33 1 22.705 0 false 'northwest'
32 1 28.88 0 false 'northwest'
31 0 25.74 0 false 'southeast'
46 0 33.44 1 false 'southeast'
37 0 27.74 3 false 'northwest'
Now we are ready to see the distribution of each cols and select a strategy to estimate our model:
figure(1)
histogram(data.age,'BinWidth',5);
xlabel('Age');
ylabel('Frequency');
The distribution of age:
The distribution of sex:
figure(2)
C = categorical(X.sex,[0 1],{'Female','Male'});
histogram(C,'BarWidth',0.5);
xlabel('Sex');
ylabel('Frequency');
The distribution of bmi:
figure(3)
histogram(data.bmi);
xlabel('bmi');
ylabel('Frequency');
The distribution of Children:
figure(4)
histogram(data.children);
xlabel('children');
ylabel('Frequency');
The distribuition of Smoker:
figure(5)
C0 = categorical(X.smoker,[0 1],{'No','Yes'});
histogram(C0, 'BarWidth',0.5);
xlabel('smoker');
ylabel('Frequency');
The distribution of Region:
figure(6)
C1 = categorical(data.region);
histogram(C1, 'BarWidth',0.5);
xlabel('Region');
ylabel('Frequency');
The distribution of Charges:
figure(7)
histogram(data.charges,'BinWidth',5000);
xlabel('charges');
ylabel('Frequency');
Since Y is the response and it has a heavy-tail distribution, one general way to fit it is to take the log:
Y.charges = log(Y.charges);
figure(8)
histogram(Y.charges,'BinWidth',0.5);
xlabel('ln-charges');
ylabel('Frequency');
The log charge histogram:
Now, it is a bell-shaped distribution. To regress the full set of data, we have to convert region into dummy variables:
X.southwest = ( strcmp(X.region ,'southwest'));
X.northwest = ( strcmp(X.region ,'northwest'));
X.southeast = ( strcmp(X.region ,'southeast'));
X = removevars(X,{'region'});
head(X)
Result
age sex bmi children smoker southwest northwest southeast
___ ___ ______ ________ ______ _________ _________ _________
19 0 27.9 0 true true false false
18 1 33.77 1 false false false true
28 1 33 3 false false false true
33 1 22.705 0 false false true false
32 1 28.88 0 false false true false
31 0 25.74 0 false false false true
46 0 33.44 1 false false false true
37 0 27.74 3 false false true false
Note, in this section each \(X_n\) is correspond the \(n^{th}\) predictor in the following predictors list: {age sex bmi children smoker southwest northwest southeast smokerbmi smokerage} We start by regressing the full set of predictors:
x = X{:,:};
[m,p] = size(x);
y = Y{:,:};
n = length(y);
mdl = fitlm(x,y)
mdl.ModelCriterion
mdl =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ ___________
(Intercept) 7.0306 0.072396 97.112 0
x1 0.034582 0.00087205 39.655 1.3701e-227
x2 -0.075416 0.024401 -3.0907 0.0020384
x3 0.013375 0.002096 6.3811 2.4234e-10
x4 0.10186 0.010099 10.085 4.2411e-23
x5 1.5543 0.030279 51.333 1.1199e-317
x6 -0.12895 0.035027 -3.6815 0.00024116
x7 -0.063788 0.034906 -1.8274 0.06786
x8 -0.1572 0.035083 -4.4807 8.0776e-06
Number of observations: 1338, Error degrees of freedom: 1329
Root Mean Squared Error: 0.444
R-squared: 0.768, Adjusted R-Squared 0.767
F-statistic vs. constant model: 550, p-value = 0
AIC and BIC
AIC: 1.6350e+03
AICc: 1.6352e+03
BIC: 1.6818e+03
CAIC: 1.6908e+03
Plot the residual:
%Plot residual
figure(9)
plotResiduals(mdl);
%Normal probability plot
figure(10)
plotResiduals(mdl,'probability');
figure(11)
plotResiduals(mdl,'fitted');
Plot1: Plot2
It turns out that residuals have some linear relationship, so we have to modify our model.
We use formula from Belsley, D. A., E. Kuh, and R. E. Welsch. Regression Diagnostics. Hoboken, NJ: John Wiley & Sons, 1980 in open toolbox to check the VIF. It turns our they are limited and most of them are lower than 5.
function [V]=vif(X)
R0 = corrcoef(X); % correlation matrix
V=diag(inv(R0))';
Our first try is to add some interactive elements in our model smoker:bmi:
x1 = X;
x1.smokerbmi = x1.smoker.*x1.bmi;
x11 = x1{:,:};
[m1,p1] = size(x11);
mdl2 = fitlm(x11,y);
mdl2.ModelCriterion;
Regression Output
mdl2 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ ___________
(Intercept) 7.3374 0.076673 95.697 0
x1 0.034795 0.00084289 41.281 2.4521e-240
x2 -0.087064 0.023607 -3.688 0.00023512
x3 0.003406 0.0022676 1.502 0.13334
x4 0.10315 0.0097594 10.569 3.9892e-25
x5 0.15642 0.146 1.0714 0.2842
x6 -0.13751 0.033856 -4.0617 5.1563e-05
x7 -0.071131 0.033735 -2.1085 0.035176
x8 -0.16273 0.033903 -4.7998 1.7681e-06
x9 0.045574 0.0046633 9.773 7.8431e-22
Number of observations: 1338, Error degrees of freedom: 1328
Root Mean Squared Error: 0.429
R-squared: 0.784, Adjusted R-Squared 0.782
F-statistic vs. constant model: 534, p-value = 0
AIC and BIC
AIC: 1.5441e+03
AICc: 1.5443e+03
BIC: 1.5961e+03
CAIC: 1.6061e+03
We see an increase in Adjusted R-square and a decrease in BIC, so we retain this new interactive predictor.
Our second try is to add an interactive predictor: smoker:age
x2 = X;
x2.smokerbmi = x2.smoker.*x2.bmi;
x2.smokerage = x2.smoker.*x2.age;
x12 = x2{:,:};
[m2,p2] = size(x12);
mdl3 = fitlm(x12,y);
mdl3.ModelCriterion;
Regression output
mdl3 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ ___________
(Intercept) 7.1326 0.069955 101.96 0
x1 0.041573 0.00084966 48.93 2.0913e-299
x2 -0.085616 0.021242 -4.0305 5.8833e-05
x3 0.0011516 0.0020444 0.56331 0.57332
x4 0.10633 0.0087834 12.106 4.5087e-32
x5 1.2793 0.14589 8.7689 5.423e-18
x6 -0.15093 0.030473 -4.9529 8.2515e-07
x7 -0.06492 0.030357 -2.1385 0.032657
x8 -0.14486 0.030523 -4.7459 2.3003e-06
x9 0.051037 0.0042074 12.13 3.4458e-32
x10 -0.033401 0.0018873 -17.698 4.3552e-63
Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.386
R-squared: 0.825, Adjusted R-Squared 0.824
F-statistic vs. constant model: 625, p-value = 0
AIC and BIC
AIC: 1.2626e+03
AICc: 1.2628e+03
BIC: 1.3198e+03
CAIC: 1.3308e+03
Retain this predictor for samiliar reason. Now, since there are only two coninuous predictors, one useful idea is to raise their power since we expect the continous variables can explain more change in response than logical variable. Add bmi^1.5
x3 = x2;
x3.bmi = x3.bmi+x3.bmi.^1.5;
x13 = x3{:,:};
[m3,p3] = size(x13);
mdl4 = fitlm(x13,y);
mdl4.ModelCriterion;
Regression output
mdl4 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ ___________
(Intercept) 7.152 0.05642 126.76 0
x1 0.041601 0.00084936 48.979 8.7652e-300
x2 -0.085515 0.021244 -4.0254 6.0099e-05
x3 6.9951e-05 0.00021768 0.32136 0.74799
x4 0.10637 0.0087842 12.109 4.3441e-32
x5 1.2662 0.14588 8.6796 1.144e-17
x6 -0.15044 0.030469 -4.9375 8.9176e-07
x7 -0.064937 0.030361 -2.1388 0.032633
x8 -0.14327 0.030536 -4.6916 2.9908e-06
x9 0.051497 0.0042081 12.238 1.0536e-32
x10 -0.033429 0.0018876 -17.71 3.634e-63
Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.386
R-squared: 0.825, Adjusted R-Squared 0.824
F-statistic vs. constant model: 625, p-value = 0
AIC and BIC
AIC: 1.2628e+03
AICc: 1.2630e+03
BIC: 1.3200e+03
CAIC: 1.3310e+03
Add age^1.5
x4 = x3;
x4.age = x4.age+x4.age.^1.5;
x14 = x4{:,:};
mdl5 = fitlm(x14,y);
mdl5.ModelCriterion;
Regression output
mdl5 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ ___________
(Intercept) 7.5956 0.05336 142.35 0
x1 0.0039719 8.3054e-05 47.824 5.3991e-291
x2 -0.08619 0.02157 -3.9958 6.8006e-05
x3 6.153e-05 0.00022107 0.27832 0.78081
x4 0.11671 0.0089159 13.091 6.4801e-37
x5 1.2569 0.14813 8.4854 5.6652e-17
x6 -0.15002 0.030937 -4.8492 1.3859e-06
x7 -0.065807 0.030828 -2.1347 0.032969
x8 -0.14376 0.031006 -4.6367 3.8901e-06
x9 0.051251 0.0042726 11.995 1.5076e-31
x10 -0.032981 0.0019157 -17.216 4.1606e-60
Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 0.392
R-squared: 0.819, Adjusted R-Squared 0.818
F-statistic vs. constant model: 602, p-value = 0
AIC and BIC
AIC: 1.3036e+03
AICc: 1.3038e+03
BIC: 1.3608e+03
CAIC: 1.3718e+03
Now,plot the resiudal against fitted value. We find that, although there is still colinearity, but now less scatter points are on the above linear line.
Now, let’s try to fit previous preditors with the original response: Refit the model:
x_1 = x4;
x_11 = x_1{:,:};
y_1 = data.charges;
mdl6 = fitlm(x_11,y_1);
mdl6.ModelCriterion;
Regression output:
mdl6 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ ________ ___________
(Intercept) 773.81 656.17 1.1793 0.2385
x1 25.752 1.0213 25.215 6.1455e-115
x2 -500.25 265.25 -1.886 0.059515
x3 1.6724 2.7185 0.61517 0.53855
x4 582.65 109.64 5.3143 1.2549e-07
x5 -20379 1821.5 -11.188 7.8924e-28
x6 -1224.5 380.43 -3.2186 0.0013191
x7 -588.75 379.09 -1.5531 0.12064
x8 -1185.7 381.28 -3.1099 0.0019113
x9 1448.5 52.54 27.569 1.1927e-132
x10 -5.0746 23.558 -0.21541 0.82948
Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 4.82e+03
R-squared: 0.843, Adjusted R-Squared 0.841
F-statistic vs. constant model: 710, p-value = 0
AIC and BIC
AIC: 2.6504e+04
AICc: 2.6504e+04
BIC: 2.6561e+04
CAIC: 2.6572e+04
We see both an increase in BIC and adjusted R squared value. So, if we are selecting model with respect to adjusted R, we should stop here. Otherwise we continous to next step. It turns out that region is not a very significant predictor, so we only keep southwest which has the lowest P-value. Refit the model. Note \(X_n\) correspond to {age sex bmi children smoker northwest smokerbmi smokerage age15 bmi15 }
x_2 = x2;
x_2.age15 = x_2.age.^1.5;
x_2.bmi15 = x_2.bmi.^1.5;
x_2 = removevars(x_2,{'southwest','southeast'});
x_12 = x_2{:,:};
y_1 = data.charges;
mdl7 = fitlm(x_12,y_1);
mdl7.ModelCriterion;
Regression result
mdl7 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ ________ ___________
(Intercept) -6048.4 3674.8 -1.6459 0.10002
x1 -320.92 125.36 -2.5601 0.010575
x2 -504.03 264.88 -1.9029 0.057269
x3 1020 328.62 3.104 0.0019494
x4 672.29 114.72 5.8603 5.8227e-09
x5 -20140 1818.8 -11.073 2.558e-27
x6 133.45 311.08 0.42899 0.668
x7 1439.6 52.452 27.446 1.0272e-131
x8 -3.7246 23.494 -0.15854 0.87406
x9 62.475 13.353 4.6786 3.1841e-06
x10 -121.14 39.114 -3.0972 0.0019944
Number of observations: 1338, Error degrees of freedom: 1327
Root Mean Squared Error: 4.82e+03
R-squared: 0.843, Adjusted R-Squared 0.842
F-statistic vs. constant model: 712, p-value = 0
AIC and BIC
AIC: 2.6500e+04
AICc: 2.6500e+04
BIC: 2.6557e+04
CAIC: 2.6568e+04
It turns out that smoker:age has a really high P value, so we drop it. Now we have predictors:
{age sex bmi children smoker northwest smokerbmi age15 bmi15 }
x_3 = x_2;
x_3 = removevars(x_3,{'smokerage'});
x_13 = x_3{:,:};
y_1 = data.charges;
mdl8 = fitlm(x_13,y_1);
mdl8.ModelCriterion;
Regression result
mdl8 =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
Estimated Coefficients:
Estimate SE tStat pValue
________ ______ _______ ___________
(Intercept) -6021.6 3669.5 -1.641 0.10104
x1 -321.58 125.24 -2.5678 0.010345
x2 -504.19 264.78 -1.9042 0.057097
x3 1019.8 328.49 3.1046 0.0019457
x4 671.94 114.66 5.8604 5.8154e-09
x5 -20265 1636.9 -12.38 2.1457e-33
x6 132.94 310.95 0.42753 0.66907
x7 1439 52.293 27.518 2.7926e-132
x8 62.466 13.348 4.6796 3.1676e-06
x9 -121.1 39.098 -3.0972 0.0019944
AIC and BIC
AIC: 2.6498e+04
AICc: 2.6498e+04
BIC: 2.6550e+04
CAIC: 2.6560e+04
After dropping smokerage, AIC and BIC values stay the same.For now, the best model would be:
charges = constant + age + sex + bmi + children + smokeryes + northeastornot + sm_bm + \(bmi^{1.5}\) + \(age^{1.5}\)