Model selection is the process of selecting a model from a set of candidate models. Many statistical techniques involve model selection either implicitly or explicitly: e.g., hypothesis tests require selecting between a null hypothesis and alternative hypothesis model; an autoregressive model requires selecting the order p; in this tutorial, a regression model requires selecting specific predictors.
A good model selection technique will balance goodness of fit with simplicity, which means under a certain performance requirement set by researchers, the best model should be as simple as possible. If a model includes too many predictors, the most common issue would be overfitting, in which case the model gives good predictions to training data but performs much worse when it comes to data not used in fitting the model. Overfitting model normally gives low bias but high variance. Conversely, when there are too few predictors in a model, the issue of underfitting may show up, and gives poor predictions to data used or not used for model fitting. Underfitting models normally have low variance but high bias.
Testing based and criterion-based approaches are the two main approaches for model (variable) selection. Testing-based approaches include backward elimination, forward selection, stepwise regression, etc. In this category, variables are selected based on whether they are significant or not when they are added/removed. For criterion-based approaches, we have some idea about the purpose for which a model is intended, so we might propose some measures of how well a given model meets that purpose. Then a model that optimizes a criterion which balances goodness-of-fit will be chosen. Some examples of criterion-based approaches include AIC/BIC, adjusted R2, Mallow’s cp, etc.
More introduction and details about model selection can be found in Statistics: An Introduction for Climate Scientists By DelSole and Tippett Chapter 11
Our tutorial mainly introduce R, Stata and Python implementation of three model selection methods: stepwise regression, Akaike information criterion(AIC) and Bayesian information criterion(BIC).
The idea of stepwise regression is to build a regression model by adding/removing predictors step-by-step, until the pre-set significance level is met for all predictors. Each step in the stepwise regression procedure can be identified as “forward selection” or “backward selection”, which mean adding a predictor or removing a existing predictor respectively.
Forward selection starts with a null model, and adds variables to the model one-by-one. The criteria of which variable to include each time is as following: 1) test each variable that is not already in the model, 2) check the significance of all of them to see if their P-value is below certain level, and 3) choose the one that is the most significant. The drawbacks of the forward selection is: every time we add one variable into the model, we ignore the fact that the new variable may render some of the existing variables to be non-significant. Unlike forward selection, the procedure of backward selection method is: 1) start with a model of all variables, 2) check the significance of each variable, and 3) drop one with the least significance each time until all the variables remained are statistically significant.
As two important examples of criterion-based model selection approach, AIC and BIC have two similar objective functions to minimize as shown below.\[AIC=n\ln(\frac{RSS}{n})+2(p+1)\]\[BIC=n\ln(\frac{RSS}{n})+(p+1)\ln n\] where n is the number of training data and p is the number of parameters in the model.
AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model. BIC on the other hand, puts a larger penalty on the complexity of the model as training data gets larger. As a result, AIC is generally considered better when prediction is the aim, despite there still is some debate over the topic.
It is worth noting that neither AIC nor BIC provides a test of a model in the sense of testing a null hypothesis. It tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, they will not give any warning of that.
Our data is Home Sale Prices from King County. The introduction and documentation of the data can be found in the prior link. For the purpose of showing the model selection methods properly. We choose a subset of the predictors and observations.
# Load the library
library(tidyverse)
# Import the data
lmdata<-read_csv('kc_house_data.csv')
# Get a subset of the predictors(p=13)
lmdata<-select(lmdata,-c(id,date,zipcode,sqft_basement,lat,long,sqft_living15,sqft_lot15))
# Get a subset of the observations(n=200) randomly
set.seed(100)
lmdata<-lmdata[sample(nrow(lmdata),200),]
# Show the head of the data
head(lmdata)
## # A tibble: 6 x 13
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## <dbl> <int> <dbl> <int> <int> <dbl> <int> <int>
## 1 176000 2 1.00 770 5200 1 0 0
## 2 370000 2 1.00 850 6213 1 0 0
## 3 875909 4 2.50 3610 13292 2 0 0
## 4 100000 2 1.00 770 17334 1 0 0
## 5 580000 3 1.75 1850 2797 1 0 0
## 6 577000 5 2.75 1940 5000 2 0 0
## # ... with 5 more variables: condition <int>, grade <int>,
## # sqft_above <int>, yr_built <int>, yr_renovated <int>
First and foremost, we should build a linear model with all the available predictors included, so that we can have an understanding of the model, as well as to use the result of this model in the upcoming model selections. price~. inside the lm function means the linear model includes all the columns in the data as predictors other than price.
lm_result1<-lm(price~.,data=lmdata)
summary(lm_result1)
##
## Call:
## lm(formula = price ~ ., data = lmdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -648573 -107534 -22206 101899 1465406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.314e+06 1.697e+06 1.954 0.052239 .
## bedrooms -3.655e+04 2.499e+04 -1.463 0.145274
## bathrooms 2.091e+04 4.585e+04 0.456 0.648881
## sqft_living 1.987e+02 5.935e+01 3.348 0.000985 ***
## sqft_lot 6.143e-01 3.566e-01 1.722 0.086637 .
## floors 5.236e+04 4.253e+04 1.231 0.219814
## waterfront 5.919e+05 1.965e+05 3.013 0.002948 **
## view 1.094e+05 2.606e+04 4.198 4.16e-05 ***
## condition 3.178e+04 3.088e+04 1.029 0.304676
## grade 1.801e+05 2.707e+04 6.652 3.10e-10 ***
## sqft_above -1.149e+02 5.412e+01 -2.124 0.035003 *
## yr_built -2.280e+03 8.685e+02 -2.626 0.009367 **
## yr_renovated 1.119e+02 5.249e+01 2.131 0.034378 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233200 on 187 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.7106
## F-statistic: 41.73 on 12 and 187 DF, p-value: < 2.2e-16
For stepwise regression, the function step should be called and the direction is set to both so that the algorithm can add and drop predictors in every iteration. Once it is called, the iterating process will proceed by itself.
From the summary of the first iteration where we include all possible predictors, we can see that the model dropped bathrooms, which is the predictor with the highest P-value in this model. As a result, in the second iteration when we analyze the impact of each predictor, the variable bathrooms has a plus sign instead of a minus sign in front of it, meaning the impact is measured when the variable bathrooms is added to our model.
Finally, when dropping/adding any variable will not give a positive impact to our model in terms of performance, the stepwise process is done.
lm_result2<-step(lm_result1, direction = "both")
## Start: AIC=4956.35
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## yr_renovated
##
## Df Sum of Sq RSS AIC
## - bathrooms 1 1.1307e+10 1.0177e+13 4954.6
## - condition 1 5.7594e+10 1.0224e+13 4955.5
## - floors 1 8.2401e+10 1.0248e+13 4956.0
## <none> 1.0166e+13 4956.4
## - bedrooms 1 1.1628e+11 1.0282e+13 4956.6
## - sqft_lot 1 1.6129e+11 1.0327e+13 4957.5
## - sqft_above 1 2.4520e+11 1.0411e+13 4959.1
## - yr_renovated 1 2.4692e+11 1.0413e+13 4959.2
## - yr_built 1 3.7476e+11 1.0541e+13 4961.6
## - waterfront 1 4.9341e+11 1.0659e+13 4963.8
## - sqft_living 1 6.0927e+11 1.0775e+13 4966.0
## - view 1 9.5793e+11 1.1124e+13 4972.4
## - grade 1 2.4056e+12 1.2572e+13 4996.8
##
## Step: AIC=4954.57
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - condition 1 6.1696e+10 1.0239e+13 4953.8
## <none> 1.0177e+13 4954.6
## - floors 1 1.0616e+11 1.0283e+13 4954.6
## - bedrooms 1 1.0659e+11 1.0284e+13 4954.7
## - sqft_lot 1 1.6070e+11 1.0338e+13 4955.7
## + bathrooms 1 1.1307e+10 1.0166e+13 4956.4
## - sqft_above 1 2.6894e+11 1.0446e+13 4957.8
## - yr_renovated 1 2.7164e+11 1.0449e+13 4957.8
## - yr_built 1 3.7749e+11 1.0555e+13 4959.9
## - waterfront 1 5.2357e+11 1.0701e+13 4962.6
## - sqft_living 1 8.6293e+11 1.1040e+13 4968.9
## - view 1 9.6019e+11 1.1138e+13 4970.6
## - grade 1 2.4627e+12 1.2640e+13 4995.9
##
## Step: AIC=4953.78
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - bedrooms 1 8.1438e+10 1.0320e+13 4953.4
## - floors 1 1.0220e+11 1.0341e+13 4953.8
## <none> 1.0239e+13 4953.8
## - sqft_lot 1 1.3277e+11 1.0372e+13 4954.4
## + condition 1 6.1696e+10 1.0177e+13 4954.6
## + bathrooms 1 1.5409e+10 1.0224e+13 4955.5
## - yr_renovated 1 2.2466e+11 1.0464e+13 4956.1
## - sqft_above 1 2.6560e+11 1.0505e+13 4956.9
## - waterfront 1 5.3244e+11 1.0771e+13 4961.9
## - yr_built 1 6.1399e+11 1.0853e+13 4963.4
## - sqft_living 1 8.8466e+11 1.1124e+13 4968.4
## - view 1 9.4802e+11 1.1187e+13 4969.5
## - grade 1 2.4192e+12 1.2658e+13 4994.2
##
## Step: AIC=4953.37
## price ~ sqft_living + sqft_lot + floors + waterfront + view +
## grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - floors 1 8.3526e+10 1.0404e+13 4953.0
## <none> 1.0320e+13 4953.4
## + bedrooms 1 8.1438e+10 1.0239e+13 4953.8
## - sqft_lot 1 1.3519e+11 1.0456e+13 4954.0
## + condition 1 3.6543e+10 1.0284e+13 4954.7
## + bathrooms 1 3.6993e+09 1.0317e+13 4955.3
## - yr_renovated 1 2.2141e+11 1.0542e+13 4955.6
## - sqft_above 1 2.6091e+11 1.0581e+13 4956.4
## - waterfront 1 5.9042e+11 1.0911e+13 4962.5
## - yr_built 1 6.1250e+11 1.0933e+13 4962.9
## - sqft_living 1 8.0750e+11 1.1128e+13 4966.4
## - view 1 9.2945e+11 1.1250e+13 4968.6
## - grade 1 2.6416e+12 1.2962e+13 4996.9
##
## Step: AIC=4952.98
## price ~ sqft_living + sqft_lot + waterfront + view + grade +
## sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## <none> 1.0404e+13 4953.0
## - sqft_lot 1 1.2203e+11 1.0526e+13 4953.3
## + floors 1 8.3526e+10 1.0320e+13 4953.4
## + bedrooms 1 6.2765e+10 1.0341e+13 4953.8
## + condition 1 3.6150e+10 1.0368e+13 4954.3
## - sqft_above 1 1.8826e+11 1.0592e+13 4954.6
## + bathrooms 1 1.9052e+10 1.0385e+13 4954.6
## - yr_renovated 1 2.0830e+11 1.0612e+13 4954.9
## - yr_built 1 5.2994e+11 1.0934e+13 4960.9
## - waterfront 1 5.8177e+11 1.0986e+13 4961.9
## - sqft_living 1 7.2398e+11 1.1128e+13 4964.4
## - view 1 9.9859e+11 1.1403e+13 4969.3
## - grade 1 2.7933e+12 1.3197e+13 4998.5
For AIC and BIC, we use step function again and specify the penalty for the number of freedom. For instance, we have \(k=2\) in AIC and \(k=log(n)\) in BIC, implying that BIC normally gives a larger penalty on the number of parameters in the model according to the definition of BIC. Once the model selection is done, we should expect a smaller number of predictors in BIC than AIC below.
lm_result3<-step(lm_result1, k=2)
## Start: AIC=4956.35
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## yr_renovated
##
## Df Sum of Sq RSS AIC
## - bathrooms 1 1.1307e+10 1.0177e+13 4954.6
## - condition 1 5.7594e+10 1.0224e+13 4955.5
## - floors 1 8.2401e+10 1.0248e+13 4956.0
## <none> 1.0166e+13 4956.4
## - bedrooms 1 1.1628e+11 1.0282e+13 4956.6
## - sqft_lot 1 1.6129e+11 1.0327e+13 4957.5
## - sqft_above 1 2.4520e+11 1.0411e+13 4959.1
## - yr_renovated 1 2.4692e+11 1.0413e+13 4959.2
## - yr_built 1 3.7476e+11 1.0541e+13 4961.6
## - waterfront 1 4.9341e+11 1.0659e+13 4963.8
## - sqft_living 1 6.0927e+11 1.0775e+13 4966.0
## - view 1 9.5793e+11 1.1124e+13 4972.4
## - grade 1 2.4056e+12 1.2572e+13 4996.8
##
## Step: AIC=4954.57
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - condition 1 6.1696e+10 1.0239e+13 4953.8
## <none> 1.0177e+13 4954.6
## - floors 1 1.0616e+11 1.0283e+13 4954.6
## - bedrooms 1 1.0659e+11 1.0284e+13 4954.7
## - sqft_lot 1 1.6070e+11 1.0338e+13 4955.7
## - sqft_above 1 2.6894e+11 1.0446e+13 4957.8
## - yr_renovated 1 2.7164e+11 1.0449e+13 4957.8
## - yr_built 1 3.7749e+11 1.0555e+13 4959.9
## - waterfront 1 5.2357e+11 1.0701e+13 4962.6
## - sqft_living 1 8.6293e+11 1.1040e+13 4968.9
## - view 1 9.6019e+11 1.1138e+13 4970.6
## - grade 1 2.4627e+12 1.2640e+13 4995.9
##
## Step: AIC=4953.78
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - bedrooms 1 8.1438e+10 1.0320e+13 4953.4
## - floors 1 1.0220e+11 1.0341e+13 4953.8
## <none> 1.0239e+13 4953.8
## - sqft_lot 1 1.3277e+11 1.0372e+13 4954.4
## - yr_renovated 1 2.2466e+11 1.0464e+13 4956.1
## - sqft_above 1 2.6560e+11 1.0505e+13 4956.9
## - waterfront 1 5.3244e+11 1.0771e+13 4961.9
## - yr_built 1 6.1399e+11 1.0853e+13 4963.4
## - sqft_living 1 8.8466e+11 1.1124e+13 4968.4
## - view 1 9.4802e+11 1.1187e+13 4969.5
## - grade 1 2.4192e+12 1.2658e+13 4994.2
##
## Step: AIC=4953.37
## price ~ sqft_living + sqft_lot + floors + waterfront + view +
## grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - floors 1 8.3526e+10 1.0404e+13 4953.0
## <none> 1.0320e+13 4953.4
## - sqft_lot 1 1.3519e+11 1.0456e+13 4954.0
## - yr_renovated 1 2.2141e+11 1.0542e+13 4955.6
## - sqft_above 1 2.6091e+11 1.0581e+13 4956.4
## - waterfront 1 5.9042e+11 1.0911e+13 4962.5
## - yr_built 1 6.1250e+11 1.0933e+13 4962.9
## - sqft_living 1 8.0750e+11 1.1128e+13 4966.4
## - view 1 9.2945e+11 1.1250e+13 4968.6
## - grade 1 2.6416e+12 1.2962e+13 4996.9
##
## Step: AIC=4952.98
## price ~ sqft_living + sqft_lot + waterfront + view + grade +
## sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## <none> 1.0404e+13 4953.0
## - sqft_lot 1 1.2203e+11 1.0526e+13 4953.3
## - sqft_above 1 1.8826e+11 1.0592e+13 4954.6
## - yr_renovated 1 2.0830e+11 1.0612e+13 4954.9
## - yr_built 1 5.2994e+11 1.0934e+13 4960.9
## - waterfront 1 5.8177e+11 1.0986e+13 4961.9
## - sqft_living 1 7.2398e+11 1.1128e+13 4964.4
## - view 1 9.9859e+11 1.1403e+13 4969.3
## - grade 1 2.7933e+12 1.3197e+13 4998.5
lm_result4<-step(lm_result1, k=log(nrow(lmdata)))
## Start: AIC=4999.23
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + yr_built +
## yr_renovated
##
## Df Sum of Sq RSS AIC
## - bathrooms 1 1.1307e+10 1.0177e+13 4994.2
## - condition 1 5.7594e+10 1.0224e+13 4995.1
## - floors 1 8.2401e+10 1.0248e+13 4995.5
## - bedrooms 1 1.1628e+11 1.0282e+13 4996.2
## - sqft_lot 1 1.6129e+11 1.0327e+13 4997.1
## - sqft_above 1 2.4520e+11 1.0411e+13 4998.7
## - yr_renovated 1 2.4692e+11 1.0413e+13 4998.7
## <none> 1.0166e+13 4999.2
## - yr_built 1 3.7476e+11 1.0541e+13 5001.2
## - waterfront 1 4.9341e+11 1.0659e+13 5003.4
## - sqft_living 1 6.0927e+11 1.0775e+13 5005.6
## - view 1 9.5793e+11 1.1124e+13 5011.9
## - grade 1 2.4056e+12 1.2572e+13 5036.4
##
## Step: AIC=4994.15
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + condition + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - condition 1 6.1696e+10 1.0239e+13 4990.1
## - floors 1 1.0616e+11 1.0283e+13 4990.9
## - bedrooms 1 1.0659e+11 1.0284e+13 4990.9
## - sqft_lot 1 1.6070e+11 1.0338e+13 4992.0
## - sqft_above 1 2.6894e+11 1.0446e+13 4994.1
## - yr_renovated 1 2.7164e+11 1.0449e+13 4994.1
## <none> 1.0177e+13 4994.2
## - yr_built 1 3.7749e+11 1.0555e+13 4996.1
## - waterfront 1 5.2357e+11 1.0701e+13 4998.9
## - sqft_living 1 8.6293e+11 1.1040e+13 5005.1
## - view 1 9.6019e+11 1.1138e+13 5006.9
## - grade 1 2.4627e+12 1.2640e+13 5032.2
##
## Step: AIC=4990.06
## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront +
## view + grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - bedrooms 1 8.1438e+10 1.0320e+13 4986.3
## - floors 1 1.0220e+11 1.0341e+13 4986.8
## - sqft_lot 1 1.3277e+11 1.0372e+13 4987.3
## - yr_renovated 1 2.2466e+11 1.0464e+13 4989.1
## - sqft_above 1 2.6560e+11 1.0505e+13 4989.9
## <none> 1.0239e+13 4990.1
## - waterfront 1 5.3244e+11 1.0771e+13 4994.9
## - yr_built 1 6.1399e+11 1.0853e+13 4996.4
## - sqft_living 1 8.8466e+11 1.1124e+13 5001.3
## - view 1 9.4802e+11 1.1187e+13 5002.5
## - grade 1 2.4192e+12 1.2658e+13 5027.2
##
## Step: AIC=4986.35
## price ~ sqft_living + sqft_lot + floors + waterfront + view +
## grade + sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - floors 1 8.3526e+10 1.0404e+13 4982.7
## - sqft_lot 1 1.3519e+11 1.0456e+13 4983.7
## - yr_renovated 1 2.2141e+11 1.0542e+13 4985.3
## - sqft_above 1 2.6091e+11 1.0581e+13 4986.0
## <none> 1.0320e+13 4986.3
## - waterfront 1 5.9042e+11 1.0911e+13 4992.2
## - yr_built 1 6.1250e+11 1.0933e+13 4992.6
## - sqft_living 1 8.0750e+11 1.1128e+13 4996.1
## - view 1 9.2945e+11 1.1250e+13 4998.3
## - grade 1 2.6416e+12 1.2962e+13 5026.6
##
## Step: AIC=4982.66
## price ~ sqft_living + sqft_lot + waterfront + view + grade +
## sqft_above + yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - sqft_lot 1 1.2203e+11 1.0526e+13 4979.7
## - sqft_above 1 1.8826e+11 1.0592e+13 4981.0
## - yr_renovated 1 2.0830e+11 1.0612e+13 4981.3
## <none> 1.0404e+13 4982.7
## - yr_built 1 5.2994e+11 1.0934e+13 4987.3
## - waterfront 1 5.8177e+11 1.0986e+13 4988.2
## - sqft_living 1 7.2398e+11 1.1128e+13 4990.8
## - view 1 9.9859e+11 1.1403e+13 4995.7
## - grade 1 2.7933e+12 1.3197e+13 5024.9
##
## Step: AIC=4979.7
## price ~ sqft_living + waterfront + view + grade + sqft_above +
## yr_built + yr_renovated
##
## Df Sum of Sq RSS AIC
## - sqft_above 1 1.3663e+11 1.0663e+13 4977.0
## - yr_renovated 1 1.9207e+11 1.0718e+13 4978.0
## <none> 1.0526e+13 4979.7
## - waterfront 1 5.7032e+11 1.1096e+13 4985.0
## - yr_built 1 6.2041e+11 1.1146e+13 4985.9
## - sqft_living 1 7.0639e+11 1.1232e+13 4987.4
## - view 1 9.4044e+11 1.1466e+13 4991.5
## - grade 1 2.8067e+12 1.3333e+13 5021.7
##
## Step: AIC=4976.98
## price ~ sqft_living + waterfront + view + grade + yr_built +
## yr_renovated
##
## Df Sum of Sq RSS AIC
## - yr_renovated 1 1.8378e+11 1.0846e+13 4975.1
## <none> 1.0663e+13 4977.0
## - waterfront 1 4.6763e+11 1.1130e+13 4980.3
## - yr_built 1 7.2588e+11 1.1389e+13 4984.9
## - sqft_living 1 7.7333e+11 1.1436e+13 4985.7
## - view 1 1.3804e+12 1.2043e+13 4996.0
## - grade 1 2.6865e+12 1.3349e+13 5016.6
##
## Step: AIC=4975.1
## price ~ sqft_living + waterfront + view + grade + yr_built
##
## Df Sum of Sq RSS AIC
## <none> 1.0846e+13 4975.1
## - waterfront 1 4.8667e+11 1.1333e+13 4978.6
## - sqft_living 1 7.6051e+11 1.1607e+13 4983.4
## - yr_built 1 9.3472e+11 1.1781e+13 4986.3
## - view 1 1.2958e+12 1.2142e+13 4992.4
## - grade 1 2.7877e+12 1.3634e+13 5015.5
As predicted before, AIC gives an optimal model with 8 predictors while BIC only has 5. Note that this time we start with the full model and only consider dropping predictors until dropping any one would increase model’s AIC value.
In stata, stepwise selection method is achieved by stepwise command. pe(#) is for forward selection and pr(#) is for backward selection, where # is the number of significance level wanted. The following regress command is used to specify the regression model where “price” is placed at first as the dependent variable before the 12 variables ready to be added.
. stepwise, pr(.2) pe(.15): regress price bedrooms bathrooms sqft_living sqft_l
> ot floors waterfront view condition grade sqft_above yr_built yr_renovated
The result is as following:
. stepwise, pr(.2) pe(.15): regress price bedrooms bathrooms sqft_living sqft_l
> ot floors waterfront view condition grade sqft_above yr_built yr_renovated
begin with full model
p = 0.6489 >= 0.2000 removing bathrooms
p = 0.2871 >= 0.2000 removing condition
p = 0.2217 >= 0.2000 removing bedrooms
p = 0.2165 >= 0.2000 removing floors
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(8, 191) = 61.92
Model | 2.6984e+13 8 3.3730e+12 Prob > F = 0.0000
Residual | 1.0404e+13 191 5.4471e+10 R-squared = 0.7217
-------------+---------------------------------- Adj R-squared = 0.7101
Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sqft_above | -92.09825 49.54009 -1.86 0.065 -189.8142 5.617694
yr_renovated | 97.98611 50.10715 1.96 0.052 -.8483454 196.8206
sqft_living | 169.9357 46.6129 3.65 0.000 77.99353 261.8779
sqft_lot | .5255762 .3511394 1.50 0.136 -.167033 1.218185
grade | 188573.9 26333.23 7.16 0.000 136632.7 240515.2
waterfront | 632038.3 193397.7 3.27 0.001 250568.6 1013508
view | 110982.1 25920.41 4.28 0.000 59855.11 162109.2
yr_built | -2116.374 678.5175 -3.12 0.002 -3454.724 -778.0237
_cons | 3055019 1271305 2.40 0.017 547417.9 5562620
------------------------------------------------------------------------------
Command regress is used for building a regression model with dependent variable as “price” and predictors as the rest of variables following “price”. Command estat ic is used for showing the AIC and BIC numbers. 1. The regression model with all 13 predictors
. regress price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
> condition grade sqft_above yr_built yr_renovated
Source | SS df MS Number of obs = 200
-------------+---------------------------------- F(12, 187) = 41.73
Model | 2.7222e+13 12 2.2685e+12 Prob > F = 0.0000
Residual | 1.0166e+13 187 5.4364e+10 R-squared = 0.7281
-------------+---------------------------------- Adj R-squared = 0.7106
Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
------------------------------------------------------------------------------
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bedrooms | -36553.76 24993.5 -1.46 0.145 -85859.22 12751.69
bathrooms | 20907.86 45845.26 0.46 0.649 -69532.51 111348.2
sqft_living | 198.7002 59.35365 3.35 0.001 81.61147 315.789
sqft_lot | .6143092 .3566439 1.72 0.087 -.0892533 1.317872
floors | 52361.89 42530.95 1.23 0.220 -31540.23 136264
waterfront | 591865.1 196460.3 3.01 0.003 204301.7 979428.5
view | 109398.2 26061.41 4.20 0.000 57986.1 160810.4
condition | 31784.07 30879.78 1.03 0.305 -29133.44 92701.58
grade | 180078.9 27071.28 6.65 0.000 126674.6 233483.3
sqft_above | -114.9439 54.12243 -2.12 0.035 -221.7129 -8.174887
yr_built | -2280.251 868.4847 -2.63 0.009 -3993.538 -566.9644
yr_renovated | 111.8681 52.49038 2.13 0.034 8.318696 215.4175
_cons | 3314377 1696541 1.95 0.052 -32442.59 6661197
------------------------------------------------------------------------------
. estat ic
Akaike's information criterion and Bayesian information criterion
-----------------------------------------------------------------------------
Model | Obs ll(null) ll(model) df AIC BIC
-------------+---------------------------------------------------------------
. | 200 -2879.193 -2748.963 13 5523.927 5566.805
-----------------------------------------------------------------------------
Note: N=Obs used in calculating BIC; see [R] BIC note.
The regression model with all predictors without bathrooms ```{} . regress price bedrooms sqft_living sqft_lot floors waterfront view condition > grade sqft_above yr_built yr_renovated
Source | SS df MS Number of obs = 200 ————-+———————————- F(11, 188) = 45.69 Model | 2.7211e+13 11 2.4737e+12 Prob > F = 0.0000 Residual | 1.0177e+13 188 5.4135e+10 R-squared = 0.7278 ————-+———————————- Adj R-squared = 0.7119 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
price | Coef. Std. Err. t P>|t| [95% Conf. Interval] |
————-+—————————————————————- |
bedrooms | -34316.58 24455.69 -1.40 0.162 -82559.4 13926.24 |
sqft_living | 210.9357 52.83233 3.99 0.000 106.7153 315.156 |
sqft_lot | .6131552 .3558829 1.72 0.087 -.0888818 1.315192 |
floors | 57396.66 40986.52 1.40 0.163 -23455.93 138249.2 |
waterfront | 604034.7 194229.2 3.11 0.002 220885.9 987183.5 |
view | 109521.1 26005.06 4.21 0.000 58221.88 160820.3 |
condition | 32809.21 30732.92 1.07 0.287 -27816.47 93434.9 |
grade | 181304.5 26880.76 6.74 0.000 128277.9 234331.2 |
sqft_above | -118.857 53.32526 -2.23 0.027 -224.0498 -13.66425 |
yr_built | -2139.773 810.3121 -2.64 0.009 -3738.246 -541.3011 |
yr_renovated | 115.7673 51.68016 2.24 0.026 13.81976 217.7148 |
_cons | 3034493 1578314 1.92 0.056 -78987.29 6147973 |
. estat ic
Akaike’s information criterion and Bayesian information criterion
Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2749.074 12 5522.149 5561.729 |
---|
Note: N=Obs used in calculating BIC; see R BIC note. ``` |
3. The regression model with all predictors without bathrooms, condition ```{} . regress price bedrooms sqft_living sqft_lot floors waterfront view grade sqft > _above yr_built yr_renovated |
Source | SS df MS Number of obs = 200 ————-+———————————- F(10, 189) = 50.11 Model | 2.7149e+13 10 2.7149e+12 Prob > F = 0.0000 Residual | 1.0239e+13 189 5.4175e+10 R-squared = 0.7261 ————-+———————————- Adj R-squared = 0.7117 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05 |
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
————-+—————————————————————- bedrooms | -29475.27 24040.42 -1.23 0.222 -76897.29 17946.74 sqft_living | 213.3751 52.80239 4.04 0.000 109.2174 317.5329 sqft_lot | .5494504 .350974 1.57 0.119 -.1428791 1.24178 floors | 56297.45 40988.73 1.37 0.171 -24556.71 137151.6 waterfront | 608962.1 194246.1 3.14 0.002 225793.1 992131 view | 108787.2 26005.58 4.18 0.000 57488.73 160085.7 grade | 179223.8 26819.91 6.68 0.000 126319 232128.6 sqft_above | -118.1052 53.34031 -2.21 0.028 -223.3241 -12.88639 yr_built | -2492.205 740.2874 -3.37 0.001 -3952.493 -1031.918 yr_renovated | 101.8994 50.0394 2.04 0.043 3.191932 200.6069 _cons | 3838972 1387310 2.77 0.006 1102370 6575573 ——————————————————————————
. estat ic
Akaike’s information criterion and Bayesian information criterion
Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2749.679 11 5521.358 5557.639 |
---|
Note: N=Obs used in calculating BIC; see R BIC note. ``` |
4. The regression model with all predictors without bathrooms, condition, bedrooms```{} . regress price sqft_living sqft_lot floors waterfront view grade sqft_above yr > _built yr_renovated |
Source | SS df MS Number of obs = 200 ————-+———————————- F(9, 190) = 55.37 Model | 2.7067e+13 9 3.0075e+12 Prob > F = 0.0000 Residual | 1.0320e+13 190 5.4318e+10 R-squared = 0.7240 ————-+———————————- Adj R-squared = 0.7109 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05 |
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
————-+—————————————————————- sqft_living | 189.5472 49.16078 3.86 0.000 92.57617 286.5182 sqft_lot | .5543912 .3514153 1.58 0.116 -.1387854 1.247568 floors | 50562.59 40774.86 1.24 0.216 -29866.98 130992.2 waterfront | 636850.4 193165.1 3.30 0.001 255826.9 1017874 view | 107647.1 26023.35 4.14 0.000 56315.34 158978.9 grade | 184681.7 26482.9 6.97 0.000 132443.4 236920 sqft_above | -117.0426 53.40385 -2.19 0.030 -222.3832 -11.70196 yr_built | -2489.166 741.263 -3.36 0.001 -3951.328 -1027.003 yr_renovated | 101.1534 50.10192 2.02 0.045 2.325913 199.9808 _cons | 3748296 1387171 2.70 0.008 1012061 6484530 ——————————————————————————
. estat ic
Akaike’s information criterion and Bayesian information criterion
Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2750.471 10 5520.942 5553.925 |
---|
Note: N=Obs used in calculating BIC; see R BIC note. ``` |
5. The regression model with all predictors without bathrooms, condition, bedrooms, floors |
```{} . regress price sqft_living sqft_lot waterfront view grade sqft_above yr_built > yr_renovated |
Source | SS df MS Number of obs = 200 ————-+———————————- F(8, 191) = 61.92 Model | 2.6984e+13 8 3.3730e+12 Prob > F = 0.0000 Residual | 1.0404e+13 191 5.4471e+10 R-squared = 0.7217 ————-+———————————- Adj R-squared = 0.7101 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05 |
price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
————-+—————————————————————- sqft_living | 169.9357 46.6129 3.65 0.000 77.99353 261.8779 sqft_lot | .5255762 .3511394 1.50 0.136 -.167033 1.218185 waterfront | 632038.3 193397.7 3.27 0.001 250568.6 1013508 view | 110982.1 25920.41 4.28 0.000 59855.11 162109.2 grade | 188573.9 26333.23 7.16 0.000 136632.7 240515.2 sqft_above | -92.09825 49.54009 -1.86 0.065 -189.8142 5.617694 yr_built | -2116.374 678.5175 -3.12 0.002 -3454.724 -778.0237 yr_renovated | 97.98611 50.10715 1.96 0.052 -.8483454 196.8206 _cons | 3055019 1271305 2.40 0.017 547417.9 5562620 ——————————————————————————
. estat ic
Akaike’s information criterion and Bayesian information criterion
Model | Obs ll(null) ll(model) df AIC BIC |
————-+————————————————————— |
. | 200 -2879.193 -2751.277 9 5520.554 5550.239 |
Note: N=Obs used in calculating BIC; see [R] BIC note.
```
The reg procedure is one of many regression procedures in the SAS System. It is a general-purpose procedure for regression, while other SAS regression procedures provide more specialized applications. For simple regression, we just need model statement in reg procedure. First we build a linear model using all predictors.
proc reg data = lmdata;
model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront
view condition grade sqft_above yr_built yr_renovated;
run;
Then SAS will return very detailed results(not shown in the tutorial) of the linear model which includes analysis variance, fit statistics, parameter estimates, fit diagnostics(plots) and residual plots.
Option selection in the model statement is for specifiing model selection methods. The first argument of the selection must be one of the following: adjrsq, b, backward, cp, maxr, minr, none, requare, stepwise. Besides, all the predictors have an assumed entry and exit significance level \(\alpha\) in the stepwise regression. An entry significance level \(\alpha\) = 0.15, specified in the slentry=0.15 option, means a variable must have a \(p-value \le 0.15\) to enter the model during stepwise regression. An exit significance level of 0.15, specified in the slstay=0.15 option, means a variable must have a \(p-value \ge 0.15\) to leave the model during stepwise regression.
/* Use stepwise to select variables */
proc reg data = lmdata outest=lmdata_result1;
model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront
view condition grade sqft_above yr_built yr_renovated / slstay=0.15 slentry=0.15 selection = stepwise;
run;
The result is shown as follows and the model we get by AIC is \[price\sim grade+view+sqft\_living+yr\_built+waterfront+yr\_renovated+sqft\_above+sqft\_lot\]
Now change the arguments of the selection to rsquare aic bic and then we can get the model selected by AIC and BIC.
/* Use AIC BIC to select variables */
proc reg data = lmdata outest=lmdata_result2;
model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront
view condition grade sqft_above yr_built yr_renovated / selection = rsquare aic bic;
/* Sort the data by value of AIC */
proc sort data=lmdata_result2;
by _aic_;
/* Drop useless columns */
data lmdata_AIC;
set lmdata_result2;
drop Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ price _IN_ _P_ _EDF_ _RSQ_ _BIC_;
/* We can get the best n model by setting obs=n */
proc print data=lmdata_AIC(obs=4);
run;
The result is shown as follows and the model we get by AIC is \[price\sim sqft\_living+sqft\_lot+waterfront+view+grade+sqft\_above+yr\_built+yr\_renovated\]
/* Sort the data by value of BIC */
proc sort data=lmdata_result2;
by _bic_;
/* Drop useless columns */
data lmdata_BIC;
set lmdata_result2;
drop Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ price _IN_ _P_ _EDF_ _RSQ_ _AIC_;
/* We can get the best n model by setting obs=n */
proc print data=lmdata_BIC(obs=4);
run;
The result is shown as follows and the model we get by BIC is \[price\sim sqft\_living+sqft\_lot+waterfront+view+grade+sqft\_above+yr\_built+yr\_renovated\]
PS: The result is the same as the result we get by AIC, which is different from what we get in R. The reason is that SAS looks like calculate the BIC for every possible subset of variables for models so that it can get a better result.