Project Introduction

Model selection is the process of selecting a model from a set of candidate models. Many statistical techniques involve model selection either implicitly or explicitly: e.g., hypothesis tests require selecting between a null hypothesis and alternative hypothesis model; an autoregressive model requires selecting the order p; in this tutorial, a regression model requires selecting specific predictors.

A good model selection technique will balance goodness of fit with simplicity, which means under a certain performance requirement set by researchers, the best model should be as simple as possible. If a model includes too many predictors, the most common issue would be overfitting, in which case the model gives good predictions to training data but performs much worse when it comes to data not used in fitting the model. Overfitting model normally gives low bias but high variance. Conversely, when there are too few predictors in a model, the issue of underfitting may show up, and gives poor predictions to data used or not used for model fitting. Underfitting models normally have low variance but high bias.

Testing based and criterion-based approaches are the two main approaches for model (variable) selection. Testing-based approaches include backward elimination, forward selection, stepwise regression, etc. In this category, variables are selected based on whether they are significant or not when they are added/removed. For criterion-based approaches, we have some idea about the purpose for which a model is intended, so we might propose some measures of how well a given model meets that purpose. Then a model that optimizes a criterion which balances goodness-of-fit will be chosen. Some examples of criterion-based approaches include AIC/BIC, adjusted R2, Mallow’s cp, etc.

More introduction and details about model selection can be found in Statistics: An Introduction for Climate Scientists By DelSole and Tippett Chapter 11

Our tutorial mainly introduce R, Stata and Python implementation of three model selection methods: stepwise regression, Akaike information criterion(AIC) and Bayesian information criterion(BIC).

Method Introduction

Stepwise Regression

The idea of stepwise regression is to build a regression model by adding/removing predictors step-by-step, until the pre-set significance level is met for all predictors. Each step in the stepwise regression procedure can be identified as “forward selection” or “backward selection”, which mean adding a predictor or removing a existing predictor respectively.

Forward selection starts with a null model, and adds variables to the model one-by-one. The criteria of which variable to include each time is as following: 1) test each variable that is not already in the model, 2) check the significance of all of them to see if their P-value is below certain level, and 3) choose the one that is the most significant. The drawbacks of the forward selection is: every time we add one variable into the model, we ignore the fact that the new variable may render some of the existing variables to be non-significant. Unlike forward selection, the procedure of backward selection method is: 1) start with a model of all variables, 2) check the significance of each variable, and 3) drop one with the least significance each time until all the variables remained are statistically significant.

AIC & BIC

As two important examples of criterion-based model selection approach, AIC and BIC have two similar objective functions to minimize as shown below.\[AIC=n\ln(\frac{RSS}{n})+2(p+1)\]\[BIC=n\ln(\frac{RSS}{n})+(p+1)\ln n\] where n is the number of training data and p is the number of parameters in the model.

AIC is founded on information theory: it offers an estimate of the relative information lost when a given model is used to represent the process that generated the data. In doing so, it deals with the trade-off between the goodness of fit of the model and the complexity of the model. BIC on the other hand, puts a larger penalty on the complexity of the model as training data gets larger. As a result, AIC is generally considered better when prediction is the aim, despite there still is some debate over the topic.

It is worth noting that neither AIC nor BIC provides a test of a model in the sense of testing a null hypothesis. It tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, they will not give any warning of that.

Data Manipulation

Our data is Home Sale Prices from King County. The introduction and documentation of the data can be found in the prior link. For the purpose of showing the model selection methods properly. We choose a subset of the predictors and observations.

# Load the library

library(tidyverse)



# Import the data

lmdata<-read_csv('kc_house_data.csv')



# Get a subset of the predictors(p=13)

lmdata<-select(lmdata,-c(id,date,zipcode,sqft_basement,lat,long,sqft_living15,sqft_lot15))



# Get a subset of the observations(n=200) randomly

set.seed(100)

lmdata<-lmdata[sample(nrow(lmdata),200),]



# Show the head of the data

head(lmdata)
## # A tibble: 6 x 13

##    price bedrooms bathrooms sqft_living sqft_lot floors waterfront  view

##    <dbl>    <int>     <dbl>       <int>    <int>  <dbl>      <int> <int>

## 1 176000        2      1.00         770     5200      1          0     0

## 2 370000        2      1.00         850     6213      1          0     0

## 3 875909        4      2.50        3610    13292      2          0     0

## 4 100000        2      1.00         770    17334      1          0     0

## 5 580000        3      1.75        1850     2797      1          0     0

## 6 577000        5      2.75        1940     5000      2          0     0

## # ... with 5 more variables: condition <int>, grade <int>,

## #   sqft_above <int>, yr_built <int>, yr_renovated <int>

Implementation

R

First and foremost, we should build a linear model with all the available predictors included, so that we can have an understanding of the model, as well as to use the result of this model in the upcoming model selections. price~. inside the lm function means the linear model includes all the columns in the data as predictors other than price.

lm_result1<-lm(price~.,data=lmdata)

summary(lm_result1)
## 

## Call:

## lm(formula = price ~ ., data = lmdata)

## 

## Residuals:

##     Min      1Q  Median      3Q     Max 

## -648573 -107534  -22206  101899 1465406 

## 

## Coefficients:

##                Estimate Std. Error t value Pr(>|t|)    

## (Intercept)   3.314e+06  1.697e+06   1.954 0.052239 .  

## bedrooms     -3.655e+04  2.499e+04  -1.463 0.145274    

## bathrooms     2.091e+04  4.585e+04   0.456 0.648881    

## sqft_living   1.987e+02  5.935e+01   3.348 0.000985 ***

## sqft_lot      6.143e-01  3.566e-01   1.722 0.086637 .  

## floors        5.236e+04  4.253e+04   1.231 0.219814    

## waterfront    5.919e+05  1.965e+05   3.013 0.002948 ** 

## view          1.094e+05  2.606e+04   4.198 4.16e-05 ***

## condition     3.178e+04  3.088e+04   1.029 0.304676    

## grade         1.801e+05  2.707e+04   6.652 3.10e-10 ***

## sqft_above   -1.149e+02  5.412e+01  -2.124 0.035003 *  

## yr_built     -2.280e+03  8.685e+02  -2.626 0.009367 ** 

## yr_renovated  1.119e+02  5.249e+01   2.131 0.034378 *  

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 

## Residual standard error: 233200 on 187 degrees of freedom

## Multiple R-squared:  0.7281, Adjusted R-squared:  0.7106 

## F-statistic: 41.73 on 12 and 187 DF,  p-value: < 2.2e-16

Stepwise regression

For stepwise regression, the function step should be called and the direction is set to both so that the algorithm can add and drop predictors in every iteration. Once it is called, the iterating process will proceed by itself.

From the summary of the first iteration where we include all possible predictors, we can see that the model dropped bathrooms, which is the predictor with the highest P-value in this model. As a result, in the second iteration when we analyze the impact of each predictor, the variable bathrooms has a plus sign instead of a minus sign in front of it, meaning the impact is measured when the variable bathrooms is added to our model.

Finally, when dropping/adding any variable will not give a positive impact to our model in terms of performance, the stepwise process is done.

lm_result2<-step(lm_result1, direction = "both")
## Start:  AIC=4956.35

## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 

##     waterfront + view + condition + grade + sqft_above + yr_built + 

##     yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bathrooms     1 1.1307e+10 1.0177e+13 4954.6

## - condition     1 5.7594e+10 1.0224e+13 4955.5

## - floors        1 8.2401e+10 1.0248e+13 4956.0

## <none>                       1.0166e+13 4956.4

## - bedrooms      1 1.1628e+11 1.0282e+13 4956.6

## - sqft_lot      1 1.6129e+11 1.0327e+13 4957.5

## - sqft_above    1 2.4520e+11 1.0411e+13 4959.1

## - yr_renovated  1 2.4692e+11 1.0413e+13 4959.2

## - yr_built      1 3.7476e+11 1.0541e+13 4961.6

## - waterfront    1 4.9341e+11 1.0659e+13 4963.8

## - sqft_living   1 6.0927e+11 1.0775e+13 4966.0

## - view          1 9.5793e+11 1.1124e+13 4972.4

## - grade         1 2.4056e+12 1.2572e+13 4996.8

## 

## Step:  AIC=4954.57

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + condition + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - condition     1 6.1696e+10 1.0239e+13 4953.8

## <none>                       1.0177e+13 4954.6

## - floors        1 1.0616e+11 1.0283e+13 4954.6

## - bedrooms      1 1.0659e+11 1.0284e+13 4954.7

## - sqft_lot      1 1.6070e+11 1.0338e+13 4955.7

## + bathrooms     1 1.1307e+10 1.0166e+13 4956.4

## - sqft_above    1 2.6894e+11 1.0446e+13 4957.8

## - yr_renovated  1 2.7164e+11 1.0449e+13 4957.8

## - yr_built      1 3.7749e+11 1.0555e+13 4959.9

## - waterfront    1 5.2357e+11 1.0701e+13 4962.6

## - sqft_living   1 8.6293e+11 1.1040e+13 4968.9

## - view          1 9.6019e+11 1.1138e+13 4970.6

## - grade         1 2.4627e+12 1.2640e+13 4995.9

## 

## Step:  AIC=4953.78

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bedrooms      1 8.1438e+10 1.0320e+13 4953.4

## - floors        1 1.0220e+11 1.0341e+13 4953.8

## <none>                       1.0239e+13 4953.8

## - sqft_lot      1 1.3277e+11 1.0372e+13 4954.4

## + condition     1 6.1696e+10 1.0177e+13 4954.6

## + bathrooms     1 1.5409e+10 1.0224e+13 4955.5

## - yr_renovated  1 2.2466e+11 1.0464e+13 4956.1

## - sqft_above    1 2.6560e+11 1.0505e+13 4956.9

## - waterfront    1 5.3244e+11 1.0771e+13 4961.9

## - yr_built      1 6.1399e+11 1.0853e+13 4963.4

## - sqft_living   1 8.8466e+11 1.1124e+13 4968.4

## - view          1 9.4802e+11 1.1187e+13 4969.5

## - grade         1 2.4192e+12 1.2658e+13 4994.2

## 

## Step:  AIC=4953.37

## price ~ sqft_living + sqft_lot + floors + waterfront + view + 

##     grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - floors        1 8.3526e+10 1.0404e+13 4953.0

## <none>                       1.0320e+13 4953.4

## + bedrooms      1 8.1438e+10 1.0239e+13 4953.8

## - sqft_lot      1 1.3519e+11 1.0456e+13 4954.0

## + condition     1 3.6543e+10 1.0284e+13 4954.7

## + bathrooms     1 3.6993e+09 1.0317e+13 4955.3

## - yr_renovated  1 2.2141e+11 1.0542e+13 4955.6

## - sqft_above    1 2.6091e+11 1.0581e+13 4956.4

## - waterfront    1 5.9042e+11 1.0911e+13 4962.5

## - yr_built      1 6.1250e+11 1.0933e+13 4962.9

## - sqft_living   1 8.0750e+11 1.1128e+13 4966.4

## - view          1 9.2945e+11 1.1250e+13 4968.6

## - grade         1 2.6416e+12 1.2962e+13 4996.9

## 

## Step:  AIC=4952.98

## price ~ sqft_living + sqft_lot + waterfront + view + grade + 

##     sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## <none>                       1.0404e+13 4953.0

## - sqft_lot      1 1.2203e+11 1.0526e+13 4953.3

## + floors        1 8.3526e+10 1.0320e+13 4953.4

## + bedrooms      1 6.2765e+10 1.0341e+13 4953.8

## + condition     1 3.6150e+10 1.0368e+13 4954.3

## - sqft_above    1 1.8826e+11 1.0592e+13 4954.6

## + bathrooms     1 1.9052e+10 1.0385e+13 4954.6

## - yr_renovated  1 2.0830e+11 1.0612e+13 4954.9

## - yr_built      1 5.2994e+11 1.0934e+13 4960.9

## - waterfront    1 5.8177e+11 1.0986e+13 4961.9

## - sqft_living   1 7.2398e+11 1.1128e+13 4964.4

## - view          1 9.9859e+11 1.1403e+13 4969.3

## - grade         1 2.7933e+12 1.3197e+13 4998.5

Akaike information criterion (AIC)

For AIC and BIC, we use step function again and specify the penalty for the number of freedom. For instance, we have \(k=2\) in AIC and \(k=log(n)\) in BIC, implying that BIC normally gives a larger penalty on the number of parameters in the model according to the definition of BIC. Once the model selection is done, we should expect a smaller number of predictors in BIC than AIC below.

lm_result3<-step(lm_result1, k=2)
## Start:  AIC=4956.35

## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 

##     waterfront + view + condition + grade + sqft_above + yr_built + 

##     yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bathrooms     1 1.1307e+10 1.0177e+13 4954.6

## - condition     1 5.7594e+10 1.0224e+13 4955.5

## - floors        1 8.2401e+10 1.0248e+13 4956.0

## <none>                       1.0166e+13 4956.4

## - bedrooms      1 1.1628e+11 1.0282e+13 4956.6

## - sqft_lot      1 1.6129e+11 1.0327e+13 4957.5

## - sqft_above    1 2.4520e+11 1.0411e+13 4959.1

## - yr_renovated  1 2.4692e+11 1.0413e+13 4959.2

## - yr_built      1 3.7476e+11 1.0541e+13 4961.6

## - waterfront    1 4.9341e+11 1.0659e+13 4963.8

## - sqft_living   1 6.0927e+11 1.0775e+13 4966.0

## - view          1 9.5793e+11 1.1124e+13 4972.4

## - grade         1 2.4056e+12 1.2572e+13 4996.8

## 

## Step:  AIC=4954.57

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + condition + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - condition     1 6.1696e+10 1.0239e+13 4953.8

## <none>                       1.0177e+13 4954.6

## - floors        1 1.0616e+11 1.0283e+13 4954.6

## - bedrooms      1 1.0659e+11 1.0284e+13 4954.7

## - sqft_lot      1 1.6070e+11 1.0338e+13 4955.7

## - sqft_above    1 2.6894e+11 1.0446e+13 4957.8

## - yr_renovated  1 2.7164e+11 1.0449e+13 4957.8

## - yr_built      1 3.7749e+11 1.0555e+13 4959.9

## - waterfront    1 5.2357e+11 1.0701e+13 4962.6

## - sqft_living   1 8.6293e+11 1.1040e+13 4968.9

## - view          1 9.6019e+11 1.1138e+13 4970.6

## - grade         1 2.4627e+12 1.2640e+13 4995.9

## 

## Step:  AIC=4953.78

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bedrooms      1 8.1438e+10 1.0320e+13 4953.4

## - floors        1 1.0220e+11 1.0341e+13 4953.8

## <none>                       1.0239e+13 4953.8

## - sqft_lot      1 1.3277e+11 1.0372e+13 4954.4

## - yr_renovated  1 2.2466e+11 1.0464e+13 4956.1

## - sqft_above    1 2.6560e+11 1.0505e+13 4956.9

## - waterfront    1 5.3244e+11 1.0771e+13 4961.9

## - yr_built      1 6.1399e+11 1.0853e+13 4963.4

## - sqft_living   1 8.8466e+11 1.1124e+13 4968.4

## - view          1 9.4802e+11 1.1187e+13 4969.5

## - grade         1 2.4192e+12 1.2658e+13 4994.2

## 

## Step:  AIC=4953.37

## price ~ sqft_living + sqft_lot + floors + waterfront + view + 

##     grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - floors        1 8.3526e+10 1.0404e+13 4953.0

## <none>                       1.0320e+13 4953.4

## - sqft_lot      1 1.3519e+11 1.0456e+13 4954.0

## - yr_renovated  1 2.2141e+11 1.0542e+13 4955.6

## - sqft_above    1 2.6091e+11 1.0581e+13 4956.4

## - waterfront    1 5.9042e+11 1.0911e+13 4962.5

## - yr_built      1 6.1250e+11 1.0933e+13 4962.9

## - sqft_living   1 8.0750e+11 1.1128e+13 4966.4

## - view          1 9.2945e+11 1.1250e+13 4968.6

## - grade         1 2.6416e+12 1.2962e+13 4996.9

## 

## Step:  AIC=4952.98

## price ~ sqft_living + sqft_lot + waterfront + view + grade + 

##     sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## <none>                       1.0404e+13 4953.0

## - sqft_lot      1 1.2203e+11 1.0526e+13 4953.3

## - sqft_above    1 1.8826e+11 1.0592e+13 4954.6

## - yr_renovated  1 2.0830e+11 1.0612e+13 4954.9

## - yr_built      1 5.2994e+11 1.0934e+13 4960.9

## - waterfront    1 5.8177e+11 1.0986e+13 4961.9

## - sqft_living   1 7.2398e+11 1.1128e+13 4964.4

## - view          1 9.9859e+11 1.1403e+13 4969.3

## - grade         1 2.7933e+12 1.3197e+13 4998.5

Bayes information criterion (BIC)

lm_result4<-step(lm_result1, k=log(nrow(lmdata)))
## Start:  AIC=4999.23

## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 

##     waterfront + view + condition + grade + sqft_above + yr_built + 

##     yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bathrooms     1 1.1307e+10 1.0177e+13 4994.2

## - condition     1 5.7594e+10 1.0224e+13 4995.1

## - floors        1 8.2401e+10 1.0248e+13 4995.5

## - bedrooms      1 1.1628e+11 1.0282e+13 4996.2

## - sqft_lot      1 1.6129e+11 1.0327e+13 4997.1

## - sqft_above    1 2.4520e+11 1.0411e+13 4998.7

## - yr_renovated  1 2.4692e+11 1.0413e+13 4998.7

## <none>                       1.0166e+13 4999.2

## - yr_built      1 3.7476e+11 1.0541e+13 5001.2

## - waterfront    1 4.9341e+11 1.0659e+13 5003.4

## - sqft_living   1 6.0927e+11 1.0775e+13 5005.6

## - view          1 9.5793e+11 1.1124e+13 5011.9

## - grade         1 2.4056e+12 1.2572e+13 5036.4

## 

## Step:  AIC=4994.15

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + condition + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - condition     1 6.1696e+10 1.0239e+13 4990.1

## - floors        1 1.0616e+11 1.0283e+13 4990.9

## - bedrooms      1 1.0659e+11 1.0284e+13 4990.9

## - sqft_lot      1 1.6070e+11 1.0338e+13 4992.0

## - sqft_above    1 2.6894e+11 1.0446e+13 4994.1

## - yr_renovated  1 2.7164e+11 1.0449e+13 4994.1

## <none>                       1.0177e+13 4994.2

## - yr_built      1 3.7749e+11 1.0555e+13 4996.1

## - waterfront    1 5.2357e+11 1.0701e+13 4998.9

## - sqft_living   1 8.6293e+11 1.1040e+13 5005.1

## - view          1 9.6019e+11 1.1138e+13 5006.9

## - grade         1 2.4627e+12 1.2640e+13 5032.2

## 

## Step:  AIC=4990.06

## price ~ bedrooms + sqft_living + sqft_lot + floors + waterfront + 

##     view + grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - bedrooms      1 8.1438e+10 1.0320e+13 4986.3

## - floors        1 1.0220e+11 1.0341e+13 4986.8

## - sqft_lot      1 1.3277e+11 1.0372e+13 4987.3

## - yr_renovated  1 2.2466e+11 1.0464e+13 4989.1

## - sqft_above    1 2.6560e+11 1.0505e+13 4989.9

## <none>                       1.0239e+13 4990.1

## - waterfront    1 5.3244e+11 1.0771e+13 4994.9

## - yr_built      1 6.1399e+11 1.0853e+13 4996.4

## - sqft_living   1 8.8466e+11 1.1124e+13 5001.3

## - view          1 9.4802e+11 1.1187e+13 5002.5

## - grade         1 2.4192e+12 1.2658e+13 5027.2

## 

## Step:  AIC=4986.35

## price ~ sqft_living + sqft_lot + floors + waterfront + view + 

##     grade + sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - floors        1 8.3526e+10 1.0404e+13 4982.7

## - sqft_lot      1 1.3519e+11 1.0456e+13 4983.7

## - yr_renovated  1 2.2141e+11 1.0542e+13 4985.3

## - sqft_above    1 2.6091e+11 1.0581e+13 4986.0

## <none>                       1.0320e+13 4986.3

## - waterfront    1 5.9042e+11 1.0911e+13 4992.2

## - yr_built      1 6.1250e+11 1.0933e+13 4992.6

## - sqft_living   1 8.0750e+11 1.1128e+13 4996.1

## - view          1 9.2945e+11 1.1250e+13 4998.3

## - grade         1 2.6416e+12 1.2962e+13 5026.6

## 

## Step:  AIC=4982.66

## price ~ sqft_living + sqft_lot + waterfront + view + grade + 

##     sqft_above + yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - sqft_lot      1 1.2203e+11 1.0526e+13 4979.7

## - sqft_above    1 1.8826e+11 1.0592e+13 4981.0

## - yr_renovated  1 2.0830e+11 1.0612e+13 4981.3

## <none>                       1.0404e+13 4982.7

## - yr_built      1 5.2994e+11 1.0934e+13 4987.3

## - waterfront    1 5.8177e+11 1.0986e+13 4988.2

## - sqft_living   1 7.2398e+11 1.1128e+13 4990.8

## - view          1 9.9859e+11 1.1403e+13 4995.7

## - grade         1 2.7933e+12 1.3197e+13 5024.9

## 

## Step:  AIC=4979.7

## price ~ sqft_living + waterfront + view + grade + sqft_above + 

##     yr_built + yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - sqft_above    1 1.3663e+11 1.0663e+13 4977.0

## - yr_renovated  1 1.9207e+11 1.0718e+13 4978.0

## <none>                       1.0526e+13 4979.7

## - waterfront    1 5.7032e+11 1.1096e+13 4985.0

## - yr_built      1 6.2041e+11 1.1146e+13 4985.9

## - sqft_living   1 7.0639e+11 1.1232e+13 4987.4

## - view          1 9.4044e+11 1.1466e+13 4991.5

## - grade         1 2.8067e+12 1.3333e+13 5021.7

## 

## Step:  AIC=4976.98

## price ~ sqft_living + waterfront + view + grade + yr_built + 

##     yr_renovated

## 

##                Df  Sum of Sq        RSS    AIC

## - yr_renovated  1 1.8378e+11 1.0846e+13 4975.1

## <none>                       1.0663e+13 4977.0

## - waterfront    1 4.6763e+11 1.1130e+13 4980.3

## - yr_built      1 7.2588e+11 1.1389e+13 4984.9

## - sqft_living   1 7.7333e+11 1.1436e+13 4985.7

## - view          1 1.3804e+12 1.2043e+13 4996.0

## - grade         1 2.6865e+12 1.3349e+13 5016.6

## 

## Step:  AIC=4975.1

## price ~ sqft_living + waterfront + view + grade + yr_built

## 

##               Df  Sum of Sq        RSS    AIC

## <none>                      1.0846e+13 4975.1

## - waterfront   1 4.8667e+11 1.1333e+13 4978.6

## - sqft_living  1 7.6051e+11 1.1607e+13 4983.4

## - yr_built     1 9.3472e+11 1.1781e+13 4986.3

## - view         1 1.2958e+12 1.2142e+13 4992.4

## - grade        1 2.7877e+12 1.3634e+13 5015.5

As predicted before, AIC gives an optimal model with 8 predictors while BIC only has 5. Note that this time we start with the full model and only consider dropping predictors until dropping any one would increase model’s AIC value.

Stata

Stepwise Regression

In stata, stepwise selection method is achieved by stepwise command. pe(#) is for forward selection and pr(#) is for backward selection, where # is the number of significance level wanted. The following regress command is used to specify the regression model where “price” is placed at first as the dependent variable before the 12 variables ready to be added.

. stepwise, pr(.2) pe(.15): regress price bedrooms bathrooms sqft_living sqft_l

> ot floors waterfront view condition grade sqft_above yr_built yr_renovated

The result is as following:

. stepwise, pr(.2) pe(.15): regress price bedrooms bathrooms sqft_living sqft_l

> ot floors waterfront view condition grade sqft_above yr_built yr_renovated

                      begin with full model

p = 0.6489 >= 0.2000  removing bathrooms

p = 0.2871 >= 0.2000  removing condition

p = 0.2217 >= 0.2000  removing bedrooms

p = 0.2165 >= 0.2000  removing floors



      Source |       SS           df       MS      Number of obs   =       200

-------------+----------------------------------   F(8, 191)       =     61.92

       Model |  2.6984e+13         8  3.3730e+12   Prob > F        =    0.0000

    Residual |  1.0404e+13       191  5.4471e+10   R-squared       =    0.7217

-------------+----------------------------------   Adj R-squared   =    0.7101

       Total |  3.7388e+13       199  1.8788e+11   Root MSE        =    2.3e+05



------------------------------------------------------------------------------

       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

  sqft_above |  -92.09825   49.54009    -1.86   0.065    -189.8142    5.617694

yr_renovated |   97.98611   50.10715     1.96   0.052    -.8483454    196.8206

 sqft_living |   169.9357    46.6129     3.65   0.000     77.99353    261.8779

    sqft_lot |   .5255762   .3511394     1.50   0.136     -.167033    1.218185

       grade |   188573.9   26333.23     7.16   0.000     136632.7    240515.2

  waterfront |   632038.3   193397.7     3.27   0.001     250568.6     1013508

        view |   110982.1   25920.41     4.28   0.000     59855.11    162109.2

    yr_built |  -2116.374   678.5175    -3.12   0.002    -3454.724   -778.0237

       _cons |    3055019    1271305     2.40   0.017     547417.9     5562620

------------------------------------------------------------------------------

AIC and BIC

Command regress is used for building a regression model with dependent variable as “price” and predictors as the rest of variables following “price”. Command estat ic is used for showing the AIC and BIC numbers. 1. The regression model with all 13 predictors

. regress price bedrooms bathrooms sqft_living sqft_lot floors waterfront view 

> condition grade sqft_above yr_built yr_renovated



      Source |       SS           df       MS      Number of obs   =       200

-------------+----------------------------------   F(12, 187)      =     41.73

       Model |  2.7222e+13        12  2.2685e+12   Prob > F        =    0.0000

    Residual |  1.0166e+13       187  5.4364e+10   R-squared       =    0.7281

-------------+----------------------------------   Adj R-squared   =    0.7106

       Total |  3.7388e+13       199  1.8788e+11   Root MSE        =    2.3e+05



------------------------------------------------------------------------------

       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

-------------+----------------------------------------------------------------

    bedrooms |  -36553.76    24993.5    -1.46   0.145    -85859.22    12751.69

   bathrooms |   20907.86   45845.26     0.46   0.649    -69532.51    111348.2

 sqft_living |   198.7002   59.35365     3.35   0.001     81.61147     315.789

    sqft_lot |   .6143092   .3566439     1.72   0.087    -.0892533    1.317872

      floors |   52361.89   42530.95     1.23   0.220    -31540.23      136264

  waterfront |   591865.1   196460.3     3.01   0.003     204301.7    979428.5

        view |   109398.2   26061.41     4.20   0.000      57986.1    160810.4

   condition |   31784.07   30879.78     1.03   0.305    -29133.44    92701.58

       grade |   180078.9   27071.28     6.65   0.000     126674.6    233483.3

  sqft_above |  -114.9439   54.12243    -2.12   0.035    -221.7129   -8.174887

    yr_built |  -2280.251   868.4847    -2.63   0.009    -3993.538   -566.9644

yr_renovated |   111.8681   52.49038     2.13   0.034     8.318696    215.4175

       _cons |    3314377    1696541     1.95   0.052    -32442.59     6661197

------------------------------------------------------------------------------



. estat ic



Akaike's information criterion and Bayesian information criterion



-----------------------------------------------------------------------------

       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC

-------------+---------------------------------------------------------------

           . |        200 -2879.193  -2748.963      13    5523.927   5566.805

-----------------------------------------------------------------------------

               Note: N=Obs used in calculating BIC; see [R] BIC note.
  1. The regression model with all predictors without bathrooms ```{} . regress price bedrooms sqft_living sqft_lot floors waterfront view condition > grade sqft_above yr_built yr_renovated

    Source | SS df MS Number of obs = 200 ————-+———————————- F(11, 188) = 45.69 Model | 2.7211e+13 11 2.4737e+12 Prob > F = 0.0000 Residual | 1.0177e+13 188 5.4135e+10 R-squared = 0.7278 ————-+———————————- Adj R-squared = 0.7119 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05

price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
————-+—————————————————————-
bedrooms | -34316.58 24455.69 -1.40 0.162 -82559.4 13926.24
sqft_living | 210.9357 52.83233 3.99 0.000 106.7153 315.156
sqft_lot | .6131552 .3558829 1.72 0.087 -.0888818 1.315192
floors | 57396.66 40986.52 1.40 0.163 -23455.93 138249.2
waterfront | 604034.7 194229.2 3.11 0.002 220885.9 987183.5
view | 109521.1 26005.06 4.21 0.000 58221.88 160820.3
condition | 32809.21 30732.92 1.07 0.287 -27816.47 93434.9
grade | 181304.5 26880.76 6.74 0.000 128277.9 234331.2
sqft_above | -118.857 53.32526 -2.23 0.027 -224.0498 -13.66425
yr_built | -2139.773 810.3121 -2.64 0.009 -3738.246 -541.3011
yr_renovated | 115.7673 51.68016 2.24 0.026 13.81976 217.7148
_cons | 3034493 1578314 1.92 0.056 -78987.29 6147973

. estat ic

Akaike’s information criterion and Bayesian information criterion

Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2749.074 12 5522.149 5561.729
Note: N=Obs used in calculating BIC; see R BIC note. ```
3. The regression model with all predictors without bathrooms, condition ```{} . regress price bedrooms sqft_living sqft_lot floors waterfront view grade sqft > _above yr_built yr_renovated
Source | SS df MS Number of obs = 200 ————-+———————————- F(10, 189) = 50.11 Model | 2.7149e+13 10 2.7149e+12 Prob > F = 0.0000 Residual | 1.0239e+13 189 5.4175e+10 R-squared = 0.7261 ————-+———————————- Adj R-squared = 0.7117 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————- bedrooms | -29475.27 24040.42 -1.23 0.222 -76897.29 17946.74 sqft_living | 213.3751 52.80239 4.04 0.000 109.2174 317.5329 sqft_lot | .5494504 .350974 1.57 0.119 -.1428791 1.24178 floors | 56297.45 40988.73 1.37 0.171 -24556.71 137151.6 waterfront | 608962.1 194246.1 3.14 0.002 225793.1 992131 view | 108787.2 26005.58 4.18 0.000 57488.73 160085.7 grade | 179223.8 26819.91 6.68 0.000 126319 232128.6 sqft_above | -118.1052 53.34031 -2.21 0.028 -223.3241 -12.88639 yr_built | -2492.205 740.2874 -3.37 0.001 -3952.493 -1031.918 yr_renovated | 101.8994 50.0394 2.04 0.043 3.191932 200.6069 _cons | 3838972 1387310 2.77 0.006 1102370 6575573 ——————————————————————————

. estat ic

Akaike’s information criterion and Bayesian information criterion

Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2749.679 11 5521.358 5557.639
Note: N=Obs used in calculating BIC; see R BIC note. ```
4. The regression model with all predictors without bathrooms, condition, bedrooms```{} . regress price sqft_living sqft_lot floors waterfront view grade sqft_above yr > _built yr_renovated
Source | SS df MS Number of obs = 200 ————-+———————————- F(9, 190) = 55.37 Model | 2.7067e+13 9 3.0075e+12 Prob > F = 0.0000 Residual | 1.0320e+13 190 5.4318e+10 R-squared = 0.7240 ————-+———————————- Adj R-squared = 0.7109 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————- sqft_living | 189.5472 49.16078 3.86 0.000 92.57617 286.5182 sqft_lot | .5543912 .3514153 1.58 0.116 -.1387854 1.247568 floors | 50562.59 40774.86 1.24 0.216 -29866.98 130992.2 waterfront | 636850.4 193165.1 3.30 0.001 255826.9 1017874 view | 107647.1 26023.35 4.14 0.000 56315.34 158978.9 grade | 184681.7 26482.9 6.97 0.000 132443.4 236920 sqft_above | -117.0426 53.40385 -2.19 0.030 -222.3832 -11.70196 yr_built | -2489.166 741.263 -3.36 0.001 -3951.328 -1027.003 yr_renovated | 101.1534 50.10192 2.02 0.045 2.325913 199.9808 _cons | 3748296 1387171 2.70 0.008 1012061 6484530 ——————————————————————————

. estat ic

Akaike’s information criterion and Bayesian information criterion

Model | Obs ll(null) ll(model) df AIC BIC ————-+————————————————————— . | 200 -2879.193 -2750.471 10 5520.942 5553.925
Note: N=Obs used in calculating BIC; see R BIC note. ```
5. The regression model with all predictors without bathrooms, condition, bedrooms, floors
```{} . regress price sqft_living sqft_lot waterfront view grade sqft_above yr_built > yr_renovated
Source | SS df MS Number of obs = 200 ————-+———————————- F(8, 191) = 61.92 Model | 2.6984e+13 8 3.3730e+12 Prob > F = 0.0000 Residual | 1.0404e+13 191 5.4471e+10 R-squared = 0.7217 ————-+———————————- Adj R-squared = 0.7101 Total | 3.7388e+13 199 1.8788e+11 Root MSE = 2.3e+05
   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————- sqft_living | 169.9357 46.6129 3.65 0.000 77.99353 261.8779 sqft_lot | .5255762 .3511394 1.50 0.136 -.167033 1.218185 waterfront | 632038.3 193397.7 3.27 0.001 250568.6 1013508 view | 110982.1 25920.41 4.28 0.000 59855.11 162109.2 grade | 188573.9 26333.23 7.16 0.000 136632.7 240515.2 sqft_above | -92.09825 49.54009 -1.86 0.065 -189.8142 5.617694 yr_built | -2116.374 678.5175 -3.12 0.002 -3454.724 -778.0237 yr_renovated | 97.98611 50.10715 1.96 0.052 -.8483454 196.8206 _cons | 3055019 1271305 2.40 0.017 547417.9 5562620 ——————————————————————————

. estat ic

Akaike’s information criterion and Bayesian information criterion

Model | Obs ll(null) ll(model) df AIC BIC
————-+—————————————————————
. | 200 -2879.193 -2751.277 9 5520.554 5550.239
           Note: N=Obs used in calculating BIC; see [R] BIC note.

```

SAS

The reg procedure is one of many regression procedures in the SAS System. It is a general-purpose procedure for regression, while other SAS regression procedures provide more specialized applications. For simple regression, we just need model statement in reg procedure. First we build a linear model using all predictors.

proc reg data = lmdata;

   model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront

                 view condition grade sqft_above yr_built yr_renovated;

run;

Then SAS will return very detailed results(not shown in the tutorial) of the linear model which includes analysis variance, fit statistics, parameter estimates, fit diagnostics(plots) and residual plots.

Stepwise Regression

Option selection in the model statement is for specifiing model selection methods. The first argument of the selection must be one of the following: adjrsq, b, backward, cp, maxr, minr, none, requare, stepwise. Besides, all the predictors have an assumed entry and exit significance level \(\alpha\) in the stepwise regression. An entry significance level \(\alpha\) = 0.15, specified in the slentry=0.15 option, means a variable must have a \(p-value \le 0.15\) to enter the model during stepwise regression. An exit significance level of 0.15, specified in the slstay=0.15 option, means a variable must have a \(p-value \ge 0.15\) to leave the model during stepwise regression.

/* Use stepwise to select variables */



proc reg data = lmdata outest=lmdata_result1;

   model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront

                 view condition grade sqft_above yr_built yr_renovated / slstay=0.15 slentry=0.15 selection = stepwise;

run;

The result is shown as follows and the model we get by AIC is \[price\sim grade+view+sqft\_living+yr\_built+waterfront+yr\_renovated+sqft\_above+sqft\_lot\]

AIC and BIC

Now change the arguments of the selection to rsquare aic bic and then we can get the model selected by AIC and BIC.

/* Use AIC BIC to select variables */



proc reg data = lmdata outest=lmdata_result2;

   model price = bedrooms bathrooms sqft_living sqft_lot floors waterfront

                 view condition grade sqft_above yr_built yr_renovated / selection = rsquare aic bic;



/* Sort the data by value of AIC */

proc sort data=lmdata_result2; 

   by _aic_;

   

/* Drop useless columns */

data lmdata_AIC;

   set lmdata_result2;

   drop Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ price _IN_ _P_ _EDF_ _RSQ_ _BIC_;



/* We can get the best n model by setting obs=n */

proc print data=lmdata_AIC(obs=4); 

run;

The result is shown as follows and the model we get by AIC is \[price\sim sqft\_living+sqft\_lot+waterfront+view+grade+sqft\_above+yr\_built+yr\_renovated\]

/* Sort the data by value of BIC */

proc sort data=lmdata_result2; 

   by _bic_;



/* Drop useless columns */

data lmdata_BIC;

   set lmdata_result2;

   drop Obs _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ price _IN_ _P_ _EDF_ _RSQ_ _AIC_;



/* We can get the best n model by setting obs=n */

proc print data=lmdata_BIC(obs=4); 

run;

The result is shown as follows and the model we get by BIC is \[price\sim sqft\_living+sqft\_lot+waterfront+view+grade+sqft\_above+yr\_built+yr\_renovated\]

PS: The result is the same as the result we get by AIC, which is different from what we get in R. The reason is that SAS looks like calculate the BIC for every possible subset of variables for models so that it can get a better result.