Introduction | Home

Ridge regression method is one of the so-called “shrinkage methods”, which is usually applied to a regression model when there is instablity resulting from collinearity of predictors.

In ordinary linear regression (OLS), the estimates of coefficient \(\beta\) are given by: \[\hat\beta=(X^TX)^{-1}X^Ty.\] When the predictors are collinear or almost collinear, the matrix \(X^TX\) here becomes singular (rarely the case) or almost singular, then the inverse would respond sensitively to errors, which results in instability of prediction with such a model.

Ridge regression, however, make a trade-off between bias and variance in prediction. By introducing a relatively small bias, you may expect a large reduction in the variance, and thus in the mean-squared error: \[MSE=E(\hat\beta-\beta)^2=(E(\hat\beta-\beta))^2+E(\hat\beta-E\hat\beta)^2=\mathrm{bias^2+variance}.\] This is achieved by introducing a penalty term into the loss function: \[(y-X\beta)^T(y-X\beta)+\lambda\sum_j\beta_j^2.\] The \(\beta\) that minimizes the new loss function is the ridge regression estimate of \(\beta\): \[\hat\beta=(X^TX+\lambda I)^{-1}X^Ty.\] It is clearly biased as \(E(\hat\beta)=(X^TX+\lambda I)^{-1}(X^TX)\beta\) (shrinking the coefficients towards \(0\)), but note that \((X^TX)^{-1}\) is replaced by \((X^TX+\lambda I)^{-1}\) here, which should be less unstable.

The Example Dataset

In the following sections, the method would be illutrated with dataset faraway:seatpos in R, where the response hipcenter is modified to take the absolute values.

library(faraway)
data(seatpos)
seatpos$hipcenter = -seatpos$hipcenter
head(seatpos)
##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3   206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9   178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0    71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0   257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9   173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4   185.150

The condition number of \(X^TX\) (where \(X\) is the design matrix) is large, suggesting strong collinearity of the predictors:

X = as.matrix(seatpos[, -9])
e = eigen(t(X) %*% X)$val
max(e)/min(e)
## [1] 500625.7

Fitting a Ridge Regression Model

As the beginning of ridge regression, it is recommended to standardize the predictors. You can still carry out ridge regression without doing so, but standardization would improve the effect of ridge regression, as it makes the shrinking fair to each coefficients. Luckily, the function that we are going to use here automatically standardizes the data, so we don’t need to do the standardization by ourselves.

To carry out ridge regression in R, you will need function lm.ridge in package MASS. (The function penalized in package penalized should also work.)

library(MASS)
fit = lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0, .4, 1e-3))

We can observe how the coefficients shrink as \(\lambda\) grows larger:

par(mar = c(4, 4, 0, 0), cex = 0.7, las = 1)
matplot(fit$lambda, coef(fit), type = "l", ylim = c(-1, 3),
        xlab = expression(lambda), ylab = expression(hat(beta)))

The fitted model is an object of class "ridgelm". There are many attributes: note that with xm and scale you can extract the means and the scales that the input values are centered and scaled with.

Selection of \(\lambda\)

As \(\lambda\) grows larger, the coefficients (as well as prediction variances) shirnk, while the bias increases. Thus we have to select a \(\lambda\) to make a trade-off, so as to control the overall prediction error. However, this is not automatically given to us as the range of lambda in the codes is input as an argument. So, though we can work out the best \(\lambda\) with package MASS, we still need a range to start with.

Notice that the eigenvalues of \(X^TX+\lambda I\) should be a good reference, which can be obtained by adding \(\lambda\) to the eigenvalues of \(X^TX\):

X = sapply(seatpos[, -9], scale)
round(eigen(t(X) %*% X)$val, 3)
## [1] 209.908  45.761  17.159   8.915   7.186   5.149   1.863   0.059

So in order to reduce collinearity (i.e. reduce the difference in ratio among the eigenvalues of \(X^TX+\lambda I\)) without introducing too much bias, \(10^1\sim10^2\) seems be a good point to start with.

fit = lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0, 30, 1e-3))

Then, to select an appropriate \(\lambda\), we can apply the command select:

select(fit)
## modified HKB estimator is 5.425415 
## modified L-W estimator is 3.589434 
## smallest value of GCV  at 22.301

You may want to specify the functon with MASS::select to avoid mistakenly applying dplyr::MASS.

Here HKB estimator and L-W estimator refer to different estimate of the ridge constant \(\phi\) (which means that as a rough estimate, you can also take \(\hat\beta_{ridge}=\hat\phi\hat\beta_{OLS}\)), while the location of the smallest value of GCV (generalized cross validation) score provides a suggestion for \(\lambda\).

const = as.numeric(names(which.min(fit$GCV)))

To check how the value of the GCV score varies, you can also plot the GCV scores with

par(mar = c(4, 4, 0, 0), cex = 0.7, las = 1)
plot(names(fit$GCV), fit$GCV, type = 'l',
     xlab = expression(lambda), ylab = "GCV Score")

You can also plot in-sample errors vs \(\lambda\) to get more information before selection of \(\lambda\).

Prediction with a Fitted Ridge Regression Model

In package MASS, we don’t have established function for prediction. Thus we will have to work out the predicted values explicitly.

For later convenience (to assess the out-of-sample prediction performance of ridge regression with this dataset), in this section we will split the dataset into training data and test data:

test_obs = sample(nrow(seatpos), 5)
test = seatpos[test_obs, ]
training = seatpos[-test_obs, ]
test
##    Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 35  19    150   172.2 169.9   89.4 34.0  39.7 38.1    268.32
## 38  56    158   173.8 171.5   90.0 36.1  39.2 35.5    176.45
## 11  47    143   171.9 169.1   87.8 32.9  39.2 36.9    196.15
## 30  40    115   153.8 151.6   80.3 27.5  37.6 31.7    102.20
## 22  33    293   201.2 198.4  101.6 39.6  44.2 43.1    279.15

Fit the model with training data:

trained = lm.ridge(hipcenter ~ ., training, lambda = seq(0, 30, 1e-3))
select(trained)
## modified HKB estimator is 7.232087 
## modified L-W estimator is 3.898503 
## smallest value of GCV  at 23.501

Work out the predicted values with fitted coeficients and scaled predictors:

predicted = trained$ym +
  scale(test[, -9], center = trained$xm, scale = trained$scales) %*%
  trained$coef[, which.min(trained$GCV)]
as.numeric(predicted)
## [1] 175.27525 159.71639 155.78005  95.43672 278.75054

It is not fitting perfectly well, in fact. To assess its performance, you can work out the square-root of MSE, i.e. RMSE:

sqrt(mean((predicted - test$hipcenter)^2))
## [1] 46.07162

Comparison with OLS

In terms of in-sample error, it has been proved by Hoerl and Kennard (1970) that with an appropraite value of \(\lambda\), ridge regression can always preform better than OLS, i.e. having smaller MSE.

However, the method of ridge regression does not necessarily promise a better performance in terms of out-of-sample errors.

fit.OLS = lm(hipcenter ~ ., training)
sqrt(mean((predict(fit.OLS, test) - test$hipcenter)^2))
## [1] 45.92622

It is likely to be even smaller than that of ridge regression in this case. And it seems that ridge regression is not having a good performance on this data set.

Reflection

The size of this data set is too small, and there may be many factors resulting in the bad performance of ridge regression on it. It is a pity that we can not present the power of ridge regression in a clear and impressive way.

References

Faraway, Julian J. Linear models with R. CRC press, 2014.

Inoue, Takakatsu. “Improving the’HKB’ordinary type ridge estimator.” Journal of the Japan Statistical Society 31.1 (2001): 67-83.

Brown, Philip J. Philip J. Measurement, regression, and calibration. No. 04; QA278. 2, B7.. 1993.

Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12.1 (1970): 55-67.

Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: applications to nonorthogonal problems.” Technometrics 12.1 (1970): 69-82.