Ridge regression method is one of the so-called “shrinkage methods”, which is usually applied to a regression model when there is instablity resulting from collinearity of predictors.
In ordinary linear regression (OLS), the estimates of coefficient \(\beta\) are given by: \[\hat\beta=(X^TX)^{-1}X^Ty.\] When the predictors are collinear or almost collinear, the matrix \(X^TX\) here becomes singular (rarely the case) or almost singular, then the inverse would respond sensitively to errors, which results in instability of prediction with such a model.
Ridge regression, however, make a trade-off between bias and variance in prediction. By introducing a relatively small bias, you may expect a large reduction in the variance, and thus in the mean-squared error: \[MSE=E(\hat\beta-\beta)^2=(E(\hat\beta-\beta))^2+E(\hat\beta-E\hat\beta)^2=\mathrm{bias^2+variance}.\] This is achieved by introducing a penalty term into the loss function: \[(y-X\beta)^T(y-X\beta)+\lambda\sum_j\beta_j^2.\] The \(\beta\) that minimizes the new loss function is the ridge regression estimate of \(\beta\): \[\hat\beta=(X^TX+\lambda I)^{-1}X^Ty.\] It is clearly biased as \(E(\hat\beta)=(X^TX+\lambda I)^{-1}(X^TX)\beta\) (shrinking the coefficients towards \(0\)), but note that \((X^TX)^{-1}\) is replaced by \((X^TX+\lambda I)^{-1}\) here, which should be less unstable.
In the following sections, the method would be illutrated with dataset faraway:seatpos
in R, where the response hipcenter
is modified to take the absolute values.
library(faraway)
data(seatpos)
seatpos$hipcenter = -seatpos$hipcenter
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 185.150
The condition number of \(X^TX\) (where \(X\) is the design matrix) is large, suggesting strong collinearity of the predictors:
X = as.matrix(seatpos[, -9])
e = eigen(t(X) %*% X)$val
max(e)/min(e)
## [1] 500625.7
As the beginning of ridge regression, it is recommended to standardize the predictors. You can still carry out ridge regression without doing so, but standardization would improve the effect of ridge regression, as it makes the shrinking fair to each coefficients. Luckily, the function that we are going to use here automatically standardizes the data, so we don’t need to do the standardization by ourselves.
To carry out ridge regression in R, you will need function lm.ridge
in package MASS
. (The function penalized
in package penalized
should also work.)
library(MASS)
fit = lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0, .4, 1e-3))
We can observe how the coefficients shrink as \(\lambda\) grows larger:
par(mar = c(4, 4, 0, 0), cex = 0.7, las = 1)
matplot(fit$lambda, coef(fit), type = "l", ylim = c(-1, 3),
xlab = expression(lambda), ylab = expression(hat(beta)))
The fitted model is an object of class "ridgelm"
. There are many attributes: note that with xm
and scale
you can extract the means and the scales that the input values are centered and scaled with.
As \(\lambda\) grows larger, the coefficients (as well as prediction variances) shirnk, while the bias increases. Thus we have to select a \(\lambda\) to make a trade-off, so as to control the overall prediction error. However, this is not automatically given to us as the range of lambda
in the codes is input as an argument. So, though we can work out the best \(\lambda\) with package MASS
, we still need a range to start with.
Notice that the eigenvalues of \(X^TX+\lambda I\) should be a good reference, which can be obtained by adding \(\lambda\) to the eigenvalues of \(X^TX\):
X = sapply(seatpos[, -9], scale)
round(eigen(t(X) %*% X)$val, 3)
## [1] 209.908 45.761 17.159 8.915 7.186 5.149 1.863 0.059
So in order to reduce collinearity (i.e. reduce the difference in ratio among the eigenvalues of \(X^TX+\lambda I\)) without introducing too much bias, \(10^1\sim10^2\) seems be a good point to start with.
fit = lm.ridge(hipcenter ~ ., seatpos, lambda = seq(0, 30, 1e-3))
Then, to select an appropriate \(\lambda\), we can apply the command select
:
select(fit)
## modified HKB estimator is 5.425415
## modified L-W estimator is 3.589434
## smallest value of GCV at 22.301
You may want to specify the functon with MASS::select
to avoid mistakenly applying dplyr::MASS
.
Here HKB estimator
and L-W estimator
refer to different estimate of the ridge constant \(\phi\) (which means that as a rough estimate, you can also take \(\hat\beta_{ridge}=\hat\phi\hat\beta_{OLS}\)), while the location of the smallest value of GCV (generalized cross validation) score provides a suggestion for \(\lambda\).
const = as.numeric(names(which.min(fit$GCV)))
To check how the value of the GCV score varies, you can also plot the GCV scores with
par(mar = c(4, 4, 0, 0), cex = 0.7, las = 1)
plot(names(fit$GCV), fit$GCV, type = 'l',
xlab = expression(lambda), ylab = "GCV Score")
You can also plot in-sample errors vs \(\lambda\) to get more information before selection of \(\lambda\).
In package MASS
, we don’t have established function for prediction. Thus we will have to work out the predicted values explicitly.
For later convenience (to assess the out-of-sample prediction performance of ridge regression with this dataset), in this section we will split the dataset into training data and test data:
test_obs = sample(nrow(seatpos), 5)
test = seatpos[test_obs, ]
training = seatpos[-test_obs, ]
test
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 35 19 150 172.2 169.9 89.4 34.0 39.7 38.1 268.32
## 38 56 158 173.8 171.5 90.0 36.1 39.2 35.5 176.45
## 11 47 143 171.9 169.1 87.8 32.9 39.2 36.9 196.15
## 30 40 115 153.8 151.6 80.3 27.5 37.6 31.7 102.20
## 22 33 293 201.2 198.4 101.6 39.6 44.2 43.1 279.15
Fit the model with training data:
trained = lm.ridge(hipcenter ~ ., training, lambda = seq(0, 30, 1e-3))
select(trained)
## modified HKB estimator is 7.232087
## modified L-W estimator is 3.898503
## smallest value of GCV at 23.501
Work out the predicted values with fitted coeficients and scaled predictors:
predicted = trained$ym +
scale(test[, -9], center = trained$xm, scale = trained$scales) %*%
trained$coef[, which.min(trained$GCV)]
as.numeric(predicted)
## [1] 175.27525 159.71639 155.78005 95.43672 278.75054
It is not fitting perfectly well, in fact. To assess its performance, you can work out the square-root of MSE, i.e. RMSE:
sqrt(mean((predicted - test$hipcenter)^2))
## [1] 46.07162
In terms of in-sample error, it has been proved by Hoerl and Kennard (1970) that with an appropraite value of \(\lambda\), ridge regression can always preform better than OLS, i.e. having smaller MSE.
However, the method of ridge regression does not necessarily promise a better performance in terms of out-of-sample errors.
fit.OLS = lm(hipcenter ~ ., training)
sqrt(mean((predict(fit.OLS, test) - test$hipcenter)^2))
## [1] 45.92622
It is likely to be even smaller than that of ridge regression in this case. And it seems that ridge regression is not having a good performance on this data set.
The size of this data set is too small, and there may be many factors resulting in the bad performance of ridge regression on it. It is a pity that we can not present the power of ridge regression in a clear and impressive way.
Faraway, Julian J. Linear models with R. CRC press, 2014.
Inoue, Takakatsu. “Improving the’HKB’ordinary type ridge estimator.” Journal of the Japan Statistical Society 31.1 (2001): 67-83.
Brown, Philip J. Philip J. Measurement, regression, and calibration. No. 04; QA278. 2, B7.. 1993.
Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12.1 (1970): 55-67.
Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: applications to nonorthogonal problems.” Technometrics 12.1 (1970): 69-82.