The purpose of this tutorial is to demonstrate weighted least squares in SAS, R, and Python. The data set used in the example below is available here. The three approaches to weighting that will be used are among those outlined here (one of the approaches is modified slightly).
The goal of the model will be to estimate an abalone’s number of rings as a function of its length. For context, the number of rings an abalone has is a way of measuring its age.
We begin by reading in the data set, fitting a simple linear model, and examining the plot of residuals against fitted values. We need to enable graphics in SAS in order to be able to view diagnostic plots.
/* Enable graphics to view diagnostic plots */
ods graphics on;
ods pdf file="diagnostics.pdf";
/* Read in data */
proc import datafile="abalone.csv" out=abalone;
getnames=no;
run;
/* Provide more descriptive names to variables */
data abalone;
set abalone;
rename var1 = sex
var2 = length
var3 = diameter
var4 = height
var5 = whole_wt
var6 = shucked_wt
var7 = visc_wt
var8 = shell_wt
var9 = rings;
run;
/* First, fit OLS model of rings (age) as a function of length */
proc reg data=abalone plots = residualbypredicted;
ods select residualbypredicted;
model rings = length;
run;
ods pdf close;
The resulting plot is shown below, alongside the regression output. It displays a prominent “megaphone’’ shape, which is indicative of nonconstant variance. This is a violation of one of the essential assumptions underpinning ordinary least squares regression. Weighted regression is designed to address this issue.
The regression output is shown below as well. Note that the coefficent on length is signficant and \(R^2 = .3099\) gives some idea of the quality of the fit.
The crux of weighted regression lies in determining the appropriate weights to use. A number of approaches are outlined, three of which will be demonstrated (one in slightly modified form). All of these approaches will in some form make use of the OLS output, so we will run the regression procedure again, this time saving the fitted values and residuals as part of the data set.
/* Save fitted values from regression to use as weights */
proc reg data=abalone plots = residualbypredicted;
model rings = length;
output out = abalone
p = yhat_1 /* Saves fitted values to output data set */
r = resids; /* Saves residuals to output data set */
run;
1.
The simplest approach that will be demonstrated here is weighting each observation by the inverse of its fitted OLS value. This is a slight modification of an approach outlined in the link above, which suggests weighting by value of the predictor.
The first step is to add the inverted fitted value to the data set:
/* Compute weights */
data abalone;
set abalone;
wt_1 = 1/yhat_1;
run;
Once this has been accomplished, the weighted regression can be fitted.
/* Fit regression weighting by fitted value */
proc reg data=abalone;
weight wt_1;
model rings = length;
run;
This results in the following output:
The predictor remains statistically significant (its t-statistic actually increases from 43.30 to 50.62), and its coefficient increases in magnitude from 14.94641 to 15.52974.
2.
The next two approaches will make use of the squared residuals from the OLS model and the absolute value of the residuals from the OLS model, so we will add those to our data set as well.
/* Create variables corresponding to square and absolute value of residuals */
data abalone;
set abalone;
resids_sq = resids**2;
resids_abs = abs(resids);
run;
The first of these approaches will regress the absolute value of the residuals from the OLS model on the fitted values from the OLS model. It will then store the fitted values from this model for use as weights.
/* Fit absolute value of residuals against fitted values */
proc reg data=abalone;
model resids_abs = yhat_1;
output out = abalone
p = yhat_2;
run;
*/ Store weight in data set */
/* Compute weights */
data abalone;
set abalone;
wt_2 = 1/yhat_2**2;
run;
These weights are then used to fit another model.
/* Fit regression weighting by residuals from absolute residuals vs fitted values model */
proc reg data=abalone;
weight wt_2;
model rings = length;
run;
This produces the following ouptut:
In this case, the predictor remains statistically significant (with a still larger t-statistic of 58.05) and its coefficient increases slightly again to 15.99657.
3.
The final method that will be demonstrated will repeat the process from the second model, but replace the absolute value of the residuals by the square of the residuals. So the square of the residuals from the OLS model will be regressed on the fitted values from the OLS model, and the fitted values from this second model will then be stored for use as weights.
/* Fit square of residuals against fitted values */
proc reg data=abalone;
model resids_sq = yhat_1;
output out = abalone
p = yhat_3;
run;
*/ Store weight in data set */
/* Compute weights */
data abalone;
set abalone;
wt_3= 1/yhat_3**2;
run;
And the weight is used to fit a third regresison model:
/* Fit regression weighting by residuals from squared residuals vs fitted values model */
proc reg data=abalone;
weight wt_3;
model rings = length;
run;
This produces the following output:
Yet again, the coefficient increases, this time to 16.15641, and remains statistically significant with a t-statistic of 68.01.
Thus, overall, we actually see that when we encounter the nonconstant residual problem, the weighted least squares regression method is a great way to improve the parameter estimates. Note that weighting methods are not limited to the three we demonstrated here.
We begin by reading in the data set, fitting a simple linear model, and examining the plot of residuals against fitted values.
abalone <- read.csv("abalone.data.txt", header=FALSE)
colnames(abalone)<-c("sex","length","diameter","height",
"whole_weight","shucked_weight","viscera_weight",
"shell_weight","rings")
g<-lm(rings~length,data=abalone)
summary(g)
##
## Call:
## lm(formula = rings ~ length, data = abalone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9665 -1.6961 -0.7423 0.8733 16.6776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1019 0.1855 11.33 <2e-16 ***
## length 14.9464 0.3452 43.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.679 on 4175 degrees of freedom
## Multiple R-squared: 0.3099, Adjusted R-squared: 0.3098
## F-statistic: 1875 on 1 and 4175 DF, p-value: < 2.2e-16
plot(predict(g,data=abalone),resid(g,data=abalone),xlab = "fitted", ylab="residuals")
abline(h=0,col="red")
From the original OLS regression model, we see that that length variable is significant, yet with low \(R^2\) (0.3099). While checking the residual plot of the fitted verses residuals, it is clear that we now have a heteroscedasticity problem, which violates a key OLS assumption.
One way to solve this problem is by using weighted least squares regression, and below, we use three different weighting methods in R to demostrate how weighted least squares regression works.
1.
Here weight=1/y_hat from the original OLS model.
abalone$yhat=predict(g,data=abalone)
abalone$wt_1=1/abalone$yhat
g_wt1<-lm(rings~length,data=abalone, weights = wt_1)
summary(g_wt1)
##
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_1)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.7477 -0.5495 -0.2372 0.2992 5.1862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7962 0.1577 11.39 <2e-16 ***
## length 15.5297 0.3068 50.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8363 on 4175 degrees of freedom
## Multiple R-squared: 0.3803, Adjusted R-squared: 0.3802
## F-statistic: 2562 on 1 and 4175 DF, p-value: < 2.2e-16
With this weighting, the length variable parameter now increases from 14.9464 to 15.5297, and remains significant. The t statistic increases as well, which means the predictor has become even more significant.
2.
Here we regress the \(|\sigma_i|\) from original modle on predictor length, and get the \(1/\text{predicted}^2\) as weight.
abalone$resid_abs=abs(abalone$rings-predict(g,data=abalone))
fit2<-lm(resid_abs~length,data=abalone)
abalone$wt_2=1/(predict(fit2,data=abalone))^2
g_wt2<-lm(rings~length,data=abalone, weights = wt_2)
summary(g_wt2)
##
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_2)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.6265 -0.8813 -0.3858 0.5080 8.2189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5641 0.1349 11.60 <2e-16 ***
## length 15.9966 0.2755 58.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.343 on 4175 degrees of freedom
## Multiple R-squared: 0.4467, Adjusted R-squared: 0.4465
## F-statistic: 3370 on 1 and 4175 DF, p-value: < 2.2e-16
Here the parameter estimate of length is similar to what we have from the first weight (15.9966), and still significant.
3.
Here we regress the \(|\sigma_i^2|\) from original modle on predictor length, and get 1/predicted as weight.
abalone$resid_sq=(abalone$resid_abs)^2
fit3<-lm(resid_sq~length,data = abalone)
abalone$wt_3=abs(1/predict(fit3,data=abalone))
g_wt3<-lm(rings~length,data = abalone,weights = wt_3)
summary(g_wt3)
##
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_3)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.5320 -0.6439 -0.2828 0.3761 6.0253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4688 0.1133 12.96 <2e-16 ***
## length 16.1564 0.2375 68.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9906 on 4175 degrees of freedom
## Multiple R-squared: 0.5256, Adjusted R-squared: 0.5255
## F-statistic: 4626 on 1 and 4175 DF, p-value: < 2.2e-16
Here parameter the estimate of length is now 16.1564 comparing to the 14.9464 from orignial model, and still significant.
Thus, overall, we actually see that when we encounter the nonconstant residual problem, the weighted least squares regression method is a great way to improve the parameter estimates. Note that weighting methods are not limited to the three we demonstrated here.
The first step is to load the appropriate packages, load the data, and fit a simple OLS model.
## Load Packages
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.graphics.gofplots import ProbPlot
## Load data
data = pd.read_csv('abalone.csv',na_values=['?'])
## selecting only variables we use
data = data.loc[:,['Length', 'Rings']]
## fit the OLS model and check the result
model_ols = smf.ols("Rings ~ Length", data=data).fit()
print(model_ols.summary())
Here the t-statistic is 43.303. However, diagnostics should be performed to confirm that this model is appropriate.
# fitted values (need a constant term for intercept)
model_fitted_y = model_ols.fittedvalues
# model residuals
model_residuals = model_ols.resid
# normalized residuals
model_norm_residuals = model_ols.get_influence().resid_studentized_internal
# absolute residuals
model_abs_resid = np.abs(model_residuals)
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'Rings', data=data,
lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')
# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]
for i in abs_resid_top_3.index:
plot_lm_1.axes[0].annotate(i,
xy=(model_fitted_y[i],
model_residuals[i]));
As shown in the Residuals vs Fitted plot, there is a megaphone shape, which indicates that non-constant variance is likely to be an issue. We will try to use weighted least squares to address this. We are going to use three different weights (based on the approaches suggested at the link above).
Weighted Least Square
1.
The weight here is the inverse of the fitted values obtained from the OLS model.
## An interecept is not included by default, so we have to add it manually
y=data["Rings"]
y=y.tolist()
x=data["Length"]
x=x.tolist()
## add a intercept point
x = sm.add_constant(x)
## Compute the weight and add it to the column named "weight_1"
data["weight_1"] = model_fitted_y
data["weight_1"] = data["weight_1"]**-1
model_wls_1 = sm.WLS(y, x, data['weight_1'])
mod_res_1 = model_wls_1.fit()
print(mod_res_1.summary())
After putting weight in the model, the estimate of coefficient changed slightly and the t-statistic increased to 50.621.
2.
Here we regress the absolute values of the residuals against the predictor (which is the same as regressing against the fitted value since we only have one predictor). The resulting fitted values of this regression is an estimates of the error and we will use it as the weight.
## Compute the weight and add it to the column named "weight_2"
data["temp"] = model_abs_resid
model_temp = smf.ols("temp ~ Length", data=data).fit()
weight_2 = model_temp.fittedvalues
weight_2 = weight_2**-2
data['weight_2'] = weight_2
mod_wls = sm.WLS(y,x, data['weight_2'])
mod_res = mod_wls.fit()
print(mod_res.summary())
Now \(t = 58.054\), an increase compared to the previous model.
3.
Here we regress the squared residuals against the predictor (which again is the same as regressing against the fitted value since we only have one predictor). The resulting fitted values are used as weight.
## Compute the weight and add it to the column named "weight_3"
data["temp"] = model_residuals**2
model_temp = smf.ols("temp ~ Length", data=data).fit()
weight_3 = model_temp.fittedvalues
weight_3 = abs(weight_3)
weight_3 = weight_3**-1
data['weight_3'] = weight_3
mod_wls = sm.WLS(y,x, data['weight_3'])
mod_res = mod_wls.fit()
print(mod_res.summary())
Compared to the previous models, we have the largest t-statistic, 68.013. The result suggests that weighted least squares is a good way to improve parameter estimates if we have non-constant variance problem. There are many kinds of methods to choose the weight, and the best one may be different for any given dataset.