Weighted Regression in SAS, R, and Python

Weighted Least Squares

The purpose of this tutorial is to demonstrate weighted least squares in SAS, R, and Python. The data set used in the example below is available here. The three approaches to weighting that will be used are among those outlined here (one of the approaches is modified slightly).

The goal of the model will be to estimate an abalone’s number of rings as a function of its length. For context, the number of rings an abalone has is a way of measuring its age.

(source: Wikimedia Commons)

SAS

We begin by reading in the data set, fitting a simple linear model, and examining the plot of residuals against fitted values. We need to enable graphics in SAS in order to be able to view diagnostic plots.

/* Enable graphics to view diagnostic plots */
ods graphics on;
ods pdf file="diagnostics.pdf";

/* Read in data */
proc import datafile="abalone.csv" out=abalone;
     getnames=no;
run;

/* Provide more descriptive names to variables */
data abalone;
     set abalone;
     rename var1 = sex
            var2 = length
            var3 = diameter
            var4 = height
            var5 = whole_wt
            var6 = shucked_wt
            var7 = visc_wt
            var8 = shell_wt
            var9 = rings;
run;

/* First, fit OLS model of rings (age) as a function of length */
proc reg data=abalone plots = residualbypredicted;
     ods select residualbypredicted;
     model rings = length;
run;
ods pdf close;

The resulting plot is shown below, alongside the regression output. It displays a prominent “megaphone’’ shape, which is indicative of nonconstant variance. This is a violation of one of the essential assumptions underpinning ordinary least squares regression. Weighted regression is designed to address this issue.

The regression output is shown below as well. Note that the coefficent on length is signficant and \(R^2 = .3099\) gives some idea of the quality of the fit.

The crux of weighted regression lies in determining the appropriate weights to use. A number of approaches are outlined, three of which will be demonstrated (one in slightly modified form). All of these approaches will in some form make use of the OLS output, so we will run the regression procedure again, this time saving the fitted values and residuals as part of the data set.

/* Save fitted values from regression to use as weights */
proc reg data=abalone plots = residualbypredicted;
     model rings = length;
     output out = abalone
            p = yhat_1 /* Saves fitted values to output data set */
            r = resids; /* Saves residuals to output data set */
run;

The simplest approach that will be demonstrated here is weighting each observation by the inverse of its fitted OLS value. This is a slight modification of an approach outlined in the link above, which suggests weighting by value of the predictor.

The first step is to add the inverted fitted value to the data set:

/* Compute weights */
data abalone;
     set abalone;
     wt_1 = 1/yhat_1;
run;

Once this has been accomplished, the weighted regression can be fitted.

/* Fit regression weighting by fitted value */
proc reg data=abalone;
     weight wt_1;
     model rings = length;
run;

This results in the following output:

The predictor remains statistically significant (its t-statistic actually increases from 43.30 to 50.62), and its coefficient increases in magnitude from 14.94641 to 15.52974.

The next two approaches will make use of the squared residuals from the OLS model and the absolute value of the residuals from the OLS model, so we will add those to our data set as well.

/* Create variables corresponding to square and absolute value of residuals */
data abalone;
     set abalone;
     resids_sq = resids**2;
     resids_abs = abs(resids);
run;

The first of these approaches will regress the absolute value of the residuals from the OLS model on the fitted values from the OLS model. It will then store the fitted values from this model for use as weights.

/* Fit absolute value of residuals against fitted values */
proc reg data=abalone;
     model resids_abs = yhat_1;
     output out = abalone
            p = yhat_2;
run;

*/ Store weight in data set */
/* Compute weights */
data abalone;
     set abalone;
     wt_2 = 1/yhat_2**2;
run;

These weights are then used to fit another model.

/* Fit regression weighting by residuals from absolute residuals vs fitted values model */
proc reg data=abalone;
     weight wt_2;
     model rings = length;
run;

This produces the following ouptut:

In this case, the predictor remains statistically significant (with a still larger t-statistic of 58.05) and its coefficient increases slightly again to 15.99657.

The final method that will be demonstrated will repeat the process from the second model, but replace the absolute value of the residuals by the square of the residuals. So the square of the residuals from the OLS model will be regressed on the fitted values from the OLS model, and the fitted values from this second model will then be stored for use as weights.

/* Fit square of residuals against fitted values */
proc reg data=abalone;
     model resids_sq = yhat_1;
     output out = abalone
            p = yhat_3;
run;

*/ Store weight in data set */
/* Compute weights */
data abalone;
     set abalone;
     wt_3= 1/yhat_3**2;
run;

And the weight is used to fit a third regresison model:

/* Fit regression weighting by residuals from squared residuals vs fitted values model */
proc reg data=abalone;
     weight wt_3;
     model rings = length;
run;

This produces the following output:

Yet again, the coefficient increases, this time to 16.15641, and remains statistically significant with a t-statistic of 68.01.

Thus, overall, we actually see that when we encounter the nonconstant residual problem, the weighted least squares regression method is a great way to improve the parameter estimates. Note that weighting methods are not limited to the three we demonstrated here.

R

We begin by reading in the data set, fitting a simple linear model, and examining the plot of residuals against fitted values.

abalone <- read.csv("abalone.data.txt", header=FALSE)
colnames(abalone)<-c("sex","length","diameter","height",
                  "whole_weight","shucked_weight","viscera_weight",
                     "shell_weight","rings")

g<-lm(rings~length,data=abalone)
summary(g)

## 
## Call:
## lm(formula = rings ~ length, data = abalone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9665 -1.6961 -0.7423  0.8733 16.6776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1019     0.1855   11.33   <2e-16 ***
## length       14.9464     0.3452   43.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.679 on 4175 degrees of freedom
## Multiple R-squared:  0.3099, Adjusted R-squared:  0.3098 
## F-statistic:  1875 on 1 and 4175 DF,  p-value: < 2.2e-16

plot(predict(g,data=abalone),resid(g,data=abalone),xlab = "fitted", ylab="residuals")
abline(h=0,col="red")

From the original OLS regression model, we see that that length variable is significant, yet with low \(R^2\) (0.3099). While checking the residual plot of the fitted verses residuals, it is clear that we now have a heteroscedasticity problem, which violates a key OLS assumption.

One way to solve this problem is by using weighted least squares regression, and below, we use three different weighting methods in R to demostrate how weighted least squares regression works.

Here weight=1/y_hat from the original OLS model.

abalone$yhat=predict(g,data=abalone)
abalone$wt_1=1/abalone$yhat
g_wt1<-lm(rings~length,data=abalone, weights = wt_1)
summary(g_wt1)

## 
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_1)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7477 -0.5495 -0.2372  0.2992  5.1862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7962     0.1577   11.39   <2e-16 ***
## length       15.5297     0.3068   50.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8363 on 4175 degrees of freedom
## Multiple R-squared:  0.3803, Adjusted R-squared:  0.3802 
## F-statistic:  2562 on 1 and 4175 DF,  p-value: < 2.2e-16

With this weighting, the length variable parameter now increases from 14.9464 to 15.5297, and remains significant. The t statistic increases as well, which means the predictor has become even more significant.

Here we regress the \(|\sigma_i|\) from original modle on predictor length, and get the \(1/\text{predicted}^2\) as weight.

abalone$resid_abs=abs(abalone$rings-predict(g,data=abalone))

fit2<-lm(resid_abs~length,data=abalone)
abalone$wt_2=1/(predict(fit2,data=abalone))^2

g_wt2<-lm(rings~length,data=abalone, weights = wt_2)
summary(g_wt2)

## 
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6265 -0.8813 -0.3858  0.5080  8.2189 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5641     0.1349   11.60   <2e-16 ***
## length       15.9966     0.2755   58.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.343 on 4175 degrees of freedom
## Multiple R-squared:  0.4467, Adjusted R-squared:  0.4465 
## F-statistic:  3370 on 1 and 4175 DF,  p-value: < 2.2e-16

Here the parameter estimate of length is similar to what we have from the first weight (15.9966), and still significant.

Here we regress the \(|\sigma_i^2|\) from original modle on predictor length, and get 1/predicted as weight.

abalone$resid_sq=(abalone$resid_abs)^2

fit3<-lm(resid_sq~length,data = abalone)
abalone$wt_3=abs(1/predict(fit3,data=abalone))

g_wt3<-lm(rings~length,data = abalone,weights = wt_3)
summary(g_wt3)

## 
## Call:
## lm(formula = rings ~ length, data = abalone, weights = wt_3)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5320 -0.6439 -0.2828  0.3761  6.0253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4688     0.1133   12.96   <2e-16 ***
## length       16.1564     0.2375   68.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9906 on 4175 degrees of freedom
## Multiple R-squared:  0.5256, Adjusted R-squared:  0.5255 
## F-statistic:  4626 on 1 and 4175 DF,  p-value: < 2.2e-16

Here parameter the estimate of length is now 16.1564 comparing to the 14.9464 from orignial model, and still significant.

Python

The first step is to load the appropriate packages, load the data, and fit a simple OLS model.

## Load Packages
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf 
import statsmodels.api as sm
from statsmodels.graphics.gofplots import ProbPlot

## Load data
data = pd.read_csv('abalone.csv',na_values=['?']) 
## selecting only variables we use 
data = data.loc[:,['Length', 'Rings']]

## fit the OLS model and check the result
model_ols = smf.ols("Rings ~ Length", data=data).fit() 
print(model_ols.summary())

Caption for the picture. Here the t-statistic is 43.303. However, diagnostics should be performed to confirm that this model is appropriate.

# fitted values (need a constant term for intercept)
model_fitted_y = model_ols.fittedvalues
# model residuals
model_residuals = model_ols.resid
# normalized residuals
model_norm_residuals = model_ols.get_influence().resid_studentized_internal
# absolute residuals
model_abs_resid = np.abs(model_residuals)

plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'Rings', data=data, 
                          lowess=True, 
                          scatter_kws={'alpha': 0.5}, 
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals')

# annotations
abs_resid = model_abs_resid.sort_values(ascending=False)
abs_resid_top_3 = abs_resid[:3]
for i in abs_resid_top_3.index:
    plot_lm_1.axes[0].annotate(i, 
                               xy=(model_fitted_y[i], 
                                   model_residuals[i]));

Caption for the picture. As shown in the Residuals vs Fitted plot, there is a megaphone shape, which indicates that non-constant variance is likely to be an issue. We will try to use weighted least squares to address this. We are going to use three different weights (based on the approaches suggested at the link above).

Weighted Least Square
1.

The weight here is the inverse of the fitted values obtained from the OLS model.

## An interecept is not included by default, so we have to add it manually
y=data["Rings"]
y=y.tolist() 
x=data["Length"] 
x=x.tolist()   
## add a intercept point
x = sm.add_constant(x) 

## Compute the weight and add it to the column named "weight_1"
data["weight_1"] = model_fitted_y  
data["weight_1"] = data["weight_1"]**-1
model_wls_1 = sm.WLS(y, x, data['weight_1']) 
mod_res_1 = model_wls_1.fit() 
print(mod_res_1.summary())

Caption for the picture. After putting weight in the model, the estimate of coefficient changed slightly and the t-statistic increased to 50.621.

Here we regress the absolute values of the residuals against the predictor (which is the same as regressing against the fitted value since we only have one predictor). The resulting fitted values of this regression is an estimates of the error and we will use it as the weight.

## Compute the weight and add it to the column named "weight_2"
data["temp"] = model_abs_resid 
model_temp = smf.ols("temp ~ Length", data=data).fit()  
weight_2 = model_temp.fittedvalues 
weight_2 = weight_2**-2 
data['weight_2'] = weight_2 

mod_wls = sm.WLS(y,x, data['weight_2'])
mod_res = mod_wls.fit()
print(mod_res.summary())

Caption for the picture. Now \(t = 58.054\), an increase compared to the previous model.

Here we regress the squared residuals against the predictor (which again is the same as regressing against the fitted value since we only have one predictor). The resulting fitted values are used as weight.

## Compute the weight and add it to the column named "weight_3"
data["temp"] = model_residuals**2  
model_temp = smf.ols("temp ~ Length", data=data).fit()  
weight_3 = model_temp.fittedvalues
weight_3 = abs(weight_3) 
weight_3 = weight_3**-1 
data['weight_3'] = weight_3 

mod_wls = sm.WLS(y,x, data['weight_3'])
mod_res = mod_wls.fit()
print(mod_res.summary())

Caption for the picture. Compared to the previous models, we have the largest t-statistic, 68.013. The result suggests that weighted least squares is a good way to improve parameter estimates if we have non-constant variance problem. There are many kinds of methods to choose the weight, and the best one may be different for any given dataset.

Weighted Regression in SAS, R, and Python

Eli Schultz, Huan Tan, and Shengchen Hao

Weighted Least Squares

SAS

R

Python