Robust Regression

What is Robust Regression

The linear regression models are based on normal distribution of errors. If the distribution of errors is prone to outliers, the result of linear regression becomes unreliable. By assigning a weight to each observation, the robust fitting method is less sensitive than ordinary least squares to the outliers. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Those points with larger residuals are weighted less. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance.

Dataset: Employee Compensation in San Francisco

For our data analysis we will use the Employee Compensation in San Francisco data set. This data is maintained by the San Francisco government, specifically the San Francisco Controller’s Office and is a database of the salary and benefits paid to City employees for fiscal years 2013 to 2017.The full data set can be found on the San Francisco Dataset page. We will be focusing on the variables for Total Benefits (totalbenefits), Salaries (salaries), and Overtime (overtime) for the fiscal year 2014, all in U.S. Dollars. This subset of the data has 40,868 observations and we can use each Employee Identification number to find each unique data point in single year. We will use Salaries and Overtime to predict Total Benefits.

Stata

Version info: Code for this page was tested in Stata 14.2

Stata’s rreg command first runs the OLS regression, gets the Cook’s Distance for each observation, and then drops any observations with Cook’s distnace greater than 1.The iteration process starts with weights calculated based on absolute residuals and stops when the maximum change between the weights from one iteration to the next is below tolerance. Two types of weights are used in rreg, Huber and biweighting. Huber weighting weights observations with small residuals with 1 and as the residual gets larger the observation is weighted less. In biweighting, all cases with a non-zero residual get down-weighted at least a little. Using both, the most influential points are dropped, and then cases with large absolute residuals are down-weighted. .

Description of Data

We can view our variables of interest with summarize. Note here we have negative values for salaries, overtime, and total benefits.

*import delimited sfsalary.csv
*save sfsalary.dta

use sfsalary.dta, clear

summarize salaries overtime totalbenefits

OLS Regression

We start with OLS regression and some diagnostics.

regress totalbenefits salaries overtime

We can use Cooks Distance to help determine leverage points. Using the predict command with cooksd option we can create a new variable called c_dist1 containing the values of Cook’s Distance. A conventional cut-off point is 4/n, where n is the number of observations in the data set. For us, this n = 40,868 and we will use this to select which values to display. Then the gsort command sorts this by descending order of Cook’s Distance so we can see those points with the largest Cook’s Distance. If we had points where the Cook’s distance was greater than 1, then our robust regression using rreg would drop these points. For this data set we don’t have any Cook’s Distance points greater than 1.

predict c_dist1, cooksd
clist employeeidentifier totalbenefits salaries overtime c_dist1 if c_dist1 > 4/40868

Again, we can use our predict command to look at the residuals and generate the variable absr1 for the absolute value of the standardized residuals. Then the gsort command sorts this by descending order and we can see which employees have the largest residuals, which will be weighted less in our robust regression.

predict r1, rstandard
gen absr1 = abs(r1)
gsort -absr1
clist employeeidentifier absr1 in 1/10, noobs

Robust Regression

Now we will do our robust regression and see how they compare. We will use rreg and generate a new variable to store the final weights. You can see the iteration history of both types of weights, huber and biweight, at the top of the robust regression output.

rreg totalbenefits salaries overtime, gen(weight)

Comparing the OLS regression and robust regression models, we can see that the results are different, notably the coefficients have changed. You will also notice that no R-squared, adjusted R-squared or root MSE from rreg output. However we can easily compute that with display and notice that our R2 value has increased in our Robust Regression to 0.905.

display "R2 = " e(r2)

Now we can look at observations with relatively small weights and those with relatively large weights using sort and gsort.

sort weight
clist employeeidentifier weight absr1 c_dist1 in 1/10, noobs
gsort -weight
clist employeeidentifier weight absr1 c_dist1 in 1/10, noobs

For the most case, as the residuals go down, the weights should increase. That is, those with larger residuals are down weighted more in our robust regression. In OLS all cases have a weight of 1. The more cases in robust regression that have weight 1, the closer the results of OLS and Robust Regression are. We can see this above by looking at both those observations with the highest weights and those with the lowest.

We can do some post-estimation after rreg with margins command. For example we can get predicted values with respect to a set of values of variable salaries, holding overtime at its mean.

margins, at(salaries=(0(50000)533985)) vsquish

Here we can see, in increments of $50,000, that as salaries goes up, the predictred total benefits also increases.

References

R

There are a few key libraries to install and load for our regression, particularly foreign and MASS. Once installed with install.packages(“packagename”) they can earily be loaded using require() or library()

require(foreign)
require(MASS)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
library(data.table)
library(tidyverse)
library(faraway)

Description of data

After subsetting our data to select just the year 2014, we can view the summaries of our variables of interest using summary.

rawG0 <- read.csv("/Users/maraudersmap/Documents/Employee_Compensation.csv")
rawG1 = data.frame(rawG0)
rawG2 = cbind(rawG1$Year,rawG1$Employee.Identifier,rawG1$Salaries,rawG1$Overtime,rawG1$Total.Benefits)

colnames(rawG2)=c("year","Employee ID","Salaries","Overtime","Total_Benefit")
rawG3 = data.frame(rawG2) 
rawG4=rawG3[rawG3$year=="2014",]

summary(rawG4)
##       year       Employee.ID       Salaries         Overtime     
##  Min.   :2014   Min.   :    1   Min.   :-68772   Min.   :-12309  
##  1st Qu.:2014   1st Qu.:14250   1st Qu.: 21967   1st Qu.:     0  
##  Median :2014   Median :28448   Median : 61482   Median :     0  
##  Mean   :2014   Mean   :28519   Mean   : 61914   Mean   :  4272  
##  3rd Qu.:2014   3rd Qu.:42897   3rd Qu.: 91853   3rd Qu.:  2434  
##  Max.   :2014   Max.   :56987   Max.   :308633   Max.   :172612  
##  Total_Benefit   
##  Min.   :-19814  
##  1st Qu.:  9074  
##  Median : 30327  
##  Mean   : 26395  
##  3rd Qu.: 38855  
##  Max.   : 97107

OLS regression analysis

We begin first with OLS regression and some diagnostics to view the general relationship between our data. The lm package fits our OLS regression.

OLS_reg <- lm(Total_Benefit ~ Salaries+Overtime, data = rawG4)
summary(OLS_reg)
## 
## Call:
## lm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53669  -3931    464   4201  39403 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.115e+03  4.881e+01   84.31   <2e-16 ***
## Salaries    3.558e-01  6.670e-04  533.42   <2e-16 ***
## Overtime    5.927e-02  2.701e-03   21.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5720 on 40865 degrees of freedom
## Multiple R-squared:  0.8849, Adjusted R-squared:  0.8848 
## F-statistic: 1.57e+05 on 2 and 40865 DF,  p-value: < 2.2e-16

A plot of our regression gives us 4 plots to examine residuals, possible outliers and leverage points.

par(mfrow = c(2,2), oma = c(0, 0, 1, 0))

plot(OLS_reg)

From these plots, we can identify several observations that may be possibly problematic to our model. We can look at our Cook’s Distances using cooks.distance to see which values have relatively large Cook’s Distances and could possibly be high leverage points. Here we also calculate our standardized residuals using stdres.Typically the cut off point used is \(\frac{4}{n}\) where n is the number of observations. For us this n is 40,868 and we can view which Cook’s Distances are above this point.

d1 <- cooks.distance(OLS_reg)
r <- stdres(OLS_reg)
a <- cbind(rawG4, d1, r)
head(a[d1 > 4/40868, ])
##     year Employee.ID  Salaries Overtime Total_Benefit           d1
## 66  2014       38185  51537.90 31739.97      33068.00 0.0001575752
## 149 2014       12337  65524.98 23116.82      39764.46 0.0001247798
## 260 2014       52668 133687.85 27563.62      42442.62 0.0001997214
## 351 2014       38853 112703.72 51838.63      42482.38 0.0001157340
## 443 2014        1514  65035.81 23916.87      38771.09 0.0001131363
## 444 2014       55058 178922.62     0.00      61164.32 0.0001020372
##              r
## 66   1.5275071
## 149  1.9175602
## 260 -1.9000409
## 351 -0.8395243
## 443  1.7660279
## 444 -1.1548373

Using our calculated residuals above, we can find the absolute residuals and sort them, viewing the top 10 values with the highest residuals. These should be weighted differently when we do our robust regression.

rabs <- abs(r)
a <- cbind(rawG4, d1, r, rabs)
asorted <- a[order(-rabs), ]
asorted[1:10, ]
##        year Employee.ID Salaries Overtime Total_Benefit          d1
## 128897 2014       43131 146508.0  6834.03       2973.59 0.003428444
## 191970 2014       36865 134173.9     0.00        244.11 0.002948959
## 197853 2014       28348 137301.9 14225.79       2648.15 0.002709523
## 12328  2014       25111 132799.8   122.03       2627.55 0.002549898
## 200734 2014       12583 131403.1     0.00       2308.82 0.002473906
## 96190  2014        1782 130867.3     0.00       2181.83 0.002441626
## 117967 2014       51860 129589.1  2570.50       2298.57 0.002137925
## 175562 2014       55763 129378.2   174.04       2251.88 0.002297903
## 163288 2014        2953 173870.2   647.53      18265.61 0.004874941
## 129400 2014       36396 175221.8     0.00      18863.73 0.005018636
##                r     rabs
## 128897 -9.383179 9.383179
## 191970 -9.022357 9.022357
## 197853 -8.943982 8.943982
## 12328  -8.521434 8.521434
## 200734 -8.489013 8.489013
## 96190  -8.477879 8.477879
## 117967 -8.404555 8.404555
## 175562 -8.374798 8.374798
## 163288 -8.347814 8.347814
## 129400 -8.320630 8.320630

Robust Regression

The robust regression rlm function in the MASS package can take different weighting methods as an argument.

Huber Weights

First we will use Huber weights, which weight based on the size of the residual. Those observations with larger residuals are weighted least and those with smaller residuals weighted closer to 1 (as is done for all observations in OLS).

summary(rr.huber <- rlm(Total_Benefit ~ Salaries+Overtime, data = rawG4))
## 
## Call: rlm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -54850.70  -3624.17     39.03   3849.59  39603.95 
## 
## Coefficients:
##             Value     Std. Error t value  
## (Intercept) 3685.1676   45.1237    81.6681
## Salaries       0.3678    0.0006   596.5197
## Overtime       0.0374    0.0025    14.9704
## 
## Residual standard error: 5460 on 40865 degrees of freedom

We also create a new variable for the weights of each observation. Then we can see that typically as absolute residual goes down, the weight goes up. The more cases we have that are weighted at 1, the closer our robust regression is to the OLS regression.

hweights <- data.frame(Employee_ID = rawG4$Employee.ID, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:10, ]
##        Employee_ID     resid    weight
## 128897       43131 -54850.70 0.1338990
## 191970       36865 -52788.44 0.1391299
## 197853       28348 -52066.53 0.1410588
## 12328        25111 -49904.20 0.1471711
## 200734       12583 -49704.70 0.1477617
## 96190         1782 -49634.60 0.1479704
## 163288        2953 -49390.95 0.1487015
## 129400       36396 -49265.72 0.1490795
## 117967       51860 -49143.83 0.1494480
## 175562       55763 -49023.38 0.1498153
Bisquare weighting function

We can run robust regression with rlm again with Bisquare weighting.

rr.bisquare <- rlm(Total_Benefit ~ Salaries+Overtime, data = rawG4, psi = psi.bisquare)
summary(rr.bisquare)
## 
## Call: rlm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4, 
##     psi = psi.bisquare)
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -55296.74  -3483.18    -86.47   3735.59  39705.63 
## 
## Coefficients:
##             Value     Std. Error t value  
## (Intercept) 3495.3639   45.6423    76.5817
## Salaries       0.3724    0.0006   597.1466
## Overtime       0.0314    0.0025    12.4411
## 
## Residual standard error: 5239 on 40865 degrees of freedom

Again we can look at the weights and see that as our absolute residual increases, the weighting is less. We can see that our lowest weighted observations are different here using the Bisquare weigthing versus the Huberweighting method.

biweights <- data.frame(Employee_ID = rawG4$Employee.ID, resid = rr.bisquare$resid, weight = rr.bisquare$w)
biweights2 <- biweights[order(rr.bisquare$w), ]
(biweights2[1:10, ])
##       Employee_ID     resid weight
## 723         53309 -37576.00      0
## 4766        18937 -26629.01      0
## 6268        35553 -31513.25      0
## 7045        25271  38758.98      0
## 7105        35263 -30621.00      0
## 7267        17896 -26817.50      0
## 7368        50656 -28855.09      0
## 7810         9623 -24855.96      0
## 10373        2244 -36244.98      0
## 10866       49877 -29452.24      0
(biweights2[40000:40005, ])
##        Employee_ID     resid    weight
## 92351        34735 -208.8075 0.9998550
## 123472       37070  208.3145 0.9998551
## 204843       13671  208.2139 0.9998553
## 192906       24186 -208.5813 0.9998558
## 109647       56953  207.3208 0.9998564
## 70853        32550  207.3303 0.9998564

In both of our robust regressions we can see a change in the coefficients from the OLS regression.

References

MatLab

MATLAB is not mostly famous for its application in statistics, but it is still powerful enough to do some basic statistical regressions.

This script is created to show how to do the robust regression with MATLAB (version R2015b) by reference to the “help document” of function ‘robustfit’.

Robustfit

In MATLAB, B = robustfit(X,Y) returns the vector B of regression coefficients, obtained by performing robust regression to estimate the linear model Y = Xb. X is an n-by-p matrix of predictor variables, and Y is an n-by-1 vector of observations. The algorithm uses iteratively reweighted least squares with the bisquare weighting function. By default, robustfit adds a column of ones to X, corresponding to a constant term in the first element of B. Do not enter a column of ones directly into the X matrix.

  • [B,STATS] = robustfit(…) also returns a STATS structure containing the following fields:
    • % ‘ols_s’ sigma estimate (rmse) from least squares fit
    • % ‘robust_s’ robust estimate of sigma
    • % ‘mad_s’ MAD estimate of sigma; used for scaling
    • % residuals during the iterative fitting
    • % ‘s’ final estimate of sigma, the larger of robust_s
    • % and a weighted average of ols_s and robust_s
    • % ‘se’ standard error of coefficient estimates
    • % ‘t’ ratio of b to stats.se
    • % ‘p’ p-values for stats.t
    • % ‘covb’ estimated covariance matrix for coefficient estimates
    • % ‘coeffcorr’ estimated correlation of coefficient estimates
    • % ‘w’ vector of weights for robust fit
    • % ‘h’ vector of leverage values for least squares fit
    • % ‘dfe’ degrees of freedom for error
    • % ‘R’ R factor in QR decomposition of X matrix

Employee Compensation Example

Load Data

We use the “Import Data” option in the toolstrip to load the data. It is very convenient to select certain columns to import as columns vector. Here we import Year, Overtime, Salaries1 and TotalBenefits to our workspace and select data in 2014.

Overtime = Overtime(Year==2014);
Salaries1 = Salaries1(Year==2014);
TotalBenefits = TotalBenefits(Year==2014);
x = [Overtime,Salaries1];
y = TotalBenefits;

Let’s compare the result of robust fit to a standard least squares fit.

[bls,STATSls] = robustfit(x,y,'ols'); %least-squares
[brob,STATSrob] = robustfit(x,y); %robust

OLS Regression

First we start with some OLS Regression using STATSls and bls and some basic diagnotistics to get a view of the relationship in the data.

display(STATSls);
display(bls);
  • STATSls =

    • ols_s: 5.7200e+03
    • robust_s: 5.7200e+03
    • mad_s: 5.9720e+03
    • s: 5.7200e+03
    • resid: [40868x1 double]
    • rstud: [40868x1 double]
    • se: [3x1 double]
    • covb: [3x3 double]
    • coeffcorr: [3x3 double]
    • t: [3x1 double]
    • p: [3x1 double]
    • w: [40868x1 double]
    • Qy: [3x1 double]
    • R: [3x3 double]
    • dfe: 40865
    • h: [40868x1 double]
    • Rtol: 1.3939e-04
  • bls =
    • 1.0e+03
    • 4.1151
    • 0.0001
    • 0.0004

We can check all the fields in STATS structure by using .[field]. For example, we can see the p values of each predictors by using .p

STATSls.p
  • ans =
    • 1.0e-105
    • 0
    • 0.3970
    • 0

We can also view how each observation is weighted in OLS regression.

hist(STATSls.w);

All the observations are weighted at 1 in OLS regression. Our Robust regression will weight them differently depending on the residuals.

For some quick diagnostics, we can do a QQ plot to check the normality.

qqplot(STATSls.resid)

By qqplot, we can clearly see that the errors are not normally distributed and there may be issues with our OLS regression on this data.

Robust Regression

We can do our Robust Regression using STATSrob. In our robust regression, the coefficients changes a little in robust regression.

display(STATSrob);
display(brob);
  • STATSrob =
    • ols_s: 5.7200e+03
    • robust_s: 5.3485e+03
    • mad_s: 5.2382e+03
    • s: 5.3486e+03
    • resid: [40868x1 double]
    • rstud: [40868x1 double]
    • se: [3x1 double]
    • covb: [3x3 double]
    • coeffcorr: [3x3 double]
    • t: [3x1 double]
    • p: [3x1 double]
    • w: [40868x1 double]
    • Qy: [3x1 double]
    • R: [3x3 double]
    • dfe: 40865
    • h: [40868x1 double]
    • Rtol: 1.3939e-04
  • brob =
    • 1.0e+03 *
    • 3.4950
    • 0.0000
    • 0.0004

We can also view the how each observation is weighted by using a histogram of the weights.

histogram(STATSrob.w);

As noted, not all the observations are weighted the same in our Robust Regression. In fact, the weight of the points with the largest residuals are reduced to zero.

STATSrob.w(find(STATSls.resid==max(STATSls.resid)))
  • ans =
    • 0