The linear regression models are based on normal distribution of errors. If the distribution of errors is prone to outliers, the result of linear regression becomes unreliable. By assigning a weight to each observation, the robust fitting method is less sensitive than ordinary least squares to the outliers. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Those points with larger residuals are weighted less. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance.
For our data analysis we will use the Employee Compensation in San Francisco data set. This data is maintained by the San Francisco government, specifically the San Francisco Controller’s Office and is a database of the salary and benefits paid to City employees for fiscal years 2013 to 2017.The full data set can be found on the San Francisco Dataset page. We will be focusing on the variables for Total Benefits (totalbenefits), Salaries (salaries), and Overtime (overtime) for the fiscal year 2014, all in U.S. Dollars. This subset of the data has 40,868 observations and we can use each Employee Identification number to find each unique data point in single year. We will use Salaries and Overtime to predict Total Benefits.
Version info: Code for this page was tested in Stata 14.2
Stata’s rreg command first runs the OLS regression, gets the Cook’s Distance for each observation, and then drops any observations with Cook’s distnace greater than 1.The iteration process starts with weights calculated based on absolute residuals and stops when the maximum change between the weights from one iteration to the next is below tolerance. Two types of weights are used in rreg, Huber and biweighting. Huber weighting weights observations with small residuals with 1 and as the residual gets larger the observation is weighted less. In biweighting, all cases with a non-zero residual get down-weighted at least a little. Using both, the most influential points are dropped, and then cases with large absolute residuals are down-weighted. .
We can view our variables of interest with summarize. Note here we have negative values for salaries, overtime, and total benefits.
*import delimited sfsalary.csv
*save sfsalary.dta
use sfsalary.dta, clear
summarize salaries overtime totalbenefits
We start with OLS regression and some diagnostics.
regress totalbenefits salaries overtime
We can use Cooks Distance to help determine leverage points. Using the predict command with cooksd option we can create a new variable called c_dist1 containing the values of Cook’s Distance. A conventional cut-off point is 4/n, where n is the number of observations in the data set. For us, this n = 40,868 and we will use this to select which values to display. Then the gsort command sorts this by descending order of Cook’s Distance so we can see those points with the largest Cook’s Distance. If we had points where the Cook’s distance was greater than 1, then our robust regression using rreg would drop these points. For this data set we don’t have any Cook’s Distance points greater than 1.
predict c_dist1, cooksd
clist employeeidentifier totalbenefits salaries overtime c_dist1 if c_dist1 > 4/40868
Again, we can use our predict command to look at the residuals and generate the variable absr1 for the absolute value of the standardized residuals. Then the gsort command sorts this by descending order and we can see which employees have the largest residuals, which will be weighted less in our robust regression.
predict r1, rstandard
gen absr1 = abs(r1)
gsort -absr1
clist employeeidentifier absr1 in 1/10, noobs
Now we will do our robust regression and see how they compare. We will use rreg and generate a new variable to store the final weights. You can see the iteration history of both types of weights, huber and biweight, at the top of the robust regression output.
rreg totalbenefits salaries overtime, gen(weight)
Comparing the OLS regression and robust regression models, we can see that the results are different, notably the coefficients have changed. You will also notice that no R-squared, adjusted R-squared or root MSE from rreg output. However we can easily compute that with display and notice that our R2 value has increased in our Robust Regression to 0.905.
display "R2 = " e(r2)
Now we can look at observations with relatively small weights and those with relatively large weights using sort and gsort.
sort weight
clist employeeidentifier weight absr1 c_dist1 in 1/10, noobs
gsort -weight
clist employeeidentifier weight absr1 c_dist1 in 1/10, noobs
For the most case, as the residuals go down, the weights should increase. That is, those with larger residuals are down weighted more in our robust regression. In OLS all cases have a weight of 1. The more cases in robust regression that have weight 1, the closer the results of OLS and Robust Regression are. We can see this above by looking at both those observations with the highest weights and those with the lowest.
We can do some post-estimation after rreg with margins command. For example we can get predicted values with respect to a set of values of variable salaries, holding overtime at its mean.
margins, at(salaries=(0(50000)533985)) vsquish
Here we can see, in increments of $50,000, that as salaries goes up, the predictred total benefits also increases.
There are a few key libraries to install and load for our regression, particularly foreign and MASS. Once installed with install.packages(“packagename”) they can earily be loaded using require() or library()
require(foreign)
require(MASS)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
library(data.table)
library(tidyverse)
library(faraway)
After subsetting our data to select just the year 2014, we can view the summaries of our variables of interest using summary.
rawG0 <- read.csv("/Users/maraudersmap/Documents/Employee_Compensation.csv")
rawG1 = data.frame(rawG0)
rawG2 = cbind(rawG1$Year,rawG1$Employee.Identifier,rawG1$Salaries,rawG1$Overtime,rawG1$Total.Benefits)
colnames(rawG2)=c("year","Employee ID","Salaries","Overtime","Total_Benefit")
rawG3 = data.frame(rawG2)
rawG4=rawG3[rawG3$year=="2014",]
summary(rawG4)
## year Employee.ID Salaries Overtime
## Min. :2014 Min. : 1 Min. :-68772 Min. :-12309
## 1st Qu.:2014 1st Qu.:14250 1st Qu.: 21967 1st Qu.: 0
## Median :2014 Median :28448 Median : 61482 Median : 0
## Mean :2014 Mean :28519 Mean : 61914 Mean : 4272
## 3rd Qu.:2014 3rd Qu.:42897 3rd Qu.: 91853 3rd Qu.: 2434
## Max. :2014 Max. :56987 Max. :308633 Max. :172612
## Total_Benefit
## Min. :-19814
## 1st Qu.: 9074
## Median : 30327
## Mean : 26395
## 3rd Qu.: 38855
## Max. : 97107
We begin first with OLS regression and some diagnostics to view the general relationship between our data. The lm package fits our OLS regression.
OLS_reg <- lm(Total_Benefit ~ Salaries+Overtime, data = rawG4)
summary(OLS_reg)
##
## Call:
## lm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53669 -3931 464 4201 39403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.115e+03 4.881e+01 84.31 <2e-16 ***
## Salaries 3.558e-01 6.670e-04 533.42 <2e-16 ***
## Overtime 5.927e-02 2.701e-03 21.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5720 on 40865 degrees of freedom
## Multiple R-squared: 0.8849, Adjusted R-squared: 0.8848
## F-statistic: 1.57e+05 on 2 and 40865 DF, p-value: < 2.2e-16
A plot of our regression gives us 4 plots to examine residuals, possible outliers and leverage points.
par(mfrow = c(2,2), oma = c(0, 0, 1, 0))
plot(OLS_reg)
From these plots, we can identify several observations that may be possibly problematic to our model. We can look at our Cook’s Distances using cooks.distance to see which values have relatively large Cook’s Distances and could possibly be high leverage points. Here we also calculate our standardized residuals using stdres.Typically the cut off point used is \(\frac{4}{n}\) where n is the number of observations. For us this n is 40,868 and we can view which Cook’s Distances are above this point.
d1 <- cooks.distance(OLS_reg)
r <- stdres(OLS_reg)
a <- cbind(rawG4, d1, r)
head(a[d1 > 4/40868, ])
## year Employee.ID Salaries Overtime Total_Benefit d1
## 66 2014 38185 51537.90 31739.97 33068.00 0.0001575752
## 149 2014 12337 65524.98 23116.82 39764.46 0.0001247798
## 260 2014 52668 133687.85 27563.62 42442.62 0.0001997214
## 351 2014 38853 112703.72 51838.63 42482.38 0.0001157340
## 443 2014 1514 65035.81 23916.87 38771.09 0.0001131363
## 444 2014 55058 178922.62 0.00 61164.32 0.0001020372
## r
## 66 1.5275071
## 149 1.9175602
## 260 -1.9000409
## 351 -0.8395243
## 443 1.7660279
## 444 -1.1548373
Using our calculated residuals above, we can find the absolute residuals and sort them, viewing the top 10 values with the highest residuals. These should be weighted differently when we do our robust regression.
rabs <- abs(r)
a <- cbind(rawG4, d1, r, rabs)
asorted <- a[order(-rabs), ]
asorted[1:10, ]
## year Employee.ID Salaries Overtime Total_Benefit d1
## 128897 2014 43131 146508.0 6834.03 2973.59 0.003428444
## 191970 2014 36865 134173.9 0.00 244.11 0.002948959
## 197853 2014 28348 137301.9 14225.79 2648.15 0.002709523
## 12328 2014 25111 132799.8 122.03 2627.55 0.002549898
## 200734 2014 12583 131403.1 0.00 2308.82 0.002473906
## 96190 2014 1782 130867.3 0.00 2181.83 0.002441626
## 117967 2014 51860 129589.1 2570.50 2298.57 0.002137925
## 175562 2014 55763 129378.2 174.04 2251.88 0.002297903
## 163288 2014 2953 173870.2 647.53 18265.61 0.004874941
## 129400 2014 36396 175221.8 0.00 18863.73 0.005018636
## r rabs
## 128897 -9.383179 9.383179
## 191970 -9.022357 9.022357
## 197853 -8.943982 8.943982
## 12328 -8.521434 8.521434
## 200734 -8.489013 8.489013
## 96190 -8.477879 8.477879
## 117967 -8.404555 8.404555
## 175562 -8.374798 8.374798
## 163288 -8.347814 8.347814
## 129400 -8.320630 8.320630
The robust regression rlm function in the MASS package can take different weighting methods as an argument.
First we will use Huber weights, which weight based on the size of the residual. Those observations with larger residuals are weighted least and those with smaller residuals weighted closer to 1 (as is done for all observations in OLS).
summary(rr.huber <- rlm(Total_Benefit ~ Salaries+Overtime, data = rawG4))
##
## Call: rlm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4)
## Residuals:
## Min 1Q Median 3Q Max
## -54850.70 -3624.17 39.03 3849.59 39603.95
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3685.1676 45.1237 81.6681
## Salaries 0.3678 0.0006 596.5197
## Overtime 0.0374 0.0025 14.9704
##
## Residual standard error: 5460 on 40865 degrees of freedom
We also create a new variable for the weights of each observation. Then we can see that typically as absolute residual goes down, the weight goes up. The more cases we have that are weighted at 1, the closer our robust regression is to the OLS regression.
hweights <- data.frame(Employee_ID = rawG4$Employee.ID, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:10, ]
## Employee_ID resid weight
## 128897 43131 -54850.70 0.1338990
## 191970 36865 -52788.44 0.1391299
## 197853 28348 -52066.53 0.1410588
## 12328 25111 -49904.20 0.1471711
## 200734 12583 -49704.70 0.1477617
## 96190 1782 -49634.60 0.1479704
## 163288 2953 -49390.95 0.1487015
## 129400 36396 -49265.72 0.1490795
## 117967 51860 -49143.83 0.1494480
## 175562 55763 -49023.38 0.1498153
We can run robust regression with rlm again with Bisquare weighting.
rr.bisquare <- rlm(Total_Benefit ~ Salaries+Overtime, data = rawG4, psi = psi.bisquare)
summary(rr.bisquare)
##
## Call: rlm(formula = Total_Benefit ~ Salaries + Overtime, data = rawG4,
## psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -55296.74 -3483.18 -86.47 3735.59 39705.63
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 3495.3639 45.6423 76.5817
## Salaries 0.3724 0.0006 597.1466
## Overtime 0.0314 0.0025 12.4411
##
## Residual standard error: 5239 on 40865 degrees of freedom
Again we can look at the weights and see that as our absolute residual increases, the weighting is less. We can see that our lowest weighted observations are different here using the Bisquare weigthing versus the Huberweighting method.
biweights <- data.frame(Employee_ID = rawG4$Employee.ID, resid = rr.bisquare$resid, weight = rr.bisquare$w)
biweights2 <- biweights[order(rr.bisquare$w), ]
(biweights2[1:10, ])
## Employee_ID resid weight
## 723 53309 -37576.00 0
## 4766 18937 -26629.01 0
## 6268 35553 -31513.25 0
## 7045 25271 38758.98 0
## 7105 35263 -30621.00 0
## 7267 17896 -26817.50 0
## 7368 50656 -28855.09 0
## 7810 9623 -24855.96 0
## 10373 2244 -36244.98 0
## 10866 49877 -29452.24 0
(biweights2[40000:40005, ])
## Employee_ID resid weight
## 92351 34735 -208.8075 0.9998550
## 123472 37070 208.3145 0.9998551
## 204843 13671 208.2139 0.9998553
## 192906 24186 -208.5813 0.9998558
## 109647 56953 207.3208 0.9998564
## 70853 32550 207.3303 0.9998564
In both of our robust regressions we can see a change in the coefficients from the OLS regression.
MATLAB is not mostly famous for its application in statistics, but it is still powerful enough to do some basic statistical regressions.
This script is created to show how to do the robust regression with MATLAB (version R2015b) by reference to the “help document” of function ‘robustfit’.
In MATLAB, B = robustfit(X,Y) returns the vector B of regression coefficients, obtained by performing robust regression to estimate the linear model Y = Xb. X is an n-by-p matrix of predictor variables, and Y is an n-by-1 vector of observations. The algorithm uses iteratively reweighted least squares with the bisquare weighting function. By default, robustfit adds a column of ones to X, corresponding to a constant term in the first element of B. Do not enter a column of ones directly into the X matrix.
We use the “Import Data” option in the toolstrip to load the data. It is very convenient to select certain columns to import as columns vector. Here we import Year, Overtime, Salaries1 and TotalBenefits to our workspace and select data in 2014.
Overtime = Overtime(Year==2014);
Salaries1 = Salaries1(Year==2014);
TotalBenefits = TotalBenefits(Year==2014);
x = [Overtime,Salaries1];
y = TotalBenefits;
Let’s compare the result of robust fit to a standard least squares fit.
[bls,STATSls] = robustfit(x,y,'ols'); %least-squares
[brob,STATSrob] = robustfit(x,y); %robust
First we start with some OLS Regression using STATSls and bls and some basic diagnotistics to get a view of the relationship in the data.
display(STATSls);
display(bls);
STATSls =
We can check all the fields in STATS structure by using .[field]. For example, we can see the p values of each predictors by using .p
STATSls.p
We can also view how each observation is weighted in OLS regression.
hist(STATSls.w);
All the observations are weighted at 1 in OLS regression. Our Robust regression will weight them differently depending on the residuals.
For some quick diagnostics, we can do a QQ plot to check the normality.
qqplot(STATSls.resid)
By qqplot, we can clearly see that the errors are not normally distributed and there may be issues with our OLS regression on this data.
We can do our Robust Regression using STATSrob. In our robust regression, the coefficients changes a little in robust regression.
display(STATSrob);
display(brob);
We can also view the how each observation is weighted by using a histogram of the weights.
histogram(STATSrob.w);
As noted, not all the observations are weighted the same in our Robust Regression. In fact, the weight of the points with the largest residuals are reduced to zero.
STATSrob.w(find(STATSls.resid==max(STATSls.resid)))