Introduction

General Description

In statistics, a model that has fixed parameters or non-random quantities is called fixed effects model.

In general, based on some observed factors, data can be divided into groups. The group means could be assumed as constant or non-constant across groups. And in a fixed effects model, just as its name implies, each group mean is a specifically fixed quantity.

Furthermore, the assumption of fixed effect is that the group-specific effects are correlated with the independent variables.

Thus, in the fixed effect models, if the heterogeneity is fixed over time, this unobserved heterogeneity can be controlled. This heterogeneity is removable from the data by differencing, for instance, any time invariant components of the model can be taken away by taking a first difference.

Panel Data

In this tutorial, we will focus on fixed effects model with panel data.

Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset, where the behavior of entities is observed across time. The possible entities could be states, companies, individuals, countries, etc.

In panel data, fixed effects stand for the subject-specific means. In panel data analysis, fixed effects estimator is referred to an estimator for the coefficients in the regression model including those fixed effects, in other words, one time-invariant intercept for each subject.

Classical Representation

The linear unobserved effects model for \(N\) observations and \(T\) time periods:

\[y_{it}=X_{it}\beta+\alpha_i+\mu_{it} ,\ for \ t=1,..,T \ and \ i=1,...,N\]

Where:

\(y_{it}\) is the dependent variable observed for individual i at time t.

\(X_{it}\) is the time-variant \(T\times k\) (the number of independent variables) regression matrix.

\(\beta\) is the \(k\times 1\) matrix of parameters.

\(\alpha _{i}\) is the unobserved time-invariant individual effect.

\(\mu_{it}\) is the error term.

Overview

In this tutorial, we will use R, SAS and STATA to fit fixed effect models and compared them with ordinary linear regression models.

The packages we use in R are basic R, lfe and plm. The package we use in STATA is glm, and two different commands class and absorb are both showed. In STATA, we use the packages areg, xtreg, and reghdfe to do the regression.

The dataset Cigar is a built-in dataset in the plm package in R. It is clean enough for us to do the data analysis directly.

Example Dataset: Cigar

The dataset Cigar is a panel of 46 observations from 1963 to 1992 of cigarette consuming.

The total number of observations is 1380.

The panel data Cigar looks like this (first 10 observations):

state year price pop pop16 cpi ndi sales pimin
1 63 28.6 3383 2236.5 30.6 1558.305 93.9 26.1
1 64 29.8 3431 2276.7 31.0 1684.073 95.4 27.5
1 65 29.8 3486 2327.5 31.5 1809.842 98.5 28.9
1 66 31.5 3524 2369.7 32.4 1915.160 96.4 29.5
1 67 31.6 3533 2393.7 33.4 2023.546 95.5 29.6
1 68 35.6 3522 2405.2 34.8 2202.486 88.4 32.0
1 69 36.6 3531 2411.9 36.7 2377.335 90.1 32.8
1 70 39.6 3444 2394.6 38.8 2591.039 89.8 34.3
1 71 42.7 3481 2443.5 40.5 2785.316 95.4 35.8
1 72 42.3 3511 2484.7 41.8 3034.808 101.1 37.4

Variables:

The varaibles used for regression and fixed effect model:

Dependent variable:

sales: cigarette sales in packs per capita.

Independent variables (may be transformed):

pop: population.
                    
pop16: population above the age of 16.

price: price per pack of cigarettes.

cpi: consumer price index (1983=100).

ndi: per capita disposable income.

Fixed effects variables:

state (46 levels): state abbreviation.

year (29 levels): the year.

Why Fixed Effects Models

Heterogeneity in fixed effects models means different means among categories such as states and years. When the data can be grouped by such categories, and there are also some evidences indicating heterogeneity, the OLS is not sufficient to control the effects of these unobservable factors. However, fixed effects models can control and estimate these effects. Moreover, if these unobservable factors are time-invariant, then omitted variable bias can be eliminated by fixed effects regression.

Heterogeneity across year

The above graph shows that the means of sales for distinct year are different.

Heterogeneity across state

We can also observe heterogeneity across state from the above graph. Therefore, fixed effects model is an ideal choice.

Tutorial in R

Data Manipulation

Import the data:

# data: the dataset 'Cigar' is available inside the 'plm' package
library(plm)
data(Cigar)

Transform the variables:

# Adjust the price, and disposable income with cpi to 
# get the dollar value in 1983
attach(Cigar)
Cigar$price_adj=(price/cpi)*100
Cigar$income_adj = (ndi/cpi)*100

OLS regression

Fit an OLS regression model with sale as the response and price_adj, pop, pop16 and income_adj as predictors:

# Run ordinary linear regression without fixed effect
ols = lm(sales ~ price_adj + pop + pop16 + income_adj, data = Cigar)
summary(ols)
## 
## Call:
## lm(formula = sales ~ price_adj + pop + pop16 + income_adj, data = Cigar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.905 -12.834  -2.860   7.873 162.438 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.897e+02  5.254e+00  36.111  < 2e-16 ***
## price_adj   -1.247e+00  5.378e-02 -23.185  < 2e-16 ***
## pop          1.040e-02  2.447e-03   4.248 2.30e-05 ***
## pop16       -1.495e-02  3.276e-03  -4.564 5.46e-06 ***
## income_adj   5.278e-03  4.379e-04  12.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26 on 1375 degrees of freedom
## Multiple R-squared:  0.2981, Adjusted R-squared:  0.2961 
## F-statistic:   146 on 4 and 1375 DF,  p-value: < 2.2e-16

From the summary above, we see that the coefficient of price_adj, pop, pop16 and income_adj are -1.247e+00, 1.040e-02, -1.495e-02, 5.278e-03 respectively.

Fixed Effects Models

We fit a fixed effects model with sale as the response, price_adj, pop, pop16, income_adj as independent variables, and state and year as fixed effects variables.

There are three ways to do with R, using the regular funtions lm, the felm in the lfe package, or plm in the plm package. In fact, they produce the same results.

The lm generates dummies variables for state and year and then run linear regression. However, the felm and plm will absorb individual fixed effects estimates.

If we just want to control for fix effect and only care about coefficients of interests, either felm and plm is a good choice. But if we want to know the effect of some specific groups, lm is preferred.

Basic R:

In fact, the summary of lm will show individual fixed effects estimates for every year and every state. But for convenience, we only show the estimated coefficients of independent variables and first five estimated effects for years.

# Fixed effects using Least squares dummy variable model
ols_fixed = lm(sales ~ price_adj + pop + pop16 + income_adj +factor(year) + factor(state), data = Cigar)
summary(ols_fixed)$coefficients[1:10,]
##                     Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)    254.871577992 8.0047498533  31.8400428 5.576540e-165
## price_adj       -1.474957838 0.0712776661 -20.6931276  1.840630e-82
## pop              0.001908401 0.0018072543   1.0559672  2.911793e-01
## pop16           -0.002180001 0.0021201973  -1.0282068  3.040437e-01
## income_adj      -0.002320871 0.0007200866  -3.2230442  1.299815e-03
## factor(year)64  -0.675374024 2.5764491002  -0.2621337  7.932599e-01
## factor(year)65   1.071170096 2.6189757232   0.4090034  6.826045e-01
## factor(year)66   5.615730191 2.6730827193   2.1008441  3.584666e-02
## factor(year)67   5.353523777 2.7114760535   1.9743946  4.854810e-02
## factor(year)68   7.346161519 2.7803640858   2.6421581  8.336775e-03
Package lfe:
# Use lfe package, treat *state* and *year* as fixed effects variables, and fit a model 
library(lfe)
felm_fixed = felm(sales ~ price_adj + pop + pop16 + income_adj |factor(year) + factor(state), data = Cigar)
summary(felm_fixed)
## 
## Call:
##    felm(formula = sales ~ price_adj + pop + pop16 + income_adj |      factor(year) + factor(state), data = Cigar) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.049  -5.140  -0.117   5.525 108.515 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## price_adj  -1.4749578  0.0712777 -20.693   <2e-16 ***
## pop         0.0019084  0.0018073   1.056   0.2912    
## pop16      -0.0021800  0.0021202  -1.028   0.3040    
## income_adj -0.0023209  0.0007201  -3.223   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.29 on 1301 degrees of freedom
## Multiple R-squared(full model): 0.8515   Adjusted R-squared: 0.8426 
## Multiple R-squared(proj model): 0.2545   Adjusted R-squared: 0.2098 
## F-statistic(full model):95.67 on 78 and 1301 DF, p-value: < 2.2e-16 
## F-statistic(proj model):   111 on 4 and 1301 DF, p-value: < 2.2e-16
Package plm:
# Use plm package, treat *state* and *year* as fixed effects variables, and fit a model
library(plm)
plm_md = plm(sales ~ price_adj + pop + pop16 + income_adj, data = Cigar,
          index = c("year", "state"), model = "within", effect = "twoways")
summary(plm_md)
## Twoways effects Within Model
## 
## Call:
## plm(formula = sales ~ price_adj + pop + pop16 + income_adj, data = Cigar, 
##     effect = "twoways", model = "within", index = c("year", "state"))
## 
## Balanced Panel: n = 30, T = 46, N = 1380
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -63.04920  -5.13997  -0.11695   5.52514 108.51496 
## 
## Coefficients:
##               Estimate  Std. Error  t-value Pr(>|t|)    
## price_adj  -1.47495784  0.07127767 -20.6931   <2e-16 ***
## pop         0.00190840  0.00180725   1.0560   0.2912    
## pop16      -0.00218000  0.00212020  -1.0282   0.3040    
## income_adj -0.00232087  0.00072009  -3.2230   0.0013 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    263760
## Residual Sum of Squares: 196630
## R-Squared:      0.25451
## Adj. R-Squared: 0.20982
## F-statistic: 111.043 on 4 and 1301 DF, p-value: < 2.22e-16
Summary

From the summaries, we see that the coefficients of price_adj, pop, pop16 and income_adj change after adding state and year as fixed effects. The most obvious change is that the coefficient of income_adj flips sign. It changes from 5.278e-03 to 2.321e-03. The coefficents of price_adj, pop, pop16 are -1.475e+00, 1.908e-03, -2.180e-03 respectively. In addition, the variables pop and pop16 are changed to be insignificant in the fixed effects models.

Tutorial in SAS

Data Manipulation

Import the data:

/* read the data file  */
proc import datafile=".\Cigar.csv" 
out=mydata dbms=csv replace; 
getnames=yes; 
run;

Transform the variables:

/*change the price, and income with cpi to get the dollar value in 1983 */
data Cigar; set mydata;
price_adj = (price/cpi)*100;
income_adj = (ndi/cpi)*100;
run;

OLS regression

proc reg data=Cigar; 
 model sales = price_adj pop pop16 income_adj;
 run;
quit;

Fixed Effects Models

In SAS, the glm is to fit with fixed effects models. In glm, we can either use class or absorb to determine the fixed effects variables.

If we want to see the fixed effects estimates for every state and every year, class will be the first choice. The class will automatically generate a set of dummy variables for each level of the variable state and year.

It we only want to know the estimates of our interested independent variables, we can use absorb instead of class. But it can only absorb one variable at a time. And to use the absorb, we need to suppress the intercept to avoid a dummy variable trap.

We only show the estimated coefficients of independent variables and first five estimated effects for years.

Use class:

For convenience, we only show the estimated coefficients of independent variables and first five estimated effects for years. But in fact, in SAS, individual fixed effects estimate for every state and every year will be displayed.

/* Fixed effects by class, generating a set of dummy variables */
/* for each level of the variable state and year               */
proc glm data=Cigar;
 class year state; 
 model sales = price_adj pop pop16 income_adj year state/ solution; run;
quit;
Use absorb:

In SAS, as we absorb the variable state, only individual fixed effects estimate for every year will be displayed. And we only show the first five estimated effects for years.

/* Absorbing the variable *state* and generating dummies of years */
proc glm data=Cigar;
 absorb state; 
 class year;
 model sales = price_adj pop pop16 income_adj year/ solution noint; run;
quit;
Summary

The estimates of the independent variables price_adj, pop, pop16, income_adj are the same as the R.

We will find the estimates for years are different from those in R. This is because that R will automatically treat one level of the factors as the reference levels, in this case, the reference levels are year 63 or state 1, and incorporating them into the estimated intercept. But SAS has no such process.

Though some differences are watched, by simple calculations the estimates are the same.

Tutorial in STATA

Data Manipulation

Import the data:

import delimited Cigar.csv, clear

Transform the variables:

## change the price, and income with cpi to get the dollar value in 1983
g price_adj = (price/cpi)*100
g income_adj = (ndi/cpi)*100

OLS regression

reg sales price_adj pop pop16 income_adj

Fixed Effects Models

There are three ways to do with STATA, using the commands areg, xtreg, or reghdfe. In fact, they produce the same results.

The areg and xtreg cannot absorb more than one fix effect, but we can still put factor variable i.var in. Sometimes, they are computationally inefficient since they actually calculate and report coefficients for those dummy variables. However, in some cases they could be helpful if we want to see the effect of one specific group.

If we just want to control for fix effect and only care about other coefficients of interests, reghdfe is the best option.

Command areg:
## Absorbing the variable state and generating dummies of years
areg sales price_adj pop pop16 income_adj i.year, absorb(state) 
Command xtreg:
## Absorbing the variable state and generating dummies of years
xtset state year
xtreg sales price_adj pop pop16 income_adj i.year, fe 
Command reghdfe:

install packages:

## install reghdfe packages, and also ftools
ssc install reghdfe
ssc install ftools

regression:

## Absorbing the variables state and year using reghdfe
reghdfe sales price_adj pop pop16 income_adj, absorb(state year)
Summary

The estimates of the independent variables price_adj, pop, pop16, income_adj are the same as the R and the SAS.

Same as R, STATA will also take year 63 and state 1 by default as reference levels. So, the estimated effects for years equal to those of R. As STATA absorbs variables, the estimated intercepts are different. But in fact, the models are the same.

Discussion and Summary

Compare Fixed Effects Model to OLS

The results of the OLS and the fixed effects model are extremely different. To be more specific, with fixed effect the negative effect of price on sales is stronger in magnitude than the OLS, and the coefficient on income flips sign.

Importance of Fixed Effects Model

If we fit OLS instead of fixed effects, we will underestimate the effects of price on sales of cigarette, and even have wrong conclusion for the influence of income on sales. So, it highlights the importance of controlling for fix effect.

Absorption or Not

When computing fixed effects models estimates, we should choose to absorb them or not. It depends on what our aim is. Absorption is computationally fast, and looks concisely, however, individual fixed effects estimates will not be showed. In order to get every individual fixed effects estimates, the preferred method is “no absorption”, which will automatically generate a set of dummy variables for each level of the fixed effects variable.

References

Wikipedia: Fixed effects model

Dataset: Cigar

R Package: plm

R Package: lfe

STATA Package: reghdfe

Notes: Panel Data using R

Notes: Fixed Effects in SAS