In statistics, a model that has fixed parameters or non-random quantities is called fixed effects model.
In general, based on some observed factors, data can be divided into groups. The group means could be assumed as constant or non-constant across groups. And in a fixed effects model, just as its name implies, each group mean is a specifically fixed quantity.
Furthermore, the assumption of fixed effect is that the group-specific effects are correlated with the independent variables.
Thus, in the fixed effect models, if the heterogeneity is fixed over time, this unobserved heterogeneity can be controlled. This heterogeneity is removable from the data by differencing, for instance, any time invariant components of the model can be taken away by taking a first difference.
In this tutorial, we will focus on fixed effects model with panel data.
Panel data (also known as longitudinal or cross-sectional time-series data) is a dataset, where the behavior of entities is observed across time. The possible entities could be states, companies, individuals, countries, etc.
In panel data, fixed effects stand for the subject-specific means. In panel data analysis, fixed effects estimator is referred to an estimator for the coefficients in the regression model including those fixed effects, in other words, one time-invariant intercept for each subject.
The linear unobserved effects model for \(N\) observations and \(T\) time periods:
\[y_{it}=X_{it}\beta+\alpha_i+\mu_{it} ,\ for \ t=1,..,T \ and \ i=1,...,N\]
Where:
\(y_{it}\) is the dependent variable observed for individual i at time t.
\(X_{it}\) is the time-variant \(T\times k\) (the number of independent variables) regression matrix.
\(\beta\) is the \(k\times 1\) matrix of parameters.
\(\alpha _{i}\) is the unobserved time-invariant individual effect.
\(\mu_{it}\) is the error term.
In this tutorial, we will use R, SAS and STATA to fit fixed effect models and compared them with ordinary linear regression models.
The packages we use in R are basic R, lfe and plm. The package we use in STATA is glm, and two different commands class and absorb are both showed. In STATA, we use the packages areg, xtreg, and reghdfe to do the regression.
The dataset Cigar is a built-in dataset in the plm package in R. It is clean enough for us to do the data analysis directly.
The dataset Cigar is a panel of 46 observations from 1963 to 1992 of cigarette consuming.
The total number of observations is 1380.
The panel data Cigar looks like this (first 10 observations):
state | year | price | pop | pop16 | cpi | ndi | sales | pimin |
---|---|---|---|---|---|---|---|---|
1 | 63 | 28.6 | 3383 | 2236.5 | 30.6 | 1558.305 | 93.9 | 26.1 |
1 | 64 | 29.8 | 3431 | 2276.7 | 31.0 | 1684.073 | 95.4 | 27.5 |
1 | 65 | 29.8 | 3486 | 2327.5 | 31.5 | 1809.842 | 98.5 | 28.9 |
1 | 66 | 31.5 | 3524 | 2369.7 | 32.4 | 1915.160 | 96.4 | 29.5 |
1 | 67 | 31.6 | 3533 | 2393.7 | 33.4 | 2023.546 | 95.5 | 29.6 |
1 | 68 | 35.6 | 3522 | 2405.2 | 34.8 | 2202.486 | 88.4 | 32.0 |
1 | 69 | 36.6 | 3531 | 2411.9 | 36.7 | 2377.335 | 90.1 | 32.8 |
1 | 70 | 39.6 | 3444 | 2394.6 | 38.8 | 2591.039 | 89.8 | 34.3 |
1 | 71 | 42.7 | 3481 | 2443.5 | 40.5 | 2785.316 | 95.4 | 35.8 |
1 | 72 | 42.3 | 3511 | 2484.7 | 41.8 | 3034.808 | 101.1 | 37.4 |
The varaibles used for regression and fixed effect model:
Dependent variable:
sales: cigarette sales in packs per capita.
Independent variables (may be transformed):
pop: population.
pop16: population above the age of 16.
price: price per pack of cigarettes.
cpi: consumer price index (1983=100).
ndi: per capita disposable income.
Fixed effects variables:
state (46 levels): state abbreviation.
year (29 levels): the year.
Heterogeneity in fixed effects models means different means among categories such as states and years. When the data can be grouped by such categories, and there are also some evidences indicating heterogeneity, the OLS is not sufficient to control the effects of these unobservable factors. However, fixed effects models can control and estimate these effects. Moreover, if these unobservable factors are time-invariant, then omitted variable bias can be eliminated by fixed effects regression.
The above graph shows that the means of sales for distinct year are different.
We can also observe heterogeneity across state from the above graph. Therefore, fixed effects model is an ideal choice.
Import the data:
# data: the dataset 'Cigar' is available inside the 'plm' package
library(plm)
data(Cigar)
Transform the variables:
# Adjust the price, and disposable income with cpi to
# get the dollar value in 1983
attach(Cigar)
Cigar$price_adj=(price/cpi)*100
Cigar$income_adj = (ndi/cpi)*100
Fit an OLS regression model with sale as the response and price_adj, pop, pop16 and income_adj as predictors:
# Run ordinary linear regression without fixed effect
ols = lm(sales ~ price_adj + pop + pop16 + income_adj, data = Cigar)
summary(ols)
##
## Call:
## lm(formula = sales ~ price_adj + pop + pop16 + income_adj, data = Cigar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.905 -12.834 -2.860 7.873 162.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.897e+02 5.254e+00 36.111 < 2e-16 ***
## price_adj -1.247e+00 5.378e-02 -23.185 < 2e-16 ***
## pop 1.040e-02 2.447e-03 4.248 2.30e-05 ***
## pop16 -1.495e-02 3.276e-03 -4.564 5.46e-06 ***
## income_adj 5.278e-03 4.379e-04 12.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26 on 1375 degrees of freedom
## Multiple R-squared: 0.2981, Adjusted R-squared: 0.2961
## F-statistic: 146 on 4 and 1375 DF, p-value: < 2.2e-16
From the summary above, we see that the coefficient of price_adj, pop, pop16 and income_adj are -1.247e+00, 1.040e-02, -1.495e-02, 5.278e-03 respectively.
We fit a fixed effects model with sale as the response, price_adj, pop, pop16, income_adj as independent variables, and state and year as fixed effects variables.
There are three ways to do with R, using the regular funtions lm, the felm in the lfe package, or plm in the plm package. In fact, they produce the same results.
The lm generates dummies variables for state and year and then run linear regression. However, the felm and plm will absorb individual fixed effects estimates.
If we just want to control for fix effect and only care about coefficients of interests, either felm and plm is a good choice. But if we want to know the effect of some specific groups, lm is preferred.
In fact, the summary of lm will show individual fixed effects estimates for every year and every state. But for convenience, we only show the estimated coefficients of independent variables and first five estimated effects for years.
# Fixed effects using Least squares dummy variable model
ols_fixed = lm(sales ~ price_adj + pop + pop16 + income_adj +factor(year) + factor(state), data = Cigar)
summary(ols_fixed)$coefficients[1:10,]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 254.871577992 8.0047498533 31.8400428 5.576540e-165
## price_adj -1.474957838 0.0712776661 -20.6931276 1.840630e-82
## pop 0.001908401 0.0018072543 1.0559672 2.911793e-01
## pop16 -0.002180001 0.0021201973 -1.0282068 3.040437e-01
## income_adj -0.002320871 0.0007200866 -3.2230442 1.299815e-03
## factor(year)64 -0.675374024 2.5764491002 -0.2621337 7.932599e-01
## factor(year)65 1.071170096 2.6189757232 0.4090034 6.826045e-01
## factor(year)66 5.615730191 2.6730827193 2.1008441 3.584666e-02
## factor(year)67 5.353523777 2.7114760535 1.9743946 4.854810e-02
## factor(year)68 7.346161519 2.7803640858 2.6421581 8.336775e-03
# Use lfe package, treat *state* and *year* as fixed effects variables, and fit a model
library(lfe)
felm_fixed = felm(sales ~ price_adj + pop + pop16 + income_adj |factor(year) + factor(state), data = Cigar)
summary(felm_fixed)
##
## Call:
## felm(formula = sales ~ price_adj + pop + pop16 + income_adj | factor(year) + factor(state), data = Cigar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.049 -5.140 -0.117 5.525 108.515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## price_adj -1.4749578 0.0712777 -20.693 <2e-16 ***
## pop 0.0019084 0.0018073 1.056 0.2912
## pop16 -0.0021800 0.0021202 -1.028 0.3040
## income_adj -0.0023209 0.0007201 -3.223 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.29 on 1301 degrees of freedom
## Multiple R-squared(full model): 0.8515 Adjusted R-squared: 0.8426
## Multiple R-squared(proj model): 0.2545 Adjusted R-squared: 0.2098
## F-statistic(full model):95.67 on 78 and 1301 DF, p-value: < 2.2e-16
## F-statistic(proj model): 111 on 4 and 1301 DF, p-value: < 2.2e-16
# Use plm package, treat *state* and *year* as fixed effects variables, and fit a model
library(plm)
plm_md = plm(sales ~ price_adj + pop + pop16 + income_adj, data = Cigar,
index = c("year", "state"), model = "within", effect = "twoways")
summary(plm_md)
## Twoways effects Within Model
##
## Call:
## plm(formula = sales ~ price_adj + pop + pop16 + income_adj, data = Cigar,
## effect = "twoways", model = "within", index = c("year", "state"))
##
## Balanced Panel: n = 30, T = 46, N = 1380
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -63.04920 -5.13997 -0.11695 5.52514 108.51496
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## price_adj -1.47495784 0.07127767 -20.6931 <2e-16 ***
## pop 0.00190840 0.00180725 1.0560 0.2912
## pop16 -0.00218000 0.00212020 -1.0282 0.3040
## income_adj -0.00232087 0.00072009 -3.2230 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 263760
## Residual Sum of Squares: 196630
## R-Squared: 0.25451
## Adj. R-Squared: 0.20982
## F-statistic: 111.043 on 4 and 1301 DF, p-value: < 2.22e-16
From the summaries, we see that the coefficients of price_adj, pop, pop16 and income_adj change after adding state and year as fixed effects. The most obvious change is that the coefficient of income_adj flips sign. It changes from 5.278e-03 to 2.321e-03. The coefficents of price_adj, pop, pop16 are -1.475e+00, 1.908e-03, -2.180e-03 respectively. In addition, the variables pop and pop16 are changed to be insignificant in the fixed effects models.
Import the data:
/* read the data file */
proc import datafile=".\Cigar.csv"
out=mydata dbms=csv replace;
getnames=yes;
run;
Transform the variables:
/*change the price, and income with cpi to get the dollar value in 1983 */
data Cigar; set mydata;
price_adj = (price/cpi)*100;
income_adj = (ndi/cpi)*100;
run;
proc reg data=Cigar;
model sales = price_adj pop pop16 income_adj;
run;
quit;
In SAS, the glm is to fit with fixed effects models. In glm, we can either use class or absorb to determine the fixed effects variables.
If we want to see the fixed effects estimates for every state and every year, class will be the first choice. The class will automatically generate a set of dummy variables for each level of the variable state and year.
It we only want to know the estimates of our interested independent variables, we can use absorb instead of class. But it can only absorb one variable at a time. And to use the absorb, we need to suppress the intercept to avoid a dummy variable trap.
We only show the estimated coefficients of independent variables and first five estimated effects for years.
For convenience, we only show the estimated coefficients of independent variables and first five estimated effects for years. But in fact, in SAS, individual fixed effects estimate for every state and every year will be displayed.
/* Fixed effects by class, generating a set of dummy variables */
/* for each level of the variable state and year */
proc glm data=Cigar;
class year state;
model sales = price_adj pop pop16 income_adj year state/ solution; run;
quit;
In SAS, as we absorb the variable state, only individual fixed effects estimate for every year will be displayed. And we only show the first five estimated effects for years.
/* Absorbing the variable *state* and generating dummies of years */
proc glm data=Cigar;
absorb state;
class year;
model sales = price_adj pop pop16 income_adj year/ solution noint; run;
quit;
The estimates of the independent variables price_adj, pop, pop16, income_adj are the same as the R.
We will find the estimates for years are different from those in R. This is because that R will automatically treat one level of the factors as the reference levels, in this case, the reference levels are year 63 or state 1, and incorporating them into the estimated intercept. But SAS has no such process.
Though some differences are watched, by simple calculations the estimates are the same.
Import the data:
import delimited Cigar.csv, clear
Transform the variables:
## change the price, and income with cpi to get the dollar value in 1983
g price_adj = (price/cpi)*100
g income_adj = (ndi/cpi)*100
reg sales price_adj pop pop16 income_adj
There are three ways to do with STATA, using the commands areg, xtreg, or reghdfe. In fact, they produce the same results.
The areg and xtreg cannot absorb more than one fix effect, but we can still put factor variable i.var in. Sometimes, they are computationally inefficient since they actually calculate and report coefficients for those dummy variables. However, in some cases they could be helpful if we want to see the effect of one specific group.
If we just want to control for fix effect and only care about other coefficients of interests, reghdfe is the best option.
## Absorbing the variable state and generating dummies of years
areg sales price_adj pop pop16 income_adj i.year, absorb(state)
## Absorbing the variable state and generating dummies of years
xtset state year
xtreg sales price_adj pop pop16 income_adj i.year, fe
install packages:
## install reghdfe packages, and also ftools
ssc install reghdfe
ssc install ftools
regression:
## Absorbing the variables state and year using reghdfe
reghdfe sales price_adj pop pop16 income_adj, absorb(state year)
The estimates of the independent variables price_adj, pop, pop16, income_adj are the same as the R and the SAS.
Same as R, STATA will also take year 63 and state 1 by default as reference levels. So, the estimated effects for years equal to those of R. As STATA absorbs variables, the estimated intercepts are different. But in fact, the models are the same.
The results of the OLS and the fixed effects model are extremely different. To be more specific, with fixed effect the negative effect of price on sales is stronger in magnitude than the OLS, and the coefficient on income flips sign.
If we fit OLS instead of fixed effects, we will underestimate the effects of price on sales of cigarette, and even have wrong conclusion for the influence of income on sales. So, it highlights the importance of controlling for fix effect.
When computing fixed effects models estimates, we should choose to absorb them or not. It depends on what our aim is. Absorption is computationally fast, and looks concisely, however, individual fixed effects estimates will not be showed. In order to get every individual fixed effects estimates, the preferred method is “no absorption”, which will automatically generate a set of dummy variables for each level of the fixed effects variable.
Wikipedia: Fixed effects model
Dataset: Cigar
R Package: plm
R Package: lfe
STATA Package: reghdfe
Notes: Panel Data using R
Notes: Fixed Effects in SAS