Topic Introduction

Beijing’s air pollution poses a big threat to the health of local residents. The PM 2.5 Index has been known as an important indication of the air quality. Exploring an effective way of predicting PM 2.5 Index will be an important step towards addressing the pollution. However, various factors such as dew point, temperature, pressure, etc. may influence PM 2.5 and the relationships between them remain unclear. Thus, it is challenging to accurately predict PM 2.5. To address this problem, Linear Mixed Effects Model is adopted to predict the response variable (ie. air quality) based on the numerous explanatory variables (meteorological records).

Model Introduction

Ecological and meteorological data are often complex and messy because of the inclusion of grouping factors like gender, population, continents, and seasons. When fitting a complicated models with many parameters and observations, the data points might not be truly independent from each other. For example, all observations from a meterological dataset could be collected on the same site. Therefore, mixed models were developed to overcome such problem and let us to be able to use all of our data. Mixed models are applied in many disciplines where multiple correlated measurements are made on each unit of interest. The core of mixed model is that it incorporates fixed and random effects. The difference between fixed and random effects is that a fixed effect is an effect that is constant for a given population, but a random effect is an effect that varies for a given population.

A typical linear regression model takes the form \[y = X\beta+\epsilon\] where \(X\) is the fixed effects with coefficients \(\beta\). \(\epsilon\) corresponds to random noise, \(\epsilon \sim \mathcal{N}(0,I)\).

In a linear mixed effects model, \(Z\gamma\) are being added. \(Z\) corresponds to random effects with coefficicients \(\gamma\). Tne model has the form \[Y = X\beta +Z\gamma+\epsilon\] and \(\gamma \sim \mathcal{N}(0,\sigma^2I)\).

Data Description

The Beijing PM2.5 Data Data Set is from UCI Machine Learning Repository.The PM 2.5 Index dataset includes 43824 observations and 12 attributes. Here is a table of the attributes’ information.

Variable Description Data Type
PM2.5 PM2.5 concentration (ug/m^3) numerical
DEWP Dew Point numerical
TEMP Temperature numerical
PRES Pressure (hPa) numerical
cbwd Combined wind direction categorical
Iws Cumulated wind speed (m/s) numerical
Is Cumulated hours of snow numerical
Ir Cumulated hours of rain numerical

Overview

For this project, we will be using packagen lme4 and nlme in R and package statsmodels in python to evaluate the important causes that contribute to PM2.5 in Beijing, China. Due to the skewness of PM2.5 against variable “cbwd”, we choose to do a log transformation on the response variable. The way we pick out the best model to fit is by plotting log(PM2.5) against every predictor in different wind directions. We also have summarized and compared our result in the conclusion.

Data Preprocessing

We first preprocess the data before fitting a model.

AQ=read.csv("Beijing.csv")
# divide the months into four seasons
seasons = c( 'winter',
             'winter',
             rep('spring',3),
             rep('summer',3),
             rep('fall',3),
             'winter'
)
# delete the mising values
AQ = AQ %>%
  filter(!is.na(pm2.5) & pm2.5 > 0) %>%
  mutate(season=factor(month, 1:12, seasons)) 
# boxplot for response grouped by cbwd
qplot(cbwd, pm2.5, facets = . ~ cbwd, 
      colour = cbwd, geom = "boxplot", data = AQ)

We can see that the data is seriously left-skewed, so we need to do some transfomation. We draw Box-Cox plot for the data.

# draw the box-cox plot
g =lm(pm2.5~., data = AQ)
boxcox(g,plotit=T)

# the plot a log transformation for the response 
AQ=
  AQ%>%
  mutate(pm2.5_log=log(pm2.5)) %>%
  filter(is.finite(pm2.5_log))

The Box-Cox plot suggests a log transformation for the response variable. We then drop the outliers and draw the boxplot for response grouped by the variable cbwd.

qplot(cbwd, pm2.5_log, facets = . ~ cbwd, 
      colour = cbwd, geom = "boxplot", data = AQ_de)

We can see that the response shows difference in different groups. It is resasonable to establish the following model:

\[log(pm2.5)=year+month+day+hour+pm2.5+DEWP+TEMP+PRES+Iws+Is+Ir+(1|cbwd)+\epsilon\]

In the models above, we assumed that the effect of predictors was the same for all wind directions. However, the effect of other predictors might be different for different subjects. Therefore, what we need is a random slope model, where predictors in different wind directions are not only allowed to have differing intercepts, however, where they are also allowed to have different slopes. We draw the response pm2.5 against the predictors in different wind directions.

The slopes of hour and PRES are significantly different between groups of combined wind direction.Therefore according to the plots, we decide to choose the following model:

\[log(pm2.5)=year+month+day+hour+pm2.5+DEWP+TEMP+PRES+Iws+Is+Ir+(1+hour+PRES|cbwd)+\epsilon\]

Three Approaches

lme4 in R

library(lme4)
res = lmer(pm2.5_log ~ year+month+day+hour+DEWP+TEMP+PRES+Is+Ir+(1+hour+PRES|cbwd), 
           REML = TRUE,
           data = AQ_de)
summary(res)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pm2.5_log ~ year + month + day + hour + DEWP + TEMP + PRES +  
##     Is + Ir + (1 + hour + PRES | cbwd)
##    Data: AQ_de
## 
## REML criterion at convergence: 97209.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.6470 -0.6713  0.0145  0.6886  4.2271 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr     
##  cbwd     (Intercept) 11.7312  3.425             
##           hour         1.4075  1.186    0.06     
##           PRES         4.9209  2.218    0.06 1.00
##  Residual              0.6084  0.780             
## Number of obs: 41441, groups:  cbwd, 4
## 
## Fixed effects:
##               Estimate Std. Error         df  t value Pr(>|t|)    
## (Intercept)  3.546e+01  5.778e+00  1.886e+03    6.137 1.02e-09 ***
## year        -5.606e-03  2.731e-03  4.142e+04   -2.053   0.0401 *  
## month       -2.080e-02  1.186e-03  4.142e+04  -17.535  < 2e-16 ***
## day          6.421e-03  4.369e-04  4.142e+04   14.697  < 2e-16 ***
## hour         1.212e-02  5.938e-01  1.710e-02    0.020   0.9974    
## DEWP         6.006e-02  5.533e-04  4.090e+04  108.554  < 2e-16 ***
## TEMP        -7.755e-02  7.056e-04  4.099e+04 -109.897  < 2e-16 ***
## PRES        -1.895e-02  1.110e+00  1.648e-02   -0.017   0.9978    
## Is          -3.512e-02  4.980e-03  4.142e+04   -7.054 1.77e-12 ***
## Ir          -8.035e-02  2.750e-03  4.141e+04  -29.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr) year   month  day    hour   DEWP   TEMP   PRES   Is    
## year  -0.946                                                        
## month  0.022  0.003                                                 
## day    0.001  0.001  0.006                                          
## hour   0.017  0.000  0.000  0.000                                   
## DEWP  -0.104  0.069 -0.239 -0.031  0.000                            
## TEMP   0.033 -0.103 -0.067  0.001  0.000 -0.522                     
## PRES   0.017  0.000  0.000  0.000  1.000  0.000  0.000              
## Is    -0.010  0.012  0.052  0.038  0.000 -0.085  0.098  0.000       
## Ir    -0.015  0.012  0.004  0.003  0.000 -0.154  0.099  0.000  0.017
## convergence code: 0
## unable to evaluate scaled gradient
## Model failed to converge: degenerate  Hessian with 2 negative eigenvalues

We can see hour and PRES are not significant predictors.

#plot residual againt fitted values 
dat=data.frame(fitted.values=as.vector(resc$fitted),residuals=as.vector(resc$residuals))
ggplot(data=dat,aes(x=fitted.values,y=residuals))+geom_point(color="red",alpha=0.1)+geom_smooth(se=F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Since the smooth line looks flat, the residuals and fitted values seems to be irrelevant.Then we test whether random effects are warranted:

resb = lm(pm2.5_log ~ year+month+day+hour+DEWP+TEMP+PRES+Is+Ir, data = AQ_de)
dev1 = -2*logLik(res);dev0 = -2*logLik(resb)
devdiff = as.numeric(dev0-dev1)
dfdiff <- attr(dev1,"df")-attr(dev0,"df"); dfdiff
## [1] 6
cat('Chi-square =', devdiff, '(df=', dfdiff,'), p =', 
    pchisq(devdiff,dfdiff,lower.tail=FALSE))
## Chi-square = 5146.406 (df= 6 ), p = 0

and compare the BIC:

BIC(resb, res) # compare the BIC of simple linear regression and linear mixed model
##      df       BIC
## resb 11 102472.79
## res  17  97390.18

Apparently, the random effect is significant.

nlme in R

We follow the steps to clean and process oringin data in the first part and get our dataset to fit regression. Then we use nlme, the default package for mixed model analysis in R, to fit the model.

library(nlme)
resc=lme(pm2.5_log ~ year+month+day+hour+DEWP+TEMP+PRES+Is+Ir, random=~1+hour+PRES|cbwd,  
            method = 'ML', data = AQ_de)
summary(resc)
## Linear mixed-effects model fit by maximum likelihood
##  Data: AQ_de 
##        AIC      BIC    logLik
##   97430.82 97577.56 -48698.41
## 
## Random effects:
##  Formula: ~1 + hour + PRES | cbwd
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr        
## (Intercept) 6.765895e-07 (Intr) hour 
## hour        3.974904e-03 0.321       
## PRES        2.692868e-04 0.929  0.611
## Residual    7.832873e-01             
## 
## Fixed effects: pm2.5_log ~ year + month + day + hour + DEWP + TEMP + PRES +      Is + Ir 
##                Value Std.Error    DF    t-value p-value
## (Intercept) 32.75997  5.532499 41428    5.92137  0.0000
## year        -0.00380  0.002740 41428   -1.38581  0.1658
## month       -0.02053  0.001190 41428  -17.25218  0.0000
## day          0.00634  0.000439 41428   14.44928  0.0000
## hour         0.01252  0.002086 41428    6.00036  0.0000
## DEWP         0.06012  0.000555 41428  108.36199  0.0000
## TEMP        -0.07805  0.000708 41428 -110.29278  0.0000
## PRES        -0.01990  0.000729 41428  -27.28603  0.0000
## Is          -0.03148  0.004987 41428   -6.31223  0.0000
## Ir          -0.07823  0.002759 41428  -28.35042  0.0000
##  Correlation: 
##       (Intr) year   month  day    hour   DEWP   TEMP   PRES   Is    
## year  -0.991                                                        
## month  0.022  0.003                                                 
## day    0.000  0.002  0.006                                          
## hour  -0.010  0.012 -0.011 -0.001                                   
## DEWP  -0.106  0.068 -0.240 -0.031  0.071                            
## TEMP   0.032 -0.101 -0.068  0.001 -0.075 -0.521                     
## PRES  -0.091 -0.039 -0.202 -0.022  0.089  0.293  0.502              
## Is    -0.007  0.009  0.055  0.039 -0.006 -0.087  0.102 -0.018       
## Ir    -0.014  0.010  0.004  0.004 -0.013 -0.155  0.102  0.024  0.014
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -4.51245458 -0.68091829  0.01432061  0.69152643  4.28382111 
## 
## Number of Observations: 41441
## Number of Groups: 4
# plot the residuals vs. fitted values plot
plot(resc)

# plot qqplot
qqnorm(resid(resc))
qqline(resid(resc))

We can see that the model has good fit result according to the qqplot and residuals vs. fitted values plot.

library(stargazer)
stargazer(resc, type = "text",
          digits = 3,
          star.cutoffs = c(0.05, 0.01, 0.001),
          digit.separator = "")
## 
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                               pm2.5_log          
## -------------------------------------------------
## year                           -0.004            
##                                (0.003)           
##                                                  
## month                         -0.021***          
##                                (0.001)           
##                                                  
## day                           0.006***           
##                               (0.0004)           
##                                                  
## hour                          0.013***           
##                                (0.002)           
##                                                  
## DEWP                          0.060***           
##                                (0.001)           
##                                                  
## TEMP                          -0.078***          
##                                (0.001)           
##                                                  
## PRES                          -0.020***          
##                                (0.001)           
##                                                  
## Is                            -0.031***          
##                                (0.005)           
##                                                  
## Ir                            -0.078***          
##                                (0.003)           
##                                                  
## Constant                      32.760***          
##                                (5.532)           
##                                                  
## -------------------------------------------------
## Observations                    41441            
## Log Likelihood               -48698.410          
## Akaike Inf. Crit.             97430.820          
## Bayesian Inf. Crit.           97577.560          
## =================================================
## Note:               *p<0.05; **p<0.01; ***p<0.001

From the table we can see only year is not significant, which is not the same conclusion as we derive from the first part.

BIC(res, resc)
##      df      BIC
## res  17 97390.18
## resc 17 97577.56

The coefficients are close to those when using lme4, and regression using lme4 has smaller BIC.

Python

This is the data processing part. Here we add grouping variable “season” according to the months. And then we apply the log transformation on the response varibale according to the Box-Cox plot drawn in R above. Then we remove the outliers according to the boxplot of log(pm2.5) against cbwd.

import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
AQ = pd.read_csv('Beijing.csv')#read data
dic = {1: "Winter",
       2: "Winter",
       3: "Spring",
       4: "Spring",
       5: "Spring",
       6: "Summer",
       7: "Summer",
       8: "Summer",
       9: "Fall",
       10: "Fall",
       11: "Fall",
       12: "Winter"}
AQ['season'] = AQ['month'].map(dic)
AQ = AQ.dropna()
AQ = AQ[AQ['pm2.5'] > 0] #remove unresonable response values
AQ['pm25_log'] = np.log(AQ['pm2.5']) #do the log transformation on the response variable
# remove the outliers
AQ_cv = AQ[AQ['cbwd'] == 'cv']
AQ_cv = AQ_cv[(AQ_cv['pm25_log'] > 2.2) & (AQ_cv['pm25_log'] < 6.8)]
AQ_NE = AQ[AQ['cbwd'] == 'NE']
AQ_NE = AQ_NE[(AQ_NE['pm25_log'] > 0.5)]
AQ_NW = AQ[AQ['cbwd'] == 'NW']
AQ_NW = AQ_NW[(AQ_NW['pm25_log'] > 0.5)]
AQ_SE = AQ[AQ['cbwd'] == 'SE']
AQ_SE.sort_values(['pm25_log'], ascending=[False])
AQ_SE = AQ_SE[(AQ_SE['pm25_log'] > 0.5) & (AQ_SE['pm25_log'] < 6.291569)]
AQ_new = pd.concat([AQ_cv, AQ_NE, AQ_NW, AQ_SE])

Then we fit the model and print the summary of the fit.

#fit the model
mixed = smf.mixedlm("pm25_log ~ year+month+day+hour+DEWP+TEMP+PRES+Is+Ir", AQ_new, groups = AQ_new["cbwd"], re_formula="~hour+PRES")
mixed_fit = mixed.fit()
#print the summary
print(mixed_fit.summary())

Here is the output of the Linear Mixed Effect Model in Python. The result is close to the results from the two packages used in R.

The coefficients are close to the results from using lme4 and nlme. The result shows that only hour and PRES are not significant, which agrees with the result from using lme4.

This is the plot of Residual against Fitted values

plt.scatter(AQ_new['pm25_log'] - model.resid, model.resid, alpha = 0.5)
plt.title("Residual vs. Fitted in Python")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.savefig('python_plot.png',dpi=300)
plt.show()

Conclusion

Here is the comparison of outputs of fit using different tools:

R(lme4) R(nlme) Python
Intercept 35.46(***) 32.760(***) 38.066(***)
Year -0.0056(*) -0.004 -0.007(*)
Month -0.0208(***) -0.021(***) -0.022(***)
Day -0.006(***) 0.006(***) 0.006(***)
Hour 0.012 0.013(***) 0.012
DEWP 0.060(***) 0.060(***) 0.061(***)
TEMP -0.078(***) -0.078(***) -0.079(***)
PRES -0.019 -0.020(***) -0.019
Is -0.035(***) -0.031(***) -0.035(***)
Ir -0.008(***) -0.078(***) -0.081(***)

*p<0.05;**p<0.01;***p<0.001

In this study, we applied the linear mixed effects model to Beijing Air Polution Dataset(pm2.5). After removing some of the outliers, we choose the variable “cbwd”, the abbreviation of “combined wind direction, as our grouping variable. as random effect to fit the non-linear dataset using mixed effects models. This grouping variable is selected based on common sense and plots that help us finding random slopes. As can be seen from the above analyses, we conducted three models in R and Python. The results from the models are really significant suggesting that there are random effects in the dataset. When comparing the BICs with the simple linear regression, the result shows that the linear mixed effects model fits better. Moreover, we concluded some additional suggestions from this model for the development in future research using this method. First, this method may not always work well if we don’t have the thorough knowledge of a dataset. For example,this model fits worse without cleaning the dataset and removing the outliers. Second, when there are many variables, it will be more difficult to find which ones to put in the random slope part(in the model function, this part starting with (1|a+b+…+z). Therefore, we should pay more attention to model selection when applying Linear Mixed Effects Model. In addition, in R, when we use different packages, we should consider which model to use to fit the dataaset. In lmer function(from lme4 library), we used the suggested model, but similarly, in lme function(from nlme library), we got the approximately same result.

References

1.UCLA: Introduction to Linear Mixed Models - IDRE Stats

2.Standford University: Section Week 8 - Linear Mixed Models

3.ETH Zürich: Linear Mixed-Effects Models

4.Gabriela K Hajduk: Introduction to linear mixed models

5.Wikipedia: Mixed Model

6.Jianhua Wang et al, ‘Effects of Meteorological Conditions on PM2.5 Concentrations in Nagasaki, Japan’, J.Environ, Res. Public Health 2015,12, p.9