Zero-inflated Poisson model

Motivation

First let’s recall poisson regression model:

Y|X ~ Poisson(exp{X\(\beta\)})

This model will not be able to deal with some situations we often see in count data as zero-inflation:

There are many more zeros than would be predicted by a Poisson model. A mild amount of zero inflation can be handled by the negative binomial model. However, a large amount will result in a poor fit, especially if the distribution becomes bimodal.

Model Introduction

To handle this type of data we will fit a mixture model: \[ Yi=\left\{ \begin{aligned} & Xi &with~~p\\ & 0 &with~~1-p \end{aligned} \right. \]

Here Xi is some distribution on the integers, most commonly Poisson in our model. According to this expression, we can also regard Yi as a product of a Poisson variable and a Bernoulli variable.

Simple Derivation

According to the expression above, we can see zero-inflated poisson model is a kind of latent variable model. We suppose there is some sequence of unobserved Zi~Bernoulli(fi), and that Yi = Xi*Zi.

Thus computing the likelihood P(Y) = P(Y1, … ,Yn) is intractible because it requires marginalizing over all \(2^{n}\) possible values of Z. Hence, we must use a different strategy to obtain the MLE:

We can approximately maximize the likelihood using the EM algorithm. Since it is complicated, we just explain some key ideas as a simple derivation:

Zi ~ Bernoulli(\(\phi\))

Xi ~ Poisson(\(\lambda\))

Yi = XiZi

The conditional expectation of Zi given Yi is:

P(Zi = 1 | Yi > 0) = 1

P(Zi = 1 | Yi = 0) = \(\frac{P(Yi=0,Zi=1)}{P(Yi=0)}\) = \(\frac{\phi e^{-\lambda}}{(1-\phi)+\phi e^{-\lambda}}\) =: \(\omega(\phi,\lambda)\)

The complete log-likelihood is:

Log P(Y,Z) = \(\sum_{i}(1-Zi)log[(1-\phi)1(Yi=0)] + Zi[log\phi+logf(yi)]\)

Taking E(Z|X,\(\phi0\),\(\lambda0\)), we get:

Q(Y;\(\phi\), \(\lambda\)) = \(\sum_{Yi>0}[log\phi+logf(Yi)]+\sum_{Yi=0}[1-\omega(\phi0,\lambda0)]log(1-\phi)+\omega(\phi0,\lambda0)[log(\phi)-\lambda]\)

We maximize Q(Y) to get a new estimate \(\phi0\), and we iterate.

Fitting Zero-inflated model in R

Description of the data

This data comes from a study of nesting horseshoe crabs (J. Brockmann, Ethology,102: 121, 1996).

Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. The response outcome for each female crab is her number of satellites.

The explanatory variables thought possibly to affect this was the female crabs shell width and weight (continuous variables). Moreover, the color and type of spine (catagorical variables) are also thought possible as predictors.

Analysis with R

Descriptive analysis

crabs = read.table('http://www.stat.ufl.edu/~aa/glm/data/Crabs.dat', header=T)
save(crabs, file='crabs.RData')
load('crabs.RData')
ggplot(crabs, aes(x=y)) + stat_count() + xlab("Number of satellites")

We can see there are many observations with value 0, which is appropriate to apply zero-inflated model here.

Model Fitting

To fit zero-inflated poisson model in R, we can use function pscl::zeroinfl()

fit = zeroinfl(y ~ weight+width|color+spine, crabs)
summary(fit)
## 
## Call:
## zeroinfl(formula = y ~ weight + width | color + spine, data = crabs)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6349 -0.9258 -0.3129  0.6313  4.3622 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  2.55738    1.01692   2.515  0.01191 * 
## weight       0.47758    0.18395   2.596  0.00943 **
## width       -0.08580    0.05331  -1.609  0.10752   
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1164     0.6699  -3.159 0.001582 ** 
## color         0.8146     0.2330   3.496 0.000472 ***
## spine        -0.2202     0.2258  -0.975 0.329484    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 11 
## Log-likelihood: -370.1 on 6 Df

Comparing with basic possion regression model

poissonfit = glm(y ~ weight + width + color + spine, family = poisson, data = crabs)
vuong(poissonfit, fit)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -5.208586 model2 > model1 9.5143e-08
## AIC-corrected         -5.146224 model2 > model1 1.3289e-07
## BIC-corrected         -5.047902 model2 > model1 2.2334e-07

The Vuong test compares the zero-inflated model with an ordinary Poisson regression model. In this example, we can see that our test statistic is significant, indicating that the zero-inflated model is superior to the standard Poisson model.

Visualization

In zero-inflated model, we found that only weight and color are significant in this model. We can make some plots to confirm this result:

df = data.frame(weight=crabs$weight, color=crabs$color, fitted=fitted(fit))
ggplot(df, aes(x=weight, y=fitted, color=factor(color))) + geom_point()

df2 = data.frame(weight=crabs$weight, fitted=fitted(fit))
ggplot(df2, aes(x=weight, y=fitted)) + geom_point()

It’s easy to see that there’s an obvious positive correlation between number of satellites and weight. And different colors have different regression lines (nearly parallel).

STATA

Description of Data

The file we will use is Crabs.txt. There are five variables y, weight, width, color, and spine in our dataset. We will use STATA to build a Zero-inflated Poisson model to illustrate the practical use. Let¡¯s look at the data.

First, table 1 shows the top 10 rows. Most observations’ y are zero, so it is reasonable to use zero-inflated poisson regression.

. insheet using Crabs.txt, clear delimiter(";")
(7 vars, 173 obs)

. drop n

. list in 1/10
Table 1

Table 1

Second, in order to inspect y in general, I generate a histogram of y.

. histogram y, discrete freq
(start=0, width=1)
Figure 1

Figure 1

It is clear that there are many 0s in y. According to the data, color and spine are also categorical variables. But the following result suggests that we should not use them as response of the model because they are positive.

. tab1 color spine
Table 2

Table 2

Ok, now we have learned about the data, the next part is regression.

Regression

Having data loaded in the STATA environment, we will run the zip command with weight and width as predictors of the y, color and spine as the predictor of the excess zeros.

. zip y weight width, inflate(color spine)
Figure 2

Figure 2

Although the result is clear, we cannot know whether the model is better than poisson model. For comparison, we have included the vuong option which provides a test of the zero-inflated model versus the standard poisson model.

. zip y weight width, inflate(color spine) vuong
Figure 3

Figure 3

The output includes the following items:

  • Begins with the iteration log giving the values of the log likelihoods starting with a model that has no predictors. The last value in the log is the final value of the log likelihood for the full model and is repeated below.

  • Next comes the header information. On the right-hand side the number of observations used , number of nonzero observations are given along with the likelihood ratio chi-squared. This compares the full model to a model without count predictors, giving a difference of two degrees of freedom. This is followed by the p-value for the chi-square. The model, as a whole, is statistically significant.

  • Below the header you will find the Poisson regression coefficients for each of the count predicting variables along with standard errors, z-scores, p-values and 95% confidence intervals for the coefficients.

  • Following these are logit coefficients for the variable predicting excess zeros along with its standard errors, z-scores, p-values and confidence intervals.

  • Below the various coefficients you will find the results of the Vuong test. The Vuong test compares the zero-inflated model with an ordinary poisson regression model. A significant z-test indicates that the zero-inflated model is better.

Robust Method

Cameron and Trivedi (2009) recommend robust standard errors for poisson models. We will rerun the model with the vce(robust) option. We did not include this option in the second model because robust and vuong options cannot be used in the same model.

. zip y weight width, inflate(color spine) vce(robust)
Figure 4

Figure 4

Now we can move on to the specifics of the individual results.

  • Using the robust option has resulted in a fairly large change in the model chi-square, which is now a Wald chi-square. This statistic is based on log pseudo-likelihoods instead of log-likelihoods.
  • The coefficients for weight and width can be interpreted as follows:
    • For each unit increase of weight the expected log count of the response variable increases by .478.
    • For each unit increase of width the expected log count of the response variable decreases by .086.
  • The inflate coefficient for spine suggests that for each unit increase in person the log odds of an inflated zero decrease by .220.

Marginal Analysis

We can use the margins to help understand our model. One margins command will give the expected y for values of spine from zero to three at both levels of color.

. margins, at(color=(1(1)4) spine=(1(1)3)) vsquish
Figure 5

Figure 5

The following marginsplot shows the 95% CIs for each color and spine.

. marginsplot
Figure 6

Figure 6

Python

Model determination

from __future__ import division
import seaborn as sns
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.miscmodels.count as smm

We first want to check what regression model would be properly suited to our data. If we are looking for Zero-Inflated regressions in particular, we can check this by developing a histogram. This allows us to observe the counts for which zero is observed.

import pandas as pd
from matplotlib import pyplot as plt
Crabs_datanew = pd.read_csv('C:/Users/XXXX/Downloads/Crabs.dat', sep = '\s+', header = None, skiprows = 1)
plt.hist(Crabs_datanew[1])
#plt.show()

So, we can see that there are several values that are occur as 0 in the response variable. This implies that we need to apply a Zero-inflated regression to understand which variables can explain the large amount of 0s in the response. Since we do not have a large difference between the values that are zero and the values that are not, we can assume that we must use Zero-inflated Poisson Regression. We can conduct this regression by using the Maximum Likelihood Estimator for Zero-inflated Poisson. This can be found using the PoissonZiGMLE function in the statsmodels package, and then use this MLE to fit the data.

Regression

import pandas as pd
import statsmodels.miscmodels.count as smm
Crabs_datanew = pd.read_csv('C:/Users/XXXXX/Downloads/Crabs.dat', sep = '\s+', header = None, skiprows = 1)
X1 = Crabs_datanew[[2,3,4,5]]
MLE = smm.PoissonZiGMLE(Crabs_datanew[1], X1)
ZiP_model = MLE.fit()
ZiP_model.summary()

The results that the Zero-inflated Poisson Regression give show a few results. We see that the width variable is the only variable that is a significant predictor for the response. Our p-values for the predictors weight, color, and spine are relatively large, showing us that these predictors are not significant. These results do not correspond with the results observed with the corresponding functions in R or STATA, which we will discuss in the next section.

Inconsistencies

Now, it can be seen that the results do not match the results given by the other languages. The issue with the Zero-inflated Poisson regression function in statsmodels is that the function does not let you select which of the predictors are predictors for the excess zeros. We can explore the function to study this.

def zip_pmf(x, pi=pi, lambda_=lambda_):
    if pi < 0 or pi > 1 or lambda_ <= 0:
        return np.zeros_like(x)
    else:
        return (x == 0) * pi + (1 - pi) * stats.poisson.pmf(x, lambda_)
class ZeroInflatedPoisson(GenericLikelihoodModel):
    def __init__(self, endog, exog=None, **kwds):
        if exog is None:
            exog = np.zeros_like(endog)
            
        super(ZeroInflatedPoisson, self).__init__(endog, exog, **kwds)
    
    def nloglikeobs(self, params):
        pi = params[0]
        lambda_ = params[1]

        return -np.log(zip_pmf(self.endog, pi=pi, lambda_=lambda_))
    
    def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):
        if start_params is None:
            lambda_start = self.endog.mean()
            excess_zeros = (self.endog == 0).mean() - stats.poisson.pmf(0, lambda_start)
            
            start_params = np.array([excess_zeros, lambda_start])
            
        return super(ZeroInflatedPoisson, self).fit(start_params=start_params,
                                                    maxiter=maxiter, maxfun=maxfun, **kwds)

As we can see in the python code, the function uses the endogenous and exogenous variables to create the Maximum Likelihood Estimates, but it does not let the user select which variables are the predictors that can be predictors for the excess zeros. The function is fundamentally varied from the corresponding functions in the other languages we looked at.