Poisson Regression

Poisson regression is a form of the generalized linear model and it is used to model count data and contingency tables. Poisson regression assumes that the response variable Y has a Poisson distribution (its mean is equal to its variance), and that the logarithm of its expected value can be modeled by a linear combination of unknown parameters.

In this tutorial, we will use Poisson regression to study the relationship between the number of satellites (male crabs residing near a female crab), for a female horseshoe crab and the different feasures of the female horseshoe crab, such as its color, the width of its back, etc.

Example: Horseshoe Crabs and Satellites

This example refers to the horseshoe crabs and satellites dataset from J. Brockmann’s work on nesting horseshoe crabs in 1996. See also Section 4.3 of Agresti’s study in 1996 and 2002.

Description of the Horseshoe Crab data

In their studies, a female horseshoe crab always had a male crab that was attached to the female nest. The purpose of this study is to investigate factors that affect whether a female crab had any other “Satellites” males that resided near the female crab. In this dataset, there are 173 female crabs in total. There are various variables that are considered to affect this fact including the female’s color (C), spine condition (S), weight (Wt), and carapace width (W). The response variable is the number of satellites male crabs (Sa) for each female crab.

R

Let’s get started by fitting a Poisson regression model with only one variable, the width of female’s carapace (W) to predict the number of satellites crabs attached to females. Hereby, let’s try GLM() first.

# Load data
library(plyr)
library(ggplot2)
# Rename the columns to be observations(Obs), crab's color(C), spine condition(S),
# carapace width(W), weight(Wt), and number of satellites for female crabs(Sa)
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
summary(crab)
##       Obs               C               S               W        
##  Min.   :  2.00   Min.   :1.000   Min.   :1.000   Min.   :21.00  
##  1st Qu.: 44.75   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:24.88  
##  Median : 87.50   Median :2.000   Median :3.000   Median :26.10  
##  Mean   : 87.50   Mean   :2.442   Mean   :2.483   Mean   :26.29  
##  3rd Qu.:130.25   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:27.70  
##  Max.   :173.00   Max.   :4.000   Max.   :3.000   Max.   :33.50  
##        Wt              Sa       
##  Min.   :1.200   Min.   : 0.00  
##  1st Qu.:2.000   1st Qu.: 0.00  
##  Median :2.325   Median : 2.00  
##  Mean   :2.434   Mean   : 2.89  
##  3rd Qu.:2.812   3rd Qu.: 5.00  
##  Max.   :5.200   Max.   :15.00

Data Statistics Summary

From the summary of descriptive statistics of all variables, each variable has 173 valid observations. The variance of the response variable, Sa, is higher than its mean, which does not satisfy the assumption for a Poisson regression. The Negative Binomial model could account for this situation. However, since we are focusing on Poisson regression in this tutorial, we will proceed with poisson model nonetheless.

Next, we propose to use tapply function to investigate the summary statistics of Sa by color type. From the statistic results, we can see that the conditional means and variances of the number of satellites are quite different between different color types, which suggests that color is a good candidate for predicting the number of satellites. The histogram of satellite counts for each female crab with different colors is shown as below.

# Factorize variable "C": crab color
crab$C=as.factor(crab$C)
with(crab, tapply(Sa, C, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                      1                      2                      3 
## "M (SD) = 4.08 (3.12)" "M (SD) = 3.24 (3.18)" "M (SD) = 2.23 (2.60)" 
##                      4 
## "M (SD) = 2.05 (3.62)"
# Plot 
ggplot(crab, aes(Sa, fill = C)) +
  geom_histogram(binwidth=.5, position="dodge")

Poisson Regression of Satellite Number on Carapace Width

Here we will fit a Poisson regression model using the glm() function in R.

# Poisson regression of Sa on W
poireg=glm(crab$Sa~crab$W+1,family=poisson(link=log))
summary(poireg)
## 
## Call:
## glm(formula = crab$Sa ~ crab$W + 1, family = poisson(link = log))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8358  -1.9858  -0.4898   1.0964   4.9364  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.25639    0.54520  -5.973 2.33e-09 ***
## crab$W       0.16195    0.02009   8.062 7.53e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 626.77  on 171  degrees of freedom
## Residual deviance: 564.34  on 170  degrees of freedom
## AIC: 919.7
## 
## Number of Fisher Scoring iterations: 6

The summary output of the this simple Poisson regression model is shown above. The prediction model is: \(log(Sa) = -3.30476 + 0.16405 \times W\). We can see that the estimated coefficient for the width of female’s carapace is 0.16405 and its standard error is 0.01997, which is pretty small. Moreover, the p-value is significantly low. Since the coefficient of carapace width is positive, a female crab with wider carapace will have more male satellites on an order of \(e^{0.1640} = 1.18\), which mean by one unit of width increase, Sa will be 1.18 times its original value.

From the model statistics summary, we know that the residual deviance is 567.88 on 171 degrees of freedom, and the scale factor residual deviance/df: 567.88/171=3.321, is larger than one, indicating a potential problem with overdispersion. This result indicates that Thus we can see that the model doesn’t fit very well.

model_fit1=data.frame(crab, pred=poireg$fitted)
head(model_fit1)
##   Obs C S    W   Wt Sa     pred
## 1   2 3 3 26.0 2.60  4 2.596669
## 2   3 3 3 25.6 2.15  0 2.433792
## 3   4 4 2 21.0 1.85  0 1.155455
## 4   5 2 3 29.0 3.00  1 4.220994
## 5   6 1 2 25.0 2.30  3 2.208431
## 6   7 4 3 26.2 1.30  0 2.682151
#poireg$linear.predictors
#exp(poireg$linear.predictors)

As shown above, we can also get count predictions for each observation and the linear predictor values from R output. We can tell from the output results that the count prefictions (crab$fitted) and linear predictor values are the log of response satellite counts. Take the first observation for example, pred is 3.81 and linear.predictors is 1.3377, and \(e^{1.3378}=3.810\).

Dealing with Overdispersion

As discussed in the above discussion, there is an issue with overdispersion. One possible reason for the overdispersion is heterogeneity caused by the differences between subjects within each covariate combination. For example, some female crabs with similar carapace width may still have diversed number of satellites residing her. To address this problem and improve fitness of model, we assume that we only have one predictor: the carapace width.

#tapply(crab$Sa, crab$W, function(x)c(mean=mean(x), variance=var(x)))
disp_fit=glm(crab$Sa~crab$W, family=quasipoisson(link=log), data=crab)
summary.glm(disp_fit)
## 
## Call:
## glm(formula = crab$Sa ~ crab$W, family = quasipoisson(link = log), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8358  -1.9858  -0.4898   1.0964   4.9364  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.2564     0.9744  -3.342  0.00102 ** 
## crab$W        0.1620     0.0359   4.510  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.194402)
## 
##     Null deviance: 626.77  on 171  degrees of freedom
## Residual deviance: 564.34  on 170  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
summary.glm(disp_fit)$dispersion
## [1] 3.194402

Since the dispersion estimation is 3.182205, we know that the variance of the response variable, namely the number of satellites for each female with a specific carapace width, is about 3.18 times of its mean value. From this output, the mean and variance are different for the random component, thus the response “Sa”" doesn’t follow a Poisson distribution.

We will then apply library(MASS) to fit a negative binomial model and theta is the dispersion parameter. The example R code below is to estimate the dispersion parameter. Assume that carapace width is the only predictor in our assumption here.

library(MASS)
# disp.est=glm(Sa~W, data=crab, family=negative.binomial(theta=1, link="identity"), start=poireg$coef)
# summary(disp.est)
# glm.nb() function is used to estimate dispersion parameter.
disp.est2=glm.nb(Sa~W, data=crab, init.theta = 1, link=identity, start=poireg$coef)
summary(disp.est2)
## 
## Call:
## glm.nb(formula = Sa ~ W, data = crab, start = poireg$coef, init.theta = 0.9216988823, 
##     link = identity)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7819  -1.4068  -0.2551   0.4389   2.1075  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.56008    1.07296  -10.77   <2e-16 ***
## W             0.55048    0.05109   10.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9217) family taken to be 1)
## 
##     Null deviance: 214.14  on 171  degrees of freedom
## Residual deviance: 193.94  on 170  degrees of freedom
## AIC: 747.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.922 
##           Std. Err.:  0.166 
## 
##  2 x log-likelihood:  -741.102

Other than the reason of overdispersion, missing important preditors can also lead to poor fit of Poisson regression model. By adding other candidate predictors, we can make comparisons and see if goodness of fit can be improved or not.

Adding Another Predictor “Crab Color”-“C”

Hereby, we’ve already factorized the variable crab color. The optimized model can be fitted as below:

poireg_2=glm(Sa~W+C, family=poisson(link=log), data=crab)
summary(poireg_2)
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0400  -1.9312  -0.5192   1.0042   4.7781  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.60793    0.59117  -4.411 1.03e-05 ***
## W            0.14780    0.02096   7.051 1.78e-12 ***
## C2          -0.21275    0.15391  -1.382   0.1669    
## C3          -0.43800    0.17638  -2.483   0.0130 *  
## C4          -0.44981    0.20915  -2.151   0.0315 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 626.77  on 171  degrees of freedom
## Residual deviance: 556.12  on 167  degrees of freedom
## AIC: 917.48
## 
## Number of Fisher Scoring iterations: 6

The residual deviance is 559.34 on 168 degrees of freedom. The scale factor residual deviance/df: 559.34/168=3.329405, is larger than one, Thus, adding another predictor crab color doesn’t improve goodness of fit for the model. The overdispersion seems to be mainly caused by heterogeneity of response variable. It can also be shown that the model does not fit very well by ploting the fitted values vs. true values.

#### Make a plot of the results
plot(crab$Sa, type="o", bty="l",col="red", lwd=1.5 , pch=17,
     main="Plot of Observed and Predicted Sa", 
     xlab="Width", ylab="Number of Satellites")
lines(poireg_2$fitted, lwd=1.5, pch=19, type="o", col="blue")
legend(10,15, c("obs","pred"), pch = c(17,19), col=c("red","blue"))

Stata

Let’s get started by fitting a Poisson regression model with only one variable, the width of female’s carapace (W) to predict the number of satellites crabs attached to females.

qui log using ex, replace

import delimited crab.csv, clear

rename v1 Obs
rename v2 C
rename v3 S
rename v4 W
rename v5 Wt
rename v6 Sa

sum Sa C S W Wt

Data Statistics Summary

From the summary of descriptive statistics of all variables, each variable has 173 valid observations. The variance of the response variable, Sa, is higher than its mean , which does not satisfy the assumption for a Poisson regression. The Negative Binomial model could account for this situation. However, since we are focusing on Poisson regression in this tutorial, we will proceed with poisson model nonetheless.

Let’s continue by examining the variations in the number of satellites within each categorical variable. There are two categorical variables in this dataset, Color and Spine. The table belows show the mean of the number of satellites by color, or Spine condition. The result suggests that the variable Color is a good candidate for predicting the nubmer of satellites, since the mean of the response variable appears to vary by Color.r. The histogram of satellite counts for each female crab with different colors is shown as below.

tabstat Sa, by(C) stats(mean sd n)

tabstat Sa, by(S) stats(mean sd n)

histogram Sa, discrete freq scheme(s1mono)

Poisson Regression of Satellite Number on Carapace Width

Below we use the poisson command to estimate a Poisson regression model. We use the vce(robust) option to obtain robust standard errors for the parameter estimates as recommended by Cameron and Trivedi (2009) to control for mild violation of underlying assumptions.

poisson Sa W, vce(robust)

The output begins with the iteration log, which gives the values of the log of pseudolikelihoods starting with the null model. The last value in the iteration log is the final value of the log of pseudolikelihood for the full model and is displayed again. Because we asked for robust standard errors, the maximized likelihood is actually a pseudolikelihood. The estimates of the parameters are maximum likelihood estimates and the estimation of the variance-covariance matrix of the parameter estimates leads to the pseudolikelihood. Log pseudolikelihood values can be used to compare models.

The header information is presented next. On the right-hand side, the number of observations used in the analysis (173) is given, along with the Wald chi-square statistic with three degrees of freedom for the full model, followed by the p-value for the chi-square. This is a test that all of the estimated coefficients are equal to zero-a test of the model as a whole. From the p-value, we can see that the model is statistically significant. The header also includes a pseudo-R2, which is 0.07 in this example.

Below the header you will find the Poisson regression coefficients for each of the variables along with robust standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. For example, the coefficient for Width is 0.164. This means that the expected increase in log count for a one-unit increase in Width is .164.

Based on the model, the relationship between the number of satellites and the width of the female crab can be described by \[ log(Sa) = -3.305+0.164 \times Width \]

To help assess the fit of the model, the estat gof command can be used to obtain the goodness-of-fit chi-squared test. This is not a test of the model coefficients (which we saw in the header information), but a test of the model form: Does the poisson model form fit our data?

estat gof

We conclude that the model does not fit very well because the goodness-of-fit chi-squared test is statistically significant. We may try to if our linearity assumption holds and/or if there is an issue of over-dispersion. For our data, the problem lies in the over-dispersion, since the conditional variance exceeds the conditional mean. We can use Negative Binomial model to account for the over-dispersion.

Dealing with Overdispersion

Here we fit a Negative Binomial model to the variable Width.

nbreg Sa W

On the coefficients table, /lnalpha is the estimated log-transformed over- dispersion parameter. For a Poisson model, this alpha value is constrained to zero. Stata finds the maximum likelihood estimate of the log of alpha and then use this to calculate alpha.

Below the table of coefficients, you will find a likelihood ratio test that alpha equals zero-the likelihood ratio test comparing this model to a Poisson model. In this example the associated chi-squared value is 171.89 with one degree of freedom. This strongly suggests that alpha is non-zero and the negative binomial model is more appropriate than the Poisson model.

Adding Another Predictor “Crab Color”-“C”

Besides over-dispersion, the poor fit of the model could also arise from a lack of predictor variables. We can include a categorical variable, Color, to the model and see if it can improve the fit of the model. The i. before Color indicates that it is a factor variable (i.e., categorical variable), and that it should be included in the model as a series of indicator variables.

poisson Sa W i.C, vce(robust)

We can use fitstat command to estimate the goodness of fit of the model.

fitstat

From the output, we can see that adding another variable does not improve the fit. Therefore, we can conclude that the poor fit mainly comes from the over-dispersion of the data.

Sometimes, we might want to present the regression results as incident rate ratios, we can use the irr option. These IRR values are equal to our coefficients from the output above exponentiated.

poisson, irr

Predictions Using the Poisson Regression Model

We can use the Poisson regression model to predict the number of satellites given the Width and the Color of the female horseshoe crab.

Below we use the margins command to calculate the predicted counts at each level of Color, holding all other variables (in this example, Widdth) in the model at their means.

margins C, atmeans

Below we will obtain the predicted counts for values of Width that range from 20 to 30 in increments of 5.

margins, at(W=(20(5)30)) vsquish

We can graph the predicted number of Satellites with the commands below. The graph indicates that the most satelliles are predicted for crabs with Color = 1, especially if the female horseshoe crab has a large carapace width. The lowest number of predicted awards is for crabs with Color = 3 and Color = 4.

predict c
separate c, by(C)
twoway scatter c1 c2 c3 c4 W, connect(l l l l) sort ///
       ytitle("Predicted Count") ylabel( ,nogrid) legend(rows(4)) ///
       legend(ring(0) position(10)) scheme(s1mono)

Python

Let’s get started by fitting a Poisson regression model with only one variable, the width of female’s carapace (W) to predict the number of satellites crabs attached to females.

First of all, we need to import all the packages and load the data.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import poisson
from statsmodels.formula.api import negativebinomial
import numpy as np
import matplotlib.pyplot as plt
import math
crab = pd.read_csv("crab.csv", names = ['Obs', 'C', 'S', 'W', 'Wt', 'Sa'])
crab.head(5)

Data Statistics Summary

There are 173 valid observations and their distributions seem quite reasonable. For the response variable Sa, i.e. the number of satellites for female crabs, the mean and variance are not that close to each other. But since we want to use Poisson Regression, we still assume that mean and variance of out response variable are roughly equal and see what will happen if we use Poisson Regression on this dataset.

crab.describe()

Then we compute the average numbers of satellites by color type and the result seems to suggest that color type is a good candidate for predicting the number of satellites, our outcome variable, because the mean value of the outcome appears to vary by color. However, the means and variances within each level of C, i.e. the conditional means and variances, are not quite similar. For example, the average number of satellites with color type 1 is about 4.08, whereas the standard deviation is around 3.12, which indicates the variance is about 9.73.

uniqCols = sorted(crab['C'].unique())
mn_c = []
for elem in uniqCols:
    mn_c.append(crab[crab['C'] == elem]['Sa'].mean())
std_c = []
for elem in uniqCols:
    std_c.append(crab[crab['C'] == elem]['Sa'].std())
    
mn_c = np.array(mn_c)
std_c = np.array(std_c)
mn_std = pd.DataFrame(mn_c, columns = ["mean"])
mn_std["SD"] = std_c

print mn_std

We can also plot a conditional histogram separated out by color type to show the distribution.

histData = []
for elem in uniqCols:
    histData.append(crab[crab['C'] == elem]['Sa'].values)
    
plt.hist(tuple(histData), bins = 10, normed = True, histtype = 'bar', label = map(lambda x: 'Color' + str(x), uniqCols))
plt.legend()
plt.ylabel('Count')
plt.title('Histogram for each color')
plt.show()

Poisson Regression of Satellite Number on Carapace Width

Now, we are ready to perform our Poisson model analysis using the poisson function imported from the package, statsmodels.formula.api.

First, regress the response, Sa, on the two continuous predictors, W and Wt. Store this model in the object m1 and then summarize the model.

m1 = poisson('Sa ~ W', data = crab).fit()
print m1.summary()

The summary output of the this simple Poisson regression model is shown above. The prediction model is: \(log(Sa) = -3.30476 + 0.16405 \times W\). We can see that the estimated coefficient for the width of female’s carapace is 0.164 and its standard error is 0.02, which is pretty small. Moreover, the p value is significantly low. Since the coefficient of carapace width is positive, a female crab with wider carapace will have more male satellites on an order of \(e^{0.1640} = 1.18\), which mean by one unit of width increase, Sa will increase and be multiplied by 1.18.

The fitted values are shown below. Although the predictor, W, appears to be significant, the fitted values are not close to the true values.

model_fit1 = crab
preds_1 = m1.predict()
model_fit1['preds'] = preds_1

print model_fit1.head(5)

Also, from the model statistics summary, The information on LLR test is displayed. We see that llR p value is very small, which indicates that the model doesn’t fit well. The lack of fit maybe due to missing data, covariates or overdispersion. Here, it may be caused by overdispersion, since according to the data summary part, the mean and variance of Sa are not quite similar.

Dealing with Overdispersion

As discussed in the above discussion, there can be an issue with overdispersion. Then, it may be more appropriate to fit a negative binomial model in this case.

m_nb = negativebinomial('Sa ~ W', data = crab).fit()

Adding Another Predictor “Crab Color”-“C”

Besides the impact of overdispersion, there are some other influential factors, such as the lack of predictors. We want to check whether the lack of fit is caused by the lack of the number of predictors. Then, the variable, C, is added to the model.

First, we need to create a data frame with dummy variables.

col_dummies = pd.get_dummies(crab['C']).rename(columns = lambda x: 'Col' + str(x))
dataWithDummies = pd.concat([crab, col_dummies], axis = 1) 
dataWithDummies.drop(['C', 'Col1'], inplace = True, axis = 1)
dataWithDummies = dataWithDummies.applymap(np.int)
dataWithDummies.head(5)

Let color type 1 be the reference level. Compute the design matrix, X and put the response aside, then store it into Y.

columns = ['W', 'Col2', 'Col3', 'Col4']
X = [elem for elem in dataWithDummies[columns].values]
X = sm.add_constant(X, prepend = False)
Y = [elem for elem in dataWithDummies['Sa'].values]

Fit a poisson model using X and Y. Then summarize the model.

m2 = sm.Poisson(Y, X).fit()
print m2.summary()

From the output, LLR p value is still significant, which indicates that adding another predictor crab color doesn’t improve goodness of fit for the model.

We can also plot the true values of the response vs the fitted values to see the goodness of the fit.

preds = m2.predict()
plt.plot(range(len(Y)), Y, 'r*-', range(len(Y)), preds, 'bo-')
plt.title('True Sa vs fitted values')
plt.legend(['Real Values', 'Fitted Values'])
plt.show()