Introduction

Zero-truncated Negative Binomial Regression is used to model count data for which the value zero cannot occur and for which over dispersion exists. There are a lot of count variables that cannot have a value of 0, such as the duration patients are in hospital and the age(measured in years) of an animal. When people want to use regression on these count variables, they may want to use Negative Binomial Regression first because it is a useful model for the count data. However, the underlying assumption of Negative Binomial distributions may cause a problem because the range of these distributions include zero. If the mean of the response is small, and it does not contain zeros, then the estimated parameters and standard errors obtained by GLM may be biased, which may cause bad effects when interpreting the results using the esimated parameters. In this situation, the Zero-Truncated Negative Binomial Regression model can be used to solve this problem. The formula of the Zero-Truncated Negative Binomial Regression model is: \[ Y_{i} \sim NB(\mu_{i}, k) \] \[ E(Y_{i}) = \mu_{i}, Var(Y_{i}) = \mu_{i}+\frac{\mu_{i}^2}{k}\] The link function: \[ ln(\mu_{i}) = \eta_{i}\]

The linear predictor: \[ \eta_{i} = \beta_{0} + \beta_{1}X_{i1} +...+ \beta_{p}X_{ip}\] .

Data Background

The data used in this tutorial is the Abalone Dataset, which comes from an original study called “The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait” by Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994). The link of the data: https://www.kaggle.com/rodolfomendes/abalone-dataset . This data has 4177 rows and 9 columns. We will focusing on the following variables: Rings(Can give the age of the Abalone in years by adding 1.5), Sex(M, F, and I (infant)) and Length(Longest shell measurement). The response variable is Rings. Since every abalone have Rings(as a symbol that can represent the age of abalone), we think Rings are zero-truncated and may suitable to be treated as the response variable of the Zero-Truncated Negative Binomial Regression model.

Overview of the examples analyzing the Abalone data

We will use three languages: R(using packages VGAM, tidyverse and MASS), SAS and Stata to analyze the abalone data. In all of these three parts, we do the following things: load and clean the data first, and then visualize the response Rings to test whether the data is suitable to use the Zero-Truncated Negative Binomial Regression model. In addition, we use both Zero-Truncated Negative Binomial Regression and Negative Binomial Regression on this data and use log-likelihood to compare. At last, we interpret the fitting results of the Zero-Truncated Negative Binomial Regression. Besides, we plot the prediction to visualize the regression results, and produce a nicely formatted table to show the regression results.
The R code can be seen here: https://github.com/tlwangzi123/Stats-506-Group-Project-Group-6/blob/master/gp.R
The SAS code can be seen here: https://github.com/tlwangzi123/Stats-506-Group-Project-Group-6/blob/master/gp.sas
The Stata code can be seen here: https://github.com/tlwangzi123/Stats-506-Group-Project-Group-6/blob/master/gp.do

Languages

R

Data cleaning

Load the data into R using read.csv

library(VGAM)
library(tidyverse)
library(MASS)
abalone_full = read.csv("abalone.csv")

Use the function distinct from dplyr package to select the variable that we want and remove replicate rows. Here, we also sort the data by all variables in incresing order, because in the SAS part we use the command proc sort with option noduplicate to get a data that do not have duplicate rows. Besides, we show the dimension and the summary of the new data named abalone.

abalone = abalone_full %>%
  distinct(Rings, Sex, Length) %>%
  arrange(Rings, Sex, Length)
dim(abalone)
## [1] 1670    3
summary(abalone)
##  Sex         Length           Rings      
##  F:561   Min.   :0.0750   Min.   : 1.00  
##  I:464   1st Qu.:0.4350   1st Qu.: 8.00  
##  M:645   Median :0.5300   Median :10.00  
##          Mean   :0.5153   Mean   :10.86  
##          3rd Qu.:0.6150   3rd Qu.:13.00  
##          Max.   :0.8150   Max.   :29.00

Show the numbers of duplicate rows:

dim(abalone_full)[1] - dim(abalone)[1]
## [1] 2507

So, we drop 2507 duplicate rows.

Show the first 10 rows of the data abalone:

abalone[1:10,]
##    Sex Length Rings
## 1    I  0.075     1
## 2    I  0.150     2
## 3    I  0.110     3
## 4    I  0.130     3
## 5    I  0.140     3
## 6    I  0.160     3
## 7    I  0.165     3
## 8    I  0.180     3
## 9    I  0.190     3
## 10   I  0.195     3

Data exploration

Now, we are going to visualize the response Rings to test whether our dataset is suitable to use the Zero-Truncated Negative Binomial Regression model.
At first, from the summary of the data and the decription of the variables, the response Rings is a count variable that cannot have a value of 0.
In addition, show the mean, standard deviation and histogram of the response Rings:

sprintf("The Mean of Rings = %4.3f, and the SD of Rings = %4.3f", 
        mean(abalone$Rings), sd(abalone$Rings))
## [1] "The Mean of Rings = 10.863, and the SD of Rings = 4.085"
hist(abalone$Rings, xlab = "Rings", main = "Histogram of Rings")

The results show that the variable Rings has small means, and exists over dispersion. Therefore, the Zero-Truncated Negative Binomial Regression model is suitable for modeling our dataset.

ZTNB Regression

Use function vglm in VGAM package to use Zero-Truncated Negative Binomial Regression on the data abalone

t_nb1= vglm(Rings ~ Sex + Length, 
            family = posnegbinomial(), data = abalone)
summary(t_nb1)
## 
## Call:
## vglm(formula = Rings ~ Sex + Length, family = posnegbinomial(), 
##     data = abalone)
## 
## 
## Pearson residuals:
##                Min      1Q  Median     3Q    Max
## loge(munb)  -2.023 -0.7443 -0.2020 0.5512 4.3377
## loge(size) -11.145 -0.1913  0.3326 0.6262 0.7432
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  1.63153    0.04093  39.857  < 2e-16 ***
## (Intercept):2  5.19065    0.59778   8.683  < 2e-16 ***
## SexI          -0.14064    0.02238  -6.285 3.28e-10 ***
## SexM          -0.02083    0.01743  -1.196    0.232    
## Length         1.50445    0.06727  22.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: loge(munb), loge(size)
## 
## Log-likelihood: -4294.985 on 3335 degrees of freedom
## 
## Number of iterations: 7 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2'

The Log-likelihood of the model is -4294.895

Now, interpret the fitting result.
From the summary of the Zero-Truncated Negative Binomial Regression model, SexI has a estimate coefficient -0.14, and SexI is significant(because the p-value of SexI is 3.28e-10 < 0.05), which means the log count of Rings for a Infant Abalone is obviously less compared to Female Abalone by 0.14 when the length of these Abalones are the same.
SexM has a estimate coefficient -0.02, and SexM is not significant(because the p-value of SexM is 0.232 > 0.05), which means the log count of Rings for a Infant Abalone is inconspicuously less compared to Female Abalone by 0.02 when the length of these Abalones are the same.
Length has a estimate coefficient 1.5, and Length is not significant(because the p-value of Length < 2e-16 < 0.05), which means the log count of Rings has a obviously increase by 1.5 for each mm increase in Longest shell measurement when the sex keeps the same.
The first intercept((Intercept):1) is 1.63, this shows that the log count of the Rings of the Abalone 1.63 when all predictors equal zero.
The second intercept((Intercept):2) is 5.19, which means the value of the over dispersion parameter k in the Zero-Truncated Negative Binomial Regression model is 5.19.

NB Regression and Comparison of the two models

Fit the Negative Binomial Regression using glm.nb in MASS package on the data abalone

nb1 = glm.nb(Rings ~ Sex + Length, data = abalone)
summary(nb1)
## 
## Call:
## glm.nb(formula = Rings ~ Sex + Length, data = abalone, init.theta = 182.7883298, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2991  -0.7801  -0.2045   0.5340   3.6605  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.63290    0.04085  39.970  < 2e-16 ***
## SexI        -0.14041    0.02235  -6.282 3.33e-10 ***
## SexM        -0.02084    0.01741  -1.197    0.231    
## Length       1.50211    0.06713  22.376  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(182.7883) family taken to be 1)
## 
##     Null deviance: 2365.9  on 1669  degrees of freedom
## Residual deviance: 1533.4  on 1666  degrees of freedom
## AIC: 8600.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  183 
##           Std. Err.:  103 
## 
##  2 x log-likelihood:  -8590.748

The Log-likelihood of the model is -8590.748/2 = -4295.874

Compare these two models using Log-likelihood:
Because -4294.895 > -4295.874, we can conclude that the Zero-Truncated Negative Binomial Regression model fits better than the Negative Binomial Regression model.

SAS

Data cleaning

Firstly, we need to load our dataset into sas. Here, because we directly use the csv file, we will use the command infile, and dlm as ‘,’. We will start at the second line, so we have firstobs = 2. We use dsd here in order to avoid separation caused by comma in characters. We use missover to avoid jumpng to the next line when reading a short line. In sas, we also need to set variable names for all the variables in the dataset. $ is for character variable.

data abalone_full;
  infile './abalone.csv' dlm = ',' dsd missover firstobs = 2;
  input Sex $
        Length
        Diameter
        Height
        Whole_weight
        Shucked_weight
        Viscera_weight
        Shell_weight
        Rings;
run;

We only want to learn the relationship between age, sex and length, thus we only keep these variables. Then we do not want to have duplicate rows, so we use the command noduplicates to remove those lines appear more than once. We eliminated 2507 duplicate observations here. We only want to find out precise relationship between age and other factors. Observations with exactly save results will put weight on that observation and influence the model we want to get. Thus, we remove duplicate observations here. Then to make the dataset beautiful, we sort the dataset by Rings, Sex and Length.

data abalone;
set abalone_full;
keep Rings Sex Length;
proc sort data = abalone out = abalone noduplicates;
by Rings Sex Length;
run;

Here, we want to see what the dataset looks like. We list 10 obervations here.

proc print data = abalone(obs=10);
run;

Below is the result.

Data exploration

Next, we should have a basic impression of each variable. We start with Sex, and we use command proc freq and tables to create a frequency table for variable Sex.

proc freq data = abalone;
  tables Sex;
run;

We then use command proc means to create summary of Length.

proc means data = abalone;
var Length;
run;

Similarly, we use command proc means to create summary of Rings.

proc means data = abalone;
var Rings;
run;

We want to do a univariate summary about the variable Rings, and then draw a histogram about the variable. We do not want the summary to be shown in the results, so we use proc univariate with the option noprint here.

proc univariate data = abalone noprint;
var Rings;
histogram;
run;

Here we do factorization about variable Sex.

data abalone;
set abalone;
  if Sex = 'F' then SexF = 1;
    else SexF = 0;
  if Sex = 'I' then SexI = 1;
    else SexI = 0;
  if Sex = 'M' then SexM = 1;
    else SexM = 0;
run;

ZTNB Regression

We use the command proc nlmixed to run zero-truncated negative binomial regression about the dataset.

proc nlmixed data = abalone;
  log_mu = intercept + b_SexI*SexI + b_SexM*SexM + b_Length*Length;
  mu = exp(log_mu);
  het = 1/alpha;
  ll = lgamma(Rings+het) - lgamma(Rings+1) - lgamma(het) 
       - het*log(1+alpha*mu) + Rings*log(alpha*mu) 
       - Rings*log(1+alpha*mu) - log(1 - (1 + alpha * mu)**-het);
  model Rings ~ general(ll);
run;

From the summary of the Zero-Truncated Negative Binomial Regression model above, we can see that at first the intercept is 1.64, which is not 0 here. Because we use the variable ‘Rings’ to make prediction about the age of abalone, thus there is a basic value for that, as Rings plus 1.5 is the age.
This value is significant, because the p-value is less than 0.0001 less than 0.05, which is pretty fit for the fact that the adjustment is 1.5 because the intercept is close to that.
The coefficient of ‘b_Sexl’ is -0.14, which is significant because p-value is less than 0.0001 less than 0.05. This value indicates that the log count rings of infant abalones is 0.14 less than the log count of female abalones when all else factors remaining the same. This makes sense because the age of infant abalone should be smaller than adult abalone. The coefficient of ‘b_SexM’ is -0.02, which is not significant because the p-value is 0.21 greater than 0.05. This means that the log counts of rings between male and female abalones has not significant differences, when other factors are the same.
Thus, we do not need to use different models to predict the age of abalone for different sexs.
The coefficient of ‘b_Length’ is 1.49, which is significant because p-value is less than 0.0001 less than 0.05. Thus, when the length of abalone increases one unit, the log count of its rings will increases 1.49 unit. This gives us the idea that larger ablones for most of the time should be older.
We also have estimate of alpha here. The estimate is close to zero, which indicates that we can also use zero-truncated poisson regression to fit the model in this example.

NB Regression and Comparison of the two models

We also want to compare the results of zero-truncted negative binomial regression, and normal negative binomial regression. We use command proc genmod to do the normal negative binomial regression.

proc genmod data = abalone;
  class Sex (param = ref ref = first);
  model Rings = Sex Length / type3 dist = negbin;
run;

The Log-likelihood of the zero-truncated negative binomial regression model is -4296.75. The log-likelihood of the negative binomial regression model is -4295.38

From above, we can also see the results from normal negative binomial regression. When comparing the log-likelihood of these two models, we can see that because -4296.75 < -4295.38, the zero-truncated negative binomial regression model should be better than the normal one.
However, we should notice that the difference is not very big. Thus, for this dataset, choosing either one of the models will not make big difference to the final results. Thus, we can say that it is not very necessary for us to use zero-truncated negative binomial regression in this example.

When comparing to the results from the other two languages, we can see that we have similar results. Thus, choosing either language to use zero-truncated negative binomial regression model will be appropriate. In this way, it is better for us to choose the one that is the easiest one to do.

Stata

Data cleaning

Load the data into Stata using import.

clear
import delimited abalone.csv, case(preserve)

We are only interested in these three variables: Rings, Sex and Length. We only keep these variables in our dataset, sort them by value and remove duplicates.

keep Rings Sex Length
sort Rings Sex Length 
duplicates drop

Generate a new column named Sex1 that denotes the genre. 0 for Female, 1 for Infant, 2 for Male.

gen Sex1 = 0
replace Sex1 = 1 if Sex == "I"
replace Sex1 = 2 if Sex == "M"
label define Sex_codes 0 "F" 1 "I" 2 "M", replace

Relabel Sex1. Save the new dataset.

label values Sex1 Sex_codes
save abalone, replace

Visualize the tabulation of Sex1.

tab1 Sex1

Data exploration

Visualize the summarize of Length and Rings.

summarize Length Rings

Then, visualize the histogram of Rings.

histogram Rings, discrete

ZTNB Regression

The tnbreg command will analyze models that are left truncated on any value not just zero. The ztnb command previously was used for zero-truncated negative binomial regression, but is no longer supported in Stata12 and has been superseded by tnbreg.

tnbreg Rings i.Sex1 Length, ll(0)

The output looks very much like the output from an OLS regression:

  • It begins with the iteration log giving the values of the log likelihoods starting with a model that has no predictors.

  • The last value in the log (-4294.9849) is the final value of the log likelihood for the full model and is repeated below.

  • Next comes the header information. On the right-hand side the number of observations used (1670) is given along with the likelihood ratio chi-squared with three degrees of freedom for the full model, followed by the p-value for the chi-square. The model, as a whole, is statistically significant.

  • The header also includes a pseudo-R2, which is quite low in this example (0.08).

  • Below the header you will find the zero-truncated negative binomial coefficients for each of the variables along with standard errors, z-scores, p-values and 95% confidence intervals for each coefficient.

  • The output also includes an ancillary parameter /lnalpha which is the natural log of the over dispersion parameter.

  • Below that, is the the overdispersion parameter alpha along with its standard error and 95% confidence interval.

  • Finally, the last line of output is the likelihood-ratio chi-square test that alpha is equal to zero along with its p-value.

Looking through the results we see the following:

  • The value of the coefficient for Sex1_M, -0.02, suggests that the log count of Rings decreases by 0.02 for each unit increase in Sex1_I group. This coefficient is not statistically significant.

  • The coefficient for Sex1_I, -0.14, is significant and indicates that the log count of Rings for Infants is 0.14 less than for non-Infants.

  • The log count of length is 1.50 more.

  • The value of the constant (_cons), 1.63 is log count of the Rings when all of the predictors equal zero.

NB Regression and Comparison of the two models

We fit it again using Negative Binomial Regression using nbreg.

nbreg Rings i.Sex1 Length

Compare these two models using Log-likelihood:
Because -4294.9849 > -4295.3738, we can conclude that the Zero-Truncated Negative Binomial Regression model fits better than the Negative Binomial Regression model.

Conclusion

Produce a nicely formatted table to show the regression results:

fit_res = tibble(
                 Item = c("First_intercept", "Second_intercept", "SexI", 
                          "SexM", "Length"),
                 Estimate = c(
                   round(as.numeric(t_nb1@coefficients["(Intercept):1"]),2), 
                   round(as.numeric(t_nb1@coefficients["(Intercept):2"]),2), 
                   round(as.numeric(t_nb1@coefficients["SexI"]),2),
                   round(as.numeric(t_nb1@coefficients["SexM"]),2),
                   round(as.numeric(t_nb1@coefficients["Length"]),2)),
                 Standard_Error = c(
                   round(as.numeric((summary(t_nb1)@coef3[,"Std. Error"])["(Intercept):1"]),2),
                   round(as.numeric((summary(t_nb1)@coef3[,"Std. Error"])["(Intercept):2"]),2),
                   round(as.numeric((summary(t_nb1)@coef3[,"Std. Error"])["SexI"]),2),
                   round(as.numeric((summary(t_nb1)@coef3[,"Std. Error"])["SexM"]),2),
                   round(as.numeric((summary(t_nb1)@coef3[,"Std. Error"])["Length"]),2))
)
cap = '**Table 1.** The Estimate and Standard Error of Regression Coefficients.'
knitr::kable(fit_res, align = 'r', caption = cap)
Table 1. The Estimate and Standard Error of Regression Coefficients.
Item Estimate Standard_Error
First_intercept 1.63 0.04
Second_intercept 5.19 0.60
SexI -0.14 0.02
SexM -0.02 0.02
Length 1.50 0.07

Besides, we can plot the prediction to visualize the regression results.

Prediction_Rings = predict(t_nb1, abalone, type = "response")
Prediction_Rings = cbind(Prediction_Rings,abalone)

ggplot(Prediction_Rings, aes(x = Length, y = Prediction_Rings, col = Sex)) +
  geom_line() +
  facet_wrap(~ Sex)

From this plot, we can see: At first, the Rings for Infant Abalone is obviously less compared to Female Abalone when the length of these Abalones are the same, while the Rings for Male Abalone is inconspicuously less compared to Female Abalone when the length of these Abalones are the same. Secondly, the Rings for Abalone increases as the length of Abalone increases. This is consistent with the interpretion fitting results.

Things to Consider

From the example above, we can see that even if the Log-likelihood criterion shows the Zero-Truncated Negative Binomial Regression model fits better than the Negative Binomial Regression model, the estimated parameters and standard errors obtained by Negative Binomial Regression and Zero-Truncated Negative Binomial Regression is very similar.
Thus, we can conclude that if the mean is not very close to 0, the truncation problem can be ignored if there is no strict limit of the accuracy for the regression results. In this case, both the Negative Binomial Regression model and the Zero-Truncated Negative Binomial Regression model are good models.

References

  1. Mixed Effects Models and Extensions in Ecology with R (2009) Zuur, Ieno, Walker, Saveliev, Smith
    (link to the book is here)
  2. Wikipedia: Truncated regression model
  3. https://stats.idre.ucla.edu/r/dae/zero-truncated-negative-binomial/
  4. https://stats.idre.ucla.edu/sas/dae/zero-truncated-negative-binomial/
  5. https://stats.idre.ucla.edu/stata/dae/zero-truncated-negative-binomial/