Marginalized Zero-Altered Models for Longitudinal Count Data

Tabb, Loni Philip; Tchetgen, Eric J. Tchetgen; Wellenius, Greg A.; Coull, Brent A.

doi:10.1007/s12561-015-9136-6

Marginalized Zero-Altered Models for Longitudinal Count Data

Published: 22 September 2015

Volume 8, pages 181–203, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Statistics in Biosciences Aims and scope Submit manuscript

Marginalized Zero-Altered Models for Longitudinal Count Data

Download PDF

Loni Philip Tabb¹,
Eric J. Tchetgen Tchetgen^2,4,
Greg A. Wellenius³ &
…
Brent A. Coull⁴

374 Accesses
3 Citations
Explore all metrics

Abstract

Count data often exhibit more zeros than predicted by common count distributions like the Poisson or negative binomial. In recent years, there has been considerable interest in methods for analyzing zero-inflated count data in longitudinal or other correlated data settings. A common approach has been to extend zero-inflated Poisson models to include random effects that account for correlation among observations. However, these models have been shown to have a few drawbacks, including interpretability of regression coefficients and numerical instability of fitting algorithms even when the data arise from the assumed model. To address these issues, we propose a model that parameterizes the marginal associations between the count outcome and the covariates as easily interpretable log relative rates, while including random effects to account for correlation among observations. One of the main advantages of this marginal model is that it allows a basis upon which we can directly compare the performance of standard methods that ignore zero inflation with that of a method that explicitly takes zero inflation into account. We present simulations of these various model formulations in terms of bias and variance estimation. Finally, we apply the proposed approach to analyze toxicological data of the effect of emissions on cardiac arrhythmias.

Maximum likelihood based analysis of equally spaced longitudinal count data with first-order antedependence and overdispersion

Article Open access 08 November 2016

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

Article Open access 24 June 2021

Conditional Poisson models: a flexible alternative to conditional logistic case cross-over analysis

Article Open access 24 November 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Count data often exhibit more zeros than that predicted by one of the common count distributions, such as the Poisson or negative binomial. Such data are common in many applications and can be difficult to analyze when many subjects have zero observations while others have much larger observations [16]. For example, in counting disease lesions on plants, a plant may have no lesions either because it is resistant to the disease, or simply because no disease spores have landed on it. This is often the distinction between structural zeros (i.e., zeros are inevitable) and sampling zeros (i.e., zeros occur by chance) [20]. There are a wide variety of application areas in which zero inflation occurs, such as road safety and accident frequencies [24], the manufacturing industry [12], and clinical trials [14].

A large literature now exists on longitudinal and other correlated data extensions of methods for zero-inflated Poisson (ZIP) counts [4, 6, 25, 27]. This work, almost all of which builds on the standard ZIP models initially proposed [12], models the data as arising from a mixture of a point mass at zero and a Poisson distribution. In this framework, one typically relates the covariates to both the probability that an observation arises from the zero point mass (via logistic regression), as well as to the mean parameter associated with the Poisson component (via a log-linear model). In the event that the Poisson assumption of equality between the mean and variance fails, and zero inflation remains, zero-inflated negative binomial (ZINB) models have also been developed. In fact, score tests for ZIP regression versus ZINB regression have been proposed [5, 21].

These ZIP and ZINB formulations have been the standard way to account for zero inflation in counts for quite sometime and have proven useful in many applications. On the other hand, these models have a few disadvantages as well. A drawback of zero inflation regression is that it can be susceptible to instability or even non-existence of the maximum likelihood estimates for the regression coefficients on the mixing probability. This problem is essentially one of complete separation [2, 13] that occurs in logistic regression when the linear predictor specifies covariate patterns for which the responses are all either zero or non-zero. Second, the parameters are difficult to interpret, as one must distinguish whether an effect relates to a change in the probability of a zero count or to the mean count. Based on these considerations, two alternative approaches were proposed [16], a hurdle model and an ordinal regression model applied to a categorized version of the count response, that yield more parsimonious summaries of a given covariate effect. As noted in [16], there is a large literature on the hurdle formulation for zero-inflated distributions. The model itself is framed in two stages: (1) model the probability an outcome is equal to zero; (2) conditional that it is not zero, model the distribution of the count using a conditional distribution derived from a traditional distribution for count data.

Even if one uses the hurdle model formulation to obtain a more parsimonious description of covariate effects, interpretation issues still arise when one includes random effects to account for correlation among clustered responses. In standard Poisson or negative binomial models (i.e., no zero inflation) with a log link, inclusion of random effects does not affect the marginal interpretation of the fixed effects [28]. However, in zero-inflated extensions of the model, this is no longer true. The fixed effects now represent conditional (i.e., within-cluster) effects. As pointed out in the literature, this interpretation can be problematic for covariates that do not vary within a subject in that the data do not directly reflect this within-subject change [9]. Thus, it is of interest to formulate a version of the zero-inflated count models that estimates the marginal effect of each covariate directly. Although recent work explored a marginalized hurdle model and a marginalized zero-inflated model for correlated and overdispersed count data with excess zero observations [10], this research does not present the operating characteristics of their proposed model and how they compare to current modeling frameworks.

The primary purpose of this research is to introduce a marginalized zero-altered model [18] and compare its performance to existing regression models for count data. Section 2 reviews existing methodology for zero-inflated count data. Section 3 outlines our proposed model and maximum likelihood estimation of the model parameters. Section 4 presents the results of two simulation studies designed to compare the performance of the proposed model to that of existing modeling approaches for longitudinal count data, which includes Poisson and negative binomial mixed-effect models and generalized estimating equations. In Sect. 5, we present a case study that applies the proposed marginalized zero-altered model to toxicological data on the cardiovascular effects of exposure to coal-fired power plant emissions and compares the results to simpler approaches. We conclude with a discussion and our recommendations in Sect. 6.

2 Existing Methodology

Our proposed model combines the conditional zero-altered random effect model [16] with a marginal regression approach [8, 9, 23]. That is, the model accounts for possible zero inflation/deflation and correlation among counts while relating the marginal mean of the outcome directly to covariates of interest.

In this section, we briefly outline the zero-altered model proposed by [16]. Let $Y_{ij}$ and $X_{ij}$ denote the count response and $1 \times p$ vector of covariates, respectively, for subject $i(i=1,\ldots ,n)$ at time $j(j=1,\ldots ,n_{i})$. The model is as follows:

$$\begin{aligned}&\hbox {log}[\mu ^b_{ij}] = X_{ij} \beta + b_{i} \\&\hbox {log}[-\hbox {log}[P(Y_{ij}=0|X_{ij},b_{i})]] = \gamma _{1} + \gamma _{2} \left( X_{ij} \beta \right) + b_{i} \\&P(Y_{ij}=y_{ij}|X_{ij},b_{i}) = I(y_{ij} > 0) [1 - P(Y_{ij}=0|X_{ij},b_{i})] \frac{g(y_{ij};\mu _{ij}^{b})}{1 - g(0;\mu _{ij}^{b})} , \end{aligned}$$

where $\mu _{ij}^b$ is the mean of a random variable with un-truncated distribution defined by density g, conditional on $X_{ij}$ and $b_{i}$; $b_{i} \mathop {\sim }\limits ^{iid} N(0, \sigma ^{2})$ is a subject-specific random intercept; $\gamma _{2} \ge 0$, $\gamma _{1}, \beta $ are unrestricted; and $g(\cdot ; \cdot )$ is assumed to be a count density, such as the Poisson or negative binomial. The first equation relates the expectation of the response variable to both measured covariates and random effects. The next two equations model the probability of having a zero count and having a count greater than zero, respectively, conditional on the covariates and the random effects. The appeal of this model is that if $g(\cdot ; \cdot )$ is assumed to be Poisson, the standard Poisson model is a special case when $\gamma _{1}=0$ and $\gamma _{2}=1$, and it directly parameterizes zero inflation/deflation via $\gamma _{1}$ and $\gamma _{2}$. However, as noted previously [16], one drawback of this model is that there is no single parameter describing the association between the covariates of interest, $X_{ij}$, and the unconditional mean count, $E(Y_{ij}|X_{ij})$. In the next section, we propose a marginalized zero-altered model that specifies a simple marginal relationship between the response variable and covariates of interest while adjusting for possible zero inflation/deflation.

3 Marginalized Zero-Altered Count Model

Section 3.1 outlines the marginalized zero-altered formulation. Section 3.2 describes maximum likelihood estimation and software implementation of the proposed model.

3.1 Statistical Model

Again let $Y_{ij}$ and $X_{ij}$ be the response variable and vector of covariates, respectively, for subject $i(i=1,\ldots ,n)$ at time $j(j=1,\ldots ,n_{i})$. We model the marginal mean, $\mu _{ij}^{Y} = E(Y_{ij}|X_{ij})$, conditional on covariates using a log link function,

$$\begin{aligned} \hbox {log}[\mu _{ij}^{Y}]= & {} X_{ij} \beta , \end{aligned}$$

(1)

where $\beta $ is a $p \times 1$ vector of estimated coefficients and $X_{ij}$ is a $1 \times p$ vector of covariates. To capture the dependence among the longitudinal measurements, we model the conditional mean response, $\mu _{ij}^{b}$, of a random variable distributed according to count density $g(\cdot ; \cdot )$, using the conditional model,

$$\begin{aligned} \hbox {log}[\mu _{ij}^{b}]= & {} \Delta _{ij} + b_{i}, \end{aligned}$$

(2)

where the $\{ b_{i} \}$ are subject-specific random effects. The random effects are assumed to have an independent structure, such that $b_{i} \mathop {\sim }\limits ^{iid} N(0, \sigma ^{2})$. $\Delta _{ij}$ is a function of the model parameters, $\Delta _{ij} = \Delta _{ij}(\beta , \gamma _{1}, \gamma _{2}, \sigma )$, where $\sigma $ is the standard deviation of the subject-specific random effects. To account for zero inflation/deflation, we explicitly model the probability of having a zero count and a count greater than zero:

$$\begin{aligned} P(Y_{ij}=0|X_{ij},b_{i})= & {} h \left[ \gamma _{1} + \gamma _{2} \left( \Delta _{ij} \right) + b_{i} \right] \end{aligned}$$

(3)

$$\begin{aligned} P(Y_{ij}=y_{ij}|X_{ij},b_{i})= & {} [1 - P(Y_{ij}=0|X_{ij},b_{i})] \frac{g(y_{ij};\mu _{ij}^{b})}{1 - g(0;\mu _{ij}^{b})}, y = 1, 2, \cdots \end{aligned}$$

(4)

where $\gamma _{2} \ge 0$ and $\gamma _{1}, \beta $ are unrestricted. Collectively, $\gamma _{1}$ and $\gamma _{2}$ measure the zero inflation/deflation, such that $\gamma _{1} < 0$ and $\gamma _{2} < 1$ introduce zero inflation in an additive and multiplicative fashion, respectively (and zero deflation in a counter fashion). The ratio in (4) is the conditional probability of a response greater than zero, where $g(\cdot ; \mu _{ij}^b)$ again represents the density for, say, a Poisson random variable with mean $\mu _{ij}^b$. This framework can also accommodate overdispersion in the non-zero counts via the negative binomial assumption, in which case we use $g(y_{ij}; \mu _{ij}^{b}) = \frac{\Gamma {(y_{ij} + 1/ \alpha )}}{\Gamma (y_{ij} + 1) \Gamma (1/ \alpha )} \left( \frac{1/ \alpha }{1/ \alpha + \mu _{ij}^{b}}\right) ^{1/ \alpha } \left( \frac{\mu _{ij}^{b}}{1/ \alpha + \mu _{ij}^{b}}\right) ^{y_{ij}}$, and $\alpha > 0$ is the overdispersion parameter.

The quantities $\Delta _{ij} (\theta )$ are indirectly defined and can be obtained as the solution to the convolution equation, which links the marginal and conditional means,

$$\begin{aligned} \mu _{ij}^{Y}= & {} \hbox {exp}( X_{ij} \beta ) \nonumber \\= & {} \int [1-P(Y_{ij}=0|X_{ij},b_{i})] \frac{\mu _{ij}^{b}}{[1 - g(0;\mu _{ij}^{b})]} \phi (b_{i}|\sigma ) \mathrm{d}b_{i}, \end{aligned}$$

(5)

where the integral defined is one-dimensional (see Appendix 1 for details). Given a fixed value of $\theta $, we solve for $\Delta _{ij}(\theta )$ after approximating (5) using numerical quadrature, such as Gauss-Hermite quadrature [15, 23]. Equation (5) expresses the marginal mean of the outcome as the conditional mean (given the random effects $b_i$) marginalized over the assumed distribution of the random effects. Because these solutions are analytically intractable, we use a simple Newton–Raphson algorithm to solve for them numerically.

We propose the choice of the h function vary depending on the chosen count distribution g. If we let $h(x) = g(0;x)$, then the model given by (1–4) reduces to the standard count distribution g in the special case of $\gamma _1=0$ and $\gamma _2=1$, which has the advantage that the presence of zero inflation can be formally tested against a standard count distribution via likelihood ratio testing. [16] used this strategy for the Poisson distribution with

$$\begin{aligned} P(Y_{ij}=0|X_{ij},b_{i}) = \text{ exp }\left\{ -\text{ exp } \left[ \gamma _{1} + \gamma _{2} \left( \Delta _{ij} \right) + b_{i} \right] \right\} , \end{aligned}$$

the inverse of which yields the log-negative-log link function in their original model formulation. In the negative binomial model, the link that results from this strategy depends on the overdispersion parameter $\alpha $. The zero-altered component of the model becomes

$$\begin{aligned} P(Y_{ij}=0|X_{ij},b_{i}) = \left( 1/ [1 + \alpha \hbox { exp}(\gamma _{1} + \gamma _{2}(\Delta _{ij}) + b_{i})]\right) ^ {1/\alpha }. \end{aligned}$$

We refer to the general formulation given by (1) - (4) as the marginalized zero-altered count (MZAC) model, with model parameters $\theta $ = ($\beta $, $\gamma _{1}$, $\gamma _{2}$, $\sigma $). Because the MZAC model can assume either a Poisson or a negative binomial distribution, we will also refer to either a MZAP or a MZANB model, respectively, throughout the remainder of the paper.

The above formulation follows that proposed by [16] in that the parameter $\gamma _2$ scales the component, in this case $\Delta _{ij}$, that involves the covariates $X_{ij}$ only. This structure corresponds to a scaling factor of 1 for the random effects, which assumes that the scales of the random effects in the zero-altered and non-zero components of the model are equal. We also consider alternative scalings that relax this assumption. Specifically, we also consider models in which the full predictor $\Delta _{ij} + b_i$ is scaled by $\gamma _{2}$,

$$\begin{aligned} h^{-1} [P(Y_{ij}=0|X_{ij},b_{i}) ] = \gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i}), \end{aligned}$$

which assumes that the scaling of covariate component $\Delta _{ij}$ and the random effects $b_i$ are scaled equally for the zero-altered component of the model. We also consider the more flexible formulation

$$\begin{aligned} h^{-1} [P(Y_{ij}=0|X_{ij},b_{i}) ] = \gamma _{1} + \gamma _{2} (\Delta _{ij}) + \gamma _{3}(b_{i}), \end{aligned}$$

which makes no assumptions about how the scalings of either the covariate component $\Delta _{ij}$ and $b_i$ are related. We use standard likelihood-based fit statistics, such as Akaike information criterion (AIC) [1], to choose between these three scaling models.

A key feature of this model is the relationship given in (1), which specifies a simple log-linear relationship between the response and corresponding covariates of interest. Moreover, the mean model is separate from the association model. As noted in previous research [9], this separation typically yields inferences for the regression coefficient parameters that are less sensitive to misspecification of the association model for the repeated measures. Another advantage of this model lies in (3), which explicitly parameterizes zero inflation/deflation through $\gamma _{1}$ and $\gamma _{2}$. To provide an interpretation of the $\gamma _1$ and $\gamma _2$ parameters governing the amount of zero inflation in the model, Fig. 1 presents a simulated dataset from each of six values of $(\gamma _1, \gamma _2)$, for an intercept only model with $\beta _0=1.50$ and $\sigma ^2=0.10$. Data were generated from a MZAP model with the following values of $\gamma = (\gamma _1, \gamma _2)$: (0, 1), ($-$0.25, 1), and ($-$0.50, 1) in column 1, and (0, 0.25), ($-$0.25, 0.25), and ($-$0.5, 0.25) in column 2. Therefore, the first histogram in Fig. 1 represents data generated from the Poisson distribution with $\mu =\hbox {exp}(1.50) = 4.48$, and the other five plots show how this distribution changes as one increases the amount of zero inflation either additively (i.e., changing $\gamma _1$), multiplicatively (i.e., changing $\gamma _2$), or both. Moving down the first column, which represents increasing amounts of zero inflation on the additive scale, the frequency of zeros increases to approximately 15 and 30, respectively, while the range of the distribution decreases slightly and the spike at zero becomes evident in the $\gamma = (\gamma _{1}, \gamma _{2}) = ($-0.5, 1) case. The first plot in the second column represents zero inflation on the multiplicative scale only, and the frequency of zeros is about 30. When $\gamma = (\gamma _{1}, \gamma _{2}) = ($-0.25, 0.25), zero inflation is present on both the additive and multiplicative scales, and, not only does the frequency of zeros increase, but the distribution becomes more skewed to the right, with the maximum count close to 20. The last histogram, when $\gamma = (\gamma _{1}, \gamma _{2}) = ($-0.5, 0.25), shows an even larger increase in the frequency of zeros with a much thicker right tail. Therefore, when zero inflation is present on the additive scale only, the frequency of zeros increases, whereas zero inflation on both the additive and multiplicative scales increases both the amount of zero inflation and the skewness of the distribution, allowing for more extreme observations. Sections 4 and 5 present some evidence that this property of the MZAC distribution may result in regression estimates that are less sensitive to outliers.

3.2 Maximum Likelihood Estimation

We estimate the parameters of the MZAC model via maximum likelihood estimation. The likelihood contribution from subject i is constructed under the standard assumptions that the response counts are independent given the random effects. The marginal likelihood contribution for subject i is

$$\begin{aligned} L_{i}= & {} \int \prod _{j=1}^{n_{i}} P(Y_{ij} | X_{ij}, \beta , \gamma _{1}, \gamma _{2}, b_{i}) \phi (b_{i}|\sigma ) \mathrm{d}b_{i} \\= & {} \int \prod _{j=1}^{n_{i}} [P(Y_{ij}=0|X_{ij},b_{i})]^{I(Y_{ij}=0)} \\&\times \left[ [1-P(Y_{ij}=0|X_{ij},b_{i})] \frac{g(y_{ij};\mu _{ij}^{b})}{1 - g(0;\mu _{ij}^{b})} \right] ^{I(Y_{ij}>0)} \phi (b_{i}|\sigma ) \mathrm{d}b_{i} \end{aligned}$$

where $\phi (b_{i}|\sigma )$ is the probability density function for a normal random variable with mean zero and standard deviation $\sigma $. The conditional (on the random effects) likelihood $P(Y_{ij} | X_{ij}, \beta , \gamma _{1}, \gamma _{2}, b_{i})$ is the product of two terms. The first term depends on the probability that $Y_{ij}$ is equal to zero and the second term corresponds to the probability that $Y_{ij}$ is greater than zero. Because the integral in this marginal likelihood can not be evaluated analytically, numerical methods are required for approximation. Any number of software packages that allow fitting of nonlinear mixed models can maximize this likelihood. We use SAS PROC NLMIXED (sample code provided in Appendix 2), which uses fixed Gaussian quadrature to approximate $\text{ log } (L_{i})$ and a dual quasi-Newton algorithm to maximize this approximation summed over all subjects, to obtain estimates for all model parameters. The standard errors associated with these parameters are based on the Hessian matrix computed using the quadrature approximation of the marginal likelihood.

Our primary interest is in estimating the exposure effect, along with other potential confounders. Because the MZAC model allows separate marginal and conditional mean structures, the interpretation of the $\beta $ coefficients is straightforward. In general, an element of $\beta $ represents the log relative change in response mean related to a one unit difference in the corresponding predictor. For example, if we consider a single binary covariate, such as exposure level, then $\beta = \hbox {log}[E(Y_{ij} | \hbox {Exp}_{i} = 1)] - \hbox {log}[E(Y_{ij} | \hbox {Exp}_{i} = 0)]$ measures the average change in, say, frequency of our outcome between subjects exposed as opposed to subjects unexposed.

4 Simulation Studies

There is no previous work on the impact of zero inflation on estimators that naively ignore this feature of the count data. The primary reason for this gap in the literature is the fact that the parameters in the most common models for zero-inflated count data do not have a marginal interpretation, and so it is difficult to compare results from one of these models to those from a marginal one. One advantage of our marginalized parameterization of the zero-altered count model is that the resulting regression coefficient estimators can be directly compared to those from other marginal models. We conducted two simulation studies to investigate how the estimators from the MZAP model perform as compared to those from more traditional models, namely the negative binomial mixed-effect (NBME) model, Poisson mixed-effect (PME) model and generalized estimating equations (GEE). The NBME, PME, and GEE models are commonly used to analyze longitudinal count data, but do not explicitly adjust and/or estimate zero inflation. The NBME and PME models are a special case of a generalized linear mixed model, and takes the form

$$\begin{aligned} \text{ log }[\mu ^b_{ij}] = X_{ij} \beta + b_i. \end{aligned}$$

In this model, the regression coefficients have a marginal interpretation, and so coefficient estimates are directly comparable to their MZAC counterparts.

Because the likelihood function in NBME and PME models include integrals that are analytically intractable, maximum likelihood estimation for this model also depends on quadrature, and we implemented this approach in SAS PROC NLMIXED. To provide consistent results in the presence of misspecification [3], we also utilized the sandwich estimation option for the standard errors for these two approaches. The GEE approach [28], on the other hand, solves the set of estimating equations,

$$\begin{aligned} s(\beta ) = \sum _{i=1}^{n} D_{i}^{'} V_{i}^{-1} (Y_{i} - \mu ^m_{i}) = 0, \end{aligned}$$

where $Y_{i} = (Y_{i1}, \ldots , Y_{in_{i}})^{'}$, $\mu ^m_{i} = (\mu ^m_{i1}, \ldots , \mu ^m_{in_{i}})^{'}$, $D_{i} = \partial \mu _{i} / \partial \beta $, and $V_{i} = A_{i}^{\frac{1}{2}} R A_{i}^{\frac{1}{2}}$, with $A_{i}$ being the diagonal matrix of marginal variances for the ith cluster and $R = R(\alpha )$ being the working correlation matrix. We fit the log-linear GEE

$$\begin{aligned} \text{ log }[\mu ^m_{ij}] = X_{ij} \beta , \end{aligned}$$

to each simulated dataset.

The first simulation study assessed the performance of each method, in terms of bias, efficiency, and variance estimation, while increasing the amount of zero inflation in the data. Motivated by an empirical observation that GEE approaches may be overly sensitive to a single outlying count (see Sect. 5), we conducted a second simulation study that assessed the sensitivity of the four methods to response outliers. In carrying out these simulations, we note that although the GEE is the correct model, both the PME and NBME are misspecified but with consistent score equations.

4.1 Simulation 1

We simulated datasets containing 100 clusters, each with a cluster size of 5, from the proposed MZAP model. Each dataset contained 50 “exposed” subjects and 50 “unexposed” subjects. The marginal model was $\mu _{ij}^{Y} = \hbox {exp} (\beta _{0} + \beta _{1}\hbox {Time}_{j} + \beta _{2}\hbox {Exp}_{i} + \beta _{3}\hbox {Exp}_{i} *\hbox {Time}_{j})$. Taken to closely match the results from a similar data analysis presented in Sect. 5, the true $\beta $ values were $(\beta _{0}, \beta _{1}, \beta _{2}, \beta _{3}) = (1.12, $-0.87, 0.29, 1.10) and the covariates of interest are exposure ($\hbox {Exp}_{i}: 1=\hbox {exposed}, 0=\hbox {unexposed}$) and time ($\hbox {Time}_{ij} = (j-1)*0.25$). The subject-specific random effects, $b_{i}$, were generated from a normal distribution with mean 0 and variance 0.10. Equivalently, in a setting where there are only two exposure groups and 5 observation points, there will be 10 different values of $\Delta _{ij}$ for each time and exposure combination; therefore, the values of $\Delta _{ij}$ that correspond to the assumed $\beta $ and $\sigma ^{2}$ values, as well as assuming no zero inflation, are $\Delta _{ij}$ = (1.36, 1.42, 1.47, 1.53, 1.59, 1.07, 0.85, 0.63, 0.42, 0.20), where the first five values are for the exposed subject and the last five are for the unexposed subject. To introduce different amounts of zero inflation, we varied $\gamma _{1}$ and $\gamma _{2}$, resulting in three different cases: (A) for $\gamma _{2}=1$, $\gamma _{1} = (0, -0.25, -0.50, -0.75)$, (B) for $\gamma _{2}=0.50$, $\gamma _{1} = (0, -0.25, -0.50, -0.75)$, and (C) for $\gamma _{2}=0.25$, $\gamma _{1} = (0, -0.25)$. Recall that $\gamma _{1}$ and $\gamma _{2}$ measure zero inflation/deflation in an additive and multiplicative fashion, respectively; therefore, the smaller $\gamma _{1}$ (i.e., $\gamma _{1} < 0$) and $\gamma _{2}$ (i.e., $\gamma _{2} < 1$), the more zero inflation is present. The amounts of zero inflation increase within each case as $\gamma _{1}$ gets smaller, and, overall, the amount of zero inflation increases from case A to case B to case C, because $\gamma _{2}$ is also getting smaller. Only two values for $\gamma _1$ are reported for case C due to the fact that, when $\gamma _2=0.25$, larger negative values of $\gamma _1$ yielded a distribution with such an overwhelming proportion of zeros that none of the four estimation methods considered were able to reliably estimate the model parameters.

Table 1 Simulation 1

Full size table

Table 1 presents the results, which are based on 500 simulated datasets, for only three specific zero inflation settings (no zero inflation, $\gamma = (\gamma _{2}, \gamma _{1} = (1, 0))$, moderate zero inflation, $\gamma = (\gamma _{2}, \gamma _{1} = (1,-0.75))$, and severe zero inflation, $\gamma = (\gamma _{2}, \gamma _{1} = (-0.25, 0.25))$). A detailed table, with 10 different zero inflation settings, is presented in Sect. 8 (Online Resource 1). To compare the four different modeling approaches for the three different cases (A–C), we focus on the simulated mean and variability (standard deviation) of the estimates, as well as the mean standard errors. We also present the 95% coverage rates for the $\beta $ coefficients. Across all four methods, the $\beta $ coefficients display little to no bias, even as the amount of zero inflation increases; similarly, the coverage is approximately the same across the four methods. As the amount of zero inflation increases, though, the mean of the PME estimator of $\sigma ^2$ does not match the true value, but this is to be expected as it is estimating cluster heterogeneity on the log-linear scale and not within the zero-altered Poisson framework. As the zero inflation moderately increases, the variance component for the NBME performs somewhat similar to the variance component for the PME in Case A, in that the simulated variance is larger than the true value. However, as the amount of zero inflation significantly increases, the mean of the NBME estimator of $\sigma ^{2}$ decreases, approaching 0.02 in the most extreme case. For all three simulation settings, the MZAP estimators for $\gamma _1$, $\gamma _2$, and the between-subject variability, $\sigma ^{2}$, display little bias, suggesting that the proposed software implementation yields adequate inferences when the model is correctly specified.

Table 2 Simulation 2

Full size table

4.2 Simulation 2

To compare the sensitivity of the four estimators to outlying observations, we simulated 30 clusters, with 5 observations each. The marginal mean model for each subject was $\mu _{i}^{m} = \hbox {exp} (\beta _{0} + \beta _{1}\hbox {Time}_{i} + \beta _{2}\hbox {Exp}_{i})$, where the true $\beta $ values are $(\beta _{0}, \beta _{1}, \beta _{2}) = (1.12, -0.87, 0.29)$. The random effects were normally distributed with zero mean and variance equal to 0.10. The true $\gamma $ values were $\gamma _{1}=-0.25$ and $\gamma _{2}=0.25$, corresponding to a significant amount of zero inflation. To induce contamination, we increased the last response count of one exposed subject, such that, $y_{ij}^{*} = y_{ij} + c, c \epsilon (0, 20, 40, 60, 80, 100)$, where $i=1$ and $j=5$. Table 2 presents the results of this contamination simulation. Due to the small number of clusters, there is a small amount of finite sample bias. Focusing on the coefficient estimates for the exposure covariate, as the final observation in this exposed subject becomes more extreme, the bias in the GEE estimator is approximately linear in the amount of contamination present in the response, agreeing with established results on the form of the influence function for GEE estimators [19]. MZAP, NBME, and PME estimators incurs less bias as compared to its GEE counterpart, with the amount of this bias being relatively small. These simulation results suggest that, in the presence of zero inflation, the MZAP estimators can be more robust to extreme observations as compared to their GEE counterparts.

5 Analysis of Arrhythmia Data

We use the proposed MZANB and MZAP models to analyze data collected as part of the toxicological evaluation of realistic emissions of source aerosols (TERESA) study [26]. In this study, rats with an induced acute myocardial infarction were exposed to either stack emissions ($n=15$) or filtered air ($n=14$) for up to five hours. The exposed rats were assessed while in photochemical chambers filled with stack emissions composed of SO$_{2}$, NO, and primary particulate matter. The control group of rats was exposed to room air filtered through a high-efficiency particulate air (HEPA) filter. The outcome of interest was arrhythmia frequency per hour, with electrocardiogram output inspected manually by an investigator blinded to the exposure status of each animal. Because each rat did not have a measured outcome for all five hours, we assumed that the data are missing completely at random, which is reasonable due to the fact that shorter exposures were likely due to technical issues unrelated to the outcome. Figure 2 shows the histogram of the frequency of arrhythmias per hour for the rats in the study. This histogram shows that there are potentially more observed zeros than expected from a standard distribution for count data, as indicated by the significant spike at zero. It also shows that the distribution of the frequency of arrhythmias per hour is extremely skewed to the right.

We compare the results from the MZANB and MZAP models to several traditional count models and a hurdle model with random effects. Preliminary analyses showed no evidence of an interaction between exposure and time, so to assess the association between the frequency of arrhythmias with exposure level and time, we considered the following marginal model:

$$\begin{aligned} \hbox {log}[\mu _{ij}^{Y}]= & {} \beta _{0} + \beta _{1}\hbox {Exp}_{i} + \beta _{2}I(\hbox {Time}_{ij}=1) + \beta _{3}I(\hbox {Time}_{ij}=2) \\&+ \beta _{4}I(\hbox {Time}_{ij}=3) + \beta _{5}I(\hbox {Time}_{ij}=4) , \end{aligned}$$

for subject $i(i=1,\ldots , 29)$ at hour $j (j= 1, 2, 3, 4, 5)$. To complete the MZANB and MZAP model specification, we assume

$$\begin{aligned} \hbox {log}[\mu _{ij}^{b}] = \Delta _{ij} + b_{i}, \end{aligned}$$

where $b_{i} \sim N(0, \sigma ^{2})$, and considered all three possible scaling formulations outlined in Sect. 3.

Table 3 presents the maximum likelihood estimates, the corresponding standard errors, and p values from the MZANB, MZAP, NBME, PME, and GEE models, as well as the AIC and likelihood ratio test statistics to compare model fit. A more detailed table along with results from the hurdle model with random effects as presented in [16] is given in Sect. 9 (Online Resource 2) , even though the parameters in this mixture model formulation do not directly compare to those from the other marginal models considered. Here, we only present results from the most flexible scaling model,

$$\begin{aligned} h^{-1} [P(Y_{ij}=0|X_{ij},b_{i})] = \gamma _{1} + \gamma _{2} (\Delta _{ij}) + \gamma _{3}(b_{i}), \end{aligned}$$

because both the MZANB and the MZAP models had the lowest AIC values in comparison to the other two scaling models for these data. Both the MZANB and MZAP results suggest that there is a significant (at the $\alpha =0.05$ level) effect of exposure ($\beta _{E} \approx 0.7$), with the MZANB producing the best fit as indicated by the lowest AIC. For both of the MZANB and MZAP models, the estimated relative risk, $\hbox {exp}(0.7) = 2.01$, of arrhythmias is doubled for those rats exposed to stack emissions as opposed to those rats exposed to HEPA filtered air. Although the MZANB maximum likelihood estimate for the heterogeneity parameter is $\sigma ^{2} = 0.187$, indicating a small amount of subject-to-subject variation in the frequency of arrhythmias, the $\gamma _{3}$ parameter that measures the overall effect of the subject-specific random effects is large. It appears that, in this case, because the NBME and PME models do not account for zero inflation, they underestimate the effect of exposure and overestimate the corresponding standard error. The GEE approach yields an estimated effect of exposure similar to those from the MZANB and MZAP models, but as we see in the next table this estimate can be sensitive to the inclusion or exclusion of a few extreme observations.

Table 3 Model estimates, standard errors (SE), and p values (p) for the frequency of arrhythmias, with $-$2 log likelihood ($-2$lnL) and Akaike information criterion (AIC) values for the marginalized zero-altered negative binomial (MZANB), marginalized zero-altered Poisson (MZAP), negative binomial mixed-effect (NBME), Poisson mixed-effect (PME), and generalized estimating equations (GEE) models

Full size table

Due to the small sample size of the toxicity study, the zero-altered parameters, $\gamma _1$, $\gamma _2$, and $\gamma _3$, exhibit large standard errors. Although these quantities indicate evidence of high uncertainty, these marginalized zero-altered models consistently perform better than their standard counterpart models, NBME and PME, as denoted by the AIC values as well as a formal likelihood ratio test. More specifically, we have compared the MZANB against the NBME as well as the MZAP against the PME. The likelihood ratio test statistic for both tests are 15.3 and 39, respectively, highlighting significant zero inflation as well. Both tests also indicate better model fit for our proposed marginalized models, as opposed to the models that ignore zero inflation.

The original data also included an extreme observation, which was removed from the dataset prior to analysis. The largest number of arrhythmias did not exceed 20 for any rat, with the exception of this one outlier, which was 59. In addition to the primary analyses above, we analyzed the dataset with this extreme observation using all marginal models presented as well as the hurdle model with random effects. Table 4 displays the results for the exposure coefficient only, and a detailed table including the hurdle results is presented in Sect. 10 (Online Resource 3). The results show that the MZANB and MZAP models are more robust in the presence of this extreme observation compared to the NBME, PME, GEE, and hurdle approaches (considering the percent change across both portions of this mixture model). The MZANB and MZAP have the smallest percent change, 16.596% and 1.349%, respectively, in analyzing the toxicity dataset with and without the outlier, and the overall conclusions of these analyses do not change. In contrast, the GEE has the largest percent change, 67.428%, coupled with a large change in the strength of this association as well.

Table 4 Model estimates (standard errors) for exposure and % change for the marginalized zero-altered negative binomial (MZANB), marginalized zero-altered Poisson (MZAP), negative binomial mixed effect (NBME), Poisson mixed-effect (PME), and generalized estimating equations (GEE) models with and without an outlier

Full size table

6 Discussion

We proposed marginalized zero-altered count models for longitudinal data with excessive zeros. This class of models allowed us to address the issue of zero inflation in a simple, clear, and parsimonious way. Current methods, such as zero-inflated regression models, adjust for excessive zeros but often at the expense of interpretability.

Our statistical framework has a few advantages. First, because our model includes a marginal mean structure, model parameters are easily interpretable. Unlike zero-inflated regression models, the interpretation of our regression coefficients is of the marginal form. Second, we are able to adjust for the zero inflation/deflation in the data via the parameters $\gamma _{1}$ and $\gamma _{2}$. Estimates of these parameters reflect the degree of the zero inflation/deflation. Third, the MZANB and MZAP models simplify to a negative binomial and Poisson model, respectively, when $\gamma _{1} = 0$ and $\gamma _{2} = 1$, which allows us to test whether $\gamma _{1}$ is different from zero and $\gamma _{2}$ (and $\gamma _3$ in models containing this additional scale parameter) is different from one. We also carried out two simulation studies to test the performance of our proposed MZAP model to that of standard approaches for longitudinal data, such as the negative binomial and Poisson mixed-effect models and generalized estimating equations. Results suggested that, when compared to GEE, the MZAP model was more robust in the presence of outlying response observations and, compared to the NBME and PME models, more accurately reflected estimation uncertainty in the presence of zero inflation.

We also presented the use of the marginalized zero-altered models in a toxicity study where the study population exhibited a significant number of zero outcomes in the arrhythmia response. Because our interest moves beyond simply modeling whether or not the arrhythmias exists as well as a more parsimonious environment for interpretation, our marginalized zero-altered models proved more appropriate than either traditional random effect models as well as the hurdle model with random effects. Our proposed models displayed better model fit in the presence of significant zero inflation when compared to the use of these other statistical models for longitudinal data.

Finally, [10] considers marginalized versions of the hurdle model that specify two regression equations, one for the probability of a zero count and one for the truncated non-zero distribution, with distinct model parameters. The zero-altered model that we consider assumes that a single set of regression coefficients, $\varvec{\beta }$, appear in both regression equations but are additively and multiplicatively scaled by $\gamma _1$ and $\gamma _2$, respectively. Just as importantly, our zero-altered model uses a complimentary log–log function in the model related to the probability of a zero count. The combination of these two features ($\gamma _1$ and $\gamma _2$ scaling plus log–log link) yield the important feature of the model that an ordinary Poisson mixed model is nested within this model, yielding tests of zero-alteration. Moreover, we derive and provide the analogous novel link that leads to this nesting property when the truncated count distribution is the negative binomial model, thereby extending the methods to cases in which the non-zero counts exhibit overdispersion. To our knowledge this link function has not been previously considered in the literature.

As with any statistical framework, our proposed model has a few limitations. Because each of our parameters of interest rely on the evaluation of the implicit $\Delta _{ij}$ function, computing time increases significantly as the number of clusters and observations per cluster increases. Also, if the correlated data structure has multiple nested levels, our marginal likelihood could be multi-dimensional, and model fitting will be computational expensive. However, the simulations fitting the proposed marginalized zero-altered count models suggested the models were not prohibitively slow for a reasonable number of clusters (i.e., $N=100$). We estimate that for this sample size the models took approximately 80 minutes per 100 datasets, suggesting the model is feasible for larger problems.

Current research focuses on extending this framework to analyze spatially correlated zero-inflated count data. Although one could fit this model via maximum likelihood estimation, we develop Bayesian methods of model fitting. One primary reason for turning to a Bayesian framework is because spatial data often have a multi-level structure. For instance, breast cancer incidence rates may be nested within census tracts, which are then nested within neighborhoods, which are then nested within cities. Because of this structure, there is a possible correlation at multiple levels, which then implies the use of multiple sets of random effects in a conditional mean model. In a similar fashion, the zero-altered models considered in [16] have the advantage that they are more parsimonious than alternative versions that use distinct parameters for the two parts of the model, but this can also be a source of lack of fit. In some situations, it would be of interest to include distinct sets of random effects in which case alternative estimation methods, such as Bayesian approaches, may be preferred.

Also, semi-continuous longitudinal data, whereby data are either continuous or zero with observations that are correlated, presents similar statistical challenges when in the presence of large zeros and strong skewness. Current methods include modeling this type of data via two-part models [11, 17]. Although different, there are close parallels; therefore, extending our proposed marginalized approach to this semi-continuous longitudinal data is of interest.

Finally, it is well known that due to the linearity of the influence functions [7, 22] associated with a GEE that is non-robust to outliers, that mixed models have the potential to be more robust because the influence functions have nonlinear form. Although we do not have any specific analytic results in this regard for the proposed class of models considered in this paper, we now acknowledge this as one possible reason for the difference in performance between the two methods and note that this is a possible future research direction.

References

Akaike H (1987) Factor analysis and AIC. Psychometrika 52(3):317–332
Article MathSciNet MATH Google Scholar
Albert A, Anderson JA (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71(1):1–10
Article MathSciNet MATH Google Scholar
Everitt BS (1998) The cambridge dictionary of statistics. Cambridge University Press, Cambridge
MATH Google Scholar
Hall DB (2000) Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics 56(4):1030–1039
Article MathSciNet MATH Google Scholar
Hall DB, Berenhaut KS (2002) Score tests for heterogeneity and overdispersion in zero-inflated poisson and binomial regression models. Can J Stat 30(3):415–430
Article MathSciNet MATH Google Scholar
Hall DB, Zhang Z (2004) Marginal models for zero inflated clustered data. Stat Model 4(3):161–180
Article MathSciNet MATH Google Scholar
Hampel FR (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393
Article MathSciNet MATH Google Scholar
Heagerty PJ (1999) Marginally specified logistic-normal models for longitudinal binary data. Biometrics 55(3):688–698
Article MathSciNet MATH Google Scholar
Heagerty PJ, Zeger SL (2000) Marginalized multilevel models and likelihood inference (with comments and a rejoinder by the authors). Stat Sci 15(1):1–26
MathSciNet Google Scholar
Kassahun W, Neyens T, Molenberghs G, Faes C, Verbeke G (2014) Marginalized multilevel hurdle and zero-inflated models for overdispersed and correlated count data with excess zeros. Stat Med 33(25):4402–4419
Article MathSciNet Google Scholar
Lachenbruch PA (2002) Analysis of data with excess zeros. Stat Methods Med Res 11(4):297–302
Article MATH Google Scholar
Lambert D (1992) Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14
Article MATH Google Scholar
Lesaffre E, Albert A (1989) Partial separation in logistic discrimination. J R Stat Soc Ser B 51:109–116
MathSciNet MATH Google Scholar
Lu S-E, Lin Y, Shih W-CJ (2004) Analyzing excessive no changes in clinical trials with clustered data. Biometrics 60(1):257–267
Article MathSciNet MATH Google Scholar
Miglioretti DL, Heagerty PJ (2004) Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics 5(3):381–398
Article MATH Google Scholar
Min Y, Agresti A (2005) Random effect models for repeated measures of zero-inflated count data. Stat Model 5(1):1–19
Article MathSciNet MATH Google Scholar
Olsen MK, Schafer JL (2001) A two-part random-effects model for semicontinuous longitudinal data. J Am Stat Assoc 96(454):730–745
Article MathSciNet MATH Google Scholar
Philip LP (2010) Multilevel models for zero-inflated count data in environmental health and health disparities research. Ph.D. thesis, Harvard University
Qu A, Song PX-K (2004) Assessing robustness of generalised estimating equations and quadratic inference functions. Biometrika 91(2):447–459
Article MathSciNet MATH Google Scholar
Ridout M, Demétrio CGB, Hinde J (1998) Models for count data with many zeros. Proceedings of the sixth international biometric conference, vol. 19, pp 179–192
Ridout M, Hinde J, DemeAtrio CGB (2001) A score test for testing a zero-inflated poisson regression model against zero-inflated negative binomial alternatives. Biometrics 57(1):219–223
Article MathSciNet MATH Google Scholar
Rousseeuw FR, Hampel EM, Ronchetti PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York
MATH Google Scholar
Schildcrout JS, Heagerty PJ (2007) Marginalized models for moderate to long series of longitudinal binary response data. Biometrics 63(2):322–331
Article MathSciNet MATH Google Scholar
Shankar V, Milton J, Mannering F (1997) Modeling accident frequencies as zero-altered probability processes: an empirical inquiry. Accid Anal Prev 29(6):829–837
Article Google Scholar
Tooze JA, Grunwald GK, Jones RH (2002) Analysis of repeated measures data with clumping at zero. Stat Methods Med Res 11(4):341–355
Article MATH Google Scholar
Wellenius Gregory A, Diaz Edgar A, Gupta Tarun, Ruiz Pablo A, Long Mark, Kang Choong Min, Coull Brent A, Godleski John J (2011) Electrocardiographic and respiratory responses to coal-fired power plant emissions in a rat model of acute myocardial infarction: results from the toxicological evaluation of realistic emissions of source aerosols study. Inhal Toxicol 23(S2):84–94
Article Google Scholar
Yau KKW, Lee AH (2001) Zero-inflated poisson regression with random effects to evaluate an occupational injury prevention programme. Stat Med 20(19):2907–2920
Article MathSciNet Google Scholar
Zeger SL, Liang K-Y, Albert PS (1988) Models for longitudinal data: a generalized estimating equation approach. Biometrics 44:1049–1060
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors would like to acknowledge the following NIH Grants: ES07142, CA134294, ES012044, and ES00002.

Author information

Authors and Affiliations

Department of Epidemiology & Biostatistics, School of Public Health, Drexel University, Philadelphia, PA, USA
Loni Philip Tabb
Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA
Eric J. Tchetgen Tchetgen
Department of Community Health, Brown University, Boston, MA, USA
Greg A. Wellenius
Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
Eric J. Tchetgen Tchetgen & Brent A. Coull

Authors

Loni Philip Tabb
View author publications
You can also search for this author in PubMed Google Scholar
Eric J. Tchetgen Tchetgen
View author publications
You can also search for this author in PubMed Google Scholar
Greg A. Wellenius
View author publications
You can also search for this author in PubMed Google Scholar
Brent A. Coull
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Loni Philip Tabb.

Ethics declarations

Conflicts of interest

No conflict of interest.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 40 KB)

Appendices

Appendix 1: Calculation of $\Delta _{ij}$

In solving for $\Delta _{ij}$, we need to solve the convolution equation that links the marginal (i.e, $\mu _{ij}^{Y} = \hbox {E}(Y_{ij}|X_{ij}) = \hbox {exp} (\beta X_{ij})$) and conditional means, where, assuming the MZAP setting,

$$\begin{aligned} \mu _{ij}^{Y}= & {} \hbox {exp}(X_{ij} \beta ) \\= & {} \int [1-P(Y_{ij}=0|X_{ij},b_{i})] \frac{\mu _{ij}^{b}}{[1 - \hbox {exp}(-\mu _{ij}^{b})]} \phi (b_{i}|\sigma ) \mathrm{d}b_{i} \\= & {} \int \frac{[1-\hbox {exp}[-\hbox {exp}(\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i}))]]}{[1 - \hbox {exp}[-\hbox {exp}(\Delta _{ij} + b_{i})]]} \hbox {exp}(\Delta _{ij} + b_{i}) \phi (b_{i}|\sigma ) \mathrm{d}b_{i} . \end{aligned}$$

Estimates of $\Delta _{ij}$ can be obtained using a Newton–Raphson algorithm, such that

$$\begin{aligned} \Delta _{ij}^{(t+1)}= & {} \Delta _{ij}^{(t)} - \left( \frac{\partial f(\Delta _{ij})}{\partial \Delta _{ij}}\right) ^{-1} \times f(\Delta _{ij}) , \end{aligned}$$

where $\Delta _{ij} = \Delta _{ij} (\beta , \gamma _{1}, \gamma _{2}, \sigma )$ and $f(\Delta _{ij})$ refers to the convolution equation above. The derivative needed for the Newton–Raphson algorithm is as follows

$$\begin{aligned} \frac{\partial }{\partial \Delta _{ij}} \mu _{ij}^{Y}= & {} \frac{\partial }{\partial \Delta _{ij}} \int \frac{[1-\hbox {exp}[-\hbox {exp}(\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i}))]]}{[1 - \hbox {exp}[-\hbox {exp}(\Delta _{ij} + b_{i})]]} \hbox {exp}(\Delta _{ij} + b_{i}) \phi (b_{i}|\sigma ) \mathrm{d}b_{i} \\= & {} \int \left\{ \frac{\partial }{\partial \Delta _{ij}} \frac{[1-\hbox {exp}[-\hbox {exp}(\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i}))]]}{[1 - \hbox {exp}[-\hbox {exp}(\Delta _{ij} + b_{i})]]} \hbox {exp}(\Delta _{ij} + b_{i}) \right\} \phi (b_{i}|\sigma ) \mathrm{d}b_{i} . \end{aligned}$$

After using the chain rule, $\diamond = \left\{ \frac{\partial }{\partial \Delta _{ij}} \frac{[1-\hbox {exp}[-\hbox {exp}(\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i}))]]}{[1 - \hbox {exp}[-\hbox {exp}(\Delta _{ij} + b_{i})]]} \hbox {exp}(\Delta _{ij} + b_{i}) \right\} $ results in

$$\begin{aligned} \diamond= & {} \{ 1 - e^{-e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}} \} * \left\{ \frac{[1 - e^{-e^{\Delta _{ij} + b_{i}}}]e^{\Delta _{ij} + b_{i}} - e^{\Delta _{ij} + b_{i}}[-e^{-e^{\Delta _{ij} + b_{i}}}][-e^{\Delta _{ij} + b_{i}}]}{[1 - e^{-e^{\Delta _{ij} + b_{i}}}]^{2}} \right\} \\&+ \left\{ \frac{e^{\Delta _{ij} + b_{i}}}{1 - e^{-e^{\Delta _{ij} + b_{i}}}} \right\} * \{ e^{-e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}} [e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}] \gamma _{2} \} \\= & {} \frac{e^{\Delta _{ij} + b_{i}}}{1 - e^{-e^{\Delta _{ij} + b_{i}}}} * [ \{ 1 - e^{-e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}} \}* \left\{ 1 - \frac{[e^{-e^{\Delta _{ij} + b_{i}}}] [e^{\Delta _{ij} + b_{i}}]}{1 - e^{-e^{\Delta _{ij} + b_{i}}}} \right\} \\&+ \{ e^{-e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}} [e^{\gamma _{1} + \gamma _{2} (\Delta _{ij} + b_{i})}] \gamma _{2} \} ] . \end{aligned}$$

Gauss–Hermite quadrature can be used to evaluate this one-dimensional integral.

Appendix 2: SAS Execution via PROC NLMIXED

Marginalized zero-altered Poisson (MZAP) and marginalized zero-altered negative binomial (MZANB) models

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tabb, L.P., Tchetgen, E.J.T., Wellenius, G.A. et al. Marginalized Zero-Altered Models for Longitudinal Count Data. Stat Biosci 8, 181–203 (2016). https://doi.org/10.1007/s12561-015-9136-6

Download citation

Received: 03 March 2015
Accepted: 08 September 2015
Published: 22 September 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s12561-015-9136-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Marginalized Zero-Altered Models for Longitudinal Count Data

Abstract

Similar content being viewed by others

Maximum likelihood based analysis of equally spaced longitudinal count data with first-order antedependence and overdispersion

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

Conditional Poisson models: a flexible alternative to conditional logistic case cross-over analysis

1 Introduction

2 Existing Methodology