1 Introduction

In a cross-sectional cluster setup, the responses from the individuals in a given cluster are correlated as these responses share a common random cluster effect, whereas in a longitudinal setup the repeated responses collected from an individual form a cluster and these clustered responses from the same individual become correlated as they are likely to follow a dynamic relationship. Thus the correlation structure under cross-sectional clusters and longitudinal clusters are supposed to be different. In both setups, it is of primary objective to examine the regression effects (of the associated covariates) after accommodating the respective correlation structure.

To facilitate the discussion on the cluster regression models in a cross-sectional setup, suppose that there are I independent clusters and for a cluster i(i = 1,…,I), ni denotes its size. Let yij denotes a binary response from the j-th (i = 1,…,ni) individual of the i-th cluster. Further, let xij be a p-dimensional fixed covariate vector, and \(\boldsymbol {\beta }=(\beta _{1},\ldots ,\beta _{u},\ldots ,\beta _{p})^{\prime }\) be the regression effect of xij on yij, for all i = 1,…,I;j = 1,…,ni. Notice that in this cluster setup, there is likely to be a cluster effect on the responses belonging to the same cluster. Let γi denote the random cluster effect of the i-th cluster which is shared by the responses belonging to this cluster. Thus, on top of β, there is an influence of γi on the responses ({yij,j = 1,…,ni}) belonging to the i-th cluster. This additional influence, along with the influence of xij, is reflected on the binary response yij through a cluster-specific conditional mean model given by

$$ \begin{array}{@{}rcl@{}} E[Y_{ij}|\gamma_{i}]&=&Pr[Y_{ij}=1|\boldsymbol{x}_{ij},\gamma_{i}] \\ &=& p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i})=\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})/ [1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})], \end{array} $$
(1)

where it is customarily assumed that \(\gamma _{i} {\stackrel {iid}{\sim }} (0,\sigma ^{2}_{\gamma }).\) Notice that Eq. 1 is a marginal model for yij conditional on γi. Hence this model may be referred to as the marginal-conditional (MC) binary logistic model. This model in Eq. 1 is also known as the so-called random effects model where \(\sigma ^{2}_{\gamma }\) plays multiple roles. More specifically, depending on the distributional assumption of γi or \(p^{*}_{ij}(\gamma _{i}),\) (1) the unconditional mean, that is, E[Yij] = Pr[Yij = 1|xij], may or may not be a function of \(\sigma ^{2}_{\gamma };\) (2) similarly var[Yij] may or may not exhibit overdispersion (McCullagh and Nelder, 1989); but (3) because γi is the common clustered effect shared by all responses {yijandyik,forkj;j,k = 1,…,ni,} yij and yik are correlated, and this within cluster correlation must be a function of \(\sigma ^{2}_{\gamma }.\) For the third reason (3), \(\sigma ^{2}_{\gamma }\) may preferably be referred to as the cluster correlation parameter.

As far as the applications of the model (1) is concerned, there are many practical situations, where one needs to analyze cluster specific binary data following (1). For example, in a chronic obstructive pulmonary disease (COPD) study (Liang et al. 1992; Ekholm et al. 1995; Sutradhar and Mukerjee, 2005), yij denotes the impaired pulmonary function (IPF) status (yes or no), and xij is the vector of covariates such as gender, race, age, and smoking status, for the j-th sibling of the i-th COPD patient. In this problem it is likely that the IPF status for ni siblings may be influenced by an unobservable random effect (γi) due to the i-th COPD patient. This common random effect makes the binary responses of any two siblings of the same patient correlated. It is of scientific interest to find the effects of the covariates on the binary responses (i.e., β in the mean function, E[Yij] = Pr[Yij = 1|xij]), after taking the within cluster correlations into account. Thus, it is desired to derive the formula for P[Yij = 1|xij] = E[Yij|xij] from Eq. 1 under a suitable distribution for γi, or using a nonparametric density. In this paper, we will confine our discussion to a parametric setup.

With regard to constructing a marginal (fixed or mixed) model for E[Yij] = Pr[Yij = 1|xij] we remark that because γi in Eq. 1 may be treated as an additive random covariate in the linear predictor \(\boldsymbol {x}^{\prime }_{ij}+\gamma _{i},\) it would be highly reasonable to assume that γi follows a normal (N) distribution, specifically \(\gamma _{i} {\stackrel {iid}{\sim }} N(0,\sigma ^{2}_{\gamma })\) (Breslow and Clayton, 1993; Lee and Nelder, 1996; Sutradhar, 2004). This normality assumption produces a marginal mixed effects (MME) model. For convenience of further discussion in the next section and so on, we name this model as cluster model A (CM-A). Some studies such as Wang and Louis (2003) assume a so-called “bridge” distribution for γi, which provides a marginal fixed effects (MFE) model. We name this model as CM-B-1. Some other studies such as Prentice (1986) (see also Haseman and Kuper, 1979) assumed a beta-binary distribution which also produces a MFE model. We name this as CM-B-2.

There exists another group of studies (Zeger et al. (1988, Section 3.1), Neuhaus et al. (1991, Eqn. (4)), and Chen et al. (2011, Sections 2.1, 3.1)) where without any justification how the cluster effect may contribute to the modeling for mean, variance and correlations, they assumed a subject specific (SS) arbitrary marginal fixed effects (AMFE) model for this mean function, and further assume that a user’s choice ‘working’ correlation structure can be used for the estimation of the marginal fixed effects parameter. Thus, in this approach both means and correlations have ‘working’ models, which is a naive approach, and is bound to produce invalid such as inconsistent regression and correlation estimates in many practical situations where true means and correlations may be generated based on normal random cluster effects, γi. We name this naive/working model as CM-C. A brief review is given in Section 2 on the advantages and drawbacks of all these for cluster models (CM-A, CM-B-1, CM-B-2, CM-C) and their respective inferences.

We now consider the clustered binary data in a longitudinal setup, where a cluster is formed with repeated responses from the same individual. For convenience, we consider I independent individuals, whereas in the cross-sectional cluster setup, the same I was used to represent total number of independent clusters. However to form the i-th (i = 1,…,I) cluster for individual i, with repeated responses, we assume that these responses are recorded over a small period of time T, such as T = 4or5 weeks/months/years. Hence we denote the binary response recorded at time t(t = 1,…,T) from the i-th individual by yit, whereas in cross-sectional cluster setup yij,j = 1,…,ni, was used to represent the binary response from the j-th member of the i-th clusters. Next we denote by xit, a time dependent covariate vector corresponding to yit. Here it is natural to expect that these repeated responses {yit,t = 1,…,T,} will be correlated most possibly through a dynamic relationship similar to time series data.

Similar to the cross-sectional clusters setup. it is of primary interest in this longitudinal setup, to find out the effect of xit on the binary response yit. This is equivalent to compute the effect of xit on E[Yit|xit] = Pr[Yit = 1|xit]. Note however that unlike in the cross-sectional setup, in some situations it may be of interest to find the effects of the past history on the current response yit. This is equivalent to compute the effect of the covariate history

$$ H_{i,t}(\cdot)\equiv [\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{i,t-1},\boldsymbol{x}_{i,t}] $$

on yit, i.e., to compute E[Yit|Hit] = Pr[Yit = 1|Hit]. Suppose that the effect of xit on yit is measured by β which is similar but different than in the cross-sectional case where it represents the effect of xij on yij, j being the j-th individual in the cluster. On top of this difference, a major difference between the models in both setups (cross-sectional and longitudinal clusters) arises because of the different nature of the binary responses under their respective clusters. More specifically, in the longitudinal setup, the correlation between yit and yi,t− 1, for t = 2,…,T arises because these responses are likely to be directly related through a dynamic dependence relationship, whereas in cross-sectional setup, yij and yik for jk;j,k = 1,…,ni, are correlated as they share a common random cluster effect γi. For this reason, a big volume of existing studies (Laird and Ware, 1982; Stiratelli et al. 1984; Neuhaus, 2002; Parzen et al. 2011) where longitudinal binary responses are analyzed using random/mixed effects model, fail to accommodate longitudinal correlations adequately. In particular these random effects based models have limited or no values to address the dynamic dependence among repeated responses.

As far as the marginal models for time specific binary means are concerned, there exists many situations where a marginal fixed effects (MFE) model involving only regression parameters (β) can be used for the marginal means. This is similar to the cross-sectional setup but correlation models are quite different under both cross-sectional and longitudinal setups. For MFE based longitudinal models, one may refer to specific dynamic models suggested, for example, by Bahadur (1961) (see also Cox, 1972) multivariate binary density (MBD) based model, Kanter (1975) for an observation driven dynamic (ODD) model, and Zeger et al. (1985) for a linear dynamic conditional probability (LDCP) model (see Sutradhar (2011, Section 7.2) for details). All these models yield the marginal mean function, i.e., the formula for the unconditional means (E[Yit|xit] = Pr[Yit = 1|xit]) for all t = 1,…,T, in terms of β parameter only. For our discussion involving a MFE model, we consider, for example the AR(1) (auto-regressive order 1) type linear dynamic model from Zeger et al. (1985), given by

$$ \begin{array}{@{}rcl@{}} &&{}Pr[Y_{i1}=1|\boldsymbol{x}_{i1}]=\tilde{p}_{i1}(\boldsymbol{\beta}) \\ &&{}Pr[Y_{it}=1|y_{i,t-1},\boldsymbol{x}_{it},\boldsymbol{x}_{i,t-1}]=\tilde{p}_{it}(\boldsymbol{\beta})+ \rho(y_{i,t-1}-\tilde{p}_{i,t-1}), t=2,\ldots,T, \end{array} $$
(2)

with \(\tilde {p}_{it}(\boldsymbol {\beta })=\exp (x^{\prime }_{it}\boldsymbol {\beta })/[1+\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta })],\) yielding the marginal means and variances as the function of β only, that is they are free of the dynamic dependence parameter ρ. We refer to this MFE model as longitudinal model 1 (LM(1)), and express it as

$$ \begin{array}{@{}rcl@{}} && \text{LM(1): A marginal fixed effects (MFE) model} \\ &&E[Y_{it}]=Pr[Y_{it}=1]=\tilde{p}_{it}(\boldsymbol{\beta})=\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})/ [1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})] \\ && \text{var}[Y_{it}]=\tilde{p}_{it}(\boldsymbol{\beta}) (1-\tilde{p}_{it}(\boldsymbol{\beta})). \end{array} $$
(3)

There are, however, many other situations where the MFE models are not appropriate for the marginal means of the longitudinal binary data. This mostly happens when yit depends on the history Hit, rather than on xit. In this case, marginal means will be the function of both β and dynamic dependence parameter ρ. In cross-sectional cluster setup, a MME (marginal mixed effects model (CM(1)), means involving β and \(\sigma ^{2}_{\gamma }\)) was used to represent the marginal means, but in the present longitudinal setup it is more appropriate to refer the model as the MD (marginal dynamic) or MR (marginal recursive) model. More specifically this MD/MR model for marginal means is derived from a non-linear conditional dynamic logit model Sutradhar and Farrell (2007) given by

$$ \begin{array}{@{}rcl@{}} &&{}Pr[Y_{i1}=1|\boldsymbol{x}_{i1}]=\tilde{p}_{i1}(\boldsymbol{\beta}) =\exp(\boldsymbol{x}^{\prime}_{i1}\boldsymbol{\beta})/[1+\exp(\boldsymbol{x}^{\prime}_{i1}\boldsymbol{\beta})] \\ &&{}Pr[Y_{it}=1|y_{i,t-1},\boldsymbol{x}_{it}]=\frac{\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\rho y_{i,t-1})}{1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\rho y_{i,t-1})}, t=2,\ldots,T, \end{array} $$
(4)

(see also Loredo-Osti and Sutradhar, 2012; Fokianos and Kedem, 2003 in a time series setup) yielding the marginal dynamic/recursive (MD/MR) model:

$$ \begin{array}{@{}rcl@{}} \text{LM(2)} \!\!\!\!&:&\!\!\!\! \text{marginal dynamic/recursive (MD/MR) model} \\ \mu_{i1}(\beta)\!\!\!\!&=&\!\!\!\!E[Y_{i1}|\boldsymbol{x}_{i1}]=Pr[Y_{i1}=1|\boldsymbol{x}_{i1}]=\tilde{p}_{i1}(\boldsymbol{\beta}) \\ \!\!\!\!\mu_{it}(\boldsymbol{\beta},\rho)\!\!\!\!&=&\!\!\!\!E[Y_{it}|H_{it}]=\tilde{p}_{it}(\boldsymbol{\beta}) +\mu_{i,t-1}(\cdot)({\tilde{\tilde{p}}}_{it}(\boldsymbol{\beta},\rho)- \tilde{p}_{it}(\boldsymbol{\beta})),\\ t\!\!\!\!&=&\!\!\!\!2,\ldots,T, \end{array} $$
(5)

[Sutradhar and Farrell (2007), Sutradhar (2011, Section 7.7.2)] where \({\tilde {\tilde {p}}}_{it}(\boldsymbol {\beta },\) \(\rho ) =\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta }+\rho )/[1+\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta }+\rho )].\)

For the sake of completeness, we also include another MFE model (on top of LM(1)) where, similar to CM-C in cross-section cluster setup, an arbitrary MFE (AMFE) model is used for the marginal means in terms of β, and longitudinal correlations are not modeled but substituted by certain ‘working’ correlations (Liang and Zeger, 1986) for the inference about β. We name this AMFE based model as LM(3). we briefly review these models LM(1), LM(2), and LM(3) in Section 3, along with available approaches for their parameters estimation. The advantages and drawbacks of these models and estimation approaches are also discussed.

Furthermore, because the CM-A as opposed to CM-B-1, CM-B-2 and CM-C in the cross-sectional cluster setup shows under the normality assumption of the random cluster effect (γi) that the marginal means contain both β and \(\sigma ^{2}_{\gamma }\) parameters, we consider this general model further in Section 4 and demonstrate how to construct a computationally simpler GQL (generalized quasi-likelihood) approach than maximum likelihood (ML) approach for the estimation of β, for known \(\sigma ^{2}_{\gamma }.\) When \(\sigma ^{2}_{\gamma }\) is unknown we provide a consistent MM (method of moments) approach for its estimation. Asymptotic properties such as consistency of these GQL and MM estimators are also given in the same section. As far as the estimation of β and ρ under the general longitudinal model LM(2) is concerned, one may refer to Sutradhar and Farrell (2007) for their GQL and MM based estimation. In Section 5, we provide a brief review on the use of the GLMM (generalized linear mixed model) in a Bayesian frame work for inferences for correlated binary data in both cluster and longitudinal setups. Apart from computational complexity, it is outlined that because the random effects based models, in general, do not produce the time lag dependent correlations, choosing necessary proper prior distributions under longitudinal setup may be problematic. To tackle this problem to some extent, there appears a few studies using dynamic models for longitudinal binary data in a Bayesian frame work. This approach is discussed in brief as well. The paper concludes in Section 6.

2 Existing Marginal Models and Estimation for Cross-Sectional Clustered Binary Data

2.1 CM-A: Population Average (PA) Based Marginal Mixed Effects (MME) Models

Refer to the marginal conditional model (1) which is written by adding the random cluster effect γi to the linear predictor \(\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }\) used in the binary logistic probability function. This model is also as random effects model for binary data. Under the assumption that γi has a suitable probability distribution with probability density \(g_{D}(\gamma _{i}|0,\sigma ^{2}_{\gamma }),\) one may then write the likelihood function as

$$ \begin{array}{@{}rcl@{}} &&L(\boldsymbol{\beta},\sigma^{2}_{\gamma})={\Pi}^{I}_{i=1}Pr((y_{i1},\ldots,y_{ij},\ldots,y_{in_{i}}) |\boldsymbol{\beta},\sigma^{2}_{\gamma}) \\ &=&{\Pi}^{I}_{i=1}{\int}_{\gamma_{i}}{\Pi}^{n_{i}}_{j=1}Pr(y_{ij}|\gamma_{i})g_{D}(\gamma_{i})d\gamma_{i} ={\Pi}^{I}_{i=1}{\int}_{\gamma_{i}}\\ &&{\Pi}^{n_{i}}_{j=1}[p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i})]^{y_{ij}} [1-p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i})]^{1-y_{ij}}g_{D}(\gamma_{i})d\gamma_{i} \\ &=&{\Pi}^{I}_{i=1}\int\frac{\exp\{{\sum}^{n_{i}}_{j=1}y_{ij}(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta} +\gamma_{i})\}}{{\Pi}^{n_{i}}_{j=1}\{1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})\}} g_{D}(\gamma_{i})d\gamma_{i}, \end{array} $$
(6)

which, and some of its modification such as penalized quasi-likelihood, hierarchical likelihood, were exploited by many researchers over the last four decades under varieties form of gD(⋅), mainly for the estimation of β and \(\sigma ^{2}_{\gamma }.\) Among varieties form for gD(⋅), normality assumption based gN(γi) is widely used. See for example, Stiratelli et al. (1984, Eqn. (3.1)), Breslow and Clayton (1993), Lee and Nelder (1996), Sutradhar and Mukerjee (2005). Some authors have used a specialized “bridge” (b) distribution with density, say gb(⋅) (Wang and Louis (2003, Eqns. (4.1)-(4.2)), Parzen et al. (2011)), which, unlike the normal distribution (gN(⋅)), yields a marginal fixed effects model for the marginal means. But this “bridge” distribution appears to be restrictive and too technical for practical use. When gD(⋅) ≡ gN(⋅), one obtains a MME (marginal mixed effects) based mean model given by

$$ \begin{array}{@{}rcl@{}} &&E[Y_{ij}]=Pr[Y_{ij}=1|\boldsymbol{x}_{ij}] j=1,\ldots,n_{i} \\ &=&{\int}_{\gamma_{i}}p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i})g_{N}(\gamma_{i})d\gamma_{i} =\int \left[\frac{\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})} {[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})]}\right]dG_{N}(\gamma_{i},\sigma^{2}_{\gamma}) \\ &=&\mu_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \text{(say), for all} j=1,\ldots,n_{i} \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} & \neq& \frac{\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})}{[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})]}=p_{ij}(\boldsymbol{\beta}). \end{array} $$
(8)

Notice that the gN(⋅) based likelihood estimates for β and \(\sigma ^{2}_{\gamma }\) (Sutradhar and Mukerjee, 2005), obtained by maximizing the likelihood function (6), can be used in the marginal mean \(\mu _{ij}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) to interpret the effects of xij on the binary mean response E[Yij] = Pr[Yij = 1|xij]. However, some studies attempt to estimate β in pij(β) and interpret the effects of xij on the mean response. But clearly it would be an incorrect or inconsistent estimate under normal random cluster effects, as β in Eq. 7 can not be estimated without estimating \(\sigma ^{2}_{\gamma }\) at least consistently. This is also evident from Zeger et al. (1988) that under normality, the population average (PA) based β, that is, βPA in \(\mu _{ij}({\boldsymbol {\beta }}^{PA},\sigma ^{2}_{\gamma })\) in Eq. 8 has an approximate relationship with the subject specific (SS) β, i.e., βSS in \(p_{ij}({\boldsymbol {\beta }}^{SS}),\) as

$$ \begin{array}{@{}rcl@{}} {\boldsymbol{\beta}}^{PA} \approx {\boldsymbol{\beta}}^{SS}/[\sqrt{1+\left( \frac{16}{15}\right)^{2} \frac{3}{\pi^{2}}\sigma^{2}_{\gamma}}]. \end{array} $$
(9)

Thus, the desired βPA can not be estimated without estimating \(\sigma ^{2}_{\gamma }\) under the present cluster setup.

However, there remains two relatively complex issues in this β = βPA estimation. First, \(\mu _{ij}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) is an implicit function, hence it is not easy to interpret the role of β on this mean function in the presence of an estimate of \(\sigma ^{2}_{\gamma }.\) Second, the likelihood estimation for β and \(\sigma ^{2}_{\gamma }\) is complex. As a remedy, following a binomial approximation (BA) to the normal distribution of γi (Sutradhar (2011, Chapter 5, Eqn. (4.24))), one may compute this mean function μij(⋅) as follows and interpret it as the function of β for given \(\sigma ^{2}_{\gamma }.\) Use \(\gamma ^{*}_{i}=\gamma _{i}/\sigma _{\gamma }\) in Eq. 7 and express \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i})\) as \(p^{*}_{ij}(\boldsymbol {x}_{j};\boldsymbol {\beta },\sigma _{\gamma } \gamma ^{*}_{i}).\) Further consider vi as a binomial variable with parameters V and 1/2, i.e., \(v_{i} \sim \text {binomial} (V,1/2).\) Next using

$$ \gamma^{*}_{i}=\frac{v_{i}-V(1/2)}{\sqrt{V(1/2)(1/2)}}\equiv h(v_{i}) $$
(10)

we may express the MME based mean function in Eq. 7 as

$$ \begin{array}{@{}rcl@{}} \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) ={\sum}^{V}_{v_{i}=0}p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))\begin{pmatrix}V \\ v_{i}\end{pmatrix}(1/2)^{v_{i}}(1/2)^{V-v_{i}}, \end{array} $$
(11)

where V is assumed to be relatively large such as V = 10. Note that this formula in Eq. 11 for the computation of the BA-based individual specific mean function is explicit, whereas the mean function was implicitly defined in Eq. 7. Thus one may use the likelihood estimates of β and \(\sigma ^{2}_{\gamma }\) obtained by exploiting the gN(γi)-based likelihood function (6) (Sutradhar and Mukerjee, 2005) into this mean formula in Eq. 11, and easily examine/interpret the effects of individual covariate xij on the binary mean function Pr[Yij = 1|xij] = E[Yij|xij]. We remark however that as the likelihood estimation is relatively complex, in Section 4 we demonstrate how one can develop a GQL approach (which produces consistent and highly efficient estimate, the ML estimate being optimal) for the estimation of the main regression parameters, and a MM approach for consistent estimation of \(\sigma ^{2}_{\gamma }.\) These GQL and MM approach exploit moments of the clustered binary data up to order 2 containing all squared and pairwise (from 2 individuals in the cluster) products of the binary responses. The following section provides a brief discussion on some other (than ML and GQL) existing estimated approaches along with their limitations.

2.1.1 Some Highly Competing Estimation Approaches in the Cross-Sectional Cluster Setup and their Drawbacks

A BLUP (Best Linear Unbiased Prediction) Approach

Under normality, i.e., when gD(γi) ≡ gN(γi) in Eq. 6, many authors such as Stiratelli et al. (1984, Eqn. (3.1)), Schall (1991), Karim and Zeger (1992), Breslow and Clayton (1993), McGilchrist (1994), Kuk (1995), Lin and Breslow (1996), and Lee and Nelder (1996) have used a BLUP analogue estimation approach, where cluster/familial random effects are treated to be the fixed effects [Henderson (1963)] and the regression and variance components of the mixed model (6) are estimated based on the so-called estimates of the random effects. Because γi has to be estimated using the data from the i-th cluster only, in general, this BLUP procedure may yield biased estimate for γi specially when i th cluster size is small, which may subsequently produce biased regression and variance estimates, variance estimates being more adversely affected than regression estimates. In order to remove biases in the estimates, Kuk (1995) and Lin and Breslow (1996), among others, provided certain asymptotic bias corrections both for the regression and the variance component estimates. But, as Breslow and Lin (1995, p. 90) have shown that the bias corrections appear to improve the asymptotic performance of the uncorrected quantities only when the true variance component is small, more specifically, less than or equal to 0.25. But in practice, variance component can be much larger. We further remark that the above BLUP analogue approaches are essentially using a likelihood technique for the present non-linear binary regression analysis. For example, Breslow and Clayton (1993) specifically use a PQL (penalized quasi-likelihood) approach, similarly Lee and Nelder (1996) use a HL (hierarchical likelihood) approach. These two approaches are similar, because in the first step, both PQL and HL approaches estimate the regression parameters and the random effects. The difference between the two approaches is that the PQL approach estimates them by maximizing a penalized quasi-likelihood function, whereas the HL approach maximizes a hierarchical likelihood function. In the second step, in estimating the variance of the random effects, the PQL approach maximizes a profile quasi-likelihood function, whereas the HL approach maximizes an adjusted profile hierarchical likelihood function. Thus both approaches encounter biases in the estimates in a similar way.

Another major drawback of the above mentioned BLUP oriented likelihood approaches is that no attempt is made to compute the marginal means from the respective likelihood function, whereas this computation of the marginal means is essential to interpret the effects of the covariates xij on the marginal means \(Pr[Y_{ij}=1|\boldsymbol {x}_{ij}]=E[Y_{ij}|\boldsymbol {x}_{ij}]=\mu _{ij} (\boldsymbol {x}_{ij};\boldsymbol {\beta },\sigma ^{2}_{\gamma }).\)

2.2 CM-B-1: Subject Specific (SS) Marginal Fixed Effects (MFE) Model Based on “bridge” Random Cluster Effects

In some situations depending on the assumption about the distribution of the random effects (γi), the PA-based mixed model may yield a fixed effects model for the marginal means. More specifically, by using a slightly different (than \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i})\) in Eq. 7) marginal-conditional probability given by

$$ \begin{array}{@{}rcl@{}} &&Pr[Y_{ij}=1|\boldsymbol{x}_{ij},\gamma_{i}]=p^{**}_{ij}(\boldsymbol{\beta},\phi(\sigma^{2}_{\gamma}),\gamma_{i}) \\ &=&\exp(\{\phi(\sigma^{2}_{\gamma})\}^{-1}\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i}) [1+\exp(\{\phi(\sigma^{2}_{\gamma})\}^{-1}\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})],\\ 0&<&\phi(\sigma^{2}_{\gamma})<1, \end{array} $$
(12)

Wang and Louis (2003, Eqn. (4.2)) have shown that

$$ \begin{array}{@{}rcl@{}} &&Pr[Y_{ij}=1|\boldsymbol{x}_{ij}]={\int}^{\infty}_{-\infty}p^{**}_{ij} (\boldsymbol{\beta},\gamma_{i})g_{D}(\gamma_{i})d\gamma_{i} \\ &=&p_{ij}(\boldsymbol{\beta}) =\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})/[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})], \end{array} $$
(13)

a MFE (marginal fixed effects) based model involving only β parameters, when

$$ g_{D}(\gamma_{i}) \Rightarrow g_{b}(\gamma_{i}), $$

gb(γi) being the so-called “bridge” density of the form

$$ \begin{array}{@{}rcl@{}} &&{}g_{b}(\gamma_{i})=\frac{1}{2\pi}\frac{sin(\phi \pi)}{cosh(\phi \gamma_{i})+cos(\phi \pi)}; 0<\phi(\sigma^{2}_{\gamma})<1, -\infty <\gamma_{i} < \infty. \end{array} $$
(14)

where ϕ is related to \(\sigma ^{2}_{\gamma }\) through the relationship, \(\sigma ^{2}_{\gamma }=\pi ^{2}(\phi ^{-2}-1)/3.\)

We remark that the MFE model for the marginal means given in Eq. 13 is simpler than the MME model (7) to interpret the effects of the covariates xij on the SS binary means E[Yij|xij] as it is \(\sigma ^{2}_{\gamma }\) free. Also the likelihood estimate of β obtained by exploiting the likelihood function

$$ \begin{array}{@{}rcl@{}} L(\boldsymbol{\beta},\phi/\sigma^{2}_{\gamma}) ={\Pi}^{I}_{i=1}\int\frac{\exp\{{\sum}^{n_{i}}_{j=1}y_{ij}(\phi^{-1}\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta} +\gamma_{i})\}}{{\Pi}^{n_{i}}_{j=1}\{1+\exp(\phi^{-1}\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\gamma_{i})\}} g_{b}(\gamma_{i})d\gamma_{i}, \end{array} $$
(15)

can be used for β in the MFE model (13). This is because the MFE model (13) can be obtained from the joint probability function used in the likelihood function (15). That is βPAβSS.

However, some of the major drawbacks of this “bridge” random effects based fixed model are:

(i) Notice that γi involved in the linear mixed predictor in the conditional probability function in Eq. 12 may be treated as a random covariate, whereas xij’s are known to be fixed covariates. As far as its distributional properties are concerned, even though the bridge distribution (14) (which has a complex trigonometrical ratio form) technically yields the marginal fixed effects model, the suitability of this distributional assumption, as opposed to the normality assumption (e.g., Breslow and Clayton, 1993; Lee and Nelder, 1996 in GLMM setup) in practical contexts, is not discussed adequately in the literature.

(ii) Even though the MFE model (13) does not contain \(\phi /\sigma ^{2}_{\gamma },\) this parameter has to be estimated anyway as it contains in the likelihood function or any possible correlation structure. Moreover, the likelihood estimation using the likelihood function (15) would be much more complex than using the normal clusters based likelihood function.

(iii) β parameter in the MFE model (13) could be estimated using a QL (quasi-likelihood) approach rather than using the complicated likelihood approach, provided one could compute the pair-wise correlations among the clustered binary responses. The computation of these correlations appear to be complicated under this “bridge” cluster effects assumption.

2.3 CM-B-2: SS Marginal Fixed Effects (MFE) Model Based on Beta-Binary Random Clustered Probability Function

Similar to the CM-B-1 model (Wang and Louis, 2003, 2004), there exists some early studies (Prentice, 1986; Haseman and Kuper, 1979) in the context of clustered binary regression analysis in a longitudinal setup, where, in order to obtain a MFE model for binary means, an extended assumption about the distribution of a function of γi, specifically for \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i}),\) was used. For a bounded scale parameter τ, as a function of \(\sigma ^{2}_{\gamma },\) say \(\tau (\sigma ^{2}_{\gamma })\) satisfying the range \(0<\tau (\sigma ^{2}_{\gamma })<1),\) suppose that \(p^{*}_{ij}(\boldsymbol {\beta },\gamma )\) in Eq. 1 (see also Eq. 7) follows a beta-distribution (\(\tilde {g}_{B}\)) of first kind with parameters \((\{\tau (\sigma ^{2}_{\gamma })\}^{-1}-1)p_{ij}(\boldsymbol {\beta })\) and \((\{\tau (\sigma ^{2}_{\gamma })\}^{-1}-1)q_{ij}(\boldsymbol {\beta }))\) with qij(β) = 1 − pij(β). More specifically,

$$ \begin{array}{@{}rcl@{}} &&{}{\text{Distributional assumption for the random logistic function }p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i})} : \\ &&{}{\text{ A beta-distribution of first kind }} \\ &&{}\tilde{g}_{B}(p^{*}_{ij};\tau,p_{ij}(\boldsymbol{\beta}))=\frac{{p^{*}_{ij}}^{(\tau^{-1}-1)p_{ij}-1} (1-p^{*}_{ij})^{(\tau^{-1}-1)q_{ij}-1}}{B((\tau^{-1}-1)p_{ij},(\tau^{-1}-1)q_{ij})}; 0\le p^{*}_{ij} \le 1, \end{array} $$
(16)

which yields the marginal probability

$$ \begin{array}{@{}rcl@{}} &&Pr[Y_{ij}=1|\boldsymbol{x}_{ij}]={{\int}^{1}_{0}}p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i}) \tilde{g}_{B}(p^{*}_{ij})dp^{*}_{ij} \\ & &=p_{ij}(\boldsymbol{\beta}) =\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})/[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})], \end{array} $$
(17)

as in Eq. 13 under CM-B-1 model, which is the same as the marginal probability in Eq. 8.

Consequently, under this mixed model approach, one may examine the effects of xij on the marginal response means E[Yij|xij] = Pr[Yij = 1|xij] by computing β parameter involved in the simpler MFE model Eq. 17, i.e., in \(p_{ij}(\boldsymbol {\beta }) =\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta })/[1+\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta })].\) This estimation can be achieved by maximizing the likelihood function

$$ \begin{array}{@{}rcl@{}} \!\!\!\!L(\boldsymbol{\beta},\tau)\!\!\!\!&=&\!\!\!\!{\Pi}^{I}_{i=1}{\Pi}^{n_{i}}_{j=1}{{\int}^{1}_{0}} \frac{{p^{*}_{ij}}^{\{(\tau^{-1}-1)p_{ij}+y_{ij}\}-1} (1 - p^{*}_{ij})^{\{(\tau^{-1}-1)q_{ij}+y_{ij}+1\}-1}}{B((\tau^{-1}-1)p_{ij},(\tau^{-1}-1)q_{ij})} dp^{*}_{ij} \\ \!\!\!\!&=&\!\!\!\!{\Pi}^{I}_{i=1}{\Pi}^{n_{i}}_{j=1}\frac{\Gamma{(\tau^{-1}-1)p_{ij}+y_{ij}} {\Gamma}{(\tau^{-1}-1)q_{ij}+y_{ij}+1}} {\Gamma{(\tau^{-1} - 1)+2y_{ij} + 1}B((\tau^{-1} - 1)p_{ij},(\tau^{-1} - 1)q_{ij})}, \end{array} $$
(18)

with respect to β and τ.

However, some of the major drawbacks with this beta-binary approach based marginal fixed model, are:

(i) The likelihood estimation by exploiting the likelihood function (18) is complex. See, for example, Sutradhar and Das (1997), for an approximate QL approach estimation in a similar setup.

(ii) The assumption that the whole conditional probability \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i})=\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i}) /[1+\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i})]\) in Eq. 1 follows a beta distribution, rather than assuming a distribution for γi such as normality, appears to be too restrictive and hence it may be impractical, in order to obtain a marginal fixed model.

(iii) The pair-wise correlations among the clustered responses are not understood as they may not be easy to compute. This is because, such a computation will require first the computation of the correlations between \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i}) =\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i})/[1+\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i})]\) and \(p^{*}_{ik}(\boldsymbol {\beta },\gamma _{i})=\exp (\boldsymbol {x}^{\prime }_{ik}\boldsymbol {\beta }+\gamma _{i}) /[1+\exp (\boldsymbol {x}^{\prime }_{ik}\boldsymbol {\beta }+\gamma _{i})]\) for jk;j,k = 1,…,ni which is not possible without making further joint, say bivariate, distributional assumptions for \(p^{*}_{ij}\) and \(p^{*}_{ik}.\)

(iv) Even though ML approach may give an estimate for τ, estimating \(\sigma ^{2}_{\gamma },\) the cluster variance, is, however, not possible without knowing the specific relationship between τ and \(\sigma ^{2}_{\gamma },\) \(\tau (\sigma ^{2}_{\gamma })\) being currently an implicit function only.

2.4 CM-C: A SS Arbitrary Marginal Fixed Effects (AMFE) Model

Some authors, for example in an early study, Zeger et al. (1988, Section 3.1), considered a clustered binary data analysis and suggested to use the MFE model for the marginal means, given by

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&E[Y_{ij}] = Pr[Y_{ij} = 1|\boldsymbol{x}_{ij}] = p_{ij}(\boldsymbol{\beta}) = \exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta})/ [1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}], \end{array} $$
(19)

for all j = 1,…,ni. Because the clustered responses are correlated, for inferences about β, these authors have suggested the use of a ‘working’ correlation structure based GEE (generalized estimating equations) approach discussed by Liang and Zeger (1986). It is clear that neither the means nor the correlations were modeled under this approach. Hence, the MFE model in Eq. 19 is purely an arbitrary fixed effects model. Notice that even though this model in Eq. 19 appears to be the same as the marginal models in Eqs. 13 and 17, it is however an assumed model without showing its connection with the marginal-conditional model (1), whereas the models in Eqs. 13 and 17 were derived from Eq. 1 under certain distributional (“bridge” and beta-binary) assumptions for the cluster effects γi. Furthermore, as shown by Eq. 7 (see also Eq. 11), this marginal model (19) can not be derived from Eq. 1 under normality assumption for γi, and in such normal based cases, the MFE model (19) would produce biased and hence inconsistent regression estimate due to ignoring \(\sigma ^{2}_{\gamma }\) from the marginal mean function. In this token we remark that the marginal model (19) suggested by Zeger et al. (1988, Section 3.1), therefore, gives a wrong impression that it can be used for any clustered correlated binary data. This impression is further noticed in a recent paper by Chen et al. (2011, Section 2.1), where this marginal model (19) was used under a clustered correlated ‘response process’ without misclassification, and was generalized for a possible ‘misclassification process’. The difference between their models is that Zeger et al. (1988, Section 3.1) suggested a ‘working’ correlation structure to construct their GEE, whereas Chen et al. (2011, Section 2.1) suggested a ‘working’ odds ratio based ‘working’ covariance (or bivariate probability) structure to develop the GEE for β estimation. However, in a longitudinal setup for binary data, it is well known that these GEE approaches may produce inefficient estimates as compared to the so-called independence assumption based simpler MM and QL approaches (Sutradhar and Das (1999), Sutradhar (2011, Section 7.3.6), Sutradhar and Zheng (2018), Sutradhar (2014, Section 4.2)), which is a serious inference drawback.

To have a feel about the possible adverse performance of the odds ratio based GEE approach in the present cross-sectional cluster setup, we consider the most likely practical case with normal random cluster effects \((\gamma _{i} \sim N(0,\sigma ^{2}_{\gamma }))\) as discussed in Section 2.1, and compute the odds ratio as follows to examine whether one can express log of this odds ratio in a linear form as suggested in Chen et al. (2011, Section 2.1). Because, the odds ratio for yij and yik (jk;j,k = 1,…,ni) has the formula

$$ \begin{array}{@{}rcl@{}} \psi_{ijk}&=&\frac{Pr(Y_{ij}=1,Y_{ik}=1)Pr(Y_{ij}=0,Y_{ik}=0)} {Pr(Y_{ij}=1,Y_{ik}=0)Pr(Y_{ij}=0,Y_{ik}=1)}, \end{array} $$
(20)

we compute these joint probabilities involved in Eq. 20, by exploiting the independence of yij and yik conditional on γi in Eq. 1 and then taking population average over the normal distribution of γi. More specifically, for \(\gamma ^{*}_{i}=\frac {v_{i}-V(1/2)}{\sqrt {V(1/2)(1/2)}}\equiv h(v_{i})\) as in Eq. 10, following Eq. 11, we write

$$ \begin{array}{@{}rcl@{}} &&\lambda^{(1,1)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})=Pr(Y_{ij}=1,Y_{ik}=1) \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} \!\!\!\!&=&\!\!\!\!{\sum}^{V}_{v_{i}=0}\frac{\exp((\boldsymbol{x}^{\prime}_{ij}+\boldsymbol{x}^{\prime}_{ik})\boldsymbol{\beta}+2\sigma_{\gamma} h(v_{i}))} {[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))][1+\exp(\boldsymbol{x}^{\prime}_{ik}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))]} \begin{pmatrix}V \\ v_{i}\end{pmatrix}\\&&(1/2)^{v_{i}}(1/2)^{V-v_{i}} \\ &&\lambda^{(0,0)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})=Pr(Y_{ij}=0,Y_{ik}=0) \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} \!\!\!\!&=&\!\!\!\!{\sum}^{V}_{v_{i}=0}\frac{1} {[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))][1+\exp(\boldsymbol{x}^{\prime}_{ik}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))]} \begin{pmatrix}V \\ v_{i}\end{pmatrix}\\&&(1/2)^{v_{i}}(1/2)^{V-v_{i}} \\ &&\lambda^{(1,0)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})=Pr(Y_{ij}=1,Y_{ik}=0) \end{array} $$
(23)
$$ \begin{array}{@{}rcl@{}} \!\!\!\!&=&\!\!\!\!{\sum}^{V}_{v_{i}=0}\frac{\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))} {[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))][1+\exp(\boldsymbol{x}^{\prime}_{ik}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))]} \begin{pmatrix}V \\ v_{i}\end{pmatrix}\\&&(1/2)^{v_{i}}(1/2)^{V-v_{i}} \\ &&\lambda^{(0,1)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})=Pr(Y_{ij}=0,Y_{ik}=1) \end{array} $$
(24)
$$ \begin{array}{@{}rcl@{}} \!\!\!\!&=&\!\!\!\!{\sum}^{V}_{v_{i}=0}\frac{\exp(\boldsymbol{x}^{\prime}_{ik}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))} {[1+\exp(\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))][1+\exp(\boldsymbol{x}^{\prime}_{ik}\boldsymbol{\beta}+\sigma_{\gamma} h(v_{i}))]} \begin{pmatrix}V \\ v_{i}\end{pmatrix}\\&&(1/2)^{v_{i}}(1/2)^{V-v_{i}}, \end{array} $$

yielding the odds ratio as

$$ \begin{array}{@{}rcl@{}} \psi_{ijk}&=&\frac{\lambda^{(1,1)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \lambda^{(0,0)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})} {\lambda^{(1,0)}_{ijk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})\lambda^{(0,1)}_{ijk} (\boldsymbol{\beta},\sigma^{2}_{\gamma})}. \end{array} $$
(25)

In Chen et al. (2011), these joint probabilities are unknown, and \(\lambda ^{(1,1)}_{ijk}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) is expressed as a function of ψijk,pij(β),pik(β), and then estimate this joint probability by using an estimate of ψijk. For the estimation, in this approach they use an ‘working’ log linear model, namely

$$ \psi_{ijk}=\exp(\boldsymbol{u}^{\prime}_{ijk}\boldsymbol{\alpha}), $$

where uijk is a set of suitable covariates and α is set of new regression parameters. Notice however that the odds ratio, which is known, and computed by Eq. 25, is far different than what is modeled using a log linear relationship. Thus, this aforementioned example demonstrates that the ‘working’ odds ratio approach by fitting a log linear model for odds ratio estimation may yield inconsistent estimate for the joint probability, restricting its use for GEE construction.

3 Existing Marginal Models and Estimation for Longitudinal Clustered Binary Data

As opposed to cross-sectional clustered data collection, in a longitudinal setup a cluster is formed with repeated responses over a period of time T, from an individual i, for all i = 1,…,I. As explained in Section 1, specifically in Eqs. 2 and 4, the correlations among repeated responses arise through certain dynamic relationships between past and present responses of the same individual. We refer to Sutradhar (2010, Section 2.2) and Sutradhar and Zheng (2018), for example, for some low order non-stationary (time dependent covariates based) correlation such as AR(1) (auto-regressive order 1), MA(1) (moving average order 1), and exchangeable/equi-correlation structures for repeated binary data. Similar but ‘working’ correlation structures for stationary/non-stationary repeated binary data are also found in Liang and Zeger (1986), Zeger et al. (1985), and Lin and Carroll (2001), for example.

As far as the marginal models for the binary means at a given time t are concerned, similar to the cross-sectional clustered binary models, these models can be (1) MFE model (LM(1)) such as in Eq. 1.4 obtained from a linear dynamic conditional model, or (2) MD/MR model (LM(2)) such as in Eq. 5 obtained from a non-linear dynamic conditional logits model, or (3) AMFE (arbitrary marginal fixed effects) model (LM(3)), where correlations are thought not to play any roles in mean specification. For convenience, we refer to Zeger et al. (1985), Sutradhar (2010, 2011), and Sutradhar and Zheng (2018), for LM(1) type MFE model; Fokianos and Kedem (2003), Sutradhar and Farrell (2007), and Sutradhar (2011, Section 7.7.2), for LM(2) type MD/MR (marginal dynamic/recursive) model; and Liang and Zeger (1986), and Lin and Carroll (2001), for the AMFE model.

We further remark that some studies (e.g., Laird and Ware, 1982; Stiratelli et al., 1984; Parzen et al., 2011) have used random effects models those are similar to the cross-sectional clustered models discussed in Section 2. However, these models can accommodate only EQC/exchangeable type correlations, and hence they have limited or no values for longitudinal data where one encounters correlations through time series type dynamic models. As discussed in Section 2, these models also have limitations for specification of marginal means for the cross-sectional cluster binary data. Thus, we do not include these models any further in our discussion.

3.1 LM(1): Time Specific (TS) Marginal Fixed Effects Model

Recall from Section 1 that the linear dynamic conditional probability (Pr[Yit = 1|yi,t− 1]) model (2) relates yi,t− 1 to yit, for t = 2,…,T, through an AR(1) type relationship. This model produces a MFE model for the binary means at time t as in Eq. 1.4, specifically it yields

$$ E[Y_{it}]=Pr[Y_{it}=1]=\tilde{p}_{it}=\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}) /[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})], $$

where β explains the effects of the fixed covariates xit on yit, specifically on E[Yit]. Notice that the conditional model (2) also produces the lag (tu) correlations between two responses yiu and yit, for u < t, say, as

$$ \begin{array}{@{}rcl@{}} \text{corr}(Y_{iu},Y_{it})&=& \rho^{t-u}[\frac{\sigma_{iuu}}{\sigma_{itt}}]^{\frac{1}{2}}, \end{array} $$
(26)

[Sutradhar (2011), Eqn. (7.73)) where σitt is the variance of yit, for all t = 1,…,T, and is given in Eq. 1.4 as \(\sigma _{itt}=\tilde {p}_{it}(1-\tilde {p}_{it}).\)

As far as the estimation of β is concerned, if one is willing to ignore the correlation structure (26) (which is equivalent to use ρ = 0 in Eq. 26), then (1) one may solve the MM (method of moments) estimating equation

$$ \begin{array}{@{}rcl@{}} {\sum}^{I}_{i=1}{\sum}^{T}_{t=1}\frac{\partial \tilde{p}_{it}}{\partial \boldsymbol{\beta}}(y_{it}-\tilde{p}_{it})=0, \end{array} $$
(27)

or (2) a QL (quasi-likelihood) estimating equation

$$ \begin{array}{@{}rcl@{}} {\sum}^{I}_{i=1}{\sum}^{T}_{t=1}\frac{\partial \tilde{p}_{it}}{\partial \boldsymbol{\beta}}\sigma^{-1}_{itt}(y_{it}-\tilde{p}_{it})=0, \end{array} $$
(28)

[Wedderburn (1974)] to obtain MM or QL estimate for β. Note that both MM (27) and QL (28) estimating equations are unbiased as \(E[Y_{it}]=\tilde {p}_{it}(\boldsymbol {\beta })\) yielding \(E[Y_{it}-\tilde {p}_{it}(\boldsymbol {\beta })]=0.\) Consequently, as \(I \rightarrow \infty ,\) MM and QL estimators will be consistent under some mild regularity conditions. But these estimators will be inefficient as compared to other moments based estimators obtained by accommodating the underlying correlation structure (26).

Let Σi(β,ρ) = (σiut(⋅)) denote the T × T covariance matrix constructed based on the correlation structure from Eq. 26. One may then obtain a highly efficient estimate of β by solving the GQL estimating equation

$$ \begin{array}{@{}rcl@{}} {\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{\Sigma}}^{-1}_{i}(\boldsymbol{\beta},\rho)(\boldsymbol{y}_{i}-\tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta}))=0, \end{array} $$
(29)

[Sutradhar (2003, Section 3)] where

$$ \boldsymbol{y}_{i}=(y_{i1},\ldots,y_{it},\ldots,y_{iT})^{\prime}, \tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta})= (\tilde{p}_{i1},\ldots,\tilde{p}_{it},\ldots, \tilde{p}_{iT})^{\prime}. $$

Alternatively, one may obtain an optimal estimate of β by solving a likelihood estimating equation for \(\boldsymbol {\theta }=(\boldsymbol {\beta }^{\prime },\rho )^{\prime }\) given by

$$ \begin{array}{@{}rcl@{}} &&\frac{\partial log L(\boldsymbol{\beta},\rho)}{\partial \boldsymbol{\theta}}=0, \end{array} $$
(30)

where

$$ L(\boldsymbol{\beta},\rho)={\Pi}^{I}_{i=1}[f(y_{i1}){\Pi}^{T}_{t=2}f(y_{it}|y_{i,t-1})], $$
(31)

with

$$ f(y_{i1})={\tilde{p}_{i1}}^{y_{i1}}[1-\tilde{p}_{i1}]^{1-y_{i1}} $$

as the binary density at t = 1, and conditional density of the form

$$ f(y_{it}|y_{i,t-1})=[\lambda^{*}_{it}(\boldsymbol{\beta},\rho|y_{i,t-1})]^{y_{it}} [1-\lambda^{*}_{it}(\boldsymbol{\beta},\rho|y_{i,t-1})]^{1-y_{it}}, $$
(32)

for t = 2,…,T, with \(\lambda ^{*}_{it}(\boldsymbol {\beta },\rho |y_{i,t-1})=P[y_{it}=1|y_{i,t-1}]\) as the conditional probability as given in Eq. 2.

Note that as the likelihood estimation is more complex than the GQL estimation approach, GQL approach becomes practically useful as it also provides more efficient estimates than the MM and QL approaches. We further remark that under this MFE model (LM(1)), where correlations are specified by Eq. 26, the so-called GEE approach (Liang and Zeger (1986)) becomes redundant because no ‘working’ correlation structure is needed when true correlation structure is known.

3.2 LM(2): Time Specific (TS) Marginal Dynamic/Recursive (MD/MR) Model

Many existing studies such as Liang and Zeger (1986), Zeger et al (1988, Section 3.1), Lipsitz et al. (1991), and Yi and Cook (2002), among others, have specified the marginal binary means as a function of regression parameters only, specifically as

$$ E[Y_{it}]=Pr[Y_{it}=1]=\tilde{p}_{it}=\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}) /[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})], $$
(33)

which is similar to Eq. 19 in a cross-sectional cluster setup, and estimated β using ‘working’ correlations based so-called GEE approach. Thus, it is clear that these and other follow up works neither did model the marginal means nor the correlation structure for the underlying longitudinal binary responses. Between these two specifications, i.e., specifying the marginal means by Eq. 33, and specifying a ‘working’ correlations matrix, for repeated binary data, the former specification can seriously effect the validation of the regression estimates when the marginal means for correlated binary data can not be specified as a function of regression parameters only. One such important situation is indicated by Eq. 5 under Section 1, where marginal means for the longitudinal binary data appear to involve both regression (β) and correlation (ρ) parameters. More specifically, the dynamic logit model (4) yields

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\mu_{i1}(\boldsymbol{\beta})\!\!\!&=&\!\!\!E[Y_{i1}|\boldsymbol{x}_{i1}]=\tilde{p}_{i1}(\boldsymbol{\beta}) =\exp(\boldsymbol{x}^{\prime}_{i1}\boldsymbol{\beta})/[1+\exp(\boldsymbol{x}^{\prime}_{i1}\boldsymbol{\beta})] \\ \!\!\!\!\!\!\!\!\!\!\!\!\mu_{i2}(\boldsymbol{\beta},\rho)\!\!\!&=&\!\!\!E[Y_{i2}|H_{i2}]=\frac{\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta})} {[1+\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta})]} \\ &&+\mu_{i1}(\boldsymbol{\beta})\left( \frac{\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta}+\rho)} {[1+\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta}+\rho)]} -\frac{\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta})}{[1+\exp(\boldsymbol{x}^{\prime}_{i2}\boldsymbol{\beta})]}\right) \\ \!\!\!\!\!\!\!\!\!\!\!\!\mu_{i3}(\boldsymbol{\beta},\rho)\!\!\!&=&\!\!\!E[Y_{i3}|H_{i3}]=\frac{\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta})} {[1+\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta})]} \\ &&\!\!\!+\mu_{i2}(\boldsymbol{\beta},\rho)\left( \frac{\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta}+\rho)} {[1+\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta}+\rho)]} -\frac{\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta})}{[1+\exp(\boldsymbol{x}^{\prime}_{i3}\boldsymbol{\beta})]}\right), \end{array} $$
(34)

and so on. Clearly, these means show a recursive relationship. These marginal means take the so-called fixed effects model form, i.e., \(\mu _{it}(\cdot )=\tilde {p}_{it}(\boldsymbol {\beta })=\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta }) /[1+\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta })], \text {for all} t=1,\ldots ,T,\) only when ρ = 0. Otherwise, μi2(⋅) (marginal mean at time point t = 2) is \(\tilde {p}_{i2}(\boldsymbol {\beta })\) plus an increment or decrement due to ρ weighted by previous mean μi1, and so on. It is then clear that one can no longer estimate the regression effects β by using the so-called GEE approach (Liang and Zeger, 1986). This is because the binary means (5) under this BDL model are not free of correlation parameter, whereas GEE approach is developed for the estimation of fixed effects based marginal means all containing only regression parameters β, correlations are being nuisance. In summary, any β estimates for the mean model (33) when in fact the mean model by Eq. 5 is true, would produce inconsistent regression estimates, which is a serious inference issue.

For the regression analysis of the BDL (binary dynamic logit) model (4) which produces the marginal recursive (MR) means as in Eq. 5 involving both β and ρ, (Sutradhar and Farrell, 2007) (see also Amemiya (1985, p. 422) in a time series setup) have develop a GQL (generalized quasi-likelihood) estimation approach which exploits the true correlation structure of the data. For u < t, the formula for the lag (tu) auto-correlation between yiu and yit, is given by

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\text{corr}(Y_{iu},Y_{it})\!\!\!&=&\!\!\!\tilde{\rho}_{t-u}(\boldsymbol{\beta},\rho) \\ \!\!\!&=&\!\!\!\sqrt{\frac{\mu_{iu}(\cdot)(1-\mu_{iu}(\cdot))} {\mu_{it}(\cdot)(1-\mu_{it}(\cdot))}}{\Pi}^{t}_{v=u+1} ({\tilde{\tilde{p}}}_{iv}(\boldsymbol{\beta},\rho)-\tilde{p}_{iv}(\boldsymbol{\beta})), \end{array} $$
(35)

[Sutradhar and Farrell (2007)] where

$$ \begin{array}{@{}rcl@{}} \tilde{p}_{it}(\boldsymbol{\beta})&=&\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}) /[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})], \text{for all} t=1,\ldots,T, \\ {\tilde{\tilde{p}}}_{it}(\boldsymbol{\beta},\rho)&=&\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\rho) /[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\rho)], \text{for all} t=2,\ldots,T, \\ \mu_{i1}(\boldsymbol{\beta})&=&\tilde{p}_{i1}(\boldsymbol{\beta}), \end{array} $$

and μit(β,ρ) for t = 2,…,T, have the recursive/dynamic formula as in Eq. 5. Subsequently, one may construct the T × T true covariance matrix of the response vector \(\boldsymbol {y}_{i}=(y_{i1},\ldots ,y_{it},\ldots ,y_{iT})^{\prime },\) of the i-th individual, as

$$ \begin{array}{@{}rcl@{}} \tilde{\boldsymbol{\Sigma}}_{i}(\boldsymbol{\beta},\rho)&=&\text{cov}[\boldsymbol{Y}_{i}]={\boldsymbol{A}}^{\frac{1}{2}}_{i} (\boldsymbol{\beta},\rho)\tilde{\boldsymbol{\rho}}_{M}(\boldsymbol{\beta},\rho) {\boldsymbol{A}}^{\frac{1}{2}}_{i}(\boldsymbol{\beta},\rho), \end{array} $$
(36)

where

$$ \begin{array}{@{}rcl@{}} \boldsymbol{A}_{i}(\boldsymbol{\beta},\rho)&=& \text{diag}[\mu_{i1}(\cdot)(1-\mu_{i1}(\cdot)),\ldots,\mu_{iT}(\cdot)(1-\mu_{iT}(\cdot))] \\ \tilde{\boldsymbol{\rho}}_{M}(\boldsymbol{\beta},\rho)&=&\begin{pmatrix}1 & \tilde{\rho}_{1} & \tilde{\rho}_{2} & {\ldots} &\tilde{\rho}_{\ell} & {\ldots} & \tilde{\rho}_{T-2} & \tilde{\rho}_{T-1} \\ \cdot & 1 & \tilde{\rho}_{1} & {\ldots} & \tilde{\rho}_{\ell -1}& {\ldots} & \tilde{\rho}_{T-3} & \tilde{\rho}_{T-2} \\ {\vdots} & {\vdots} & \vdots & {\ldots} & {\vdots} & {\ldots} & {\vdots} \\ \cdot &\cdot & \cdot & {\ldots} & \cdot & {\ldots} & 1 & \tilde{\rho}_{1} \\ \cdot &\cdot & \cdot & {\ldots} & \cdot & {\ldots} & \cdot & 1 \\\end{pmatrix}. \end{array} $$
(37)

One may then exploit the mean vector μi(β,ρ) = E[Yi] = (μi1(β),μi2(β,ρ), \(\ldots ,\mu _{iT}(\boldsymbol {\beta },\rho ))^{\prime }\) and the above covariance matrix \(\tilde {\boldsymbol {\Sigma }}_{i}(\boldsymbol {\beta },\rho )\) from Eq. 36, for the GQL estimation of the main regression parameter β. The dynamic dependence or correlation index parameter ρ can be estimated by using the method of moments (MM). See Section 5, for specific GQL and MM estimating equations for these parameters. In the same section, it is shown that as \(I \rightarrow \infty ,\) the GQL estimator of β and the MM estimator of ρ are consistent under some mild regularity conditions. The asymptotic normality of the GQL estimator of the main parameter β is also given for convenience of the construction of confidence intervals, when needed.

3.3 LM(3): A TS (Time Specific) Arbitrary Marginal Fixed Effects (AMFE) Model for Longitudinal Binary Data

This model is similar to the AMFE model (19) under the cross-sectional cluster setup. More specifically, the AMFE model under the longitudinal setup is written as in Eq. 33, i.e.,

$$ E[Y_{it}]=Pr[Y_{it}=1]=\tilde{p}_{it}(\boldsymbol{\beta}) =\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})/[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta})], $$

without writing any correlation structures for its derivation, as the underlying structure is assumed to be unknown. Some possible correlation models those might yield the above marginal mean model (as in Eq. 33) are: (a) the AR(1) type model given in Eq. 2. (b) MA(1) (moving average order 1), and (c) EQC (equi-correlations)/Exchange, models. We may refer to Sutradhar (2011, Sections 7.4.1 to 7.4.3) for a detailed discussion about these three basic longitudinal models, all yielding the same marginal fixed effects based mean model.

Notice that to obtain a consistent estimate of β involved in the above marginal mean \(\tilde {p}_{it}(\boldsymbol {\beta }),\) one may solve the MM or QL estimating equations shown in Eqs. 26 and 27, as they are unbiased estimating equations. These MM and QL estimating equations are free of correlations and hence the regression estimates obtained from them are bound to be less efficient than any moments based equations involving correlations. However, for the cases where true correlation models are unknown, Liang and Zeger (1986) proposed a ‘working’ correlation matrix based GEE (generalized estimating equations) approach for efficient estimation of β. More specifically, for efficient β estimation, they define a ‘working’ correlation matrix as Ri(α), α being a set of working correlation index parameters, and solve the GEE given by

$$ \begin{array}{@{}rcl@{}} &&{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha)(\boldsymbol{y}_{i}-\tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta}))=0, \end{array} $$
(38)

where \(\tilde {\boldsymbol {p}}_{i}(\boldsymbol {\beta })=(\tilde {p}_{i1}(\boldsymbol {\beta }),\ldots , \tilde {p}_{it}(\boldsymbol {\beta }),\ldots ,\tilde {p}_{iT}(\boldsymbol {\beta }))^{\prime },\) and \(\boldsymbol {V}_{i}(\boldsymbol {\beta },\alpha )={\tilde {\boldsymbol {A}}}^{\frac {1}{2}}_{i}(\boldsymbol {\beta }) \boldsymbol {R}_{i}(\boldsymbol {\beta },\) \(\alpha ) {\tilde {\boldsymbol {A}}}^{\frac {1}{2}}_{i}(\boldsymbol {\beta }),\) with \(\tilde {\boldsymbol {A}}_{i}(\boldsymbol {\beta })=\text {diag}[\tilde {p}_{i1},\ldots ,\tilde {p}_{it},\ldots , \tilde {p}_{iT}].\) Because this GEE approach was ambitiously aimed to deal with any types of correlated binary data, it was used by hundreds and hundreds researchers over two decades or so until it was discovered that this approach may in fact yield less efficient estimates than an independence assumption-based estimating equation approach (Sutradhar and Das (1999), Sutradhar (2011, Section 7.3.6; see also Sutradhar and Zheng (2018) under a semi-parametric setup)) such as QL approach in Eq. 27 (also may be referred to as independence based GEE (GEE(I)).

Further note that as pointed out in the last section, one can not at all use the marginal fixed effects (MFE) based GEE approach for certain longitudinal binary data where correlation parameters enter to the formulas for the binary marginal means. More specifically, if GEE is used in such cases, it will produce inconsistent regression estimates. For example, suppose that the longitudinal responses follow the BDL (binary dynamic logit) model (4) yielding the marginal mean models as in Eq. 5. Under this BDL model, the response vector \(\boldsymbol {y}_{i}=(y_{i1},\ldots ,y_{it},\ldots ,y_{iT})^{\prime }\) has the mean \(\boldsymbol {\mu }_{i}(\boldsymbol {\beta },\rho )=E[\boldsymbol {Y}_{i}]=(\mu _{i1} (\boldsymbol {\beta }),\mu _{i2}(\boldsymbol {\beta },\rho ),\ldots ,\mu _{iT}(\boldsymbol {\beta },\rho ))^{\prime }\) (see Eq. 5) and the covariance matrix Σi(β,ρ) as in Eq. 36, i.e.,

$$ \boldsymbol{Y}_{i} \sim(\boldsymbol{\mu}_{i}(\boldsymbol{\beta},\rho),\boldsymbol{\Sigma}_{i}(\boldsymbol{\beta},\rho)), $$
(39)

where marginal means are function of both β and ρ, whereas the GEE approach will specify the mean vector as \(\tilde {\boldsymbol {p}}_{i}(\boldsymbol {\beta })=(\tilde {p}_{i1}(\boldsymbol {\beta }),\ldots , \tilde {p}_{it}(\boldsymbol {\beta }),\ldots ,\tilde {p}_{iT}\) \((\boldsymbol {\beta }))^{\prime },\) with \(\tilde {p}_{it}(\boldsymbol {\beta })=\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta })/[1+\exp (\boldsymbol {x}^{\prime }_{it}\boldsymbol {\beta })].\)

Now to examine the convergence of \(\hat {\boldsymbol {\beta }}_{GEE}\) obtained from Eq. 38 when it is known that yi has the true mean vector and covariance matrix as in Eq. 39, we first write the iterative equation to obtain \(\hat {\boldsymbol {\beta }}_{GEE}\), as follows:

$$ \begin{array}{@{}rcl@{}} &&\hat{\boldsymbol{\beta}}_{GEE}(r+1)=\hat{\boldsymbol{\beta}}_{GEE}(r) \\ \!\!\!\!&+&\!\!\!\!\left[\left\{{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta})}{\partial {\boldsymbol{\beta}}^{\prime}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\hat{\alpha})\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right\}\right.\\&&\left. {\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\hat{\alpha})(\boldsymbol{y}_{i} -\tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta}))\right]_{\boldsymbol{\beta}=\hat{\boldsymbol{\beta}}_{GEE}(r)}. \end{array} $$
(40)

Notice that because the mean vector and covariance matrix of yi are the function of both β and ρ, and because \(\hat {\alpha }\) is usually a moment estimator, it then follows that \(\hat {\alpha }\) will converge to a quantity, say α0, which must be a function of ρ. That is,

$$ \hat{\alpha} \rightarrow \alpha_{0}(\rho) $$
(41)

[Crowder (1995), Sutradhar and Das (1999)]. Thus, one may approximate the limiting (as \(I \rightarrow \infty \)) difference between \(\hat {\boldsymbol {\beta }}_{GEE}\) and true parameter β, as

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\lim_{I \rightarrow \infty}[\hat{\boldsymbol{\beta}}_{GEE}-\boldsymbol{\beta}] \\ &\approx &\!\!\!\!\lim_{I \rightarrow \infty}\left[\left\{{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta})}{\partial {\boldsymbol{\beta}}^{\prime}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right\} \right.\\&&\left.{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))(\boldsymbol{y}_{i} -\tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta}))\right] \\ &\rightarrow &\!\!\!\! E_{\boldsymbol{y}}\left[\left\{{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta})}{\partial {\boldsymbol{\beta}}^{\prime}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right\} \right.\\&&\left.{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))(\boldsymbol{y}_{i} -\tilde{\boldsymbol{p}}_{i}(\beta))\right] \\ &=&\!\!\!\!\left[\left\{{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta})}{\partial {\boldsymbol{\beta}}^{\prime}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}\right\} \right.\\&&\left.{\sum}^{I}_{i=1}\frac{\partial \tilde{\boldsymbol{p}}^{\prime}_{i}(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}{\boldsymbol{V}}^{-1}_{i}(\boldsymbol{\beta},\alpha_{0}(\rho))(\boldsymbol{\mu}_{i}(\boldsymbol{\beta},\rho) -\tilde{\boldsymbol{p}}_{i}(\boldsymbol{\beta}))\right] \ne 0, \end{array} $$
(42)

because of the fact that under the BDL model (4), E[Yi] = μi(β,ρ) as in Eq. 34 (see also Eq. 39), which is quite different than \(\tilde {\boldsymbol {p}}_{i}(\boldsymbol {\beta }).\) Thus \(\hat {\boldsymbol {\beta }}_{GEE}\) obtained from Eq. 38 is asymptotically biased and can not converge to β unless ρ = 0, which is unlikely to happen in the longitudinal setup.

4 Further Estimation and Asymptotic Properties in Cross-sectional Cluster Setup

4.1 GQL and MM Estimation

Recall from Section 2 that except the MME based general cluster model A (CM-A), the remaining MFE based cluster models were developed either under restrictive assumptions about the distribution of the random effects such as “bridge” distribution leading to the fixed effects model CM-B-1, and beta-binary distribution leading to the fixed effects model CM-B-2, or using ‘working’ specification both for means and correlations leading to the AMFE model (CM-C). As discussed in details in the same section, these later three MFE based models have limited practical use, in particular the AMFE model (CM-C) can not be trusted at all as it does not justify how a fixed effects based marginal mean model can be derived from the conditional random effects model (1). For these reasons, we concentrate in this section only on the estimation of the parameters of the MME based CM-A model.

More specifically we turn back to the CM-A model described in Section 2.1. The model parameters β and \(\sigma ^{2}_{\gamma }\) are involved in the marginal mean function \(\mu _{ij}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) in Eq. 7 which has its BA (binomial approximation) based computational version given by Eq. 11. For the estimation of these parameters, as discussed in Section 2.1.1 that there exist several likelihood based approaches (exact likelihood, PQL (penalized quasi-likelihood), HL (hierarchical likelihood)), but they are either computationally involved or they produce biased and hence mean squared error inconsistent estimates specially for large values of \(\sigma ^{2}_{\gamma }.\) In this section, following the GQL approach of Sutradhar (2004) developed under the GLMM (generalized linear mixed model) setup, we simplify the binomial approximation (to standard normal random effects) based GQL estimating equation for β, and MM estimating equation for \(\sigma ^{2}_{\gamma }.\) Furthermore, as in practice one (specially the statistical agencies) deals with large number of clusters each containing large number of individuals so that \({\sum }^{I}_{i=1}n_{i} \rightarrow \infty ,\) in the next section we make sure for the benefit to these practitioners that the GQL estimator of β and the MM estimator of \(\sigma ^{2}_{\gamma },\) are consistent. In Section 4.2.2, we show that the GQL estimator of the main regression parameters β has asymptotically normal distribution providing an opportunity for confidence interval construction, when needed.

4.1.1 GQL Estimation of β

Once the mean function is specified, one requires the true covariance/correlation structure to construct the desired GQL estimating equation (Sutradhar (2003, Section 3.1)) for the parameter of interest. Under the conditional cluster model (1) with normal random cluster effects (γi), the mean function of the i-th cluster response vector \(\boldsymbol {y}_{i}=(y_{i1},\ldots ,y_{ij},\ldots ,y_{in_{i}})^{\prime }\) is computed as in Eq. 39. More specifically, by using the BA (binomial approximation) to the standard normal cluster effect \(\gamma ^{*}_{i}\) as in Eq. 10, we write the BA based mean function as

$$ \begin{array}{@{}rcl@{}} &&E[\boldsymbol{Y}_{i}]={\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \\ &=&(\mu^{BA}_{i1}(\boldsymbol{\beta},\sigma^{2}_{\gamma}),\ldots, \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}),\ldots,\mu^{BA}_{in_{i}}(\boldsymbol{\beta},\sigma^{2}_{\gamma}))^{\prime}, \end{array} $$
(43)

where by Eq. 11,

$$ \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) ={\sum}^{V}_{v_{i}=0}p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))\begin{pmatrix}V \\ v_{i}\end{pmatrix}(1/2)^{v_{i}}(1/2)^{V-v_{i}}, $$

for all j = 1,…,ni, V being a large number such as V = 10,or15. It immediately follows that

$$ \text{var}(Y_{ij})=\sigma^{BA}_{i,jj}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) =\mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) (1-\mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma})). $$
(44)

We now turn to the computation of the ni × ni covariance matrix of yi. For two responses yij and yik, jk;j,k = 1,…,ni, by Eq. 1, we first write

$$ \begin{array}{@{}rcl@{}} &&\lambda_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})= E[Y_{ij}Y_{ik}]=E_{\gamma_{i}}E[Y_{ij}Y_{ik}|\gamma_{i}] \\ &=&E_{\gamma_{i}}[E(Y_{ij}|\gamma_{i})E(Y_{ik}|\gamma_{i})] =\int p^{*}_{ij}(\boldsymbol{\beta},\gamma_{i}) p^{*}_{ik}(\boldsymbol{\beta},\gamma_{i})g_{N}(\gamma_{i})d\gamma_{i}, \end{array} $$
(45)

where \(p^{*}_{ij}(\boldsymbol {\beta },\gamma _{i})=\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i}) /[1+\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\gamma _{i})],\) and \(g_{N}(\gamma _{i}) \equiv [\gamma _{i} \sim N(0,\sigma ^{2}_{\gamma })].\) Notice that the normal integration (45) of a complex function in γi, can be computed as in Eq. 11 using the BA. More specifically, for \(\gamma ^{*}_{i}=\gamma _{i}/\sigma _{\gamma } \equiv h(v_{i})\) as in Eq. 10, the integration in Eq. 45 is approximated as

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \\ &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!=&\!\!\!\!\!\!\!\!\!\!\!\!{\sum}^{V}_{v_{i}=0}p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))p^{*}_{ik}(\boldsymbol{x}_{ik};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))\begin{pmatrix}V \\ v_{i}\end{pmatrix}(1/2)^{v_{i}}(1/2)^{V-v_{i}}, \end{array} $$
(46)

yielding the covariance between yij and yik as

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\text{cov}[Y_{ij},Y_{ik}]\!\!\!\!&=&\!\!\!\!E[Y_{ij}Y_{ik}]-E[Y_{ij}]E[Y_{ik}] \\ \!\!\!\!&=&\!\!\!\!\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})-\mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \mu^{BA}_{ik}(\boldsymbol{\beta},\sigma^{2}_{\gamma})=\sigma^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \end{array} $$
(47)

where the formula for \(\mu ^{BA}_{ij}(\boldsymbol {\beta },\sigma ^{2}_{\gamma }),\) for example, is given by Eq. 43 (see also Eq. 11). Subsequently, combining Eqs. 44 and 47 we obtain the ni × ni covariance matrix of yi, as

$$ \begin{array}{@{}rcl@{}} \text{cov}[\boldsymbol{Y}_{i}]&=&{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) =(\sigma^{BA}_{i,jk}): n_{i} \times n_{i}. \end{array} $$
(48)

Because \(E[\boldsymbol {Y}_{i}]={\boldsymbol {\mu }}^{BA}_{i}(\boldsymbol {\beta },\sigma ^{2})\) and \(\text {cov}[\boldsymbol {Y}_{i}]={\boldsymbol {\Sigma }}^{BA}_{i}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) can be computed by Eqs. 43 and 48, respectively, following Sutradhar (2003, Section 3.1), for given \(\sigma ^{2}_{\gamma },\) one may then construct the desired GQL estimating equation for β as

$$ \begin{array}{@{}rcl@{}} &&{\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}))=0. \end{array} $$
(49)

Let the solution of this GQL estimating Eq. 49 be denoted by \(\hat {\boldsymbol {\beta }}_{GQL}.\) For practical benefit, the asymptotic properties of this estimator must be studied. The consistency of this estimator is examined in Section 4.2.1, along with its asymptotic normality property in Section 4.2.2. To solve the estimating Eq. 49, it remains to compute the matrix derivative involved in the equation, which we derive as follows.

Computation of the Derivative \(\frac {\partial [{\boldsymbol {\mu }}^{BA}_{i}(\boldsymbol {\beta },\sigma ^{2})]^{\prime }}{\partial \boldsymbol {\beta }}\)

For this matrix computation it is sufficient to compute the derivative vector, \(\frac {\partial \mu ^{BA}_{ij}(\boldsymbol {\beta },\sigma ^{2})}{\partial \boldsymbol {\beta }},\) which, following Eq. 43, can be derived as

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!&&\frac{\partial \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \boldsymbol{\beta}} = {\sum}^{V}_{v_{i}=0}\frac{\partial p^{*}_{ij}(\boldsymbol{x}_{ij};\!\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))}{\partial \boldsymbol{\beta}}\!\begin{pmatrix}V \\ v_{i}\end{pmatrix}\!(1/2)^{v_{i}}(1/2)^{V-v_{i}} \end{array} $$
(50)
$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\!\!\!\!\!\!\!&=&{\sum}^{V}_{v_{i}=0}\boldsymbol{x}_{ij} p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))q^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))\begin{pmatrix}V \\ v_{i}\end{pmatrix}\\\!\!\!\!\!\!\!\!\!\!\!\!\!\!&&(1/2)^{v_{i}}(1/2)^{V-v_{i}}: p \times 1, \end{array} $$

where \(q^{*}_{ij}(\boldsymbol {x}_{ij};\boldsymbol {\beta },\sigma _{\gamma } h(v_{i}))\!=1-p^{*}_{ij}(\boldsymbol {x}_{ij};\boldsymbol {\beta },\sigma _{\gamma } h(v_{i}))=[1+\exp (\boldsymbol {x}^{\prime }_{ij}\boldsymbol {\beta }+\sigma _{\gamma } h\) (vi))]− 1.

4.1.2 MM Estimation of \(\sigma ^{2}_{\gamma }\)

Notice that the GQL estimating Eq. 66 for β was developed for known \(\sigma ^{2}_{\gamma },\) which is, however, unknown in practice. Similar to Sutradhar (2004) (see also Jiang, 1998), in this section we estimate this parameter by exploiting second order binary responses, whereas β was estimated using the first order responses. However, because \(\sigma ^{2}_{\gamma }\) is a parameter of secondary interest, as opposed to the GQL approach for \(\sigma ^{2}_{\gamma }\) estimation by Sutradhar (2004), for simplicity, we use the well known method of moments (MM). It is shown in Section 4.2.3, this simpler MM estimation produces consistent \(\sigma ^{2}_{\gamma }\) estimate, similar to the consistency property of the GQL regression estimator \(\hat {\boldsymbol {\beta }}_{GQL}.\) As pointed out above, this MM estimator of \(\sigma ^{2}_{\gamma }\) is expected to be less efficient than its GQL estimator, this efficiency is not being a concerning issue as \(\sigma ^{2}_{\gamma }\) is a parameter of secondary interest.

Let the second order response vectors under the present clustered binary setup, be denoted by

$$ \begin{array}{@{}rcl@{}} \boldsymbol{g}_{i}&=&(y^{2}_{i1},\ldots,y^{2}_{ij},\ldots,y^{2}_{in_{i}})^{\prime} \equiv(y_{i1},\ldots,y_{ij},\ldots,y_{in_{i}})^{\prime}=\boldsymbol{y}_{i}:n_{i} \times 1 \\ \boldsymbol{q}_{i} &=& (y_{i1}y_{i2},\ldots,y_{ij}y_{ik},\ldots, y_{i(n_{i}-1)}y_{in_{i}})^{\prime}: j<k: \frac{n_{i}(n_{i}-1)}{2} \times 1 \\ &\equiv& (q_{i,12},\ldots,q_{i,jk},\ldots,q_{i,(n_{i}-1)n_{i}})^{\prime}, \end{array} $$
(51)

containing all possible squared and pair-wise responses. Because, gi = yi, clearly \(E[\boldsymbol {G}_{i}]={\boldsymbol {\mu }}^{BA}_{i}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) as in Eq. 43. Next, by Eqs. 45 and 46, we write

$$ \begin{array}{@{}rcl@{}} E[\boldsymbol{Q}_{i}]&=&(\lambda^{BA}_{i,12}(\boldsymbol{\beta},\sigma^{2}_{\gamma}),\ldots, \lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \ldots,\boldsymbol{\lambda}^{BA}_{i,(n_{1}-1)n_{i}}(\boldsymbol{\beta},\sigma^{2}_{\gamma}))^{\prime} \\ &=&\boldsymbol{\lambda}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \text{(say)}, \end{array} $$
(52)

where the BA based formula for \(E[Y_{ij}Y_{ik}]=\lambda _{i,jk}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })\) is given in Eq. 46. Because both \(\boldsymbol {\mu }^{BA}_{i}(\cdot )\) and \(\boldsymbol {\lambda }^{BA}_{i}(\cdot )\) contain \(\sigma ^{2}_{\gamma }\) on top of β, we may construct a MM estimating equation for \(\sigma ^{2}_{\gamma },\) as

$$ \begin{array}{@{}rcl@{}} && {\sum}^{I}_{i=1}\left[\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})]^{\prime} }{\partial \sigma^{2}_{\gamma}} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma}) ) \right. \\ &+& \left. \frac{\partial [{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{\prime}}{\partial \sigma^{2}_{\gamma}} (\boldsymbol{q}_{i}-{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}))\right]=0, \end{array} $$
(53)

where

$$ \begin{array}{@{}rcl@{}} \!\!\!\!\!\!\!\frac{\partial {\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}} = (\frac {\partial \mu^{BA}_{i1}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}},\ldots, \frac{\partial \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}},\ldots,\frac{\partial \mu^{BA}_{in_{i}}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}})^{\prime}, \end{array} $$
(54)

with

$$ \begin{array}{@{}rcl@{}} &&\frac{\partial \mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}} \end{array} $$
(55)
$$ \begin{array}{@{}rcl@{}} &\!\!\!\!\!=&\!\!\!\!{\sum}^{V}_{v_{i}=0}\frac{1}{2\sigma_{\gamma}}h(v_{i}) p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))q^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))\begin{pmatrix}V \\ v_{i}\end{pmatrix}(1/2)^{v_{i}}(1/2)^{V-v_{i}}, \end{array} $$

and

$$ \begin{array}{@{}rcl@{}} \frac{\partial {\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}}&=& (\frac{\partial \lambda^{BA}_{i,12}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}},\ldots,\frac{\partial \lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}}, \ldots,\\&&\frac{\partial \lambda^{BA}_{i,(n_{1}-1)n_{i}}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}})^{\prime}, \end{array} $$
(56)

where, it follows from Eq. 46 that

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\!\!\!\!\!\!\!\frac{\partial \lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})}{\partial \sigma^{2}_{\gamma}}={\sum}^{V}_{v_{i}=0}\frac{1}{2\sigma_{\gamma}}h(v_{i}) p^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))p^{*}_{ik}(\boldsymbol{x}_{ik};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i})) \\ &\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\times &\!\!\!\!\!\!\!\!\!\!\!\! [q^{*}_{ij}(\boldsymbol{x}_{ij};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i})) +q^{*}_{ik}(\boldsymbol{x}_{ik};\boldsymbol{\beta},\sigma_{\gamma} h(v_{i}))]\begin{pmatrix}V \\ v_{i}\end{pmatrix}(1/2)^{v_{i}}(1/2)^{V-v_{i}}, \end{array} $$
(57)

for all j < k;j,k = 1,…,ni.

Let \(\hat {\sigma }^{2}_{\gamma , MM}\) denote the solution of the moment (53). In Section 4.2.3 we show that this MM estimator for \(\sigma ^{2}_{\gamma }\) is a consistent estimator under some mild regularity conditions.

4.2 Consistency and Asymptotic Normality

4.2.1 Consistency of \(\hat {\boldsymbol {\beta }}_{GQL}\) Obtained from Eq. 49

We first apply a first order Taylor series expansion to the GQL estimating function in the left hand side of the estimating in Eq. 49 and obtain

$$ \begin{array}{@{}rcl@{}} &&\hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} \simeq -\left[{\sum}^{I}_{i=1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}}\right]^{-1} \\ &\times &\left[{\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}))\right] \\&&+o_{p}(1/\sqrt{N}), \end{array} $$
(58)

where \(N={\sum }^{I}_{i=1}n_{i}.\) Let GN be a N-dependent finite and bounded quantity, and it increases as N gets larger. Notice that the p × p matrix in the first term in the right hand side of Eq. 58 is free from responses {y}. Suppose that this p × p matrix satisfies the regularity condition

$$ \begin{array}{@{}rcl@{}} \frac{1}{{\sum}^{I}_{i=1}n_{i}}{\sum}^{I}_{i=1} |\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}}| \le G_{N}, \end{array} $$
(59)

implying that

$$ \begin{array}{@{}rcl@{}} {\sum}^{I}_{i=1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}}\equiv O(NG_{N}). \end{array} $$
(60)

Next the second term in the right hand side of Eq. 58 converges to zero as

$$ \begin{array}{@{}rcl@{}} &&{}\quad{\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})) \\ &&{} \rightarrow E\left[{\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}))\right]=0, \end{array} $$
(61)

in the order of

$$ \begin{array}{@{}rcl@{}} &&{}\quad\left[|\text{cov}\left\{ {\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})) \right \}|\right]^{\frac{1}{2}} \end{array} $$
(62)
$$ \begin{array}{@{}rcl@{}} &&{}=|\left[ {\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}} \right]^{\frac{1}{2}}|\\&&{}\simeq O_{p}(\sqrt{NG_{N}}), \text{by} (4.35). \end{array} $$

Hence applying Eqs. 60 and 62 into 58, we obtain

$$ \begin{array}{@{}rcl@{}} [\hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} ] &=& O(N^{-1}G^{-1}_{N})O_{p}(\sqrt{NG_{N}})+o_{p}(1/\sqrt{N}) \\ &=&O_{p}((1/\sqrt{N})G^{-\frac{1}{2}}_{N})+o_{p}(1/\sqrt{N}) \equiv o_{p}(1/\sqrt{N}), \end{array} $$
(63)

because GN is a finite and bounded quantity. It then follows that

$$ \begin{array}{@{}rcl@{}} &&\lim_{N \rightarrow \infty} [\hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} ] \rightarrow 0. \end{array} $$
(64)

Thus, \(\hat {\boldsymbol {\beta }}_{GQL}\) obtained from Eq. 49 is consistent for β.

Note that as \(\hat {\boldsymbol {\beta }}_{GQL}\) is asymptotically unbiased for β, it follows from Eq. 58 that its asymptotic covariance matrix is given by

$$ \begin{array}{@{}rcl@{}} &&{}\lim_{I \rightarrow \infty}\text{cov}(\hat{\boldsymbol{\beta}}_{GQL}) \\&&{}=\left[ {\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}}\right]^{-1}, \end{array} $$
(65)

which can be estimated by replacing β and \(\sigma ^{2}_{\gamma },\) with \(\hat {\boldsymbol {\beta }}_{GQL}\) and \(\hat {\sigma }^{2}_{\gamma ,MM},\) respectively, provided \(\hat {\sigma }^{2}_{\gamma ,MM}\) is a consistent estimator of \(\sigma ^{2}_{\gamma }.\) This later consistency property is examined in Section 4.2.3. Further note that the aforementioned estimate for \(\text {cov}(\hat {\boldsymbol {\beta }}_{GQL})\) becomes more useful when confidence interval construction for the β parameter is needed. However, for such a confidence interval construction one needs to examine the asymptotic distribution of \(\hat {\boldsymbol {\beta }}_{GQL},\) which we do in the following section.

4.2.2 Asymptotic Normality of \(\hat {\boldsymbol {\beta }}_{GQL}\)

We outline the derivation of the asymptotic distribution as follows. Notice from Eq. 58 that for large I, β estimator satisfy the approximation

$$ \begin{array}{@{}rcl@{}} &&{}\quad\hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} \simeq -\left[\frac{1}{I} {\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}} \right]^{-1} \\ &&{}\times \left[\frac{1}{I} {\sum}^{I}_{i=1}\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}))\right], \end{array} $$
(66)

which we re-express as

$$ \begin{array}{@{}rcl@{}} \hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} \simeq =-\left[{\sum}^{I}_{i=1}\frac{\partial \boldsymbol{f}_{i}(\boldsymbol{\beta}|\sigma^{2}_{\gamma}, \boldsymbol{y}_{i})}{\partial \boldsymbol{\beta}^{\prime}}\right]^{-1} \left[{\sum}^{I}_{i=1}\boldsymbol{f}_{i}(\boldsymbol{\beta}|\sigma^{2}_{\gamma}, \boldsymbol{y}_{i})\right]. \end{array} $$
(67)

Let

$$ \begin{array}{@{}rcl@{}} \bar{\boldsymbol{f}}_{I}(\boldsymbol{\beta}|\sigma^{2}_{\gamma})=\frac{1}{I}{\sum}^{I}_{i=1} \boldsymbol{f}_{i}(\boldsymbol{\beta}|\sigma^{2}_{\gamma},\boldsymbol{y}_{i}), \end{array} $$
(68)

where fi’s are clearly independent because y1,…,yi,…,yI are independent vectors from I independent clusters. But, they are not identically distributed because of the fact that

$$ \{\boldsymbol{Y}_{i}: n_{i} \times 1\} \sim ({\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}),{\boldsymbol{\Sigma}}^{BA}_{i} (\boldsymbol{\beta},\sigma^{2}_{\gamma})), $$
(69)

by Eqs. 43 and 48, i.e., the means, variances and covariances are cluster dependent, i.e., they vary from cluster to cluster. Notice from Eqs. 6668 that \(\bar {\boldsymbol {f}}_{I}(\boldsymbol {\beta }|\sigma ^{2}_{\gamma },\boldsymbol {y}_{i})\) in Eq. 68 has the mean vector and covariance matrix as given by

$$ \begin{array}{@{}rcl@{}} &&E[\bar{\boldsymbol{f}}_{I}(\boldsymbol{\beta})]=0 , \text{and} \\ && \text{cov}[\bar{\boldsymbol{f}}_{I}(\boldsymbol{\beta})] =\frac{1}{I^{2}}{\sum}^{I}_{i=1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]^{\prime}}{\partial \boldsymbol{\beta}}[{\boldsymbol{\Sigma}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2})]}{\partial \boldsymbol{\beta}^{\prime}} \\ &=&\frac{1}{I^{2}} {\boldsymbol{V}}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \text{(say)}. \end{array} $$
(70)

Next we assume that the multivariate version of Lindeberg’s condition holds, that is,

$$ \begin{array}{@{}rcl@{}} {\lim}_{I \rightarrow \infty}{\boldsymbol{V}^{*}}^{-1}_{I} {\sum}^{I}_{i=1}{\sum}_{\{\boldsymbol{f}^{\prime}_{i}{\boldsymbol{V}^{*}}^{-1}_{I}\boldsymbol{f}_{i}\}> \epsilon}\boldsymbol{f}_{i}\boldsymbol{f}^{\prime}_{i}p^{\dagger}(\boldsymbol{f}_{i})=0 \end{array} $$
(71)

holds, for all 𝜖 > 0, p(⋅) being the probability distribution of fi. Then the Lindeberg-Feller central limit theorem [Amemiya (1985). Theorem 3.3.6), McDonald (2005, Theorem 2.2)] implies the following convergence in distribution \((\rightarrow _{d}):\)

$$ \begin{array}{@{}rcl@{}} &&\boldsymbol{Z}_{I}=I[{\boldsymbol{V}}^{*}_{I}]^{-\frac{1}{2}}\bar{\boldsymbol{f}}_{I}(\boldsymbol{\beta})\rightarrow_{d} N_{p}(0,I_{p}). \end{array} $$
(72)

Ip being the p × p identity matrix.

By using the notations from Eq. 68 it follows from Eqs. 67 and 72 that

$$ \begin{array}{@{}rcl@{}} &&\hat{\boldsymbol{\beta}}_{GQL}-\boldsymbol{\beta} \simeq -\left[{\sum}^{I}_{i=1}\frac{\partial \boldsymbol{f}_{i}(\boldsymbol{\beta})} {\partial \boldsymbol{\beta}^{\prime}}\right]^{-1} \left[{\sum}^{I}_{i=1}\boldsymbol{f}_{i}\right] \\ &=&[\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1} [\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{\frac{1}{2}} [\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-\frac{1}{2}}I\bar{\boldsymbol{f}}_{I}(\boldsymbol{\beta}) \\ &=&[\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-\frac{1}{2}}\boldsymbol{Z}_{I}. \end{array} $$
(73)

Clearly, by Eq. 72, the quantity in Eq. 73 converges in distribution, as

$$ \begin{array}{@{}rcl@{}} &&[\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-\frac{1}{2}}\boldsymbol{Z}_{I} \rightarrow_{d} N_{p}(0, [\boldsymbol{V}^{*}_{I}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{-1}). \end{array} $$
(74)

Notice that this normal covariance matrix \([\boldsymbol {V}^{*}_{I}(\boldsymbol {\beta },\sigma ^{2}_{\gamma })]^{-1}\) is the same as the limiting covariance matrix in Eq. 65, as expected.

4.2.3 Consistency of \(\hat {\sigma }^{2}_{\gamma , MM}\) Obtained from Eq. 53

Consistency property of \(\hat {\sigma }^{2}_{\gamma ,MM}\) can be established in a similar way as that of \(\hat {\boldsymbol {\beta }}_{GQL}\) discussed in Section 4.2.1. For convenience we, however, highlight the main steps below. Because \(\hat {\sigma }^{2}_{\gamma , MM}\) is the solution of the MM estimating Eq. 53, a first order Taylor series expansion of the estimating function in the left hand side of Eq. 53 about \(\sigma ^{2}_{\gamma }\) provides

$$ \begin{array}{@{}rcl@{}} &&[\hat{\sigma}^{2}_{\gamma,MM}-\sigma^{2}_{\gamma}] \\ &\simeq & - \left[{\sum}^{I}_{i=1}\left\{\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})]^{\prime} }{\partial \sigma^{2}_{\gamma}} \frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})] }{\partial \sigma^{2}_{\gamma}} + \frac{\partial [{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{\prime}}{\partial \sigma^{2}_{\gamma}} \frac{\partial [{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{\prime}}{\partial \sigma^{2}_{\gamma}}\right\}\right]^{-1} \\ &\times & \left[{\sum}^{I}_{i=1}\left\{\frac{\partial [{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})]^{\prime} }{\partial \sigma^{2}_{\gamma}} (\boldsymbol{y}_{i}-{\boldsymbol{\mu}}^{BA}_{i}(\boldsymbol{\beta}, \sigma^{2}_{\gamma}) )+ \frac{\partial [{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]^{\prime}}{\partial \sigma^{2}_{\gamma}} (\boldsymbol{q}_{i}-{\boldsymbol{\lambda}}^{BA}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}))\right\}\right] \\ &+&o_{p}(1/\sqrt{N}), \end{array} $$
(75)

where \(N={\sum }^{I}_{i=1}n_{i}.\) For convenience of further calculations, we re-express the equation in (75) as

$$ \begin{array}{@{}rcl@{}} &&\hat{\sigma}^{2}_{\gamma,MM}-\sigma^{2}_{\gamma} \simeq -S^{-1}_{1}S_{2,y}+o_{p}(1/\sqrt{N}). \end{array} $$
(76)

Suppose that HN is a N-dependent increasing but finite and bounded quantity, and S1 in Eq. 76 satisfies the following regularity condition:

$$ \frac{1}{{\sum}^{I}_{i=1}n_{i}}S_{1} \le H_{N}, $$
(77)

implying that

$$ \begin{array}{@{}rcl@{}} &&S_{1} \approx O(NH_{N}). \end{array} $$
(78)

Notice that EY[S2,y] = 0. It then follows that \(S_{2,y} \rightarrow _{p} E_{Y}[S_{2,y}]=0,\) but in order of \([\text {var}(S_{2,y})]^{\frac {1}{2}}.\) To compute this variance formula, it is convenient to re-express S2,y, by using Eq. 75, as

$$ \begin{array}{@{}rcl@{}} &&S_{2,y}={\sum}^{I}_{i=1}\left\{\frac{{\sum}^{n_{i}}_{j=1}\partial [\mu^{BA}_{ij}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})] }{\partial \sigma^{2}_{\gamma}} (y_{ij}-\mu^{BA}_{ij}(\boldsymbol{\beta}, \sigma^{2}_{\gamma}) ) \right. \\ &+& \left. {\sum}^{n_{i}}_{j <k} \frac{\partial [\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}} (y_{ij}y_{ik}-\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma}))\right\}, \end{array} $$
(79)

and obtain its variance as

$$ \begin{array}{@{}rcl@{}} &&\text{var}(S_{2,y}) \\ &=&{\sum}^{I}_{i=1}\left[\left\{{\sum}^{n_{i}}_{j=1}\left( \frac{\partial [\mu^{BA}_{ij}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})] }{\partial \sigma^{2}_{\gamma}}\right)^{2}\text{var}(Y_{ij})\right. \right. \\&&+2{\sum}^{n_{i}}_{j <k} \left( \frac{\partial [\mu^{BA}_{ij}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})] }{\partial \sigma^{2}_{\gamma}}\frac{\partial [\mu^{BA}_{ik}(\boldsymbol{\beta}, \sigma^{2}_{\gamma})] }{\partial \sigma^{2}_{\gamma}}\right) \\ &\times & \left. \text{cov}\left( Y_{ij},Y_{ik}\right) \right\} +\left\{{\sum}^{n_{i}}_{j<k}\left( \frac{\partial [\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}}\right)^{2}\text{var}(Y_{ij}Y_{ik}) \right. \\ &+& \left. {\sum}^{n_{i}}_{j<k}{\sum}^{n_{i}}_{\ell<m}\left( \frac{\partial [\lambda^{BA}_{i,jk}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}}\frac{\partial [\lambda^{BA}_{i,\ell m}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}}\right) \text{cov}(Y_{ij}Y_{ik},Y_{i\ell}Y_{im}) \right\} \\ &+&\left. \left\{{\sum}^{n_{i}}_{j=1}{\sum}^{n_{i}}_{k< \ell}\left( \frac{\partial [\mu^{BA}_{ij}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}}\frac{\partial [\lambda^{BA}_{i,k \ell}(\boldsymbol{\beta},\sigma^{2}_{\gamma})]}{\partial \sigma^{2}_{\gamma}}\right) \text{cov}(Y_{ij},Y_{ik}Y_{i\ell}) \right\} \right] \\ &=&{\sum}^{I}_{i=1}{\Omega}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \text{(say)}, \end{array} $$
(80)

where \(\text {var}(Y_{ij})=\sigma ^{BA}_{i,jj}(\boldsymbol {\beta },\sigma ^{2}_{\gamma }),\) and \(\text {cov}\left (Y_{ij},Y_{ik}\right )=\sigma ^{BA}_{i,jk}(\boldsymbol {\beta },\sigma ^{2}_{\gamma }),\) are given by Eqs. 44 and. 47, respectively. The computational formulas for the remaining third and fourth order moments, i.e., for \(\text {var}(Y_{ij}Y_{ik})=\omega ^{BA}_{i,jjkk}(\boldsymbol {\beta },\sigma ^{2}_{\gamma }); \text {cov}(Y_{ij}\) \(Y_{ik},Y_{i\ell }Y_{im}) =\omega ^{BA}_{i,jk\ell m}(\boldsymbol {\beta },\sigma ^{2}_{\gamma }); \text {cov}(Y_{ij},Y_{ik}Y_{i\ell })=\phi ^{BA}_{i,jk\ell } (\boldsymbol {\beta },\sigma ^{2}_{\gamma }),\) are relatively lengthy and given in Appendix A, for convenience.

Suppose that for a N-dependent finite and bounded quantity KN, var (S2,y) satisfies the regularity condition

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{{\sum}^{I}_{i=1}n_{i}}{\sum}^{I}_{i=1}{\Omega}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \le K_{N}, \end{array} $$
(81)

implying that

$$ \begin{array}{@{}rcl@{}} &&\left[\text{var}(S_{2,y})\right]=O_{p}(\sqrt{NK_{N}}). \end{array} $$
(82)

Now by applying Eqs. 77 and 82 to 76, one obtains

$$ \begin{array}{@{}rcl@{}} \hat{\sigma}^{2}_{\gamma,MM}-\sigma^{2}_{\gamma} &\simeq & O(N^{-1}H^{-1}_{N})O_{p}(\sqrt{NK_{N}})+o_{p}(1/\sqrt{N}) \\ &=&O_{p}(N^{-\frac{1}{2}}\frac{\sqrt{K_{N}}}{H_{N}})+o_{p}(1/\sqrt{N})=o_{p}(1/\sqrt{N}), \end{array} $$
(83)

because both HN and KN are finite and bounded. Hence,

$$ {\lim}_{N \rightarrow \infty}[\hat{\sigma}^{2}_{\gamma,MM}-\sigma^{2}_{\gamma}] \rightarrow_{p} 0, $$
(84)

justifying that \(\hat {\sigma }^{2}_{\gamma ,MM}\) is consistent for \(\sigma ^{2}_{\gamma },\) and this can be used in the GQL (49) while estimating β.

We remark that because by Eq. 84, \(\hat {\sigma }^{2}_{\gamma ,MM}\) is asymptotically unbiased for \(\sigma ^{2}_{\gamma },\) one may then compute the asymptotic variance of \(\hat {\sigma }^{2}_{\gamma ,MM}\) by exploiting Eqs. 75 and 76. More specifically,

$$ \begin{array}{@{}rcl@{}} \text{var}[\hat{\sigma}^{2}_{\gamma,MM}]&=&S^{-1}(\boldsymbol{\beta},\sigma^{2}_{\gamma})\text{var}[S_{2,y}] S^{-1}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) \\ &=&S^{-1}(\boldsymbol{\beta},\sigma^{2}_{\gamma}){\sum}^{I}_{i=1}{\Omega}_{i}(\boldsymbol{\beta},\sigma^{2}_{\gamma}) S^{-1}(\boldsymbol{\beta},\sigma^{2}_{\gamma}), \end{array} $$
(85)

by Eq. 80, which can be estimated by replacing β with \(\hat {\boldsymbol {\beta }}_{GQL},\) and \(\sigma ^{2}_{\gamma }\) with \(\hat {\sigma }^{2}_{\gamma ,MM}.\)

5 On the Bayesian Approach for Correlated Binary Data

We continue discussing mixed effects models (1) for cross-sectional cluster binary data and a time dynamic fixed effects models (4) for longitudinal cluster data. However, as opposed to the parametric correlation structures based regression analysis, here we focus on some of the the existing alternative studies using the Bayesian approach where, without specifying the correlation structures, multilevel conditional models are used to estimate the main parameters such as the individual level covariates effects (β) and cluster specific parameters such as cluster variation \(\sigma ^{2}_{\gamma }\) in Eq. 1 under cross-sectional cluster model (e.g., McCulloch, 1997), or dynamic dependence parameter (ρ) in Eq. 4 under longitudinal cluster model (e.g Chib and Jeliazkov, 2006). We remark that because the mixed effects models are also used by some authors such as Stiratelli et al. (1984), and Zeger et al. (1988) for binary longitudinal data, we also include these models under the longitudinal setup on top of the dynamic fixed effects models.

5.1 Monte Carlo Based Likelihood Estimation for Cluster Binary Data

We keep focussing on the clustered binary model but instead of common random cluster effect γi, consider a more general situation using γij as the cluster specific individual random effect for the j-th individual in the i-th cluster. Let \({z}^{*}_{ij}\) denote a cluster-specific scalar covariate associated with random cluster effects γij. Thus, \({z}^{*}_{ij}=1,\) and γij = γi for all j = 1,…,ni, would refer to the basic cluster-specific mean model (1). Similar to Stiratelli et al. (1984, Eqn. (2.2)), Zeger et al. (1988, Eqn. (2.1), and Daniels and Gatsonis (1999, Eqns. (1)-(2)), we may write the logit link for this general case, as

$$ \begin{array}{@{}rcl@{}} &&\text{logit}(p^{*}_{ij}(\boldsymbol{\beta},{\gamma}_{ij}))\equiv \ell(p^{*}_{ij})=\boldsymbol{x}^{\prime}_{ij}\boldsymbol{\beta}+z^{*}_{ij}{\gamma}_{ij}, \end{array} $$
(86)

for j = 1,…,ni;i = 1,…,I. Notice that Daniels and Gatsonis (1999) have expressed the logit link as \(\ell (p^{*}_{ij})={\boldsymbol {x}^{*}}^{\prime }_{ij}\boldsymbol {\alpha }_{i}.\) For our discussion it is convenient to use the notation in Eq. 86. Write \(\boldsymbol {\gamma }_{i}=(\gamma _{i1},\ldots ,\gamma _{ij},\ldots ,\gamma _{n_{i}})^{\prime }.\) Similar to the normality assumption in Eq. 7, and also in the aforementioned studies, one may assume that

$$ \begin{array}{@{}rcl@{}} &&\boldsymbol{\gamma}_{i} \sim N(0, \boldsymbol{D}_{i}), \end{array} $$
(87)

where Di is the ni × ni covariance matrix.

Recall from Eq. 6 that a closed-form likelihood function cannot be obtained due to the problem of integration over the distribution of the random effect γi. To handle such an integration problem, some numerical algorithms are developed where γi is considered to be a missing data, and it is drawn from a conditional distribution of γi|y by using the so-called Metropolis algorithm (Gelfand and Carlin, 1993), which does not require specification of the unconditional density of the binary data y. More specifically, the Metropolis algorithm is used to simulate the random effects and the so-called expectation-maximization (EM) or Newton−Ralphson (NR) technique is used to maximize the Monte Carlo (simulated) (MC) based approximate likelihood function for the estimation of the regression effects β. We may refer to McCulloch (1997), for example, for these MCEM and MCNR approaches. Some authors such as Daniels and Gatsonis (1999), in stead of normality in Eq. 87, have assumed more general symmetric multivariate t distribution for γi given by

$$ \begin{array}{@{}rcl@{}} &&\boldsymbol{\gamma}_{i} \sim t_{\nu}(\boldsymbol{G}_{i}\gamma,\boldsymbol{D}_{i}), \end{array} $$
(88)

where Gi is a cluster level covariates dependent known matrix of dimension ni × q, and γ is a q-dimensional vector of suitable parameters, and ν is the unknown degrees of freedom parameter. Next, using suitable proper prior distributions for γ and Di, Daniels and Gatsonis (1999, Section 2.2) used the Markov Chain Monte Carlo (MCMC) approach for the desired model fitting.

However, as expected the above monte carlo based likelihood inference approach is computationally expensive. Moreover, the selection of proper prior distributions is a challenge in this approach. For example, while under normality assumption for γi (87), it is reasonable to consider \(\boldsymbol {D}^{-1}_{i}\) has the prior so-called Wishart distribution, but it may not be a proper prior distribution when γi follows the multivariate t distribution as in Eq. 88. This is because as Sutradhar and Ali (1989), for example, derived a Wishart distribution under the multivariate t-model which is different than the usual normality based Wishart distribution. More specifically, it is also dependent of the degrees of freedom of the t-distribution.

Turning back to the parametric inferences discussed in Section 4, when the normality assumption in Eq. 87 holds, using the so-called Binomial approximation (BA), one may easily construct the correlation structure as explained in Section 4.1.1 and apply the generalized quasi-likelihood estimation approach (49) to obtain consistent and highly efficient estimate for β, and method of moments (MM) estimation approach (53) to obtain consistent estimates for the variance parameters involved in Di. Alternatively, as also pointed out by Daniels and Gatsonis (1999, Section 1), one may use the other parametric approaches such as the PQL (penalized quasi-likelihood) approach of Breslow and Clayton (1993) or hierarchical likelihood (HQL) approach of Lee and Nelder (1996), which are simpler than the MCMC based Bayesian approach.

5.2 Monte Carlo Based Likelihood Estimation for Longitudinal Binary Data

With regard to the analysis of longitudinal binary data in Bayesian setup, the most of the existing studies used the same random effects based logit link model (86) with some modifications as follows. Using the notations from Section 3, we re-write the logit link model for longitudinal data as

$$ \begin{array}{@{}rcl@{}} &&\text{logit}(\tilde{p}_{it}(\boldsymbol{\beta},{\gamma}_{i}))\equiv \ell(\tilde{p}_{it})=\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\tilde{z}_{it}{\gamma}_{i}, \end{array} $$
(89)

for the binary response yit, recorded at time t (t = 1,…,T), for the i-th individual. Here both xit and \(\tilde {\boldsymbol {z}}_{it}\) are time dependent covariates. Under this model, the binary responses yiu at time u, and yit at time t, become correlated under the assumption that i-th individual’s random effect remains the same over time.

We remark that in a longitudinal setup, irrespective of the nature of the responses whether linear, count or binary, it is expected in practice that as time lag increases the correlations between two responses must decrease. Two dynamic models considered in the literature for binary responses, such as the AR(1) type dynamic model (2) and binary dynamic logit (BDL) model (4) satisfies this lag dependent decaying correlation property. Specifically, the AR(1) model (2) produced the correlations \(\text {corr}(Y_{iu},Y_{it})= \rho ^{t-u}[\frac {\sigma _{iuu}}{\sigma _{itt}}]^{\frac {1}{2}}\) as in Eq. 26 which (a) becomes smaller as |tu| increases, and also (b) it contains time varying covariates involved in σitt. Similarly, the BDL model (4) produced the correlations

$$ \text{corr}(Y_{iu},Y_{it})=\sqrt{\frac{\mu_{iu}(\cdot)(1-\mu_{iu}(\cdot))} {\mu_{it}(\cdot)(1-\mu_{it}(\cdot))}}{\Pi}^{t}_{v=u+1} ({\tilde{\tilde{p}}}_{iv}(\boldsymbol{\beta},\rho)-\tilde{p}_{iv}(\boldsymbol{\beta}) $$

as in Eq. 35, which decay as |tu| increases. This is because

$$ 0<({\tilde{\tilde{p}}}_{iv}(\boldsymbol{\beta},\rho)-\tilde{p}_{iv}(\boldsymbol{\beta})<1, $$

for all v = (u + 1),…,t. Also it contains time varying covariates.

Note that the random effects model (89), for example, does not satisfy decaying correlation property (a) mentioned above. It, however, satisfies (b) indicating that correlations contain time varying covariates. This is because under the assumption that \(\gamma _{i} {\stackrel {iid}{\sim }} N(0,\tilde {\sigma }^{2}_{\gamma }),\) we can compute

$$ \begin{array}{@{}rcl@{}} E[Y_{it}]&=&E_{\gamma_{i}}\left[E(Y_{it}|\gamma_{i})\right] =\tilde{\mu}_{it}(\boldsymbol{x}_{it},z^{*}_{it},\boldsymbol{\beta},\tilde{\sigma}^{2}_{\gamma}) \\ &=&\int \left[\frac{\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+z^{*}_{it}\gamma_{i})} {[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+z^{*}_{it}\gamma_{i})]}\right]dG_{N}(\gamma_{i},\tilde{\sigma}^{2}_{\gamma}), \end{array} $$
(90)

and

$$ \begin{array}{@{}rcl@{}} E[Y_{iu}Y_{it}]&=&E_{\gamma_{i}}\left[E(Y_{iu}|\gamma_{i})E(Y_{it}|\gamma_{i})\right] =\tilde{\lambda}_{iut}(\boldsymbol{x}_{iu},\boldsymbol{x}_{it},z^{*}_{iu},z^{*}_{it},\boldsymbol{\beta}, \tilde{\sigma}^{2}_{\gamma}) \\ &=& \int \left[\left\{\frac{\exp(\boldsymbol{x}^{\prime}_{iu}\boldsymbol{\beta}+z^{*}_{iu}\gamma_{i})} {[1+\exp(\boldsymbol{x}^{\prime}_{iu}\boldsymbol{\beta}+z^{*}_{iu}\gamma_{i})]}\right\} \left\{\frac{\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+z^{*}_{it}\gamma_{i})} {[1+\exp(\boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+z^{*}_{it}\gamma_{i})]}\right\}\right] \\&&dG_{N}(\gamma_{i},\tilde{\sigma}^{2}_{\gamma}) \\ &-&\tilde{\mu}_{iu}(\boldsymbol{x}_{iu},z^{*}_{iu},\boldsymbol{\beta},\sigma^{2}_{\gamma}) \tilde{\mu}_{it}(\boldsymbol{x}_{it},z^{*}_{it},\boldsymbol{\beta},\sigma^{2}_{\gamma}), \end{array} $$
(91)

yielding the (tu) lag correlation as

$$ \begin{array}{@{}rcl@{}} \text{corr}[Y_{iu},Y_{it}]&=&\frac{ \tilde{\lambda}_{iut}(\boldsymbol{x}_{iu},\boldsymbol{x}_{it},z^{*}_{iu},z^{*}_{it},\boldsymbol{\beta}, \tilde{\sigma}^{2}_{\gamma})}{[\tilde{\mu}_{iu}(\cdot)(1-\tilde{\mu}_{iu}(\cdot)) \tilde{\mu}_{it}(\cdot)(1-\tilde{\mu}_{it}(\cdot))]^{\frac{1}{2}}} \\ & - &\frac{[\tilde{\mu}_{iu}(\cdot) \tilde{\mu}_{it}(\cdot)]^{\frac{1}{2}}} {[(1-\tilde{\mu}_{iu}(\cdot)) (1-\tilde{\mu}_{it}(\cdot))]^{\frac{1}{2}}}, \end{array} $$
(92)

which contains time varying covariates but provides equi-correlations for any (u,t) when these covariates are same over time. Thus, it does not show any decaying correlations when lag |tu| increases.

Some authors in the past such as Stiratelli et al. (1984, Eqns. (2.2), (3.1)–(3.2)), recognizing that serial correlations play an important role for longitudinal binary data, to reflect such correlations, as opposed to Eq. 89, they have used a logit link random effects model similar but different than Eq. 86. More specifically,

$$ \begin{array}{@{}rcl@{}} &&\text{logit}(\tilde{p}_{it}(\boldsymbol{\beta},{\gamma}_{it}))\equiv \ell(\tilde{p}_{it})=\boldsymbol{x}^{\prime}_{it}(y_{i,t-1},\ldots,y_{i1})\boldsymbol{\beta}+\tilde{z}_{it}{\gamma}_{it}, \end{array} $$
(93)

where the covariates xit is composed of the past binary responses as given covariates, and \(\tilde {\boldsymbol {\gamma }}_{i}=(\gamma _{i1},\ldots ,\gamma _{it},\ldots ,\gamma _{iT})^{\prime }\) denote the variable random effects of the i-th individual over the time period T. As far as the distribution of \(\tilde {\boldsymbol {\gamma }}_{i}\) is concerned, the authors have considered

$$ \begin{array}{@{}rcl@{}} \tilde{\boldsymbol{\gamma}}_{i} \sim N(0,\tilde{D}), \end{array} $$
(94)

similar to Eq. 87, and estimated β and \(\tilde {D}: T \times T,\) using the so-called empirical Bayes estimation approach. Notice that the dimension of β in Eqs. 93 and 89 is different. This is because β in Eq. 93 also contains the regression effects/parameters of the past binary responses.

As compared to the logit link model (93), Chib and Jeliazkov (2006, Eqns. (1),(6)) have used a more general semi-parametric dynamic mixed model for longitudinal binary data, constructed based on a latent linear semi-parametric dynamic mixed model. This allows one either to use logit or probit links. More specifically, suppose that \(y^{*}_{it}\) is an unobservable continuous variable satisfying a linear semi-parametric dynamic mixed model, as

$$ \begin{array}{@{}rcl@{}} g^{*}_{it}=E[Y^{*}_{it}|\cdot] &=& \boldsymbol{x}^{\prime}_{it}\boldsymbol{\beta}+\tilde{z}_{it}\gamma_{it}+\phi_{1} y_{i,t-1}+\ldots+\phi_{m} y_{i,t-m}\\&&+g(s_{it})+\epsilon_{it}, \end{array} $$
(95)

where yi,tj is the binary response occurred in the past at time (tj) with its regression effect ϕtj, g(sit) is a smooth non-parametric function in sit covariates, and 𝜖it is the model error. Next, suppose that the binary response yit be determined based on the relationship

$$ \begin{array}{@{}rcl@{}} y_{it}&=&\left\{\begin{array}{ll} 1 & \text{if} y^{*}_{it} > 0 \\ 0 & \text{otherwise.} \end{array} \right. \end{array} $$
(96)

Note that if \(y^{*}_{it}\) follows a logistic (L) distribution (e.g., Johnson and Kotz (1970)) with mean \(g^{*}_{it}\) as in Eq. 95, and variance \(\frac {\pi ^{2}}{3},\) then by using the condition in Eq. 96, one can compute the binary probability as

$$ \begin{array}{@{}rcl@{}} Pr(Y_{it}=1|g^{*}_{it}]&=& {\int}^{g^{*}_{it}}_{-\infty}f_{L}(y^{*}_{it})dy^{*}_{it}=\frac{\exp(g^{*}_{it})}{1+ \exp(g^{*}_{it})}=\pi^{**}_{it}(\cdot), \end{array} $$
(97)

which has the same form as in Eq. 4, with a difference in the formula for \(g^{*}_{it},\) specifically in Eq. 4\(g^{*}_{it}\) has the dynamic form, whereas \(g^{*}_{it}\) in Eq. 95 considered by Chib and Jeliazkov (2006) has the dynamic mixed model form which is a logit link function \((\text {logit}(\pi ^{**}_{it})=g^{*}_{it})\) for clustered longitudinal data (see Sutradhar (2011, Chapter 11) for similar familial/cluster longitudinal binary data). For linear data, \(g^{*}_{it}\) itself is the linear link function which has been studied by some authors such as Das et al. (2013) in a Bayesian setup. For some more discussions on binary dynamic mixed models, similar to that of Chib and Jeliazkov (2006), one may be referred to Sutradhar et al. (2010), for example, in a parametric setup, and Congdon (2014, Section 7.1.1, p. 287), among others, in a Bayesian frame work.

6 Concluding Remarks

This review paper clarifies at least two main misconceptions around the analysis of correlated binary data collected under a cluster in both cross-sectional cluster and longitudinal cluster data setups. First many authors over the past forty years used random effect models to model the correlations for longitudinal binary data. This approach is either misleading or too restrictive. This is because similar to time series modeling, the longitudinal correlations are best modeled through suitable dynamic models relating repeated responses from the same individual. More clearly a common individual random effect among longitudinal responses is unable to address the time effects on the binary responses, rather it generates equi-correlations type structure among the repeated responses which is too restrictive.

Second, in both cross-sectional and longitudinal cluster setups, many studies have pre-specified the marginal means as the function of regression parameters only which may lead to inconsistent regression estimates when a mixed effects model is the true model for the marginal means. To clarify this issue in the cross-sectional cluster setup, we have considered 3 different important situations where fixed effects based marginal means may or may not appropriate. (A) In the first approach, random cluster effects are assumed to follow a normal distribution, and a likelihood function is constructed averaging (referred to as population average (PA)) the conditional likelihood function (which is a product of independent binary distributions conditional on the cluster effects) over the normal random effects, and then the likelihood estimates of the regression and cluster variance parameters are obtained and interpreted. Under this approach, the binary response means were shown to have a marginal mixed effects (MME) model. Thus any fixed effects based marginal mean specification in such cases is bound to produce inconsistent regression estimates, which is a serious inference issue. (B) In the second approach, certain suitable distributions for the random cluster effects were technically developed so that it provides a marginal fixed effects (MFE) model for the binary means involving only the regression parameters (referred to as the subject specific (SS) regression effects), which may be estimated and interpreted by using the likelihood estimates computed in the same way as in (A). But, these distributional assumptions such as so-called “bridge” or beta- binary distributions for the random effects or their functions, are too narrow or restrictive for practical use. (C) In the third approach, no assumption is made about the random cluster effects distribution, in stead an arbitrary MFE (AMFE) model was used for the means involving only regression parameters. Also in this approach no attempts were made to develop any correlation structure or likelihood function, in stead a ‘working’ correlation structure based GEE (generalized estimating equations) approach was used for the SS regression parameters estimation. This approach is misleading as under (A) one never gets a fixed mean model, and under (B) only a limited number of assumptions for the distribution of random effects, those too technically restrictive, may lead to a marginal fixed effects based mean model. In summary, because normal random effects based cluster model (A) is quite practical, we have given details for estimation of the mean model (involving both β and \(\sigma ^{2}_{\gamma }\)) using the so-called GQL approach. As asymptotic properties of such estimators are not available, they (consistency and normality) are discussed in details.

The paper has also equally studied the longitudinal clustered models for binary data where repeated responses from an individual are collected over a short period. Longitudinal correlations arise due to a dynamic relationship among the present and past binary responses and they are different than clustered correlations. Similar to the cluster setup, it is shown that in many situations MFE model can not be used to study the regression effects. For example, an alternative MD/MR (marginal dynamic/recursive) model does not produce fixed effects based mean model. The existing GEE approach is not useful in such a situation because the recursive means contains both regression and correlation parameters, whereas GEE is based on fixed effects based marginal means.

Furthermore, there also exists some studies dealing with clustered and/or panel/longitudinal binary data in a Bayesian setup. These studies are based on generalized linear mixed models with certain suitable link functions to reflect the correlations of the clustered and/or longitudinal data. But the inferences are not made based on any correlation structures, rather they exploit conditional likelihood using monte carlo techniques. We have high lighted some of these important studies in this paper.