Keywords

1 Introduction

In a regression setup, the responses whether linear, count, or binary, are generated as a function of certain suitable covariates. If bulk of the responses appear to be close to the mean function of the responses with a few remaining responses appearing at a significant distance from the mean function, then these latter few responses are considered to be potential outliers. In general these outliers occur because of the corresponding covariates which may be contaminated in some ways, and they are referred to as the mean shifted outliers. In some situations, a response may be considered as an outlier because of its inflated variance as compared to the bulk of the responses. It is of main interest to understand the regression model appropriate for bulk of the good responses. But the use of few outlying responses may distort the inference for the bulk of the responses. There are at least two ways this inference problem has been tackled in the literature.

First, it is attempted to detect the outliers and exclude them for the overall inference. For some justifications on this, one may be referred to Hampel et al. (1986, Sect. 1.4) among others. For the purpose, many researchers have discussed the so-called maximum studentized residual (MSR) and maximum normed residual (MNR) tests for detection of outliers in a linear regression setup for independent data. For example, one may refer to the work of Srikantan (1961), Stefansky (19711972), Tietjen et al. (1973), Prescott (1975), Lund (1975), Bailey (1977), Johnson and Prescott (1975), Ellenberg (19731976), Cook and Prescott (1981), Doornbos (1981), and Beckman and Cook (1983, Sect. 4), among others. The powers of these two statistics in detecting outliers may also be affected by the ways the parameters of the regression models are estimated. For a discussion on this, see, for example, a relatively recent work by Sutradhar et al. (2007). In second approach, a robust weighted distance function is constructed such that the suspected outliers get smaller weights. Next the distance function is minimized for the estimation of the regression effects. Some of the existing widely used robust procedures are: Minimax estimation, M-estimation, L-estimation, and R-estimation. For details on these procedures, see, for example, Hampel et al. (1986), Rousseeuw and Leroy (1987), and Huber (2004), and the references therein.

In the independent setup, some authors such as Cantoni and Ronchetti (2001), among others, have suggested a Mallow’s type quasi-likelihood (MQL) robust estimation approach to obtain a consistent estimate for the regression effects involved in the model. For the MQL construction, they use the Huber’s robust function but did not use the inverse of the variance of such a function to make the MQL standardized. Recently, Bari and Sutradhar (2010a) have improved this estimating equation and introduced a fully standardized MQL (FSMQL) estimating equation that provides regression estimates with smaller bias. In this paper, we review these MQL and FSMQL approaches for the estimation of the regression effects involved in generalized linear models (GLMs), for example for binary and count data.

Also, there have been some studies using QL or generalized estimating equations (GEE) approaches for robust regression estimation in the longitudinal setup. For example, Preisser and Qaqish (1999) have used a resistant GEE (REGEE) approach, which was improved by Cantoni (2004) (see also Sinha 2006 for a random effects approach) by using a semi-standardized MQL (SSMQL; see also Bari and Sutradhar 2010b) approach. In the second part of the paper, we review these approaches including the robust GQL (RGQL) approach discussed by Bari and Sutradhar (2010b) and point out their advantages and drawbacks. Both count and binary longitudinal models are considered.

2 Robust Inference in Regression Models in Independent Setup

2.1 Inference for Linear Models

There exists a vast literature for robust inference in linear models for independent data in the presence of one or more outliers. See, for example, Rousseeuw and Leroy (1987), Huber (2004, Chap. 7), and a relatively recent paper by Sutradhar et al. (2007). These studies mainly deal with outliers in normal responses. For simplicity consider a simple linear regression model

$$\displaystyle{y = X\beta +\epsilon,}$$
(1)

where \(y = {(y_{1},\ldots,y_{i},\ldots,y_{K})}^{\prime}\) is a K ×1 response vector, X is known design matrix of order K ×p, β is a p ×1 vector of unknown parameters, and ε is an K ×1 error variable distributed as \(\epsilon \sim N(0{,\sigma}^{2}I_{K})\), I K being the K ×n identity matrix. Usually, each observation in a realization (y, X) contributes to the evaluation of the regression coefficient β. The contribution of one observation, however, may be discordant to the point of sensibly determining the value of a regression parameter. Such an observation is said to be an outlier. To see how an outlier can perturb the linear model (1), two types of outliers are generally considered. They are (a) mean shifted outliers, also referred to as the additive outliers, and (b) variance inflated outliers, also referred to as the innovative or multiplicative outliers.

To construct an additive outlier model, one can perturb the linear model (1) and write

$$\displaystyle{y = X\beta +\tilde{\epsilon},}$$
(2)

where \(\tilde{\epsilon}= {(\tilde{\epsilon}_{1},\ldots,\tilde{\epsilon}_{i},\ldots,\tilde{\epsilon}_{K})}^{\prime}\) is related to ε in (1) as

$$\displaystyle\begin{array}{rcl} \tilde{\epsilon}_{j} = \left \{\begin{array}{ll} \epsilon _{j} +\delta _{1},&\mbox{for}\;j = i \\ \epsilon _{j}, &\mbox{for}\;j\neq i,\\ \end{array} \right.& &{}\end{array}$$
(3)

where for | δ 1 | > 0, \(y_{i} = x^{\prime}_{i}\beta +\tilde{\epsilon} _{i}\) is certainly a discordant observation when compared to the other K − 1 observations. It is clear from (1) and (3) that

$$\displaystyle{\tilde{\epsilon}\sim N(\delta {,\sigma}^{2}I_{K}),}$$

where

$$\displaystyle{\delta = [01^{\prime}_{i-1},\delta _{1},01^{\prime}_{K-i}]^{\prime}.}$$

To construct a variance inflated outlier model, one can perturb the model (1) as

$$\displaystyle{y = X\beta {+\epsilon}^{{\ast}},}$$
(4)

where \({\epsilon}^{{\ast}} = {(\epsilon _{1}^{{\ast}},\ldots,\epsilon _{i}^{{\ast}},\ldots,\epsilon _{K}^{{\ast}})}^{\prime}\) is related to ε in (1) as

$$\displaystyle{\epsilon _{j}^{{\ast}} = \left \{\begin{array}{ll} \epsilon _{j}/\sqrt{\omega},&\mbox{for}\;j = i \\ \epsilon _{j}, &\mbox{for}\;j\neq i,\\ \end{array} \right.}$$
(5)

where for ω → 0, the ith observation y i will have large variance leading this observation to be an outlier. It is clear from (1) and (5) that

$$\displaystyle{{\epsilon}^{{\ast}}\sim N(0,V _{\omega} {=\sigma}^{2}\mbox{diag}[1^{\prime}_{i-1},1/\omega,1^{\prime}_{K-i}]).}$$

Thus, under model (2), bulk (K − 1) of the error variables follow N(0, σ 2) distribution and 1 follows N(δ 1, σ 2). This is equivalent to say that the \(\tilde{\epsilon}_{i}\) in model (2) are independent, identically distributed with the common underlying distribution

$$\displaystyle{F(\tilde{\epsilon}) = (1 - \frac{1} {K})\Phi \left (\frac{\tilde{\epsilon}-0} {\sigma} \right ) + \frac{1} {K}\Phi \left (\frac{\tilde{\epsilon}-\delta _{1}} {\sigma} \right ),}$$

(Huber 2004, Example 1.1) where Φ( ⋅) is the standard normal cumulative. Similarly, one may say that ε i  ∗  under model (4) are independent, identically distributed with common underlying distribution

$$\displaystyle{F({\epsilon}^{{\ast}}) = (1 - \frac{1} {K})\Phi \left (\frac{{\epsilon}^{{\ast}}- 0} {\sigma} \right ) + \frac{1} {K}\Phi \left (\frac{{\epsilon}^{{\ast}}- 0} {\sigma /\sqrt{\omega}} \right ).}$$

2.1.1 Robust Estimation of Regression Effects

It is understandable that the ordinary least square (LS) estimator

$$\displaystyle{\hat{\beta}_{LS} = {[X^{\prime}X]}^{-1}X^{\prime}y}$$
(6)

is biased for β under model (2)–(3) and will be unbiased but inefficient under model (4)–(5). There exist various robust approaches for the consistent estimation of β irrespective of the underlying model whether it is (2)–(3) or (4)–(5). Here we briefly describe two of the approaches, for example.

2.1.1.1 Huber’s Robust Weights Based Iterative Re-weighted Least Square Approach

This estimate is obtained via an iterative re-weighted least squares (RWLS) method (Street et al. 1988). For p components of β, in this approach one solves the robust weights based estimating equation

$$\displaystyle{\sum _{j=1}^{K}\xi _{j}x_{ju}(y_{j} - x_{j}^{\prime}\beta ) = 0,\ u = 1\ldots,p,}$$
(7)

where x ju is the uth component of the x j vector, and

$$\displaystyle{\xi _{j} = \frac{\psi (r_{j})} {r_{j}},}$$
(8)

with ψ(r j ) as the Huber’s bounded function of r j given by

$$\displaystyle{\psi (z) = \mbox{max}\left [-a,\mbox{min}(z,a)\right ],\;\mbox{with}\;a = 1.25,}$$

where \(r_{j} = (y_{j} - x_{j}^{\prime}\beta _{r(0)}^{{\ast}})/\tilde{s}\) for j = 1, , n, with \(\beta _{r(0)}^{{\ast}}\) as an initial robust estimate of β which may be obtained by minimizing the L 1 distance \(\sum _{j=1}^{K}\vert y_{j} - x_{j}^{\prime}\beta \vert\), and \(\tilde{s}\) as a robust estimate of σ given by

$$\displaystyle{\tilde{s} = \mbox{Median}\Big\{\mbox{largest}\ \mbox{K-p+1 of the}\frac{\vert y_{j} - x_{j}^{\prime}\beta _{r(0)}^{{\ast}}\vert} {0.6745} \Big\}.}$$

Note that if r j  = 0, one uses ξ j  = 1. The solution to (7) may then be obtained as

$$\displaystyle{\beta _{r(1)}^{{\ast}} = {({X}^{\prime}\Omega X)}^{-1}{X}^{\prime}\Omega y,}$$
(9)

where \(\Omega = \mbox{diag}[\xi _{1},\ldots,\xi _{K}]\). This \(\beta _{r(1)}^{{\ast}}\) replaces \(\beta _{r(0)}^{{\ast}}\) and provides us with a new start and new weights for an improved estimate of β to be obtained by (9). This cycle of iterations continues until convergence. Let the final solution be denoted by \(\hat{\beta}_{r(1)}\).

2.1.1.2 An Alternative Weights Based Iterative RWLS Approach

Rousseeuw and Leroy (1987, Chap. 5) suggest a least median of squares (LSM) approach where the scale parameter to compute the residual is estimated using robust weights different than Huber’s weights used in the last section. In fact one can use the iterative least square approach discussed in the last section by replacing Huber’s weights with these new weights suggested by Rousseeuw and Leroy (1987, p. 202). See, for example, Sutradhar et al. (2007) for a comparison between RWLS approaches using Huber’s and Rousseeuw and Leroy weights. To be specific, Rousseeuw and Leroy robust weights are defined as

$$\displaystyle{\tilde{w}_{j} = \left \{\begin{array}{ll} 1,&\mbox{if}\ \vert d_{j(\beta _{r(0)}^{{\ast}})}/\tilde{s}_{0}\vert \leq 2.5 \\ 0,&\mbox{otherwise},\\ \end{array} \right.}$$
(10)

where \(d_{j(\beta _{r(0)}^{{\ast}})} = y_{j} - x_{j}^{\prime}\beta _{r(0)}^{{\ast}}\) and \(\tilde{s}_{0}\) is given by

$$\displaystyle{\tilde{s}_{0} = 1.4826(1 + 5/(K - p))\sqrt{\mbox{Median} \ d_{j(\beta _{r(0)}^{{\ast}})}^{2}}.}$$

These robust weights in (10) are then used to compute an \(\tilde{\Omega}\) matrix as

$$\displaystyle{\tilde{\Omega} = diag[\tilde{w}_{1},\ldots,\tilde{w}_{j},\ldots,\tilde{w}_{K}],}$$

which is then used to obtain a first step improved robust estimate for β as

$$\displaystyle{\beta _{r(1)}^{{\ast}{\ast}} = {({X}^{\prime}\tilde{\Omega}X)}^{-1}{X}^{\prime}\tilde{\Omega}y.}$$
(11)

The cycle of iterations continues until convergence. Let this final RWLS estimate be denoted by \(\hat{\beta}_{r(2)}\).

2.1.2 Robust Estimation of Variance Component

Note that in the linear model setup, the LS estimate of σ 2 is obtained by computing the residual sum of squares based on the least square estimate of β. That is, \(\hat{\sigma}_{ls}^{2} =\sum _{j=1}^{K}{(y_{j} - x_{j}^{\prime}\hat{\beta}_{ls})}^{2}/(K - p)\). Under the linear model in the presence of outliers, one may obtain LS estimate of σ 2 simply by replacing \(\hat{\beta}_{ls}\) with \(\hat{\beta}_{r(1)}\) or \(\hat{\beta}_{r(2)}\) obtained in the last section. Thus the LS estimator for σ 2 has the formula

$$\displaystyle{\tilde{\sigma}_{ls(1)}^{2} =\sum _{j=1}^{K}{(y_{j} - x_{j}^{\prime}\hat{\beta}_{r(1)})}^{2}/(K - p),}$$
(12)

or

$$\displaystyle{\tilde{\sigma}_{ls(2)}^{2} =\sum _{j=1}^{K}{(y_{j} - x_{j}^{\prime}\hat{\beta}_{r(2)})}^{2}/(K - p).}$$
(13)
2.1.2.1 Huber’s Robust Weights Based Iterative RWLS Estimator for σ 2

Following Street et al. (1988), one obtains this estimator as

$$ \hat{\sigma}_{r(1)}^{2} =\sum _{j=1}^{K}\xi _{j}\Big{[y_{j} - x_{j}^{\prime}\hat{\beta}_{ls}\Big]}^{2}\Big/\Big(\sum _{j=1}^{K}\xi _{j} - tr\{({X}^{\prime}{\Omega}^{2}X){({X}^{\prime}\Omega X)}^{-1}\}\Big),$$
(14)

where ξ j (j = 1, , K) is the jth robust weight to protect the estimate against possible outliers, and \(\Omega = diag(\xi _{1},\ldots,\xi _{j},\ldots,\xi _{K})\). To be specific, ξ j is defined as \(\xi _{j} =\psi (r_{j})/r_{j}\) with \(r_{j} = (y_{j} - {x}^{\prime}\hat{\beta}_{ls})/{s}^{{\ast}}\), where

$$\displaystyle{{s}^{{\ast}} = \mbox{Median}\Bigg\{\mbox{largest}\ \mbox{K-p+1 of the}\frac{\vert y_{j} - x_{j}^{\prime}\hat{\beta}_{ls}\vert} {0.6745} \Bigg\}.}$$

Note that the ψ function involved in ξ j in (14) is the same Huber’s robust function used in (8).

2.1.2.2 Rosseeuw and Leroy Weights Based Robust Estimator for σ 2

This robust estimator is computed following Rousseeuw and Leroy (1987, p. 202, Eq. (1.5)). More specifically, in this approach, robust weights are defined as

$$\displaystyle{w_{j} = \left \{\begin{array}{ll} 1,&\mbox{if}\ \vert d_{j(\hat{\beta}_{ls})}/s_{0}\vert \leq 2.5 \\ 0,&\mbox{otherwise},\\ \end{array} \right.}$$

where \(d_{j(\hat{\beta}_{ls})} = y_{j} - x_{j}^{\prime}\hat{\beta}_{ls}\) and s 0 is given by

$$\displaystyle{s_{0} = 1.4826(1 + 5/(K - p))\sqrt{\mbox{Median} \ d_{j(\hat{\beta}_{ls} )}^{2}}.}$$

Next, these weights are exploited to compute the estimator, say \(\hat{\sigma}_{r(2)}^{2}\), as

$$\displaystyle{\hat{\sigma}_{r(2)}^{2} =\Big (\sum _{j=1}^{K}w_{j}d_{j(\hat{\beta}_{ls})}^{2}\Big)\Big/\Big(\sum _{j=1}^{K}w_{j} - p\Big).}$$
(15)

2.1.3 Finite Sample Performance of the Robust Estimators: An Illustration

Sutradhar et al. (2007) conducted a simulation study to examine the performance of the robust methods as compared to the LS method in estimating the parameters in a linear model when the data contain a few variance inflated outliers. Here, we refer to some of the results of this study, for example. Consider a linear model with p = 2 covariates so that \(\beta = {(\beta _{1},\beta _{2})}^{\prime}\). For the associated K ×2 design matrix X, consider their design configuration:

$$\displaystyle{D_{2}: x_{1} = 1,x_{2} = 0,\mbox{all other}\;x(s)\;\mbox{at 0.5.}}$$

With regard to the sample size, consider \(K(\equiv n) = 6,8,10,\ \mbox{and}\ 20\) to examine the effect of small as well as moderately large samples on the estimation. Furthermore, select two locations for the possible outlier, namely locations at i = 2 and 3 for K = 6; i = 2 and 4 for K = 8; i = 2 and 6 for K = 10; and i = 2 and 11 for K = 20. Also, without any loss of generality, choose σ 2 = 1, β 1 = 1, and β 2 = 0. 5. For variance inflation, eight values of ω i , namely ω i  = 0. 001, 0. 005, 0. 01, 0. 05, 0. 10, 0. 25, 0. 50, and 1. 0, were considered. Note that ω i  = 1. 0 represents the case where the data do not contain any outliers, whereas a small value of ω i indicates that y i is generated with a large variance implying that y i can be an influential outlier. The data were simulated 10,000 times. Under each simulation, the LS estimate of β and σ 2 were obtained, which are denoted by \(\hat{\beta}_{ls} = {(\hat{\beta}_{ls,1},\hat{\beta}_{ls,2})}^{\prime}\) and \(\hat{\sigma}_{ls}^{2}\), respectively. As far as the robust estimation of β and σ 2 is concerned, these parameters were estimated by using two robust approaches. More specifically, \(\hat{\beta}_{r(1)} = {(\hat{\beta}_{r(1),1},\hat{\beta}_{r(1),2})}^{\prime}\) is obtained by using (9), \(\hat{\beta}_{r(2)} = {(\hat{\beta}_{r(2),1},\hat{\beta}_{r(2),2})}^{\prime}\) is obtained by using (11), and similarly \(\tilde{\sigma}_{r(1)}^{2}\) and \(\tilde{\sigma}_{r(2)}^{2}\) are obtained from (14) and (15), respectively. The mean squared errors (MSEs) of these estimators based on 10,000 simulations are displayed in Figs. 13, for the estimates of β 1, β 2, and σ 2, respectively.

Fig. 1
figure 1

Mean squared error (MSE) of \(\hat{\beta}_{ls,1}\) (LS estimator of β 1), \(\hat{\beta}_{r(1),1}\) (first robust estimator of β 2), and \(\hat{\beta}_{r(2),1}\) (second robust estimator of β 1)

Fig. 2
figure 2

Mean squared error (MSE) of \(\hat{\beta}_{ls,2}\) (LS estimator of β 2), \(\hat{\beta}_{r(1),2}\) (first robust estimator of β 2), and \(\hat{\beta}_{r(2),2}\) (second robust estimator of β 2)

Fig. 3
figure 3

Mean squared error (MSE) of \(\hat{\sigma}_{ls}^{2}\) (LS estimator of σ 2), \(\tilde{\sigma}_{r(1)}^{2}\) (first robust estimator of σ 2), and \(\tilde{\sigma}_{r(2)}^{2}\) (second robust estimator of σ 2)

In summary, the results of this simulation study indicate that in the presence of a variance inflated outlier, the second robust approach performs worse as compared to the first robust and LS methods in estimating β 1 and β 2. In estimating σ 2, the LS method performs very poorly when compared with the robust methods.

2.2 Robust Estimation in GLM Setup For Independent Discrete Data

As opposed to the linear models in normal or other continuous exponential family based variables, the robust inference for discrete data in the GLMs setup, such as for count and binary data, is, however, not adequately discussed in the literature. For i = 1, , K, let y i be a discrete response, such as count or binary, collected from the ith individual, and \(x_{i} = (x_{i1},\ldots,x_{iu},\ldots,x_{ip})^{\prime}\) be the corresponding p-dimensional observed covariate vector. Note that when the data contain a single outlier, any of the K responses \(y_{1},\ldots,y_{i},\ldots,y_{K}\) can be that outlier. Now, in the spirit of the mean shifted linear outlier model (2)–(3), suppose that we consider \(y_{j},\;j\neq i,\;i = 1,\ldots,K\), for example, to be the outlier because of the covariate for the jth individual, namely x j is contaminated. Note that if \(\tilde{x}_{i} = {(\tilde{x}_{i1},\ldots,\tilde{x}_{iu},\ldots,\tilde{x}_{ip})}^{\prime}\) denotes the p-dimensional uncontaminated covariate vector corresponding to y i for all i = 1, , K, then for a positive vector δ, the observed covariates {x i } may be related to the uncontaminated covariates \(\{\tilde{x}_{i}\}\) as

$$\displaystyle\begin{array}{rcl} x_{j}& =& \tilde{x}_{j}+\delta, \\ \mbox{but}\;x_{i}& =& \tilde{x}_{i},\;\mbox{for}\;i\neq j,\;i = 1,\ldots,K.{}\end{array}$$
(16)

It is of primary interest to estimate \(\beta = {(\beta _{1},\ldots,\beta _{u},\ldots,\beta _{p})}^{\prime}\), the effects of uncontaminated covariates \(\tilde{x}_{i}\) on the response y i . But, as not all the \(\tilde{x}_{i}\)’s are observed, one cannot use them to estimate β, instead the observed contaminated x i ’s are used, which causes bias and hence inconsistency in the estimators.

2.2.1 Understanding Outliers in Count and Binary Data

2.2.1.1 K Count Observations with a Single Outlier

First assume that in the absence of outliers, \(y_{1},\ldots,y_{i},\ldots,y_{K}\) are generated following the Poisson density \(P(Y _{i} = y_{i}) = [\exp (-\mu _{i})\mu _{i}^{y_{i}}]/y_{i}!\), with \(\mu _{i} =\exp (\tilde{x}_{i}^{\prime}\beta )\) with \(\tilde{x}_{i} = {(\tilde{x}_{i1},\tilde{x}_{i2})}^{\prime}\). Suppose that the values of these two covariates arise from

$$\displaystyle{\tilde{x}_{i1}\stackrel{iid} \sim N(0.5,0.25)\;\mbox{and}\;\tilde{x}_{i2}\stackrel{iid} \sim N(0.5,0.5),}$$

respectively, for all i = 1, , K. Suppose that j is the index for the outlying observation that takes a value between 1 and K.

Now, to consider y j as an outlying value, that is, to have a data set of size K with one outlier, one may then shift the values of \(\tilde{x}_{j1}\) and \(\tilde{x}_{j2}\) as

$$\displaystyle{x_{j1} =\tilde{x}_{j1} +\delta \; \mbox{and}\;x_{j2} =\tilde{x}_{j2}+\delta,\;\delta> 0,}$$

respectively, but retain \(x_{i1} =\tilde{x}_{i1}\;\mbox{and}\;x_{i2} =\tilde{x}_{i2}\), for all ij. As far as the shifting is concerned, suppose that δ = 2. 0. Thus, y 1, , y K refer to a sample of K count observations with y j as the single outlier.

2.2.1.2 K Binary Observations with a Single Outlier

Note that the existing literature (Copas 1988, p. 226; Carroll and Pederson 1993; Sinha 2004) does not provide a clear definition for the outliers in binary data. Remark that Cantoni and Ronchetti (2001) have suggested a practically useful MQL robust inference technique for independent data subject to outliers in GLM setup. However even though GLMs include count and binary models, since the concordant counts (bulk of the observations of similar nature) in the Poisson case and the concordant success numbers in the binomial case can be exploited in a similar way to recognize any possible outliers in the respective data sets, Cantoni and Ronchetti’s (2001) definitions of outliers are appropriate only for the Poisson and binomial cases. Thus, even though binary is a special case of the binomial setup, Cantoni and Rochetti’s (2001) robust inference development does not appear to be appropriate for the binary data. In view of these difficulties with regard to the robust inferences for the binary case, Bari and Sutradhar (2010a) have provided a new definition for the outliers for the binary data. More specifically, they dealt with one and two sided outliers in the binary data. For convenience these definitions are summarized as follows.

One sided outlier:

For

$$\displaystyle{Pr[Y _{i} = 1] = E[Y _{i}] =\mu _{i} = \frac{exp(x_{i}^{\prime}\beta )} {1 + exp(x_{i}^{\prime}\beta )},}$$

and

$$\displaystyle{p_{sb} = max\{\mu _{i}\},\;p_{lb} = min\{\mu _{i}\},}$$

suppose that the bulk (K − 1) of the binary observations occur with small probabilities such that

$$\displaystyle{Pr[Y _{i} = 1] = \left \{\begin{array}{ll} \leq p_{sb}&\mbox{for}\;i\neq j,i = 1,\ldots,K, \\> p_{sb}&\mbox{for}\;i = j,\\ \end{array} \right.}$$
(17)

or, with large probabilities such that

$$\displaystyle{Pr[Y _{i} = 1] = \left \{\begin{array}{ll} \geq p_{lb}&\mbox{for}\;i\neq j,i = 1,\ldots,K, \\ <p_{lb}&\mbox{for}\;i = j,\\ \end{array} \right.}$$
(18)

Here the binary y j , whether 1 or 0, satisfying (17) is referred to as an upper sided outlier or satisfying (18) is referred to as a lower sided outlier, whereas the remaining K − 1 responses denoted by y i for ij constitute a group of “concordant” observations.

Two sided outlier:

It may happen in practice that probabilities for the bulk of the observations lie in the range \(p_{sb} \leq P(Y _{i} = 1) \leq p_{lb}\), leading to a situation where one may encounter a two sided outlier. To be specific, y j  = 0 or 1 will be an outlier if either P(Y j  = 1) > p lb or P(Y j  = 1) < p sb .

Generation of K binary observations with an outlier:

We now illustrate the generation of K binary observations including one outlier. For the purpose one may first generate K binary responses \(y_{1},\ldots,y_{i},\ldots,y_{K}\) assuming that they do not contain any outliers. To be specific, generate these K “good” responses following the binary logistic model \(P(Y _{i} = 1) = [\exp (\tilde{x}_{i}^{\prime}\beta )]/[1 +\exp (\tilde{x}_{i}^{\prime}\beta )]\), with two covariates so that \(\tilde{x}_{i} = {(\tilde{x}_{i1},\tilde{x}_{i2})}^{\prime}\) and \(\beta = {(\beta _{1},\beta _{2})}^{\prime}\). As far as the covariate values are concerned, similar to the Poisson case, consider two covariates \(\tilde{x}_{i1}\) and \(\tilde{x}_{i2}\) as

$$\displaystyle{\tilde{x}_{i1}\stackrel{iid} \sim N(-1.0,0.25)\;\mbox{and}\;\tilde{x}_{i2}\stackrel{iid} \sim N(-1.0,0.5),}$$

respectively, for i = 1, , K.

Next, to create an outlier y j where j can take any value between 1 and K, change the corresponding covariate values \(\tilde{x}_{j1}\) and \(\tilde{x}_{j2}\) as

$$\displaystyle{x_{j1} =\tilde{x}_{j1} +\delta _{1}\;\mbox{and}\;x_{j2} =\tilde{x}_{j2} +\delta _{2},\;\delta _{1},\delta _{2}> 0,}$$

respectively. Note that for large positive δ 1 and δ 2, these modified covariates will be increased in magnitude yielding larger probability for y j  = 1. One may then treat y j as an outlier. For convenience, suppose that one uses δ 1 = 3. 0 and δ 2 = 4. 0. As far as the remaining covariates are concerned, they are kept unchanged. That is, for ij (i = 1, , K), consider \(x_{i1} =\tilde{x}_{i1}\;\mbox{and}\;x_{i2} =\tilde{x}_{i2}\).

2.2.2 Naive and Existing Robust QL Estimation Approaches

2.2.2.1 Naive QL (NQL) Estimation of β

Had there been no outliers, one could have obtained the consistent estimate of β by solving the well-known QL (quasi-likelihood) estimating equation

$$\displaystyle{\sum _{i=1}^{K}\left [\frac{\partial \tilde{\mu}_{i}} {\partial \beta} {V}^{-1}(\tilde{\mu}_{i})(y_{i} -\tilde{\mu}_{i})\right ] = 0,}$$
(19)

(see Wedderburn 1974; McCullagh and Nelder 1989; Heyde 1997) where, for example, \(\tilde{\mu}_{i} = E[Y _{i}] =\exp (\tilde{x}^{\prime}_{i}\beta )\;\mbox{and}\;V (\tilde{\mu}_{i}) = var[Y _{i}] =\tilde{\mu} _{i}\) for Poisson count data; and \(\tilde{\mu}_{i} = E[Y _{i}] =\exp (\tilde{x}^{\prime}_{i}\beta )/[1 +\exp (\tilde{x}^{\prime}_{i}\beta )]\;\mbox{and}\;V (\tilde{\mu}_{i}) = var[Y _{i}] =\tilde{\mu} _{i}(1 -\tilde{\mu}_{i})\) for binary data. But, as the uncontaminated \(\tilde{x}_{i}\)’s are unobserved, it is not possible to use (19) for the estimation of β. Now suppose that following (19) but by using the observed covariates {x i }, one writes the naive quasi-likelihood (NQL) estimating equation for β given by

$$\displaystyle{\sum _{i=1}^{K}\left [\frac{\partial \mu _{i}} {\partial \beta} {V}^{-1}(\mu _{i})(y_{i} -\mu _{i})\right ] = 0,}$$
(20)

where, for example, \(\mu _{i} =\exp (x^{\prime}_{i}\beta )\;\mbox{and}\;V (\mu _{i}) =\mu _{i}\) for Poisson count data; and \(\mu _{i} =\exp (x^{\prime}_{i}\beta )/[1 +\exp (x^{\prime}_{i}\beta )]\;\mbox{and}\;V (\mu _{i}) =\mu _{i}(1 -\mu _{i})\) for binary data. Since β is the effect of \(\tilde{x}_{i}\) on y i for all i = 1, , K, it then follows that the quasi-likelihood estimator obtained from (20) will be biased and hence inconsistent for β.

2.2.2.2 Partly Standardized Mallows Type QL (PSMQL) Estimation of β

As a remedy to the inconsistency of the quasi-likelihood estimator obtained from (20), Cantoni and Ronchetti (2001) (see also references therein), among others, have suggested a Mallow’s type quasi-likelihood (MQL) robust estimation approach to obtain a consistent estimate for the regression effects β. For the purpose, for \(r_{i} = \frac{y_{i}-\mu _{i}} {\sqrt{V (\mu _{i} )}}\), they first define the Huber robust function as

$$\displaystyle{\psi _{c}(r_{i}) = \left \{\begin{array}{ll} r_{i}, &\vert r_{i}\vert \leq c, \\ c\ \mbox{sign}(r_{i}),&\vert r_{i}\vert> c,\\ \end{array} \right.}$$
(21)

where c is referred to as the so-called tuning constant. This robust function is then used to construct the MQL estimating equation given by

$$\displaystyle{\sum _{i=1}^{K}\left [w(x_{i})\frac{\partial \mu _{i}} {\partial \beta} {V}^{-\frac{1} {2}}(\mu _{i})\psi _{c}(r_{i}) - a(\beta )\right ] = 0,}$$
(22)

where \(a(\beta ) = \frac{1} {K}\sum _{i=1}^{K}w(x_{i})\frac{\partial \mu _{i}} {\partial \beta} {V}^{-\frac{1} {2}}(\mu _{i})E[\psi _{c}(r_{i})]\), with \(\mu _{i} = E(Y _{i})\), \(V (\mu _{i}) = var(Y _{i})\), and w(x i ) = 1 for the binomial data as in Huber’s linear regression case, but \(w(x_{i}) = \sqrt{(1 - h_{i} )}\) for the Poisson data, where h i is the ith diagonal element of the hat matrix \(H = X{({X}^{\prime}X)}^{-1}{X}^{\prime}\), with \(X = {(x_{1},\ldots,x_{i},\ldots,x_{K})}^{\prime}\) being the K ×p covariate matrix.

Note that in order to minimize the robust distance function \(\psi _{c}(r_{i})\), the MQL estimating (22) was constructed by using the variance \(V (\mu _{i}) = var(Y _{i})\) as a weight function and \(\frac{\partial \mu _{i}} {\partial \beta}\) as a gradient function, whereas a proper estimating equation should use \(var(\psi _{c}(r_{i}))\) and \(\frac{\partial \psi _{c}(r_{i})} {\partial \beta}\) as the weight and gradient functions, respectively. One may therefore refer to the estimating (22) as a partly standardized MQL (PSMQL) estimating equation. This PSMQL estimating (22) provides regression estimates with smaller bias than the traditional maximum likelihood or NQL estimating (20). But, as discussed in Bari and Sutradhar (2010a), this improvement does not appear to be significant enough to recommend the use of the PSMQL estimation approach. Moreover, this PSMQL approach is not suitable for inferences in binary regression models.

2.2.2.3 FSMQL Estimation of β

As an improvement over the PSMQL estimation, Bari and Sutradhar (2010a) have proposed a FSMQL estimation approach where the regression effects β is obtained by solving the FSMQL estimating equation

$$\displaystyle\begin{array}{rcl} & & \sum _{i=1}^{K}\left [w(x_{i})\frac{\partial} {\partial \beta}\left \{\psi _{c}(r_{i}) - \frac{1} {K}\sum _{i=1}^{K}E\left (\psi _{c}(r_{i})\right )\right \}{\left \{var\left (\psi _{c}(r_{i})\right )\right \}}^{-1}\right. \\ & \times & \left.\left \{\psi _{c}(r_{i}) - \frac{1} {K}\sum _{i=1}^{K}E\left (\psi _{c}(r_{i})\right )\right \}\right ] = 0. {}\end{array}$$
(23)

Note that this FSMQL estimating (23) is constructed by replacing the “working” variance and gradient functions V (μ i ) and \(\frac{\partial \mu _{i}} {\partial \beta}\) in (22), with the true variance and gradient functions \(var(\psi _{c}(r_{i}))\) and \(\frac{\partial \psi _{c}(r_{i})} {\partial \beta}\), respectively. Also, \(w(x_{i}) = \sqrt{(1 - h_{i} )}\) is used in both binary and Poisson cases. Furthermore, the specific formulas for the true weight function \(var(\psi _{c}(r_{i}))\) and the gradient function \(\frac{\partial \psi _{c}(r_{i})} {\partial \beta}\) for the count and binary cases are available from Bari and Sutradhar (2010a, Sects. 2.1 and 2.2).

Bari and Sutradhar (2010a) also considered another version of the FSMQL estimating (23), which was developed by using the deviance \(\psi _{c}(r_{i}) - E(\psi _{c}(r_{i}))\) instead of \(\psi _{c}(r_{i}) - \frac{1} {K}\sum _{i=1}^{K}E(\psi _{c}(r_{i}))\). This alternative FSMQL estimating equation has the form

$$\displaystyle{\sum _{i=1}^{K}\left [w(x_{i})\frac{\partial} {\partial \beta}\left \{\psi _{c}(r_{i}) - E\left (\psi _{c}(r_{i})\right )\right \}{\left \{var\left (\psi _{c}(r_{i})\right )\right \}}^{-1}\left \{\psi _{c}(r_{i}) - E\left (\psi _{c}(r_{i})\right )\right \}\right ] = 0.}$$
(24)

For convenience, one may refer to (23) and (24) as the FSMQL1 and FSMQL2 estimating equations, respectively.

2.2.2.3.1 Robust Function and Properties for Count Data

For the count data, consider the Huber robust function ψ c (r i ) as in (21). The expectation and variance of this function are available from Cantoni and Ronchetti (2001, Appendix A, p. 1028). The gradient of the robust function and its expectation may then be computed as follows (see also Bari and Sutradhar 2010a, Appendix):

$$\displaystyle{\frac{\partial \psi _{c}(r_{i})} {\partial \beta} = \left \{\begin{array}{ll} - \frac{\mu _{i}} {{V}^{\frac{1} {2}}(\mu _{i})}x_{i},&\vert r_{i}\vert \leq c, \\ 0, &\vert r_{i}\vert> c,\\ \end{array} \right.}$$
(25)

and

$$\displaystyle\begin{array}{rcl} \frac{\partial E(\psi _{c}(r_{i}))} {\partial \beta} & =& -c\left [\frac{\partial} {\partial \beta}F_{Y _{i}}(i_{2}) + \frac{\partial} {\partial \beta}F_{Y _{i}}(i_{1})\right ] + \frac{\mu _{i}} {{V}^{\frac{1} {2}}(\mu _{i})}\left [\left \{x_{i}P(Y _{i} = i_{1}) + \frac{\partial} {\partial \beta}P(Y _{i} = i_{1})\right \}\right. \\ & & \left.-\left \{x_{i}P(Y _{i} = i_{2}) + \frac{\partial} {\partial \beta}P(Y _{i} = i_{2})\right \}\right ], {}\end{array}$$
(26)

where

$$\displaystyle\begin{array}{rcl} & & \frac{\partial} {\partial \beta}P(Y _{i} = i_{1}) = P(Y _{i} = i_{1})(i_{1} -\mu _{i})x_{i},\; \frac{\partial} {\partial \beta}P(Y _{i} = i_{2}) = P(Y _{i} = i_{2})(i_{2} -\mu _{i})x_{i}, {}\\ & & \frac{\partial} {\partial \beta}F_{Y _{i}}(i_{1}) =\sum _{j=0}^{i_{1}} \frac{\partial} {\partial \beta}P(Y _{i} = j),\;\;\mbox{and}\;\;\frac{\partial} {\partial \beta}F_{Y _{i}}(i_{2}) =\sum _{j=0}^{i_{2}} \frac{\partial} {\partial \beta}P(Y _{i} = j). {}\\ \end{array}$$
2.2.2.3.2 Robust Function and Properties for Binary Data
2.2.2.3.3 (a) Robust function in the presence of one sided outlier

Suppose that the bulk of the binary observations occur with small probabilities. In this case, the robust function ψ c (r i ) (i = 1, , n) may be defined as

$$\displaystyle{\psi _{c}(r_{i}) = \left \{\begin{array}{ll} \frac{y_{i}-\mu _{i}} {{V}^{\frac{1} {2}}(\mu _{i})}, &P(Y _{i} = 1) \leq p_{sb},i\neq j,i = 1,\ldots,K, \\ \frac{y_{i}-\mu _{i}^{(c_{1})}} {{{V}^{(c_{1})}}^{\frac{1} {2}}(\mu _{i}^{(c_{1})})},&P(Y _{i} = 1)> p_{sb},i = j, \\ \end{array} \right.}$$
(27)

where \(\mu _{i} = \frac{exp(x_{i}^{\prime}\beta )} {1+exp(x_{i}^{\prime}\beta )}\), \(V (\mu _{i}) =\mu _{i}(1 -\mu _{i})\) for all i = 1, , K, and \(p_{sb} = max\{\mu _{i}\}\), ij, is a bound for all K − 1 small probabilities.

Note that as opposed to the case given in (27), if the bulk of the binary observations occur with large probabilities, then the robust function \(\psi _{c}(r_{i})\) (i = 1, , K) is defined as

$$\displaystyle{\psi _{c}(r_{i}) = \left \{\begin{array}{ll} \frac{y_{i}-\mu _{i}} {{V}^{\frac{1} {2}}(\mu _{i})}, &P(Y _{i} = 1) \geq p_{lb},i\neq j,i = 1,\ldots,K, \\ \frac{y_{i}-\mu _{i}^{(c_{2})}} {{{V}^{(c_{2})}}^{\frac{1} {2}}(\mu _{i}^{(c_{2})})},&P(Y _{i} = 1) <p_{lb},i = j, \\ \end{array} \right.}$$
(28)

where \(p_{lb} = min\{\mu _{i}\}\), ij, is a bound for all K − 1 large probabilities.

2.2.2.3.4 (b) Robust function in the presence of two sided outlier

In this case, the robust function \(\psi _{c}(r_{i})\) (i = 1, , K) may be defined as

$$\displaystyle{\psi _{c}(r_{i}) = \left \{\begin{array}{ll} \frac{y_{i}-\mu _{i}^{(c_{1})}} {{{V}^{(c_{1})}}^{\frac{1} {2}}(\mu _{i}^{(c_{1})})},&P(Y _{i} = 1)> p_{lb},i = j, \\ \frac{y_{i}-\mu _{i}} {{V}^{\frac{1} {2}}(\mu _{i})}, &p_{sb} \leq P(Y _{i} = 1) \leq p_{lb},i\neq j,i = 1,\ldots,K, \\ \frac{y_{i}-\mu _{i}^{(c_{2})}} {{{V}^{(c_{2})}}^{\frac{1} {2}}(\mu _{i}^{(c_{2})})},&P(Y _{i} = 1) <p_{sb},i = j, \\ \end{array} \right.}$$
(29)

where \(\mu _{j}^{(c_{1})}\) and \({V}^{(c_{1})}(\mu _{j}^{(c_{1})})\) are defined as in (27), whereas \(\mu _{j}^{(c_{2})}\) and \({V}^{(c_{2})}(\mu _{j}^{(c_{2})})\) are defined as in (28).

2.2.2.3.5 (b(i)) Basic properties of the robust function ψ c (r i ): Binary case

It is convenient to write these properties for the two sided outlier case. The results for the one sided outlier may be obtained as a special case. The expectation, variance, and gradient of the robust function in the presence of a two sided outlier are available from Bari and Sutradhar (2010a, Appendix). For convenience, these properties are summarized as follows.

Let \(\psi _{c}(r_{i})\) denote the robust function defined as in (29). The expectation and variance of ψ c (r i ) are given by

$$\displaystyle{E(\psi _{c}(r_{i})) = \frac{\mu _{i} -\mu _{i}^{(c_{1})}} {{{V}^{(c_{1})}}^{\frac{1} {2}}(\mu _{i}^{(c_{1})})}P_{1} + \frac{\mu _{i} -\mu _{i}^{(c_{2})}} {{{V}^{(c_{2})}}^{\frac{1} {2}}(\mu _{i}^{(c_{2})})}P_{3},}$$
(30)

and

$$\displaystyle{var(\psi _{c}(r_{i})) = \frac{(1 - 2\mu _{i}^{(c_{1})})\mu _{i} +{\mu _{i}^{(c_{1})}}^{2}} {{V}^{(c_{1})}(\mu _{i}^{(c_{1})})} P_{1}+P_{2}+\frac{(1 - 2\mu _{i}^{(c_{2})})\mu _{i} +{\mu _{i}^{(c_{2})}}^{2}} {{V}^{(c_{2})}(\mu _{i}^{(c_{2})})} P_{3}-{\left [E(\psi _{c}(r_{i}))\right ]}^{2},}$$
(31)

where P 1, P 2, and P 3 are the probabilities for a binary observation to satisfy the conditions \(P(Y _{i} = 1)> p_{lb}\), \(p_{sb} \leq P(Y _{i} = 1) \leq p_{lb}\), and \(P(Y _{i} = 1) <p_{sb}\), respectively. In practice, the probabilities P 1, P 2, and P 3 may be computed from the data by using the sample proportions given by, for example,

$$\displaystyle{P_{1} = \frac{\mbox{Number of observations satisfying}\;P(Y _{i} = 1)> p_{lb}} {\mbox{Total observation}\;(K)}.}$$

The gradient of the robust function ψ c (r i ) [defined in (29)] and its expectation are given by

$$\displaystyle\begin{array}{rcl} \frac{\partial \psi _{c}(r_{i})} {\partial \beta} = \left \{\begin{array}{ll} 0, &P(Y _{i} = 1)> p_{lb},i = j, \\ \frac{-\mu _{i}(1-\mu _{i})x_{i}} {{V}^{\frac{1} {2}}(\mu _{i})},&p_{sb} \leq P(Y _{i} = 1) \leq p_{lb},i\neq j,i = 1,\ldots,K, \\ 0, &P(Y _{i} = 1) <p_{sb},i = j,\\ \end{array} \right.& &{}\end{array}$$
(32)

and

$$\displaystyle{\frac{\partial E(\psi _{c}(r_{i}))} {\partial \beta} = \frac{(1 -\mu _{i})\mu _{i}x_{i}} {{{V}^{(c_{1})}}^{\frac{1} {2}}(\mu _{i}^{(c_{1})})}P_{1} + \frac{(1 -\mu _{i})\mu _{i}x_{i}} {{{V}^{(c_{2})}}^{\frac{1} {2}}(\mu _{i}^{(c_{2})})}P_{3}.}$$
(33)

To illustrate the finite sample based relative performance of the competitive robust approaches, namely PSMQL (22), FSMQL1 (23), and FSMQL2 (24) approaches, we refer to some of the simulation results from Bari and Sutradhar (2010a). In the presence of a single outlier, the count and binary data were generated as in Sect. 2.2.1. With K = 60 observations including an outlier, the relative bias (RB) of an estimator, for example, for β k (k = 1, , p) given by

$$\displaystyle{\mbox{RB}\;(\hat{\beta}_{k}) = \frac{\vert \hat{\beta}_{k} -\beta _{k}\vert} {\mbox{s.e.}\;(\hat{\beta}_{k})} \times 100,}$$
(34)

were computed based 1,000 simulations. The results are shown in Table 1.

Table 1 (For count and binary data with one outlier) Simulated means (SM), simulated standard errors (SSE), and relative biases (RB) of the PSMQL, FSMQL1, and FSMQL2 estimates of the regression parameters β 1 = 1. 0 and β 2 = 0. 5, for sample size 60 and selected values of the tuning constant c = 1. 4 under the Poisson model, and tuning constant \({\mu}^{c_{1}} = 0.9\) under the binary model, in the presence of one outlier

The results of the table show that both fully standardized robust procedures FSMQL1 and FSMQL2 perform much better in estimating β as compared to the existing PSMQL robust approach.

3 Robust Inference in Longitudinal Setup

3.1 Existing GEE Approaches for Robust Inferences

Let \(\mu _{i}(x_{i}) = E(Y _{i}) = {(\mu _{i1},\ldots,\mu _{it},\ldots,\mu _{iT})}^{\prime}\) denote the mean, and \(\Sigma _{i}(x_{i},\rho ): T \times T\) be the true covariance matrix of the response vector y i where x i represents all true covariates, i.e., \(x_{i} \equiv x_{i1},\ldots,x_{it},\ldots,x_{iT}\). For convenience, the covariance matrix \(\Sigma _{i}(x_{i},\rho )\) is often expressed as \(\Sigma _{i}(x_{i},\rho ) = A_{i}^{\frac{1} {2}}C_{i}(\rho )A_{i}^{\frac{1} {2}}\), where \(A_{i} = \mbox{diag}[\sigma _{i11},\ldots,\sigma _{itt},\ldots,\sigma _{iTT}]\) and C i (ρ) is the correlation matrix for repeated binary or count data. Note that if the longitudinal data do not contain any outliers, then one may obtain consistent and highly efficient estimate of β by solving the GQL estimating equation

$$\displaystyle{\sum _{i=1}^{K}\left [\frac{\partial \mu _{i}^{\prime}(x_{i})} {\partial \beta} \Sigma _{i}^{-1}(x_{i},\hat{\rho})(y_{i} -\mu _{i}(x_{i}))\right ] = 0,}$$
(35)

(see Sutradhar 2003) where \(\hat{\rho}\) is a suitable consistent, for example, a moment estimate of ρ.

Note that in practice it may, however, happen that a small percentage such as 1% of longitudinal observations are suspected to be outliers. Suppose that m of the K T responses are referred to as the outliers when their corresponding covariates are shifted by an amount δ, δ being a real valued vector. For convenience, we denote the new set of covariates as

$$\displaystyle{\tilde{x}_{it} = \left \{\begin{array}{ll} x_{it} &\mbox{for}(i,t)\not\equiv ({i}^{\prime},{t}^{\prime}) \\ x_{it}+\delta &\mbox{for}(i,t) \equiv ({i}^{\prime},{t}^{\prime})\\ \end{array} \right.,}$$

and use these observed covariates \(\tilde{x}_{it}\) for the estimation of β. It is, therefore, clear that since β is the effect of the true covariate x it on y it , the solution of the observed covariates \(\tilde{x}_{i}\) based naive GQL (NGQL) estimating equation

$$\displaystyle{\sum _{i=1}^{K}\left [\frac{\partial \mu _{i}^{\prime}(\tilde{x}_{i})} {\partial \beta} \Sigma _{i}^{-1}(\tilde{x}_{i},\hat{\rho})(y_{i} -\mu _{i}(\tilde{x}_{i}))\right ] = 0,}$$
(36)

will produce biased and hence inconsistent estimate for β. To overcome this inconsistency problem, Preisser and Qaqish (1999), among others, have proposed to solve a resistant generalized quasi-likelihood estimating equation (REGEE) given by

$$\displaystyle{\sum _{i=1}^{K}\left [\frac{\partial \mu _{i}^{\prime}(\tilde{x}_{i})} {\partial \beta} V _{i}^{-1}(\tilde{x}_{i},\alpha )(\psi _{i}^{{\ast}}- c_{i})\right ] = 0,}$$
(37)

where \(\psi _{i}^{{\ast}}\) is a down-weighting function, \(c_{i} = E(\psi _{i}^{{\ast}})\), and \(V _{i}(\tilde{x}_{i},\alpha )\) is a “working” covariance matrix (Liang and Zeger 1986). Note that the REGEE in (37) does not appear to be a proper weighted estimating equation. This is because, first, \(V _{i}(\tilde{x}_{i},\alpha )\) is only a substitute of \(\Sigma _{i}(\tilde{x}_{i},\rho )\) matrix, whereas in the presence of outliers, one needs to use \(\Omega _{i}^{{\ast}} = var(\psi _{i}^{{\ast}})\) in order to obtain efficient regression estimates. Secondly, the REGEE (37) uses \(\frac{\partial \mu _{i}^{\prime}(\tilde{x}_{i})} {\partial \beta}\) as the gradient function, whereas the consistency of the estimates may depend on the proper gradient function constructed by taking the derivative of the \(\psi _{i}^{{\ast}}- c_{i}\) function with respect to β.

Cantoni (2004) has provided an improvement over the REGEE by introducing the proper gradient function in the estimating equation. To be specific, as compared to Preisser and Qaqish (1999) (see also Eq. (36)), Cantoni (2004) constructed an improved resistant generalized estimating equation (IREGEE) given by

$$\displaystyle{\sum _{i=1}^{K}\left [E{\left \{\frac{\partial (\psi _{i}^{{\ast}}- c_{i})} {{\partial \beta}^{\prime}} \right \}}^{\prime}V _{i}^{-1}(\tilde{x}_{i},\alpha )(\psi _{i}^{{\ast}}- c_{i})\right ] = 0,}$$
(38)

where \(E\left [\frac{\partial (\psi _{i}^{{\ast}}-c_{i})} {{\partial \beta}^{\prime}} \right ]\) is a proper gradient of the robust function \(\psi _{i}^{{\ast}}- c_{i}\), with

$$\displaystyle{E\left [\frac{\partial (\psi _{i}^{{\ast}}- c_{i})} {{\partial \beta}^{\prime}} \right ] = E\left [\frac{\partial (\psi _{i}^{{\ast}}- c_{i})} {\partial \mu _{i}^{\prime}(\tilde{x}_{i})} \right ]\frac{\partial \mu _{i}} {{\partial \beta}^{\prime}} = \Gamma _{i}\frac{\partial \mu _{i}} {{\partial \beta}^{\prime}}.}$$

Note that the estimating (38) still uses a “working” covariance matrix \(V _{i}(\tilde{x}_{i},\alpha )\), whereas an efficient estimating equation (Sutradhar and Das 1999) should use the proper covariance matrix of the robust function, namely \(\Omega _{i}^{{\ast}} = var(\psi _{i}^{{\ast}})\). Further, similar to Cantoni (2004), Sinha (2006) has attempted to develop certain robust inferences to deal with outliers in the longitudinal data. But, Sinha (2006) has modeled the longitudinal correlations through random effects, which, therefore addresses a different problem than longitudinal data problems.

Recently, Bari and Sutradhar (2010b) has proposed an auto-correlation class based robust GQL (RGQL) approach for inferences in binary and count panel data models in the presence of outliers. This RGQL approach produces consistent and highly efficient regression estimates, and it is a generalization of the FSMQL approach for independent data to the longitudinal setup. The RGQL approach is summarized in the next section.

3.2 RGQL Approach for Robust Inferences in Longitudinal Setup

Note that when the covariates are stationary, that is, time independent, one may develop a general auto-correlation class based robust GQL estimation approach. Bari and Sutradhar (2010b) have considered non-stationary covariates and exploited the most likely AR(1) type correlation structures for both count and binary data. These correlation structures are discussed in detail in Sutradhar (2010), see also Sutradhar (2011). For convenience we summarize these correlation structures as follows.

Recall that \(x_{it} = {(x_{it1},\ldots,x_{itu},\ldots,x_{itp})}^{\prime}\) is the p ×1 vector of covariates corresponding to y it when the data do not contain any outliers, and β denote the effects of the covariate x it on y it . The AR(1) correlation models for repeated responses \(y_{i1},\ldots,y_{it},\ldots,y_{iT}\) based on the uncontaminated covariates \(x_{i1},\ldots,x_{it},\ldots,x_{iT}\), for binary and count data are given below.

3.2 AR(1) model for repeated binary data

For \(\mu _{it} = \frac{exp(x_{it}^{\prime}\beta )} {1+exp(x_{it}^{\prime}\beta )}\), for all t = 1, , T, the AR(1) model for the binary data may be written as

$$\displaystyle\begin{array}{rcl} & & y_{i1} \sim bin(\mu _{i1})\;\;\mbox{and}\;\; \\ & & y_{it}\vert y_{i,t-1} \sim bin[\mu _{it} +\rho (y_{i,t-1} -\mu _{i,t-1})],{}\end{array}$$
(39)

(Zeger et al. 1985; Qaqish 2003) where ρ is a correlation index parameter. The binary AR(1) model (39) has the auto-correlation structure given by

$$\displaystyle{corr(Y _{iu},Y _{it}) = \left \{\begin{array}{ll} {\rho}^{t-u}{\left [\frac{\sigma _{iuu}} {\sigma _{itt}} \right ]}^{1/2},&\mbox{for $u <t$} \\ {\rho}^{u-t}{\left [ \frac{\sigma _{itt}} {\sigma _{iuu}}\right ]}^{1/2},&\mbox{for $u> t$} \end{array} \right.,}$$
(40)

where \(\sigma _{iuu} =\mu _{iu}(1 -\mu _{iu})\), for example, is the variance of y iu . Note that ρ parameter in (39)–(40) must satisfy the range restriction

$$\displaystyle{max\left [- \frac{\mu _{it}} {1 -\mu _{i,t-1}},-\frac{1 -\mu _{it}} {\mu _{i,t-1}} \right ] \leq \rho \leq min\left [ \frac{1 -\mu _{it}} {1 -\mu _{i,t-1}}, \frac{\mu _{it}} {\mu _{i,t-1}}\right ].}$$
(41)

3.2 AR(1) model for repeated count data

As opposed to the binary AR(1) model (39), the AR(1) model for the count data is defined as

$$\displaystyle\begin{array}{rcl} y_{i1}& \sim & Poisson(\mu _{i1}) \\ y_{it}& =& \rho {\ast}y_{i,t-1} + d_{it},\;t = 2,\ldots,T,{}\end{array}$$
(42)

(see McKenzie 1988; Sutradhar 2003), where \(y_{i,t-1} \sim Poisson(\mu _{i,t-1})\) and \(d_{it} \sim Poisson(\mu _{it} -\rho \mu _{i,t-1})\), with \(\mu _{it} = E(Y _{it}) =\exp (x^{\prime}_{it}\beta )\). In (42), d it and \(y_{i,t-1}\) are assumed to be independent. Also, for given count y i, t − 1,

$$\displaystyle{\rho {\ast}y_{i,t-1} =\sum _{j=1}^{y_{i,t-1}}b_{j}(\rho ),}$$

where b j (ρ) stands for a binary variable with \(P[b_{j}(\rho ) = 1] =\rho\) and \(P[b_{j}(\rho ) = 0] = 1-\rho\). The AR(1) model (42) for count data has the auto-correlation structure given by

$$\displaystyle{corr(Y _{iu},Y _{it}) {=\rho}^{t-u}\sqrt{\frac{\mu _{iu}} {\mu _{it}}},}$$
(43)

with ρ satisfying the range restriction

$$\displaystyle{0 <\rho <min\left [1, \frac{\mu _{it}} {\mu _{i,t-1}}\right ],t = 2,\cdots \,,T.}$$
(44)

3.2.1 RGQL Estimating Equation

For \(\xi _{i} = {[\psi _{c}(r_{i1}),\ldots,\psi _{c}(r_{it}),\ldots,\psi _{c}(r_{iT})]}^{\prime}\), its expectation λ i is available from Cantoni and Ronchetti (2001) for the count data, and from Sect. 2.2.2 for the binary case. Recall from (38) that based on “working” covariance of the responses (Liang and Zeger 1986), Cantoni (2004) has suggested an IREGEE approach for estimating β in the presence of outliers. One may obtain consistent β estimate by solving a slightly different equation than (38) given by

$$\displaystyle{\sum _{i=1}^{K}\left [W_{i}\frac{\partial} {\partial \beta}{\left \{\xi _{i} - {K}^{-1}\sum _{i=1}^{K}\lambda _{i}\right \}}^{\prime}V _{i}^{-1}(\alpha )\left \{\xi _{i} - {K}^{-1}\sum _{i=1}^{K}\lambda _{i}\right \}\right ] = 0,}$$
(45)

where \(W_{i} = diag[w_{i1},\ldots,w_{it},\ldots,w_{iT}]\) is the T ×T covariate dependent diagonal weight matrix so that covariates corresponding to the outlying response yield less weight for the corresponding robust function. To be specific, the t-th diagonal element of the W i matrix is computed as \(w_{it} = \sqrt{1 - h_{itt}}\), h itt being the t-th diagonal element of the hat matrix \(H_{i} = \tilde{X}_{i}{(\tilde{X}_{i}^{\prime}\tilde{X}_{i})}^{-1}\tilde{X}_{i}^{\prime}\) with \(\tilde{X}_{i} = {[\tilde{x}_{i1},\ldots,\tilde{x}_{it},\ldots,\tilde{x}_{iT}]}^{\prime}\). See, for example, Cantoni and Ronchetti (2001). Also in (45), \(V _{i}(\alpha ) = \mbox{cov}(Y _{i}) = A_{i}^{\frac{1} {2}}R(\alpha )A_{i}^{\frac{1} {2}}\) is a “working” covariance matrix of y i , with R(α) as the associated “working” correlation matrix. Note that there are twofold problems with this estimating equation. First, for efficiency increase, it would have been appropriate to use \(\mbox{cov}(\xi _{i}) = cov[\psi _{c}(r_{i1}),\ldots,\psi _{c}(r_{it}),\ldots,\psi _{c}(r_{iT})]\) as the weight matrix instead of the true covariance matrix \(\Sigma _{i}(\alpha ) = \mbox{cov}(Y _{i})\). Secondly, Cantoni (2004) did not even use Σ i , rather has used a “working” covariance matrix \(V _{i}(\alpha ) = \mbox{cov}(Y _{i})\).

To overcome this inefficiency problem encountered by Cantoni’s approach, Bari and Sutradhar (2010b) have suggested a robust function based GQL (RGQL) estimating equation for β as

$$\displaystyle{\sum _{i=1}^{K}\left [W_{i}\frac{\partial} {\partial \beta}{\left \{\xi _{i} - {K}^{-1}\sum _{i=1}^{K}\lambda _{i}\right \}}^{\prime}\Omega _{i}^{-1}\left \{\xi _{i} - {K}^{-1}\sum _{i=1}^{K}\lambda _{i}\right \}\right ] = 0,}$$
(46)

where

$$\displaystyle{\Omega _{i} = cov(\xi _{i}) = (\omega _{iut}),}$$
(47)

with

$$\displaystyle{\omega _{iut} = E\left [\psi _{c}(r_{iu})\psi _{c}(r_{it})\right ] -\left \{E(\psi _{c}(r_{iu}))E(\psi _{c}(r_{it}))\right \},}$$
(48)

where, as mentioned above, the formulas for \(E[\psi _{c}(r_{it})]\) are available for both count and binary data.

3.2.1.1 Computation of Ω i for the Binary Data

Note that the computation of the product moment \(E\left [\psi _{c}(r_{iu})\psi _{c}(r_{it})\right ]\) in (48) is manageable for the binary case, but it is extremely difficult for the count data. For example, suppose that y it , t = 1, , T, used in the robust functions \(\psi _{c}(r_{it})\), follow an AR(1) type correlation structure given by (40), where \(\mu _{it} = \frac{exp(x_{it}^{\prime}\beta )} {1+exp(x_{it}^{\prime}\beta )}\) and ρ is a correlation index parameter. Next, suppose that the binary data contain two sided outliers. One may then follow (29) and compute all nine combinations for the product term \(\psi _{c}(r_{iu})\psi _{c}(r_{it})\) and compute the expectations of all these nine terms, and derive the formulas as

$$\displaystyle{E\left [\psi _{c}(r_{iu})\psi _{c}(r_{it})\right ] {=\rho}^{t-u}\sigma _{iuu}a_{iut} + \left [E(\psi _{c}(r_{iu}))E(\psi _{c}(r_{it}))\right ],}$$
(49)

where

$$\displaystyle\begin{array}{rcl} a_{iut}& =& \frac{P_{1}^{2}} {\sqrt{\sigma _{itt}^{(c_{1} )}\sigma _{iuu}^{(c_{1} )}}} + P_{1}P_{2}\left [ \frac{1} {\sqrt{\sigma _{itt} \sigma _{iuu}^{(c_{1} )}}} + \frac{1} {\sqrt{\sigma _{itt}^{(c_{1} )}\sigma _{iuu}}}\right ] {}\\ & +& P_{1}P_{3}\left [ \frac{1} {\sqrt{\sigma _{itt}^{(c_{2} )}\sigma _{iuu}^{(c_{1} )}}} + \frac{1} {\sqrt{\sigma _{itt}^{(c_{1} )}\sigma _{iuu}^{(c_{2} )}}}\right ] + P_{2}P_{3}\left [ \frac{1} {\sqrt{\sigma _{itt}^{(c_{2} )}\sigma _{iuu}}} + \frac{1} {\sqrt{\sigma _{itt} \sigma _{iuu}^{(c_{2} )}}}\right ] {}\\ & +& \frac{P_{2}^{2}} {\sqrt{\sigma _{itt} \sigma _{iuu}}} + \frac{P_{3}^{2}} {\sqrt{\sigma _{itt}^{(c_{2} )}\sigma _{iuu}^{(c_{2} )}}}, {}\\ \end{array}$$

for u < t. We may then easily compute ω iut by using (49) and (48).

Further note that for the one sided outlier case, the \(E\left [\psi _{c}(r_{iu})\psi _{c}(r_{it})\right ]\) can be obtained from (49) as follows. For the one sided down-weighting function \(\psi _{c}(r_{it})\) given in (28), one may compute the expectation of \(\psi _{c}(r_{iu})\psi _{c}(r_{it})\) from (49) by changing the limits obtained by replacing p lb with 0. Similarly, the product moment based on the down-weighting function \(\psi _{c}(r_{it})\) given in (27), can be obtained from (49) by changing the limits obtained by replacing p sb with 1.

Under the AR(1) binary correlation structure (40), the outlier based moment estimation formula for ρ derived from (49), is given by

$$\displaystyle{\hat{\rho}_{M} = \frac{\frac{\frac{\sum _{i=1}^{K}\sum _{u=1}^{T-1}[\psi _{c}(r_{iu})-E(\psi _{c}(r_{iu}))][\psi _{c}(r_{i,u+1})-E(\psi _{c}(r_{i,u+1}))]w_{iu}w_{i,u+1}} {K(T-1)}} {\frac{\sum _{i=1}^{K}\sum _{u=1}^{T}{[\psi _{c}(r_{iu})-E(\psi _{c}(r_{iu}))]}^{2}/var[\psi _{c}(r_{iu})]} {KT}}} {\frac{\sum _{i=1}^{K}\sum _{u=1}^{T-1}\sigma _{iuu}a_{iut}w_{iu}w_{i,u+1}} {K(T-1)}}.}$$
(50)

Alternatively, for any lag 1 dependent [irrespective of the correlation structure such as AR(1) or MA(1)] binary or count data with possible outliers, the lag 1 correlation index parameter ρ may be estimated as

$$\displaystyle{\hat{\rho}_{M} = \frac{\frac{\sum _{i=1}^{K}\sum _{u=1}^{T-1}[\psi _{c}(r_{iu})w_{iu}-\bar{\xi}_{u,w}][\psi _{c}(r_{i,u+1})w_{i,u+1}-\bar{\xi}_{u+1,w}]} {K(T-1)}} {\frac{\sum _{i=1}^{K}\sum _{u=1}^{T}{[\psi _{c}(r_{iu})w_{iu}-\bar{\xi}_{u,w}]}^{2}} {KT}},}$$
(51)

where \(\bar{\xi}_{t,w} = \frac{1} {K}\sum _{i=1}^{K}\psi _{c}(r_{it})w_{it}\).

3.2.1.2 Computation of Ω i for Count Data

Note that as opposed to the binary case, the construction of the Ω i matrix is difficult for the count data case. One may, however, alternatively compute this Ω i matrix by using the general formula

$$\displaystyle{cov(\xi _{i}) = \Omega _{i} = A_{i\xi}^{\frac{1} {2}}C_{i\xi}A_{i\xi}^{\frac{1} {2}},}$$
(52)

where \(A_{i\xi} = \left [var(\psi _{c}(r_{i1})),\ldots,var(\psi _{c}(r_{it})),\ldots,var(\psi _{c}(r_{iT}))\right ]\) and \(C_{i\xi} = (c_{i\xi,ut})\), with \(c_{i\xi,ut} = corr[\psi _{c}(r_{iu}),\psi _{c}(r_{it})]\) for u, t = 1, , T. For (52), the formulas for \(var[\psi _{c}(r_{it})]\) for the binary data are given in Sect. 2.2.2, and for the count data they are available from Cantoni and Ronchetti (2001, Appendix). As far as the computation of the C matrix is concerned, one may approximate this matrix by a constant matrix \(C_{\xi}^{{\ast}}\), say, by pretending that the covariates are stationary even though they are non-stationary (i.e., time dependent). Under this assumption, the (u, t)t h component of the constant matrix \(C_{\xi}^{{\ast}}\) may be computed as

$$\displaystyle{C_{\xi}^{{\ast}} = (c_{\xi,ut}^{{\ast}}),}$$

where

$$\displaystyle{c_{\xi,ut}^{{\ast}} = \frac{\frac{1} {K}\sum _{i=1}^{K}[\psi _{c}(r_{iu}) -\bar{\xi}_{u}][\psi _{c}(r_{it}) -\bar{\xi}_{t}]} {\sqrt{\frac{1} {K}\sum _{i=1}^{K}{[\psi _{c}(r_{iu}) -\bar{\xi}_{u}]}^{2} \frac{1} {K}\sum _{i=1}^{K}{[\psi _{c}(r_{it}) -\bar{\xi}_{t}]}^{2}}},}$$
(53)

with \(\bar{\xi}_{t} = \frac{1} {K}\sum _{i=1}^{K}\psi _{c}(r_{it})\), for all t = 1, , T.

Note that the REGEE approach encounters convergence problems and also this approach produces regression estimates with much larger relative biases than the RGQL approach. See, for example, the finite sample relative performance of the RGQL and REGEE approaches shown through intensive simulation studies reported in Bari and Sutradhar (2010b).