1 Introduction

Many real-life problems require suitable techniques to describe some response data through a set of related explanatory variables. Parametric regression helps the experimenter to model such scenarios by means of some pre-specified functional relationship between response and explanatory variables described through a set of real parameters. The most widely used regression model is linear regression for continuous responses. In practice, though, there are lots of different types of response data like count data, binary response data and others which arise frequently in real-life experiments. The generalized linear model is the general tool that can be used with all such types of response variables. It allows the experimenter to model the response variables by any distribution within a large family of distributions, namely the exponential family, and the expected response by any (suitably smooth) function of a linear combination of the explanatory variables. The ordinary linear regression is a special case of the above.

The classical estimation procedure in this context is the maximum likelihood estimation method which is asymptotically efficient but lacks robustness against outliers and model misspecification. In many real-life experiments, outliers show up as a matter or routine which influence the maximum likelihood estimators (MLEs) and often produce nonsensical results. So, there is a real need for developing robust estimation procedures for the generalized linear regression model. Although there is a crowded field of robust estimators in the ordinary linear regression problem, there exist only a few robust estimators for the generalized linear model. Cantoni and Ronchetti (2001) and Hosseinian (2009) present and discuss some such approaches that bound the Pearson residuals. There is another pathway in the literature which consists of bounding the unscaled deviance components in some special cases; see, eg., Bianco and Yohai (1996), Croux and Haesbroeck (2003) and Bianco et al. (2013). Aeberhard et al. (2014) provides a comparison between the two approaches in case of the negative binomial responses. However, most of these approaches consider the explanatory variables to be stochastic.

In this paper, we will develop an estimation procedure for the generalized linear model from a design perspective, where we will assume that the explanatory variables are fixed; each response is independent and follows the same distribution specified by the generalized linear model, but has different distributional parameters depending on the values of the corresponding explanatory variables. The idea is motivated by the work of Ghosh and Basu (2013) where a robust minimum divergence estimation procedure was developed under the general setup of independent but non-homogeneous observations using the density power divergence. This work considered the case of simple linear regression in detail. Here, we will follow a similar approach to develop the minimum density power divergence estimators of the parameters of the generalized linear model, which will be highly robust in presence of influential observations and also have comparable high efficiency. Like Cantoni and Ronchetti (2001) and Hosseinian (2009), our approach also bounds the Pearson residuals; hence the term “robustness” in this paper refers to bounded-influence robustness.

The rest of the paper is organized as follows. In Sect. 2 we briefly describe the model and develop the corresponding minimum density power divergence estimators. We also prove the asymptotic properties and present the influence function analysis of the proposed estimator. We also present a short discussion on a data-driven choice of the tuning parameter \(\alpha \) in Sect. 2.5. We will then explore the special case of Poisson regression for count data and logistic regression for binary data in Sects. 3 and 4, respectively. Section 5 contains the application of the proposed method to two real-life data sets. A brief comparison with some of the existing robust estimators is provided in Sect. 6. Finally the paper ends with some concluding remarks in Sect. 7. Some additional numerical examples are provided in the Supplementary Material.

2 The minimum density power divergence estimator in generalized linear models

2.1 The generalized linear model (GLM)

In generalized linear models, the response variables \(Y_i\) are independent and follow the general exponential family of distributions having density

$$\begin{aligned} f(y_i;\theta _i,\phi ) = \exp \left\{ \frac{y_i\theta _i - b(\theta _i)}{a(\phi )} + c(y_i,\phi ) \right\} , \end{aligned}$$
(1)

where the canonical parameter \(\theta _i\) is a measure of location depending on the fixed predictor \(x_i\) and \(\phi \) is the nuisance scale parameter. The mean \(\mu _i\) of \(Y_i\) satisfies the relation \(g(\mu _i) = \eta _i = x_i^T\beta ,\) where g is a monotone and differentiable link function and \(\eta _i = x_i^T\beta \) is the linear predictor. Our main parameter of interest is the regression coefficient \(\beta \) and \(\phi \) acts as the nuisance parameter which shows up only in the error variance. Clearly the generalized linear model allows us to choose several possible densities f from the exponential family and the link function g to form a wide variety of regression models.

By choosing f to be the normal density and g to be the identity link function the generalized linear model reduces to the usual normal linear regression model. Further, choosing f as the Poisson density and g as the log link \(g(\mu ) = \log (\mu )\), we get the Poisson regression case which is useful in modeling ordinal data and cases of overdispersion. For binomial f, choosing the Logit link function \(g(\mu )=\log (\mu /(1-\mu ))\) or the Probit link function \(g(\mu )=\varPhi ^{-1}(\mu )\) generates the logistic and the Probit regression models, respectively, which are useful in modeling binary response variables.

2.2 The minimum density power divergence estimator (MDPDE)

We will define the minimum density power divergence estimators (MDPDEs) for the GLM with general density f and link function g so that we can estimate the regression coefficients for any regression model as a special case of it by substituting the form of f and g. In the later sections, we will consider some of these special cases in detail. The density power divergence (DPD) measure was developed by Basu et al. (1998) in terms of a tuning parameter \(\alpha \ge 0\); the divergence between two densities h and f is given by

$$\begin{aligned} d_\alpha (h,f)= & {} \int f^{1+\alpha } - \frac{1+\alpha }{\alpha } \int f^\alpha h + \frac{1}{\alpha } \int h^{1+\alpha }, \quad \hbox { if } ~~ \alpha > 0, \end{aligned}$$

and \(d_0(h,f) = \lim _{\alpha \rightarrow 0} d_\alpha (h,f) = \int h \log (h/f).\) In practice, h represents the data density and f represents the model density (which depends on the unknown parameter). One then minimizes this divergence over the parameter space to get the minimum divergence estimate of the parameter. The situation is substantially simplified in case of the DPD because in this case the data distribution may be represented by ordinary empiricals, rather than its smoothed version.

Suppose we have a data set \((y_i,~x_i)\); \(i=1, \ldots , n\) from the GLM with density f given by Eq. (1) and a general link function \(g(\mu _i) = \eta _i = x_i^T\beta \). Further assume that the independent variables \(x_i\) are given and fixed so that we are indeed considering a fixed carrier generalized linear model. Then we have the setup of independent but non-homogeneous observations, where \(y_1, \ldots , y_n\) are independent and \(y_i\) has density \(f_i(.;(\beta , \phi )) = f(y_i;\theta _i,\phi )\) for all \(i=1,\ldots , n\). Hence, we can use the approach of Ghosh and Basu (2013), where the MDPDE for the independent but non-homogeneous observations was defined. Following this approach, the MDPDE of \((\beta , \phi )\) has to be obtained by minimizing

$$\begin{aligned} H_n(\beta , \phi )= & {} \frac{1}{n} \sum _{i=1}^n V_i(Y_i;(\beta , \phi )), \\ \hbox {where }\quad V_i(Y_i;(\beta , \phi ))= & {} \int f_i(y;(\beta , \phi ))^{1+\alpha } \mathrm{d}y - \left( 1+\frac{1}{\alpha }\right) f_i(Y_i;(\beta , \phi ))^\alpha . \end{aligned}$$

Note that, in the usual GLM estimation, conventionally we use a robust estimate of scale parameter \(\phi \) and then estimate the regression parameter \(\beta \). One can perform simultaneous robust estimation of both the parameters, as in Huber’s Proposal 2 (Huber 1983) in the linear case; however, such exceptions to the above convention are rare. Here, in the proposed minimum DPD estimation, we do simultaneously estimate \(\beta \) and \(\phi \) robustly by just minimizing \(H_n(\beta , \phi )\) with respect to both the parameters. The estimating equation of the parameters are then given by \(\sum _{i=1}^n \nabla V_i(Y_i;(\beta , \phi )) = 0,\) or,

$$\begin{aligned} \sum _{i=1}^n \left[ \int u_i(y;(\beta , \phi )f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y - u_i(Y_i;(\beta , \phi ) f_i(Y_i;(\beta , \phi ))^\alpha \right] = 0. \end{aligned}$$

where \(u_i(y;(\beta , \phi )) = \nabla \log (f_i(y;(\beta , \phi ))\); \(\nabla \) represents the derivative with respect to \((\beta , \phi )\), with \(\nabla _\beta \) and \(\nabla _\phi \) denoting the indicated individual derivatives. Then, a simple calculation shows that

$$\begin{aligned} \nabla _\beta \log (f_i(y_i;(\beta , \phi ))= & {} \frac{(y_i - \mu _i)}{\mathrm{Var}(y_i)g'(\mu _i)} x_i = K_{1i}(y_i;(\beta , \phi )) x_i, \\ \nabla _\phi \log (f_i(y_i;(\beta , \phi ))= & {} -\frac{(y_i\theta _i - b(\theta _i))}{a^2(\phi )}a'(\phi ) +\frac{\partial }{\partial \phi }c(y_i,\phi ) = K_{2i}(y_i;(\beta , \phi )), \end{aligned}$$

where \(K_{1i}\), \(K_{2i}\) are the indicated functions. Thus, our estimating equations become

$$\begin{aligned} \sum _{i=1}^n x_i ~ \left[ \int K_{1i}(y;(\beta , \phi )) f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y - K_{1i}(y_i;(\beta , \phi )) f_i(y_i;(\beta , \phi ))^\alpha \right] = 0,\nonumber \\ \end{aligned}$$
(2)
$$\begin{aligned} \sum _{i=1}^n \left[ \int K_{2i}(y;(\beta , \phi )) f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y - K_{2i}(y_i;(\beta , \phi )) f_i(y_i;(\beta , \phi ))^\alpha \right] = 0.\nonumber \\ \end{aligned}$$
(3)

However, if we want to ignore the nuisance parameter \(\phi \), as per the usual practice, and estimate \(\beta \) taking \(\phi \) fixed (or, substituted suitably), it is enough to consider only estimating Eq. (2). Further, for \(\alpha =0\), we have

$$\begin{aligned} \int \frac{(y_i - \mu _i)}{\mathrm{Var}(y_i)} g'(\mu _i) x_i f_i(y_i;(\beta , \phi ))^{1+\alpha }\mathrm{d}y = 0, \end{aligned}$$

and hence the estimating equations for \(\beta \) (ignoring \(\phi \)) simplify to

$$\begin{aligned} \sum _{i=1}^n \frac{(Y_i - \mu _i)}{\mathrm{Var}(Y_i)g'(\mu _i)} x_i = 0. \end{aligned}$$

Note that this is just the maximum likelihood estimating equation and also is the same as the ordinary least squares (OLS) estimating equation for \(\beta \) assuming \(\phi \) to be fixed. Thus, the MDPDE of \(\beta \) with \(\alpha =0\) equals the maximum likelihood estimator as well as the OLS estimator of \(\beta \). That is the MDPDE proposed here is just a natural generalization of the MLE.

Also it is interesting to note that if our density f is such that \(\int f(y;\theta _i,\phi )^{1+\alpha } \mathrm{d}y\) is independent of the location parameter \(\theta _i\), like the normal density, then we have \(\int \frac{(y_i - \mu _i)}{\mathrm{Var}(y_i)g'(\mu _i)} x_i f_i(y_i;(\beta , \phi ))^{1+\alpha }\mathrm{d}y=0\) and hence the estimating Eq. (2) simplifies to

$$\begin{aligned} \sum _{i=1}^n \frac{(Y_i - \mu _i)}{\mathrm{Var}(Y_i)g'(\mu _i)} x_i f_i(Y_i;(\beta , \phi ))^\alpha = 0. \end{aligned}$$
(4)

2.3 Asymptotic properties

We will now derive the joint asymptotic distribution of the minimum density power divergence estimator \(({\hat{\beta }}, {\hat{\phi }})\) of the parameters \((\beta , \phi )\) obtained by solving the estimating Eqs. (2) and (3). For simplicity, we will assume that the true data-generating distribution also belongs to the model density with parameters \((\beta ^g, \phi ^g)\). Define, for \(i=1, \ldots , n\) and \(j, k =1, 2\),

$$\begin{aligned} \gamma _{ji} = \gamma _{ji}^{1+\alpha }(\beta , \phi )= & {} \int K_{ji}(y;(\beta , \phi )) f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y, \\ \hbox {and }~\gamma _{jki} = \gamma _{jki}^{1+\alpha }(\beta , \phi )= & {} \int K_{ji}(y;(\beta , \phi )) K_{ki}(y;(\beta , \phi )) f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y, \\ \hbox {so that }~~ N_i^{1+\alpha }(\beta , \phi )= & {} \int u_i(y;(\beta , \phi )f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y = \left( \begin{array}{c} \gamma _{1i} x_i \\ \gamma _{2i} \end{array} \right) , \\ M_i^{1+\alpha }(\beta , \phi )= & {} \int u_i(y;(\beta , \phi ))u_i(y;(\beta , \phi ))^T f_i(y;(\beta , \phi ))^{1+\alpha }\mathrm{d}y \\ \qquad= & {} \left( \begin{array}{c c} \gamma _{11i} x_ix_i^T &{}\quad \gamma _{12i} x_i \\ \gamma _{12i} x_i^T &{}\quad \gamma _{22i} \end{array} \right) . \end{aligned}$$

Now, put \(\varGamma _{j}^{(\alpha )} = \mathrm{Diag}(\gamma _{ji})_{i=1,\ldots ,n}\) and \(\varGamma _{jk}^{(\alpha )} = \mathrm{Diag}(\gamma _{jki})_{i=1,\ldots ,n}\) for \(j,k=1,2\) and \(X^T = [x_1, \ldots , x_n]\). Then we have

$$\begin{aligned} \varPsi _n(\beta , \phi )= & {} \frac{1}{n}\sum _{i=1}^n M_i^{1+\alpha }(\beta , \phi ) = \frac{1}{n} \left( \begin{array}{c c} X^T\varGamma _{11}^{(\alpha )}X &{}\quad X^T\varGamma _{12}^{(\alpha )}\mathbf {1} \\ \mathbf {1}^T\varGamma _{12}^{(\alpha )}X &{}\quad \mathbf {1}^T\varGamma _{22}^{(\alpha )}\mathbf {1} \end{array} \right) ; \end{aligned}$$
(5)
$$\begin{aligned} \varOmega _n(\beta , \phi )= & {} \frac{1}{n}\sum _{i=1}^n \left[ M_i^{1+2\alpha }(\beta , \phi ) - N_i^{1+\alpha }(\beta , \phi ) (N_i^{1+\alpha }(\beta , \phi ))^T \right] \end{aligned}$$
(6)
$$\begin{aligned}= & {} \frac{1}{n} \left( \begin{array}{c c} X^T[\varGamma _{11}^{(2\alpha )}-\varGamma _{1}^{(\alpha )T}\varGamma _{1}^{(\alpha )}]X &{}\quad X^T[\varGamma _{12}^{(2\alpha )}-\varGamma _{1}^{(\alpha )}\varGamma _{2}^{(\alpha )}]\mathbf {1} \\ \mathbf {1}^T[\varGamma _{12}^{(2\alpha )}-\varGamma _{1}^{(\alpha )}\varGamma _{2}^{(\alpha )}]X &{}\quad \mathbf {1}^T[\varGamma _{22}^{(2\alpha )}-\varGamma _{2}^{(\alpha )T}\varGamma _{2}^{(\alpha )}]\mathbf {1} \end{array} \right) . \end{aligned}$$
(7)

Then, the asymptotic distribution of \(({\hat{\beta }}, {\hat{\phi }})\) follows along the lines of Theorem 3.1 of Ghosh and Basu (2013), provided the Assumptions (A1)–(A7) hold in case of the generalized linear models. These assumptions are presented in the Supplementary material to this paper. Note that, Assumptions (A1)–(A3) hold directly from the properties of the exponential family of distributions.

Theorem 1

Under Assumptions (A1)–(A7) of Ghosh and Basu (2013), there exists a consistent sequence \(({\hat{\beta }}_n, {\hat{\phi }}_n)\) of roots to the minimum DPD estimating Eqs. (2) and (3). Also, the asymptotic distribution of \(\varOmega _n^{-\frac{1}{2}}\varPsi _n [\sqrt{n} (({\hat{\beta }}_n, {\hat{\phi }}_n) - (\beta ^g, \phi ^g))]\) is \((p+1)\)-dimensional normal with mean 0 and variance \(I_{p+1}\), the identity matrix of dimension \(p+1\), where \(\varPsi _n=\varPsi _n(\beta ^g, \phi ^g)\) and \(\varOmega _n=\varOmega _n(\beta ^g, \phi ^g)\).

Note that the results of Theorem 1 would have been a direct consequence of standard M-estimation results provided the covariates are assumed to be stochastic. However, in this paper we are considering fixed design cases with non-stochastic covariates and the new Assumptions (A1)–(A7) are just the corresponding generalizations of the original assumptions of Huber (1964). See Ghosh and Basu (2013) for a more detailed discussion on these new assumptions. Similar generalizations in the context of the influence function in relation to the approach of Hampel et al. (1986) will be considered in the next subsection.

It follows from above theorem that the reciprocal of the matrix \(\varPsi _n^{-1}\varOmega _n\varPsi _n^{-1}\) gives a estimate of the asymptotic efficiency of the MDPDEs \(({\hat{\beta }}_n, {\hat{\phi }}_n)\). Though this depends on the sample size n and the given covariates \(x_i\), it will give a reasonable estimate of the asymptotic efficiency for large n.

Further, note that the asymptotic covariance of the estimators \({\hat{\beta }}_n\) and \({\hat{\phi }}_n\) is not in general 0 and hence these estimators are not asymptotically independent for all GLMs. However, for some particular cases including the normal linear regression case, they turn out to be independent. One possible set of sufficient conditions for their independence are \(\gamma _{12i}^{1+2\alpha }=0\) and \(\gamma _{1i}^{1+\alpha } \gamma _{2i}^{1+\alpha } =0\) for all i. These conditions hold for the normal linear regression case.

2.4 Influence function

To illustrate the robustness properties of the proposed estimation methodology for the generalized regression model, we will now consider the influence function of the MDPDE of the parameter \(\theta = (\beta , \phi )\). For this we need to consider them in terms of a statistical functional at the true data-generating distribution \(\underline{\mathbf {G}}=(G_1,\ldots ,G_n)\). Let \(T_\alpha ^\beta (\underline{\mathbf {G}})\) and \(T_\alpha ^\phi (\underline{\mathbf {G}})\) denote the minimum DPD functionals for the parameters \(\beta \) and \(\phi \), respectively. Let \(T_\alpha (\underline{\mathbf {G}}) = ( T_\alpha ^\beta (\underline{\mathbf {G}})^T , T_\alpha ^\phi (\underline{\mathbf {G}}))^T,\) which is defined by

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n d_\alpha (g_i(.),f_i(.;T_\alpha (\underline{\mathbf {G}}))) = \min _{\theta \in \varTheta } ~ \frac{1}{n} \sum _{i=1}^n d_\alpha (g_i(.),f_i(.;\theta )), \end{aligned}$$

where \(g_i\) is the probability density function corresponding to \(G_i\). We consider the contaminated density \(g_{i,\epsilon } = (1-\epsilon ) g_i + \epsilon \delta _{t_i}\) where \(t_i\) is the point of contamination and \(G_{i,\epsilon }\) denotes the corresponding distribution function for all \(i=1, \ldots , n\). Let \(\theta _\epsilon ^{i_0} = T_\alpha (G_1,\cdot ,G_{i_0-1},G_{i_0,\epsilon },\cdot , G_n)\) be the minimum DPD functional with contamination only in the \(i_0\)th direction. Then a fairly straightforward (albeit lengthy and tedious) calculation shows that the influence function of \(T_\alpha \) for contamination at the direction \(i_0\) will be

$$\begin{aligned} \mathrm{IF}_{i_0}(t_{i_0}, T_\alpha , \underline{\mathbf {G}})= & {} \varPsi _n^{-1} \frac{1}{n}[ f_{i_0}(t_{i_0};(\beta , \phi ))^\alpha u_{i_0}(t_{i_0};(\beta , \phi )) - N_{i_0}^{1+\alpha } ] \\= & {} \varPsi _n^{-1} \frac{1}{n} \left( \begin{array}{c} [f_{i_0}(t_{i_0};(\beta , \phi ))^\alpha K_{1i_0}(t_{i_0};(\beta , \phi )) - \gamma _{1i_0}] x_i \\ \\ f_{i_0}(t_{i_0};(\beta , \phi ))^\alpha K_{2i_0}(t_{i_0};(\beta , \phi )) - \gamma _{2i_0} \end{array} \right) . \end{aligned}$$

Note that for any fixed sample size n and any given (finite) values of \(X_i\)s, if \(\varPsi _n\) and \(\gamma _{ji_0}\)s are assumed to be bounded, the influence function of the MDPDE of \((\beta , \phi )\) will be bounded with respect to the contamination in any direction \(i_0\) provided the terms \(f_{i}(t_{i};(\beta , \phi ))^\alpha K_{ji}(t_{i};(\beta , \phi ))\) are bounded for all i and \(j=1,2\). Under assumptions (A1)–(A7) of Ghosh and Basu (2013), \(\varPsi _n\) and \(\gamma _{ji_0}\)s are necessarily bounded. This can be seen to hold for the majority of GLMs with \(\alpha >0\) because of the exponential nature of the density function and the polynomial nature of the functions \( K_{ji}(t_{i};(\beta , \phi ))\). This demonstrates the robust nature of the MDPDE in most GLMs with \(\alpha >0\). However, for \(\alpha =0\) the term \(f_{i}(t_{i};(\beta , \phi ))^\alpha K_{ji}(t_{i};(\beta , \phi )) = K_{ji}(t_{i};(\beta , \phi ))\) is clearly unbounded implying the non-robust nature of the MLE in case of any GLM.

As discussed before, in the particular case when \(\gamma _{12i}^{1+2\alpha }=0\) and \(\gamma _{1i}^{1+\alpha } \gamma _{2i}^{1+\alpha } =0\) for all i (like the normal linear regression case), the minimum density power divergence estimator of \(\beta \) and \(\phi \) become asymptotically independent and we can also separate out the influence function for the minimum density power divergence estimator of \(\beta \) and \(\phi \). Due to the special form of the matrix \(\varPsi _n\) in this case, these two influence functions simplify, respectively, to

$$\begin{aligned} \mathrm{IF}_{i_0}(t_{i_0}, T_\alpha ^\beta , \underline{\mathbf {G}})= & {} (X^T\varGamma _{11}^{(\alpha )}X)^{-1} x_{i_0} [f_{i_0}(t_{i_0};(\beta , \phi ))^\alpha K_{1i_0}(t_{i_0};(\beta , \phi )) - \gamma _{1i_0}], \\ \hbox {and }~~ \mathrm{IF}_{i_0}(t_{i_0}, T_\alpha ^\phi , \underline{\mathbf {G}})= & {} (\mathbf {1}^T\varGamma _{22}^{(\alpha )}\mathbf {1})^{-1} [f_{i_0}(t_{i_0};(\beta , \phi ))^\alpha K_{2i_0}(t_{i_0};(\beta , \phi )) - \gamma _{2i_0}]. \nonumber \end{aligned}$$

As in Ghosh and Basu (2013), in this context also we can define some measures of sensitivity based on the influence function, which is presented in the Supplementary material to this paper.

2.5 A data-driven choice of the tuning parameter \(\alpha \)

The minimum DPD estimators depend on the choice of the tuning parameter \(\alpha \ge 0\) defining the divergence. The properties of the MDPDE in the case of independent and identically distributed data have been extensively studied in the literature and it is well known that there is a trade-off between efficiency and robustness for varying \(\alpha \). Increasing \(\alpha \) leads to greater robustness at the cost of efficiency. Ghosh and Basu (2013, (2015) also observed similar trade-offs for the linear regression case with fixed covariates. Here, we have observed the same phenomenon in the context of the proposed MDPDE for the Poisson and the logistic regression models (see Sects. 3 and 4 below). Therefore, it is necessary to carefully choose the tuning parameter \(\alpha \) while using the MDPDE in any of the GLMs. In this section, we will try to present a possible approach to choose the optimum value of \(\alpha \) based on the observed data at hand.

In the context of the i.i.d. data problems, some data-driven choices for selecting the optimum tuning parameter in the minimum DPD estimation context have been proposed by Hong and Kim (2001) and Warwick and Jones (2005). Ghosh and Basu (2015) extended these approaches to the case of independent but non-homogeneous data and illustrated this approach for the case of linear regression through detailed simulation studies. In this present paper, we consider the GLM from its design perspective so that given the values of the explanatory variables \(x_i\) the response \(y_i\) is independent but not identically distributed. So, we can apply the results of Ghosh and Basu (2015) to choose a data-driven optimum choice of the tuning parameter \(\alpha \). Accordingly, we need to choose \(\alpha \) by minimizing a consistent estimate of the mean square error (MSE) of the MDPDE \({\hat{\theta }} = ({\hat{\beta }}_\alpha ,~ {\hat{\phi }}_\alpha )\) of the true parameter value \(\theta ^g=(\beta ^g,~\phi ^g)\), defined as \(E[ ({\hat{\theta }}_\alpha - \theta ^g)^T({\hat{\theta }}_\alpha - \theta ^g)]\), in the GLMs. The true parameter represents the larger component of a possible mixture distribution in the spirit of Warwick and Jones (2005). It follows from the asymptotic distribution of the MDPDE that, asymptotically

$$\begin{aligned} E[ ({\hat{\theta }}_\alpha - \theta ^g)^T({\hat{\theta }}_\alpha - \theta ^g)] = ({\theta }_\alpha - \theta ^g)^T({\theta }_\alpha - \theta ^g) + \frac{1}{n} \mathrm{Trace}[\varPsi _n^{-1} \varOmega _n \varPsi _n^{-1} ], \end{aligned}$$
(8)

where \(\varPsi _n\) and \(\varOmega _n\) are as defined in Sect. 2.3 and \(\theta _\alpha = (\beta _\alpha ,~\phi _\alpha )\) is the parameter value minimizing the DPD measure between the true and model densities corresponding to tuning parameter \(\alpha \). Further, from the expressions of \(\varPsi _n\) and \(\varOmega _n\) it is sufficient to find some consistent estimator of the quantities \(\gamma _{ji}\) and \(\gamma _{jki}\) for \(j, k = 1,2\) and \( i=1, \ldots , n\), which can be done by replacing the parameter value \((\beta ,~\phi )\) in their expressions by the corresponding MDPDEs \(({\hat{\beta }}_\alpha ,~ {\hat{\phi }}_\alpha )\). Let us denote the resulting matrices by \(\hat{\varPsi }_n\) and \(\hat{\varOmega }_n\). To estimate the bias term, we will use \(({\hat{\beta }}_\alpha ,~ {\hat{\phi }}_\alpha )\) as a consistent estimate of \(({\beta }_\alpha ,~{\phi }_\alpha )\). For estimating \(\theta ^g\), we can use several “pilot” estimators which will in turn affect the final choice of the tuning parameter. Ghosh and Basu (2015) suggested, on the basis of an extensive simulation study, the choice of the MDPDE with \(\alpha =0.5\) as a reasonable pilot estimator. For any particular generalized model, we can find such a “good” pilot estimator through some simulation studies, and then use the observed data to choose the corresponding optimum tuning parameter value. Some examples illustrating this approach of choosing tuning parameter in case of the GLM is provided in the supplementary material.

Another perspective of the criterion (8) can be obtained by noting its similarity with the robust version of AIC (Heritier et al. 2009, p. 73, Eq. (3.31)) and its generalized version the GAIC (Heritier et al. 2009, p. 159). Although the trace term is different, the formulations are clearly in similar spirit which gives another interesting interpretation of this criterion.

Note that, the robustness of the proposed MDPDE, when the tuning parameter \(\alpha \) is estimated from the data, also depends directly on the robustness of the estimation of \(\alpha \). Using the chain rule of derivatives, the robustness of the MDPDE with a data-driven \(\alpha \) can be quantified by noting that its influence function is a multiple of the influence function of the fixed \(\alpha \) estimator as obtained in Sect. 2.4 and the multiplier is nothing but the influence function of the estimator of optimum \(\alpha \) itself. For the tuning parameter selection process described above, it can be verified empirically that the robustness of the optimum \(\alpha \) estimator depends directly on that of the “pilot” estimator used; see Ghosh and Basu (2015) and Section 3 of the supplementary material of this paper for some numerical illustrations. Our experience in this regard indicates that the suggested choice of the MDPDE with \(\alpha =0.5\) as the “pilot” estimator is quite robust with respect to contamination in the data leading to robust selection of \(\alpha \). However, further research including more detailed empirical studies will improve our understanding of this complex issue; we hope to take up such research in the future.

Another important issue requires consideration in connection with the data-driven choice of the tuning parameter \(\alpha \). The asymptotic properties derived in Sect. 2.3 pertain to a fixed \(\alpha \). What about the asymptotic results under this data-driven choice? Clearly, in that case the final result will depend on the process of selecting the parameter \(\alpha \), and how good that process is. When the assumed model holds, and the data are pure, the classical method would work well and the chosen tuning parameter \(\alpha \) should preferably remain close to 0. Large-scale simulation studies, not presented here, indicate that the adaptively chosen \(\alpha \) is equal to or close to 0 in the overwhelming majority of the cases when the tuning parameter is adaptively selected using the Warwick and Jones (2005) approach and the data are pure; this phenomenon is observed for several different GLMs. While we do not have a general proof at this moment, our conjecture is that the estimator corresponding to the adaptively chosen tuning parameter will be asymptotically equivalent to the maximum likelihood estimator under the model; at the least, the distribution of the estimator chosen through this adaptive routine will provide a good large sample approximation to that of the maximum likelihood estimator.

The description of the previous paragraph parallels the result of Theorem 1, where the asymptotic distribution for fixed \(\alpha \) is provided under the model. However, when the model is misspecified or the data are contaminated, the description becomes more complicated. The theoretical optimal \(\alpha \) then corresponds to the estimator which minimizes the sum of the square of the theoretical bias and the trace of the covariance matrix. We feel that whenever the data-driven estimate of \(\alpha \) is consistent for the true (fixed) optimal value, the large sample asymptotic distribution of the fixed \(\alpha \) estimator will provide a good approximation for the distribution of the adaptively chosen estimator. Clearly more research is needed on this topic.

3 Special Case I: Poisson regression for count data

The most useful regression tool for count data is the Poisson regression model where, given the values of explanatory variables, the response variables independently follow the Poisson distribution but with different mean parameters depending on the corresponding values of the explanatory variable. More precisely, let \((y_1,x_1), \ldots , (y_n, x_n)\) be the sample observations from the Poisson regression model. Assume that the values \(x_i\) of the explanatory variable are fixed. Then, in the Poisson regression model, the count variables \(y_i\) are assumed to be independent and have Poisson distributions with

$$\begin{aligned} E(y_i|x_i) = \mathrm{e}^{(x_i^T\beta )} \end{aligned}$$

and we want to estimate the parameter \(\beta \) efficiently and robustly.

3.1 The MDPDE for Poisson regression

Poisson regression is indeed a special case of GLM with known shape parameter \(\phi =1\) and \(\theta _i = \eta _i = x_i^T\beta \), \(b(\theta _i) = \mathrm{e}^{\theta _i}\) and \(c(y_i) = - \log (y_i !)\). Since here the mean is \(\mu _i = \mathrm{e}^{(x_i^T\beta )}=\mathrm{e}^{\eta _i}\), the link function g is the natural logarithm function and the variance of \(y_i\) is also \(\mathrm{e}^{(x_i^T\beta )}\). Thus, we can estimate the unknown parameter \(\beta \) using our minimum density power divergence estimation procedure as described earlier. Using the above notation and the form of the Poisson distribution, the minimum DPD estimating equation for \(\alpha \ge 0\) becomes

$$\begin{aligned} \sum _{i=1}^n [ \gamma _{1i}(\beta ) - (y_i- \mathrm{e}^{(x_i^T\beta )})f_i(y;\beta )^{\alpha } ]x_i = 0. \end{aligned}$$
(9)

where \(f_i(y;\beta )\) is the probability mass function of the Poisson distribution with mean \(\mathrm{e}^{(x_i^T\beta )}\). In particular, for \(\alpha =0\), the above estimating equation simplifies to the maximum likelihood estimating equation given by

$$\begin{aligned} \sum _{i=1}^n (y_i- \mathrm{e}^{(x_i^T\beta )})x_i = 0. \end{aligned}$$
(10)

However, for \(\alpha > 0\), there is no simplified form for \(\gamma _{1i}\) and \(\gamma _{11i} \) so that we need to compute this quantities numerically and then numerically solve the estimating Eq. (9) with respect to \(\beta \).

3.2 Properties of the MDPDE

The asymptotic properties of the MDPDE of \(\beta \) under Poisson regression model follows directly from Theorem 1 (see Section 2 of the supplementary material for derivation).

Corollary 1

Under Assumptions (A1)–(A7) of Ghosh and Basu (2013), there exists a consistent sequence \({\hat{\beta }}_n={\hat{\beta }}_n^{(\alpha )}\) of roots to the minimum DPD estimating Eqs. (9) for the tuning parameter \(\alpha \). Also, asymptotically,

$$\begin{aligned} (X^T[\varGamma _{11}^{(2\alpha )}(\beta ^g)-\varGamma _{1}^{(\alpha )2}(\beta ^g)]X)^{-\frac{1}{2}} (X^T\varGamma _{11}^{(\alpha )}(\beta ^g)X)({\hat{\beta }}_n - \beta ^g) \sim N_p(0, I_p). \end{aligned}$$

Thus, the asymptotic efficiency of the different MDPDE \({\hat{\beta }}_n = {\hat{\beta }}_n^{(\alpha )}\) of \(\beta \) can be measured based on the asymptotic variance

$$\begin{aligned} \mathrm{AV}_\alpha (\beta ^g) = (X^T\varGamma _{11}^{(\alpha )}(\beta ^g)X)^{-1} (X^T[\varGamma _{11}^{(2\alpha )}(\beta ^g)-\varGamma _{1}^{(\alpha )2}(\beta ^g)]X) (X^T\varGamma _{11}^{(\alpha )}(\beta ^g)X)^{-1}, \end{aligned}$$

which can be consistently estimated by replacing \(\beta ^g\) with \(\hat{\beta _n}\) in its expression, i.e., \({\widehat{\mathrm{AV}}}_\alpha = \mathrm{AV}_\alpha ({\hat{\beta }}_n)\). Thus, an estimate of the relative efficiency of the different MDPDEs of the \(i{\mathrm{th}}\) component of the parameter vector \(\beta \) with respect to its MLE (or the OLS estimator) is given by

$$\begin{aligned} {\widehat{\mathrm{RE}}}_{i,\alpha } = \frac{i{\mathrm{th}} \hbox { diagonal entry of } {\widehat{\mathrm{AV}}}_0}{i{\mathrm{th}} \hbox { diagonal entry of } {\widehat{\mathrm{AV}}}_\alpha } \times 100. \end{aligned}$$

Clearly, the above estimate of the relative efficiency depends on the sample size n and the choice of the given explanatory variables \(x_i\). But it can be shown that the consistency of the estimator \(\hat{\beta _n}\) implies that the above measure gives us a consistent estimator of the asymptotic relative efficiency if the \(x_i\)s are chosen suitably. For example, \(X^TX\) must be bounded. We have presented the empirical value of this measure of relative efficiency for different sample sizes \(n=50\), respectively, under several different cases in Table 1; the same for \(n=100\) is provided in the supplementary material. We have reported six cases which are defined based on the true values of the regression coefficients \(\beta =(\beta _0, \beta _1, \ldots , \beta _p)\) and the given values of the explanatory variables \(x_i\) (\(i=1,\ldots ,n\)) as follows: the parameter \(p=2\) in the first four cases; Cases I and II have \(x_i = (1, \sqrt{i})\) while Cases III and IV have \(x_i = \left( 1, \frac{1}{i}\right) \); \(\beta =(1, 1)\) for Cases I and III and \(\beta =(1, 0.5)\) for Cases II and IV. The parameter \(p=3\) in Cases V and VI with common \(x_i = \left( 1, \sqrt{i}, \frac{1}{i^2}\right) \) and \(\beta =(1, 1, 1)\), \(\beta =(2, 1, 0.5)\), respectively. All the simulations are done based on 1000 replications. It is clear from the tables that the loss of efficiency is quite negligible for the MDPDE with small positive \(\alpha \) under each of the cases considered here. Even for large positive \(\alpha \) near 0.5 we can get quite high efficiency if \(x_i\)s are relatively small.

Table 1 The estimated relative efficiencies of the MDPDE for various values of the tuning parameter \(\alpha \) under different cases of Poisson regression with sample size \(n = 50\)

Next, to see the robustness of the MDPDE under the Poisson regression model, we will use the results from the Sect. 2.4. The influence function of the MDPDE in the direction \(i_0\) simplifies to

$$\begin{aligned} \mathrm{IF}_{i_0}(t_{i_0}, T_\alpha ^\beta , \underline{\mathbf {G}}) = (X^T\varGamma _{11}^{(\alpha )}X)^{-1} x_{i_0} \left[ \frac{(t_{i_0} - \mathrm{e}^{(x_{i_0}^T\beta )})}{(t_{i_0}!)^\alpha } \mathrm{e}^{\alpha [t_{i_0}(x_{i_0}^T\beta ) + \mathrm{e}^{(x_{i_0}^T\beta )}]} - \gamma _{1i_0}\right] . \end{aligned}$$

Clearly, whenever the inverse of the first matrix exists, this influence function is bounded in \(t_{i_0}\) for any \(\alpha > 0\) implying the robustness of the MDPDE with \(\alpha > 0\). However, at \(\alpha =0\), \(\gamma _{1i_0}=0\) and hence the influence function above further simplifies to

$$\begin{aligned} \mathrm{IF}_{i_0}(t_{i_0}, T_0^\beta , \underline{\mathbf {G}}) = (X^T\varGamma _{11}^{(0)}X)^{-1} x_{i_0} (t_{i_0} - \mathrm{e}^{(x_{i_0}^T\beta )}), \end{aligned}$$

which is linear and hence unbounded in \(t_{i_0}\). This indicates the non-robustness of the MLE and equivalently OLS of the regression parameter in case of the Poisson regression model. Figure 1 shows the influence function of the MDPDE for different \(\alpha \) under several specific Poisson regression models and for sample size \(n=50\); the same for \(n=100\) is presented in the Supplementary material. The redescending nature of the influence function with increasing \(\alpha \) is quite clear in all the figures.

Fig. 1
figure 1

Plot of the influence function of MDPDEs of the slope parameter \(\beta _1\) for different \(\alpha \) (solid line \(\alpha =0\), dotted line \(\alpha =0.1\), dashed line \(\alpha =0.5\) and dashed-dotted line \(\alpha =1\)) and direction \(i_0\) of contamination with \(n=50\). Here, Model (I)–(III) have \(x_i= (1, \sqrt{i})^T\), \(x_i= \left( 1, \frac{1}{i}\right) ^T\) and \(x_i= \left( 1, \frac{1}{i}, \frac{1}{i}\right) ^T\), respectively, with \(\beta _j=1\) for all j. a Model I, \(i_0 = 1\). b Model I, \(i_0= 20\). c Model II, \(i_0 = 1\). d Model II, \(i_0= 20\). e Model III, \(i_0= 1\). f Model III, \(i_0= 20\)

Although these implications are visible in Table 1 and Fig. 1, it may be of importance to highlight them clearly in the text. There is a clear trade-off between robustness and efficiency over increasing \(\alpha \) in this context. Small values of \(\alpha \) provide a high degree of asymptotic efficiency; large values of \(\alpha \) provide greater bounded-influence robustness as is evidenced by their highly stable influence functions.

4 Special Case II: logistic regression for binary data

Another important special case of the GLM is the logistic regression model which is used to model any categorical or binary dependent variable in terms of some explanatory variable. Given the value of the explanatory variable \(x_i\), the binary outcome variable \(y_i\) (or the binary transform of the categorical variable) is assumed to follow a Bernoulli distribution with success probability \(\pi _i\) depending on the explanatory variable \(x_i\) (for each \(i=1, \ldots , n\)). To ensure that the predicted values of \(\pi _i\) are in the interval (0, 1), in the logistic model it is assumed that

$$\begin{aligned} \pi _i = \pi (x_i) = \frac{\mathrm{e}^{x_i^T\beta }}{1 + \mathrm{e}^{x_i^T\beta }}. \end{aligned}$$

We will now assume that the \(x_i\)s are fixed and consider the logistic regression model from its design perspective to estimate \(\beta \) efficiently and robustly.

4.1 The MDPDE for logistic regression

We can treat the logistic regression model as a particular case of the GLM with known shape parameter \(\phi =1\) and \(\theta _i=\eta _i=x_i^T\beta \), \(c(y_i)=0\). The distribution of \(y_i\) is Bernoulli with mean \(\mu _i = \pi _i = \frac{\mathrm{e}^{\eta _i}}{1 + \mathrm{e}^{\eta _i}},\) and \(\mathrm{var}(y_i) = \pi _i(1-\pi _i) = \frac{\mathrm{e}^{\eta _i}}{(1 + \mathrm{e}^{\eta _i})^2}.\) Thus, the link function g is the logit function and so we can use the minimum DPD estimation procedure discussed in Sect. 2 to estimate \(\beta \) robustly. Using the above notations and the form of the Bernoulli distribution, the minimum DPD estimating equation for \(\alpha \ge 0\) is given by,

$$\begin{aligned} \sum _{i=1}^n \left[ \frac{\mathrm{e}^{x_i^T\beta } (\mathrm{e}^{\alpha (x_i^T\beta )} - 1)}{(1 + \mathrm{e}^{x_i^T\beta })^{2+\alpha }} - \left( y_i - \frac{\mathrm{e}^{x_i^T\beta } }{1 + \mathrm{e}^{x_i^T\beta }}\right) \frac{\mathrm{e}^{\alpha (x_i^T\beta )y_i} }{(1 + \mathrm{e}^{x_i^T\beta })^{\alpha }}\right] x_i = 0, \end{aligned}$$
(11)

which can be further simplified to

$$\begin{aligned} \sum _{i=1}^n (1-2y_i)\mathrm{e}^{(x_i^T\beta )(1-y_i)} ~ \frac{ (\mathrm{e}^{\alpha (x_i^T\beta )} + \mathrm{e}^{x_i^T\beta })}{(1 + \mathrm{e}^{x_i^T\beta })^{2+\alpha }} x_i = 0. \end{aligned}$$
(12)

We can easily solve the above with respect to \(\beta \) to compute the MDPDE for any \(\alpha \ge 0\). In particular, for \(\alpha =0\), Eq. (11) simplifies to

$$\begin{aligned} \sum _{i=1}^n \left( y_i - \frac{\mathrm{e}^{x_i^T\beta } }{1 + \mathrm{e}^{x_i^T\beta }}\right) x_i = 0, \end{aligned}$$
(13)

which is the maximum likelihood estimating equation. Once again the minimum DPD estimating equation is just a generalization of the maximum likelihood estimating equation.

4.2 Properties of the MDPDE

We will now present the asymptotic distribution of the MDPDE of \(\beta \) in the logistic regression case as it follows from Theorem 1. In this special case, we have the following result.

Corollary 2

Under Assumptions (A1)–(A7) of Ghosh and Basu (2013), there exists a consistent sequence \({\hat{\beta }}_n = {\hat{\beta }}_n^{(\alpha )}\) of roots to the minimum DPD estimating Eq. (12) at the tuning parameter \(\alpha \). Also, the asymptotic distribution of

$$\begin{aligned}&\left( \sum _{i=1}^n \mathrm{e}^{x_i^T\beta ^g} \frac{ (\mathrm{e}^{\alpha (x_i^T\beta ^g)} + \mathrm{e}^{x_i^T\beta ^g})^2}{(1 + \mathrm{e}^{x_i^T\beta ^g})^{4+2\alpha }} (x_i x_i^T)\right) ^{-\frac{1}{2}} \\&\quad \times \left( \sum _{i=1}^n \mathrm{e}^{x_i^T\beta ^g} \frac{(\mathrm{e}^{\alpha (x_i^T\beta ^g)} + \mathrm{e}^{x_i^T\beta ^g})}{(1 + \mathrm{e}^{x_i^T\beta ^g})^{3+\alpha }} (x_i x_i^T)\right) ({\hat{\beta }}_n - \beta ^g) \end{aligned}$$

is p-dimensional normal with mean 0 and variance \(I_{p}\).

As argued in Sect. 3.2 for the Poisson regression, the asymptotic efficiency of the different MDPDE \({\hat{\beta }}_n={\hat{\beta }}_n^{(\alpha )}\) of \(\beta \) for the logistic regression can also be measured in terms of its asymptotic variance \(\mathrm{AV}_\alpha (\beta ^g)\), which can be again estimated consistently by \({\widehat{\mathrm{AV}}}_\alpha = \mathrm{AV}_\alpha ({\hat{\beta }}_n)\).

As in the Poisson regression case, here also we can compute the values of relative efficiencies of the MDPDEs of the coefficients of the logistic regression model based on \({\widehat{\mathrm{AV}}}_\alpha \). This measure of relative efficiency clearly depends on the value of \(\beta \) and \(X_i\)s. We present the empirical estimate of the relative efficiencies of the MDPDE in case of the logistic regression model in Table 2 for sample size \(n=50\) and the same for \(n=100\) is presented in the Supplementary material. These are calculated based on a simulation study with 1000 replications under several different cases of logistic regressions. These cases are defined based on the given values of the explanatory variables \(x_i\) (\(i=1,\ldots ,n\)) as in the case of Poisson regression, but now with the true regression coefficients \(\beta =(\beta _0, \beta _1, \ldots , \beta _p)\) being (0.1, 0.1), (0.001, 0.0001), (1, 1), (0.1, 0.1), (0.1, 0.1, 0.1) and (0.01, 0.001, 0.0001), respectively, for Cases I–VI. It is clearly seen from the tables that for any value of the parameter and the explanatory variables, the loss of efficiency is negligible for small \(\alpha >0\). Further, if the values of \(x_i^T\beta \) is small, then we can get quite high efficiency even for large positive \(\alpha \) near 0.5.

Table 2 The estimated relative efficiencies of the MDPDE for various values of the tuning parameter \(\alpha \) under different cases of logistic regression with sample size \(n = 50\)

5 Real data examples

In this section, we will explore the performance of the proposed MDPDEs in Poisson and logistic regression models by applying it on two interesting real data sets. Application to several other real data sets are presented in the supplementary material. For all the applications, the estimators are computed by minimizing the corresponding objective function through the software “R”; the minimization is performed using the “optim” function of R under suitable convergence criteria. The “R” code used is available from the authors.

5.1 Epilepsy data

First we consider an interesting data set consisting of 59 epilepsy patients from Thall and Vail (1990). The data were obtained from a clinical trial carried out by Leppik et al. (1985) where the patients were treated by the anti-epileptic drug “progabide” or a placebo with randomized assignment. Then the total number of epilepsy attacks was noted which we model by an appropriate set of explanatory variables through a Poisson regression model (Hosseinian 2009). The variables considered in this regard are “Base”, the eight-week baseline seizure rate prior to randomization in multiples of 4, “Age”, the patient’s age in multiple of 10 years, and “Trt”, a binary indicator for the treatment–control group. Also, the interaction between treatment and baseline seizure rate is important in this case, because it represents either higher or lower seizure rate for the treatment group compared to the placebo group depending on the baseline count. In fact, the drug decreases the epilepsy only if the baseline count becomes sufficiently large in number with respect to some critical threshold.

The data were also analyzed by Hosseinian (2009) who compared the maximum likelihood estimator with the robust methodologies proposed by herself in the same paper and those by Cantoni and Ronchetti (2001). There it was observed that the data contain some outlying observations due to which the interaction effect between treatment and baseline seizure rate turns out to be insignificant based on the maximum likelihood estimator whereas the robust estimators show this interaction to be significant. Here, we will apply our proposed robust minimum density power divergence estimators for this epilepsy data set and try to see if our proposed estimators are also robust enough to differentiate with maximum likelihood estimator for the interaction effect.

Table 3 presents the parameter estimates, their asymptotic standard errors and corresponding p values based on the minimum density power divergence estimator with different \(\alpha \). Clearly the estimators corresponding to \(\alpha \ge 0.3\) are quite different from the maximum likelihood estimator and for these estimators the interaction effect is also significant under the Poisson regression model. Indeed, these estimators are quite similar to the robust estimators considered in Hosseinian (2009) but, as we have described earlier, have superior asymptotic properties.

Table 3 The minimum density power divergence estimates, their standard errors and p values for the epilepsy data

5.2 Damaged carrots data

As an interesting data example leading to the logistic regression model, we consider the damaged carrots dataset of Phelps (1982). The data set was obtained from a soil experiment trial containing the proportion of insect-damaged carrots with three blocks and eight dose levels of insecticide in the experiments and was discussed by Williams (1987). McCullagh and Nelder (1989) used these data to illustrate the identification methods for isolated departures from the model through an outlier in the y-space present in the data (\(14\mathrm{th}\) observation; dose level 6 and block 2). Later Cantoni and Ronchetti (2001) modeled these data by a binomial logistic model to illustrate the performance of their proposed robust estimators. However, it can be checked easily that the observation 14 is only an outlier in the y-space and not a leverage point.

We now apply the minimum density power divergence estimation method for several different \(\alpha \) to explore the performance of the proposed method in case of the presence of outlier only in the y-space. Table 4 presents the parameter estimates, their asymptotic standard errors and corresponding p values for different tuning parameters \(\alpha \). The estimates corresponding to \(\alpha \ge 0.3\) again turn out to be highly robust and also similar to the robust estimator obtained by Cantoni and Ronchetti (2001). Also, for these estimators the indicator of Block 1 turns out to be insignificant which became significant in case of the maximum likelihood estimator (corresponding to \(\alpha =0\)) due to the presence of the outlying observation.

Table 4 The minimum density power divergence estimates, their standard errors and p values for the damaged carrots data

6 Comparison with existing robust estimators in GLM

Here, we briefly consider a comparison of our proposed estimators with some existing robust estimators. As noted previously, there are few robust inference procedures in the literature of GLM; only the Poisson, logistic and negative binomial regression models with stochastic covariates have got some attention. On the contrary, our proposal considers non-stochastic covariates and, therefore, is not theoretically comparable to the existing methods. However, from a practical point of view they can be adapted to solve real-life problems with fixed covariates and hence numerical comparisons can be of some interest.

Two existing methods appear to be close to our proposal in the sense of bounding the Pearson residual. One is the approach of Hosseinian (2009) who has proposed weighted likelihood-type robust estimators by following the \(L_\mathrm{q}\) quasi-likelihood approach of Morgenthaler (1992); the other is by Cantoni and Ronchetti (2001) who have considered a class of Mallows-type M-estimators as a special case of the generalized estimating equation of Preisser and Qaqish (1999). The second work is itself a special case of Cantoni (2004).

Hosseinian (2009) only proposed robust estimators for the Poisson and logistic regression cases and provided no general form for all GLMs. Further, the proposed estimating equations in Hosseinian (2009) are not asymptotically unbiased implying an inconsistent estimator. Our proposal does not have this theoretical flaw and, in addition, is completely general. Accordingly, further comparison with the Hosseinian (2009) work does not appear to be useful.

On the contrary, the goal of the Cantoni and Ronchetti (2001) work was not to just introduce a new robust estimator for GLM; rather it aims to develop a comprehensive robust analysis (estimation, testing and model selection through the analysis of deviance) that would complement the classical analysis. Their estimators (and those proposed in the current paper) have unbiased estimating equations at the model; hence it is easy to establish the theoretical consistency results of these estimators unlike the case of Hosseinian (2009).

On the robustness issue, the estimators proposed in Cantoni and Ronchetti (2001) have bounded-influence functions which is also the case with our proposed MDPDEs. Our estimators appear to have competitive or better robustness properties compared to Cantoni and Ronchetti (2001). For illustration, let us consider the Epilepsy Data example modeled by Poisson regression. The analysis based on the proposed MDPDE has been presented in Table 3, which shows that the MDPDE with \(\alpha \in [0.3, 0.7]\) can successfully ignore the outliers in the data and generate robust insights. In this example, the effect of the outliers is actually on the significance of coefficients of the variables “Age” and “Trt \(\times \) Base”. Our analysis shows that while the “Age” variable is significant for \(\alpha =0\), this false significance is quickly turned around by moderate values of \(\alpha \). Similarly the true significance of the coefficient of “Trt \(\times \) Base” is masked at \(\alpha = 0\), but clearly observed at larger values of \(\alpha \). The MDPDE for the coefficient of “Age” for \(\alpha \in [0.3, 0.7]\) vary from 0.04 to 0.02 with p values of the order of 0.3, and those for “Trt \(\times \) Base” ranges from 0.016 to 0.013 with p values less than \(10^{-4}\). The coefficients of “Age” and “Trt \(\times \) Base” obtained by the techniques of Cantoni and Ronchetti (2001) are 0.16 and 0.012 with p value of 0.0008 and 0.02, respectively (Hosseinian 2009, Table 12, p. 125). Therefore, the proposed MDPDEs with moderate \(\alpha \) seem to produce more robust/competitive estimators compared to the Cantoni and Ronchetti (2001) estimators. Similar results can be observed for the other real data examples.

We hope to conduct an extensive simulation study in the future to get a more comprehensive idea about the comparisons of the proposed estimators with all the estimators mentioned in the Sect. 1 over the twin goals of efficiency and robustness.

7 Conclusions

In this paper, we have proposed a new general methodology for robust estimation in case of generalized linear models and considered two prominent special cases—Poisson regression and logistic regression. We have established the robustness properties of the proposed method in terms of the influence function analysis and applied it to several real data sets having different types of outliers. Our method appears to perform competitively in comparison with existing techniques in terms of its robustness properties and capability of generalization to all GLMs. Our method is also a bona fide optimization procedure; selecting the correct solution is, therefore, easier than the estimating equation-based competitors. On the whole, we expect that the proposed estimators will help the researchers in several application domains to estimate the model parameters in any generalized linear model efficiently and robustly.