1 Introduction

Means and covariance/dispersion matrix are among the most important concepts in statistics. They are essential in describing the distribution of a population or sample. They are also the building blocks in most widely used statistical methods (e.g., ANOVA, regression, correlations, factor analysis, principal component analysis, structural equation modeling, growth curves, etc.). Most topics in applied multivariate statistics can be regarded as the analyses of sample means and/or covatiance matrix (e.g., Johnson and Wichern 2002). However, real data tend to have heavy tails (Micceri 1989) and the sample means and covariance matrix can be very inefficient. In particular, with missing data that are even all missing at random (MAR) (Rubin 1976), biases in the normal-distribution-based maximum likelihood (NML) estimates (NMLEs) can be greater than the values of the population parameters, due to the interaction between heavy-tailed distribution and missing data (Yuan et al. 2012). In such a situation, robust estimates are desired. Robust procedures have been systematically introduced in textbooks (Hampel et al. 1986; Heritier et al. 2009; Huber 1981; Maronna et al. 2006; Wilcox 2012). Robust estimates of means and dispersion matrix with missing values have been developed using maximum likelihood (ML) based on multivariate \(t\)- or contaminated-normal distributions (Little 1988). However, either of the ML procedures might not be the best method when the underlying population distribution is unknown. Other M-estimators, S-estimator, and/or those obtained from certain hybrid-methods might be preferred (see e.g., Mehrotra 1995).

When robust estimates of means and dispersion matrix are subject to further analysis, we need to have a consistent estimator of their covariance matrix to obtain consistent standard errors (SEs) for the derived parameter estimates or proper test statistics for overall model evaluation. If the robust means and dispersion matrix satisfy a set of estimating equations, then a consistent sandwich-type covariance matrix of the robust estimates directly follows from the estimating equations (Godambe 1960; Huber 1967; Yuan and Jennrich 1998). Thus, it is important to relate robust estimates to estimating equations. With complete data, robust M-estimators of means and dispersion matrix are typically defined by estimating equations (Maronna 1976). With missing data, they have been presented as the output of expectation-robust (ER) algorithms in which certain weights are attached to cases with imputed data (Little and Smith 1987; Cheng and Victoria-Feser 2002). It is also necessary to identify their corresponding estimating equations if inference is needed in their applications.

The paper has four goals: (1) generalizing the maximum likelihood estimates with missing data based on a multivariate \(t\)-distribution to M-estimators using estimating equations; (2) providing an ER algorithm to solve the estimating equations; (3) identifying the estimating equations corresponding to existing algorithms for computing robust means and dispersion matrix with missing data; (4) comparing bias and efficiency of different robust estimators defined through estimating equations with missing values. We will review relevant literature for robust estimation with missing data in the development. But comparing all the existing robust methods theoretically or numerically is not our goal. Statistical theory suggests that it is impossible to identify the best method for a real data set whose population distribution is unknown.

In Sect. 2 we extend the estimating equations based on the multivariate \(t\)-distribution to those defining general M-estimators for samples with missing values. Special cases of the equations are also satisfied by S-estimators for samples with missing values. We then give the ER algorithm for solving the estimating equations. Estimating equations corresponding to algorithms for calculating robust means and dispersion matrix in the literature are also identified and discussed. Monte Carlo results concerning the efficiency of several robust estimators are presented in Sect. 3. Applications of the results to robust analysis of linear regression and growth curve models are considered in Sect. 4. We end the paper by discussing issues related to applications of robust estimation in practice.

2 Expectation-robust algorithm and estimating equations

Let \(\mathbf{x}\) represent a population of \(p\) random variables. A sample \(\mathbf{x}_i\), \(i=1\), 2, \(\ldots \), \(n\), from \(\mathbf{x}\) is obtained. Due to missing values, \(\mathbf{x}_i\) only contains \(p_i\) marginal realizations of \(\mathbf{x}\). We are interested in estimating the means and dispersion matrix of \(\mathbf{x}\) by a robust method. Let \(\mathbf{x}_{im}\) be the vector containing the \(p-p_i\) missing values. For notational convenience, we will use \(\mathbf{x}_{ic}=(\mathbf{x}_i',\mathbf{x}_{im}')'\) to denote the complete data. Of course, the positions of missing values are not always at the end in practice. We can perform a permutation on each missing pattern so that all the algebraic operations in this article still hold. With the sample \(\mathbf{x}_i\), \(i=1\), 2, \(\ldots \), \(n\), in mind, we will first present the EM algorithm based on a multivariate \(t\)-distribution and then extend it to an ER algorithm solving general estimating equations.

Let \(Mt_{p}({\varvec{\mu }},\mathbf{\Sigma },m)\) denote the \(p\)-variate \(t\)-distribution with \(m\) degrees of freedom, where \({\varvec{\mu }}\) is the mean vector and \(\mathbf{\Sigma }\) is the dispersion matrix. When \(m>2\), the maximum likelihood estimate (MLE) of \(\mathrm{Cov}(\mathbf{x})=m\mathbf{\Sigma }/(m-2)\) can be obtained as \(m\hat{\mathbf{\Sigma }}/(m-2)\) with \(\hat{\mathbf{\Sigma }}\) being the MLE of \(\mathbf{\Sigma }\). Because the purpose of modeling with a multivariate \(t\)-distribution is mostly for robustness rather than regarding the data as truly coming from a \(t\)-distribution, many applications just directly work with \(\hat{\mathbf{\Sigma }}\) rather than \(m\hat{\mathbf{\Sigma }}/(m-2)\) in further analysis (e.g., Devlin et al. 1981). Actually, most statistical analyses based on \(\hat{\mathbf{\Sigma }}\) or a rescaling of it yield the same results.

To introduce the EM algorithm based on \(\mathbf{x}\sim Mt_p({\varvec{\mu }},\mathbf{\Sigma },m)\) with a given \(m\), let \({\varvec{\mu }}^{(j)}\) and \(\mathbf{\Sigma }^{(j)}\) be the values of \({\varvec{\mu }}\) and \(\mathbf{\Sigma }\) at the \(j\)th iteration, \({\varvec{\mu }}_i^{(j)}\) and \(\mathbf{\Sigma }_i^{(j)}\) be the means and dispersion matrix corresponding to the observed \(\mathbf{x}_i\). When \(p_i<p\), we have

$$\begin{aligned} {\varvec{\mu }}^{(j)}=\left( \begin{array}{l} {\varvec{\mu }}_i^{(j)}\\ {\varvec{\mu }}_{im}^{(j)} \end{array} \right) \;\;\;\mathrm{and}\;\;\; \mathbf{\Sigma }^{(j)} =\left( \begin{array}{ll} \mathbf{\Sigma }_i^{(j)}&{}\mathbf{\Sigma }_{iom}^{(j)}\\ \mathbf{\Sigma }_{imo}^{(j)}&{}\mathbf{\Sigma }_{imm}^{(j)} \end{array} \right) , \end{aligned}$$
(1)

where \({\varvec{\mu }}_{im}^{(j)}\) corresponds to the means of \(\mathbf{x}_{im}\); \(\mathbf{\Sigma }_{imm}^{(j)}\) and \(\mathbf{\Sigma }_{imo}^{(j)}\) correspond to the dispersion matrices of \(\mathbf{x}_{im}\) with itself and with \(\mathbf{x}_{i}\), respectively. Notice that the elements of \({\varvec{\mu }}^{(j)}\) and \(\mathbf{\Sigma }^{(j)}\) in (1) are the same for all the observations, and the subscript \(i\) is used to indicate that cases may have different number of missing values. Let

$$\begin{aligned} d_i^2=d^2(\mathbf{x}_i,{\varvec{\mu }}_i,\mathbf{\Sigma }_i)= (\mathbf{x}_i-{\varvec{\mu }}_i)'\mathbf{\Sigma }_i^{-1}(\mathbf{x}_i-{\varvec{\mu }}_i) \end{aligned}$$

be the Mahalanobis distance for the observed \(\mathbf{x}_i\), and \((d_i^{(j)})^2=d^2(\mathbf{x}_i,{\varvec{\mu }}_i^{(j)},\mathbf{\Sigma }_i^{(j)})\). The E-step of the EM algorithm based on \(\mathbf{x}\sim Mt_{p}({\varvec{\mu }},\mathbf{\Sigma },m)\) in Little (1988) obtains the weight \(w_i^{(j)}=(m+p_i)/\left[ m+\left( d_i^{(j)}\right) ^2\right] \), the conditional means

$$\begin{aligned} \hat{\mathbf{x}}_{ic}^{(j)}=E_j(\mathbf{x}_{ic}|\mathbf{x}_i)=\left( \begin{array}{l} \mathbf{x}_i\\ \hat{\mathbf{x}}_{im}^{(j)} \end{array} \right) , \end{aligned}$$
(2)

and the conditional covariance matrixFootnote 1

$$\begin{aligned} \mathbf{C}_i^{(j)}=\mathrm{Cov}_j(\mathbf{x}_{ic}|\mathbf{x}_{i}) =\left( \begin{array}{l@{\quad }l} \mathbf{0}&{}\,\mathbf{0}\\ \mathbf{0}&{}\,\mathbf{C}_{imm}^{(j)} \end{array} \right) , \end{aligned}$$
(3)

where

$$\begin{aligned} \hat{\mathbf{x}}_{im}^{(j)}={\varvec{\mu }}_{im}^{(j)}+\mathbf{\Sigma }_{imo}^{(j)} (\mathbf{\Sigma }_{i}^{(j)})^{-1}(\mathbf{x}_i-{\varvec{\mu }}_{i}^{(j)}) \;\;\mathrm{and}\;\; \mathbf{C}_{imm}^{(j)}=\mathbf{\Sigma }_{imm}^{(j)}-\mathbf{\Sigma }_{imo}^{(j)} (\mathbf{\Sigma }_{i}^{(j)})^{-1}\mathbf{\Sigma }_{iom}^{(j)}. \end{aligned}$$

The M-step gives

$$\begin{aligned} {\varvec{\mu }}^{(j+1)}&= \frac{\sum _{i=1}^nw_{i}^{(j)}\hat{\mathbf{x}}_{ic}^{(j)}}{\sum _{i=1}^nw_{i}^{(j)}}, \end{aligned}$$
(4)
$$\begin{aligned} \mathbf{\Sigma }^{(j+1)}&= \frac{\sum _{i=1}^n\left[ w_{i}^{(j)} (\hat{\mathbf{x}}_{ic}^{(j)}-{\varvec{\mu }}^{(j+1)}) (\hat{\mathbf{x}}_{ic}^{(j)}-{\varvec{\mu }}^{(j+1)})'+\mathbf{C}_i^{(j)}\right] }{n}. \end{aligned}$$
(5)

The robustness of an M-estimator may depend on the starting values for the EM algorithm. We will discuss choices of starting values at the end of this section. At the convergence of the EM algorithm, we obtain the MLEs \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) based on the multivariate \(t\)-distribution with \(m\) degrees of freedom. Notice that the \(n\) in the denominator of (5) can be replaced by \(\sum _{i=1}^nw_{i}^{(j)}\), which makes the EM algorithm converge faster (Kent et al. 1994; Meng and Dyk 1997).

Clearly, the \(t\)-distribution-based MLEs satisfy the estimating equations obtained by setting the score functions corresponding to \(\mathbf{x}_i\sim Mt_{p_i}({\varvec{\mu }}_i,\mathbf{\Sigma }_i,m)\) at zero. Let \(w_i=(m+p_i)/(m+d_i^{2})\),

$$\begin{aligned} \mathbf{V}_i=2^{-1}(\mathbf{\Sigma }_i^{-1}\otimes \mathbf{\Sigma }_i^{-1}), \end{aligned}$$

and \({\varvec{\sigma }}=\mathrm{vech}(\mathbf{\Sigma })\) be the vector containing the elements in the low-triangular part of \(\mathbf{\Sigma }\). The estimating equations corresponding to \(\mathbf{x}_i\sim Mt_{p_i}({\varvec{\mu }}_i,\mathbf{\Sigma }_i,m)\), \(i=1\), 2, \(\ldots \), \(n\), are given by

$$\begin{aligned} \sum _{i=1}^nw_i\frac{\partial {\varvec{\mu }}_i'}{\partial {\varvec{\mu }}}\mathbf{\Sigma }_i^{-1}(\mathbf{x}_i-{\varvec{\mu }}_i)=\mathbf{0}\end{aligned}$$
(6)

and

$$\begin{aligned} \sum _{i=1}^n\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}}\mathbf{V}_i\mathrm{vec}[w_i(\mathbf{x}_i-{\varvec{\mu }}_i) (\mathbf{x}_i-{\varvec{\mu }}_i)'-\mathbf{\Sigma }_i]=\mathbf{0}. \end{aligned}$$
(7)

Notice that Eqs. (6), (7) and others that we call estimating equations in this article only involve the observed values \(\mathbf{x}_i\), not the estimated component \(\hat{\mathbf{x}}_{im}=E(\mathbf{x}_{im}|\mathbf{x}_i,{\varvec{\mu }},\mathbf{\Sigma })\), which are consistent with the estimating equation literature (e.g., Godambe 1991; Liang and Zeger 1986; Prentice and Zhao 1991).

As noted in the introduction, the MLEs corresponding to \(\mathbf{x}\sim Mt_{p}({\varvec{\mu }},\mathbf{\Sigma },m)\) may not be even asymptotically efficient unless the true underlying population follows the multivariate \(t\)-distribution. Many approaches have been proposed to obtain robust estimates of means and dispersion matrix with complete data (e.g., Maronna 1976; Maronna and Zamar 2002; Mehrotra 1995). In particular, both M-estimators and S-estimatorsFootnote 2 satisfy a set of estimating equations (Lopuhaä 1989; Rocke 1996). A natural generalization of (6) and (7) to accommodating different weights in estimating means and dispersion matrix of the observed data is given by

$$\begin{aligned} \sum _{i=1}^nw_{i1}\frac{\partial {\varvec{\mu }}_i'}{\partial {\varvec{\mu }}} \mathbf{\Sigma }_i^{-1}(\mathbf{x}_i-{\varvec{\mu }}_i)=\mathbf{0}\end{aligned}$$
(8)

and

$$\begin{aligned} \sum _{i=1}^n\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}}\mathbf{V}_i\mathrm{vec}[w_{i2}(\mathbf{x}_i-{\varvec{\mu }}_i) (\mathbf{x}_i-{\varvec{\mu }}_i)'-w_{i3}\mathbf{\Sigma }_i]=\mathbf{0}, \end{aligned}$$
(9)

where \(w_{i1}=w_{i1}(d_i)\), \(w_{i2}=w_{i2}(d_i)\) and \(w_{i3}=w_{i3}(d_i)\) are typically nonincreasing functions of \(d_i\).

Obviously, (6) and (7) are a special case of (8) and (9) when \(w_{i1}=w_{i2}=(m+p_i)/(m+d_i^2)\) and \(w_{i3}=1\). Let \(0<\varphi <1\) and \(r_i\) be the \((1-\varphi )\)th quantile corresponding to \(\chi _{p_i}\), the chi-distribution with \(p_i\) degrees of freedom. Equations (8) and (9) extend Huber-type M-estimators to samples with missing data when letting

$$\begin{aligned} w_{i1}=w_{i1}(d_i)=\left\{ \begin{array}{ll} 1, &{}\quad \mathrm{if}\;d_i\le r_i,\\ r_i/d_i, &{}\quad \mathrm{if}\;d_i>r_i, \end{array} \right. \end{aligned}$$
(10)

\(w_{i2}=w_{i1}^2/\tau _i\) and \(w_{i3}=1\), where \(\tau _i\) is a constant such that \(E[\chi _{p_i}^2w_{i1}^2(\chi _{p_i})/\tau _i]=p_i\). They also extend the elliptical-distribution-based MLEs discussed in Kano et al. (1993) to samples with missing values when \(w_{i3}=1\), and \(w_{i1}=w_{i2}=w_i(d_i)\) corresponds to the density function of the elliptical distribution.

Among studies of S-estimators, Tukey’s biweight function

$$\begin{aligned} \rho _{c}(t)=\left\{ \begin{array}{ll}\displaystyle t^2/2-t^4/(2c^2)+t^6/(6c^4), &{}\quad |t|\le c,\\ \displaystyle c^2/6, &{} \quad |t|>c \end{array} \right. \end{aligned}$$
(11)

is most widely used (e.g., Lopuhaä 1989; Rocke 1996), where \(c\) is a tuning constant. With \(p_i\) variables being observed in \(\mathbf{x}_i\), let \(w_{i1}=\dot{\rho }_{c_i}(d_i)/d_i\), \(w_{i2}=p_iw_{i1}\) and \(w_{i3}=\rho _{c_i}(d_i)d_i-\rho _{c_i}(d_i)+b_i\), where \(\dot{\rho }_{c}(d)=\partial {\rho }_c(d)/\partial d\) and \(b_i=E[\rho _{c_i}(\chi _{p_i})]\). Then, Eqs. (8) and (9) are natural extension of equation (2.6) of Lopuhaä (1989) that S-estimators of \({\varvec{\mu }}\) and \(\mathbf{\Sigma }\) need to satisfy with complete data. In particular, for each observed pattern of the sample, the left sides of Eqs. (8 and (9) are mathematically equivalent to the left sides of the two equations in (2.6) of Lopuhaä if we let \(c_i\) be the same for all the observations within an observed pattern. Notice that the large sample breakdown point of the S-estimator for complete data is given by \(6b/c^2\). We may choose \(c_i\) so that \(6b_i/c_i^2\) is the same across all the observed patterns. Also notice that there might be multiple solutions to Eqs. (8) and (9), all of them are M-estimators but only one is an S-estimator (Tyler 1991). Multiple starting values might be needed to find the S-estimator (Ruppert 1992) that corresponds to the minimum value of \(|\mathbf{\Sigma }|>0\).

The generality of (8) and (9) is that they are simply estimating equations. Unlike (6) and (7), the estimating equations may not correspond to the score functions of a particular log likelihood. Thus, the EM algorithm based on the multivariate \(t\)-distribution in (2) to (5) does not apply to (8) and (9). However, a slight modification of (4) and (5) yields solutions to (8) and (9). Specifically, the E-step is the same as in (2) and (3). Let \(w_{i1}^{(j)}\), \(w_{i2}^{(j)}\) and \(w_{i3}^{(j)}\) be evaluated at \(d_{i}^{(j)}\). The M-step is replaced by

$$\begin{aligned}&\displaystyle {\varvec{\mu }}^{(j+1)}=\frac{\sum _{i=1}^nw_{i1}^{(j)} \hat{\mathbf{x}}_{ic}^{(j)}}{\sum _{i=1}^nw_{i1}^{(j)}},\end{aligned}$$
(12)
$$\begin{aligned}&\displaystyle \mathbf{\Sigma }^{(j+1)}=\frac{\sum _{i=1}^n[w_{i2}^{(j)} (\hat{\mathbf{x}}_{ic}^{(j)}-{\varvec{\mu }}^{(j+1)})(\hat{\mathbf{x}}_{ic}^{(j)} -{\varvec{\mu }}^{(j+1)})'+w_{i3}^{(j)}\mathbf{C}_i^{(j)}]}{\sum _{i=1}^nw_{i3}^{(j)}}. \end{aligned}$$
(13)

Following Little and Smith (1987), we will call (12) and (13) the robust (R) step. Notice that the ER algorithm in (2), (3), (12) and (13) is a special case of the iteratively reweighted least squares algorithm, whose convergence properties are studied by Green (1984). Denote \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) as the converged values of the ER algorithm. The proof for \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) to solve Eqs. (8) and (9) is given in Appendix A.

Equations (8) and (9) can also be solved using the Newton-Raphson algorithm (Kelley 2003), which involves the derivatives of each term in (8) and (9) with respect to \({\varvec{\mu }}\) and \({\varvec{\sigma }}\). Notice that \(w_{i1}=w_{i1}(d_i)\), \(w_{i2}=w_{i2}(d_i)\), and \(w_{i3}=w_{i3}(d_i)\) are functions of \({\varvec{\mu }}\) and \({\varvec{\sigma }}\), and their derivatives have to be computed at every iteration in addition to themselves. Also notice that \({\varvec{\mu }}_i\) and \(\mathbf{\Sigma }_i\) corresponding to different subsets of \(\mathbf{x}\) contain distinct elements, and the derivatives need to be coded separately for each observed pattern in addition to accounting for different observed values of \(\mathbf{x}_i\). Thus, although the Newton-Raphson algorithm can be used to solve (8) and (9), its coding is more involved than that of the ER algorithm. It is also possible for the Newton-Raphson algorithm to take longer time than the ER algorithm to reach a convergence.

We now discuss the convergence properties of \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) as \(n\rightarrow \infty \), which are different from the convergence properties of the ER or Newton-Raphson algorithm that yielded \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\). When \(\mathbf{x}\) follows an elliptical distribution and without missing values, \(\hat{{\varvec{\mu }}}\) is consistent and \(\hat{\mathbf{\Sigma }}\) converges to \(\kappa \mathbf{\Sigma }\) for certain \(\kappa >0\), where \(\mathbf{\Sigma }\) is the dispersion matrix of the elliptical distribution (Maronna 1976). With missing values that are MAR, the estimates \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) are consistent and asymptotically most efficient when the left sides of Eqs. (8) and (9) are the score functions corresponding to the true population distribution of \(\mathbf{x}_i\) by ignoring the missing values (Rubin 1976). For other scenarios, let \(\mathbf{g}({\varvec{\nu }})=(\mathbf{g}_1'({\varvec{\nu }}),\mathbf{g}_{2}'({\varvec{\nu }}))'\) with \(\mathbf{g}_1({\varvec{\nu }})\) and \(\mathbf{g}_{2}({\varvec{\nu }})\) being defined as the summation of functions on the left sides of (8) and (9), respectively, where \({\varvec{\nu }}=({\varvec{\mu }}',{\varvec{\sigma }}')'\). Then, under a set of regularity conditions (e.g., Yuan and Jennrich 1998), the estimate \(\hat{{\varvec{\nu }}}=(\hat{{\varvec{\mu }}}',\hat{{\varvec{\sigma }}}')'\) obtained by the ER algorithm converges to a vector \({\varvec{\nu }}^*\) that satisfies \(E[\mathbf{g}({\varvec{\nu }}^*)]=\mathbf{0}\), where the expectation is with respect to the true distributionFootnote 3 of each observed \(\mathbf{x}_i\). Since the true population distribution of the observed sample is typically unknown in practice, nor is the missing data mechanism behind each missing value, it might be hard to know the exact properties of \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) in a specific application. We will use Monte Carlo simulation to evaluate the efficiency of different estimators in the next section.

Little and Smith (1987) give an ER algorithm for computing robust means and dispersion matrix but they do not provide the corresponding estimating equations. Their E-step is the same as in (2) and (3), and their R-step is

$$\begin{aligned}&\qquad \qquad \qquad \qquad {\varvec{\mu }}^{(j+1)}=\frac{\sum _{i=1}^nw_{i}^{(j)} \hat{\mathbf{x}}_{ic}^{(j)}}{\sum _{i=1}^nw_{i}^{(j)}},\end{aligned}$$
(14)
$$\begin{aligned}&\mathbf{\Sigma }^{(j+1)}=\frac{\sum _{i=1}^n\left[ \left( w_{i}^{(j)}\right) ^2\left( \hat{\mathbf{x}}_{ic}^{(j)} -{\varvec{\mu }}^{(j+1)}\right) \left( \hat{\mathbf{x}}_{ic}^{(j)}-{\varvec{\mu }}^{(j+1)}\right) '+\mathbf{C}_i^{(j)}\right] }{\sum _{i=1}^n\left( w_{i}^{(j)}\right) ^2-1}, \end{aligned}$$
(15)

where the weight function \(w_i^{(j)}=w_i(d_i^{(j)})\) or \(w_i=w_i(d_i)\) is given in Little and Smith (1987). Using Eqs. (31) and (32) in the appendix of this article, it can be shown that (14) and (15) solve the Eq. (6) and

$$\begin{aligned} \begin{array}{l} \displaystyle \sum _{i=1}^n\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}}\mathbf{V}_i\mathrm{vec}[w_i^2(\mathbf{x}_i-{\varvec{\mu }}_i) (\mathbf{x}_i-{\varvec{\mu }}_i)'-\mathbf{\Sigma }_i]\\ \displaystyle +\sum _{i=1}^n(1-w_i^2)\frac{\partial \mathrm{vec}'(\mathbf{\Sigma })}{\partial {\varvec{\sigma }}}\mathrm{vec}(\mathbf{\Sigma }^{-1}) +\frac{\partial \mathrm{vec}'(\mathbf{\Sigma })}{\partial {\varvec{\sigma }}}\mathrm{vec}(\mathbf{\Sigma }^{-1}) =\mathbf{0}. \end{array} \end{aligned}$$
(16)

Cheng and Victoria-Feser (2002) proposed an ER algorithm to compute S-estimators with missing data in their Eqs. (22) and (23), which can be written as

$$\begin{aligned} \sum _{i=1}^nw_{i1}\mathbf{\Sigma }_i^{-1}(\hat{\mathbf{x}}_{ic}-{\varvec{\mu }})=\mathbf{0}\end{aligned}$$
(17)

and

$$\begin{aligned} \sum _{i=1}^n\left\{ w_{i2}[(\hat{\mathbf{x}}_{ic}-{\varvec{\mu }}) (\hat{\mathbf{x}}_{ic}-{\varvec{\mu }})'+\mathbf{C}_i]-w_{i3}\mathbf{\Sigma }\right\} =\mathbf{0}. \end{aligned}$$
(18)

Using the results in Appendix A, one can show that the solution to (17) satisfies (8), and the solution to (18) satisfies

$$\begin{aligned} \sum _{i=1}^n&\Bigg \{\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}}\mathbf{V}_i\mathrm{vec}\{w_{i2}[(\mathbf{x}_i-{\varvec{\mu }}_i) (\mathbf{x}_i-{\varvec{\mu }}_i)'-\mathbf{\Sigma }_i]\}\nonumber \\&\quad \quad +(w_{i2}-w_{i3}) \frac{\partial \mathrm{vec}'(\mathbf{\Sigma })}{\partial {\varvec{\sigma }}}\mathrm{vec}(\mathbf{\Sigma })\Bigg \}= \mathbf{0}. \end{aligned}$$
(19)

Cheng and Victoria-Feser (2002) also proposed a modification to (15) of Little and Smith’s R-step, which is given by

$$\begin{aligned} \mathbf{\Sigma }^{(j+1)}=\frac{\sum _{i=1}^n(w_{i}^{(j)})^2\left[ \left( \hat{\mathbf{x}}_{ic}^{(j)} -{\varvec{\mu }}^{(j+1)}\right) \left( \hat{\mathbf{x}}_{ic}^{(j)}-{\varvec{\mu }}^{(j+1)}\right) '+\mathbf{C}_i^{(j)}\right] }{\sum _{i=1}^n(w_{i}^{(j)})^2-1}. \end{aligned}$$
(20)

It can be shown that (20) corresponds to the estimating equation

$$\begin{aligned} \sum _{i=1}^n\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}} \mathbf{V}_i\mathrm{vec}\{w_{i}^2[(\mathbf{x}_i-{\varvec{\mu }}_i)(\mathbf{x}_i-{\varvec{\mu }}_i)'-\mathbf{\Sigma }_i]\} +\frac{\partial \mathrm{vec}'(\mathbf{\Sigma })}{\partial {\varvec{\sigma }}}\mathrm{vec}(\mathbf{\Sigma }^{-1})= \mathbf{0}.\nonumber \\ \end{aligned}$$
(21)

Because a weight is attached to \((\mathbf{x}_i-{\varvec{\mu }}_i)(\mathbf{x}_i-{\varvec{\mu }}_i)'\) in (16), (19) and (21), solutions to each of the three equations might be robust. However, these three equations are not as natural as (9) when considered as generalizations of (7) or equations satisfied by M- and S-estimators as well as any elliptical-distribution-based MLEs for samples without missing values (Kano et al. 1993; Lopuhaä 1989; Maronna 1976; Rocke 1996). Cheng and Victoria-Feser (2002) called (17) and (18) estimating equations. Clearly, (17) and (18) involve the imputed/estimated data \(\hat{\mathbf{x}}_{im}=E(\mathbf{x}_{im}|\mathbf{x}_i,{\varvec{\mu }},\mathbf{\Sigma })\) whereas (8) and (9) do not. Equations (8) and (9) are not only consistent with the literature but also easily generalizable. When structural models \({\varvec{\mu }}({\varvec{\theta }}_1)\) and \(\mathbf{\Sigma }({\varvec{\theta }}_2)\) are of interest and there is no overlapping parameters between \({\varvec{\theta }}_1\) and \({\varvec{\theta }}_2\), the corresponding estimating equations are obtained after replacing the \({\varvec{\mu }}\) in the denominator of (8) and the \({\varvec{\sigma }}\) in (9) by \({\varvec{\theta }}_1\) and \({\varvec{\theta }}_2\), respectively. When \({\varvec{\theta }}_1\) and \({\varvec{\theta }}_2\) share common parameters and let \({\varvec{\theta }}\) be the vector of all parameters, the corresponding estimating equation is obtained after replacing both the \({\varvec{\mu }}\) in the denominator of (8) and the \({\varvec{\sigma }}\) in (9) by \({\varvec{\theta }}\), and adding the two equations. It is not clear how to generalize (17) and (18) to structural models.

The identification of the estimating equations for each ER algorithm allows us to obtain a consistent estimate of the covariance matrix of the resulting robust means and dispersion matrix. Let \(\mathbf{g}_i({\varvec{\nu }})=(\mathbf{g}_{i1}'({\varvec{\nu }}),\mathbf{g}_{i2}'({\varvec{\nu }}))'\) with

$$\begin{aligned} \mathbf{g}_{i1}({\varvec{\nu }})=w_{i1}(d_i)\frac{\partial {\varvec{\mu }}_i'}{\partial {\varvec{\mu }}}\mathbf{\Sigma }_i^{-1}(\mathbf{x}_i-{\varvec{\mu }}_i) \end{aligned}$$

and

$$\begin{aligned} \mathbf{g}_{i2}({\varvec{\nu }})=\frac{\partial \mathrm{vec}'(\mathbf{\Sigma }_i)}{\partial {\varvec{\sigma }}}\mathbf{V}_i\mathrm{vec}[w_{i2}(d_i)(\mathbf{x}_i-{\varvec{\mu }}_i) (\mathbf{x}_i-{\varvec{\mu }}_i)'-w_{i3}(d_i)\mathbf{\Sigma }_i]. \end{aligned}$$

According to the theory of estimating equations (Godambe 1960; Huber 1967), under standard regularity conditions (Yuan and Jennrich 1998), the asymptotic covariance matrix of \(\hat{{\varvec{\nu }}}=(\hat{{\varvec{\mu }}}',\hat{{\varvec{\sigma }}}')'\) obtained at the convergence of (2), (3), (12) and (13) is consistently estimated by

$$\begin{aligned} \hat{\Gamma }=\left[ \sum _{i=1}^n\frac{\partial \mathbf{g}_i(\hat{{\varvec{\nu }}})}{\partial \hat{{\varvec{\nu }}}'}\right] ^{-1} \left[ \sum _{i=1}^n\mathbf{g}_i(\hat{{\varvec{\nu }}})\mathbf{g}_i'(\hat{{\varvec{\nu }}})\right] \left[ \sum _{i=1}^n\frac{\partial \mathbf{g}_i'(\hat{{\varvec{\nu }}})}{\partial \hat{{\varvec{\nu }}}}\right] ^{-1}, \end{aligned}$$
(22)

where \(w_{i1}\), \(w_{i2}\) and \(w_{i3}\) are also functions of \({\varvec{\nu }}\) when evaluating the derivatives. Consistent covariance matrices can also be obtained for the estimators satisfying (16), (19) or (21) by properly defining \(\mathbf{g}_i({\varvec{\nu }})\) such that these equations can be writen as \(\sum _{i=1}^n\mathbf{g}_i({\varvec{\nu }})=\mathbf{0}\).

Now we turn to starting values for the EM or ER algorithm. Since Eqs. (6) and (7) or (8) and (9) may have multiple solutions, robust starting values might be needed for the EM algorithm in (2) to (5) or the ER algorithm in (2), (3), (12) and (13) to yield estimates that are least affected by data contamination or outliers. The minimum covariance determinant (MCD) estimator has been suggested to use as the starting value for \(\mathbf{\Sigma }\) because its breakdown point is close to 50 % (e.g., Cheng and Victoria-Feser 2002). Instead, we prefer the estimates proposed by Mehrotra (1995), because no iteration is needed in their calculation. In Mehrotra’s proposal, each mean is estimated by the marginal median, each dispersion \(\sigma _{jj}\) is obtained by rescaling the median absolute deviation (MAD) from the median, and \(\sigma _{jk}\) is obtained by combining MAD and the median of all pairwise slopes in the form of \((x_{i_2j}-x_{i_1j})/(x_{i_2k}-x_{i_1k})\), \(1\le i_1<i_2\le n\), excluding cases with \(x_{i_2k}=x_{i_1k}\). As an estimator of the slope of the regression of \(x_j\) on \(x_k\), the median of the pairwise slopes was originally proposed by Theil (1950) and Sen (1968), and has been shown by Wilcox (1998) to enjoy not only good robust properties but also good small sample efficiency. Thus, we will use Mehrotra’s proposal to get starting values for the EM and ER algorithms in the simulation study in the next section, where marginal median and pairwise slopes are applied to the observed data. In particular, when the starting \(\mathbf{\Sigma }^{(0)}\) is not positive definite, an eigenvalue decomposition on \(\mathbf{\Sigma }^{(0)}\) is performed, and a new \(\mathbf{\Sigma }^{(0)}\) is obtained by replacing all the eigenvalues smaller than .01 with .01 in the decomposition. Such a process was referred to as “filtering” by Mehrotra (1995), which may not be needed unless \(n\) is small or missing data proportion is high, and together with a near singular population \(\mathbf{\Sigma }\).

3 Monte Carlo results

Although the main purpose of the paper is to establish the relationship between estimating equations and ER algorithm for estimating means and dispersion matrix with missing data, it is informative to see how different estimators perform when the underlying population varies. A Monte Carlo study is conducted for such a purpose. Seven estimators are compared in the study: NMLEs; \(t\)-distribution-based MLEs with degrees of freedom 3 and 1, respectively; M-estimators satisfying Eq. (8) and (9) with \(w_{i3}=1\), \(w_{i1}\) and \(w_{i2}\) being determined by the Huber-type weights in (10) with \(\varphi =0.2\) and \(0.1\), respectively; and M/S-estimators satisfying Eqs. (8) and (9) with \(w_{i1}\), \(w_{i2}\) and \(w_{i3}\) determined by the biweight function in (11) with large sample breakdown points \(6b_i/c_i^2=0.1\) and 0.2, respectively. They are denoted respectively by \(Nm\), \(t(3)\), \(t(1)\), H(0.1), H(0.2), B(0.1), and B(0.2) in our presentation.

Let \(\mathbf{1}_p\) be a vector of \(p\) 1s, and \(\mathbf{I}_p\) be the identity matrix of size \(p\). We chose \(p=5\) with population mean vector \({\varvec{\mu }}_0=\mathbf{1}_5\) and covariance matrix \(\mathbf{\Sigma }_0=0.5(\mathbf{I}_5+\mathbf{1}_5\mathbf{1}_5')\), which is also a correlation matrix with all the correlations equal to 0.5.

Five distribution conditions are used to generate samples with missing data. Let \(\mathbf{A}\) be the lower triangular matrix satisfying \(\mathbf{A}\mathbf{A}'=\mathbf{\Sigma }_0\). The five conditions are respectively (C1) the normal distribution according to \(\mathbf{x}=\mathbf{A}\mathbf{z}+{\varvec{\mu }}_0\), where \(\mathbf{z}\sim N_5(\mathbf{0},\mathbf{I}_5)\); (C2) an elliptical distribution according to \(\mathbf{x}=r\mathbf{A}\mathbf{z}+{\varvec{\mu }}_0\), where \(r\) follows the standardized exponential distribution and is independent with \(\mathbf{z}\); (C3) a skew distribution according to \(\mathbf{x}=r\mathbf{A}\mathbf{u}+{\varvec{\mu }}_0\), where \(r\) follows the same distribution as in C2, \(\mathbf{u}=(u_1,u_2,\ldots , u_5)'\) and the \(u_j\)s are independent with each other and with \(r\), and each \(u_j\) follows the standardized gamma distribution with shape parameter 3; (C4) a contaminated normal distribution with 10 % of the sample from C1 being multiplied by 3; (C5) a contaminated normal distribution with 20 % of the sample from C1 being multiplied by 3. It is easy to see that \(E(\mathbf{x})={\varvec{\mu }}_0\) and \(\mathrm{Cov}(\mathbf{x})=\mathbf{\Sigma }_0\) for the population in C1. It is also straightforward to show that the population mean vector and covariance matrix in C2 and C3 are also given by \({\varvec{\mu }}_0\) and \(\mathbf{\Sigma }_0\), respectively. In C4 and C5, the majority of the cases correspond to \(N_5({\varvec{\mu }}_0,\mathbf{\Sigma }_0)\); and the observed samples are skewed in distribution. Notice that, with \(p=5\), the breakdown point of an M-estimator is limited by \((p+1)^{-1}=0.167\). C4 and C5 are chosen to examine how the robust estimators perform when the percent of contamination is below and above the breakdown point. Clearly, only NML is asymptotically optimal for the normal distribution in C1, no other method is known to work best for any of the five conditions. Such a design is motivated by the fact that we typically do not know which method works best for a given data set whose population distribution is unknown.

Three sample sizes are used: \(n=100\), 300, 500. For each sample, \(x_1\) and \(x_2\) are fully observed; \(x_3\), \(x_4\) and \(x_5\) are missing when \(x_1+x_2\) is greater than certain threshold values. Thus, there are two observed patterns for each sample and the missing values are MAR. The threshold values are chosen for \((x_3,x_4,x_5)\) to miss at about 10, 20 and 30 %, respectively. Data contaminations in C4 and C5 are done after each sample with missing values is obtained, and the percent of contamination is proportional to the number of cases in each observed pattern. For each combination of population distribution, sample size and missing data proportion, \(N_r=1000\) replications are used.

Notice that, under the assumption of an elliptical population distribution and without missing value, an M- or S-estimator \(\hat{\mathbf{\Sigma }}\) is known to converge to a matrix that is proportional to \(\mathbf{\Sigma }_0\) (Lopuhaä 1989; Maronna 1976), and the proportional factor depends on the underlying population and the weights used in the estimation process. It is not proper to compare the bias in different estimates of the dispersion matrix. In our evaluation of different estimators, we put all the estimators on the same scale by obtaining the corresponding correlation matrix following each estimate of the dispersion matrix. Let \(\hat{{\varvec{\mu }}}\) be the vector of the estimates of 5 means and \(\hat{{\varvec{\rho }}}\) be the vector of estimates of the 10 correlations at each sample by one of the 7 estimation methods. With \(\hat{{\varvec{\gamma }}}_i=(\hat{{\varvec{\mu }}}_i',\hat{{\varvec{\rho }}}_i')'\) for the \(i\)th replication, the bias, variance and mean square error (MSE) for the \(j\)th element of \(\hat{{\varvec{\gamma }}}\) are calculated as

$$\begin{aligned} \mathrm{Bias}_j&= \bar{\gamma }_j-\gamma _{j0},\\ \mathrm{Var}_j&= \frac{1}{N_r-1}\sum _{i=1}^{N_r}(\hat{\gamma }_{ij}-\bar{\gamma }_{j})^2, \end{aligned}$$

and

$$\begin{aligned} \mathrm{MSE}_j=\frac{1}{N_r}\sum _{i=1}^{N_r}(\hat{\gamma }_{ij}-\gamma _{j0})^2, \end{aligned}$$

respectively, where \(\bar{\gamma }_j=\sum _{i=1}^{N_r}\hat{\gamma }_{ij}/N_r\). Notice that our study includes 3 missing-data conditions, 3 sample-size conditions, 5 distribution conditions, and 7 estimation methods. With a total of \(3\times 3\times 5\times 7=315\) conditions and 15 parameter estimates, many tables are needed if we report the biases, variances and MSEs of the estimates for individual parameters. To save space, we choose to report the average of absolute bias, variance and MSE across the 15 parameters according to

$$\begin{aligned} \mathrm{Bias}=\frac{1}{15}\sum _{j=1}^{15}|\mathrm{Bias}_j|, \;\; \mathrm{Var}=\frac{1}{15}\sum _{j=1}^{15}\mathrm{Var}_j, \;\; \mathrm{MSE}=\frac{1}{15}\sum _{j=1}^{15}\mathrm{MSE}_j. \end{aligned}$$

These are contained in 5 tables corresponding to the 5 distribution conditions. Because most of the quantities are in the 3rd decimal place, they are multiplied by 10 in the tables for us to see more details and to save space.

Table 1 contains the results of bias, variance and MSE of \(\hat{{\varvec{\gamma }}}\) when \(\mathbf{x}\) is normally distributed (C1). For easy comparison, the smallest entry among the 7 estimation methods is underlined and the largest entry is put in bold. For each method, the average across the 9 conditions (3-sample-size by 3-missing-proportion) is also included on the right of the table. It is clear that NMLEs (those following \(Nm\)) enjoy the smallest bias, variance and MSE across all the conditions. The estimates following B(0.2) have the largest bias; whereas those following \(t(1)\) have the largest variance and also the largest MSE on average. But estimates following B(0.2) have the largest MSE at \(n=300\) and 500. Missing data proportion has little effect on the performance of the 7 estimation methods.

Table 1 Averages of empirical absolute \(\mathrm{bias}\times 10\), \(\mathrm{variance}\times 10\) and \(\mathrm{MSE}\times 10\) of \(\hat{{\varvec{\mu }}}\) and \(\hat{\rho }_{ij}\) by seven methods, (C1) normally distributed population

Table 2 contains the results of bias, variance and MSE of \(\hat{{\varvec{\gamma }}}\) when \(\mathbf{x}\) is elliptically distributed (C2). While estimates following B(0.2) continue to have the largest bias in 8 out of the 9 conditions, the method yielding estimates with the smallest bias varies across conditions of sample size and missing data proportion. Estimates following \(Nm\) have the largest variance and MSE on average, whereas estimates with the smallest variance and MSE are given by ML based on \(\mathbf{x}\sim Mt_5({\varvec{\mu }},\mathbf{\Sigma },1)\).

Table 2 Averages of empirical absolute \(\mathrm{bias}\times 10\), variance \(\times 10\) and MSE \(\times 10\) of \(\hat{{\varvec{\mu }}}\) and \(\hat{\rho }_{ij}\) by seven methods, (C2) elliptically distributed population

The results under a skew population distribution (C3) are in Table 3. Like in Table 2, estimates following \(Nm\) have the largest variance and MSE across the 9 conditions. But estimates following \(Nm\) have the smallest bias at \(n=100\); and B(0.2) enjoys the smallest bias at \(n=300\) and \(500\). Estimates following \(t(1)\) have the smallest variances and also smallest MSE in 8 out of the 9 conditions. Missing data proportion has little effect on the performance of different estimation methods. Notice that the robust estimators may not be consistent when the population distribution is skewed. However, the average MSEs in Table 3 following all the robust methods are smaller than those following \(Nm\). Biases following B(0.2) at \(n=300\) and 500 are also smaller than those following \(Nm\).

Table 3 Averages of empirical absolute \(\mathrm{bias}\times 10\), variance \(\times 10\) and MSE \(\times 10\) of \(\hat{{\varvec{\mu }}}\) and \(\hat{\rho }_{ij}\) by seven methods, (C3) skew distributed population

Table 4 contains the results when 10 % of the sample from a normally distributed population is contaminated (C4). The estimates following \(Nm\) are most biased, least efficient and consequently have the largest MSEs. Least biased estimates are given by \(t(1)\) or \(t(3)\), whereas most efficient estimates are given by \(t(3)\) or H(0.2), depending on missing data proportion. Estimates with least MSEs are given by \(t(3)\) in 8 out of the 9 conditions, and \(t(1)\) enjoys the smallest MSE at \(n=500\) and 10 % of missing data.

Table 4 Averages of empirical absolute \(\mathrm{bias}\times 10\), variance \(\times 10\) and MSE \(\times 10\) of \(\hat{{\varvec{\mu }}}\) and \(\hat{\rho }_{ij}\) by seven methods, (C4) 10 % of normally distributed samples are contaminated

Table 5 contains the results when 20 % of the sample from a normally distributed population is contaminated (C5). Again, the NMLEs (those following \(Nm\)) are most biased, least efficient and consequently have the largest MSEs. The estimates following \(t(1)\) are least biased and have the smallest MSEs, whereas estimates following \(t(3)\) have the least variances. Missing data proportion or sample size has little effect on the performance of the different methods.

Table 5 Averages of empirical absolute \(\mathrm{bias}\times 10\), variance \(\times 10\) and MSE \(\times 10\) of \(\hat{{\varvec{\mu }}}\) and \(\hat{\rho }_{ij}\) by seven methods, (C5) 20 % of normally distributed samples are contaminated

Comparing the averaged numbers (the last column) in each table and across the 5 tables, we may notice that NMLEs differ from each of the robust estimates substantially in both bias and variance. The robust estimates differ more in bias than in variance. Except those following \(t(1)\), all the other estimates attain the largest averaged bias and MSE at C5, implying that the contaminated distribution is their least favorite among the 5 distribution conditions. Although the population distribution in C3 is skewed, the sizes of the average MSEs in Table 3 following all the robust methods are comparable to those in Tables 1 and 2. The average biases corresponding to B(0.1) and B(0.2) in Table 3 are even smaller than those in Tables 1 and 2.

In summary, NML is most preferable when \(\mathbf{x}\sim N_p({\varvec{\mu }},\mathbf{\Sigma })\), but it can perform badly when data are nonnormally distributed or contaminated. Each robust method also has its pros and cons, depending on the underlying population distribution of the sample. With 20 % of the sample from a normally distributed population being contaminated, we may expect B(0.2) to perform better than the results presented in Table 5. The under-expectation of B(0.2) might be understood from the definition of the breakdown point, which is the proportion of extreme observations an estimator can take before becoming arbitrarily large or small, not related to optimizing bias, variance or MSE. The reason for not observing the advantage of B(0.2) in Table 5 might not be because the contaminated observations are not extreme enough, since we also studied the condition of multiplying 20 % of normally distributed samples from C1 by 5 and found results similar to those in Table 5.

For contaminated data in C5, it is very likely that B(\(\alpha \)) with a greater \(\alpha \) than 0.2 may work better than B(0.2). Similarly, other degrees of freedom corresponding to the multivariate \(t\)-distribution and other tuning parameters in the Huber-type weights may yield more efficient and less biased estimates. Our results are consistent with what Richardson and Welsh (1995) have found in the context of mixed linear models, where no method performs the best across all the conditions.

4 Applications

As mentioned in the introduction, means and covariance/dispersion matrix are behind many commonly used statistical methods. In this section, we discuss applications of robust means and dispersion matrix in linear regression and growth curve models, both are widely used in various disciplines.

4.1 Regression models

Consider the regression model

$$\begin{aligned} y_i=\alpha +\mathbf{u}_i'{\varvec{\beta }}+e_i,\;\;\;i=1,2,\ldots , n. \end{aligned}$$
(23)

When all the \(n\) observations are completely observed, let \(s_{yy}\), \(\mathbf{s}_{uy}\), \(\mathbf{S}_{uu}\) be the sample variance of \(y_i\), vector of covariances of \(\mathbf{u}_i\) with \(y_i\), and covariance matrix of \(\mathbf{u}_i\), respectively. Then

$$\begin{aligned} \hat{{\varvec{\beta }}}=\mathbf{S}_{uu}^{-1}\mathbf{s}_{uy} \end{aligned}$$
(24)

is the MLE of \({\varvec{\beta }}\) under the assumption \(e_i\sim N(0,\sigma ^2)\) and that \(\mathbf{u}_i\) and \(e_i\) are independent. Without missing values and when the \(\mathbf{u}_i\)s are not subject to data contamination, a robust estimate of \({\varvec{\beta }}\) can be defined using estimating equations by assigning smaller weights to cases with larger residuals \(e_i=y_i-\alpha -\mathbf{u}_i'{\varvec{\beta }}\) (e.g., Hampel et al. 1986, pp. 311–312). However, with real data, both \(y_i\) and \(\mathbf{u}_i\) may contain missing values. If \(y_i\) is missing in (23), then \(e_i\) is not available even when \(\alpha \) and \({\varvec{\beta }}\) are known. If certain elements of \(\mathbf{u}_i\) are missing and \(y_i\) is observed, then the meaning of \(e_i\) based on observed \(\mathbf{u}_i\) is different from that in (23). Thus, it is not clear how to generalize robust regression from complete data to missing data by downweighting large residuals.

Notice that a robust estimate of \({\varvec{\beta }}\) parallel to (24) can still be obtained as long as robust estimates of \(\mathbf{\Sigma }_{uu}=\mathrm{Cov}(\mathbf{u}_i)\) and \({\varvec{\sigma }}_{uy}=\mathrm{Cov}(\mathbf{u}_i,y_i)\) or the corresponding dispersion matrices are available. With missing values, Little (1988) parameterizes \({\varvec{\sigma }}_{uy}=\mathbf{\Sigma }_{uu}{\varvec{\beta }}\) in formulating the EM algorithms for robust regression based on multivariate \(t\)-distributions. Such a parameterization is mathematically equivalent to letting \({\varvec{\sigma }}_{uy}\) and \(\mathbf{\Sigma }_{uu}\) be free parameters. Similarly, with \(\mathbf{x}_i=(y_i,\mathbf{u}_i')'\), the \({\varvec{\mu }}_i\) and \(\mathbf{\Sigma }_i\) in (8) and (9) can also be reparameterized using \(\alpha \), \({\varvec{\mu }}_u=E(\mathbf{u}_i)\), \({\varvec{\beta }}\), \(\mathbf{\Sigma }_{uu}\), and \(\sigma ^2=\mathrm{Var}(e_i)\). Since the ER algorithm corresponding to (8) and (9) is easier to program, there is no foreseeable advantage of using the regression parameterization. In particular, at the convergence of the ER algorithm, we obtain robust estimates of \(\alpha \) and \({\varvec{\beta }}\) as

$$\begin{aligned} \hat{\alpha }=\hat{\mu }_y-\hat{{\varvec{\mu }}}_u'\hat{{\varvec{\beta }}},\;\;\;\mathrm{and}\;\;\; \hat{{\varvec{\beta }}}=\hat{\mathbf{\Sigma }}_{uu}^{-1}\hat{{\varvec{\sigma }}}_{uy}. \end{aligned}$$

Since \(\hat{\alpha }\) and \(\hat{{\varvec{\beta }}}\) are functions of \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\), they will be consistent as long as \(\hat{{\varvec{\mu }}}\) is consistent and \(\hat{\mathbf{\Sigma }}\) converges to \(\kappa \mathbf{\Sigma }\) for certain \(\kappa >0\), as \(n\rightarrow \infty \). With missing values that are MAR, although it is not clear under what conditions robust estimators are consistent beyond those noted in Sect. 2, the Monte Carlo results in Sect. 3 imply that robust estimators can perform a lot better than NMLEs for many conditions. With the \(\hat{\Gamma }\) in (22), consistent SEs of \(\hat{{\varvec{\beta }}}\) can be obtained from the so-called delta-method. Notice that \({\varvec{\sigma }}=\mathrm{vech}(\mathbf{\Sigma })=(\sigma _{yy},{\varvec{\sigma }}_{uy}',\mathrm{vech}'(\mathbf{\Sigma }_{uu}))'\), there exist

$$\begin{aligned} \dot{{\varvec{\beta }}}_1(\mathbf{\Sigma })&= \partial {\varvec{\beta }}(\mathbf{\Sigma })/\partial {\varvec{\mu }}'=\mathbf{0},\;\;\; \dot{{\varvec{\beta }}}_2(\mathbf{\Sigma })=\partial {\varvec{\beta }}(\mathbf{\Sigma })/ \partial {\varvec{\sigma }}'\nonumber \\&= (\mathbf{0},\mathbf{\Sigma }_{uu}^{-1}, -({\varvec{\beta }}'\otimes \mathbf{\Sigma }_{uu}^{-1})\mathbf{D}_{p-1}), \end{aligned}$$
(25)

where \(\mathbf{D}_{p-1}=\partial \mathrm{vec}(\mathbf{\Sigma }_{uu})/\partial \mathrm{vech}'(\mathbf{\Sigma }_{uu})\) is the \((p-1)^2\times [p(p-1)/2]\) duplication matrix (see e.g., Schott 2005, p. 313). It follows from the delta-method that the covariance matrix of \(\hat{{\varvec{\beta }}}\) is consistently estimated by

$$\begin{aligned} \mathrm{Cov}(\hat{{\varvec{\beta }}})=\dot{{\varvec{\beta }}}(\hat{\mathbf{\Sigma }})\hat{\Gamma }\ \dot{{\varvec{\beta }}}'(\hat{\mathbf{\Sigma }}) \;\;\;\mathrm{or}\;\;\; \mathrm{Cov}(\hat{{\varvec{\beta }}})=\frac{n}{n-p}\dot{{\varvec{\beta }}}(\hat{\mathbf{\Sigma }}) \hat{\Gamma }\dot{{\varvec{\beta }}}'(\hat{\mathbf{\Sigma }}), \end{aligned}$$
(26)

where \(\dot{{\varvec{\beta }}}(\hat{\mathbf{\Sigma }})=(\dot{{\varvec{\beta }}}_1(\hat{\mathbf{\Sigma }}),\dot{{\varvec{\beta }}}_2(\hat{\mathbf{\Sigma }}))\).

When dummy coded categorical variables such as experimental condition, gender or race are present and are completely observed, robust means and dispersion matrix can be estimated for each group. After \(\hat{{\varvec{\mu }}}\)s and \(\hat{\mathbf{\Sigma }}\)s are obtained for all the groups, regression analysis can be done for each group separately when there is no constraint on parameters across the groups. With constraints on parameters across the groups, regression analysis can be done by fitting the regression models simultaneously to the \(\hat{{\varvec{\mu }}}\)s and \(\hat{\mathbf{\Sigma }}\)s under the constraints. When the \(\mathbf{u}_i\)s in (23) contain categorical variables that are missing, the robust methods described here may not be appropriate. Maximum likelihood estimates of the regression coefficients can be obtained if one can correctly specify the distribution of the categorical variables and the conditional distribution of the continuous variables given the categorical variables (see e.g., Little and Schluchter 1985). The resulting estimators will enjoy certain robust properties if the specified distribution accounts for heavy tails in the observed data. More studies in this direction are needed.

4.2 Growth curve models

Let \(y_{it}\) be the observed outcome of person \(i\) at time \(t\), \(t=1\), 2, \(\ldots \), \(T\); \(i=1\), \(2\), \(\ldots \), \(n\); \(\mathbf{u}_i\) be a vector that contains background variables (e.g., treatment conditions) for person \(i\). For complete data, let the linear growth curve model be

$$\begin{aligned} y_{it}=\beta _{i0}+\beta _{i1}t+\varepsilon _{it}, \;\;\; \beta _{i0}={\varvec{\gamma }}_{0}'\mathbf{u}_i+\delta _{i0}, \;\;\; \beta _{i1}={\varvec{\gamma }}_{1}'\mathbf{u}_i+\delta _{i1}, \end{aligned}$$
(27)

where \(E(\varepsilon _{it})=0\), \(\mathrm{Var}(\varepsilon _{it})=\psi _{tt}\), \(\mathrm{Cov}(\varepsilon _{is}, \varepsilon _{it})=\psi _{st}=0\) when \(s\ne t\); \(E(\delta _{i0})=E(\delta _{i1})=0\), \(\mathrm{Var}(\delta _{i0})=\phi _{00}\), \(\mathrm{Var}(\delta _{i1})=\phi _{11}\), \(\mathrm{Cov}(\delta _{i0},\delta _{i1})=\phi _{01}\); and \(\mathbf{u}_i\), \(\varepsilon _{it}\) and \((\delta _{i0},\delta _{i1})\) are independent. Then the structured means and covariances of \(\mathbf{x}_i=(\mathbf{y}_i',\mathbf{u}_i')'\), with \(\mathbf{y}_i=(y_{i1},y_{i2},\ldots ,y_{iT})'\), are

$$\begin{aligned} E(y_{it})&= {\varvec{\gamma }}_0'{\varvec{\mu }}_u+{\varvec{\gamma }}_1'{\varvec{\mu }}_u t, \;\;\; E(\mathbf{u}_i)={\varvec{\mu }}_u, \\ \mathrm{Cov}(y_{is}, y_{it})&= ({\varvec{\gamma }}_0+s{\varvec{\gamma }}_1)'\mathbf{\Sigma }_{uu} ({\varvec{\gamma }}_0+t{\varvec{\gamma }}_1) +\phi _{00}+(s+t)\phi _{01}+st\phi _{11}+\psi _{st},\\ \mathrm{Cov}(\mathbf{u}_i)&= \mathbf{\Sigma }_{uu}, \;\;\; \mathrm{Cov}(y_{it},\mathbf{u}_i)=({\varvec{\gamma }}_0+t{\varvec{\gamma }}_1)'\mathbf{\Sigma }_{uu}; \end{aligned}$$

and the vector of model parameters is

$$\begin{aligned} {\varvec{\theta }}=({\varvec{\gamma }}_0', {\varvec{\gamma }}_1', \phi _{00},\phi _{01},\phi _{11}, \psi _{11},\psi _{22},\ldots ,\psi _{TT},{\varvec{\mu }}_u',\mathrm{vech}'(\mathbf{\Sigma }_{uu}))'. \end{aligned}$$

With missing values, let \({\varvec{\mu }}({\varvec{\theta }})\) and \(\mathbf{\Sigma }({\varvec{\theta }})\) represent the mean and covariance structural models given above, and \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) be proper robust estimates that satisfy (8) and (9). Robust estimates of \({\varvec{\theta }}\) can be obtained by minimizing a discrepancy function between \((\hat{{\varvec{\mu }}},\hat{\mathbf{\Sigma }})\) and \(({\varvec{\mu }}({\varvec{\theta }}),\mathbf{\Sigma }({\varvec{\theta }}))\). A commonly used discrepancy function is derived from the likelihood ratio statistic of testing \(({\varvec{\mu }}({\varvec{\theta }}),\mathbf{\Sigma }({\varvec{\theta }}))\) nested within \(({\varvec{\mu }},\mathbf{\Sigma })\) by assuming \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) as the sample means and covariance matrix based on \(\mathbf{x}\sim N({\varvec{\mu }},\mathbf{\Sigma })\). Consistent SEs of the resulting \(\hat{{\varvec{\theta }}}\) can be obtained using a sandwich-type covariance matrix involving the \(\hat{\Gamma }\) in (22). Details of fitting mean and covariance structural models in general are given in Yuan and Zhang (2012), which contains multiple test statistics for overall model evaluation.

With complete data, we can define a robust M-estimator of \({\varvec{\theta }}\) for (27) by estimating equations in which cases with large \(\varepsilon _{it}\) and/or \(\delta _{ij}\) are downweighted. Estimating equations for the structural model can also be formulated when \(y_{it}\)s are partially observed and the \(\mathbf{u}_i\)s contain no missing values. When the \(\mathbf{u}_i\)s contain missing values, however, it is not clear how to define estimating equations by downweighting cases with large \(\varepsilon _{it}\) or \(\delta _{ij}\). The two-stage approach by obtaining \(\hat{{\varvec{\mu }}}\) and \(\hat{\mathbf{\Sigma }}\) first and then fitting \(({\varvec{\mu }}({\varvec{\theta }}),\mathbf{\Sigma }({\varvec{\theta }}))\) to \((\hat{{\varvec{\mu }}},\hat{\mathbf{\Sigma }})\) provides a robust procedure for growth curve modeling with missing data. In mean and covariance structure analysis with missing values, it has been shown that a two-stage approach by estimating the saturated means and covariances using NML first and then fitting them by the structural models works better than direct NML (Savalei and Falk 2014). We expect that fitting \(({\varvec{\mu }}({\varvec{\theta }}),\mathbf{\Sigma }({\varvec{\theta }}))\) to \((\hat{{\varvec{\mu }}},\hat{\mathbf{\Sigma }})\) will work equally well, if not better, than robust estimators of \({\varvec{\theta }}\) for (27) directly defined through estimating equations.

Similar to regression, when dummy coded variables such as group membership or treatment conditions exist, means and dispersion matrix can be robustly estimated for each group. Growth curve modeling can be done separately for each group or simultaneously when across-group constraints exist. When the \(\mathbf{u}_i\)s contain categorical variables that are missing, method based on mixed distribution of continuous and categorical variables is needed to properly model the joint distribution of the observed \(\mathbf{u}_i\) and \(\mathbf{y}_i\). Robustness of the method depends on the extent the mixed distribution can account for heavy tails in the observed \(\mathbf{x}_i=(\mathbf{y}_i',\mathbf{u}_i')'\). More development in this direction is needed.

5 Discussions

Most classical estimation methods generate consistent parameter estimates when data are complete. With missing data that are MAR, only MLEs are known to be consistent in general. However, with real data, it is hard to specify a correct likelihood function to generate true MLEs. When data have heavy tails, by adjusting the degrees of freedom (Liu 1997), a \(t\)-distribution might better describe the underlying population than the normal distribution. Estimating equations provide even more flexibility in modeling the distribution of the data, and they have become important tools in many areas when modeling practical data whose population distributions are unknown or cannot be described by a familiar parametric family (e.g., Godambe 1991; Liang and Zeger 1986; Prentice and Zhao 1991). The flexibility with estimating Eqs. (8) and (9) lies in that \(w_{i1}\), \(w_{i2}\), and \(w_{i3}\) can have different forms. By properly choosing these weights, the estimating equations may closely approximate those yielded by setting the true but unknown score functions at zero. Then, the resulting estimates will be close to being consistent and asymptotically most efficient. Since it is unlikely to know the size of the bias in parameter estimates with real data, we may select the weights according to the size of the variances corresponding to a set of invariant model parameters (see Yuan et al. 2004), with the hope that the estimates also have minimal biases when their variances are close to smallest. The variances can be estimated using the asymptotic covariance matrix in (22) or the bootstrap (Efron and Tibshirani 1993). Of course, to identify nearly optimal weights for a given data set, one needs to include a variety of procedures. In particular, when a high percentage of data contamination is suspected, S-estimator or other high-breakdown-point estimators need to be included in the comparison. For most of the conditions in Tables 1, 2, 3, 4 and 5, the methods achieve the smallest variances also generate either the smallest biases or the smallest MSEs. For the few exceptions that the smallest variances and MSEs or biases do not go with the same method, the biases or MSEs corresponding to the smallest variances are close to being the smallest.

When data contamination or outliers are suspected, an alternative procedure is to use NML following outlier removal with influential analysis (Poon and Poon 2002). Such a procedure may generate more efficient estimates if the population is normally distributed without contamination. If the heavy tails in a sample are not just due to outliers, a robust method might perform better.

Statistical theory for robust estimation with complete data is primarily developed under the assumption of symmetric or elliptical distributions, mainly because the resulting parameter estimates are consistent. However, with missing data that are MAR, it is not clear whether the consistency property still holds when the population distribution is elliptical but the left sides of (8) and (9) are not the score functions derived from the elliptical-distribution-based log likelihood function. Even if the consistency property can be established within the class of elliptical distributions, it is not clear how to use such a result in practice. This is because, even when the population is elliptically distributed, the observed data can be skewed under MAR mechanism, and there does not exist an effective procedure to tell the difference between skewness caused by the MAR mechanism and that caused by a skew underlying population distribution. When it is not clear which method to choose for a given data set, many applied researchers just go with NML, and robust methods offer viable alternatives to NML. Monte Carlo results in Sect. 3 of this article showed that robust estimates are more accurate than the NMLEs when the population distribution is either of heavy tails or data are contaminated. References cited in this article and elsewhere have repeatedly shown the advantage of robust methods over NML with real data.