Abstract
Deletion diagnostics have been widely adopted to evaluate the influence of one or more observations on the adjustment outputs. Both the case-deletion model and the mean-shift outlier model can be used to develop multiple case-deletion diagnostics for linear models. These two multiple outlier detection models are identical from the statistical point of view. However, the mean-shift outlier model, in which the underlying observations are implicitly deleted, outweighs the case-deletion model in term of computational efficiency. The influence of outliers on the adjustment outputs is also addressed. It reveals that the precision, minimal detectable bias (MDB) measure and dilution of precision metric (DOP) are all overestimated when outliers exist but were neglected under the assumption that a priori variance factor is known before.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
When outliers are present in a data set, a least-squares (LS) adjustment may not be possible or will produce poor or invalid results (Wolf and Ghilani 1997). Many approaches to mitigate or even eliminate the deteriorating effect of outlying observations on the parameter estimates have been developed (Cook 1977; Koch 1999; Monhor and Verö 2011), albeit there is no universally-accepted definition for an outlier (Barnett and Lewis 1994; Monhor and Takemoto 2005; Monhor and Verö 2011).
There are two essential approaches to control the corrupt effects of outliers: conventional outlier detection test procedures developed in geodetic literature (Baarda 1968; Pope 1976) and robust methods (Huber 1981; Hampel et al. 1986, Rousseeuw and Leroy 1987; Koch 1999; Yang 1999; Hekimoglu and Koch 2000; Xu 2005; Hekimoglu 2005). However, the conventional test procedures are only applicable under the assumption that no more than one outlier is present. In case of multiple outliers, the most practical strategy is to employ the iterative data snooping presented by Kok (1984), whilst procedures for detecting all outliers at once have also been proposed (Hadi and Simonoff 1993; Snow and Schaffrin 2003; Baselga 2011).
To evaluate the influence of one or more observations on the adjustment outputs, the deletion diagnostics have been extensively adopted (Cook 1977, 1979; Chatterjee and Hadi 1988). There are two methods to implement the diagnostics, namely, delete the underlying observation(s) explicitly or implicitly. The explicit one is case-deletion model and the other one is referred to as mean-shift outlier model (Hekimoglu et al. 2012). The aim of this contribution is twofold: first to prove the equivalence of these two methods; second to address influence of outlying observations on the quality measures.
The paper is organized as follows: the equivalence of two multiple outlier detection models is investigated, followed by the computational considerations in performing the mean-shift outlier model. Furthermore, theoretical analyses state that the precision, Minimal Detectable Bias (MDB) measure and Dilution of Precision (DOP) metric are all overoptimistic when the outlying observations should have been taken into account but were neglected.
2 Model description
Let us consider a linear Gauss-Markov model defined by Koch (1999)
where L is the n×1 vector of observations, A the n×u design matrix with full column rank, and X the u×1 vector of unknowns. \(\sigma _{0}^{2}\) is the a priori variance factor of unit weight, and P the symmetric positive-definite weight matrix. Whenever necessary, the observations are supposed to be normally distributed.
Then, the (weighted) LS estimate of the unknowns in model Eq. (1) reads (Koch 1999)
The corresponding residual vector is readily obtained as
where R=I n −A(A T PA)−1 A T P maps the original observational vector onto the residual vector as a result of the LS adjustment (Schaffrin 1997; Guo et al. 2011). The matrix R plays an important role in linear adjustment techniques since it contains extremely useful information (Huber 1981; Guo et al. 2007, 2010). One can easily verify that R is idempotent and has the following useful properties
the weighted sum of squares of the LS residuals reads
3 Multiple outlier detection models
As is known, LS method is very susceptible to outliers (Wolf and Ghilani 1997; Koch 1999; Guo et al. 2010). There are two procedures to implement the deletion diagnostics, namely, the case-deletion model and the mean-shift outlier model.
Let us assume the i 1th, the i 2th, …, and the i m th observations are to be deleted, while the i m+1th, the i m+2th, …, and the i n th observations are the remaining ones.
3.1 Mean-shift outlier model
For convenience we introduce the following notations,
where h i denotes the ith n-vector having a 1 as its ith entry and zeros otherwise. It can be seen (H b ,H r ) is a permutation matrix (Strang and Borre 1997). Since a permutation matrix is orthogonal, one can obtain
and
it follows immediately that
Accordingly, the corresponding mean-shift outlier model reads
in which (A,H b ) is of full column rank.
Based on the LS principle, one can obtain the following normal equation:
with which and denoting
we have
It can be verified that \(\boldsymbol{R}_{\boldsymbol{H}_{b}}\) is idempotent and has the following useful properties
The corresponding residual vector is
and thus
with
3.2 Multiple case-deletion model
Under the same condition, the multiple case-deletion model reads
with which one can obtain the LS estimator as follows
where
The permutation matrix (H b ,H r ) is invertible. Therefore, one can obtain
which in combination with Eq. (8) yields
By virtue of Eqs. (7), (13), (19) and (21), we have
It follows that
The weighted sum of squares of the LS residuals in this multiple case-deletion model reads
and thus
It can be seen from Eqs. (23) and (24) that the mean-shift outlier model is equivalent to the multiple case-deletion model, no matter whether the deleted observations are correlated with the remaining or not.
According to the above discussions, one can conclude that the adjustment outputs are equal to each other no matter whether the (potential) outliers are deleted explicitly or implicitly, even though the removed observations are correlated with the remaining ones.
3.3 Computational consideration
With Eq. (12), one has to deal with the two matrix inversions with orders u and m, as opposed to the two matrix inversions with orders u and n−m in Eq. (18). Therefore, Eq. (12) outperforms Eq. (18) in term of computational efficiency for in most applications the number of outliers m is small relative to the number of the original observations n.
However, the computational burden can be further reduced by taking the partitioned structure of the normal matrix in Eq. (10) into account. In fact, the normal equation (10) can also be solved as
or in more explicit form
with which we obtain
and
Apparently, in this situation it only requires extra calculation of the inverse of the m×m normal matrix \(\boldsymbol{H}_{b}^{T}\boldsymbol{PRH}_{b}\). As a by-product, the estimate of the vector of the disturbance parameters ∇ can also be obtained with Eq. (26). This is a sufficient reason for choosing the mean-shift outlier model over the case-deletion model from the computational point of view.
4 Quality Assessment of outlying observations
With Sherman-Morrison-Woodbury-Schur formula (Strang and Borre 1997), we have
This formula states the apparent increase in precision when the outlying observations should have been taken into account but were neglected under the assumption that a priori variance factor is known before (Schaffrin 1997).
The second term of Eq. (29) has a quadratic form, it follows that
This inequality shows that all types of DOP metrics (Strang and Borre 1997) will be over-optimistic if the outliers were ignored, even though the outlying observations are correlated with the remaining ones.
After some matrix manipulation, it follows that
where \(\boldsymbol{R}_{r} = \boldsymbol{I}_{r} - \boldsymbol{H}_{r}^{T}\boldsymbol{A} \cdot (\boldsymbol{A}^{T}\boldsymbol{H}_{r}\boldsymbol{P}_{r}\boldsymbol{H}_{r}^{T}\boldsymbol{A})^{ - 1} \cdot \boldsymbol{A}^{T}\boldsymbol{H}_{r} \cdot \boldsymbol{P}_{r}\).
By virtue of Eqs. (28) and (31) and since the two quadratic forms, Ω ∇ and Ω r , are equivalent for any realization of the random observational vector L, we have
Obviously, the kth observation in the multiple case-deletion model is just the i m+k th one in the original linear Gauss–Markov model. Consequently, we get
where \(\tilde{\boldsymbol{h}}_{k}\) denotes the kth (n−m)-dimensional canonical unit vector with 1 as its ith entry.
The kth Baarda’s w-test in the multiple case-deletion model reads (Baarda 1968)
The corresponding MDB measure is given by
which in combination with Eq. (32) yields
This indicates that all the MDB measures of the remaining observations will become larger.
5 Conclusions
Both the case-deletion model and the mean-shift outlier model can be employed to perform multiple case-deletion diagnostics for linear models. The advantage of the case-deletion model is its intuitive appeal, for the suspicious observations are removed explicitly. The mean-shift outlier model, in which the underlying observations are implicitly deleted, has found wider acceptance because of its computational simplicity. However, these two models are equivalent from the mathematical point of view. Under the assumption that a priori variance factor is known before, theoretical analyses indicate that the precision, MDB measure and all kinds of DOP metrics are all over-optimistic when outliers were neglected.
References
Baarda W (1968) A testing procedure for use in geodetic networks. Publication on geodesy, vol 2(5). Netherlands Geodetic Commission, Delft
Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
Baselga S (2011) Acta Geod Geophys Hung 46(4):401–416
Chatterjee S, Hadi AS (1988) Sensitivity analysis in linear regression. Wiley, New York
Cook RD (1977) Technometrics 19(1):15–18
Cook RD (1979) J Am Stat Assoc 74(365):169–174
Guo J, Ou J, Wang H (2007) J Surv Eng 133(3):129–133
Guo J, Ou J, Wang H (2010) J Geod 84(4):243–250
Guo J, Ou J, Yuan Y (2011) J Surv Eng 137(1):9–13
Hadi AS, Simonoff JS (1993) J Am Stat Assoc 88(424):1264–1272
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York
Hekimoglu S (2005) ZFV, Z Vermess.wes 130(3):174–180
Hekimoglu S, Koch KR (2000) AVN 107(7):247–253
Hekimoglu S, Erdogan B, Erenoglu RC (2012) Exp Tech. doi:10.1111/j.1747-1567.2012.00876.x
Huber PJ (1981) Robust statistics. Wiley, New York
Kern M, Preimesberger T, Allesch M, Pail R, Bouman J, Koop R (2010) J Geod 78(9):509–519
Koch KR (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, Berlin
Kok JJ (1984) On data snooping and multiple outlier testing. NOAA technical report, NOS NGS 30, Rockville, MD
Monhor D, Takemoto S (2005) Earth Planets Space 57(11):1009–1018
Monhor D, Verö J (2011) Acta Geod Geophys Hung 46(1):84–92
Pope AJ (1976) The statistics of residuals and the detection of outliers. NOAA technical report, NOS 65, NGS 1 Rockville, MD
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York
Schaffrin B (1997) J Surv Eng 123(3):126–137
Snow KB, Schaffrin B (2003) GPS Solut 7(2):130–139
Strang G, Borre K (1997) Linear algebra, geodesy, and GPS. Wellesley-Cambridge Press, Wellesley
Wolf PR, Ghilani CD (1997) Adjustment computations: statistics and least squares in surveying and GIS, 3rd edn. Wiley, New York
Xu P (2005) J Geod 79(1–3):146–159
Yang Y (1999) J Geod 73(5):268–274
Acknowledgements
This research was sponsored by National Key Basic Research Program of China (2012CB825604), and the Natural Science Foundation of China (Grant No. 40874007). The author is also supported by the China Scholarship Council (File No. 2011317045).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guo, J. The case-deletion and mean-shift outlier models: equivalence and beyond. Acta Geod Geophys 48, 191–197 (2013). https://doi.org/10.1007/s40328-013-0017-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40328-013-0017-5