1 Introduction

In many geodetic data processing tasks, large sets of observations are recorded or sampled, where it is nearly impossible that such datasets are free from outliers (Lehmann 2013; Rofatto et al. 2020a). In case outliers occur, the least squares (LS) estimation, typically applied to geodetic data processing, may lose the properties of unbiasedness and minimum variance on the parameter estimate (Koch 1999; Teunissen 2000). Thus, one of the major challenges of geodetic data analysis is to deal with outliers properly. In this paper, the outlier follows the definition given by Lehmann (2013): outliers are observations that are contaminated by gross errors. In contrast, the observations without any contamination are defined as inliers (Hawkins 1980). Generally, there have been two main categories of methods developed to deal with outliers, which are outlier diagnostics and robust estimation (Rousseeuw and Leroy 1987).

The methods of outlier diagnostics aim to pinpoint outliers from the data, after which these outliers are to be removed or corrected, followed by an LS analysis for the remaining cases. The outlier diagnostics has a long history, since Thompson (1935), Pearson and Sekar (1936), Nair (1948), and Grubbs (1969) studied the outliers based on the normalized or studentized residuals of LS. Their results were followed up by many researchers who detect and identify outliers in normal samples or linear regressions, e.g., Daniel (1960), Anscombe (1960), Quesenberry and David (1961), Ferguson (1961), Srikantan (1961), David and Paulson (1965), Stefansky (1972), Ellenberg (1973, 1976), Rosner (1975), Galpin and Hawkins (1981), and Fischler and Bolles (1981). Besides, there are extensive reviews of the subject by Beckman and Cook (1983).

In geodesy, considering the characteristics of geodetic data, outlier diagnostics methods have also been extensively studied. After over half a century of development, data snooping has become one of the best-established outlier diagnostics methods and has been used as a standard procedure for quality control. Data snooping with its associated reliability measure originated in the pioneering work of (Baarda 1967, 1968), later extended by Pope (1976) to the case that the precision of the observations is unknown, see also, e.g., Alberda (1976), Kok (1984), Teunissen (1985), Xu (1987a, b) and Koch (1999). Data snooping usually consists of three steps. First, the test statistics are usually used to detect the presence of outliers in the observation system (Teunissen 2000; Koch 2015). Then, model selection is carried out to identify the most possible outlier by screening each observation (Förstner 1983; Yang et al. 2013). Finally, parameter estimation is carried out after excluding the tested outlier. The statistical test used in data snooping is well-known as the w-test. In case that observations are independent of each other, the w-test statistics are equivalent to the normalized or studentized least squares residuals.

For modern geodetic applications, there are typically large datasets that are very likely to contain multiple outliers. In this case, one of the most commonly used methods is to implement the data snooping procedure iteratively, processing outliers one by one (Mickey et al. 1967; Barnett and Lewis 1978; Gentle 1978), which is known as iterative data snooping (IDS) (Kok 1984; Lehmann and Scheffler 2011; Rofatto et al. 2017; Klein et al. 2022). However, this approach is not theoretically rigorous, since it is assumed that there is only one outlier in each iteration, but this assumption is overthrown immediately in the next iteration (Lehmann and Lösler 2016). Actually, in the case of multiple outliers, the testing methods would be hampered by the masking and swamping effects. Specifically, masking effects are that multiple outliers can mask each other easily, which increases the difficulty of outlier detection (McMillan 1971). In addition, swamping effects are that if the suspected maximum number of outliers is large, the statistical test tends to declare more outliers than there are (Fieller 1976).

Another kind of method, robust estimation, aims to attain a solution with higher robustness via modifying the score function of LS. Since Box (1953) coined the term ‘robustness’, an enormous amount of studies has been published on this subject, e.g., M-estimation (Huber 1964; Hampel et al. 1986), L-estimation (Sarhan and Greenberg 1956), R-estimation (Hodges and Lehmann 1963; Jaeckel 1972; Duchnowski 2013), S-estimation (Rousseeuw and Yohai 1984), median estimation (Stigler 1977; Duchnowski 2010), quantile regression (Koenker and Hallock 2001), Msplit estimation (Wiśniewski 2009, 2010), least absolute values method (Edgeworth 1887; Khodabandeh and Amiri-Simkooei 2011), least median of squares method (Rousseeuw 1984; Rousseeuw and Leroy 1987), sign-constrained robust least squares (Xu 2005), and various generalized versions of them. Using the theory of breakdown point (Hodges 1967; Donoho and Huber 1983), they have achieved satisfactory performance when dealing with multiple outliers.

Among all of these robust estimations, the subclass of M-estimation is more attractive in geodesy, since they are computationally efficient and can be easily implemented in existing geodetic adjustment software (Yang et al. 2002; Koch 2013). Compared to other robust estimations, M-estimation is usually consistent with the LS solution in the absence of outliers, so that retains higher precision of parameter estimates in most cases. Due to the nonlinear property of M-estimators, the iteratively reweighted least squares (IRLS) procedure is developed for practical applications (Holland and Welsch 1977; Koch 1999). The principle of IRLS is to adapt the weight for each observation based on the previous adjustment residuals. Such weights reduce or eliminate the effect of outliers on the final estimate of the parameters via many different down-weight strategies, e.g., Huber’s estimator (Huber 1964), Danish method (Krarup et al. 1980), Hampel’s estimator (Hampel et al. 1986), IGGIII method (Yang 1994; Yang et al. 2002). However, since these method are based on the measure of initial LS residuals, the masking and swamping effects unavoidably bring in some incorrect down-weight selection (Hekimoğlu 1997, 1999).

The present contribution is to investigate the cause of masking and swamping effects and to propose a new method to mitigate these phenomena. First, based on the data division, an extended form of the w-test with its reliability measure is presented, and a theoretical reinterpretation of data snooping and IDS is provided. Then, a new outlier diagnostic method and its iterative form are proposed, namely data refining and iterative data refining (IDR). In general, if the total observations are initially divided into an inlying set and an outlying set, data snooping can be considered a process of selecting outliers from the inlying set to the outlying set. Conversely, data refining is then a reverse process to transfer inliers from the outlying set to the inlying one. For IDS, all data are usually assumed as inliers in the initial stage. In this case, the inlying set is probably contaminated when there are multiple outliers. Consequently, a contaminated inlying set might invalidate the test decision, which might cause the masking and swamping effect. However, in the initial stage of IDR, the suspected outliers are moved out of the inlying set. Therefore, a reliable inlying set relatively guarantees the validity of the following test, thereby effectively alleviating the masking and swamping effect.

This contribution is structured as follows: In Sect. 2, the Gauss–Markov model is briefly reviewed. Also, Baarda’s w-test and data snooping are introduced. In Sect. 3, first, based on the data division, the extended w-test with its associated reliability measure is presented. Then, data snooping is reinterpreted, and a new method called data refining is proposed. In Sect. 4, the iterative forms of data snooping and data refining are presented, which are IDS and IDR, respectively. As well, the comparison of IDS and IDR is discussed in detail. In Sect. 5, a linear fitting example is used to analyze the property of IDR for dealing with outliers. Finally, Sect. 6 concludes this contribution and discusses future work.

2 Baarda’s data snooping in the linear models

In this section, the linear model with least squares estimates is briefly reviewed. Then, the w-test and data snooping with its associated reliability measure are introduced.

2.1 Linear models and least squares (LS) estimate

Consider the linear model of observation equations (Koch 1999; Teunissen 2000):

$$ {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}},\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}. $$
(1)

Here, \(\rm{E}\left( \cdot \right)\) and \(\rm{D}\left( \cdot \right)\) are the expectation and dispersion operators, respectively. \({\varvec{A}} \in {\mathbb{R}}^{{{{m}}\; \times \;{{n}}}}\) is the non-random design matrix of \({ {\text{rank}}}\left( {\varvec{A}} \right)\; = \;{{n}}\; < \;{{m}}\). \({\varvec{y}} \in {\mathbb{R}}^{{{m}}}\) is the vector of observations with normal distribution, and \({\varvec{x}} \in {\mathbb{R}}^{{{n}}}\) is the vector of unknown parameters. The symmetric positive matrix \({\varvec{Q}} \in {\mathbb{R}}^{{{{m}}\; \times \;{{m}}}}\) denotes the variance–covariance cofactor of \({\varvec{y}}\), and \(\sigma^{2}\) denotes the variance of unit weight. The LS estimate of the unknown parameters \({\varvec{x}}\) is given by:

$$ \hat{\varvec{x}} = \left( {{\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} {\varvec{A}}} \right)^{ - 1} {\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} {\varvec{y}}, $$
(2)

with the cofactor matrix of variance–covariance:

$$ {\varvec{Q}}_{{\varvec{\hat{x}\hat{x}}}} = \left( {{\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} {\varvec{A}}} \right)^{ - 1} . $$
(3)

The residual vector of \({\varvec{y}}\) can be given as:

$$ \hat{\varvec{e}} = \left[ {{\varvec{I}}_{{{m}}} - {\varvec{A}}\left( {{\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} {\varvec{A}}} \right)^{ - 1} {\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} } \right]{\varvec{y}}, $$
(4)

with the cofactor matrix of variance–covariance:

$$ {\varvec{Q}}_{{\varvec{\hat{e}\hat{e}}}} = {\varvec{Q}} - {\varvec{A}}\left( {{\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} {\varvec{A}}} \right)^{ - 1} {\varvec{A}}^{{ {\text{T}}}} . $$
(5)

For condition equations, the linear model in Eq. (1) can be equivalently written as (Koch 1999; Teunissen 2000):

$$ {\varvec{B}}^{{ {\text{T}}}} {\rm{E}}\left( {\varvec{y}} \right) = 0,\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}. $$
(6)

Here, \({\varvec{B}} \in {\mathbb{R}}^{{{{m}}\; \times \;\left( {{{m}} - {{n}}} \right)}}\) is a basis matrix satisfied that \({\varvec{B}}^{{ {\text{T}}}} {\varvec{A}} = 0\) and \({ {\text{rank}}}\left( {\varvec{B}} \right) = {{m}} - {{n}}\). In this case, the residual vector can be given as:

$$ \hat{\varvec{e}} = {\varvec{QB}}\left( {{\varvec{B}}^{{ {\text{T}}}} {\varvec{QB}}} \right)^{ - 1} {\varvec{B}}^{{ {\text{T}}}} {\varvec{y}}, $$
(7)

with the cofactor matrix of variance–covariance:

$$ {\varvec{Q}}_{{\varvec{\hat{e}\hat{e}}}} = {\varvec{QB}}\left( {{\varvec{B}}^{{ {\text{T}}}} {\varvec{QB}}} \right)^{ - 1} {\varvec{B}}^{{ {\text{T}}}} {\varvec{Q}}. $$
(8)

Particularly, if \(\sigma^{2}\) is unknown, it can be estimated as:

$$ \hat{\sigma }^{2} = \frac{\hat{\varvec{e}}^{\text{T}} {\varvec{Q}}^{ - 1} \hat{\varvec{e}}}{m - n}. $$
(9)

In general, LS is the best linear unbiased estimator (BLUE) for linear models (Koch 1999; Teunissen 2000). However, the optimal properties of LS might be compromised, once the observations are contaminated from gross errors resulting in the presence of outliers (Rousseeuw and Leroy 1987; Lehmann 2013). Therefore, statistical test procedures for outlier diagnostics have been developed.

2.2 w-test and data snooping

Suppose there is a suspected gross error in the kth observation, then the linear model in Eqs. (1) or (6) becomes (Koch 1999; Teunissen 2000):

$$ \left\{ {\begin{array}{*{20}c} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}} + {\varvec{c}}_{{{k}}} \nabla_{{{k}}} , \;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}} \\ {{\varvec{B}}^{{ {\text{{T}}}}} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{B}}^{{ {\text{T}}}} {\varvec{c}}_{{{k}}} \nabla_{{{k}}} ,\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ \end{array} } \right., \;{ {k}} \in \left\{ {1, \ldots ,{ {m}}} \right\}, $$
(10)

where \(\nabla_{{ {k}}}\) is the size of the gross error in the kth observation. \({\varvec{c}}_{{ {k}}} = \left[ {0, \ldots ,0,1,0, \ldots ,0} \right]^{{ {\text{T}}}}\) is a unit vector with the kth element equal to one. The w-test statistic for the kth observation can then be formed as follows (Baarda 1967; Teunissen 2000):

$$ w_{{ {k}}} = \frac{{\hat{\nabla }_{{ {k}}} }}{{\sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }} = \frac{{{\varvec{c}}_{{ {k}}}^{{ {\text{{T}}}}} {\varvec{M}}{\varvec{y}}}}{{\sigma \sqrt {{\varvec{c}}_{{ {k}}}^{{ {\text{{T}}}}} {\varvec{M}}{\varvec{c}}_{{ {k}}} } }}, $$
(11)

with

$$ {\varvec{M}} = {\varvec{Q}}^{ - 1} - {\varvec{Q}}^{ - 1} {\varvec{A}}\left( {{\varvec{A}}^{{ {\text{{T}}}}} {\varvec{Q}}^{ - 1} {\varvec{A}}} \right)^{ - 1} {\varvec{A}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1} = {\varvec{B}}\left( {{\varvec{B}}^{{ {\text{T}}}} {\varvec{QB}}} \right)^{ - 1} {\varvec{B}}^{{ {\text{T}}}} = {\varvec{Q}}^{ - 1} {\varvec{Q}}_{{\varvec{\hat{e}\hat{e}}}} {\varvec{Q}}^{ - 1} . $$
(12)

Here, \(\hat{\nabla }_{{{k}}} = {\varvec{c}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{M}}{\varvec{y}}/{\varvec{c}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{M}}{\varvec{c}}_{{{k}}}\) is the LS estimate of \(\nabla_{{{k}}}\), and \(q_{{\hat{\nabla }_{{{k}}} }}^{2} = 1/{\varvec{c}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{M}}{\varvec{c}}_{{{k}}}\) is the variance cofactor of \(\hat{\nabla }_{{{k}}}\).

In specific, \(w_{{{k}}}\) obeys a normal distribution, that is \(w_{{{k}}} \sim N\left( {\delta ,1} \right)\), where \(\delta = \frac{{\nabla_{{{k}}} }}{{\sigma \sqrt {q_{{\hat{\nabla }_{{{k}}} }}^{2} } }}\). If there is no gross error in observation \(y_{{{k}}}\), then \(\delta = 0\), otherwise, \(\delta \ne 0\). Therefore, one can test whether the data \(y_{{{k}}}\) is an inlier or outlier according to the significance of \(w_{{{k}}}\). The so-called w-test is organized as follows. Given a critical value \(k_{\alpha }\), if \(\left| {w_{{{k}}} } \right| > {{k}}_{\alpha }\), then \(y_{{{k}}}\) is tested as an inlier; otherwise, \(y_{{{k}}}\) is tested as an outlier. Here, \({{k}}_{\alpha }\) is calculated as the quantile of \(N\left( {0,1} \right)\) upon a significance level \(\alpha\), that is \(k_{\alpha } = N_{1 - \alpha /2} \left( {0,1} \right)\).

Note that when \(\sigma\) in Eq. (11) is unknown, then it can be estimated as:

$$ \hat{\sigma }_{{{k}}}^{2} = \frac{{\hat{\varvec{e}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{Q}}^{ - 1} \hat{\varvec{e}}_{{{k}}} }}{{{{m}} - {{n}} - 1}} = \frac{{{\varvec{y}}^{{{\text{{T}}}}} \left[ {{\varvec{M}} - {\varvec{M}}{\varvec{c}}_{{{k}}} \left( {{\varvec{c}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{M}}{\varvec{c}}_{{{k}}} } \right)^{ - 1} {\varvec{c}}_{{{k}}}^{{{\text{{T}}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{{{m}} - {{n}} - 1}}, $$
(13)

where \(\hat{\varvec{e}}_{{{k}}}\) is the observation residuals calculated by Eq. (10). In this case, \(w_{{{k}}} = \frac{{\hat{\nabla }_{{{k}}} }}{{\hat{\sigma }_{{{k}}} \sqrt {q_{{\hat{\nabla }_{{{k}}} }}^{2} } }}\) turns to obey a student distribution with a degree of freedom \({{m}} - {{n}} - 1\) and a noncentralized parameter \(\delta\), that is \(w_{{{k}}} \sim t\left( {{{m}} - {{n}} - 1,\delta } \right)\) (Xu 1987a, b). Correspondingly, the critical value of the w-test becomes \(k_{\alpha } = t_{1 - \alpha /2} \left( {{{m}} - {{n}} - 1,0} \right)\). Also, \(w_{{{k}}}\) can be equivalently transformed as another test statistic obeying the \(\tau\) distribution with a degree of freedom 1 and \({{m}} - {{n}} - 1\), that is \(\frac{{\sqrt {{{m}} - {{n}}} w_{{{k}}} }}{{\sqrt {{{m}} - {{n}} - 1 + w_{{{k}}}^{2} } }}\sim \tau \left( {1,{{m}} - {{n}} - 1,\delta } \right)\) (Pope 1976; Koch 1999; Lehmann 2012).

Furthermore, considering the reliability measure, the minimal detectable bias (MDB) can be given by (Baarda 1967, 1968; Teunissen 2000):

$$ {{\rm{MDB}}}_{{{k}}} = \delta_{0} \sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } = \frac{{\delta_{0} \sigma }}{{\sqrt {{\varvec{c}}_{{{k}}}^{{ {\text{{T}}}}} {\varvec{M}}{\varvec{c}}_{{ {k}}} } }}. $$
(14)

Here, \(\delta_{0}\) is the theoretical noncentralized parameter, that can be computed via \(\delta_{0} = N_{{1 - \frac{\alpha }{2}}} \left( {0,1} \right) - N_{\beta } \left( {0,1} \right)\) or \(\delta_{0} = t_{{1 - \frac{\alpha }{2}}} \left( {{ {m}} - { {n}} - 1,0} \right) - t_{\beta } \left( {{ {m}} - { {n}} - 1,0} \right)\), where \(\alpha\) and \(\beta\) are the significance level and the power of the test, respectively.

By letting k run from 1 to and including m, one can screen the whole data set for potential outliers. The significance of the test statistics is tested by comparing them to the critical values. Thus, the observation with the largest absolute value of test statistic is tested to be an outlier when:

$$ \mathop {\max }\limits_{{{ {k}} \in \left\{ {1, \ldots ,{ {m}}} \right\}}} \left| {w_{{ {k}}} } \right| > {{k}}_{\alpha } . $$
(15)

The procedure for screening each observation for an outlier is known as “data snooping” (Kok 1984; Teunissen 2000). Furthermore, in the case of multiple outliers, the data snooping procedure can be implemented iteratively to process outliers one by one, which is known as iterative data snooping (IDS) (Kok 1984; Lehmann and Scheffler 2011; Rofatto et al. 2017; Klein et al. 2022).

Note that if observations are independent of each other, the w-test statistics are equivalent to the normalized or studentized least squares residuals used by Thompson (1935), Pearson and Sekar (1936), Nair (1948), and Grubbs (1969), as:

$$ w_{{ {k}}} = \frac{{\hat{e}_{{ {k}}} }}{{\sigma \sqrt {q_{{\hat{e}_{{ {k}}} }}^{2} } }} \;{ {\rm{or}}} \;\frac{{\hat{e}_{{ {k}}} }}{{\hat{\sigma }_{{ {k}}} \sqrt {q_{{\hat{e}_{{ {k}}} }}^{2} } }}, $$
(16)

where \(\hat{e}_{{ {k}}}\) is the kth element of the LS residuals \(\hat{\varvec{e}}\), and \(q_{{\hat{e}_{{ {k}}} }}^{2}\) are the variance cofactor of \(\hat{e}_{{ {k}}}\).

3 Data division and extended w-test

In this section, to distinguish inliers and outliers, the data division is first discussed. Based on the data division, an extended w-test with its associated reliability is presented. Then, data snooping is reinterpreted, and a new method called data refining is proposed.

3.1 Data division

By measuring via a suitably standardized scale, observations can be divided into two groups based on the characteristics of inlying and outlying, referred to as inliers and outliers (Hawkins 1980; Xu 1987a, 1987b). In other words, let all data come from a complete set \(\left\{ {y_{1} , \ldots ,y_{{ {m}}} } \right\}\), then they can always be divided into two sets, called the inlying set and the outlying set. Generally, for an observation system, the observations in the inlying set are to be retained, while those in the outlying set are to be excluded.

Here, a data division method is proposed as follows. Given an outlier number q, then all of the candidate pairs of the inlying-outlying set can be listed as follows:

$$ { {\text{I}}}_{{ {i}}} \cup { {\text{O}}}_{{ {i}}} = \left\{ {y_{1} , \ldots ,y_{{ {m}}} } \right\},\;{ {i}} \in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c} { {m}} \\ { {q}} \\ \end{array} } \right)} \right\}. $$
(17)

Each pair of the inlying and outlying sets corresponds to a candidate model (Teunissen 2000, 2018; Lehmann and Lösler 2016):

$$ \left\{ {\begin{array}{*{20}c} {{\rm{E}}\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}} + {\varvec{C}}_{{ {i}}} {\varvec{b}}_{{ {i}}} , \;{\rm{D}}\left( {\varvec {y}} \right) = \sigma^{2} {\varvec{Q}}} \\ {{\varvec{B}}^{{ {\text{T}}}} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{B}}^{{ {\text{T}}}} {\varvec{C}}_{{ {i}}} {\varvec{b}}_{{ {i}}} ,\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ \end{array} } \right.,\;{ {i}} \in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c} { {m}} \\ { {q}} \\ \end{array} } \right)} \right\}, $$
(18)

where \({\varvec{C}}_{{ {i}}} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}}}\) is a design matrix which consists of unit vectors generated by the outlying set \({ {\text{O}}}_{{ {i}}}\), and \({\varvec{b}}_{{ {i}}} \in {\mathbb{R}}^{{ {q}}}\) is the size vector of gross errors in the data of the outlying set. Based on the principle of model selection, one can find the most likely pair of inlying-outlying sets, I and O, with the least square of residuals as:

$$ \begin{aligned}{\varvec{C}} = &\mathop {{ {\text{argmin}}}}\limits_{{{\varvec{C}}_{{ {i}}} ,\; { {i}}\in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c} { {m}} \\{ {q}} \\ \end{array} } \right)} \right\}}} \left\{{\hat{\varvec{e}}_{{ {i}}}^{{ {\text{T}}}} {\varvec{Q}}^{ - 1}\hat{\varvec{e}}_{{ {i}}} } \right\}\;\\=& { \;\mathop {{ {\text{argmin}}}}\limits_{{{\varvec{C}}_{{ {i}}},\;{ {i}} \in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c}{ {m}} \\ { {q}} \\ \end{array} } \right)} \right\}}}\left\{ {{\varvec{y}}^{{ {\text{T}}}} \left[ {{\varvec{M}} -{\varvec{MC}}_{{ {i}}} \left({{\varvec{C}}_{{ {i}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {i}}} }\right)^{ - 1} {\varvec{C}}_{{ {i}}}^{{ {\text{T}}}} {\varvec{M}}}\right]{\varvec{y}}} \right\},} \\ \end{aligned}$$
(19)

where \(\hat{\varvec{e}}_{{ {i}}}\) is the observation residuals calculated via Eq. (18). Note that, when \({ {q}} =0\), there is only one candidate pair of inlying-outlying set, that is \({ {\text{I}}}_{{ {i}}} = \left\{ {y_{1} ,\ldots ,y_{{ {m}}} }\right\}\), and \({ {\text{O}}}_{i} =\emptyset\). In addition, when \({ {q}} = { {m}} -{ {n}}\), \(\hat{\varvec{e}}_{{ {i}}}\) always equals the zero vector and the data division will be invalidated. Therefore, the choice of \(q\) should satisfy that \(0 \le { {q}} \le { {m}} - { {n}}- 1\). Figure 1 gives an example of data division where \({ {m}} =5\) and \({ {q}} =2\).

Fig. 1
figure 1

Procedure of data division

3.2 Extended w-test

After the data division, the statistical test can then be used to test data in both the inlying set and the outlying set. This procedure usually consists of following two phases. The initial phase is shown in Fig. 2. Using a data division method such as Eq. (19) with a presumed suspected outlier number \({ {q}}_{0}\), the total data can be initially divided into two subsets: an initial inlying set denoted as \(\rm{I_{0}}\) and an initial outlying set denoted as \(\rm{O_{0}}\). The data in \(\rm{I_{0}}\) and \(\rm{O_{0}}\) are thus considered the suspected inliers and the suspected outliers, respectively. In this case, the linear model is given by:

$$ \left\{ {\begin{array}{*{20}c} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}} + {\varvec{C}}_{0} {\varvec{b}}_{0} , \;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}} \\ {{\varvec{B}}^{{ {\text{T}}}} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{B}}^{{ {\text{T}}}} {\varvec{C}}_{0} {\varvec{b}}_{0} ,\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ \end{array} } \right., $$
(20)

where \({\varvec{C}}_{0} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}_{0} }}\) is a design matrix generated by \(\rm{O_{0}}\), and \({\varvec{b}}_{0} \in {\mathbb{R}}^{{{ {q}}_{0} }}\) is the gross error size vector of data in \(\rm{O_{0}}\). The model given in Eq. (20) is called the initial model. Considering the specific structure of \({\varvec{C}}_{0}\), this model is essentially equivalent to Eqs. (1) and (6) with the data in \(\rm{O_{0}}\) excluded.

Fig. 2
figure 2

Initial phase of the test

In the testing phase, a statistical test can be applied to determine whether a suspected data, originating from either the initial inlying or outlying set, should be classified as an outlier or an inlier. As shown in Fig. 3, if the suspected data \(y_{{ {k}}}\) is selected, all data can be divided into three disjoint parts, the inlying set \({ {\text{I}}}_{{ {k}}}\), the outlying set \({ {\text{O}}}_{{ {k}}}\), and the testing set \(\left\{ {y_{{ {k}}} } \right\}\), in which the data number are \({ {m}} - 1 - q_{{ {k}}}\), \({ {q}}_{{ {k}}}\), and 1, respectively. In this case, the linear model in Eq. (20) becomes:

$$\begin{aligned}& \left\{ {\begin{array}{*{20}c} {\rm{E}}\!\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}} + {\varvec{C}}_{{ {k}}} {\varvec{b}}_{{ {k}}} + {\varvec{c}}_{{ {k}}} \nabla_{{ {k}}} , \;{\rm{D}}\!\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}} \\ {{\varvec{B}}^{{ {\text{T}}}} {\rm{E}}\!\left( {\varvec{y}} \right) = {\varvec{B}}^{{ {\text{T}}}} {\varvec{C}}_{{ {k}}} {\varvec{b}}_{{ {k}}} + {\varvec{B}}^{{ {\text{T}}}} {\varvec{c}}_{{ {k}}} \nabla_{{ {k}}} ,\;{\rm{D}}\!\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ \end{array} } \right., \\ &\quad { {k}} \in \left\{ {1, \ldots ,{ {m}}} \right\},\end{aligned} $$
(21)

where \({\varvec{c}}_{{ {k}}} \in {\mathbb{R}}^{{ {m}}}\) is the unit design vector generated by \(\left\{ {y_{{ {k}}} } \right\}\), \({\varvec{C}}_{{ {k}}} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}_{{ {k}}} }}\) is the design matrix generated by \(\rm{O}_{{ {k}}}\). Correspondingly, \({\varvec{b}}_{{ {k}}} \in {\mathbb{R}}^{{{ {q}}_{{ {k}}} }}\) denotes the gross error sizes of data in \({ {\text{O}}}_{{ {k}}}\), and \(\nabla_{{ {k}}}\) represents the gross error size of \(y_{{ {k}}}\). This model is called the testing model, which is equivalent to Eq. (10) with the data in \({ {\text{O}}}_{{ {k}}}\) excluded.

Fig. 3
figure 3

Testing phase of the test \(\left( {{ {k}} = 3} \right)\)

Using the model given in Eq. (21), the LS estimate of \(\nabla_{{ {k}}}\) is derived as:

$$ \hat{\nabla }_{{ {k}}} = \frac{{{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{ {k}}} }}. $$
(22)

Proof

See Appendix.

The variance cofactor of \(\hat{\nabla }_{{ {k}}}\) is given by:

$$ q_{{\hat{\nabla }_{{ {k}}} }}^{2} = \frac{1}{{{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{ {k}}} }}. $$
(23)

Proof

See Appendix.

Then, according to Eqs. (22) and (23), the w-test statistics can be extended as:

$$ w_{{ {k}}} = \frac{{\hat{\nabla }_{{ {k}}} }}{{\sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }} = \frac{{{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{\sigma \sqrt {{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{ {k}}} } }}, $$
(24)

Likewise, the extended w-test statistic \(w_{{ {k}}}\) satisfies that \(w_{{ {k}}} \sim N\left( {\delta ,1} \right)\), where \(\delta = \frac{{\nabla_{{ {k}}} }}{{\sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }}\). If \(y_{{ {k}}}\) is an inlier, then \(\delta = 0\); otherwise, \(\delta \ne 0\). Therefore, following the principle of significance test (Fisher 1925), the extended w-test can be organized as follows. Given a critical value \(k_{\alpha } = N_{{1 - \frac{\alpha }{2}}} \left( {0,1} \right)\), if \(\left| {w_{{{k}}} } \right| \le { {k}}_{\alpha }\), then \(y_{{{k}}}\) is tested as an inlier; otherwise, \(y_{{ {k}}}\) is tested as an outlier.

In addition, if \(\sigma\) in Eq. (24) is unknown, it can be estimated via Eq. (21), as:

$$ \hat{\sigma }_{{ {k}}}^{2} = \frac{{{\varvec{y}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MG}}_{{ {k}}} \left( {{\varvec{G}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MG}}_{{ {k}}} } \right)^{ - 1} {\varvec{G}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{{ {m}} - { {n}} - q_{{ {k}}} - 1}}. $$
(25)

Proof

See Appendix.

Here, \({\varvec{G}}_{{ {k}}} \in {\mathbb{R}}^{{{ {m}}\; \times \;\left( {q_{{ {k}}} + 1} \right)}}\) is the design matrix generated by \({ {\text{O}}}_{{ {k}}} \cup \left\{ {y_{{ {k}}} } \right\}\), that is \({\varvec{G}}_{{ {k}}} = \left[ {\begin{array}{*{20}c} {{\varvec{C}}_{{ {k}}} } & {{\varvec{c}}_{{ {k}}} } \\ \end{array} } \right]\). Correspondingly, the test statistic and the critical value become \(w_{{ {k}}} = \frac{{\hat{\nabla }_{{ {k}}} }}{{\hat{\sigma }_{{ {k}}} \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }}\sim t\left( {{ {m}} - { {n}} - q_{{ {k}}} - 1,\delta } \right)\) and \(k_{\alpha } = t_{1 - \alpha /2} \left( {{ {m}} - { {n}} - q_{{ {k}}} - 1,0} \right)\). Also, \(w_{{ {k}}}\) can be equivalently transformed as another test statistic, that is \(\frac{{\sqrt {{ {m}} - { {n}} - q_{{ {k}}} } w_{{ {k}}} }}{{\sqrt {{ {m}} - { {n}} - q_{{ {k}}} - 1 + w_{{ {k}}}^{2} } }}\sim \tau \left( {1,\;{ {m}} - { {n}} - q_{{ {k}}} - 1,\;\delta } \right)\).

Furthermore, considering the reliability measure, the MDB of the extended w-test can be given as:

$$ { {\text{MDB}}}_{{ {k}}} \; = \;\delta_{0} \sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } \; = \;\frac{{\delta_{0} \sigma }}{{\sqrt {{\varvec{c}}_{{ {k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}} \left( {{\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{ {k}}} } }}, $$
(26)

where \(\delta_{0}\) is computed by \(N_{{1 - \frac{\alpha }{2}}} \left( {0,1} \right) - N_{\beta } \left( {0,1} \right)\) or \(t_{{1 - \frac{\alpha }{2}}} \left( {{ {m}} - { {n}} - q_{{ {k}}} - 1,0} \right) - t_{\beta } \left( {{ {m}} - { {n}} - q_{{ {k}}} - 1,0} \right)\).

Note that in the w-test for IDS or other iterative testing methods, the test statistic in Eq. (24) has been used as an equivalent form of the classical one in Eq. (11) to test data within the observation system, i.e., in the inlying set \(\rm{I_{0}}\) (Kok 1984; Teunissen 1990, 2000). Additionally, in the extended w-test, both the format and the usage of this test statistic are further extended to encompass a broader range of the testing data, specifically for data outside the observation system, i.e., in the outlying set \(\rm{O_{0}}\). For example, using the extended w-test, one can first choose either data in \(\rm{I_{0}}\) with the largest \(\left| {w_{{ {k}}} } \right|\) or that in \(\rm{O_{0}}\) with the smallest \(\left| {w_{{ {k}}} } \right|\) as the most suspected data. Then, this suspected data can be tested as an outlier or an inlier via evaluating the significance of the test statistic.

Essentially, the extended w-test is based on the principle of using the inlying set to test whether the testing data is an outlier. Consequently, the test performance depends on the number and quality of data in the inlying set. In specific, from Eq. (26), one can see that a larger sample size for the inlying set results in a reduced MDB, thus enhancing the test power. Additionally, apart from the substantial capacity, maintaining the purity of the inlying set is also crucial, since the contamination of the inlying set might cause the masking or swamping effects. For instance, if there are outliers left in the inlying set \({\rm{I}}_{k}\) during the testing phase, the linear model in Eq. (21) will become:

$$ \left\{ {\begin{array}{*{20}c} {{\rm{E}}\left( {\varvec{y}} \right) = {\varvec{A}}{\varvec{x}} + {\varvec{C}}_{{ {k}}} {\varvec{b}}_{{ {k}}} + {\varvec{c}}_{{ {k}}} \nabla_{{ {k}}} + {\varvec{C}}_{{ {l}}} {\varvec{b}}_{{ {l}}} , {\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ {{\varvec{B}}^{{ {\text{T}}}} {\rm{E}}\left( {\varvec{y}} \right) = {\varvec{B}}^{{ {\text{T}}}} {\varvec{C}}_{{ {k}}} {\varvec{b}}_{{ {k}}} + {\varvec{B}}^{{ {\text{T}}}} {\varvec{c}}_{{ {k}}} \nabla_{{ {k}}} + {\varvec{B}}^{{ {\text{T}}}} {\varvec{C}}_{{ {l}}} {\varvec{b}}_{{ {l}}} ,\;{\rm{D}}\left( {\varvec{y}} \right) = \sigma^{2} {\varvec{Q}}} \\ \end{array} } \right., $$
(27)

where \({\varvec{C}}_{{ {l}}} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}_{{ {l}}} }}\) is the design matrix generated by the outliers left in \({\rm{I}}_{{ {k}}}\), and \({\varvec{b}}_{{ {l}}} \in {\mathbb{R}}^{{{ {q}}_{{ {l}}} }}\) denotes the corresponding gross error sizes of these outliers. In this case, the estimate \(\hat{\nabla }_{{ {k}}}\) in Eq. (22) will be biased:

$$ \begin{aligned} { {\rm{E}}}\left( {\hat{\nabla }_{{{k}}} } \right) = &\frac{{{\varvec{c}}_{{{k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{{k}}} \left( {{\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{{k}}} } \right)^{ - 1} {\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{ {\rm{E}}}\left( {\varvec{y}} \right)}}{{{\varvec{c}}_{{{k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{{k}}} \left( {{\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{{k}}} } \right)^{ - 1} {\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{{k}}} }} \hfill \\ =& \nabla_{{{k}}} + \frac{{{\varvec{c}}_{{{k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{{k}}} \left( {{\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{{k}}} } \right)^{ - 1} {\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{C}}_{{ {l}}} {\varvec{b}}_{{ {l}}} }}{{{\varvec{c}}_{{{k}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{{k}}} \left( {{\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{{k}}} } \right)^{ - 1} {\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{c}}_{{{k}}} }}. \hfill \\ \end{aligned} $$
(28)

One can see that the expectation of \(w_{{ {k}}} = \frac{{\hat{\nabla }_{{ {k}}} }}{{\sigma \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }}\) will not be zero even if \(\nabla_{{ {k}}} = 0\). Consequently, some inliers would be identified as outliers, which can be called the swamping effect. Moreover, if the observation precision is unknown, the estimate of \(\sigma^{2}\) in Eq. (25) will also be biased:

$$ \begin{aligned} { {\rm{E}}}\left( {\hat{\sigma }_{{{k}}}^{2} } \right) = \;&\frac{{\rm{E}\left\{ {{\varvec{y}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{\textit{k}} \left( {{\varvec{C}}_{\textit{k}}^{{ {\text{T}}}} {\varvec{MC}}_{{\textit{k}}} } \right)^{ - 1} {\varvec{C}}_{{{\textit{k}}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{y}}} \right\}}}{{{ {m}} - { {n}} - { {q}}_{{{k}}} - 1}} \hfill \\ = \;&\sigma^{2} + \frac{{{\varvec{b}}_{{ {l}}}^{{ {\text{T}}}} {\varvec{C}}_{{ {l}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{{k}}} \left( {{\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{{k}}} } \right)^{ - 1} {\varvec{C}}_{{{k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{C}}_{{ {l}}} {\varvec{b}}_{{ {l}}} }}{{{ {m}} - { {n}} - q_{{{k}}} - 1}}. \hfill \\ \end{aligned} $$
(29)

Here, since \(\frac{{{\varvec{b}}_{{ {l}}}^{{ {\text{T}}}} {\varvec{C}}_{{ {l}}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MG}}_{{ {k}}} \left( {{\varvec{G}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MG}}_{{ {k}}} } \right)^{ - 1} {\varvec{G}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{M}}} \right]{\varvec{C}}_{{ {l}}} {\varvec{b}}_{{ {l}}} }}{{{ {m}} - { {n}} - { {q}}_{{ {k}}} - 1}} > 0\), this estimate \(\hat{\sigma }_{{ {k}}}^{2}\) will be enlarged by the outliers left in \({\rm{I}}_{{ {k}}}\), thereby shrinking the size of \(w_{{ {k}}} = \frac{{\hat{\nabla }_{{ {k}}} }}{{\hat{\sigma }_{{ {k}}} \sqrt {q_{{\hat{\nabla }_{{ {k}}} }}^{2} } }}\). In this case, the outliers can be difficult to detect, which is exactly the masking effect.

3.3 Data snooping

Using the extended w-test, data snooping can be reinterpreted as follows. Assuming there is at most one outlier in the initial inlying set \(\rm{I_{0}}\), data snooping is a procedure of traversing all data in the inlying set to find out this outlier.

The procedure of data snooping is given in Fig. 4. Specifically, according to Eqs. (24) and (25), one can construct a test statistic \({{w}}_{k}\) for each data in the initial inlying set, \(y_{{ {k}}} \in \rm{I}_{0}\). In this case, for each \(w_{{ {k}}}\), the inlying set \({\rm{I}}_{{ {k}}}\), outlying set \({\rm{O}}_{{{k}}}\), and testing set \(\left\{ {y_{{ {k}}} } \right\}\) are constructed as follows:

$$ { {\rm{I}}}_{{ {k}}} \cup \left\{ {y_{{ {k}}} } \right\} = {\rm I}_0 \;{ {\rm{and}}}\; {\rm{O}}_{{ {k}}} = \rm{O}_{0} . $$
(30)
Fig. 4
figure 4

Procedure of data snooping

Then, the data with the largest test statistic is considered the most suspected outlier in \(\rm{I_{0}}\). Generally, data snooping consists of the following three parts.

Detection: with a significance level \(\alpha\), the significance of the largest test statistic is tested by comparing it with the critical value \({{k}}_{\alpha }\). Once the extended w-test failed, which is given as follows:

$$ \mathop {\max }\limits_{{ {k}}} \left| {w_{{{k}}} } \right| > k_{\alpha } , y_{{{k}}} \in \rm{I_{0}} , $$
(31)

then it turns to the identification step.

Identification: the data with the largest test statistic is then identified as an outlier and put into the outlying set.

Adaptation: the LS is implemented for parameter estimation using the data in the inlying set.

3.4 Data refining

Similarly, assuming there is at most one inlier in the initial outlying set \(\rm{O}_{0}\), one can then traverse all data in the outlying set and find out this inlier. This procedure is called data refining.

The procedure of data refining is given in Fig. 5. Here, according to Eqs. (24) and (25), one can construct a test statistic \(w_{{ {k}}}\) for each data in the initial outlying set, \(y_{{ {k}}} \in { {\rm{O}}}_{0}\). In this case, for each \(w_{{ {k}}}\), we have:

$$ {\rm{O}}_{{ {k}}} \cup \left\{ {y_{{ {k}}} } \right\} = {\rm{O}}_{0} \;{ {\rm{and}}}\; {\rm {I}}_k = \rm{I}_{0} . $$
(32)
Fig. 5
figure 5

Procedure of data refining

Then, the data with the smallest test statistic is considered the most suspected inlier in \(\rm{O_{0}}\). Likewise, data refining consists of the following three parts.

Detection: with a significance level \(\alpha\), the significance of the smallest w-test statistic is tested by comparing it with the critical value \({ {k}}_{\alpha }\). Once the w-test is passed, which is given as follows:

$$ \mathop {\min }\limits_{{{k}}} \left| {w_{{ {k}}} } \right| \le { {k}}_{\alpha } , y_{{ {k}}} \in \rm{O_{0}} , $$
(33)

then it turns to the identification step.

Identification: the data with the smallest test statistic is then identified as an inlier and put into the inlying set.

Adaptation: the LS is implemented for parameter estimation using the data in the inlying set.

4 Iterative data snooping and iterative data refining

After data division, data snooping and data refining can be employed to find a single outlier or inlier. However, the exact outlier number, denoted as \({ {q}}_{*}\), is often unknown in practical applications. In such cases, given a range of possible outlier numbers based on some knowledge, one can then diagnose outliers within this range. This range is typically defined by setting the minimum and maximum suspected outlier numbers, denoted as \({ {q}}_{{{ {\rm{min}}}}}\) and \({ {q}}_{{{ {\rm{max}}}}}\), respectively.

In this section, the iterative forms of data snooping and data refining are presented, which are called IDS and IDR (iterative data refining), respectively. Then, the difference of IDS and IDR are discussed, from the perspective of robustness and accuracy, choice of significance level, and computation cost.

4.1 Iterative data snooping (IDS)

If all data are divided into an outlying set and an inlying set during the initialization, then iterative data snooping (IDS) can be considered as a process of picking the data tested as outliers from the inlying set to the outlying set one by one.

As shown in Fig. 6, the initialization of IDS \(\left( {t = 0} \right)\) is organized as follows. Given the minimum suspected outlier number \({ {q}}_{{{ {\rm{min}}}}}\), then all of the candidate pairs of the inlying-outlying set can be listed as \({\rm{I}}_{{ {i}}}^{\left( 0 \right)} \cup {\rm{O}}_{{ {i}}}^{\left( 0 \right)} = \left\{ {y_{1} , \ldots ,y_{{ {m}}} } \right\},\;{ {i}} \in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c} { {m}} \\ {{ {q}}_{{{ {\rm{min}}}}} } \\ \end{array} } \right)} \right\}.\) According to Eq. (19), one can find the most possible pair of inlying-outlying sets to construct the initial model in the first iteration \(\left( {t = 1} \right)\):

$$\begin{aligned}{\varvec{C}}_0^{\left( 1 \right)} = &\mathop {{\rm{argmin}}}\limits_{{\varvec{C}}_i^{\left( 0 \right)}} \left\{ {{\varvec{y}^{\rm{\text{T}}}}\left[ {\varvec{M - MC}_i^{\left( 0 \right)}{{\left( {\varvec{C}_i^{\left( 0 \right){\rm{\text{T}}}}\varvec{MC}_i^{\left( 0 \right)}} \right)}^{ - 1}}\varvec{C}_i^{\left( 0 \right){\rm{\text{T}}}}\varvec{M}} \right]\varvec{y}} \right\},\\ &\qquad \qquad \qquad i \in \left\{ {1, \ldots ,\;\left( {\begin{array}{*{20}{c}}m\\{{q_{{\rm{min}}}}}\end{array}} \right)} \right\}\end{aligned}$$
(34)

where \({\varvec{C}}_{{ {i}}}^{\left( 0 \right)} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}_{{{ {\rm{min}}}}} }}\) is the design matrix generated by the outlying set \({\rm{O}}_{{ {i}}}^{\left( 0 \right)}\). Note that \({ {q}}_{{{ {\rm{min}}}}}\) is usually set as 0 to avoid dropping any inliers in most cases (Kok 1984; Lehmann and Scheffler 2011; Rofatto et al. 2017; Klein et al. 2022)

Fig. 6
figure 6

Initialization procedure of IDS

As shown in Fig. 7, the iteration procedure in IDS is organized as follows. Assuming in the tth iteration, \({\rm{I_{0}}}^{\left( t \right)}\) and \({\rm{O_{0}}}^{\left( t \right)}\) are the initial inlying and outlying set, respectively, one could then construct test statistics for each data in the inlying set, which are \(w_{{ {k}}}^{\left( t \right)}\), \(y_{{ {k}}} \in {\rm{I}}_{0}^{\left( t \right)}\). In this case, for each \(w_{{ {k}}}^{\left( t \right)}\), observations are divided into three disjoint parts, outlying set \({\rm{O}}_{{ {k}}}^{\left( t \right)}\), inlying set \({\rm{I}}_{{ {k}}}^{\left( t \right)}\) and testing set \(\left\{ {y_{{ {k}}} } \right\}\), in which the elements numbers are \({ {q}}_{{{ {\rm{min}}}}} + t - 1\), \({ {m}} - { {q}}_{{{ {\rm{min}}}}} - t\) and 1, respectively. The relationship among them is given by:

$$ {\rm{I}}_{{ {k}}}^{\left( t \right)} \cup \left\{ {y_{{ {k}}} } \right\} = {\rm{I}}_{0}^{\left( t \right)} \;{ {and}}\; {\rm{O}}_{{ {k}}}^{\left( t \right)} = {\rm{O}}_{0}^{\left( t \right)} , $$
(35)
Fig. 7
figure 7

Iteration procedure in IDS

According to Eq. (24), the extended w-test statistic is given as follows (Kok 1984; Teunissen 1990, 2000):

$$ w_{{ {k}}}^{\left( t \right)} = \frac{{{\varvec{c}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}}^{\left( t \right)} \left( {{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{MC}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{\sigma \sqrt {{\varvec{c}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}}^{\left( t \right)} \left( {{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{MC}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{M}}} \right]{\varvec{c}}_{{ {k}}}^{\left( t \right)} } }}, $$
(36)

where \({\varvec{c}}_{{ {k}}}^{\left( t \right)} \in {\mathbb{R}}^{{ {m}}}\) and \({\varvec{C}}_{{ {k}}}^{\left( t \right)} \in {\mathbb{R}}^{{{ {m}}\; \times \;\left( {{ {q}}_{{{ {\rm{min}}}}} + t - 1} \right)}}\) are generated by \(\left\{ {y_{{ {k}}} } \right\}\) and \({\rm{O}}_{{ {k}}}^{\left( t \right)}\), respectively. Particularly, if \(\sigma\) is unknown, it can be estimated via Eq. (25) as:

$$ \hat{\sigma }_{{ {k}}}^{2\left( t \right)} = \frac{{{\varvec{y}}^{{ {\text{T}}}} \left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}}^{\left( t \right)} \left( {{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{MC}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1} {\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} {\varvec{M}}} \right]{\varvec{y}}}}{{{ {m}} - { {n}} - { {q}}_{{{ {\rm{min}}}}} - t}}, $$
(37)

where \({\varvec{G}}_{{ {k}}}^{\left( t \right)} = \left[ {\begin{array}{*{20}c} {{\varvec{C}}_{{ {k}}}^{\left( t \right)} } & {{\varvec{c}}_{{ {k}}}^{\left( t \right)} } \\ \end{array} } \right]\). The procedure of IDS consists of the following three parts.

Detection: given a significance level \(\alpha\), the largest test statistic is compared to the critical value \({{k}}_{\alpha }^{\left( t \right)}\). Once the extended w-test fails, which is given as follows:

$$ \mathop {\max }\limits_{{ {k}}} \left| {w_{{ {k}}}^{\left( t \right)} } \right| > {{k}}_{\alpha }^{\left( t \right)} ,y_{{ {k}}} \in {\rm{I_{0}}}^{\left( t \right)} , $$
(38)

with

$$ k_{\alpha }^{\left( t \right)} = N_{{1 - \frac{\alpha }{2}}} \left( {0,1} \right)\;{ { or}}\; t_{{1 - \frac{\alpha }{2}}} \left( {{ {m}} - { {n}} - { {q}}_{{{ {\rm{min}}}}} - t,0} \right), $$
(39)

then it turns to the identification step.

Identification: the data with the largest test statistic in the inlying set is identified as an outlier and put into the outlying set.

Adaptation: when the iteration is terminated, the LS is implemented for parameter estimation using the data in the inlying set. The terminating condition is that the data number in the outlying set is equal to the maximum suspected outlier number \({ {q}}_{{{ {\rm{max}}}}}\), or all data in the inlying set are considered inliers. Note that \({ {q}}_{{{ {\rm{max}}}}}\) is usually less than \({ {m}} - { {n}}\) to make the parameters estimable.

4.2 Iterative data refining (IDR)

Conversely, if all data are divided into an inlying set and an outlying set during the initialization, then iterative data refining (IDR) is a process of picking the data tested as inliers from the outlying set to the inlying set, one after another.

Likewise, the initialization of IDR \(\left( {t = 0} \right)\) is shown in Fig. 8. Given the maximum suspected outlier number \({ {q}}_{{{ {\rm{max}}}}}\), then all of the potential pairs of inlying-outlying sets can be listed as \({\rm{I}}_{{ {i}}}^{\left( 0 \right)} \cup {\rm{O}}_{{ {i}}}^{\left( 0 \right)} = \left\{ {y_{1} , \ldots ,y_{{ {m}}} } \right\},\;{ {i}} \in \left\{ {1, \ldots , \left( {\begin{array}{*{20}c} { {m}} \\ {{ {q}}_{{{ {\rm{max}}}}} } \\ \end{array} } \right)} \right\}.\) According to Eq. (19), one can find the most possible pair of inlying-outlying sets to obtain the initial model in the first iteration \(\left( {t = 1} \right)\):

$$\begin{aligned}{\varvec{C}}_0^{\left( 1 \right)} =&\mathop {{\rm{argmin}}}\limits_{{\varvec{C}}_i^{\left( 0 \right)}} \left\{ {{\varvec{y}^{\rm{\text{T}}}}\left[ {\varvec{M - MC}_i^{\left( 0 \right)}{{\left( {\varvec{C}_i^{\left( 0 \right){\rm{\text{T}}}}\varvec{MC}_i^{\left( 0 \right)}} \right)}^{ - 1}}\varvec{C}_i^{\left( 0 \right){\rm{\text{T}}}}\varvec{M}} \right]\varvec{y}} \right\},\\ & \qquad \qquad \qquad i \in \left\{ {1, \ldots ,\;\left( {\begin{array}{*{20}{c}}m\\{{q_{{\rm{max}}}}}\end{array}} \right)} \right\}\end{aligned}$$
(40)

where \({\varvec{C}}_{{ {i}}}^{\left( 0 \right)} \in {\mathbb{R}}^{{{ {m}}\; \times \;{ {q}}_{{{ {\rm{max}}}}} }}\) is the design matrix generated by the outlying set \({ {\rm{O}}}_{{ {i}}}^{\left( 0 \right)}\). Note that \({ {q}}_{{{ {\rm{max}}}}}\) should be less than \({ {m}} - { {n}}\) to make the data division effective.

Fig. 8
figure 8

Initialization procedure of IDR

The iteration procedure in IDR is shown in Fig. 9. Assuming in the tth iteration, \({\rm{I}}_{0}^{\left( {t} \right)}\) and \({\rm{O}}_{0}^{\left( {t} \right)}\) are the initial inlying and outlying sets, respectively, one could then construct test statistics for each data in the outlying set, which are \(w_{{ {k}}}^{\left( t \right)}\), \(y_{{ {k}}} \in {\rm{O}_{0}^{\left(t\right)}}\). For each \(w_{k}^{\left( t \right)}\), all data are divided into three disjoint parts, outlying set \({\rm{O}}_{{ {k}}}^{\left( t \right)}\), inlying set \({\rm{I}}_{{ {k}}}^{\left( t \right)}\) and testing set \(\left\{ {y_{{ {k}}} } \right\}\), whose element numbers are \(q_{{{ {\rm{max}}}}} - t\), \(m - q_{{{ {\rm{max}}}}} + t - 1\), and 1, respectively. The relationship among them is given by:

$$ {\rm{O}}_{{ {k}}}^{\left( t \right)} \cup \left\{ {y_{{ {k}}} } \right\} = {\rm{O}}_{0}^{\left( t \right)} \; { {and}}\; {\rm{I}}_{{ {k}}}^{\left( t \right)} = {\rm{I}}_{0}^{\left( t \right)} , $$
(41)
Fig. 9
figure 9

Iteration procedure in IDR

According to Eq. (24), the extended w-test statistic is given by:

$$ w_{{ {k}}}^{\left( t \right)} =\frac{{{\varvec{c}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}\left[ {{\varvec{M}} - {\varvec{MC}}_{{ {k}}}^{\left( t \right)}\left( {{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{MC}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1}{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{M}}} \right]{\varvec{y}}}}{{\sigma \sqrt {{\varvec{c}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}} \left[{{\varvec{M}} - {\varvec{MC}}_{{ {k}}}^{\left( t \right)} \left({{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{MC}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1}{\varvec{C}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{M}}} \right]{\varvec{c}}_{{ {k}}}^{\left( t \right)} }}}, $$
(42)

where \({\varvec{c}}_{{ {k}}}^{\left( t \right)}\in {\mathbb{R}}^{{ {m}}}\)and \({\varvec{C}}_{{ {k}}}^{\left( t \right)}\in {\mathbb{R}}^{{{ {m}}\; \times \;\left({{ {q}}_{{{ {\rm{max}}}}} - t}\right)}}\) are generated by \(\left\{ {y_{{ {k}}} }\right\}\) and \({\rm{O}}_{{ {k}}}^{\left( t \right)},\) respectively.Particularly, if \(\sigma\)is unknown, it can be estimated via Eq. (25) as:

$$ \hat{\sigma }_{{ {k}}}^{2\left( t \right)} = \frac{{{\varvec{y}}^{{ {\text{T}}}} \left[ {{\varvec{M}} -{\varvec{MG}}_{{ {k}}}^{\left( t \right)} \left({{\varvec{G}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{MG}}_{{ {k}}}^{\left( t \right)} } \right)^{ - 1}{\varvec{G}}_{{ {k}}}^{{\left( t \right){ {\text{T}}}}}{\varvec{M}}} \right]{\varvec{y}}}}{{{ {m}} - { {n}} -{ {q}}_{{{ {\rm{max}}}}} + t - 1}},$$
(43)

where \({\varvec{G}}_{{ {k}}}^{\left( t \right)}= \left[ {\begin{array}{*{20}c} {{\varvec{C}}_{{ {k}}}^{\left( t \right)} } & {{\varvec{c}}_{{ {k}}}^{\left( t \right)} } \\\end{array} } \right]\).Similarly, the procedure of IDR consists of the following three parts.

Detection: given a significance level \(\alpha\), the largest test statistic is compared to the critical value \({{k}}_{\alpha }^{\left( t \right)}\). Once the extended w-test passes, which is given as follows:

$$ \mathop {\min }\limits_{{ {k}}} \left| {w_{{ {k}}}^{\left( t \right)} } \right| \le {{k}}_{\alpha }^{\left( t \right)} ,y_{{ {k}}} \in {\rm{O}}_{0}^{\left( t \right)} , $$
(44)

With

$$ {{k}}_{\alpha }^{\left( t \right)} = N_{{1 - \frac{\alpha }{2}}} \left( {0,1} \right)\;{ { or}}\; t_{{1 - \frac{\alpha }{2}}} \left( {{ {m}} - { {n}} - { {q}}_{{{ {\rm{max}}}}} + t - 1,0} \right), $$
(45)

then it turns to the identification step.

Identification: the data with the smallest test statistic in the outlying set is identified as an inlier and put into the inlying set.

Adaptation: when the iteration is terminated, the LS is implemented for parameter estimation using the data in the inlying set. The terminating condition is that the data number in the outlying set equals the minimum suspected outlier number \({ {q}}_{{{ {\rm{min}}}}}\), or all data in the outlying set are tested as outliers. Likewise, \({ {q}}_{{{ {\rm{min}}}}}\) is usually set as 0 to avoid dropping any inliers.

4.3 Precision and robustness

The procedure diagram of IDS and IDR is illustrated in Fig. 10. Generally, for IDS, all data are usually considered inliers and put into the inlying set in the initial stage. In the following iteration, the extended w-test is used to transfer the data tested as an outlier from the inlying set to the outlying set. This iterative process continues until either the amount of data in the outlying set reaches an upper threshold or there are no more suspected outliers left in the inlying set. Conversely, for IDR, the data are divided into an inlying set and a non-empty outlying set in the initial stage. In the following iteration, the extended w-test is a process of picking the data tested as an inlier from the outlying set to the inlying set. Finally, the process will be terminated either the number of data points in the outlying set reaches a lower threshold, or all data in the outlying set are tested as outliers.

Fig. 10
figure 10

Procedures of IDS and IDR

Generally, both IDS and IDR are based on the extended w-test that uses the inlying set to test the suspected data as an inlier or outlier. Therefore, as analyzed in Sect. 3.2, the test performance is up to the number and quality of data in the inlying set. For IDS, the amount of data in the inlying set is sufficient though, the inlying set is usually contaminated in the initial stage in the case of multiple outliers. As a consequence, a contaminated inlying set might invalidate the subsequent test decision thereby compromising the performance of decision and identification for outliers, which is exactly the cause of the masking and swamping effect. Conversely, for IDR, the suspected outliers are moved out of the inlying set as much as possible in the initial stage. This process ensures a more reliable inlying set, thereby mitigating the masking and swamping effect and enhancing the credibility of subsequent tests. Note that, this advantage is based on the premise that the inliers in the observation system are sufficient. If the data within the inlying set is severely lacking, the test results may also become untrustworthy due to the low reliability. In conclusion, compared to IDS, IDR will show stronger robustness when dealing with multiple outliers, and might pose a greater risk of precision loss if there is an insufficient amount of data in the inlying set.

4.4 Choice of significance level

Generally, the choice of significance level \(\alpha\) for both IDS and IDR should be guided by the principle of controlling the overall false alarm rate, also known as the overall type I error rate. In practice, given an adjustment network, one can first establish the mapping relations between \(\alpha\) and the overall false alarm rates. Subsequently, based on these mapping relations, the required \(\alpha\) can be determined by specifying an overall false alarm rate. The Monte Carlo Simulation (MCS) proves useful in this process, where each false alarm rate is counted while systematically adjusting \(\alpha\) within a specified range. The significance level \(\alpha\) corresponding to a given false alarm rate can then be derived (Lehmann 2012; Rofatto et al. 2020b). In particular, if the observation precision \(\sigma\) is known, an overall test can be implemented at the beginning of IDS and IDR to regulate the overall false alarm rate (Kok 1984; Teunissen 2000). Consequently, the subsequent determination of \(\alpha\) would become more flexible in this regard. For example, one can choose different \(\alpha\) in different iteration through considering the numbers and correlations of the test statistics.

4.5 Computational cost

Both IDS and IDR are computationally expensive when the amount of data is large since there would be quantities of test statistics to be computed. From Eqs. (19) and (24), one can see that the main time-consuming task is to compute the inversion of \({\varvec{C}}_{{ {k}}}^{{ {\text{T}}}} {\varvec{MC}}_{{ {k}}}\), a symmetric matrix with dimension \(q_{{ {k}}} \; \times \;q_{{ {k}}}\). The computational complexity of such an operation is of the order \({\rm{O}}\left( {q_{{ {k}}}^{3} } \right)\) (Lehmann and Lösler 2016). Generally, for IDS, there would be \(\left( {\begin{array}{*{20}c} { {m}} \\ {{ {q}}_{{{ {\rm{min}}}}} } \\ \end{array} } \right)\) matrices of dimension \({ {q}}_{{{ {\rm{min}}}}} \; \times \;{ {q}}_{{{ {\rm{min}}}}}\) in the initial stage, and \({ {m}} - { {q}}\) matrices of dimension \({ {q}}\; \times \;{ {q}}\) during the iteration where \({ {q}}\) ranges from \({ {q}}_{{{ {\rm{min}}}}}\) to \({ {q}}_{{{ {\rm{max}}}}} - 1\). As for IDR, there would be \(\left( {\begin{array}{*{20}c} { {m}} \\ {{ {q}}_{{{ {\rm{max}}}}} } \\ \end{array} } \right)\) matrices of dimension \({ {q}}_{{{ {\rm{max}}}}} \; \times \;{ {q}}_{{{ {\rm{max}}}}}\) in the initial stage, and q matrices of dimension \({ {q}}\; \times \;{ {q}}\) during the iteration where q ranges from \({ {q}}_{{{ {\rm{max}}}}}\) to \({ {q}}_{{{ {\rm{min}}}}} + 1\). Comparatively, IDR shows more time consumption than IDS, due to the extra computational cost in the initial stage.

5 Example

In this section, an example is used to evaluate the performance of IDR for dealing with outliers. It is useful to elaborate on the theoretical considerations with a simple practical example. Thus, a numerical example of linear fitting with m equidistant data points is given. With error-free abscissae \({ {i}} = 1, \ldots ,{ {m}}\), the observations in Eq. (1) are \(y_{{ {i}}} = x_{1} + { {i}}x_{2} + e_{{ {i}}} ,e_{{ {i}}} \sim N\left( {0, \sigma^{2} } \right)\), where \({ {m}} = 10\), \({ {n}} = 2\) (Lehmann and Lösler 2016). The unknown parameters \(x_{1} = 0\) and \(x_{2} = 1\) denote the intercept and slope parameter, respectively. Correspondingly, the design matrix is given by:

$$ {\varvec{A}} = \left[ {\begin{array}{*{20}c} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & {10} \\ \end{array} } \right]^{{ {\text{T}}}} . $$
(46)

In addition, the variance–covariance cofactor matrix is given as \({\varvec{Q}} = {\varvec{I}}_{{ {m}}}\), and the standard deviation of unit weight \(\sigma\) is set as 1.

The numerical example is conducted via MCS. In each simulation, in addition to the normally distributed random errors, different numbers and sizes of gross errors are added to the observations, where the outlier number \({ {q}}_{*}\) ranges from 1 to 3, and the gross error size \(\nabla\) ranges from 1 to 30 times the observation precision. For fairness, the preset minimum outlier number \({ {q}}_{{{ {\rm{min}}}}}\) for IDS and IDR are both set as 0 and the maximum outlier number \({ {q}}_{{{ {\rm{max}}}}}\) are both set as 5. Finally, both cases of \(\sigma\) known and unknown are considered for different application scenarios.

In the following discussion, the DIA probability levels (Teunissen 2018; Zaminpardaz and Teunissen 2019; Yang et al. 2021) are used to evaluate the performance of IDS and IDR, which can be calculated via MCS (Hekimoglu and Koch 1999; Rofatto et al. 2020b). Specifically, when there is no outlier, \(P_{{{ {\rm{FA}}}}}\) denotes the probability of false alarm (FA) for rejecting any inlier. And when there are some outliers, \(P_{{{ {\rm{CD}}}}}\) denotes the probability of correct detection (CD) for rejecting any outlier and \(P_{{{ {\rm{CI}}}}}\) denotes the probability of correct identification (CI) for rejecting all outliers. Besides, the fitting root-mean-square error (RMSE) is also used to evaluate the robustness of these methods. The RMSE is calculated by \({ {\rm{RMSE}}} = \sqrt {\frac{1}{{ {m}}}\mathop \sum \limits_{{{ {i}} = 1}}^{{ {m}}} \left( {\hat{y}_{{ {i}}} - \overline{y}_{{ {i}}} } \right)^{2} }\), where \(\hat{y}_{{ {i}}}\) and \(\overline{y}_{{ {i}}}\) are the estimated value and true value of the observation \(y_{{ {i}}}\).

5.1 Control of FA probability

Before the outlier diagnosis, it is very useful to determine the significance level for the statistical test by controlling the \(P_{{{ {\rm{FA}}}}}\) or called the type I error rate (Lehmann 2012; Rofatto et al. 2020b). First, Fig. 11 gives the relationship between \(P_{{{ {\rm{FA}}}}}\) and RMSE of IDS, IDR, and LS under different significance levels. Generally, one can see that larger significance levels would lead to higher \(P_{{{ {\rm{FA}}}}}\) and RMSE for both IDS and IDR. Specifically, when \(\sigma\) is known, the \(P_{{{ {\rm{CA}}}}}\) of IDR are similar to IDS regardless of significance levels. Therefore, the accuracies of IDS and IDR remain at the same level, which is slightly lower than LS. However, in the case of \(\sigma\) unknown, \(P_{{{ {\rm{FA}}}}}\) of IDR is larger than that of IDS by using the same significance level. As a consequence, one can see that IDR shows a higher RMSE than LS and IDS. It indicates that IDR might pose a larger risk of precision loss than IDS in case of data insufficiency, especially when a large significance level is chosen. To control the \(P_{{{ {\rm{FA}}}}}\) and RMSE, we give the significance levels of IDS and IDR under some fixed \(P_{{{ {\rm{FA}}}}}\) in Table 1. Generally, when \(\sigma\) is known, a fixed \(P_{{{ {\rm{FA}}}}}\) corresponds to the same significance level for IDS and IDR. When \(\sigma\) is unknown, the significance level of IDR will be much smaller than that of IDS under the same \(P_{{{ {\rm{FA}}}}}\). In practice, if the \(\sigma\) is known, the overall test can also be used at the beginning of both IDS and IDR to regulate the overall false alarm rate (Kok 1984; Teunissen 2000).

Fig. 11
figure 11

\(P_{{{ {\rm{FA}}}}}\) and RMSE of IDS, IDR, and LS under different significance levels

Table 1 Significance levels of IDS and IDR under different \(P_{{{ {\rm{FA}}}}}\)

5.2 Comparison of CD, CI probabilities, and RMSE

After the determination of the significance levels in Table 1, the performance of IDS, IDR, and LS are compared in case the observations are contaminated by outliers. First, to evaluate the outlier detection capability of IDS and IDR, \(P_{{{ {\rm{CD}}}}}\) of the two methods under different outlier numbers are shown in Fig. 12. In general, one can see that for both IDS and IDR, a larger significance level would lead to a higher \(P_{{{ {\rm{CD}}}}}\), which indicates that although a larger critical value tends to drop more inliers, it also improves the detection capability. In addition, when \(\sigma\) is known, as the gross error size grows, the \(P_{{{ {\rm{CD}}}}}\) of both IDS and IDR show increasing trends and finally keep stable at 100%. However, when \(\sigma\) is unknown, the situation become different. Although IDS and IDR still show comparable \(P_{{{ {\rm{CD}}}}}\) in the case of a single outlier, when there are multiple outliers, \(P_{{{ {\rm{CD}}}}}\) of IDS shows a downward trend as the gross error size grows even with a great significance level. It is because when there are multiple outliers, the inlying set of IDS is contaminated, and the estimate of \(\sigma\) using Eq. (34) will be biased especially when gross errors are large. As a consequence, this biased estimate will shrink the size of the extended w-test statistics, thereby compromising the detection capability. It proves when \(\sigma\) is unknown the IDS would be severely affected by the masking effect of multiple outliers. Conversely, due to the alleviation of the masking effect, IDR still shows satisfactory performance for outlier detection.

Fig. 12
figure 12

\(P_{{{ {CD}}}}\) of IDS and IDR under different significance levels and outlier numbers

To compare the outlier identification capability of IDS and IDR, Fig. 13 gives \(P_{{{ {\rm{CI}}}}}\) of the two methods under different outlier numbers. In general, when the gross error is small, larger significance levels would bring higher \(P_{{{ {\rm{CI}}}}}\). But it becomes the opposite when the gross error is large. Because in the case of small gross errors, the outlier is hard to detect, and a larger significance level is helpful to detect more outliers. Conversely, in the case of large gross errors, the outlier is easy to detect, and a small significance level will keep more inliers. Specifically, in the case of a single outlier, there is no significant difference between IDS and IDR when \(\sigma\) is known. And when \(\sigma\) is unknown, IDS shows a higher \(P_{{{ {\rm{CI}}}}}\) than IDR regardless of the significance level. However, in the case of multiple outliers, no matter \(\sigma\) is known or unknown, IDR shows a notable superiority over IDS on the \(P_{{{ {\rm{CI}}}}}\). Especially, when the outlier number reaches 3, while the \(P_{{{ {\rm{CI}}}}}\) of IDR still sees an upward trend with the increase in gross error size, those of IDS almost keep stable at a very low level, which means it is almost impossible for IDS to make a completely correct decision, no matter which significance level is chosen. It indicates that, in case of multiple outliers, IDR would show stronger potential of identification than IDS due to the alleviation of the masking and swamping effect.

Fig. 13
figure 13

\(P_{{{ {CI}}}}\) of IDS and IDR with different significance levels under different outlier numbers

Finally, Fig. 14 compares the fitting RMSEs using IDS, IDR, and LS under different outlier numbers. Generally, for both IDS and IDR, when the gross error size is small, a lower significance level leads to higher accuracy. But when the gross error size is huge, a higher significance level keeps higher robustness. Specifically, when there is only a single outlier, the RMSE of LS dramatically increases as the gross error size enlarges, which means LS is even not robust for rare outliers. In comparison, by using the two outlier diagnosis methods, fitting errors can be generally reduced to the same extent. At this time, IDS keeps slightly higher accuracy than IDR, especially in the case that \(\sigma\) is unknown. However, when there is more than one outlier, although IDS have a certain resistance to multiple outliers, the robustness would severely degrade as the gross error magnitude increases. For example, when \(\sigma\) is known and there are three outliers, IDS shows even lower accuracy than LS. Because although the outliers can be detected by IDS in this case (see Fig. 12), many inliers will also be rejected due to the low identify capability (see Fig. 13). In addition, when \(\sigma\) is unknown, IDS could hardly detect any outlier (see Fig. 12), thereby keeping the same performance as LS. This verifies that the IDS, although has been widely used in various practical applications, actually may lose robustness for handling multiple outliers with remarkable sizes. In comparison, the fitting errors of using the IDR are generally much smaller and can always be controlled, even in the case of gross errors with large numbers and magnitude. It reveals that due to the alleviation of the masking and swamping effect, IDR would show higher robustness than IDS when dealing with multiple outliers.

Fig. 14
figure 14

RMSE of LS, IDS, and IDR with different significance levels under different outlier numbers

5.3 Analysis of suspected outlier numbers

In this subsection, the influence of \({ {q}}_{{{ {\rm{max}}}}}\) on the performance of IDS and IDR are analyzed. For fairness, the \(P_{{{ {\rm{FA}}}}}\) of both IDS and IDR are chosen as \(1.0\; \times \;10^{ - 2}\) and the gross error is set as \(15\sigma\). First, Fig. 15 shows the significance level of IDS and IDR under different \({ {q}}_{{{ {\rm{max}}}}}\). Generally, when \(\sigma\) is known, the significance levels of IDS and IDR under a fixed \(P_{{{ {\rm{FA}}}}}\) will remain consistent and unchanged regardless of the choice of \({ {q}}_{{{ {\rm{max}}}}}\). In case that \(\sigma\) is unknown, the significance level of IDS is still unchanged as the \({ {q}}_{{{ {\rm{max}}}}}\) grows. However, the significance level of IDR is much slower than that of IDS and shows a downward trend as the increase in \({ {q}}_{{{ {\rm{max}}}}}\). It indicates that for IDR, while the significance level will not be influenced by \({ {q}}_{{{ {\rm{max}}}}}\) in the case of \(\sigma\) known, with a larger \({ {q}}_{{{ {\rm{max}}}}}\), a smaller significance level is needed to control \(P_{{{ {\rm{FA}}}}}\) in the case of \(\sigma\) unknown.

Fig. 15
figure 15

Significance level of IDS and IDR under a fixed \(P_{{{ {\rm{FA}}}}}\) and different \({ {q}}_{{{ {\rm{max}}}}}\)

Using these determined significance levels, the RMSE of IDS, IDR, and LS under a fixed \(P_{{{ {\rm{FA}}}}}\) and different \({ {q}}_{{{ {\rm{max}}}}}\) are shown in Fig. 16. Note that the \({ {q}}_{{{ {\rm{max}}}}}\), which ranges from 3 to 7, is always larger than \({ {q}}_{*}\) to guarantee the robustness. Generally, when there is only a single outlier, IDS and IDR always show higher improvement than LS on robustness. Comparatively, IDS keeps slightly higher accuracy than IDR, especially when a large \({ {q}}_{{{ {\rm{max}}}}}\) is chosen. In the case of multiple outliers, while the robustness of IDS severely degrades, IDR always shows higher robustness due to the alleviation of the masking and swamping effect. In addition, compared to IDS, the performance of IDR is more sensitive to the choice of \({ {q}}_{{{ {\rm{max}}}}}\) especially when \(\sigma\) is unknown. In other words, the RMSE of IDR always shows an increasing trend as the \({ {q}}_{{{ {\rm{max}}}}}\) grows, since that a larger \({ {q}}_{{{ {\rm{max}}}}}\) would lead to greater risk of dropping inliers. It reveals that under the premise that \({ {q}}_{{{ {\rm{max}}}}}\) is greater than the outlier number, a smaller \({ {q}}_{{{ {\rm{max}}}}}\) is more popular to ensure the accuracy of parameter estimates.

Fig. 16
figure 16

RMSE of IDS, IDR, and LS under a fixed \(P_{{{ {\rm{FA}}}}}\) and different \({ {q}}_{{{ {\rm{max}}}}}\)

6 Conclusion

In this contribution, the causes of masking and swamping effects are investigated, and a new method of outlier diagnostics is proposed to alleviate these phenomena. First, according to the concept of data division, an extended form of the w-test with its associated reliability measure is presented. Secondly, based on the extended w-test, both data snooping and IDS are reinterpreted theoretically. Finally, a new outlier diagnostic method and its iterative form are proposed, which are called data refining and iterative data refining, respectively. While data snooping is a process of snooping outliers from a preset inlying set to the outlying set, data refining can be considered as a reverse process to refine inliers from an outlying set to the inlying one.

A linear fitting example is used to evaluate the performance of the proposed IDR when dealing with outliers. Generally, when there is a single outlier, IDR shows similar performances with IDS on both probabilities of correct decision and accuracy of the parameter estimate. However, when the outlier number grows, the correct decision probability and estimation accuracy of IDS will degrade dramatically due to the masking and swamping effect. Conversely, IDR still maintains a stable and satisfactory performance, even in the case of gross errors with large numbers and magnitudes. It proves that IDR outperforms IDS when dealing with multiple outliers, due to the alleviation of masking and swamping effect.

It should be noted that compared to IDS, the applications of IDR still face several challenges. First, IDR poses a larger risk of precision loss for parameter estimation, especially when the observation precision is unknown. Therefore, it is especially important to control the false alarm rate to avoid dropping too many inliers. Second, the performance of IDR is more sensitive to the preset parameter \({ {q}}_{{{ {\rm{max}}}}}\). In other words, a larger \({ {q}}_{{{ {\rm{max}}}}}\) usually causes lower precision of parameter estimation, once it exceeds the truth outlier number. Factors to consider when choosing \({ {q}}_{{{ {\rm{max}}}}}\) include data redundancy, the observation environment, and the task requirements. Finally, IDR usually costs relatively higher computing time because finding a suitable model in the initial stage is usually computationally expensive. Therefore, how to find the most reliable and efficient initialization method will be further investigated in our future works.