INTRODUCTION

In many applications, processing and analysis of large volumes of measurements face appearance of anomalous data (outliers). In this case, it is important to determine whether a new anomalous observation belongs to the same distribution as the existing ones or it should be considered as a manifestation of new properties or phenomena of the object under study [13]. In many cases, such anomalies associated with emergence of new properties are the main purpose of analysis of obtained data, e.g., in cybersecurity [4, 5], medicine [68], biology [9], law machinery (financial fraud) [10, 11], etc. (see [2, 1214]). Currently, neural-network and machine-learning methods for detecting such outliers are being actively developed and implemented [15, 16]. They enable automation of the process of search and analysis of anomalous outliers in a sequence of events and determination of their possible source and the degree of hazard they impose (see, for example, [17, 18]). Anomalous outliers may be due to measurement error or noise. When data are analyzed, such outliers can lead to an inflated error and significant distortions of parameters and statistical estimates (see, for example, [1922]). Such “parasitic” outliers should be identified and discarded in further analysis. Statistical justification for dropping such outliers can be found in works dating back to the century before last [23]. Currently, filtering (suppression) of erroneous measurements has become an integral part of preprocessing of experimental data (see, for example, [24, 25]). It is often automated and is part of the software of a measuring equipment (see, for example, [26]).

There are a lot of works on methods for search and analysis of anomalous outliers (see, for example, reviews [12, 2730]). The earliest ones are based on the basic assumption of identity and independent distribution of data, their mean and spread (dispersion/covariance) being the two most important statistical characteristics for their analysis (see [31] and references therein). If the law of data distribution is known in advance, methods based on the Pearson criterion [32] are often used (\(\chi^{2}\) criteria; see, for example, [7, 37]). The outlier identification method based on indicators such as the interquartile range [33, 8, 34, 5, 14, 35, 36] (IQR method) has become widely used in the practice of statistical processing of measurements because of its simplicity. In processing of experimental data by this method, one can filter out erroneous measurements in the tails of distributions. As noted above, erroneous measurements can introduce a significant error into calculation of the statistical characteristics. This paper will present a comparative analysis of the method based on the interquartile range and the method presented in [38], which was used in a number of works [3942].

TESTING OF FILTERING METHODS

Fig. 1
figure 1

Histogram of instantaneous velocity measurements in [39]; dotted lines are boundaries of outlier cut-off by IQR method.

Despite the widespread use of the IQR method for filtering experimental data, it has a number of disadvantages. Calculation of quartiles necessitates sorting of each of implementations. This requires significant additional memory (if the time series of measurements must be stored). In addition, this method is applicable to implementations close to the normal (Gaussian) distribution. If the distribution has significant skewness \(S_{f}=\langle{\rm f}^{3}\rangle/\langle{\rm f}^{2}\rangle^{3/2}\) and/or excess \(E_{f}=\langle{\rm f}^{4}\rangle/\langle{\rm f}^{2}\rangle^{2}-3\), the use of the IQR method leaves a significant number of outliers, which leads to errors in determination of the statistical characteristics. Figure 1 shows as an example a histogram of PIV measurements of instantaneous velocity in the flow over a hydrofoil [39] after filtering of outliers by algorithms of the measuring system software (see, for example, [26, 41]). It can be seen that the distribution has significant skewness (it differs a lot from the normal distribution), and outliers are visible on both sides of the distribution core, on its tails. The lines show the cut-off boundaries for these outliers: on the right of \(Q3\), the 3rd quartile plus the interquartile range \(\Delta=Q3-Q1\); on the left of \(Q1\), the 1st quartile minus the interquartile range \(\Delta\). Some outliers remain after discarding of the outliers to the left of the left line and to the right of the right one. This is primarily due to the strong skewness of the distribution. Despite the insignificant number of the remaining outliers, the error in determination of the statistical characteristics with such filtering turns out to be large enough to hamper determination of the actual pattern of the flow (see, for example, [39, Fig. 4]).

In [38], a method was developed for filtering outliers in distributions with strong skewness. This method relies on construction of a model probability density function (PDF) of the measured distribution, which takes into account the strong skewness [43]. This function is a combination of two Gaussian distributions:

$$P(w)=\frac{a^{+}}{2\pi\sigma_{+}}\exp\left\{ {-\frac{(m^{+}-w)^{2}}{2\sigma_{+}^{2}}}\right\} +\frac{a^{-}}{2\pi\sigma_{-}}\exp\left\{ {-\frac{(m^{-}-w)^{2}}{2\sigma_{-}^{2}}}\right\} ,$$
(1)

where \(a^{+}\)\(a^{-}\)\(m^{+}\)\(m^{-}\)\(\sigma_{+}\), and \(\sigma_{-}\) are parameters of the model PDF. They are calculated from the known variance \(\sigma\) and skewness coefficient \(S\) of the measured distributions [43]:

$$m^{+}=\frac{\sigma}{4}\left[{S+\sqrt{S^{2}+8}}\right],\quad m^{-}=\frac{\sigma}{4}\left[{S-\sqrt{S^{2}+8}}\right],$$
$$a^{+}=-\frac{S-\sqrt{S^{2}+8}}{2\sqrt{S^{2}+8}},\quad a^{-}=\frac{S-\sqrt{S^{2}+8}}{2\sqrt{S^{2}+8}},$$
(2)
$$\left({\sigma_{+}}\right)^{2}=\frac{\sigma}{16}\left[{S+\sqrt{S^{2}+8}}\right],\quad\left({\sigma_{-}}\right)^{2}=\frac{\sigma}{16}\left[{S-\sqrt{S^{2}+8}}\right].$$

The filtering involves rough primary analysis of the measured distribution using the Gaussian PDF. Next, in the case of strong skewness of the filtered distribution, the iterative process of additional filtering is started, based on model PDF (1) constructed from the calculated first three statistical moments of the current distribution ((1) and (2) are written for centered moments). For the purpose of identification and deletion of outliers, each bar of the data histogram is compared with the constructed model PDF. If the bar is higher than the PDF by more than \(\alpha\) events (\(\alpha\) is the filtering parameter), these events are regarded as outliers and are discarded. The \(\alpha\) value depends on the database size: the larger is the database, the smaller \(\alpha\) value can be taken (in practice, \(\alpha\) ranges from 10 to 1000). The procedure is repeated with the updated model PDF (constructed from the histogram after deletion of outliers) until the filtered histogram exceeds the model PDF with a factor of \(\alpha\). Typically, no more than three such repetitions (iterations) are required for completion of the filtering.

Fig. 2
figure 2

Test histogram constructed based on model PDF (shown by line).

Fig. 3
figure 3

Test histogram of Fig. 2 with added random outliers.

Fig. 4
figure 4

Histograms in Fig. 3 filtered by IQR method and method from [38].

For comparison of the method from [38] with the IQR filtering, with application of a pseudorandom number generator, a test histogram of 1000 implementations was constructed from model PDF (1) with strong skewness: \(S=1.157\) (see Fig. 2). For this purpose, the function \({\rm f}(p)=\left[{\int\limits _{0}^{{\rm f}}{P(w)dw}}\right]^{-1}\) was calculated with the use of (2). It is the inverse function of the integral of PDF (1) with the given parameters \(\sigma\) and \(S\). The argument of this function was a pseudorandom number generator in the range \([0;\,1]\). That resulted in a sequence with distribution close to \(P({\rm f})\) (1), which is shown in Fig. 2, along with \(P(\mbox{f})\) (line). Twenty random outliers in the range \([-4;\,+4]\) were added to this distribution (see the histogram in black in Fig. 3); some of them were found on the tails of the distributions. The test distribution (the histogram with the outliers) was filtered by the method of [38]. The result of the filtering is shown in Fig. 4. The lines in the figure show the boundaries of outlier cut-off by the IQR method. It can be seen that after the IQR filtering, some outliers remain. Despite their insignificant quantity (17 outliers out of 1017 measurements), their influence on the statistical characteristics of the distribution leads to a large error (see Table 1). Note that in Figs. 3 and 4, the original, test (with outliers), and filtered histograms do not coincide not only in the tails, but also in the core. This is due to the fact that the histograms are normalized and centered, and since addition or removal of even a few outlier realizations leads to minor changes in the normalization and the mean, there are seen marked small discrepancies in the core of the histograms.

Table 1.

 

Statistical characteristics of histograms

 \(\sigma=\sqrt{\langle{\rm f}^{2}\rangle}\) 

 \(S=\frac{\langle{\rm f}^{3}\rangle}{\langle{\rm f}^{2}\rangle^{3/2}}\) 

 \(E=\frac{\langle{\rm f}^{4}\rangle}{\langle{\rm f}^{2}\rangle^{2}}-3\) 

Original (model) histogram

0.5858

1.157

0.775

Test histogram with outliers

0.672 ( 15%)

0.421 ( 63%)

4.887 ( 530%)

Histogram filtered by method of [38]

0.5870 ( 0.2%)

1.144 ( 1%)

0.74 ( 5%)

Histogram filtered by IQR method

0.6270 (7 %)

0.932 ( 20%)

2.507 ( 223%)

Table 1 shows the statistical characteristics (standard deviation and skewness and excess coefficients) of the original histogram, histogram with outliers, and histograms filtered by the method of [38] and the IQR method. The relative filtering error \(\delta=\frac{\left|{f_{ini}-f}\right|}{f_{ini}}\) is shown in the parentheses. The table demonstrates that outliers in measurements can lead to a significant error. Despite their insignificant quantity (20 out of 1020), the relative error in the statistical moments (increasing with the moment order) is as high as 15% for the standard deviation \(\sigma\) to 530% for the excess coefficient \(E\). It is also seen that although the IQR method can approximately halve the error in the calculated statistical characteristics, the error remains unacceptably large, especially for higher statistical moments. The filtering method of [38] makes it possible to filter out outliers with an error not exceeding 5% even for the excess coefficient.

CONCLUSIONS

By the example of a model numerical series, it is shown that random outliers can lead to a significant error in the statistical characteristics of a numerical series. For instance, adding 20 random outliers to a sequence of 1000 numbers leads to an unacceptably large error in the magnitude of the statistical moments. Therefore, such outliers must be identified and filtered out during processing of experimental data [1922].

The presented results of testing the IQR filtering method on a model distribution show that this method gives a large error for distributions with significant skewness. The same conclusion can be drawn from analysis of the IQR filtering for real PIV measurements [39] of the instantaneous fluid velocity above a hydrofoil, the distribution of which is also highly asymmetric. The method in [38], on the contrary, enables qualitative filtering of experimental data: the error does not exceed 5% for the excess coefficient (the error increases with the moment order). This was shown by the testing results presented in this work, as well as the use of this method for processing measurement data in a number of experiments [3840, 42].

Note also that when the IQR method is used in software for filtering large databases, it requires additional significant memory space for preliminary sorting of distributions. The method offered in [38] includes the iterative process for each of the distributions, as well as calculation of basic functions. However, the latter are insignificant because the parameters of these functions are obtained analytically and have a simple form, and the iterative process usually does not take more than two or three iterations [38].

FUNDING

The work was carried out with the financial support of the Russian Science Foundation (project no. 19-79-30075-P) and application of the Kutateladze Institute of Thermophysics, Siberian Branch, Russian Academy of Sciences infrastructure.

CONFLICT OF INTEREST

The author of this work declares that he has no conflicts of interest.