1 Introduction

Understanding the flow pattern has a fundamental role in determining the behavior of its related phenomena. Using experimental and field studies, the anticipation of the flow pattern behavior around the structures located at river bends is possible. Spur dikes are hydraulic engineering structures for preserving the desired water depth, deflecting the main current in the harbor channels and rivers, and protecting river banks. Spur dikes have always been used as an economical way to protect the river banks in their outer banks (Vaghefi et al. 2015b). In order to study the flow pattern around these structures, 3D flow velocities are collected via different velocimeters (Sulaiman et al. 2013; Xiekang and Xingnian 2016), and various parameters including shear stress (Vaghefi et al. 2015a), kinetic energy, and turbulence intensity (Kang 2013) are calculated. Yet by considering human factors and using different devices for data collection or change in measuring conditions, some of the data are collected as outliers (Alih and Ong 2015; Dhhan et al. 2015). Identifying these outliers and reducing their effects in measurements could be effective in presenting the authentic flow pattern. As a result, outlier detection during the data collection for specifying the flow pattern is an undeniable necessity. In previous studies, most researchers examined errors in calculations or in relations obtained from experimental data. Furthermore, they mentioned that the measuring tolerance of the device is the system errors, and the errors incurred in data collection are less likely discussed. Many researchers such as Nikora and Goring (2000), Goring and Nikora (2002), Cea et al. (2007), Khorsandi et al. (2012), Islam and Zhu (2013), Durgesh et al. (2014), Yafei (2015), and Hejazi et al. (2016) used filtration methods in relation to data cleaning and the separation of normal and raw data. They used such methods on data collection pertinent to flow velocity using Vectrino, which has the same performance as ADV. Results demonstrated that errors in data collection do not have a strong influence on the mean velocity due to the large number of data, whereas calculating the Reynolds shear stresses and other turbulence parameters may cause unrealistic values (Vaghefi et al. 2010; Mahmoodi et al. 2013a, b). Hence, it is necessary to identify errors and correct or remove them from measurements.

This study aims to identify the outliers in data collection in order to conduct flow pattern experiments using conventional data mining methods. Data mining is a branch of computer science that discovers hidden knowledge, patterns, and relationships of valid data, which have been so far unknown using data mining tools (Han and Kamber 2006; Mahmoodi et al. 2013a, b). These methods could be statistical models, mathematical algorithms, and learning methods. Discussed methods include box plot, histogram, linear regression (Shamim et al. 2015), k-nearest neighbors (kNN) (Yang et al. 2015), local outlier factor (LOF), k-medoids clustering (Alarcon-Aquino et al. 2011), multilayer perceptron (Heidari et al. 2016), and self-organizing map (Olawoyin et al. 2013).

In order to evaluate the performance of these methods in detecting outliers, their performance is reviewed in a case study that aims to determine the flow pattern around a T-shaped spur dike located in a 90° bend.

In this study, an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins 1980). For example, Fig. 1 represents a data set with five outliers that are marked by O1, O2, O3, O4, and O5 labels. As evident, these points were inconsistent with the rest of the samples and fall away from the overall data pattern.

Fig. 1
figure 1

A data set with five outliers (O1, O2, O3, O4, and O5)

2 Materials and Methods

In this section, case study, data collection devices, discussed methods, and the measurement criteria for precision of methods are introduced.

2.1 Case Study

The experimental outliers under investigation were emanated from experiments determining the flow pattern around a single spur dike located in a 90° bend in the Hydraulic Laboratory of Tarbiat Modares University in Iran (Ghodsian and Vaghefi 2009). Figure 2 shows a view of the laboratory and the desired channel. The channel is composed of a 7.1-m-long upstream and 5.2-m-long downstream straight reach, both of which are connected via a 90° bend with an external and internal radius of curvature of 2.7 and 2.1 m, respectively (Vaghefi et al. 2012). The ratio of the curvature radius to channel width is 4, its height is 70 cm and width 60 cm. The channel is made of glass and its stability is maintained by steel frames. The channel bed is rigid and covered with uniform sediment with an average diameter of 1.28 mm and standard deviation of 1.3 mm. The flow discharge is adjusted by a calibrated orifice, is constant, and is equal to 25 l/s in this experiment (Vaghefi et al. 2009). A butterfly gate, installed at the end of the channel, is used to control the flow depth. The Froude and Reynolds numbers are, respectively, 0.34 and 30,120. The rectangular plate spur dike with T-shaped plan is made of Plexiglas. The spur dike used in this experiment is T-shaped. The length of wing (L) and that of web (l) is equal to 9 cm and is 65 cm in height. This spur dike is vertical and unsubmerged in a 45° position (Vaghefi et al. 2010; Mahmoodi et al. 2013a, b).

Fig. 2
figure 2

A view of the laboratory and the channels

2.2 Data Collection System

In order to determine the flow pattern, Vectrino velocity meter is used to collect 3D velocities. Vectrino is the new generation of ADV and is an advanced device of its kind used in laboratory researches on account of its high accuracy of velocity measurement and most importantly, its ability to measure the flow velocity in three-dimensional coordinates. This device is formed of two main parts: sensor and cylindrical case (Nortek 2004). Measuring the flow velocity 5 cm away from the sensor tip is one of the characteristics of this device. For this reason, the side-looking sensor measures the velocity near the water surface, while the down-looking sensor is used at other layers. Both the placement of this device on the channel and its two sensors are illustrated in Fig. 3. The range of velocity measurement of this device is between ± 0.01 and ± 4 m/s, and measurement accuracy is 1 mm/s. The frequency is between 50 and 200 Hz (the frequency for this experiment is set at 50 Hz), and the time of sample measuring in this velocity meter is 1–5 min. Based on its users’ preferences, Vectrino can take 60,000 flow samples every 5 min in each direction and save the information in the format of binary files on the hard drive of the computer to which it is connected. The saved data are analyzed and averaged using software programs Vectrino+ and Explorer V (Nortek 2004), and the average of U, V, W velocities and other relevant parameters such as shear stress and kinetic turbulent energy are measured (Vaghefi et al. 2010; Mahmoodi et al. 2013a, b).

Fig. 3
figure 3

a Placement of the Vectrino velocity meter system, b side-looking sensor, and c down-looking sensor

2.3 Data Mining Algorithms

2.3.1 Box Plot Method

Box plot (Solberg and Lahti 2005) is a graphical technique that calculates data distribution using five main characteristics: (1) smallest normal observation (min), (2) lower quartile (Q1), (3) median, (4) upper quartile (Q3), and (5) largest normal observation (max). The value of Q3 − Q1 specifies the interquartile range (IQR). The normal and abnormal data can be identified by this parameter. Samples 1.5 × IQR times smaller than Q1, or 1.5 × IQR times greater than Q3 could be considered as outliers. The mentioned concepts are shown in Fig. 4.

Fig. 4
figure 4

Box plot and its concepts

2.3.2 Histogram Method

Histogram techniques are dependent on the frequency or number of samples. The histogram can be graphically represented. Mathematically, the histogram of a variable contains the number of discrete bins in which the height of each bin represents the frequency (number) of samples that are located within a bin. If the samples in a bin are less than a user-defined threshold, it can be said that all samples located in the bin are candidates for outliers (Eskin 2000). For example, in Fig. 5, the histogram of a data set is shown. This graph contains eight bins. The samples shown with a vector mark in a bin could be indicative of an outlier. As clearly demonstrated, the frequency of this bin is considerably less than other bins.

Fig. 5
figure 5

Histogram of a data set

2.3.3 Linear Regression Method

Regression analysis is used to determine the relationship between the dependent variable y and one (or more) independent variable x. The simplest form of regression is linear, in which there are one dependent variable and one independent variable. The linear regression uses the formula of straight line, \(\hat{y}_{i} = \hat{\alpha } + \hat{b}x_{i} + e_{i}\). In this formula, the values of \(\hat{\alpha }\) and \(\hat{b}\) variables are used to predict approximate values of \({\hat{\text{y}}}\) based on the values of x.

Values of the variables \(\hat{\alpha }\) and \(\hat{b}\) can be calculated from Eqs. (1)–(6):

$$\hat{\alpha } = \bar{y} - \hat{b}\bar{x}$$
(1)
$$\hat{b} = \frac{{S_{xy} }}{{S_{xx} }}$$
(2)
$$\bar{x} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} x_{i}$$
(3)
$$\bar{y} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} y_{i}$$
(4)
$$S_{xy} = \mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \bar{x}} \right)\left( {y_{i} - \bar{y}} \right)$$
(5)
$$S_{xx} = \mathop \sum \limits_{i = 1}^{n} \left( {x_{i} - \bar{x}} \right)^{2} ,$$
(6)

where ei specifies the residual values or errors. The regression line must be estimated in such a way that the sum-squared errors (SSE) is minimized. This method is called least-squares error. Thus, for each observation, \(e_{i} = y_{i} - \hat{y}_{i}\) is the error of regression prediction that represents the difference between the ith of the yi observation and its result through regression line of \(\hat{y}_{i}\). If the error of ith observation (ei) is remarkably larger than the error of the other members of the sample, it could be stated that this observation is a candidate for being an outlier (Srimani and Koti 2012).

2.3.4 k-Nearest Neighbors Method

The k-nearest neighbors (kNN) algorithm is used for finding k-nearest neighbors of p·q from data set D which is in the neighborhood of p if its distance from p is less than or equal to specified distance d:

$$k\_{\text{Nearest}}\;{\text{Neighbors}} = \left\{ {q \in D|{\text{Dist}}\left( {q,p} \right) \le d} \right\}$$
(7)

In this case, q is in d neighborhood of p. In the above definition, Dist represents the measuring distance between p and \(q\). Euclidean distance measurement is used in this study to measure the distance between points.

To identify outliers using this method, the data located in the d neighborhood of each data point is first calculated. If the number of points is less than a certain threshold \(k\), then that data point could be a candidate for an outlier, otherwise it is a normal member of the data set. The values of k and d are determined based on the physical nature of matter and by trial and error (Ramaswamy et al. 2002; Amiri et al. 2016).

2.3.5 Local Outlier Factor Method

The local outlier factor method (LOF) (Srimani and Koti 2012) is one of the most powerful methods in machine learning that can be used to identify anomalies in data. This method detects the outlier by calculating the local neighborhood density of each sample and assigning a factor to each of them which calculates the amount of inconsistency with other members of the data set. This factor is called the local outlier factor (LOF). The values of this factor depend on the isolation of a sample when it is compared to its local neighbors. Intuitively, large amounts of the LOF can be a representation of an outlier, while lower values indicate normality. For calculating the LOF, the following steps should be done:

2.3.5.1 Step One: Calculating k-Distance of p

For any object p, k-distance (p) is the kth nearest neighbor of p. To calculate this parameter, the kth nearest neighbor of p is initially determined, and then the distance from this neighbor to p is selected as k-distance (p). This parameter gives an estimation of the local neighborhood density of p.

2.3.5.2 Step Two: Finding k-Distance Neighborhood of p

Each q whose distance from p is less than or equal to k-distance (p) is located in kth distance neighborhood of p:

$$N_{{k - {\text{distance}}\left( p \right)}} \left( p \right) = \left\{ {q \in D\backslash \left\{ p \right\} | d\left( {p, q} \right) \le k - {\text{distance}}\left( p \right)} \right\}.$$
(8)
2.3.5.3 Step Three: Calculating the Reachability Distance of p with Respect to Object o

For any object o which is located within k-distance neighborhood of p, reachability distance of p with respect to object o is defined as Eq. (9):

$${\text{Reachdist}}_{k} \left( {p, o} \right) = \hbox{max} \left\{ {k - {\text{distance}}\left( o \right), d\left( {p, o} \right)} \right\}.$$
(9)

Figure 6 shows an example of the reachability distance for k = 4. If p is located out of k-distance (o) (\(p2\) in the figure), reachability density would be the distance between \(d\left( {o, p2} \right)\). If the distance is less than the k-distance (o), then reachability distance is equal to k-distance (o).

Fig. 6
figure 6

Concepts of reachability distance for k = 4 (Breunig et al. 2000)

2.3.5.4 Step Four: Calculating the Local Reachability Density of p

Local reachability density of p is the reversed average of reachability density k to its close neighbors:

$$lrd_{k} \left(p \right) = \left[{\frac{{\mathop \sum \nolimits_{{o\epsilon N_{K\left(p \right)}}} {\text{Reach}} - {\text{dist}}_{K} \left({p,o} \right)}}{{\left| {N_{k} \left(p \right)} \right|}}} \right]^{- 1}$$
(10)

The LOF is calculated using the value of parameter lrdk.

2.3.5.5 Step Five: Calculating the LOF

The LOF is used in order to detect outliers or normality of the data. The LOF(p) is the average ratio of the local reachability density of p and its k neighbors:

$${\text{LOF}}_{k} \left(p \right) = \frac{{\mathop \sum \nolimits_{{o\epsilon N_{k} \left(p \right)}} \frac{{lrd_{k} \left(o \right)}}{{lrd_{k} \left(p \right)}}}}{{\left| {N_{k} \left(p \right)} \right|}}$$
(11)

2.3.6 k-Medoids Clustering Method

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than those in other groups (clusters). Many data mining algorithms in the literature find outliers as a by-product of clustering algorithms themselves and define outliers as points that do not lie in or are located far apart from any clusters (Agrawal et al. 1998, 1999; Liu et al. 2015; Rashedi et al. 2015; Zhang et al. 2014; Rehman et al. 2014; Zhang 2008). Thus, the clustering techniques implicitly define outliers as the background noise of clusters. Clustering algorithms can be categorized based on their cluster model. Partitioning clustering is one of the clustering categories that perform clustering by partitioning the data set into a specific number of clusters. The number of clusters to be obtained, denoted by k, is specified by human users. Partitioning clustering methods typically start with an initial partition of the data set and then iteratively optimize the objective function until it reaches the optimal for the data set (Zhang 2008).

k-medoids clustering (Kaufman and Rousseeuw 1987) is a classical partitioning technique of clustering that clusters the data set of n objects into k clusters known a priori. It is more robust to noise and outliers as compared to k-means clustering (MacQueen 1967), since it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal, i.e., it is the most centrally located point in the cluster. A typical k-medoids algorithm for partitioning based on medoid or central objects is as follows (Theodoridis and Koutroumbas 2006):

Input: k: The number of clusters; D: A data set containing n objects.

Output: A set of k clusters that minimizes the sum of the dissimilarities of all the objects to their nearest medoid.

Method: Initialize: randomly select (without replacement) k of the n data points as the medoids.

Associate each data point to the closest medoid. (“closest” here is defined using any valid distance metric, most commonly Euclidean distance (Deza and Deza 2009), Manhattan distance (Krause 1986), or Minkowski distance (Burago et al. 2001; Papadopoulos 2014).

For each medoid m:

For each non-medoid data point o.

Swap m and o and compute the total cost of the configuration.

Select the configuration with the lowest cost.

Repeat steps 2–4 until there is no change in the medoid.

2.3.7 Multilayer Perceptron (MLP)

Multilayer perceptron (MLP) is one of the most practical architectures of artificial neural networks, which is capable of performing regression and classification problems. A typical MLP consists of an input layer, a number of hidden layers, and an output layer each having a number of processing neurons (nodes) with varying weights representing the relative influence of the different neuron inputs to the other neurons (Azari et al. 2015; Heidari et al. 2016). The number of neurons in the input and output layer is equal to input and output variables, respectively. The number of hidden layers, neurons in the hidden layer, and linking weights is usually determined in the training process with trial-and-error procedure. It has been proven that a single hidden layer MLP network, given enough hidden neurons and suitable activation functions, can approximate any nonlinear relation (Hornik 1991). In the MLP network, the output of the jth neuron (yj) can be found as follows:

$$y_{j} = f\left( {\mathop \sum \limits_{i = 1}^{M} w_{ij} x_{ij} + b_{j} } \right),$$
(12)

where wij and xij represent the link weights between the ith neuron in the previous layer and the jth neuron in the current layer that were selected randomly in the network training process, and also the input from the ith neuron to the jth neuron, respectively. M denotes the total number of neurons in the previous layer, and bj represents the bias associated with the jth neuron. f is the nonlinear activation transfer function which for the current work is hyperbolic tangent sigmoid function. In Eq. (12), weights and biases are unknown. In this study, back-propagation learning algorithm is employed to find unknowns.

The aim of this study is to examine the applicability of MLP network in outlier detection in flow pattern experiments. To do this, at first, the best MLP model will be created for each data set. Then, for each observation, the residual value (\(e_{i} = y_{i} - \hat{y}_{i}\)), which is the difference between real and output model, is calculated. The best network architecture is selected based on two statistical criteria, including coefficient of determination (R2) and root-mean-squared error (RMSE), as follows:

$${\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \hat{y}_{i} } \right)^{2} }$$
(13)
$$R = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \bar{y}} \right)\left( {\hat{y}_{i} - \bar{\hat{y}}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \bar{y}} \right)^{2} \mathop \sum \nolimits_{i = 1}^{n} \left( {\hat{y}_{i} - \bar{\hat{y}}} \right)^{2} } }},$$
(14)

where n represents the total number of observations, while yi and \(\hat{y}_{i}\) are representative of real and predicted values using models, respectively. Moreover, \(\bar{y}\) and \(\bar{\hat{y}}\) are the average of mentioned data.

2.3.8 Self-Organizing Map (SOM)

Neural networks have been extensively used for outlier detection (Hoz et al. 2015; Wang et al. 2015; Fustes et al. 2013; Yan 2011). So far, different types of neural networks have been used for outlier detection. In this paper, SOM (Olawoyin et al. 2013; Corona et al. 2010) is selected, because this method has not yet been widely applied to the field of outlier detection in flow pattern experiments. SOM is an unsupervised neural network which clusters the input data into a fixed number of units. It consists of two layers of one-dimensional array of input units and a two-dimensional array of output units. These units are called neurons. The units in one layer are fully connected with the units in another layer. If input data set consists of n observations belonging to d-dimensional space, then the input layer must have d units and the output layer has R × C units, where R and C represent the number of rows and the number of columns of the SOM output array, respectively (Yan 2011). In this configuration, each map unit has a unique (i, j) coordinate. This makes it easy to reference a unit in the network and to calculate the distances between units. Each unit is associated with a weight vector of the same dimension as the input data vectors, and a position in the map space. SOM projects the input data set in a nonlinear way onto a rectangular grid laid out on a hexagonal lattice. It has a feed-forward structure with a single computational layer, which applies competitive learning as opposed to error correction learning and uses a neighborhood function to preserve the topological properties of the input space. The general structure of SOM networks is shown in Fig. 7.

Fig. 7
figure 7

A two-dimensional SOM, each circle denotes one neuron at the input and output layer

The self-organization process involves five major components (Giraudel and Lek 2001): (1) All the connection weights are initialized with small random values; (2) A vector is chosen in a random way from the input data set and presented to the network; (3) Every unit in the network is examined to calculate which ones’ weights are most like the input vector using a discriminant function (such as Euclidean distance) which provides the basis for competition. The particular neuron with the smallest value of the discriminant function is declared the winner. The winning neuron is commonly known as the best-matching unit (BMU); (4) The radius of the neighborhood of the BMU is calculated. The units in the neighborhood of the BMU are updated by pulling them closer to the input vector; (5) Repeat stage (2) for N iterations.

If the input space is d-dimensional, we can write the input patterns as \(D = \left\{ {p_{i} :i = 1, 2, \ldots , d} \right\}\), and the connection weights between the input units i and the neurons j in the output layer can be written \(W_{j} = \left\{ {w_{ji} :j = R \times C ;i = 1, \ldots ,d} \right\}\), where R × C is the total number of neurons in the output layer. At each training step t, a sample data vector \(p\left( t \right) = \left[ {p_{1} , p_{2} , \ldots ,p_{d} } \right]\) is randomly chosen from the input data set and Euclidian distances between p(t) and all the weight vectors are computed. The winning neuron uc (the neuron whose weight vector comes closest to the input vector) is determined by Eq. (15):

$$\|p\left( t \right) - w_{c} \left( t \right)\| = \min_{j} \left\{ {\|p\left( t \right) - w_{j} \left( t \right)\|} \right\}.$$
(15)

The equations for updating weights are:

$$W_{j} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {W_{j} \left( t \right) + \alpha \left( {t,c,j} \right) \cdot \left( {p\left( t \right) - W_{j} \left( t \right)} \right), } \hfill & {{\text{if}}\quad j \in L_{c} \left( t \right) } \hfill \\ {W_{j} \left( t \right),} \hfill & { {\text{if}}\quad j \notin L_{c} \left( t \right)} \hfill \\ \end{array} } \right.,$$
(16)

where \(L_{c} \left( t \right)\) is a set of neighboring neuron of the winning neuron, and α(tcj) is the neighborhood kernel function (Wu and Chow 2004) around the winning neuron c at time t.

In this research, to detect outliers using the SOM method, based on the two-dimensional plane and the topology, a quasi-3 δ edit rule (Yan 2011) is applied. Suppose that the obtained weight vectors of SOM is \(W_{j} = \left\{ {w_{ji} :j = R \times C ;i = 1, \ldots ,d} \right\}\). The procedure of quasi-3 δ edit rule is as follows:

Determine the median of the weight vector, Wmedian, as:

$$W_{{i,\; {\text{median}}}} = {\text{median}}\left( {W_{i,1} ,W_{i,2} , \ldots ,W_{i,j} ,W_{i,R \times C} } \right),\quad i = 1,2, \ldots , d ,$$
(17)

where Wi,median is the ith element of Wmedian and Wi,j is the ith element of Wj.

Calculate the Euclidean distance dj between Wmedian and Wj as:

$$d_{j} = \left[ {\mathop \sum \limits_{k = 1}^{d} \left( {W_{k,j} - W_{{k, \,{\text{median}}}} } \right)^{2} } \right]^{{\frac{1}{2}}} ,\quad j = 1,2, \ldots ,R \times C.$$
(18)

Determine the median of the Euclidean distance(s), \(d_{\text{median}}\), as:

$$d_{\text{median}} = {\text{median}}\left( {d_{1} ,d_{2} , \ldots ,d_{j} , \ldots ,d_{R\, \times \,C} } \right).$$
(19)

Calculate the median absolute deviation from \(d_{\text{median}}\), dMAD, as:

$$d_{\text{MAD}} = 1.4826 \times {\text{median}}\left( {\left| {d_{1} - d_{\text{median}} } \right|, \left| {d_{2} - d_{\text{median}} } \right|, \ldots ,\left| {d_{R\, \times \,C} - d_{\text{median}} } \right|} \right).$$
(20)

Detect the outlier neurons based on the following rule:

$$\left\{ {\begin{array}{*{20}l} {{\text{Outlier}}\;{\text{neuron}},} \hfill & {{\text{if}}\;\;R_{j} = \left| {\frac{{d_{j} - d_{\text{median}} }}{{d_{\text{MAD}} }}} \right| > 3} \hfill \\ {{\text{Normal}}\;{\text{neuron, }}} \hfill & {{\text{if}}\;\;R_{j} = \left| {\frac{{d_{j} - d_{\text{median}} }}{{d_{\text{MAD}} }}} \right| \le 3} \hfill \\ \end{array} } \right.,\;\left( {j = 1,2, \ldots ,R \times C} \right).$$
(21)

The data objects projected on the outlier neuron are the outlier candidates.

2.4 Measurement Criteria for Precision of Methods

To select the best performance of the method (methods) for identifying outliers, it is essential to define criteria used to evaluate the performance of the algorithms. Algorithms for identifying anomalies in the data are typically evaluated by criteria like “Detection rate” and “False Alarm Rate” (Provost and Fawcett 2001):

$${\text{Detection}}\;{\text{Rate}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(22)
$${\text{False}}\;{\text{Alarm}}\;{\text{Rate}} = \frac{\text{FP}}{{{\text{FP}} + {\text{TN}}}},$$
(23)

where TP is the actual number of anomalous samples that are correctly diagnosed as anomalous samples, FN is the actual number of anomalous samples that are incorrectly diagnosed as normal samples, FP is the actual number of normal samples that are incorrectly diagnosed as anomalous samples, and TN is the actual number of normal samples that are correctly diagnosed as normal samples.

Detection rate criterion provides information regarding the relative number of correctly detected anomalous samples. False alarm rate represents the relative number of anomalous samples that might have been mistakenly interpreted as normal. If the detection rate is high and the false alarm rate is low, the method is more accurate. Its reverse is also true.

3 Potential Errors in Data Collection

In this study, based on conducted experiments, errors are divided into three categories: inherent errors, observation errors, and statistical errors (Vaghefi et al. 2010; Mahmoodi et al. 2013a, b).

3.1 Inherent Errors

These errors occur because of the circumstances of data collection and represent the error inherent in the collected data. In Vectrino velocity meter, inherent error can occur with changing flow pattern. During data collection, the required time to produce quasi-steady and quasi-permanent conditions at the start of the experiment and the restart time of the pump was taken into account. Moreover, a 10-s period of time was considered to remove local fluctuations occurring during position sensor change for 1-min collection of a point. As such, this error was reduced to the minimum possible value. Due to minor power fluctuations and its effect on the production discharge by the pump, there is a possibility of slight variations in the velocity collection. Compared to the actual value, the current error is not significant since 3000 data (in 1 min) had been collected in any direction and at any point.

3.2 Observation Error

The major error observed in this part is in the adjustment of the coordinates of collected points of the velocity meter and balancing the spur dike in the considered positions along the bend. Consider the fact that the longitudinal, transverse, and vertical rulers with an accuracy of 0.1 mm are used to adjust the coordinates of points. The adjustment of the longitudinal cart, transverse movement rail, and vertically movable shaft is done by the user employing the mentioned rulers. The error rate is 0.1 mm if there is an error in the coordinates of points.

3.3 Statistical Error

Statistical errors include errors incurred after data collection. In the collected velocity data using Vectrino, which has the same performance of the ADV, it is seen that in some of the recorded data some values are outside the range of other data. These errors are known as Spike.

4 Results and Discussion

In this section, the researchers identify outliers in the data collected from the case study using the above-mentioned methods. Among the collected points in this study, the ability of the methods to detect outliers for the coordinates of a point (U, V, and W are the velocity values in the direction of x, y, and z, respectively) has been analyzed. Details of studied data sets are presented in Table 1. To assess the performance of methods in detecting outliers, using pretests and conducted studies, the outliers in each data set were detected. If a method (methods) detects all or most of the outliers without the slightest error, it will have the best performance and can be used in future studies to identify outliers. Using the algorithm of each method, a computer code was written in MATLAB software to identify the outliers. This code receives the raw data in Excel format as the input data, which then automatically saves the filtered files and outlier files in Excel format before providing the user with them.

Table 1 Details of tested data sets

The box plot is a univariate method. In other words, it is only applicable for univariate data sets. Therefore, this method can only be used with U, V, and W data sets. In Fig. 8, the box plots of these data sets are shown. In each graph, the central line of the box represents the median, the edges of the box represent the 25th and 75th percentiles, and the trailing edges represent the normal samples. Samples that fall outside of this range represent the outliers. These samples are marked with “+” sign in the figure. The summary of results of outliers detected by this method for data sets is presented in Table 2.

Fig. 8
figure 8

Box plot of U, V, and W velocities

Table 2 Results of the box plot method on data sets

The histogram is a univariate analysis, too. In Fig. 9, the histograms of the test data sets are shown. Each of these graphs has 10 bins. In this study, the number of the bin is considered on the left side of the histogram. Table 3 shows the frequency of samples located in each bin. If the frequency of a bin compared to other bins is considerably less frequent, it can be said that all the samples located in those bins are a candidate for outliers. According to this definition, the data histogram U data set located in bin 1, the data histogram V data set located in bins 1, 7, 8, 9, and 10, and the data histogram W data set in bins 1, 9, and 10 are the candidates for outliers.

Fig. 9
figure 9

Histograms of U, V, and W velocities

Table 3 Number of data located within each data set bin

The summary of the results of outliers detected by this method for data sets is presented in Table 4.

Table 4 Results of the histogram on data sets

Simple linear regression can be applied to two-dimensional data sets. Hence, this method can be used in all data sets. The residual values should be calculated to identify outliers using regression models (the difference between the actual values and the values estimated by the regression line). Samples with values that differ greater than a threshold in comparison with other samples could be a candidate for outliers. After calculating the residual values for each data point, the following equation could be used to detect outliers:

$$G = \frac{{\left| {r_{i} - \bar{r}} \right|}}{\text{SD}}.$$
(24)

In the above equation, ri is an element of the residual values, \(\bar{r}\) the average, and SD standard deviation of the residual values. In this method, the value of G for any residual value is calculated. If the value is greater than a threshold value of t, then the sample can be a candidate for outliers. In this study, t is considered 2.5. For example, in Fig. 10, the results of applying linear regression on W and U–V data sets are shown. In this figure, the trend line on the data sets and the graph of the residuals are depicted. Due to the limited page numbers of this paper, outlining the results of applying regression for all data sets is avoided. The summary of the results of the identified outliers for each data set using this method is presented in Table 5.

Fig. 10
figure 10

Applying the regression on data set: aU and bU–W

Table 5 Results of linear regression on data sets

For applying the kNN algorithm on data sets in order to identify potential outliers, the determination of the parameters of the number of neighbors (k) and radius of the neighborhood (d) is required. As previously mentioned, the correct values of these parameters depend on the physical nature of the matter and are usually obtained through trial and error. The value of k is considered 50, which suggests that for each data item, the 50 nearest neighbors are defined as the area of the neighborhood. The reason is that around 50 samples are collected each second while collecting data. Therefore, samples collected in 1 s are considered as the neighboring data. The parameter d is also specified based on the nature of each sample and the pre-performed tests. The Euclidean function is used to measure distance between the points. The kNN test results on data sets are shown in Table 6.

Table 6 Results of the kNN method on data sets

The number of neighbors (k) and threshold parameter (t) for applying LOF algorithm on data sets needs to be determined. Here, the value of k is considered 50. After calculating LOF for each sample, the falsity or the normality is acknowledged based on this value. According to the LOF formula, if all samples are sorted and put side by side exactly with the same distance on the plane, then the LOF of all samples (except for the boundary samples) will be 1. Also, as the neighborhood density increases, the factor is closer to zero, otherwise this factor is greater than 1 and may even become a larger number. Therefore, a number greater than 1 should be selected as the threshold in each problem to identify outliers (Mahmoodi et al. 2013a, b). In this study, the number 1.3 is selected as the threshold parameter value due to the nature of the data. This means that samples with a neighborhood density less than 30% of their uniform density are considered as candidate for outliers. Table 7 presents the LOF test results for each data set.

Table 7 Results of the LOF method on data sets

Applying k-medoids algorithm to data set requires the determination of the number of clusters k and the distance measurement function. The most important parameter is finding the true value for k. There are no explicit rules as to how to find the true value of such a parameter, and it depends on the input nature. Typically, the algorithm is applied on different amounts of k so as to select the most appropriate one. Here, k is considered 40 for all data sets. In order to measure the distance, Euclidean distance has been used in this research. If the number of data located in a cluster is smaller than a threshold parameter t, then all the data in that cluster will be taken as candidates for outliers. The threshold parameter value has been selected to be 19 in all data sets. The results of the application of this method to data sets are presented in Table 8.

Table 8 Results of the k-medoids on data sets

Development of reliable ANN models for prediction problems requires determination of the ANN architecture, i.e., the number of hidden layers, the number of neurons in the hidden layers, learning algorithm, and the activation transfer functions. The suitable selection of these values is based on trial-and-error procedure. The MLP network usually has one or more hidden layers, since according to Bishop’s study (Bishop 1995), more than one hidden layer is often not necessary; so our architectures have only one hidden layer. To determine the best MLP network architecture, several models were created with varying network parameters. The parameters of optimum network structure and its schematic are shown in Table 9 and Fig. 11, respectively.

Table 9 Characteristics of selected MLP networks
Fig. 11
figure 11

Schematic of defined MLP network

The results of MLP method on the data sets are presented in Table 10. Also, the error histogram for the best obtained models is presented in Fig. 12.

Table 10 Results of the MLP method on data sets
Fig. 12
figure 12

Error histogram of the best obtained models

To cluster input data sets using self-organizing map, a 5-by-8 two-dimensional map of 40 neurons is used. The map size was determined empirically by trial and error. Figure 13 represents the schematic of defined SOM network. The batch SOM algorithm is used for training because it is more stable than the online version and in addition, it is faster and can be parallelized to reduce computational time (Fustes et al. 2013). The selected networks parameters are shown in Table 11.

Fig. 13
figure 13

Schematic of defined SOM network

Table 11 Selected SOM network parameters

Table 12 provides the results of SOM method on the data sets. Figure 14 indicates distances between neighboring of all studied stations. This figure uses the following color coding: (1) The blue hexagons represent the neurons; (2) The red lines connect neighboring neurons; (3) The colors in the regions containing the red lines indicate the distances between neurons; (4) The darker colors represent larger distances; (5) The lighter colors represent smaller distances. Figure 15 shows how many data points are associated with each neuron of all studied stations. Neurons with lower sample hits are outlier candidates.

Table 12 Results of the SOM method on data sets
Fig. 14
figure 14

Neural network training SOM neighbor weight distances of all data sets

Fig. 15
figure 15

Neural network training SOM sample hits of all data sets

The conclusion of the results of all tests is shown in Table 13. The average of false alarm rate and detection rate derived from the execution of all methods on tested data sets has been provided in this table. A method works best with the lowest average of false alarm rate and the highest average of detection rate. According to the results in Table 13, we can suggest that the local outlier factor (LOF) and the box plot methods had the best performance. The performance of the k-nearest neighbors was acceptable, and its rate of false alarm rate is slightly higher than the LOF and the box plot methods. On the other hand, the lowest performance is related to the k-medoids method. This is because such method was unable to cluster the data properly with the selected values for input parameters of the algorithm. The small values of detection rate in this method are due to the fact that a large number of outliers have been placed in normal clusters by mistake. Thus, it can be said that the method had not been able to differentiate the data properly in most of the data sets.

Table 13 Conclusions of the results of all tests on data sets

As outlined in Table 13, most methods have given satisfactory results. It should be noted, though, that the nature of the collected data from various experiments is different. Hence, there is not a superior method compared to other methods, and a method may be highly efficient for a particular data set while not having acceptable performance for other data sets. As such, it is recommended to use the process employed in this study when working with different data.

Interestingly, the samples which these methods have selected as the outlier candidates may not truly reflect the errors in the study system, as they may have been created due to changes in natural conditions (e.g., changes in flow pattern). Therefore, we should measure different aspects of outliers’ falsity after identifying them in order to either eliminate or correct them. It is also worth mentioning that in a particular experiment, the method of selecting the correct input parameters for each algorithm on its performance in the detection of outlier is effective. For example, if the threshold value of the LOF algorithm is chosen higher than 1.3, some outliers may go outside their domain and be considered as the normal sample. Also, if the threshold value is selected less than 1.3, some normal samples may go outside of their domain and be considered as the outlier. In general, there is no rule specifying the correct choice of algorithm parameters, and their correct selection is dependent on the physical nature of matter, the nature of the data and related professional person’s experience.

5 Conclusions

Experimental data collection has always been associated with numerous outliers. These outliers cause problems in data analysis and lead to incorrect conclusions. Hence, outlier detection is required before the processing of data. In this study, the box plot, histograms, linear regression, k-nearest neighbors, local outlier factor, k-medoids clustering, multilayer perceptron, and self-organizing map methods and the way they are employed to identify outliers in a case study were discussed. The performance of these methods has been analyzed in identifying the outliers in a case study, the purpose of which is to determine the flow pattern around a T-shaped spur dike located in a 90° bend. The outliers present in data collection for the case study are caused by Vectrino 3D velocimeter, change in measuring conditions, and the problems occurred during the data collection.

The results indicated that most methods have given satisfactory results, but the box plot and the local outlier factor methods held the best performance among all methods (because of the lowest average of false alarm rate and the highest average of detection rate). Moreover, the poorest performance is observed in the k-medoids method. This is because such a method was unable to cluster the data properly with the selected values for input parameters of the algorithm. However, it should be noted that the nature of the collected data from various experiments is different. Hence, there is not one method superior to other methods, and a method may be highly efficient for a particular data set while it may not have an acceptable performance for other data sets. Therefore, the authors of this paper suggest using these methods to identify the outliers before analyzing the data collected from the flow pattern experiments.