Keywords

1 Introduction

An ‘Outliers’ is defined as an examination that is radically different from the other data in its set. Outliers are also referred as abnormalities, discordant, deviants or anomalies in data mining. According to Mendenhall et al. the term “Outliers” is “that lies very far from the middle of the distribution in either direction”. Outlier detection must be performed during the preprocessing step for locating whether the data presented in the drinking water dataset are normal or abnormal. It has become an active research issue in data mining, which has important applications in the field of medical care, public safety and security, image processing, sensor/video network related surveillance, intrusion detection, monitoring criminal activities in e-commerce, monitoring water quality etc.

In my research, the main goal is to identify the contamination from the drinking water dataset. The basic event detection framework utilizes a data mining algorithm for studying the interactions between multivariate water quality parameters and detecting possible outliers. The classifier SVM is used for studying the interplay between multivariate water quality parameters and detecting possible outliers. The SVM make use of several models to detect complex outliers based on the characteristics of the support vectors obtained from SVM-models. SVM is an iterative approach and remove severe outliers in the first iteration itself. So from the next iteration it starts to learn from “cleaner” data and thus reveals outliers that were masked in the initial models [1].

The outlier detection method can be divided into uni-variant and multi-variant. Uni-variate means, considering only one variable or one parameter and multi-variate means, considering more than one variable and check for the relationship between variables. Uni-variate outlier is easy for detection and correction, but multi-variate are more difficult to detect and consumes more time for detection. Another method of outliers is parametric and non-parametric. A parametric method uses statistical models and non-parametric uses some outlier detection methods which are distance based, clustering based and spatial based.

According to [2, 3], the existing method for detecting outliers is classified based on the availability of labels present in training data sets and it is categorized namely: Supervised, Semi-Supervised and Unsupervised. In principle, models belonging to supervised or semi-supervised approaches, all the data must be trained before use, while in unsupervised approach training is not required. Additionally, in the supervised approach, training set should be provided with labels for anomalous or normal. In contrast, in the training set with normal object, labels alone are needed for semi-supervised approach. In other words, the unsupervised approach does not require any object label information. In the paper [4] the classification model treats outlier detection, classification and feature selection as separate step. Thus the proposed method combines all three methods.

In this research work, the outliers are detected before going for feature selection. Clustering is used to partition data into large or small clusters. Based on the classification, the small cluster is considered as outliers and removed safely from the dataset. The main goal of this paper is to remove the outlier effectively. For the experiment, the real time drinking water dataset is used for evaluation. Section 2 analyses the concepts of outlier detection and clustering. In Sect. 3 the proposed techniques are explained in detail. Section 4 reports the experimental results of enhanced technique. Finally conclusion is presented in Sect. 5.

2 Outlier Detection (OD) and Clustering

Clustering analysis and Outlier Detection are two related tasks which will go hand in hand. Clustering finds the major patterns in a dataset and sorting will be performed according to the data, whereas outlier detection aims at capturing the exceptional cases which deviate from the majority patterns. Taking binary decision on whether the object present in the dataset is an outlier or not are becoming a challenging task for the real time dataset. Some of the general terms used in clustering are Cluster (ordered list of objects, which will have common characteristics), Distance (calculating the distance between two points or two elements), Similarity (similarity between two documents SIMILAR (Di, Dj), Average Similarity (similarity measure will be computed for all the documents (Di, Dj), except i = j an average value will be obtained), Threshold (finding out the lowest possible input value of similarity which is required to join two objects in one cluster), Similarity Matrix (to find out the similarity between two objects, the similarity function SIMILAR (Di, Dj) will be used and result will be represented in the form of matrix), Dissimilarity Co-efficient (distance between two clusters) and Cluster seed (first object or first point of cluster is defined as initiator of that cluster and this initiator is known as cluster seed) [1]. Clustering is an important and popular tool for outlier analysis. Most of the techniques presented in outlier will rely on the key assumption that normal objects belong to either large/dense clusters, while small clusters are considered as outliers [5, 6]. Many researchers argue that clustering algorithm is not an appropriate choice for outlier detection. There is no single and specific algorithm for detecting outlier. Therefore, many approaches have been proposed and existing algorithms are also enhanced to improve the outlier detection metrics.

The approaches are classified into four major categories, namely, Distance-based approach, Density-based approach, Distribution-based approach, Deviation-based and clustering-based approach. Distance-based approach, outliers are detected, based on the distance between two points. Density-based approach helps to find out non-linear shapes and structure based on the density. Distribution-based approach helps to detect clusters with arbitrary shape and it does not require any input parameters. It can also handle large amount of spatial data. Clustering-based approach considers small sizes of clusters as outliers. Deviation-based approach helps to identify outliers which deviate from the selected objects [4, 7, 8].

In this paper, enhancement of k-means is proposed in the first phase and the second phase analyses detection of outliers. The main focus of this work is to identify outlier using distance-based clustering, which results in discovering normal and abnormal clusters. Classic k-means algorithm is very sensitive in nature. The selection of initial cluster prototypes will converge to suboptimal solution, only when the initial prototypes are chosen properly and the value of k must be specified in advance. While solving the real-world applications fixing the value of k will be difficult. For such real-time applications [8], first-width clustering algorithm is used to perform clustering process, and classifies outliers as erroneous value or interesting event. The main disadvantage of this algorithm is that it will not consider the intra clusters. The width must be specified before processing starts. For high-dimensional real time data, this algorithm will not work. The required number of distance calculation is high. The selection of centroid is important and if it is not initialized correctly, the classified cluster might have outlier. To overcome this, the Bayesian Information Criterion is included with Modified Dynamic Validity Index (MDVI). Generally, the k-means clustering is used to cluster and classify normal and abnormal clusters. The process of computation is high when k-means is applied to high dimensional dataset. To overcome this issue, a different distance calculation mechanism is applied. In the proposed algorithm all issues are handled and solved. The k-means is enhanced and there is no need to specify the k value; it is automatically assigned to the variable k and the centroid seed selection is estimated automatically. Proposed research is discussed in the following section.

3 Proposed Work

One of the best top ten algorithms in data mining is k-means, which is simple and scalable in nature. Clustering algorithm partitions the dataset into k clusters and it has the two main objectives of making each cluster as compact as much as possible [9]. This paper proposes a new cost function, and distance measure, based on the values present in the dataset. The traditional k-mean algorithm divides the data set X into k clusters and calculates the centroid of each cluster. k value must be assigned before the clustering process. The distance must be calculated with each instance and each instance is to be assigned to the cluster with the nearest seed. Finally threshold % for each cluster, and the distance between each point of cluster from centroid are calculated. When the distance is greater than the threshold value it will considered as “outlier” [9, 10]. The main objective of this paper is to enhance the k-mean clustering algorithm, handling the issues of first width clustering algorithm and evaluating the proposed algorithm with the real time application.

In this work, k-means clustering algorithm is used to cluster the dataset into k clusters. In the proposed algorithm the k-value is generated automatically by using enhanced Bayesian Information Criterion (BIC) along with Modified Dynamic Validity Index (MDVI). The Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) or Deviation Information Criterion (DIC) are used to determine the number of clusters and the distance metric used for this work is Euclidean distance. The distance from centroid to cluster will be processed until convergence is achieved. The below algorithm describes the process and steps to perform k-mean clustering.

  • Input: Dataset D with xi (i = 1…n) data points Output: Clusters (C1, …, Ck)

  1. Step 1:

    Feature selection is applied.

  2. Step 2:

    Estimation of K is automatic and the procedure is given below:

    1. I.

      Pre-cluster real time dataset (drinking water dataset) using Birch algorithm.

    2. II.

      BIC is computed for each cluster using Eq. (1), where J is a cluster.

      $$ BIC(J) = - 2\sum\limits_{j = 1}^{J} {\xi_{j} + m_{J} \log (N)} . $$
      (1)
    3. III.

      The ratio of change in BIC at each successive merging relative to the first merging determines the initial estimate and is calculated using Eq. (2).

      $$ dBIC(J) = BIC(J) - BIC(J + 1). $$
      (2)

      From these initial estimates the change ratio of the J cluster is calculated using Eq. (3) as the ratio of dBIC(J) to the dBIC(1) of the first cluster.

      $$ R_{1} \left( J \right) = \frac{dBIC(J)}{dBIC(1)}. $$
      (3)

      If dBIC(1) < 0, then KT = 1 and go to step 8 else calculate inter-cluster ratio and KT = number of cluster for which the recorded ratio is minimum of all and repeat steps V, VI and VII for all KT.

    4. IV.

      Calculate modified inter and intra cluster ratio between cluster Ck and Ck+1 using Eq. (4).

      $$ {\text{IntraRatio}}({\text{k}}) = \frac{{{\text{Intra}}({\text{k}})}}{\text{MaxIntra}}{\text{InterRatio}}({\text{k}}) = \frac{{{\text{Inter}}({\text{k}})}}{\text{MaxInter}}. $$
      (4)

      K is the pre-defined upper bound number of the clusters.

    5. V.

      Calculate the modified dynamic validity index using Eq. (5).

      $$ MDVI = \begin{array}{*{20}c} {\hbox{min} } \\ {k = 1, \ldots k} \\ \end{array} \left\{ {{\text{IntraRatio(k)}} + \gamma *{\text{InterRatio}}({\text{k}})} \right\}. $$
      (5)
    6. VI.

      KT = Number of clusters for which the dynamic validity index is maximum Optimal k = KT.

  3. Step 3:

    Estimation of K initial seeds (Cj) is automatically generated.

    1. I.

      Calculate K.

    2. II.

      Compute the distances between objects in D is calculated using Eq. (6).

      $$ N(X_{i} ) = \{ any\,x_{j} :d(x_{i} ,x_{j} ) < x_{j} , D,i,j\} . $$
      (6)

      where d(x i , x j ) is the distance between x i and x j calculated using DLG and the average distance between all objects is calculated using the following equation.

    3. III.

      Compute the average distance between all objects using Eq. (7).

      $$ \upvarepsilon = \frac{1}{{n\left( {n - 1} \right)}} \sum\limits_{i = 1}^{n - 1} {\sum\limits_{j = i + 1}^{n} {d(x_{i} ,x_{j} )} } . $$
      (7)
    4. IV.

      Find neighborhood of objects in D

      The coupling degree between neighborhoods of objects \( x_{i} ,x_{j} \) is the ratio of number of objects neighbor to both \( x_{i} \, {\text {and}} \, x_{j} \) is calculated using Eq. (8).

      $$ {\text{Coupling}}({\text{N}}(x_{i} ),{\text{N}}(x_{j} )) = \frac{{|N\left( {x_{i} } \right) \cap N\left( {x_{j} } \right)|}}{{N\left( {x_{i} } \right) \cap N(x_{j} )}}. $$
      (8)
    5. V.

      If Coupling (N(xi),N(xj)) < ε (average distance between all objects), then next centroid is found

      $$ {\text{Add}} \,{\text{to}}\, {\text{C}} ({\text{Next}}\, {\text{Centroid}}) $$
    6. VI.

      If |No of centroids| < k, Go to Step 6, otherwise go to Step 9.

    7. VII.

      End

  4. Step 4:

    Steps 5–9 for each point xi in D′ are repeated.

  5. Step 5:

    Distance between each data point xi and all k cluster centre is calculated using Eq. (9).

    $$ {\text{DLG}}_{\text{ij}} = \hbox{min} \left( {\sum\limits_{{{\text{e}} = 1}}^{\text{p}} {{\text{L}}\left( {{\text{p}}_{\text{e}} ,{\text{p}}_{{{\text{e}} + 1}} } \right)} } \right). $$
    (9)

    where e  E and ranges from 1…p. Thus, DLGij satisfies the four conditions for a metric, that is, Dij = Dji; Dij 0; Dij Die + Dej for all xi, xj, xp and Dij = 0 iff xi = xj. Thus the new measure considers both global and local consistency and can adapt itself to the data structure. L the density length is computed using Eq. (10).

    $$ L\left( {x_{i} ,x_{j} } \right) = \rho^{{dist\left( {x_{i} ,x_{j} }\right)}} - 1. $$
    (10)

    where dist(xi, xj) is the Euclidean distance between xi and xj and >1 is the flexing factor (the value 8 is used during experimentation).

  6. Step 6:

    The closest centre cj is found and assigned xi to cluster j.

  7. Step 7:

    Label of cluster centre j along with the distance of xi to cj and stored in array Cluster[] and Dist[] respectively

    • Set Cluster[i] = j (j is the nearest cluster)

    • Set Dist[i] = DLGij (distance between xi to closest cluster centre cj)

  8. Step 8:

    Cluster centres are recalculated.

  9. Step 9:

    DLGnew of xi is computed to new cluster centres until convergence

    • If DLGnew is less than or equal to DLGij, then xi belongs to the same cluster j

    • else

    • DLG is computed with every other cluster centre and assign xi to the cluster whose DLG is minimum

    • Set Cluster[i] = j and Set Distance[i] = DLGnew

  10. Step 10:

    Output clustered results.

Features selection chooses distinctive features from a set of dataset. Selection of features helps to reduce the size of dataset and make the process simpler for all subsequent design [11].

Once the feature selection is over, pre-cluster the dataset using Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm which can handle large datasets. The ratio of change in cluster is calculated in BIC at each consecutive merging. The intra and inter cluster relationship are evaluated using the formula. To improve the performance of algorithm and to find the better cluster number the MDVI is used. Next process of algorithm is to select the initial seed selection for the assignment of data to cluster. The performance of initial seed selection will be based on the sum of square, difference between members of cluster, cluster center and normalized data size [8, 9]. The inter and intra cluster distance validity measure allows to determine the number of clusters automatically. For initial centroid selection, enhanced distance measure, Reverse Neighbor Node and coupling degree are used. The coupling degree is used to measure the similarity between two objects.

Thus conventional k-means algorithm begins with a decision on the value of k. Any initial partition which classifies the data into k clusters is assigned. The problem of Euclidean distance is overcome in the proposed work. Hence the Euclidean distance of both xi and xj is calculated with the threshold value of 8 is used for experiment. Thus the new measure considers both global and local consistency and can adapt itself to the data structure. The distance from centroid to cluster will process until convergence is achieved. The above algorithm outlines the process of automatic key generation of k value and initial seed selection for clustering.

3.1 Outlier Detection (OD)

The second phase of the work is outlier detection. Detection of OD is basic issue in data mining. Outlier detection will remove ‘noise’ or ‘unwanted’ data from the dataset. Once the cluster formation has taken place with the help of enhanced k-mean clustering, then the output will be given as input for outlier detection. The dataset is partitioned into small and large clusters and the resultant cluster will be checked for anomalies, and if anomalies are present then those anomalies are safely removed from the whole dataset. For each cluster ci in the cluster set C, a set of inter cluster distances Dci = {d(ci, cj): j = 1… (|C| − 1), j = i} is computed. Here, d(ci, cj) is the Euclidean distance between centroids of ci and cj, and |C| is the number of clusters in the cluster set C. Among the set of inter-cluster distances Dci for cluster ci, the shortest K (parameter of KNN) distances are selected and using those, the average inter-cluster distance ICDi of cluster ci is computed using Eq. (11).

$$ {\text{ICD}}_{\text{i}} = \left\{ {\begin{array}{*{20}l} {\frac{1}{K} \sum\limits_{j = 1, \ne i}^{K} {d\left( {c_{i} , c_{j} } \right)} } \hfill & {K \le \left| C \right| - 1} \hfill \\ {\frac{1}{\left| C \right| - 1}\sum\limits_{j = 1, \ne i}^{\left| C \right| - 1} {d\left( {c_{i} , c_{j} } \right)} } \hfill & {K > \left| C \right| - 1} \hfill \\ \end{array} } \right\}. $$
(11)

The average inter-cluster distance computation is enhanced. Instead of using the whole cluster set C to compute the average inter-cluster distance ICDi for a cluster ci, the presented algorithm uses the K-Nearest Neighbor (KNN) for cluster ci. The advantage of this approach is that clusters at the edge of a dense region are not considered compared to clusters in the centre of the region. A cluster is identified as anomalous Ca ⊂ C are defined as Ca = {ci ∈ |C| ICDi > AVG (ICD) + SD (ICD)}, where ICD is the set of average inter-cluster distances.

Once the anomaly is identified and removed then the two clusters are merged into single cluster, if it satisfies the rules which are given below. A pair of clusters c1 and c2 are similar if the inter-cluster distance d(c1, c2) between their centers is less than the width w. If c1 and c2 are similar, then a new cluster c3 is produced. The centre of c3 is the mean of the centers of c1 and c2 and whose number of data vectors is the sum of those in c1 and c2. In the proposed system, the merging procedure compares each cluster ci with clusters {ci+1, ci+2,…}, and merges ci with the first cluster cj such that d(ci, cj) < T and j > i. The value of T is set to 0.38 after experimentation with different values ranging between 0.01 and 2.02 in steps of 4. For detecting outlier, the k-means clustering algorithm is enhanced and KNN is used to find out the nearest neighbor. The DLG is used to consider both intra and inter distance between data points. For automatic estimation, BIC is enhanced using MDVI. The centroid selection enhanced distance measure is combined with RNN and coupling degree. Finally, for dimensionality reduction, Principal Component Analysis is used. This section concludes that outliers are detected effectively with the help of the proposed technique. The enhanced k-means clustering will be able to discover clusters with correct arbitrary shape, it works well for large databases efficiently and lot of heuristics to determine the parameters. The enhanced technique helps to detect outliers efficiently and accurately.

4 Experimental Results

Clustering-based method determine cluster with shape, efficient to handle large database and determines the number of input parameters. Based on clustering, the exception will be considered as “noise”, where it is bearable in some cases and sometimes that leads to inaccurate results. The two different season (summer and winter) real time datasets are collected from TWAD Board, Coimbatore, Tamil Nadu, India. The dataset contains various parameters which brief about the characteristics of drinking water. Some of the parameter used for research are Turbidity, EC, TDS, Ph, Ca etc. The SVM and BPNN classifiers are used in this experiment for training and testing. Enhanced k-means and outlier detection are evaluated with various metrics namely, Accuracy, Normalized Root Mean Square Error and Speed.

$$ \varvec{Accuracy} = \frac{Number\,of\,correct\,predictions}{Total\,number\,of\,predictions} \times 100 $$
$$ {\mathbf{NRMSE}} = \frac{{\sqrt {\varvec{mean}[(\varvec{y}_{{\varvec{true}}} - \varvec{y}_{{\varvec{imp}}} )^{2} ]} }}{{\varvec{variance}\,\varvec{y}_{{\varvec{true}}} }} $$

Table 1 presents the classification accuracy of before and after OD to various % of outliers for two different seasons. It is obvious that the classification accuracy is improved compared with BPNN.

Table 1 Classification accuracy (%)

The normalized root means square error before and after outlier detection values are presented in Table 2 for two different seasons and it is evident that the normalized root mean square error value is minimized compared with BPNN.

Table 2 Normalized root mean square error

The execution speed of before and after outlier detection is listed in the above Table 3 for summer and winter seasons and it is clear that the execution speed is reduced compared with BPNN. Thus the experimental results shows that proposed algorithm works effectively for detecting outlier in contaminated drinking water dataset.

Table 3 Execution speed (seconds)

5 Conclusion

This paper focused to detect outlier from dataset and aims to find objects which are different or contradictory from other data. An outlier detection method is proposed. The issues in k-means clustering algorithm are handled to enhance its clustering operations and to identify the outliers from the dataset are proposed. The first step is grouping of data into a number of clusters for this the average of inter cluster distance is used. Next outlier is identified from the resultant cluster. The experimental results show that the classification accuracy is improved and normalized root mean square error is minimized and it is evident that the proposed algorithm is efficient in identifying outliers to detect the contamination quickly. In future, Feature Selection algorithm can be combined with outlier detection to improve contamination detection.