Keywords

1 Introduction

Structured Data, data which is easily organized and deposited in databases. Example, data stored in RDBMS. Semi Structured data is a data pattern which is unorganized but has some association within the data. For example: Log files, xml files. Unstructured data is a data which does not have lucid format in storage. For example: Image file, video files, audio files, etc. Big data analytic is a proficiency of examining the stored data to locate some schemes and interdependency among the data [1]. It can be applied in several fields where huge amount data is induced. Big data is stipulated by three attributes basically known as three V’s. Velocity is the rate at which the data comes into an Organization. Variety, relates many types of data. Volume projects the magnitude of information which is uncluttered into the organization [1].

Data mining is operation used in big data scrutiny in identifying concealed correlations, framework and statistical information from data repository that contradicts to prevail by using traditional proficiencies [2]. It is intended to traverse enormous amount of information within probe of accordant framework where as to affirm the consequence by the detected framework for an advanced fragment in statistics. One of the Data Mining process is Clustering.

Clustering process for computer systems conceived in anthropology driver and Kroeber which was published in the year 1932 and initiated to psychology by Zubin in the year 1938 and Robert Tryon in the year 1939, and most prominently used by Chattel conceived in the year 1943 for eccentricity theory organization in personality psychology. Clustering is organizing data in such groups called clusters, in which there is high intra-cluster similarity. The cardinal abstraction of cluster scrutiny is segregating an entrenched deposit of information entity into sections. Every section is prominently isolated and comparatively entities in one cluster are interchangeable to another one, additional contradictory to entities to another cluster. It is significant process necessary in various applications such as, machine learning, data mining, pattern recognition, Image analysis and Bioinformatics, etc.

1.1 Need for Clustering?

Clustering helps in organizing huge voluminous data into clusters and displays interior structure of statistical information [2]. Clustering is the intent of segregating the data into clusters. Clustering improves the data readiness towards AI techniques [3]. Process for clustering, exhibits knowledge discovery in data, It is used either as a stand-alone tool to get penetration into data distribution or as a pre-processing step for other algorithm.

2 Clustering and Its Types

Clustering is stipulated as the sorting of indistinguishable text report into clusters that is reports within the clusters have high parallelism when compared to other but heterogeneous to reports in other clusters [4]. Since all the information have been added to the World Wide Web it becomes very consequential to graze or probe the pertinent information dramatically. The identification of appropriate algorithms for clustering induces the optimal clustering techniques, and inclines imperative to possess the contraption for differentiating the consequences of heterogeneous clustering techniques. Several heterogeneous clustering process to retain stipulation in directive to decode the issue from distinct approach, that is

  • Partitioned clustering

  • Density based clustering

  • Hierarchical clustering

2.1 Partitioned Clustering

Partitioning methods breaks down the data into a set of different clusters. Give n objects, this method produces k clusters of data where k < n clusters of data and using an iterative relocation method [1]. Algorithms used in partitioned clustering are k-means algorithm, k-medoids algorithm.

2.1.1 k-Means Algorithm

This algorithm can be used to cluster the data set thus forming clusters repeatedly. It is one of the unsupervised and iterative algorithms. The main aim of this algorithm is to find out the location of the clusters thus minimizing the distance between the cluster and the data set. This algorithm also called “Lloyd’s algorithm” where m data set are clustered to form some number of clusters say k, where each of the data set belongs to the closer mean cluster.

Algorithm

  1. 1.

    Define the number of clusters (k) to be produced and identical data point centroids.

  2. 2.

    The distance from every data point to all the centroids are calculated and the point is assigned to the cluster with a minimum distance.

  3. 3.

    Follow the above step for all the data points.

  4. 4.

    The average of the data points present in a cluster is calculated and can set new centroid for that cluster.

  5. 5.

    Until desired clusters are formed repeat Step 2.

The initial centroid is selected randomly and thus the resulting clusters have larger influence on them. Complexity of k-means algorithm is O(tkn) where n—total data set, k—clusters formed, t-iterations in order to form cluster [1].

Advantages

  • Effortless implementation process.

  • Dense clusters are produced when clusters are spherical when compared to hierarchical method.

  • Appropriate for large databases.

Disadvantages

  • Inappropriate for clusters with different density and size.

  • Equivalent results are not produced on iterative run.

  • Euclidean distance measures can weigh unequally due to underlying factors

  • Unsuccessful for non-linear data set and categorical data

  • Noisy data and outliers are difficult to handle.

2.1.2 k-Medoids Algorithm

In this algorithm, each of the cluster is described by one of the objects which is located near the centroid of the cluster. The repeated process of changing described objects by non-described objects continues until the resulting cluster is improved. The value is predicted using cost function which measures the variance between an object and described object of a cluster. The algorithm is implemented in two steps:

Build: Initial medoids are innermost objects.

Swap: A function can be swapped by another function until the function can no longer be reduced [5].

Algorithm

  1. 1.

    Initially choose m random points as initial medoids from given data set.

  2. 2.

    For every data point assign a closest medoid by distance metrics.

  3. 3.

    Swapping cost is calculated for every chosen and unchosen object given as TCns where s is selected and n is non-selected object [5].

  4. 4.

    If TCns  < 0, s is replaced by n

  5. 5.

    Until there is no change in medoids repeat 2 and 3.

Four characteristics to be considered are

  • Shift-out membership: Movement of an object from current cluster to another is allowed.

  • Shift-in membership: Movement of an object from outside to current cluster is allowed.

  • Update the current medoids: Current medoid can be replaced by a new medoid.

  • No change: Objects are at their appropriate distances from cluster.

Advantages

  • Effortless understanding and implementation process.

  • Can run quickly and converge in few steps.

  • Dissimilarities between the objects is allowed.

  • Less sensitive to outliers when compared to k-means.

Disadvantages

  • Initial sets of medoids can produce different clustering’s. It is thus advisable to run the procedure several times with different initial sets.

  • Resulting clusters may depend upon units of measurement. Variables of different magnitude can be standardized.

2.2 Hierarchical Based Clustering

Hierarchical methods, decomposes n number of data set into groups forming hierarchy of clusters. The tree like structural representation of hierarchical based clustering is called Dendrogram diagram [6]. The root of dendrogram represents the entire dataset and leaves represents each individual cluster present in the dataset. The clustering results are obtained by taking dendrogram at different levels. The two approaches of hierarchical based clustering, Agglomerative (Bottom-up), Divisive (Top-down).

2.2.1 Agglomerative Clustering Algorithm

This Algorithm is also referred as Bottom-up approach. This approach treats each and every data point as a single cluster and then merges each cluster by considering the similarity (distance) in each individual cluster until a single large cluster is obtained or when some condition is satisfied [7,7,14].

Algorithm

  1. 1.

    Initialize all n data points into N individual clusters.

  2. 2.

    Find the cluster pairs with the least distance (closest distance) and combine them as one single cluster.

  3. 3.

    Calculate pair-wise distance between the clusters at present that is the new formed cluster and the priority available clusters.

  4. 4.

    Repeat steps 2 and 3 until all data samples are merged into a single large cluster of size N

Advantages

  1. 1.

    Easy to identify nested clusters.

  2. 2.

    Gives better results and ease in implementation.

  3. 3.

    They are suitable for automation.

  4. 4.

    Reduces the effect of initial values of cluster on the clustering results.

  5. 5.

    Reduces the computing time and space complexity.

Disadvantages

  1. 1.

    It can never undo what was done previously.

  2. 2.

    Difficulty in handling different sized clusters and convex shapes lead to increase in time complexity

  3. 3.

    There is no direct minimization of objective function.

  4. 4.

    Sometimes there is difficulty in identifying the exact number of clusters by the Dendrogram.

2.2.2 Divisive Based Clustering

This approach is also referred as the top-down approach. In this, we consider the entire data sample set as one cluster and continuously splitting the cluster into smaller clusters iteratively. It is done until each object in one cluster or the termination condition holds [8]. This method is rigid, because once a merging or splitting is done, it can never be undone.

Algorithm

  1. 1.

    Initially, initiate the process with one cluster containing all the samples.

  2. 2.

    Select a largest cluster from the cluster that contains widest diameter.

  3. 3.

    Detect the data point in the cluster found in step 2 with the minimum average similarity to the other elements in that cluster.

  4. 4.

    The first element to be added to the fragment group is the data samples found in step3.

  5. 5.

    Detect the element in the original group which has the highest average similarity with the fragment group

  6. 6.

    If the average similarity of element obtained in step 5 with the fragment group is greater than its average similarity with the original group then assign the data sample to the fragment group and go to step 5; otherwise do nothing;

  7. 7.

    Repeat the step 2–6 until each data point is separated into individual clusters.

Advantages

  1. 1.

    It produces more accurate hierarchies than bottom-up algorithm in some circumstances.

Disadvantages

  1. 1.

    Top down approach is computationally more complex than bottom up approach because we need a second flat clustering algorithm.

  2. 2.

    Use of different distance metrics for measuring distance between clusters may generate different results.

2.3 Density Based Clustering

Density based clustering is clustering of database based on the densities (i.e. local cluster criterion). It has major features such as.

  • It discovers the arbitrary shape cluster.

  • It handles the noise data.

  • It examines only the local region to justify the density.

  • To termination the process it requires the density parameters.

It is categorized into two types namely:

  1. (a)

    Density Based Connectivity: This includes clustering technique such as DBSCAN and DBCLAD.

  2. (b)

    Density based Function: DENCLUE method density clusters are obtained based on some functions.

But here lets us see DBSCAN clustering in detail.

2.3.1 DBSCAN (Density Based Spatial Clustering of Applications with Noise)

In DBSCAN, a cluster is defined as group of data that is of highly dense. DBSCAN considers two parameters such as:

Eps: the maximum value of radius from its neighborhood.

MinPts: The Eps is surrounded by data points (i.e. Eps-Neighborhood) that should be minimum.

To define Eps-Neighborhood it should satisfy the following condition, NEps(q) : { p belongs to D|(p, q) ≤ Eps }.

In order to understand the Density Based Clustering let us follow few definitions:

  1. (a)

    Core point: It is point which lies within Eps and MinPts which are specified by user. And that point is surrounded by dense neighborhood.

  2. (b)

    Border point: It is point that lies within the neighborhood of core point and multiple core points can share same border point and this point does not contains dense neighborhood.

  3. (c)

    Noise/Outlier: It is point that does not belongs to cluster.

  4. (d)

    Direct Density Reachable: A point p is directly Density Reachable from point q with respect to Eps, MinPts if point p belongs to NEps(q) and Core point condition i.e.|NEps(q)| ≥ MinPts

  5. (e)

    Density Reachable: A point p is said to Density Reachable from point q with respect to Eps, MinPts if there a chain points such as p1,p2, ……pn, p1 = q, pn = p such that pi + 1 is directly reachable from pn.

Algorithm

  1. 1.

    In order to form clusters, initially consider a random point say point p

  2. 2.

    The second step is to find the all points that are density reachable from point p with respect to Eps and MinPts. The following condition is checked in order to form the cluster

    1. (a)

      If point p is found to be core point, then cluster is obtained.

    2. (b)

      If point p is found to be border point, then no points are density reachable from point p and hence visit the next point of database.

  3. 3.

    Continue this process until all the points is processed.

Advantages

  • It can identify Outlier.

  • It does not require number of clusters to be specified in advance.

Disadvantages

  • If the density of data keeps changing then efficiency of finding clusters is difficult.

  • It does not suit for high quality of data and the user has to specify the parameter in advance.

3 Conclusion

The paper has incorporated detailed clustering techniques with algorithmic flow and features. The study provides a milestone for researchers to pause and consider the most suitable clustering form [15,15,19]. The paper is concluded with a comparative study of various clustering techniques in Table 1.

Table 1 Classification of clustering algorithm