Keywords

1 Introduction

Nowadays, data mining is an essential technique to determine the most useful information from the data set [1]. It provides the use of automated data investigation techniques to exhibit the relationship of patterns within data sets. It offers various facilities in the healthcare sector. It helps in determining the early diagnosis of illness, providing therapeutic solutions to the person at a cheaper price and the identification of most suitable treatment methods. It also supports researchers in the development of drug recommendation methods and to draft healthcare policies [2]. Data mining methods are generally classified into supervised and unsupervised learning [3]. Supervised learning tries to identify unlabelled data according to the labelled training input data. It is used in predictions to determine the value of the output variables. In this technique, based on the known input and output variable, a model can be formed from a training data set [4]. After that it can predict the class label for the unknown output variable. In supervised learning, the model requires an ample amount of labelled data for the learning process. Unsupervised learning reveals to learn the hidden patterns from the unlabelled information. In this technique, there exist none of the output data for prediction as compared to supervised learning. According to the relationship within the data points, this technique discovers the patterns in the data set.

2 Clustering

Clustering is a task of partitioning the objects into groups of data points such that the data points in a cluster have more resemblance as compared to data points in other cluster [5]. Clustering algorithms are used to classify every data point into a particular group with a given set of data points. Data points of same cluster must have similar properties whereas data point of diverse cluster must possess dissimilar properties. Clustering is unsupervised learning that is used to assist professionals in finding hidden patterns in a data set. It results in exhibiting similar and dissimilar properties for the different groups. Let us understand this with an example. Suppose, as the head of a departmental store and to understand preferences of customers to enhance the business. It is impossible to look at details of each customer and plan a unique business strategy for each one of them. So, the easy way to do is to cluster all of our customers into say five groups based on their past history of shopping and use a separate strategy for customers in each of these five groups using clustering. Clustering algorithms can be used in different sectors, that is, for the classification of diseases in medical era and customer interest in market study. There is no universal method for clustering, so various methods are used for diverse clustering purposes.

2.1 K-Means Clustering

Now, that we understand what is clustering. Let us take a look at the types of clustering.

2.1.1 K-Means Clustering Algorithm

This is one of simple clustering algorithm since it is straightforward to implement. It is a form of unsupervised learning used for data without defined groups. This algorithm works repeatedly to allocate each data point to one of K groups based on the characteristics that are provided. K-means clustering algorithm has been found to be very helpful in grouping new data. Few applications which use k-means clustering are sensor measurements, activity monitoring in a manufacturing practice, audio detection, and image segmentation [6, 13].

  • Steps:

    1. 1.

      Selecting the amount of classes to be used and arbitrarily initializing their own centre point.

    2. 2.

      Categorize every data point by evaluating the distance between that point and centre of every group. Then classify the points to be placed in the group having nearest centre.

    3. 3.

      Recompute the group centre by taking the mean of all the vectors in the group.

    4. 4.

      Do these steps repeatedly until the group centres do not alter to a great extent between iterations. Now, initialize the group centres randomly and then choose the one that seems to provide the best result, as shown in Fig. 1.

  • Pros:

    • Simple to execute.

    • It is a fast method due to fewer computations.

    • For large number of variables, K-means may be faster than hierarchical clustering (if value of K is small).

    • It may produce higher clusters than hierarchical clustering.

  • Cons:

    • The challenging aspect is identifying and classifying groups.

    • Because of arbitrarily selection of centre of cluster, the result may be inconsistent.

Fig. 1
figure 1

K-means clustering algorithm

2.2 Mean-Shift Clustering Algorithm

Mean shift is a kind of sliding window algorithm [7]. It is useful to discover the crowded region of data points and to trace the centre points of each group. Within the sliding window, it updates the candidate for the centre points as the mean of the points which can be used to eliminate the duplicate values. So, the result is arrangement of final set of centre points with their related groups. The distinction between k-means algorithm and Mean-Shift is that in Mean Shift, there is no need to state the quantity of clusters in advance because it will be determined based on the data. Mean Shift clustering algorithm is mostly useful in Computer Vision problems, Image Processing, Video Tracking, and Image Segmentation.

Working of Mean-Shift clustering algorithm with the help of these steps:

  • Step 1 − Begin with the data point that a cluster possessing itself.

  • Step 2 – Now calculate the centroids.

  • Step 3 – Modify the position of new centroids.

  • Step 4 – Repeat the process and move to the high density area.

  • Step 5 − Stop once the centroids reach at position from where it cannot shift further.

  • Pros:

    • Unlike the k-means clustering algorithm, selecting the quantity of clusters is not necessary.

    • The cluster centre should meet towards the point of maximum density.

    • It is much better as compared to k-means due to the reason that there is need to give the value of ‘k’, that is, the number of clusters.

    • This algorithm takes only one input, that is, the bandwidth of the window.

  • Cons:

    • It may be computationally expensive due to lot of steps in this algorithm.

    • The selection of the bandwidth is an important issue.

    • If the bandwidth is very small, few data points may be missed, and may not reach at convergence.

    • If the bandwidth is very large, some clusters might be missed entirely.

2.3 DBSCAN Clustering

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is extensively used in density-based algorithm [15, 16]. DBSCAN has been found to be useful in the discovery of non-linear shapes based on the density.

Let P = {p1, p2, p3... pn} is a set of data points. DBSCAN requires two parameters: € (ep) and the least amount of points required to make a cluster (minpt).

Step 1 – Begin with an arbitrary point not explored so far.

Step 2 – Remove a neighbour of this point using € (data point within the € distance are neighbours).

Step 3 – Clustering process starts if there are enough neighbours around that particular data point and mention them as explored, else consider that point as false point (that point may be a part of the cluster after some time).

Step 4 – If a point is present in a cluster then its € neighbour is also the member of the cluster. So, iterate from step 2 for all € neighbour points until all data points in the cluster are determined.

Step 5 – Then retrieve and process the new unvisited point that is used to identify a point of cluster or false data.

Step 6 –Iterate this process until all points are mentioned as visited (Fig. 2).

Fig. 2
figure 2

DBSCAN clustering

  • Pros:

    • It is better than other cluster algorithms because there is no need to specify number of clusters in advance.

    • It recognizes outliers as false data.

  • Cons:

    • This clustering algorithm may not be very helpful when there are varying density clusters. In case of change in the density level, there may be deviation in the location of the threshold distance ɛ and the least amount of points to identify the neighbours.

    • It would be a challenge to determine the threshold distance ɛ for high dimensional data.

2.4 Expectation-Maximization (EM) Clustering Using Gaussian Mixture Models (GMM)

The Gaussian mixture model (GMM) clustering provides more flexibility as compared to other clustering techniques [6]. In GMM Clustering, initially, it is assumed that all the data points are distributed according to the Gaussian distribution. To determine the shape of each cluster Mean and standard deviation are the main parameters. The standard deviation can be in both P and Q directions, each cluster can draw ellipsoids, for multivariate models. They do not need to have a spherical shape. Expectation-Maximization (EM) is an optimization algorithm that is used [6, 14] to obtain the parameters regarding the Gaussian for every group.

Similar to the k-means clustering algorithm, firstly, it selects the number of clusters and arbitrarily initializes the Gaussian distribution parameters. After that the GMM algorithm follows the following steps.

  1. 1.

    In each cluster, estimate the probability for every point. If the point is closer to the Gaussian’s centre, then there are better possibilities that it can fit into the group.

  2. 2.

    Now, according to these calculations, a new kind of parameter is located for the Gaussian distributions that increase the probabilities of data points inside clusters. The weight factor is an important parameter to determine the probability of the data point relating to the particular cluster.

  3. 3.

    Repeat the above-mentioned steps until convergence.

  • Pros:

    • GMM uses the concept of standard deviation. So, it provides more flexibility in terms of cluster covariance.

    • In k-means, a data point relates to one and only one cluster. On the other hand, in GMM, a data point relates to each cluster to a precise degree. The degree is based on the probability of the point being produced from each cluster’s normal distribution, with cluster centre as the distribution’s mean and cluster covariance as its covariance.

2.5 Hierarchical Agglomerative Clustering

Hierarchical clustering is the extensively used method to do analysis of social network data. The data are compared with one another based on their resemblance in this clustering method. The group of nodes are combined to form larger groups based on their similarity. So, the core task in hierarchical agglomerative clustering is to iteratively merge the two nearest clusters into a larger cluster [8].

Working of hierarchical clustering algorithm:

Suppose there are six data points’ p, q, r, s, t.

Step1– Assume each alphabet as a single cluster and estimate the distance of one cluster from all the other clusters.

Step 2 – Now similar clusters are combined collectively to make a single cluster. Suppose cluster (q) and cluster (r) are similar to each other, so combine these in the second step similarly with cluster (s) and (t) and at last, the clusters [(p, q), (r), (s, t)] are obtained.

Step 3 – Recompute the closeness of points according to the algorithm and join the two nearest clusters [(r), (s,t)] together to form new clusters as [(p,q), (r,s,t)].

Step 4 – Lastly, the remaining clusters are merged together to form a single cluster [(pqrst)] as depicted in Fig. 3.

  • Pros:

    • It is not necessary to specify number of clusters in advance.

    • It is insensitive to the selection of distance metric.

  • Cons:

    • The algorithm can never undo any previous steps. So for example, the algorithm clusters 2 points, and later on if it has been noticed that the correlation was not a good one, the program cannot undo that step.

    • Due to time complexity, the computation time may be high as compared to other methods like k-means.

Fig. 3
figure 3

Hierarchical clustering

3 Experimental Analysis

The experiment has been conducted on windows 7 OS,4 GB RAM. The implementation has been done in Python. The data set on diabetes from the UCI Machine Learning Repository [9] has taken into consideration. It consists of eight attributes and 768 observations. It contains the records of 500 non-diabetic persons and 268 diabetic persons (Table 1).

Table 1 Description of sample data set

Code 1: Load the Data Set

import pandas as pd data= pd.read_csv('diabetes.csv') data = data.drop("Outcome", axis = 1) array = data.values[:,0:8] print(data.head(10))

The diabetes data set has some missing values. Therefore, it is necessary to preprocess the data before using it. The data processing techniques enhance the overall quality of the models mined and also decreased the time needed for real mining. In the Diabetes Dataset, some of the entries are having zero values such as glucose level, body mass index, diastolic blood pressure, skin thickness, and physically impossible insulin level. Therefore, there is a need to preprocess the data.

The preprocessing has applied on the data set to normalize the values as shown in Table 2.

Table 2 Preprocessed data

Code 2: Preprocess the Data

from sklearn.preprocessing import scale X = pd.DataFrame(scale(data)) print(X.head(n=5))

After that apply the k-means clustering technique to identify the clusters as shown in Fig. 4.

Fig. 4
figure 4

Clusters according to k-means clustering

Code 3: K-Means Clustering

clustering = KMeans(n_clusters=4) clustering.fit(X)

To observe the performance of hierarchical clustering, agglomerative hierarchical clustering has also applied on the diabetes data set (Figs. 5 and 6).

Fig. 5
figure 5

Dendograms

Fig. 6
figure 6

Cluster according to agglomerative clustering

Code 4: Apply Hierarchical Clustering

clust = AgglomerativeClustering(n_clusters = 4, affinity = 'euclidean', linkage = 'ward') clust.fit_predict(pat_data)

3.1 Performance Analysis

The following performance metrics have been taken into consideration for measuring the performance of the clustering algorithm.

Davies Bouldin (DB) Index

It measures the average correlation among a cluster and its most related equivalent one. It is required that the clusters should be distinct from the other cluster. Thus, a clustering that reduces the index value is the perfect one [10].

Silhouette Analysis

It is employed to resolve the distance of a specific object into one cluster with respect to another object in different clusters. Its score values lie among −1 to +1. Here +1 indicates the object lies in the correct cluster and −1 indicates that objects are not accurately clustered [7, 11].

It has been depicted in Fig. 7 that according to both the indices suitable number of clusters is four.

Fig. 7
figure 7

DB index and Silhouette analysis

It has been observed from Table 3 that agglomerative clustering has better execution time as compared to k-means clustering.

Table 3 Performance comparison of k-mean and agglomerative clustering

Data set 2: To group, the unlabelled data k-means clustering has been applied to the Wisconsin breast cancer data set [12]. In the data set, there are some records of patients, and it has to be determined whether the patients have cancer or not at the moment when the information was gathered. The cluster discovered here will be the foundation for additional research.

Code 5: Load the Breast Cancer Data Set

Load data from sklearn import datasets dataset = datasets.load_breast_cancer() X=dataset.data[:,0:2]

After applying the k-means clustering and agglomerative clustering, the obtained clusters have been depicted in Figs. 8 and 9 (Tables 4).

Fig. 8
figure 8

K-mean clustering

Fig. 9
figure 9

Agglomerative clustering

Table 4 Performance comparison of k-mean and agglomerative clustering

It has been observed from the results that agglomerative hierarchical clustering is performing better than k-means clustering. It provides more flexibility as compared to k-means clustering and it has fewer hidden hypotheses regarding the distribution of the data. Sometimes, k-means give unpredictable results if the data are not well-separated into sphere-like clusters. In contrast, hierarchical clustering has fewer hypotheses regarding the distribution of data. It typically ‘associates’ nearby objects within a cluster, and then successively combines nearby objects to the most related collection. It can be computationally expensive but usually produces more intuitive results.

4 Conclusions

In this chapter, the performance of different clustering algorithm for healthcare data set has been evaluated. The performance of the algorithms has been analysed according to the number of clustered instances. According to the investigation, both the k-means and the agglomerative algorithm have efficient intra-cluster cohesion and inter-cluster distribution. The agglomerative algorithm has better performance accuracy and execution time as compared to the k-means algorithm. It is judged from the result that the agglomerative algorithm performs better as compared to k-means clustering.