Keywords

1 Introduction

The numerous records of healthcare data generated every day are increasing astronomically in today’s modern era [1]. The explosion of medical sensors, internet of things devices, and digitalization of medical records have created a flood of data typically landing in different medical storage repositories. Then, various kinds of operations such as analytical, process, and retrieval are performed to extract valuable insights from the raw data [2]. With the help of real-time alerts, doctors or medical practitioners will take perspective decisions about treatment at the right time [3, 4]. Therefore, big data analytics solutions can be used to save human lives, provide analysis much faster, ultimately save money and improve the efficiency of treatment [5].

The healthcare data is captured from various sources that include [6] hospitals, clinical, medical research, electronic records, and authorized websites respectively. They are stored in different formats such as text, video, audio, image, impala complex types, and sequence file respectively [7] and also make it very difficult to process and analyze all pieces of data effectively. One key strategy to solve this analytic issue is to group or cluster the big health data in a more compact format. In such a case, clustering algorithms contribute a major role to analyze the massive volume of healthcare data as small segments in a dispersed way and effectively aggregate all these data across different clusters to obtain the final processed medical data [8]. There are several clustering algorithms developed [9] to analyze the data but still, it is a challenging task which algorithm provides the best and the optimal number of clusters with respect to different data sets. Many authors have evaluated the clustering algorithms using different medical data sets with unique validation metrics [10,11,12,13]. Only a few authors [14] have been used synthetic data sets with real-time data sets to assess the variations and performance of three distinct clustering algorithms. Each data set is unique in its own way. No studies have been considered so far to estimate various clustering algorithms using the mixed type of physiological data. This type of analysis on vital parameters must require in the near future to identify the time-critical data than normal data. Therefore, this work considers only a synthetic data set instead of real-time data sets to evaluate the best number of clusters for healthcare data analysis. Moreover, the value of the raw healthcare data collected from hospitals or patients in real-time may be similar or slightly different from our synthetic data set. But the minimum and maximum values of vital data may only deviate from the considered ranges.

Despite the vast number of analysis for clustering algorithms using various healthcare data sets including heart rate [15], brain [16], body temperature [17], emotions [18], cancer [19], blood pressure, ambulatory, and emergency respectively, available in the literature. In such a case, it is very difficult for handlers to decide in advance which algorithm is most suitable one for identifying the abnormality in a given big health dataset. There are still many limitations exist in the literature that need to be addressed: (i) the unique attributes of various clustering algorithms especially for physiological data set are not analysed carefully, (ii) several clustering algorithms have been developed for healthcare domain but they were not deliberated any mixed type of vital information and (iii) only experimental analysis has been carried out to specific healthcare data set to study the significance of one algorithm over another. The aforementioned reasons are highly motivated us to inspect various clustering algorithms, especially for the mixed type physiological data set. The main contributions are outlined as follows:

  • To study three distinct types of clustering algorithms based on the theoretical perspectives.

  • To validate the different clustering algorithms using internal and stability metrics.

  • To analyze the most optimal clustering algorithm with respect to clinical perspectives.

Therefore, this article provides readers with a sufficient analysis of particular clustering algorithms by theoretically and experimentally comparing them on the synthetic physiological data set. Other sections of this paper are described as follows: The theoretical details of clustering algorithms are summarized in Sect. 2. Section 3 describes the internal and stability validation measures for various clustering algorithms. The experimental and comparative analysis of different clustering algorithms are explained in Sect. 4. Finally, Sect. 5 concludes the paper with appropriate clustering algorithm with future scope.

2 Analysis of Clustering Algorithms

Clustering is one of the best known algorithm in machine learning domain, named as an unsupervised learning algorithm [20]. The significance of clustering algorithm is to divide the large volume of data into smaller groups of data when there is no class labels available to process the datasets. Each cluster contains a set of data points where clustering algorithm mainly used to classify and group each data point into a particular cluster. Besides, the data points within the same cluster should have similar properties, while data points in the different cluster should have highly dissimilar properties and/or features [21]. Many clustering algorithms for analyzing healthcare data sets have been introduced in the existing research works [22,23,24,25,26]: K-means, K-Medoids or Partitioning Around Medoids (PAM), and Hierarchical. The main procedures of these algorithms are classified as follows.

2.1 K-means Clustering Algorithm

K-means is a simple and most general clustering algorithms which is mainly used to classify the given dataset that is unlabeled. This algorithm mainly aims to find similar clusters represented by variable k. For this purpose, this algorithm uses the mean or centroid as a metric to characterize the cluster. A centroid is a data point that indicates the center of the cluster, and it might not necessarily be a member of the dataset. So, it divides n data points into k number of clusters and then each data point n belongs to appropriate cluster with the nearest possible centroid. Next, the Euclidean distance is accurately calculated from each data point n to the centroid in a given cluster. Always, the data points in a cluster are assigned to the centroid depending on the minimum euclidean distance from that centroid point. When there no data point is available to assign, an early grouping is considered. In such case, ‘c’ new centroids are re-calculated, thus new iteration continues until the ‘c’ centroids stop changing their position.

2.2 K-medoids Clustering Algorithm

K-medoid is a variant type of algorithm which is also termed as Partition Around Medoids (PAM). In this algorithm, data point act as a medoid within a cluster that are centrally located whose disparity over all data points in the cluster is minimal. Therefore, this medoid can be used as a representative of other data points within a cluster. The main core idea of PAM is to first calculate major data point as a medoid in a specific cluster, group the set of medoids, and then each data point is assigned to the nearest medoid in a given cluster. Moreover, this algorithm generally follows two phases: build and swap phase. The role of the first phase is to select the first medoid as the data point with the lowest mean dissimilarity with respect to the whole dataset. Likewise, in the second phase, given the current set of ‘k’ medoids, all the neighbor data points are evaluated. A new medoid is created by exchanging data points in the old medoid with the data points in a new non-medoid.

2.3 Hierarchical Clustering

Hierarchical is a special type of unsupervised machine learning algorithm, also referred as Hierarchical Cluster Analysis (HCA). The goal of hierarchical cluster analysis is to cluster similar unlabeled data points into number of clusters using tree based structure. The data points in the end of tree forms a set of clusters, where each and every cluster is distinct from other clusters. Besides, the data points within a specific cluster is mostly identical to other clusters in the data set. This algorithm uses a tree-type structure (dendrogram) based on the hierarchy. Basically, there are two types of hierarchical clustering algorithms include Agglomerative hierarchical clustering or AGNES (Agglomerative Nesting) and Divisive hierarchical clustering or DIANA (Divisive Analysis). Both this algorithm is exactly the reverse of each other. The summary of various algorithms with respect to various characteristics are listed in Table 1.

Table 1. Summary of clustering algorithms

3 Validation Measures

The performance of unsupervised learning algorithms is evaluated using different internal, and stability validation metrics. The internal measures are very important for evaluating the right number of clusters and computing the quality of the appropriate clustering algorithm. This measures consider only the internal information to calculate the quality of a clusters without using any external information. The basic internal validation measurements [27] are classified into three types: Connectivity, Silhouette and Dunn index. This section briefly presents the internal validation indices used for a physiological data set.

3.1 Internal Measures

Connectivity.

This measure represents the total number of rows \( n \) (data points or observations) and columns \( m \) in a dataset. The values are always considered as numeric (e.g., a physiological parameter’s values). Let \( Y_{ni} (j) \) and \( x_{i} Y_{ni} (j) \) be the \( j^{th} \) nearest neighbor of data point \( i \) and zero, respectively, if both \( i \) and \( j \) are in the same cluster, and then \( {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 j}}\right.\kern-0pt} \!\lower0.7ex\hbox{$j$}} \) otherwise. The connectivity is measured for a particular cluster \( {\mathcal{C}} = \left\{ {{\mathcal{C}}_{1} , {\mathcal{C}}_{2} \ldots .{\mathcal{C}}_{k} } \right\} \) with \( n \) data points using the below equation

$$ {\mathcal{C}} = \text{ }\mathop \sum \nolimits_{i = 1}^{n} \mathop \sum \nolimits_{j = 1}^{p} x_{i} Y_{ni} \left( j \right) $$
(1)

Where \( p \) represents a parameter value and if the connectivity measure has a value between \( 0 \) and \( \infty \), it should always be decreased.

Silhouette Coefficient.

This coefficient is a very useful metric for evaluating the performance of clustering results. This value measures how data points are grouped and computes the average distance available between the different clusters. The width of this coefficient always lies in the following interval [−1, 1] that implies the super grouped data points with values near to \( 1 \) and lower grouped data points with values near to \( - 1 \). Therefore, the coefficient for data point \( i \) is defined as

$$ S\left( i \right) = \frac{{\left( {y_{i} - x_{i} } \right)}}{{max\left( {y_{i} , x_{i} } \right)}} $$
(2)

Where \( x_{i} \) and \( y_{i} \) denote the average distance between the data points in the same cluster and the average distance between the data points in the nearest neighboring clusters which can be expressed as

$$ y_{i} = \mathop {\hbox{min} }\limits_{{{\mathcal{C}}_{k} \in \frac{{\mathcal{C}}}{{{\mathcal{C}}_{i} }}}} \mathop \sum \nolimits_{{j \in {\mathcal{C}}_{k} }} \frac{dist(i,j)}{{n{\mathcal{C}}_{k} }} $$
(3)

Where \( {\mathcal{C}}_{i} \) indicate a cluster with data point \( i \), \( dist(i,j) \) presents the distance between the data points \( i \) and \( j \), then \( n{\mathcal{C}}_{k} \) implies cardinality of the cluster \( { \mathcal{C}} \).

Dunn Index.

This is an important metric that presents the ratio of the lowest distance between the data points which is not available in the same cluster and the highest distance in the intra-cluster. The index value can be obtained as

$$ D_{{\mathcal{C}}} = \mathop {\hbox{min} }\limits_{{{\mathcal{C}}_{k} , { \mathcal{C}}_{l} \in {\mathcal{C}},{ \mathcal{C}}_{k} \ne { \mathcal{C}}_{l} }} \frac{{\left( {\mathop {\hbox{min} }\limits_{{i \in {\mathcal{C}}_{k} , j \in { \mathcal{C}}_{l} }} dist(i,j)} \right)}}{{\mathop {\hbox{max} }\limits_{{{ \mathcal{C}}_{m} \in {\mathcal{C}}}} d\left( {{ \mathcal{C}}_{m} } \right)}} $$
(4)

Where \( d\left( {{ \mathcal{C}}_{m} } \right) \) indicates a cluster \( { \mathcal{C}}_{m} \) with maximum distance and this index has a value between 0 and ∞, and it should always be increased.

3.2 Stability Measures

The stability measure is a special type of validation measure to individually evaluate the cluster results from the overall analysis by removing each column in the data set. This type of measure is very significant especially when the physiological raw data are highly correlated with others. For this purpose, this study uses stability measures to compare the consistency of raw data in the medical synthetic data set. Generally, the stability measures [28] are broadly classified into four different groups: (i) Average Proportion of Non-overlap (APN), (ii) Average Distance (AD), (iii) Average Distance between Means (ADM), and (iv) Figure of Merit (FOM).

Average Proportion of Non-overlap (APN).

This measure is used to calculate the average proportion of data point that is not located in the same cluster with a particular or single column removed. Let consider \( {\mathcal{C}}^{i,0} \) be the cluster with data point \( i \) using the original cluster and \( {\mathcal{C}}^{i,l} \) be the cluster with column \( l \) removed in the data set. Then, APN value is always varied between the following interval [0, 1]. If the APN values close to 0 that indicates the highly consistent results. For the total number of cluster set K, the APN value is measured using given formula

$$ APN\left( K \right) = \frac{1}{MN}\mathop \sum \nolimits_{i = 1}^{N} \mathop \sum \nolimits_{l = 1}^{M} \left( {1 - \frac{{n\left( {{\mathcal{C}}^{i,l} \mathop \cap \nolimits {\mathcal{C}}^{i,0} } \right)}}{{n\left( {{\mathcal{C}}^{i,0} } \right)}}} \right) $$
(5)

Average Distance (AD).

The main function of AD measure is to predict the average distance between the data points that are placed in the same cluster by considering the aforementioned two cases. If the AD has a value between zero and ∞, and then the smaller values are always considered to evaluate the results. The following given expression is used to compute AD,

$$ AD\left( K \right) = \frac{1}{MN}\mathop \sum \nolimits_{i = 1}^{N} \mathop \sum \nolimits_{l = 1}^{M} \frac{1}{{n\left( {{\mathcal{C}}^{i,l} \mathop \cap \nolimits {\mathcal{C}}^{i,0} } \right)}}\left[ {\mathop \sum \nolimits_{{i \in {\mathcal{C}}^{i,0} , j \in {\mathcal{C}}^{i,l} }} dist(i,j)} \right] $$
(6)

Average Distance Between Means (AM).

The main objective of this measure is to calculate the average distance between data points that are presented in the same cluster under the aforementioned two cases. However, only it uses the Euclidean distance with smaller values between 0 and ∞ is always preferred. Let \( \overline{x}_{{{\mathcal{C}}^{i,0} }} \) denote cluster contains average data points \( i \) and \( \overline{x}_{{{\mathcal{C}}^{i,l} }} \) indicate the cluster contains data point \( i \) with column \( l \) removed. Then, it is computed using the below formula,

$$ ADM\left( K \right) = \frac{1}{MN}\mathop \sum \nolimits_{i = 1}^{N} \mathop \sum \nolimits_{l = 1}^{M} \frac{1}{{n\left( {{\mathcal{C}}^{i,l} \mathop \cap \nolimits {\mathcal{C}}^{i,0} } \right)}}dist(\overline{x}_{{{\mathcal{C}}^{i,l} }} , \overline{x}_{{{\mathcal{C}}^{i,0} }} ) $$
(7)

Figure of Merit (FOM).

The decisive role of a FOM is to estimate the average variance of the deleted columns in different clusters and grouping is performed based on the remaining (undeleted) columns. The smaller values between 0 and ∞ are mostly preferred and also it computes the mean error rate using average number of clusters. Then, FOM predicts a particular left-out column \( l \) using the given formula

$$ FOM\left( {l, K} \right) = \sqrt {\frac{1}{N}\mathop \sum \nolimits_{k = 1}^{k} \mathop \sum \nolimits_{{i \in {\mathcal{C}}^{k,l} }} dist\left( {x_{i,l} ,\overline{x}_{{{\mathcal{C}}_{k} (l)}} } \right)} $$
(8)

Where \( x_{i,l} \) presents the value of \( i^{th} \) observation in the \( l^{th} \) column and \( \overline{x}_{{{\mathcal{C}}_{k} (l)}} \) denote the average of a cluster. Generally, FOM uses only Euclidean distance and also it is multiplied by the following adjustment factor \( \sqrt {\frac{N}{N - K}} \), to decrease the amount of cluster expansions.

4 Experimental Results

The clustering algorithms are validated by including two packages defined in R programming tool. The two major packages used in this study are clValid [29] package and NbClust package [30], respectively. Both packages are very significant to determine the best optimal number of data clusters for a given data set and validate the effective results from the clustering analysis. This analysis study uses Euclidean distance as a parameter in NbClust function. The frequency of occurrence of time-critical data is measured with respect to the range of vital parameters, which are shown in Fig. 1.

4.1 Data Set

This experiment study uses statlog heartrate real-world data set (i.e., UCI machine learning repository) as a basic data set, which consists of 130 instances and 3 variables. To validate the advantages of the synthetic dataset, this work includes 5 additional variables by utilizing the same 130 instances. The data set contains only numerical values with different attributes. The vital ranges of each attribute are incorporated based on the conditions of the patient such as normal, moderate and extremely high. The various characteristics of both real world and synthetic healthcare data sets are mentioned in Table 2.

Table 2. Various characteristics of healthcare data sets

4.2 Comparative Analysis

The aim of comparative analysis is to choose how accurately each and every algorithm can able to group similar health records from the mixed physiological data set. Further, to analyze the optimal number of the cluster’s size for every algorithm and predict which algorithm performs better than others. The analysis results of three different clustering algorithms are validated using both internal and stability measures.

Evaluating Validity.

The analysis results of various clustering algorithms based on the internal validity measures are presented in Table 3. Initially, algorithms are validated with the varying cluster size from k = 2 to k = 10.

Fig. 1.
figure 1

Frequency of occurrence of vital data

From the cluster analysis, it is observed that the K-means algorithm with two clusters provides better results using connectivity and Dunn index measures as compared to the hierarchical clustering algorithm. However, the hierarchical algorithm achieves better output according to the silhouette validity measure. Therefore, it is the second best known clustering algorithm in terms of internal validity. Moreover, the comparative analysis suggested that the K-medoids yield no clustering results in comparison to K-means and hierarchical algorithms.

Evaluating Stability.

The stability of three different clustering algorithms is validated to predict any variations in the clustering outputs based on the removal of one column in a given data set. The achieved results of stability for each clustering algorithm are displayed in Table 3. From the assessments, it is noticed that the hierarchical algorithm almost approaches the lower stability values based on the APN, ADM, and FOM respectively. Though it achieved better stability values with all three measures it is failed to provide the best result for AD measure. Further, the maximum stability value of K-means algorithm indicates that the algorithm is not able to yield better values. Likewise, the Pam algorithm is ineffective to give stable outputs in terms of stability measures. Hence the hierarchical clustering algorithm contributes the highest stability results in every aspects as compared with K-means and Pam algorithms.

Table 3. Internal validation of clustering algorithms

Evaluating Optimal Scores.

The optimal number of clusters and their scores are evaluated using two important measures such as internal and stability. The best optimal scores of every algorithms are depicted in Table 4. Based on the observations, it is clearly shown that the K-means algorithm with two optimal clusters can provide the best results in terms of connectivity, Dunn index and silhouette, respectively. In contrast, the hierarchical algorithm with different clusters often yields the highest stability values for APN, ADM, and FOM except for AD among all considered clustering algorithms.

Table 4. Stability validation of clustering algorithms
Table 5. Optimal scores for various clustering algorithms

However, the best optimal cluster size for a physiological data set is 2 and also it is significantly confirmed that the suitability for dealing with high-dimensional physiological datasets. Finally, this analysis suggested that the Pam algorithm failed to produce the optimal number of clusters on synthetic data set with high problem dimensionality, as mentioned in Table 5.

5 Conclusion and Future Work

This study provided a detailed theoretical view on clustering algorithms especially for healthcare data analysis from both theoretical and experimental perspectives. There are numerous clustering algorithms deliberated in the existing studies for analyzing healthcare data sets and also validated with different metrics. However, it is very hard to decide in advance which clustering algorithm would be the most suitable for a particular data set and what would be the best optimal number of clusters from a given a set. Based on these perceptions, this study analysed various clustering algorithms in clinical point of view and validated using internal and stability measures. The observed results reported a better solution to develop novel clustering algorithm and to recommend a specific algorithm for huge volume of physiological data set. The grouping of abnormal variations from different columns of data sets is the most significant requirement rather than grouping the normal variations when using the mixed or complicated vital data sets. In future, this study will further extend the analysis for big pandemic healthcare data sets with respect to similarity score, condition-specific, and then generic preference-based measures.