Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

R utilizes accumulations of bundles to perform diverse capacities. CRAN venture sees give various bundles to various clients as per their taste. R bundle contain diverse capacities for information mining approaches. This paper looks at different bunching calculations on Hepatitis dataset utilizing R. These grouping calculations give diverse outcome as indicated by the conditions. Some grouping methods are better for huge informational index and a few gives great come about for discovering bunch with subjective shapes. This paper is wanted to learn and relates different information mining grouping calculations. Calculations which are under investigation as takes after: K-Means calculation, K-Medoids, Hierarchical grouping algorithm, Fuzzy bunching and cross breed bunching. This paper contrasted all these grouping calculations agreeing with the many elements. After examination of these grouping calculations we depict what bunching calculations ought to be utilized as a part of various conditions for getting the best outcome.

2 Related Work

Few of the researches have worked on different algorithms and implemented few of them, as per that while others have worked on the existing algorithm few have implemented the new one’s. applied various indices to determine the performance of various clustering techniques and validating the clustering algorithms.

3 Clustering Analysis Using R Language

Data mining is not performed exclusively by the application of expensive tools and software, here, we have used R language. R is a language and it’s a platform for statistical computing and graphics. The clustering techniques which we used here are of basically four types, Partitioning methods, Hierarchal methods, Model based methods, Hybrid Clustering. Here hepatitis dataset is used to validate the results.

4 Clustering Concepts

Clustering analysis is the task of grouping a set of objects or very similar data in such a way that objects in same group or cluster are very similar to each other than to those in another groups or clusters. It is an unsupervised learning technique, which offers different views to inherent structure of a given dataset by dividing it into a many number of overlapping or disjoint groups. The different algorithm that we used in this paper to perform the cluster analysis of a particular given dataset is listed below.

4.1 Partition Based Clustering

It is based on the concept of iterative relocations of the data points from the given dataset between the clusters.

4.1.1 K-Means

The aim of this algorithm is to reduce objective function. Hers, the objective function that is considered is Square error function.

$$ J = \mathop \sum \limits_{j = 1}^{k} \mathop \sum \limits_{i = 1}^{n} \left\| {x_{i} - c_{j} } \right\|^{2} $$

Where \( \left\| {x_{i} - c_{j} } \right\|^{2} \) is the distance between the data point xi, and even the cluster points centroid cj.

Algorithm Steps:

  • Consider a hepatitis dataset/data frame, load and pre-process the data

  • Keep K points into the workspace as presented by the objects that has to be clustered. These are called the initial or starting group centroids.

  • Here, the number of clusters is considered as 3.

  • Closest centroid being identified and each object has been assigned to it.

  • When all objects been assigned, the centroids is recalculated back again.

  • Repetition is being done with Steps 2 and 3, till the centroids have no longer move.

  • This gives out a separation of the objects into the groups from where the metric to be minimized should be calculated (Fig. 1).

    Fig. 1.
    figure 1

    K-means technique performed on Hepatitis data set in R studio.

4.1.2 K-Mediods (Partitioning Around Mediods)

K-Medoids algorithm is one of the partitioning clustering method that has been modified slightly from the K-Means algorithm. Both these algorithms, are particular meant to minimize the squared – error but the K-medoids is very much strong than the K-mean algorithm.

Here the data points are chosen as such to be medoids.

Algorithm steps:

  • Load the dataset and pre-process the data

  • Select k random points that considered as medoids from the given n data points of the Hepatitis dataset.

  • Find the optimal number of clusters.

  • Assign the data points as such to the closest medoid by using a distance matrix and visualize it using fviz function (Fig. 2).

    Fig. 2.
    figure 2

    K-medoids technique performed on hepatitis dataset using R studio

4.2 Hierarchy Based Clustering

This clustering basically deals with hierarchy of objects. Here we need not to pre-specify the number of clusters in this Clustering technique, like K-means approach. This clustering technique has been divided into two major types.

4.2.1 Agglomerative Clustering

This clustering technique is also known as AGNES, which is none other than Agglomerative Nesting. This clustering works as in bottom-up manner (Fig. 3).

Fig. 3.
figure 3

Agglomerative clustering based on Hepatitis dataset in R studio

Algorithm Steps:

  • Load and pre-process dataset then load the factoectra, nbclust, fpc packages..

  • Assign each data object to a formed clusters such a way, that each object is assigned to one particular cluster.

  • Find nearest pair of such clusters and combine them to form a new node, such that those are left out with N-1 clusters.

  • Calculate distance between old and new clusters.

  • Perform previous two steps till all the clusters have been clustered as one size.

  • As we have N data objects, N clusters to be formed.

  • At last, the data is visualized as a tree known as dendogram.

    $$ \begin{array}{*{20}l} {{\text{d = }}\sum\nolimits_{i = 1}^{n} {\left| {x_{i} - y_{i} } \right|} } \hfill & {{\text{d}}\text{ = }\sqrt {\sum\nolimits_{i = 1}^{n} {\left( {x_{i} - y_{i} } \right)}^{2} } } \hfill \\ {{\text{Manhattan}}\,{\text{Formula}}} \hfill & {{\text{Euclidean}}\,{\text{Formula}}} \hfill \\ \end{array} $$

4.2.2 Divisive Clustering

Divisive Clustering is just the opposite is Agglomerative algorithm. Divisive Clustering Approach is also known as Diana [3] (Fig. 4).

Fig. 4.
figure 4

Divisive clustering based on Hepatitis dataset in R studio

4.3 Fuzzy Clustering

Fuzzy is a method of clustering one particular piece of data is to belong to one or more clusters. It is based on the minimization of the objective function.

$$ J_{m} = \sum\nolimits_{I = 1}^{N} {\sum\nolimits_{j = 1}^{C} {u^{m}_{ij} \left\| {x_{i} - c_{j} } \right\|} }^{2} \quad 1 \le m < \propto $$

Where m is a real number which is greater than 1, u is a degree of membership of xi, in the ith dimensional data, cj is the centre dimension of the cluster.

Algorithm Steps:

  • Load the dataset.

  • Load the fanny function.

  • At k – steps: Calculate the centres of the vectors \( {\text{c}}\left( {\text{k}} \right) = {\text{c}}\left( {\text{j}} \right)\;{\text{with}}\;{\text{U}}({\text{k}}) \).

    $$ c_{j} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} u_{ij}^{m} .x_{j} }}{{\mathop \sum \nolimits_{i = 1}^{N} u_{ij}^{m} }} $$
  • Update the values of \( {\text{U}}({\text{K}}),{\text{U}}\left( {{\text{K}} + 1} \right) \)

    $$ {u_{ij} = \frac{1}{{\mathop \sum \nolimits_{k = 1}^{c} \left( {\frac{{\left\| {x_{i} - c_{j} } \right\|}}{{\left\| {x_{i} - c_{k} } \right\|}}} \right)^{{\frac{2}{m + 1}}} }}} $$
  • If the \( {\text{U}}\left( {{\text{K}} + 1} \right) - {\text{U}}\left( {\text{K}} \right) < {\text{E}} \) then stop of the function, otherwise return back to Step 3.

  • Visualize the data in the clustered format (Fig. 5).

    Fig. 5.
    figure 5

    Fuzzy clustering based on Hepatitis dataset in R studio

4.4 Model Based Clustering

The data will considered here is a mixture of two or more clusters.

Algorithm Steps:

  • Load and pre-process the Hepatitis dataset.

  • Install Mass, ggpubr, factoextra, mclust packages in library in R studio.

  • Apply mclust function to cluster the data. Then visualize the data (Fig. 6).

    Fig. 6.
    figure 6

    Model based clustering based on Hepatitis dataset in R studio

5 Performance Analysis

5.1 Cluster Validation

Here the term of cluster validation is used here to evaluate and compare the goodness and accuracy of different clustering algorithms results. This Internal Cluster Validation, basically uses the internal information of all the clustering process to find out the effectiveness and goodness of a cluster structure without knowing the external information. Internal measures results upon Compactness, separation and connectedness. Internal validation is done using Silhouette, Connectivity and Dunn Index.

$$ {\text{Index}}\text{ = }\left( {{\text{x}}*{\text{Separation}}} \right)/\left( {{\text{y}}*{\text{Compactness}}} \right) $$

Here x and y are the weights.

6 Results of Different Validation Techniques Using Dataset

See Figs. 7, 8 and 9.

Fig. 7.
figure 7

K-means and K-medoids validations

Fig. 8.
figure 8

Agglomerative and divisive validations

Fig. 9.
figure 9

Fuzzy validation

7 Choosing the Best Algorithm

Internal Validation of different clustering techniques results are listed here (Table 1).

Table 1. Comparision of clustering algorithms

8 Conclusion

This paper deals with defining few algorithms, and all those algorithms have been implemented and visualized in R studio. The clustering is done on hepatitis dataset. All the algorithms have been validated using internal measures and results have been displayed in the tabular format in terms of connectivity, Dunn, silhouette index. The measure has been considered for every algorithm and then compared overall to find out the best algorithm. As, per this we conclude that, the K-means is used for the large datasets and large number of clusters, Fuzzy clustering is not well suitable for the large number of clusters and also K-means have maximum dunn and silhouette index values when compare to all other algorithms.