Abstract
Data mining is the extraction of different data of intriguing as such (constructive, relevant, constructive, previously unexplored and considerably valuable) patterns or information from very large stack of data or different dataset. In other words, it is the experimental exploration of associations, links, and mainly the overall patterns that prevails in large datasets but is hidden or unknown. So, to explore the performance analysis using different clustering techniques we used R Language. This R language is a tool, which allows the user to analyse the data from various and different perspective and angles, in order to get a proper experimental results and in order to derive a meaningful relationships. In this paper, we are studying, analysing and comparing various algorithms and their techniques used for cluster analysis using R language. Our aim in this paper, is to present the comparison of 5 different clustering algorithms and validating those algorithms in terms of internal and external validation such as Silhouette plot, dunn index, Connectivity and much more. Finally as per the basics of the results that obtained we analyzed and compared, validated the efficiency of many different algorithms with respect to one another.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
R utilizes accumulations of bundles to perform diverse capacities. CRAN venture sees give various bundles to various clients as per their taste. R bundle contain diverse capacities for information mining approaches. This paper looks at different bunching calculations on Hepatitis dataset utilizing R. These grouping calculations give diverse outcome as indicated by the conditions. Some grouping methods are better for huge informational index and a few gives great come about for discovering bunch with subjective shapes. This paper is wanted to learn and relates different information mining grouping calculations. Calculations which are under investigation as takes after: K-Means calculation, K-Medoids, Hierarchical grouping algorithm, Fuzzy bunching and cross breed bunching. This paper contrasted all these grouping calculations agreeing with the many elements. After examination of these grouping calculations we depict what bunching calculations ought to be utilized as a part of various conditions for getting the best outcome.
2 Related Work
Few of the researches have worked on different algorithms and implemented few of them, as per that while others have worked on the existing algorithm few have implemented the new one’s. applied various indices to determine the performance of various clustering techniques and validating the clustering algorithms.
3 Clustering Analysis Using R Language
Data mining is not performed exclusively by the application of expensive tools and software, here, we have used R language. R is a language and it’s a platform for statistical computing and graphics. The clustering techniques which we used here are of basically four types, Partitioning methods, Hierarchal methods, Model based methods, Hybrid Clustering. Here hepatitis dataset is used to validate the results.
4 Clustering Concepts
Clustering analysis is the task of grouping a set of objects or very similar data in such a way that objects in same group or cluster are very similar to each other than to those in another groups or clusters. It is an unsupervised learning technique, which offers different views to inherent structure of a given dataset by dividing it into a many number of overlapping or disjoint groups. The different algorithm that we used in this paper to perform the cluster analysis of a particular given dataset is listed below.
4.1 Partition Based Clustering
It is based on the concept of iterative relocations of the data points from the given dataset between the clusters.
4.1.1 K-Means
The aim of this algorithm is to reduce objective function. Hers, the objective function that is considered is Square error function.
Where \( \left\| {x_{i} - c_{j} } \right\|^{2} \) is the distance between the data point xi, and even the cluster points centroid cj.
Algorithm Steps:
-
Consider a hepatitis dataset/data frame, load and pre-process the data
-
Keep K points into the workspace as presented by the objects that has to be clustered. These are called the initial or starting group centroids.
-
Here, the number of clusters is considered as 3.
-
Closest centroid being identified and each object has been assigned to it.
-
When all objects been assigned, the centroids is recalculated back again.
-
Repetition is being done with Steps 2 and 3, till the centroids have no longer move.
-
This gives out a separation of the objects into the groups from where the metric to be minimized should be calculated (Fig. 1).
4.1.2 K-Mediods (Partitioning Around Mediods)
K-Medoids algorithm is one of the partitioning clustering method that has been modified slightly from the K-Means algorithm. Both these algorithms, are particular meant to minimize the squared – error but the K-medoids is very much strong than the K-mean algorithm.
Here the data points are chosen as such to be medoids.
Algorithm steps:
-
Load the dataset and pre-process the data
-
Select k random points that considered as medoids from the given n data points of the Hepatitis dataset.
-
Find the optimal number of clusters.
-
Assign the data points as such to the closest medoid by using a distance matrix and visualize it using fviz function (Fig. 2).
4.2 Hierarchy Based Clustering
This clustering basically deals with hierarchy of objects. Here we need not to pre-specify the number of clusters in this Clustering technique, like K-means approach. This clustering technique has been divided into two major types.
4.2.1 Agglomerative Clustering
This clustering technique is also known as AGNES, which is none other than Agglomerative Nesting. This clustering works as in bottom-up manner (Fig. 3).
Algorithm Steps:
-
Load and pre-process dataset then load the factoectra, nbclust, fpc packages..
-
Assign each data object to a formed clusters such a way, that each object is assigned to one particular cluster.
-
Find nearest pair of such clusters and combine them to form a new node, such that those are left out with N-1 clusters.
-
Calculate distance between old and new clusters.
-
Perform previous two steps till all the clusters have been clustered as one size.
-
As we have N data objects, N clusters to be formed.
-
At last, the data is visualized as a tree known as dendogram.
$$ \begin{array}{*{20}l} {{\text{d = }}\sum\nolimits_{i = 1}^{n} {\left| {x_{i} - y_{i} } \right|} } \hfill & {{\text{d}}\text{ = }\sqrt {\sum\nolimits_{i = 1}^{n} {\left( {x_{i} - y_{i} } \right)}^{2} } } \hfill \\ {{\text{Manhattan}}\,{\text{Formula}}} \hfill & {{\text{Euclidean}}\,{\text{Formula}}} \hfill \\ \end{array} $$
4.2.2 Divisive Clustering
Divisive Clustering is just the opposite is Agglomerative algorithm. Divisive Clustering Approach is also known as Diana [3] (Fig. 4).
4.3 Fuzzy Clustering
Fuzzy is a method of clustering one particular piece of data is to belong to one or more clusters. It is based on the minimization of the objective function.
Where m is a real number which is greater than 1, u is a degree of membership of xi, in the ith dimensional data, cj is the centre dimension of the cluster.
Algorithm Steps:
-
Load the dataset.
-
Load the fanny function.
-
At k – steps: Calculate the centres of the vectors \( {\text{c}}\left( {\text{k}} \right) = {\text{c}}\left( {\text{j}} \right)\;{\text{with}}\;{\text{U}}({\text{k}}) \).
$$ c_{j} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} u_{ij}^{m} .x_{j} }}{{\mathop \sum \nolimits_{i = 1}^{N} u_{ij}^{m} }} $$ -
Update the values of \( {\text{U}}({\text{K}}),{\text{U}}\left( {{\text{K}} + 1} \right) \)
$$ {u_{ij} = \frac{1}{{\mathop \sum \nolimits_{k = 1}^{c} \left( {\frac{{\left\| {x_{i} - c_{j} } \right\|}}{{\left\| {x_{i} - c_{k} } \right\|}}} \right)^{{\frac{2}{m + 1}}} }}} $$ -
If the \( {\text{U}}\left( {{\text{K}} + 1} \right) - {\text{U}}\left( {\text{K}} \right) < {\text{E}} \) then stop of the function, otherwise return back to Step 3.
-
Visualize the data in the clustered format (Fig. 5).
4.4 Model Based Clustering
The data will considered here is a mixture of two or more clusters.
Algorithm Steps:
-
Load and pre-process the Hepatitis dataset.
-
Install Mass, ggpubr, factoextra, mclust packages in library in R studio.
-
Apply mclust function to cluster the data. Then visualize the data (Fig. 6).
5 Performance Analysis
5.1 Cluster Validation
Here the term of cluster validation is used here to evaluate and compare the goodness and accuracy of different clustering algorithms results. This Internal Cluster Validation, basically uses the internal information of all the clustering process to find out the effectiveness and goodness of a cluster structure without knowing the external information. Internal measures results upon Compactness, separation and connectedness. Internal validation is done using Silhouette, Connectivity and Dunn Index.
Here x and y are the weights.
7 Choosing the Best Algorithm
Internal Validation of different clustering techniques results are listed here (Table 1).
8 Conclusion
This paper deals with defining few algorithms, and all those algorithms have been implemented and visualized in R studio. The clustering is done on hepatitis dataset. All the algorithms have been validated using internal measures and results have been displayed in the tabular format in terms of connectivity, Dunn, silhouette index. The measure has been considered for every algorithm and then compared overall to find out the best algorithm. As, per this we conclude that, the K-means is used for the large datasets and large number of clusters, Fuzzy clustering is not well suitable for the large number of clusters and also K-means have maximum dunn and silhouette index values when compare to all other algorithms.
References
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195–197 (1981)
May, P., Ehrlich, H.-C., Steinke, T.: ZIB structure prediction pipeline: composing a complex biological workflow through web services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006). https://doi.org/10.1007/11823285_121
Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid ınformation services for distributed resource sharing. In: 10th IEEE International Symposium on High Performance Distributed Computing, pp. 181–184. IEEE Press, New York (2001)
Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open grid services architecture for distributed systems ıntegration. Technical report, Global Grid Forum (2002)
National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 911–916. IEEE, December 2010
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jayaram Reddy, A., Tripathy, B., Nimje, S., Sree Ganga, G., Varnasree, K. (2018). Performance Analysis of Clustering Algorithm in Data Mining in R Language. In: Zelinka, I., Senkerik, R., Panda, G., Lekshmi Kanthan, P. (eds) Soft Computing Systems. ICSCS 2018. Communications in Computer and Information Science, vol 837. Springer, Singapore. https://doi.org/10.1007/978-981-13-1936-5_39
Download citation
DOI: https://doi.org/10.1007/978-981-13-1936-5_39
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1935-8
Online ISBN: 978-981-13-1936-5
eBook Packages: Computer ScienceComputer Science (R0)