A Novel Hierarchical Clustering Algorithm for Online Resources

Agarwal, Amit; Roul, Rajendra Kumar

doi:10.1007/978-981-10-8636-6_49

Amit Agarwal¹⁸ &
Rajendra Kumar Roul¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 708))

731 Accesses
2 Citations

Abstract

The importance of hierarchical clustering in data analytics is escalating because of the exponential growth of digital content. Often, these digital contents are unorganized, and there is limited preliminary field knowledge available. One of the challenges in organizing these huge digital contents is the computational complexity involved. Aiming in this direction, we have proposed an efficient approach whose aim is to improve the efficiency of traditional agglomerative hierarchical clustering method that is used to organize the data. This is done by making use of disjoint-set data structure and a variation of Kruskal’s algorithm for minimum spanning trees. The disjoint sets represent the clusters, and the elements inside the sets are the records. This representation makes it easy to efficiently merge two clusters and to easily locate the records in any cluster. For evaluating this approach, the algorithm is tested on a sample input of 50,000 records of unorganized e-books. The experimental results of the proposed approach show that e-resources can be efficiently clustered without compromising the clustering performance.

Access provided by CONRICYT-eBooks. Download conference paper PDF

A Novel Technique for Web Pages Clustering Using LSA and K-Medoids Algorithm

Exploiting Web Sites Structural and Content Features for Web Pages Clustering

Discovering Informative Contents of Web Pages

Keywords

1 Introduction

Clustering means partitioning a given set of elements into homogeneous groups based on given features such that elements in the same group are more closely related to each other than the elements in other groups [1, 2]. It is one of the most significant unsupervised learning problems, as it is used to classify a dataset without any previous knowledge. It deals with the problem of finding pattern in a collection of unclassified data. There is no specific algorithm for clustering. This is because the notion of what constitutes a cluster can vary significantly, and hence, the output will vary. Many research works have been done in this domain [3,4,5,6,7].

Hierarchical clustering [8] is one of the most popular methods of clustering, which aims to build a hierarchy of clusters. At the bottom-most level, all objects are present in different clusters, while at the topmost level, all objects are merged into one cluster. Hierarchical clustering can be done in two ways: agglomerative and divisive. In agglomerative clustering, initially all objects are placed in different clusters and the clusters are joined as we move up the hierarchy. So it is a bottom-up approach. In divisive clusters, initially all objects are placed in a single cluster and the clusters are divided as we go down the hierarchy. So it is a top-down approach. Single-linkage clustering is a technique for performing hierarchical clustering in which the similarity among two groups of objects is determined by two objects (one in each group) that are most similar to each other. One of the major challenges in hierarchical agglomerative single-linkage clustering is the computational complexity involved. In 1973, Sibson [9] proposed SLINK, a single-linkage clustering algorithm which had a time complexity of \(\mathbf {O}(\mathbf {n}^{\mathbf{2}})\) and space complexity of O(n). In this paper, an alternative approach for single-linkage clustering is proposed based on minimum spanning trees [10] having same space and time complexity as the SLINK algorithm.

The paper can be organized on the following lines: Sect. 2 discusses the proposed approach to cluster the e-books. The experimental work has been carried out in Sect. 3, and finally, in Sect. 4, we have concluded the work.

2 The Proposed Approach

2.1 Problem Statement

Consider a corpus of N documents. All the documents are preprocessed by extracting the keywords using Natural Language toolkit.^{Footnote 1} Now, our entire data is a set X of N records (documents). Mathemetically, \(\mathbf X =\{\mathbf{x}_{\mathbf{1}},\mathbf{x}_{\mathbf{2}},\ldots , \mathbf{x}_{\mathbf{N}}\}\) where each \(\mathbf x _{\mathbf{i}}=\{\mathbf{x}_{\mathbf{i1}}, \mathbf{x}_{\mathbf{i2}} ,\ldots , \mathbf{x}_{\mathbf{ik}_{\mathbf{i}}}\}\) is a set of k_i keywords (features); the aim of clustering is to divide the set X into clusters \(C_1, C_2, C_3,\ldots ,\) where minimum similarity between any two records in a cluster is not less than a particular threshold set by the user. The dataset is represented in form of a graph where nodes represent the records and edges represent the similarity between the records. So edge weight is directly proportional to the similarity between the records directly connected by that edge. We denote G(X) = (V, E) as the undirected complete graph of X. Now, the edge weights can be derived by finding the number of features that are common to two records. The weight function is represented as \(\mathbf w (\mathbf{x}_{\mathbf{i}},\mathbf{x}_{\mathbf{j}})\).

Mathematically:

V = X

\(\mathbf E = \{(\mathbf{x}_{\mathbf{i}}, \mathbf{x}_{\mathbf{j}}), \mathbf{i}\ne \mathbf{j}, \mathbf{x}_{\mathbf{i}}\in \mathbf{X}, \mathbf{x}_{\mathbf{j}}\in \mathbf{X}\}\)

w: E \(\rightarrow \) N (set of natural numbers)

Figure 1 demonstrates the graphical representation of a sample data of e-book records. The nodes represent the books, and the features are represented by the keywords. The weight of the edge connecting two nodes (books) is equal to number of keywords that are common to the two nodes (books).

2.2 Algorithm Overview

The algorithm comprises of two major stages: In the first stage, we construct a maximum spanning tree of the given dataset (Fig. 2) using Prim’s algorithm. In the second stage, the records are merged and clusters are formed by applying Kruskal’s algorithm to the maximum spanning tree obtained in the first stage.

2.3 Constructing the Maximum Spanning Tree (MST)

A maximum spanning tree is a tree which connects all the vertices of a connected, undirected graph such that the sum of edge weights of the tree is maximum.

The maximum spanning tree of the graph (Fig. 3) can be formed using Prim’s algorithm for minimum spanning trees [11]. The algorithm is described as follows:

1.
Assign each vertex-v of the graph these 3 quantities :
- C[v] : Highest cost of a connection to vertex v
- Parent[v] : the node with which v has the highest cost connection
- Visited[v] : Indicates whether v has been included in MST.
2.
Make 2 sets :
- S : set of vertices(v) sorted in descending order with C[v] as the key.
- E : set of edges of MST
3.
Initialization :
- C[v] \(\leftarrow -\infty , \forall v \in V\)
- Visited[v] \(\leftarrow false, \forall v \in V\)
- Parent[v] \(\leftarrow null , \forall v \in V\)
- S = all vertices
- E = \(\phi \)
4.
Repeat these steps until \(S=\phi \) :
1. (a)
  Find and then remove the first vertex v from S.
2. (b)
  Set Visited[v] \(\leftarrow \) true.
3. (c)
  If Parent[v] \( \ne \) null , add (Parent[v],v) to the set E.
4. (d)
  Loop over all vertices u , s.t u \(\ne \) v and Visited[u]\(=\) false :
  1. i.
    If w(u,v) > C[u] :
    1. A.
      Set C[u] \(\leftarrow \) w(u,v)
    2. B.
      Set Parent[u] \(\leftarrow \) v
When the set S becomes empty, set E will contain the edges of Maximal Spanning Tree of the graph.

2.4 Merging of Records into Clusters

In this stage, the edges of maximum spanning tree are used to form clusters using a slight modification of Kruskal’s algorithm [12]: The algorithm is stopped when the edge weight becomes less than a particular lower threshold for similarity value: \(w_{crtical}\) which is specified by the user. The algorithm makes use of disjoint-set data structure which maintains a collection of disjoint sets having some finite number of elements within them. The disjoint sets represent the clusters, and the elements inside the sets are the records. Every set has a representative element which is used to uniquely identify that set. This representation makes it easy to efficiently merge two clusters and to easily locate the records in any cluster because of two useful operations supported by this data structure:

Find: It takes an element as input and determines which set the particular element belongs to.

Find(v):

1.
If Parent[v] \(\ne \) v then Parent[v] \(\leftarrow \) Find(Parent[v])
2.
Return Parent[v]

Union: It takes two disjoint sets as input and joins them into a single set.

Two optimization techniques: Union by rank and path compression have been employed which reduce the time complexity of Union and Find operations to a small constant.

Union(u, v):

1.
\(set_u \leftarrow \) Find(u)
2.
\(set_v \leftarrow \) Find(v)
3.
If \(Size[set_u] < Size[set_v] \; then\; Parent[set_u]\leftarrow set_v\)
4.
If \(Size[set_u] > Size[set_v] \;then\; Parent[set_v]\leftarrow set_u\)
5.
If \(Size[set_u]=Size[set_v]\) then
1. (a)
  \(Parent[set_u]\leftarrow set_v\)
2. (b)
  \(Size[set_v] \leftarrow Size[set_v]+1\)

The modified Kruskal’s algorithm is described as follows:

1.
Associate each vertex v with 2 quantities :
- Parent[v] : the parent of v in the set.
- Size[v] : the size of disjoint set which has v as its representative element.
2.
Make N(number of records) sets \(S_1, S_2, S_3\ldots S_N,\) each set representing a cluster.
3.
Initialization :
- \(Parent[v] \leftarrow v, \forall v \in \mathbf V \)
- \(Size[v] \leftarrow 1, \forall v \in \mathbf V \)
- \(S_i = \phi , \forall i \in \{1,2,\ldots ,N\}\)
4.
Sort the edges in set E (set containing edges of Maximal Spanning tree) in decreasing order of edge weight.
5.
Iterate over edges e(u, v) in set E till edge weight of \((e - w(u, v)) > w_{crtical}\) :
1. (a)
  Union(u, v)
6.
For all vertices v :
1. (a)
  \(i\leftarrow Find(v)\)
2. (b)
  Add vertex v to set \(S_i\)

At the end, each non-empty set \(S_i\) will represent a cluster containing records of similar type. Figure 4 shows the graph structure after the clustering process. Boxes having the same color are part of the same cluster.

3 Experimental Results

The proposed algorithm was tested on a sample input of 50,000 records of unorganized books.^{Footnote 2} The records were taken as input in the following format:

Book Name, Author Name, \(< keyword_1, keyword_2, \ldots>\) where keywords are a set of words which describe the book. These keywords are used as features for each record (book). A total of 541 clusters were obtained with each cluster containing books of similar subject category. Figure 5 shows a sample of the clusters obtained after running the algorithm on the dataset.

The time taken and space utilized to cluster different sizes of input data are shown in Figs. 6 and 7. The analysis shows that the time complexity of the proposed algorithm is \(\mathbf {O}(\mathbf{n}^{\mathbf{2}})\) and the space complexity is \(\mathbf O(n) \). Table 1 shows the size of dataset and corresponding hypothetical computational cost required with respect to a typical desktop PC with a CPU speed of 2.4 GHz which can perform approximately \(\mathbf {2.4*10}^{\mathbf{9}}\) operations per second.

Table 1 Computational complexity

Full size table

4 Conclusion

The proposed algorithm employs MST algorithm in two stages: In the first part of the algorithm, a maximal spanning tree is constructed out of the given dataset using Prim’s algorithm. In the second part, this maximal spanning tree is fed as input to a variation of Kruskal’s algorithm which then generates the clusters. Since the space complexity of the algorithm is O(n), the algorithm is efficient in terms of space requirement to compute clusters even for large datasets. However, the time complexity of constructing a maximum spanning tree is \(\mathbf {O}(\mathbf{n}^{\mathbf{2}})\), which hinders the application of MST in case of large datasets.

The future work is to find an approach which captures the similar information like MST in a better time and space complexity.

Notes

1.
http://www.nltk.org/.
2.
http://www.gutenberg.org/dirs/.

References

Everitt, B., Landau, S., Leese, M.: Cluster Analysis. ser. Hodder Arnold Publication. Wiley (2001). https://books.google.co.in/books?id=htZzDGlCnQYC
Gan, G., Ma, C., Wu, J.: Data clustering: theory, algorithms, and applications. SIAM (2007)
Google Scholar
Calandriello, D., Niu, G., Sugiyama, M.: Semi-supervised information-maximization clustering. Neural Netw. 57, 103–111 (2014)
Article Google Scholar
Baghshah, M.S., Afsari, F., Shouraki, S.B., Eslami, E.: Scalable semi-supervised clustering by spectral kernel learning. Pattern Recogn. Lett. 45, 161–171 (2014)
Article Google Scholar
Roul, R.K., Nanda, A., Patel, V., Sahay, S.K.: Extreme learning machines in the field of text classification. In: 2015 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 1–7. IEEE (2015)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Article Google Scholar
Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Computational Intelligence in Data Mining-Volume 3, pp. 159–171. Springer (2015)
Google Scholar
Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 2(1), 86–97 (2012)
Google Scholar
Sibson, R.: Slink: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Article MathSciNet Google Scholar
Gower, J.C., Ross, G.: Minimum spanning trees and single linkage cluster analysis. Appl. Stat. 54–64 (1969)
Article MathSciNet Google Scholar
Prim, R.C.: Shortest connection networks and some generalizations. Bell Labs Tech. J. 36(6), 1389–1401 (1957)
Article Google Scholar
Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

BITS-Pilani, K.K. Birla Goa Campus, Pilani, India
Amit Agarwal & Rajendra Kumar Roul

Authors

Amit Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Agarwal .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Pankaj Kumar Sa
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Sambit Bakshi
Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
Ioannis K. Hatzilygeroudis
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, Odisha, India
Manmath Narayan Sahoo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agarwal, A., Roul, R.K. (2018). A Novel Hierarchical Clustering Algorithm for Online Resources. In: Sa, P., Bakshi, S., Hatzilygeroudis, I., Sahoo, M. (eds) Recent Findings in Intelligent Computing Techniques . Advances in Intelligent Systems and Computing, vol 708. Springer, Singapore. https://doi.org/10.1007/978-981-10-8636-6_49

Download citation

DOI: https://doi.org/10.1007/978-981-10-8636-6_49
Published: 05 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8635-9
Online ISBN: 978-981-10-8636-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics