Keywords

1 Introduction

Clustering means partitioning a given set of elements into homogeneous groups based on given features such that elements in the same group are more closely related to each other than the elements in other groups [1, 2]. It is one of the most significant unsupervised learning problems, as it is used to classify a dataset without any previous knowledge. It deals with the problem of finding pattern in a collection of unclassified data. There is no specific algorithm for clustering. This is because the notion of what constitutes a cluster can vary significantly, and hence, the output will vary. Many research works have been done in this domain [3,4,5,6,7].

Hierarchical clustering [8] is one of the most popular methods of clustering, which aims to build a hierarchy of clusters. At the bottom-most level, all objects are present in different clusters, while at the topmost level, all objects are merged into one cluster. Hierarchical clustering can be done in two ways: agglomerative and divisive. In agglomerative clustering, initially all objects are placed in different clusters and the clusters are joined as we move up the hierarchy. So it is a bottom-up approach. In divisive clusters, initially all objects are placed in a single cluster and the clusters are divided as we go down the hierarchy. So it is a top-down approach. Single-linkage clustering is a technique for performing hierarchical clustering in which the similarity among two groups of objects is determined by two objects (one in each group) that are most similar to each other. One of the major challenges in hierarchical agglomerative single-linkage clustering is the computational complexity involved. In 1973, Sibson [9] proposed SLINK, a single-linkage clustering algorithm which had a time complexity of \(\mathbf {O}(\mathbf {n}^{\mathbf{2}})\) and space complexity of O(n). In this paper, an alternative approach for single-linkage clustering is proposed based on minimum spanning trees [10] having same space and time complexity as the SLINK algorithm.

The paper can be organized on the following lines: Sect. 2 discusses the proposed approach to cluster the e-books. The experimental work has been carried out in Sect. 3, and finally, in Sect. 4, we have concluded the work.

Fig. 1
figure 1

Graph on books

2 The Proposed Approach

2.1 Problem Statement

Consider a corpus of N documents. All the documents are preprocessed by extracting the keywords using Natural Language toolkit.Footnote 1 Now, our entire data is a set X of N records (documents). Mathemetically, \(\mathbf X =\{\mathbf{x}_{\mathbf{1}},\mathbf{x}_{\mathbf{2}},\ldots , \mathbf{x}_{\mathbf{N}}\}\) where each \(\mathbf x _{\mathbf{i}}=\{\mathbf{x}_{\mathbf{i1}}, \mathbf{x}_{\mathbf{i2}} ,\ldots , \mathbf{x}_{\mathbf{ik}_{\mathbf{i}}}\}\) is a set of ki keywords (features); the aim of clustering is to divide the set X into clusters \(C_1, C_2, C_3,\ldots ,\) where minimum similarity between any two records in a cluster is not less than a particular threshold set by the user. The dataset is represented in form of a graph where nodes represent the records and edges represent the similarity between the records. So edge weight is directly proportional to the similarity between the records directly connected by that edge. We denote G(X) = (V, E) as the undirected complete graph of X. Now, the edge weights can be derived by finding the number of features that are common to two records. The weight function is represented as \(\mathbf w (\mathbf{x}_{\mathbf{i}},\mathbf{x}_{\mathbf{j}})\).

Mathematically:

V = X

\(\mathbf E = \{(\mathbf{x}_{\mathbf{i}}, \mathbf{x}_{\mathbf{j}}), \mathbf{i}\ne \mathbf{j}, \mathbf{x}_{\mathbf{i}}\in \mathbf{X}, \mathbf{x}_{\mathbf{j}}\in \mathbf{X}\}\)

w: E \(\rightarrow \) N (set of natural numbers)

Fig. 2
figure 2

Unorganized input data

Fig. 3
figure 3

Maximal Spanning Tree

Fig. 4
figure 4

Clusters after applying Kruskal’s algorithm

Figure 1 demonstrates the graphical representation of a sample data of e-book records. The nodes represent the books, and the features are represented by the keywords. The weight of the edge connecting two nodes (books) is equal to number of keywords that are common to the two nodes (books).

2.2 Algorithm Overview

The algorithm comprises of two major stages: In the first stage, we construct a maximum spanning tree of the given dataset (Fig. 2) using Prim’s algorithm. In the second stage, the records are merged and clusters are formed by applying Kruskal’s algorithm to the maximum spanning tree obtained in the first stage.

2.3 Constructing the Maximum Spanning Tree (MST)

A maximum spanning tree is a tree which connects all the vertices of a connected, undirected graph such that the sum of edge weights of the tree is maximum.

The maximum spanning tree of the graph (Fig. 3) can be formed using Prim’s algorithm for minimum spanning trees [11]. The algorithm is described as follows:

  1. 1.

    Assign each vertex-v of the graph these 3 quantities :

    • C[v] : Highest cost of a connection to vertex v

    • Parent[v] : the node with which v has the highest cost connection

    • Visited[v] : Indicates whether v has been included in MST.

  2. 2.

    Make 2 sets :

    • S : set of vertices(v) sorted in descending order with C[v] as the key.

    • E : set of edges of MST

  3. 3.

    Initialization :

    • C[v] \(\leftarrow -\infty , \forall v \in V\)

    • Visited[v] \(\leftarrow false, \forall v \in V\)

    • Parent[v] \(\leftarrow null , \forall v \in V\)

    • S = all vertices

    • E = \(\phi \)

  4. 4.

    Repeat these steps until \(S=\phi \) :

    1. (a)

      Find and then remove the first vertex v from S.

    2. (b)

      Set Visited[v] \(\leftarrow \) true.

    3. (c)

      If Parent[v] \( \ne \) null , add (Parent[v],v) to the set E.

    4. (d)

      Loop over all vertices u , s.t u \(\ne \) v and Visited[u]\(=\) false :

      1. i.

        If w(u,v) > C[u] :

        1. A.

          Set C[u] \(\leftarrow \) w(u,v)

        2. B.

          Set Parent[u] \(\leftarrow \) v

    When the set S becomes empty, set E will contain the edges of Maximal Spanning Tree of the graph.

2.4 Merging of Records into Clusters

In this stage, the edges of maximum spanning tree are used to form clusters using a slight modification of Kruskal’s algorithm [12]: The algorithm is stopped when the edge weight becomes less than a particular lower threshold for similarity value: \(w_{crtical}\) which is specified by the user. The algorithm makes use of disjoint-set data structure which maintains a collection of disjoint sets having some finite number of elements within them. The disjoint sets represent the clusters, and the elements inside the sets are the records. Every set has a representative element which is used to uniquely identify that set. This representation makes it easy to efficiently merge two clusters and to easily locate the records in any cluster because of two useful operations supported by this data structure:

Find: It takes an element as input and determines which set the particular element belongs to.

Find(v):

  1. 1.

    If Parent[v] \(\ne \) v then Parent[v] \(\leftarrow \) Find(Parent[v])

  2. 2.

    Return Parent[v]

Union: It takes two disjoint sets as input and joins them into a single set.

Two optimization techniques: Union by rank and path compression have been employed which reduce the time complexity of Union and Find operations to a small constant.

Union(u, v):

  1. 1.

    \(set_u \leftarrow \) Find(u)

  2. 2.

    \(set_v \leftarrow \) Find(v)

  3. 3.

    If \(Size[set_u] < Size[set_v] \; then\; Parent[set_u]\leftarrow set_v\)

  4. 4.

    If \(Size[set_u] > Size[set_v] \;then\; Parent[set_v]\leftarrow set_u\)

  5. 5.

    If \(Size[set_u]=Size[set_v]\) then

    1. (a)

      \(Parent[set_u]\leftarrow set_v\)

    2. (b)

      \(Size[set_v] \leftarrow Size[set_v]+1\)

The modified Kruskal’s algorithm is described as follows:

  1. 1.

    Associate each vertex v with 2 quantities :

    • Parent[v] : the parent of v in the set.

    • Size[v] : the size of disjoint set which has v as its representative element.

  2. 2.

    Make N(number of records) sets \(S_1, S_2, S_3\ldots S_N,\) each set representing a cluster.

  3. 3.

    Initialization :

    • \(Parent[v] \leftarrow v, \forall v \in \mathbf V \)

    • \(Size[v] \leftarrow 1, \forall v \in \mathbf V \)

    • \(S_i = \phi , \forall i \in \{1,2,\ldots ,N\}\)

  4. 4.

    Sort the edges in set E (set containing edges of Maximal Spanning tree) in decreasing order of edge weight.

  5. 5.

    Iterate over edges e(u, v) in set E till edge weight of \((e - w(u, v)) > w_{crtical}\) :

    1. (a)

      Union(u, v)

  6. 6.

    For all vertices v :

    1. (a)

      \(i\leftarrow Find(v)\)

    2. (b)

      Add vertex v to set \(S_i\)

At the end, each non-empty set \(S_i\) will represent a cluster containing records of similar type. Figure 4 shows the graph structure after the clustering process. Boxes having the same color are part of the same cluster.

Fig. 5
figure 5

Generated clusters

3 Experimental Results

The proposed algorithm was tested on a sample input of 50,000 records of unorganized books.Footnote 2 The records were taken as input in the following format:

Fig. 6
figure 6

Time complexity

Fig. 7
figure 7

Space complexity

Book Name, Author Name, \(< keyword_1, keyword_2, \ldots>\) where keywords are a set of words which describe the book. These keywords are used as features for each record (book). A total of 541 clusters were obtained with each cluster containing books of similar subject category. Figure 5 shows a sample of the clusters obtained after running the algorithm on the dataset.

The time taken and space utilized to cluster different sizes of input data are shown in Figs. 6 and 7. The analysis shows that the time complexity of the proposed algorithm is \(\mathbf {O}(\mathbf{n}^{\mathbf{2}})\) and the space complexity is \(\mathbf O(n) \). Table 1 shows the size of dataset and corresponding hypothetical computational cost required with respect to a typical desktop PC with a CPU speed of 2.4 GHz which can perform approximately \(\mathbf {2.4*10}^{\mathbf{9}}\) operations per second.

Table 1 Computational complexity

4 Conclusion

The proposed algorithm employs MST algorithm in two stages: In the first part of the algorithm, a maximal spanning tree is constructed out of the given dataset using Prim’s algorithm. In the second part, this maximal spanning tree is fed as input to a variation of Kruskal’s algorithm which then generates the clusters. Since the space complexity of the algorithm is O(n), the algorithm is efficient in terms of space requirement to compute clusters even for large datasets. However, the time complexity of constructing a maximum spanning tree is \(\mathbf {O}(\mathbf{n}^{\mathbf{2}})\), which hinders the application of MST in case of large datasets.

The future work is to find an approach which captures the similar information like MST in a better time and space complexity.