Keywords

1 Introduction

Data mining is a technique used for finding the interesting and hidden patterns from the given data. It plays a dominant role in all the fields, possibly to convert the data into valuable information [1]. It is generally used in a huge variety of profiling practices such as scientific discovery, fraud detection, surveillance and marketing. Data mining is also utilized to find out the patterns in a dataset but is frequently applied only on data samples. Generally, if the samples in the dataset are represented in a better way, then data mining becomes more effective. However, if the samples are not good, then the process becomes ineffective. As more number of unstructured web pages, commercial reports and scientific publications are increasingly available with the Information Retrieval (IR) systems and on the Internet, high-quality document clustering process becomes indispensable and significant in several applications such as web data management, data mining, search engine and even more.

1.1 Tasks of Data Mining

Traditional data mining tasks are divided into four classes:

  • Classification: It organizes data into predefined groups. For instance, an email program may be classified as legitimate or spam. Commonly, algorithms such as Neural Networks (NN), Naive Bayesian Classification (NBC), Decision Tree Learning (DTL) and nearest neighbour methods are included for many real-time applications.

  • Clustering: It is a procedure to find similar data and cluster them into groups using similarity measures. But, the groups are not predefined. So, various procedures and parameters are used to group similar data and identify outliers [2].

  • Regression: It is a method for assessing the relationship among variables. It involves several mechanisms for demonstrating and examining variables with an emphasis on the connection between a variable and one or more independent variables. This is a technique used to decrease error while organizing data using regression function [3].

  • Association Rule Mining (ARM): This is a technique used to find the association among objects. For instance, a superstore may gather knowledge on client-buying practices. By applying ARM, the superstore can decide the products that are often bought together. This is often typically remarked as market basket analysis [4].

1.2 Clustering

Clustering is the procedure of considering the similarity that exists among the data to divide them and form groups with identical data objects. It deals with forming a group based on the likeness existing between data items. The data objects within the cluster have the highest similarity with huge deviation among clusters [5]. The inter-cluster and intra-cluster distances are measured using distinctive distance measures. Probably, the most primary variation between clustering and classification process is that clustering is an unsupervised learning system, whereas classification is a supervised one. For clustering, there is no need for a special mechanism to group the data items. Clustering is also known as automatic classification; but it is deceptive as it is fashioned without prior knowledge of classes. However, for classification, the classes of data are predefined. Clustering is a procedure that finds the cluster member based on the qualities and characteristics of data. On the other hand, in classification, classifiers find association among classes in the training dataset. In the dataset, data items are perfectly labelled by a specialist, and the trained classifier learns the relation among data and classes. Thus, the learned models are used to label data.

2 Related Work

Article clustering plays a dominant role in organizing documents in article publication organizations. The issues in arranging data are not clearly perceived and needs an expressive substance spotting of the produced clusters. To overcome the total inadequacy, an intensification of K-means algorithm referred to as Weighted-K-means (W-K-means) is proposed by Tran et al. [6]. W-K-means performs grouping by considering the broader meaning and then produces constructive labels for consequential groups. It offers 10 times better results when compared to the K-means algorithm by yielding more intra-cluster resemblance and less inter-cluster resemblance.

In partitioning-based document clustering algorithm, clustering results mainly depends on the number of clusters that is given as input to the algorithm. If the number of clusters is not known, then the results of clustering will be affected. Agrawal and Phatak [7] have proposed a novel algorithm which generates quantity of clusters automatically for any unknown textual content dataset and clusters the records adequately based on the cosine similarity that exists between them. The system is capable of generating clusters for unknown data. But, when documents of the same category are to be clustered, there is a problem of over-clustering.

The fundamental task of document clustering is to evaluate the common similarities among the documents. The new method for calculating the likeness between documents is given by Jun et al. [8]. Many features are implanted in this symmetric measure. The dissimilarity between the presence and absence of features is essential rather than the variance that exists between the values related to a current feature. The similarity rises due to the decrease in the change among values. Moreover, the contribution of change is typically scaled. The likeness drops when the amount of presence or absence aspects increases. A feature that is not present does not contribute to the similarity. The propounded measure is applied to estimate the likeness among different units of documents. The measure is applied to a number of textual content applications together with single-labelled classification, multi-labelled classification, K-means and hierarchical agglomerative clustering for which the outcome acquired reveal the efficiency of the propounded similarity measure.

Every document dataset has its own features and choosing the right clustering algorithm that can manipulate all types of clusters is challenging. Clustering algorithms have their unique methods for finding the number of clusters. They may be hybridized to analyse the issues of single algorithms and increase performance. Sadeghian and Nezamabadi-pour [9] have propounded Gravitational Ensemble Clustering (GEC), a clustering approach that utilizes ensemble and gravitational clustering together for efficient grouping in contrast to the individual approaches such as K-Means algorithm and two hierarchical clustering methods namely, Single Link (SL) and Average Link (AL). The gravitational clustering based on Newton’s gravitation law and ensemble clustering depends on two functions namely, generation and consensus function. These techniques offer better clustering accuracy, but the major shortcoming is the challenge faced in the optimization of continuous space.

Devi and Shanmugam [10] have proposed an effective ranking approach to rank the documents based on TF-IDF of the keywords in the documents. This reduces the communication overheads as needless documents are not downloaded. Furthermore, an effective query randomization scheme is propounded wherein multiple queries concerning the same search terms seem to be distinctive.

Fan et al. [11] have proposed a method that takes the advantages of Invasive Weed Optimization (IWO) algorithm, low complexity of K-means and hybridizing IWO with K-means. Different data experiments are carried out with various documents in different categories. The results show that IWO-K-means algorithm offers better clustering accuracy. The system finds optimal clusters but involves more convergence time.

Habibi and Popescu-Belis [12] have proposed a technique that considers an exact type of Just-In-Time (JIT) retrieval mechanism for conversational environments. This helps the users to identify the know-how requirements. Modelling the customer’s know-how wishes by means of deriving implicit queries from brief dialog fragments is focused. These queries are established on sets of key terms extracted from the dialog. A clustering process is proposed to divide the set of keywords into smaller subsets independent of topics, thus constituting implicit queries. To improve performance of supervisory information clustering, a Semi-Supervised Concept Factorization (SSCF) is propounded by Lu et al. [13]. Pairwise constraints are incorporated into CF as reward and penalty terms. These terms assure that the data points in a cluster in the original space remain in the same cluster even in the transformed space.

Abualigah et al. [14] have propounded a scheme for feature selection called Feature Selection method by using PSO algorithm for Text Clustering (FSPSOTC) that generates a new subcategory of informative text features. This subgroup of features offers better performance involving less computational time. Investigations are performed for six standard text datasets with numerous features. These datasets are usually used in text clustering. The results reveal that this method enhances the efficiency of the text clustering scheme by dealing with a new category of informative features.

Handa et al. [15] have proposed a method that considers the association between keywords to cluster documents. The documents within the appropriate cluster are searched instead of the whole dataset.

2.1 Clustering Methods

The hybridization of K-means and HSM algorithm offers better clustering of document. Forsati et al. [16] have used TF-IDF as a feature for document clustering, that is, term-weighting scheme based on TF-IDF is used. In many documents, certain terms may appear just one or two times. Hence, the TF of words cannot bring out the actual words which lead to creation of tiny clusters in a document cluster. To overcome this, Coverage Factor (CF) feature is used in the proposed Novel Feature Weighting and Feature Selection-based Hybrid Scheme for Document Clustering (NFW-FS-HSDC).

Probably, K-means is the most widely used partitioning-based clustering algorithm which is appropriate for enormous datasets. It is self-effacing, simple, easy to implement and can be applied to a wide variety of applications. In this algorithm, the number of clusters must be specified in the preliminary step itself. It faces a challenge in producing local solution due to the selection of initial cluster centre randomly. Devi and Shanmugam [17] have proposed a Harmonic Search Method (HSM) which is a new meta-heuristic optimization procedure that emulates the tune improvisation process. It has been successful in handling the optimization crisis. In the hybridizing HSM with K-means, HSM is efficient in finding the best initial cluster centre for K-means to provide better results.

3 Hybridization of K-Means and Harmony Search Method (HSM)

K-Means algorithm when applied to document clustering always performs local search. The result or solution is always dependent on the previous step. The K-means algorithm uses arbitrarily selected data as centroid of every cluster and these values change in every iteration. The solution depends on the initial arbitrarily produced centroid values, whereas HSM is excellent in discovering the centroid values, but takes extra time to converge. As a way to overcome the above points of K-means and HSM, a hybrid algorithm that mixes both the ideas can offer effective outcome. Hybridization of K-means and HSM will also yield better results when compared to K-means algorithm [18].

The steps involved in HSM are detailed below.

  • Step 1: (i) Initialize the possible solutions as harmony and algorithm parameters.

    (ii) Calculate the fitness function for document clustering using Average Distance of Documents to the Cluster Centroid (ADDC)

  • Step 2: Initialize the Harmony Memory (HM) with arbitrarily produced centroid using the fitness function of K-means

  • Step 3: Improve HM by adjusting the selected centroid

  • Step 4: If the adjusted centroid has better fitness, then update the HM

  • Step 5: If all the adjusted data of HM have same fitness value, then stop the search

Initially, the parameters of the HS algorithm which include Harmony Memory Measurements (HMMs), Harmony Memory Allowing Rate (HMAR), Pitch Tuning Rate (PTR) and the Quantity of Improvisations (QI) or the stopping criterion are unique. HMM is a reminiscence location wherein the solution vectors (sets of resolution variables) are saved. The HMAR and PTR are involved in the reinforcement of the solution vector. In HM initialization, the HM matrix is made with the arbitrarily produced resolution vectors similar to the HMM. Generating a new harmony is known as ‘improvisation’.

figure a

3.1 Term Weight-Based Hybrid Scheme for Document Clustering (TW-HSDC)

In Term Weight-based Hybrid Scheme for Document Clustering (TW-HSDC), TF-IDF is used as a feature for document clustering. TF denotes the number of times a distinctive term appears in the document. Vector Space Model (VSM) is commonly utilized in document clustering. In the VSM, each report is represented using a feature vector. Traditionally, each characteristic resembles a key word or phrase that appears within the collection of documents. Every entry in the vector stores the numerical weight for the respective feature of the document. The report is represented by using a vector of weights with ‘n’ elements.

$$ {d}_i=\left({w}_{i1},{w}_{i2},\dots .,{w}_{ij},\dots .{w}_{in}\right) $$
(1)

where

  • w ij – frequency of feature ‘j’ in document ‘i

  • n – number of features

It is a term-weighting scheme designed for know-how retrieval (as ranking in search engines like Google) IS used to find the right use of features in record classification and clustering. Weight of a characteristic can be found using the ensuing formula

$$ {w}_{ij}=\mathrm{TF}-\mathrm{IDF}=\mathrm{TF}\left(i,j\right).\log \frac{N}{df(j)} $$
(2)

where

  • TF(i, j) – frequency of feature ‘j’ in a document ‘d i

  • N – number of documents in the collection

  • df(j) – number of records in which feature ‘j’ appears

The algorithm TW-HSDC uses TF of features to cluster the documents. As a result, only low-level clusters are formed. The TF-IDF model does not consider words which appear once or twice in the document. There are chances for the words to be used for defining the category of documents. The significance of these words cannot be identified in TF.

3.2 Coverage Factor-Based Hybrid Scheme for Document Clustering (CF-HSDC)

In Coverage Factor-based Hybrid Scheme for Document Clustering (CF-HSDC), the hybridization of K-means and HSM uses the Coverage Factor (CF) as a feature for document clustering. The drawbacks of the TF-based document clustering are eliminated by using the CF. The coverage of the features is outlined as the percentage of files containing a minimum possibility of an important term.

The CF feature considers the significance of terms. CF-HSDC calculates the document frequency of words and performs word selection based on the threshold values. The CF focusses on the importance of less frequent terms by considering the percentage of documents in the dataset containing these words. The coverage of terms is calculated based on the number of input documents taken for clustering and the number of clusters. Based on the number of clusters and percentage of term availability, the lower and upper ranges of coverage are automatically decided.

figure b

Here ‘K’ is the cluster size. The number of clusters is almost equal to ‘N/K’. As the document set is typically huge, it is ineffective to perform feature extraction on the whole set of documents. To extract good features, a set of sample documents is randomly selected. A stop word list is generated which includes the meaningless words to be removed from the documents. Stemming is applied to combine similar words (Example: Learning, Learned, Learns can be represented as Learn). Since feature reduction leads to a decrease of clustering time, steps 4 to 6 try to minimalize the number of features and get sensible attention for the features. For example, the user wants the resultant cluster to comprise of ‘K’ documents. Initially, features with DF equal to ‘K’ (Step 4) are chosen. The range {lower, upper} is increased recurrently in step 6. This is to ensure sufficient coverage for the resulting feature set. The exact number of clusters is not equal to ‘N/K’. The proposed method makes use of a coverage threshold to guarantee that the chosen feature has adequate coverage of all important terms that occur less frequently.

4 Novel Feature Weighting and Feature Selection-Based Hybrid Scheme for Document Clustering (NFW-FS-HSDC)

The proposed Novel Feature Weighting and Feature Selection-based Hybrid Scheme for Document Clustering (NFW-FS-HSDC) includes the following steps.

4.1 Optimized Kernel Matrix K-Means Algorithm

In this method, the parameters including weight, softness, number of clusters of the general kernel function are tuned using PSO to improve discrimination in the higher dimensional space. Thus, optimized kernel is formed for improving the accuracy of document clustering. In this method, PSO is used for finding the optimized kernel matrix by identifying optimal weight and the membership values of each object. In every iteration, each particle has a set of kernel matrices that are derived from data objects along with the updated weight and membership matrix. The optimized kernels are very useful for calculating the similarity among heterogeneous features. An objective function ‘J(P)’ for fine-tuning the parameters of vector ‘P’ is defined using PSO:

$$ J(P)=\frac{\sum_{i=1}^c\sum_{j=1}^{n_i}d\left(j,i,i\right)}{\sum_{i=1}^c\sum_{j=1,j\ne i}^c\sum_{k=1}^{n_i}d\left(k,i,j\right)} $$
(3)
  • d(j, i, i) – Distance between successive ith vector in jth cluster

  • d(k, i, j) – Distance between ith vector in jth cluster in total number of clusters

The distance between the ith vector in the jth cluster (u ij) with the centroid of the qth cluster in the higher dimensional space is given by

$$ d\left(i,j,q\right)=k\left({u}_{ij},{u}_{ij}\right)-\left(\frac{2}{n_q}\right)\sum_{k=1}^{n_q}k\left({u}_{ij},{u}_{kq}\right)+\left(\frac{1}{n_q^2}\right)\sum_{k=1}^{n_q}\sum_{l=1}^{n_q}k\left({u}_{kq},{u}_{lq}\right) $$
(4)
  • d(i, j, q) – Distance between the ith vector in the jth cluster (u ij) with the centroid of the qth cluster

  • k(u ij, u ij) – ith vector in jth cluster

  • u ijith vector in jth cluster

  • n q – Centroid of cluster ‘q

  • k(u ij, u kq) – ith vector in jth cluster; kth vector in qth cluster

  • \( {n}_q^2 \) – Centroid of cluster ‘q

  • k(u kq, u lq) – kth vector in qth cluster; lth vector in qth cluster

figure c

Thus, optimization maximizes the discrimination in the higher dimensional space and increases the accuracy of the clustering process.

4.2 Unsupervised Constraint Kernel Matrix K-Means

Unsupervised constraint clustering is performed by the machine itself. It includes two types of constraints, namely, word and document constraints.

  • Word constraints depend on the word classifications learned from auxiliary corpus. The must-links are added if two words are close to each other semantically. Named entity identification and overlapped named entities are used to frame constraints for documents. The constraints are obtained by calculating the semantic distance from synonyms in WordNet.

  • In case of document constraints, the must-link constraints are built from the connected entities such as person, location, organization, etc.

5 Results and Discussion

The proposed NFW-FS-HSDC offers better Precision, Recall, F-Measure, Average Distance to Document Centroid (ADDC), Entropy, Overall Similarity and Cluster Purity for Newsgroup and Text REtrieval Conference (TREC) datasets. From these representations, it is evident that the propounded scheme outperforms the existing methodologies in fine-tuned clusters. Hence, the effort and price of using physically labelled constraints are reduced by implementing the unsupervised constraints.

5.1 Performance of the Proposed NFW-FS-HSDC Scheme for Newsgroup Dataset

The clustering results of existing and proposed approaches for Newsgroup dataset are clearly given in Table 1.

Table 1 Results of document clustering for Newsgroup dataset

For Newsgroup dataset, the proposed NFW-FS-HSDC scheme offers 5.6% and 4.2% better Precision in contrast to the existing TF-HSDC and CF-HSDC schemes. Similarly, the propounded NFW-FS-HSDC scheme offers 5% and 3.7% better Recall when compared to the existing TF-HSDC and CF-HSDC schemes. The benchmarked TF-HSDC and CF-HSDC schemes offer 5.3% and 4% reduced F-Measure in contrast to the proposed NFW-FS-HSDC scheme. Similarly, the benchmarked TF-HSDC and CF-HSDC schemes offer 21.3% and 10.6% reduced ADDC when compared to the proposed NFW-FS-HSDC scheme.

The proposed NFW-FS-HSDC scheme offers 9.2% and 6.8% better Entropy in contrast to the existing TF-HSDC and CF-HSDC schemes. The proposed NFW-FS-HSDC scheme offers 10% and 8.9% improved Overall Similarity in contrast to the existing TF-HSDC and CF-HSDC schemes. Similarly, the benchmarked TF-HSDC and CF-HSDC schemes offer 9% and 7.4% reduced Cluster Purity when compared to the proposed NFW-FS-HSDC scheme.

5.2 Performance of the Proposed NFW-FS-HSDC Scheme for TREC Dataset

Similarly, the results of existing and proposed approaches for TREC dataset are shown in Table 2.

Table 2 Results of document clustering for TREC dataset

For TREC dataset, the proposed NFW-FS-HSDC scheme offers 5.7% and 3.9% better Precision in contrast to the existing TF-HSDC and CF-HSDC schemes. Similarly, the propounded NFW-FS-HSDC scheme offers 5.3% and 4.6% better Recall when compared to the existing TF-HSDC and CF-HSDC schemes. The benchmarked TF-HSDC and CF-HSDC schemes offer 5.5% and 4.3% reduced F-Measure in contrast to the proposed NFW-FS-HSDC scheme. Similarly, the benchmarked TF-HSDC and CF-HSDC schemes offer 22.7% and 11.5% reduced ADDC when compared to the proposed NFW-FS-HSDC scheme.

The proposed NFW-FS-HSDC scheme offers 6.2% and 3.9% better Entropy in contrast to the existing TF-HSDC and CF-HSDC schemes. The proposed NFW-FS-HSDC scheme offers 9.9% and 4% improved Overall Similarity in contrast to the existing TF-HSDC and CF-HSDC schemes. Similarly, the benchmarked TF-HSDC and CF-HSDC schemes offer 6.1% and 4.4% reduced Cluster Purity when compared to the proposed NFW-FS-HSDC scheme.

It is seen that Precision (Fig. 1), Recall (Fig. 2), F-Measure (Fig. 3), ADDC (Fig. 4), Entropy (Fig. 5), Overall Similarity (Fig. 6) and Cluster purity (Fig. 7) for Newsgroup and TREC datasets of the proposed NFW-FS-HSDC scheme are better in contrast to the benchmarked TF-HSDC and CF-HSDC schemes considered for investigation.

Fig. 1
figure 1

Precision of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 2
figure 2

Recall of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 3
figure 3

F-Measure of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 4
figure 4

ADDC of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 5
figure 5

Entropy of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 6
figure 6

Overall Similarity of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

Fig. 7
figure 7

Cluster Purity of the TF-HSDC, CF-HSDC and NFW-FS-HSDC schemes for the Newsgroup and TREC dataset

6 Conclusion

Different techniques such as Term Frequency-based Hybrid Scheme for Document Clustering (TF-HSDC), Coverage Factor-based Hybrid Scheme for Document Clustering (CF-HSDC) and Novel Feature Weighting and Feature Selection-based Hybrid Scheme for Document Clustering (NFW-FS-HSDC) are proposed in this Chapter to facilitate efficient document clustering. It is important to observe that the K-means algorithm has the limitation of generating local optimal solution, whereas Harmony Search Method (HSM) takes more time to converge. In the TF-HSDC algorithm, clusters are produced based on only TF features count. In CF-HSDC, CF does not guarantee that all documents are considered for the clustering process. In the proposed NFW-FS-HSDC, a single viewpoint is used to cluster documents. The effectiveness of single viewpoint clustering mainly depends on the appropriateness of the similarity measure. The experimental results prove that the NFW-FS-HSDC outperforms the other two clustering methods in terms of Precision, Recall, F-Measure, ADDC, Entropy, Overall Similarity and Cluster Purity for Newsgroup and TREC datasets.