1 Introduction

Documents clustering (DC) is an instinctive management of learning task, which groups high correlation documents into same category and divides those of disparate into different categories simultaneously [1]. Recently DC [2,3,4, 6,7,8,9, 12,13,14,15,16] turns into an intriguing issue. The typical structure of DC comprises of text refining and knowledge distillation. During the former step, a procedure changing a document into an intermediate form can be document-based or concept-based [17]. In the next stage, clustering algorithms are then applied to extract valuable information according to the intermediate form. To take out a decent example from the document, scientists utilized diverse machine learning algorithms. In past years, various works applied text mining techniques to analyze the text patterns and carry out their mining process [19]. Researchers categorized these techniques into agglomerative clustering algorithms, partitioning algorithms, and standard parametric modeling-based methods [20]. Automatic document organization, topic extraction, fast information retrieval or filtering are the common document clustering applications [21].

One key issue in document clustering is the similitude estimation [22,23,24,25]. There are two widely used geometrical similarity metrics: the Cosine Similarity (CS) [26,27,28] and the Euclidean distance (ED) [29]. The former suffers from the magnitude difference (MD) of vectors which is dealing with term frequency. CS computes the similarity level between two documents without taking into count the rated frequency of each term. The subsequent one neglects to calculate the difference of two vectors that offer a similar ED and does not perform well on high-dimensional data. In this manner, these two measurements are not consistently appropriate to figure out the similarity between several documents.

Another issue in document clustering is how to group documents derived from different sources. Due to the diversity and massive volume of unorganized text documents generated from diverse sources [30], extracting high quality of information from text documents is a challenging task. A dataset is referred to as multi-view data when it derives from diverse sources or pattern [31]. Data from different sources have dissimilar physical connotations and statistical properties. To describe the divergent information, several studies regarded each source or modality as one “view”. In that way, multi-view learning can be utilized to join information which will produce better results in one hand. In another way, it can be used to minimize the effect of noise (data values that make it harder to find patterns) in the data. The properties as mentioned above make the multi-view learning a great candidate to be used in text document clustering. In the last decade, several works were concentrated on the multi-view document clustering challenges [2,3,4, 6, 26, 32]. Some of them consider all the data points in an individual view [3]; others regard them as multiple views [32]. Multi-view Clustering (MvC) aims to accurately and robustly partition the data more than any single-view clustering [33]. Thus, two methods are available, firstly, the distributed method where views are clustered separately, then fused the results to get a final partition. Secondly, simultaneously fuses all views into one, then apply clustering algorithms. The method called centralization suffers from the over-fitting problem and ignores the statistical property of each view. Merging multiple views without decreasing the accuracy is an ongoing challenge in multi-view clustering. Recently [35] reported that there is no criterion to decide which MvC algorithm is the best. To that end, ensemble clustering can be applied to MvC to take advantage of different methods.

In this article, we extend our past research work [36]. Knowing that existing geometrical approaches of similarity measurement consider the magnitude, the ED, the cosine and the direction of vectors separately and inspired by the previous research results, the current work proposes a Robust Multi-view Document Clustering (RMDC) method to address the similarity measurement, the fusion of documents derived from multiple sources challenges in text document clustering. The major differences between the Concept-Enhanced Multi-view Clustering (CEMvC) [36], and RMDC are summarized as follows. Based on the theory that every metric has its advantage and disadvantage according to the dataset, we explore a new theory which not only extended CEMvC but improved it consequently. We instantiate five models of metric instead of three in the previous work. To that end, we run them on every dataset to determine which metric is more suitable. We apply text preprocessing mentioned in Sect. 4.2 on all texts in each dataset and create a list of top n keywords with high TF-IDF score from each benchmark dataset. The algorithm applies the CS, ED, TS-SS, and RDSim1-5 similarity metrics to the data matrix of each view to generate the corresponding similarity matrices. Furthermore, an ensemble approach is deployed to combine these matrices into a solid similarity matrix for the final clustering process. Then a partition-based algorithm such as spectral clustering is deployed to cluster the data. RMDC considers more views in the dataset whereas our previous work focused only on two views per dataset. More rigorous analysis on every metric as well as dataset are conducted respectively in Sects. 6.3 and 6.5, which show the robustness of the proposed RMDC. The main contributions of our work are as following:

  • The proposed RMDC tackles the drawbacks of CS and ED metrics by calculating the similarity between documents with the same ED while taking into consideration their MD.

  • Our method does not only compute the similarity between documents but also their similarity level.

The rest of this paper is structured as follows: the Sect. 2 surveys the related works on text document clustering and the ensemble clustering method. In Sect. 3, we review some basic notions of similarity metric and multi-view methods. We present our proposed multi-view document clustering scheme in Sect. 4. In Sect. 5 we analyze the time complexity of the proposed algorithm. We conduct experimental studies in Sect. 6. Section 7 contains the present paper conclusion and plan of the future work.

2 Related works

This section presents a review of recent literature on document clustering as well as multi-view document clustering which are the foundation of our proposed method.

Recently, extensive studies on document clustering have been carried out. Priya and Priyadharshini [7] proposed a new algorithm named text clustering with feature selection. This algorithm identifies pertinent features (i.e., terms) by iteratively incorporating an improved supervised feature selection to identify important features. In their proposed framework, they represented the terms such as synonym, meronym or hypernym and concept relationship in the ontology. The proposed algorithm in [13] works well in small data, but need to be upgraded to deal with large-scale document datasets. Yu et al. [37] combined text mining techniques and the bibliometric methods to analyze the patterns of the information science publications, geographic distribution, source journals, source institutes, international collaboration, inter-institutional collaboration, document co-citation network, and the references citation bursts detection. Saini et al. [38] fused the self-organizing map (SOM) and the multi-objective differential evolution approach yielding to a cognitive-inspired multi-objective automatic document clustering technique. They utilized the concept of SOM to design new genetic operators for the proposed clustering technique. Furthermore, they encoded the variable number of cluster centers in different solutions of the population to automatically determine the number of clusters from a data set. Sherkat et al. [11] proposed an innovative visual analytic scheme for reciprocal document clustering. In the proposed system, introductory clustering is established based on the user-defined number of clusters and the preferred clustering algorithm. A set of coordinated visualizations allow the examination of the dataset and the results of the clustering. The visualization provides the user with the highlights of individual documents and understanding of the evolution of documents over the time period to which they relate. The users then interact with the process by means of changing key-terms that drive the process according to their knowledge of the document’s domain. In key-term-based synergy, the user designates a set of keywords to each object cluster to instruct the clustering algorithm. We have improved that process with a novel algorithm for choosing proper seeds for the clustering. Janani and Vijayarani [5] proposed an improved text document clustering framework based on the combination of Spectral Clustering algorithm with Particle Swarm Optimization (SCPSO). By the use of global and local optimization function the algorithm aims to deal with the huge volume of text documents. However, the complexity of the similarity graph matrix generation is very high. The method in [6] barely deals with the overlapping clusters and the matrix generation problems. Abualigah et al. [15] combined several objective functions and algorithms such as Krill Herd. They initially inherit solutions from the k-mean clustering algorithm and the clustering agreement, then combined the two objective functions. Bisson and Grimal [2] proposed the Multi-view similarity (MVSIM) framework to handle the dilemma of learning co-similarities when a set of matrices describes the connection between various items. To handle noise in the data, they set the percentage parameter p of the smallest similarity values to be zero in the document and word matrices at the end of each iteration. One drawback in their method is that this parameter relies on prior information which is not accurate. The second drawback is that it processes noise during the clustering step which might affect the performance.

Multi-view document clustering emerged to address the problem of grouping documents derived from diverse sources. [35, 45, 46] discussed the recent progress and new challenges regarding multi-view clustering. Zhao et al. [46] categorized multi-view learning mechanisms into three majors: co-training style algorithms, co-regularization style algorithms, and margin-consistency style algorithms. Lastly, Yang and Wang [35] classified the learning method into multi-kernel learning, multi-view subspace, multi-task multi-view, multi-view graph clustering, and co-training technique algorithms. According to their study, the correctness of views, the opportune moment of fusion, the incomplete MvC, and the multi-task multi-view clustering are still challenging problems. Furthermore, [47,48,49,50] advised some new trend directions. Wahid et al. [6] proposed a non-dominated sorting genetic multi-view document clustering based algorithm. This method generates distinctive clustering solutions from the multiple views of the documents and then privileges a mixture of clusters to form a final clustering. Hussain et al. [3] combined different ensemble techniques which lead to a novel multi-view document clustering algorithm. Their algorithm computes three particular similarity matrices on each dataset and aggregates them to set up a consensual similarity matrix, which is then used as an input of a clustering algorithm to obtain the final clustering. However, their algorithm is computationally expensive, and its accuracy relies on the multiple clustering algorithms used. Inspired by this work, in this study, we extend the same idea then compute the similarity matrices in a parallel to drop down the computation cost. We detailed our framework in Sect. 4. Furthermore, the same authors proposed a multi-view clustering setting in the context of a co-clustering framework [32] based on the assumption that transferring similarity values from one view to the others regarding the individual data will enhance a clustering result. They extended a co-clustering algorithm named \(\chi\)-SIMFootnote 1 to multi-view clustering. However, this method suffers from the problem of executing the number of iterations accurately. The multiple views detection in documents is still a challenging problem [51]. The multi-view concept factorization (MVCF) [8] technique incorporates a graph-regularized method to cluster document. The MVCF algorithm preserves the local geometrical structure of the manifolds for multi-view clustering what the traditional concept factorization can not. The proposed algorithm is only suitable for small scale datasets and has its time complexity is very high. To overcome this problem, Jia et al. [10] devised an approximate normalized cuts algorithm beyond the eigen-decomposition for large scale clustering. Firstly, they reduced the space prerequisite of the normalized cut by sampling a few data points to deduce the global features of dataset instead of using the full affinity matrix. Secondly, they accelerated the graph cut clustering proceeding in an iterative way that using the approximate weighted kernel k-means to optimize the objective function of normalized cut. This technique avoids the direct eigen-decomposition of Laplacian matrix.

Similarly, Yan et al. [9] proposed a novel regularized concept factorization algorithm, which focuses on two constraints. Firstly, whether two documents belong to the same class (must-connected) and secondly when they are in different classes (cannot-connected). It is well-known that there is no criterion to decide which MvC algorithm is the best. One way to take advantage of them is to combine them in an ensemble learning method what we discuss below.

3 Preliminaries

This section briefly reviewed three commonly used similarity metrics in document clustering analysis.

Notations: Let V be the data matrix representing a dataset having documents in rows and words in its columns. \(V_j^i\) denotes the elements of V that corresponds to the intensity of association between document i and word j. For simplicity, we select two documents u and v in the data view. Table 1 gives a detailed summary of the notation used through this paper.

Table 1 List of symbols

3.1 Similarity metric

The measure of similarity between two documents is a complex task in text mining. Several studies proposed some similarity measurements. Birjali et al. [22] suggested a Map Reduce-based algorithm to measure the similarity in a large corpus document. In their study, they discussed similarity measures based on the arcs, nodes, vector space, and hybrids. Wagh and Anand [23] compared two approaches for finding legal document similarity: (a) CS, (b) citation based similarity. The most difference is the use of Jaccard similarity in the citation-based similarity case. Results show that citation-based similarity measure is more robust in determining parallel among cases but requires more connected components. Jagatheeshkumar and Brunda [24] surveyed about similarity measures based on distance such as ED, Manhattan distance, Minkowski distance, CS. Shirkhorshidi et al. [25] examined the role of them on high-dimensional datasets.

We introduce the two most commonly used similarity distances which are CS and ED.

3.1.1 Cosine similarity

CS formulated in Eq. (1) computes the pairwise similarity between two documents using dot product and magnitude of vector document \(\mathbf {u}\) and vector document \(\mathbf {v}\) in high-dimensional space [52,53,54,55,56].

$$\begin{aligned} Cos\left( {u,v} \right) = \frac{{\sum \limits _{n = 1}^k {u\left( n \right) .v\left( n \right) } }}{{\left| u \right| .\left| v \right| }} \end{aligned}$$
(1)

CS refers to as a metric for measuring distance when the MD of the vectors is not prerequisite. Text data represented by word counts is a suitable case to apply this technique. For example, a group of word occurs more in one text document as it is longer than the other text document which is shorter in length. In this case, the weight of this community might be more substantial for the first document than the second, but they appear to be similar documents. In such cases, CS would be a better metric.

3.1.2 Euclidean distance

The ED [57, 58] within two points a and b is the portion of the straight-line distance connecting them. In this case, it appears to be the second extensively used similarity metric.

$$\begin{aligned} ED\left( {u,v} \right) = \sqrt{\sum \limits _{n = 1}^k {{{\left( {u\left( n \right) - v\left( n \right) } \right) }^2}}} \end{aligned}$$
(2)

where u is the first document and v the second one.

ED computes in n-dimensional, the distance between two points space based on their coordinate.

3.1.3 TS-SS similarity

Heidarian and Dinneen highlighted both drawbacks of CS and ED in [14], and then they proposed a new method named TS-SSFootnote 2. The method combines the triangle’s area similarity TS formulated as follows:

$$\begin{aligned} TS\left( {u,v} \right) = \frac{{\left| u \right| .\left| v \right| .\sin \left( {\theta '} \right) }}{2} \end{aligned}$$
(3)

and the sector’s area similarity SS depicted in the next formula.

$$\begin{aligned} SS\left( {u,v} \right) = \frac{{\pi .\left( {\theta '} \right) {{\left[ {ED\left( {u,v} \right) + MD\left( {u,v} \right) } \right] }^2}}}{{360}} \end{aligned}$$
(4)

TS-SS is formulated as follows:

$$\begin{aligned} TS-SS\left( {u,v} \right) = \frac{{\pi .\left| u \right| .\left| v \right| .\theta '.\sin (\theta ').{{\left( {ED\left( {u,v} \right) + MD\left( {u,v} \right) } \right) }^2}}}{{720}} \end{aligned}$$
(5)

where \(MD\left( {u,v} \right) = \left| {\sqrt{\sum \limits _{n = 1}^k {u_n^2} } - \sqrt{\sum \limits _{n = 1}^k {v_n^2} } } \right|\) and \(\theta ' = {\cos ^{ - 1}}\left( {\cos \left( {u,v} \right) } \right) + {10 }\)

4 Proposed method

In this work, we first instantiate five models of metric named Robust Document Similarity metric (\(RDSim_{1-5}\)). Then we advise our multi-view document clustering scheme based on the proposed new similarity metrics.

4.1 Document similarity metrics

CS and ED metric are not always suitable to measure the similarity between two documents. CS is known to be one of the good geometric similarity measurements. However, it does not consider the MD of the two vectors. Both CS and ED are limited to estimate the similarity between two documents accurately. Knowing that they complete each other, an alternative is to combine them. Therefore, there is a need to build a novel approach to calculate similarity which can cope with the drawbacks of these metrics. To that end, we devise the following robust document similarity metrics (\(RDSim_{1-5}\)):

$$\begin{aligned} RDSim1\left( {u,v} \right)= \, & {} ED\left( {u,v} \right) \times Cos\left( {u,v} \right) + TS-SS\left( {u,v} \right) \end{aligned}$$
(6)
$$\begin{aligned} RDSim2\left( {u,v} \right)= \, & {} \left[ {ED\left( {u,v} \right) + MD\left( {u,v} \right) } \right] Cos\left( {u,v} \right) \end{aligned}$$
(7)
$$\begin{aligned} RDSim3\left( {u,v} \right)= & {} \left[ {SS\left( {u,v} \right) + Cos\left( {u,v} \right) } \right] ED\left( {u,v} \right) \end{aligned}$$
(8)
$$\begin{aligned} RDSim4\left( {u,v} \right)= & {} \left[ {SS\left( {u,v} \right) + Cos\left( {u,v} \right) } \right] TS\left( {u,v} \right) \end{aligned}$$
(9)
$$\begin{aligned} RDSim5\left( {u,v} \right)= & {} ED\left( {u,v} \right) \times SS\left( {u,v} \right) \end{aligned}$$
(10)

The RDSim1 metric in Eq. (6) computes the similarity between two documents by taking into consideration their ED, Cosine, the triangle’s as well as the sector’s area similarities. Since the ED is sometimes large, for the RDSim2 metric in Eq. (7), we strengthen the cosine with the sector’s and the triangle’s area similarity. From RDSim 3–5, we revise the ED metric with the sector’s area similarity in Eq. (8), the MD and the cosine in Eq. (9) and combined the sector’s area similarity with the cosine in Eq. (10).

After computing the document similarity matrix with the above five metrics, we devise a method to aggregate different matrices generated from several views in each document.

4.2 Multi-view document cluster ensemble

We propose an ensemble technique to combine different similarity matrix generated from the metrics as mentioned above in the following steps:

Fig. 1
figure 1

Framework for the proposed robust multi-view document clustering

  • Step 1: Document preprocessing

    In document preprocessing, Tokenization is a crucial step and refers to partition the document into an array of sentences which in turn into words. Following the processing in [59], we use the word_tokenize function of the natural language toolkit (nltk) to tokenize the words. This procedure generates many words which affect the clustering accuracy. We consider words like stop-words which are not precise enough as noise, and establishes a collection of irrelevant similarities. These words have to be pruned or removed accordingly. Porter’s stemming algorithm is then applied to reduce inflected words into their stems. Later, TF-IDF is applied to measure the term frequency, then filter the words that appear with very low frequency throughout the corpus. To compute the similarity between two documents A and B, we convert their sentences into vectors with TF-IDF. The vectors are then equalized to the same length. These data serve as the input for the afterward step.

  • Step 2: Similarity matrices generation

    On every dataset, the equalized list of view is used as the input to CS using Eq. (1), ED Eq.(2), TS-SS Eq.(5), and \(RDSim_{1-5}\) similarity metrics to generate the corresponding view similarity matrix. For instance, in View1, the output is dataset Name-View1-Cosine-sim-matrix, dataset Name-View1-ED-sim-matrix, dataset Name-View1-TS-SS-Cosine-sim-matrix etc. For n views, we obtain 8n matrices.

  • Step 3: Inner-view similarity matrices aggregation

    We concatenate the matrix generated by every metric view by view to obtain the inner-view matrix. We repeat the procedure above to all the views. Then, we combine in the next step the individual inner-view matrix for each view to improve the clustering using the formula in Eq. (11)

    $$\begin{aligned} M^v = \frac{1}{n}\sum \limits _{i = 1}^n { (M_j^i} + M_r^i) \end{aligned}$$
    (11)

    where for view i, \(M_j^i\) is the matrix generated by the traditional geometrical similarity metrics j, \(M_r^i\) the matrix produced by our five proposed metrics r and n is the total number of document views.

  • Step 4: Inter-view similarity matrices aggregation

    We aggregate the inner-view similarity matrices to obtain a unified final similarity matrix. For \(n>2\) views, our proposed algorithm aggregates n ensemble based similarity matrices. In this paper we fix the number n to 3.

  • Step 5: Final clustering

    The final similarity matrix which is then used as the input of clustering algorithms such as spectral clustering to generate the final clusters C. The clustering performance is evaluated using accuracy 12 and purity 13 evaluation metrics. The overall procedure is highlighted in Fig. 1, and the pseudo-code is displayed in Algorithm 1.

figure a

5 Time complexity analysis

The complexity of the proposed method relies on the RDSim algorithm used during the similarity matrices generation. Given a dataset with n views \(n \ge 2\), the objects number q and m similarity measurements, the overall complexity is O(qnmd) where d is the data dimensionality. To save the memory we compute the similarity matrix in parallel and store in the disk. This approach makes easy to reuse the same matrix without recomputing it again.

6 Experiments

In this section, we conduct tests on six (6) real-world multi-view datasets to evaluate the effectiveness of the proposed approach. We run all the experiments in PYTHON3 on a work-station (Windows 64bits, Intel(R) Core (TM) i7-4600 CPU @2.10 GHz 2.70 GHz processors, 16GB of RAM).

6.1 Data sets description

We present each dataset by specifying the views and features in Table 2 . In all cases the content view is the documents-words matrix, containing 0/1 values indicating absence or presence of a word in a document. The inbound view is the matrix indicating by 0/1 values describing the inbound links between documents. The cites view is the matrix of the number of citation links between documents.

Table 2 A description of datasets

CiteSeerFootnote 3 The dataset contains 3312 documents over the 6 labels (Agents, IR, DB, AI, HCI, ML). Every document is made of the following views: content, inbound, cites. The documents are described by 3703 words in the content view, and by the 4732 links between them in the inbound, and cites views.

CoraFootnote 4 contains 2708 documents over the 7 labels (Neural Networks, Rule Learning, Reinforcement Learning, Probabilistic Methods, Theory, Genetic Algorithms, Case Based). It is made of same number of views like Citesser dataset. It is described by the absence/presence of the word in a set of publication as the first view in the dataset. The second view consists of citation links to scientific publications. The documents are described by 1433 words in the content view, and by the 5429 links between them in the inbound, and cites views.

CornellFootnote 5 contains 195 documents over the 5 labels (student, project, course, staff, faculty). It is made of 3 views (content, inbound, cites) on the same documents. The documents are described by 1703 words in the content view, and by the 569 links between them in the inbound, and cites views.

TexasFootnote 6 is one of the four subsets of WEBKB dataset. The first view is a matrix of document-by-words while the second view corresponds to document-links and is one of the four universities datasets. The documents are described by 1703 words in the content view, and by the 578 links between them in the inbound, and cites views. The dataset documents belongs to 5 different classes(student, project, course, staff, faculty).

WashingtonFootnote 7 contains 230 documents over the five labels (student, project, course, staff, faculty). It is made of the views of content, inbound, and cites on every documents. A set of 1703 words describe the documents in the first view, and 783 links between them in the other views.

WisconsinFootnote 8 is an archive of 265 documents over the five labels (student, project, course, staff, faculty). It is made of 3 views (content, inbound, and cites) on the same documents. The documents consist of 1703 words in the content view, and 938 links between them in the inbound, and cites views.

6.2 Evaluation metric

Knowing that an external index measure the agreement between two partitions where the first partition is the priori known clustering label, and the second results from the predicting clustering procedure [18]. We employ the most two widely used external validity indices: Accuracy and Purity to evaluate clustering performance.

  • Accuracy [34] measures how the set of predicted labels for a sample must exactly match the corresponding set of true labels. Accuracy is defined as follows:

    $$\begin{aligned} ACC = \frac{1}{n}\sum \limits _{i = 1}^n {\delta \left( {map\left( {{c_i}} \right) ,{g_i}} \right) } \end{aligned}$$
    (12)

    where n is the total number of samples, \(g_i\) is the ground-truth label, \(\delta (u,v)\) is the delta function that equals 1 for similar documents and equals 0 for the dissimilar one, \(map (c_i)\) is the permutation mapping function that maps each cluster label \(c_i\) to the equivalent label from the data set.

  • Purity [16] is an external evaluation criterion of cluster quality. It quantifies the extent that cluster \(C_i\) contains points only from one (ground truth) partition in the unit range from 0 to 1. The expression of the purity can write as follows:

    $$\begin{aligned} Purity = \frac{1}{D}\sum \limits _{i = 1}^k {\mathop {\max }\limits _{j = 1}^k } \left\{ {{p_{ij}}} \right\} \end{aligned}$$
    (13)

    where D is the number of all documents in the dataset, k is the number of clusters, \(p_{ij}\) is the probability that a member of cluster j belongs to class i.

6.3 Analysis of the proposed similarity metrics

In all the experiment, we use spectral clustering, and the number of clusters k is equal to the real cluster number of the original dataset. Firstly, we evaluate the performance of the traditional geometrical similarity metrics.

Fig. 2
figure 2

Performance of RDSim1-5 metrics

Secondly, we compare the accuracy values among the different variations of RDSim. The accuracy results values of these metrics are shown in Table 3.

Table 3 Evaluation of the 8 similarity metrics on the multi-view datasets in term of accuracy

From Fig. 2 it can be observed that every metric has its advantage and disadvantage according to the dataset.

Table 3 reveals that \(RDSim_{3}\) and \(RDSim_{5}\) excel on Washington dataset. Citeseer seems to be the most challenging dataset for our metric. One can see that \(RDSim_{2}\) metric performs better on Citeseer, Cornell, Texas datasets. We deduce that \(RDSim_{3}\) and \(RDSim_{5}\) are better on data where the variety of documents/texts is more important, and \(RDSim_{2}\) is better when this variety is lower.

The overall evaluation for the 8 metrics is shown in Fig. 3. Among the proposed metrics \(RDSim_{5}\) yields a poor accuracy in all dataset comparing to the others. This is due to the fact that we do not take into account the cosine while computing the similarity. So, it corroborates the hypothesis that cosine is important but not enough to measure the similarity between documents. \(RDSim_{4}\) surpasses \(RDSim_{5}\) while still can not excel the other metric since it ignores the ED during the similarity computation. This result confirms the hypothesis based on ED. To that end, it appear that combining ED and CS and boost the document similarity measurement.

Fig. 3
figure 3figure 3

Comparison of the 8 similarity metrics on 6 multi-view datasets

6.4 RMDC comparison with other algorithms

Table 4 Accuracy evaluation of the proposed method compared to the state-of-the-art document clustering algorithms

We compare our proposed method to the following algorithms:

  • Multi-view ensemble clustering (MVEC) [3]: The algorithm computes three different similarity matrices named cluster-based similarity matrix, affinity matrix and pair-wise dissimilarity matrix on the individual datasets and aggregates these matrices to form a combined similarity matrix, which serves as the input of a final clustering algorithm.

  • Multi-view concept factorization (MVCF) [8]: the algorithm identifies the underlying coefficient matrices for each view, and then fuses them with a multi-manifold regularizer to locally conserve the data geometrical format while learning the individual view weights automatically.

  • NMF model with co-orthogonal constraints (NMF-CC) [33] : NMF-CC adds a co-orthogonal constraint to the representation and basis matrices for further capturing the diversity within each view and learning the appropriate basis matrices, in which the basis vector is independent to each other.

  • Cluster-based similarity partitioning algorithm (CSPA) [60]: The algorithm detects the relation among objects in the equivalent cluster by inducing a similarity measure from the partitioning. Further, calculates the pairwise similarity between them and then reclusters each object by using this similarity measurement to determine the combined clustering.

  • Weighted hybrid clustering (WHC) [61]: the algorithm first computes a weighted kernel fusion clustering based on voting techniques to calculate individual clustering results from each data, then combines them using a weighted ensemble clustering technique;

  • Hierarchical ensemble clustering (HEC) [62]: The objective of this algorithm is to connect partition-based and hierarchical clustering. The algorithm uses a set of dendrogram and aggregates them into a distance matrix. A consensus distance is then used to build a structured hierarchy on top of the consensus clustering.

  • Hierarchical combination clustering (HCC) [63]: The algorithm consists of combining results from multi-views using hierarchical clustering. To that end, it converted this hierarchical clustering into matrices which describe the dendrogram distances and then aggregated them into a final matrix and used it for the combined clustering.

  • k-means based co-clustering (kCC) [64]: The algorithm uses a greedy approach but only guarantees the local optimum solution.

  • Diverse NMF (DiNMF) [65]: DiNMF utilizes a diversity term to explore diversity of from different views. This approach has two parameters, which are selected identical with the original literature.

  • Kernel Multi-view low-rank sparse subspace clustering (KMLRSSC) [66] : KMLRSSCFootnote 9 is a spectral based multi-view clustering method with low-rank and sparsity constraints, where the centroid-based scheme is used to learn the consensus matrix.

  • Multi-view clustering via multi-manifold regularized non-negative matrix factorization (MMNMF) [67]: MMNMF incorporates consensus manifold and consensus coefficient matrix with multi-manifold regularization to preserve the locally geometrical structure of the multi-view data space.

  • Multi-view clustering with soft capped norm (SCaMVC) [68]: SCaMVC learns an optimal weight for each view automatically without introducing an additive parameter as previous methods do. Furthermore, to deal with different level noises and outliers, it uses soft-capped norm, which caps the residual of outliers as a constant value and provides a probability for certain data point being an outlier.

  • Multi-view capped-norm k-means clustering (CaKMVC) [69]: CaKMVC utilizes the capped-norm based residual calculation for the objective to remove the effects of the outliers.

  • Self-paced and auto-weighted multi-view clustering (SAMVC) [70]: SAMVC learns the MVC model with easy examples and then progressively considers complex ones from each view. In addition, a soft weighting scheme of self-paced learning is designed to further reduce the negative impact from outliers and noises.

We quote the results from the original paper [32, 70] for algorithms whose codes are not available publicly. Compared to previous works, our method outperforms the other models as shown in Table 4 on four of the six multi-view document data.

6.5 Analysis of the proposed multi-view document clustering method

We analyze the accuracy scores of the proposed multi-view document clustering algorithm on each benchmark dataset.

Comparing to the state-of-the-art algorithm, Table 4 shows that RMDC outperforms with significant marge. It can be observed that our proposed RMDC performs better on Cornell, Texas, Washington and Wisconsin datasets than the previous methods. The Citeseer dataset shows the least accuracy among all the algorithms tested. We speculate that this might be due to the fact that a variety of document in this dataset is higher than the others and this data is more subject to noise.

7 Conclusion and future work

In this work, a robust multi-view document Clustering have been proposed. To that end, we instantiated five similarity measurements and concatenate these similarity metrics to solve the problem of similarity measurement in document clustering. Our metrics calculate the dissimilarity between documents based on their Cosine, Euclidean distances and MD. The similarity matrices are computed in parallel to diminish the computation cost. Furthermore, we recommended a robust multi-view clustering method tailored to cluster documents. The experimental analysis shows that every metric of \(RDSim_{1-5}\) has its advantage and disadvantage according to the dataset and the RMDC approach exceeds diverse advanced multi-view clustering schemes. Despite its good results, the proposed method consumes more space and ran slower when the data dimensionality increases. In our future work, we will overcome this issue by combining diverse dimensionality reduction approaches.