8.1 Introduction

The rapid development of information technology and the abundant amount of available data have considerably contributed to the growth of studies on multi-view clustering [8, 32] . Multi-view data is observed from varying points resulting in different representations (views) with distinct statistical properties. In text clustering, these views can be obtained through word frequencies, topic and context based representations, and many other embedding models capable of capturing either syntactic or semantic information or both [14]. The main task of multi-view text clustering is to achieve better clustering by combining information held by each view, such information is disregarded when only a single view is used. However, efficiently integrating different views while preserving their characteristics remains a challenge. A naive solution for multi-view clustering consists in concatenating features from all views then apply a single-view clustering algorithm. Nevertheless, such combination fails to exploit the specificity of each view. Hence, multiple approaches have been proposed to optimize multi-view clustering [16, 19, 35].

This chapter reviews multi-view methods for text clustering. In fact, textual data was examined early on in the context of multi-view, particularly in cross-lingual text categorization where the data is labeled in one view and not in another, the aim is to use the information in both views to label all data [1, 28, 30]. With the abundance of unlabeled data, this process was extent to multi-view text clustering [13].

The reminder of this chapter is organized as follows: Sect. 8.2 presents an overview of exiting multi-view clustering methos, specifically for text clustering. Section 8.3 evaluates the performance on real-world textual data. Finally, Sect. 8.4 presents the conclusion and current challenges.

8.2 Overview of Multi-View Textual Data Clustering

The main challenge of multi-view clustering consists in integrating the different views while taking advantage of the characteristics of each view to improve the clustering results. An intuitive solution consists of concatenating all features from all views and apply a clustering algorithm afterwards, this, however, ignores the statistical properties of each view and can conceal valuable information [3]. To this end, according to the integration scheme, existing multi-view clustering methods can be presented under three main categories [22]. The first category called late integration derives clustering results from each view individually, then a fusion step is applied to reach a consensus clustering [7, 29]. The second category is based on co-training, which incorporate multi-view integration into the clustering process directly through jointly optimizing the objective function [2, 17]. The third category is based on space learning, such that views are mapped into a new space to reveal the latent data structure. We present in the following the characteristics of each category and detail a number of existing methods.

8.2.1 Late Integration Based Methods

The late integration approach, also known as late fusion, consists of applying a clustering algorithm on each view individually and subsequently combines the results into a consensus clustering. The idea examines the relations between the clusters derived from each view rather than the relations between data points. The combination of clustering results can be obtained using different methods, such as latent probabilistic models [7] or more recently ensemble methods [9, 13, 26]. Figure 8.1 presents the overall process of late integration based methods.

Fig. 8.1
figure 1

General process of late integration based methods

8.2.1.1 Ensemble Methods for Multi-View Text Clustering

Xie et al. [31] proposed a multi-view clustering ensembles, an combination of multi-view clustering algorithms and ensemble clustering. The method extends both multi-view kernel K-means [27] and multi- view spectral clustering [16] to ensemble clustering and compares the two methods. Given a data set X, different clustering results {π 1, π 2, …, π H} are obtained through different runs of the clustering algorithm. These clustering are then combined based on plurality voting, i.e., considering the majority cluster label for each data point to give the final clustering π .

Algorithm 1: Multi-view clustering ensembles

Hussain et al. [13] presented a late integration framework for multi-view document clustering based on ensemble method. The proposed method first converts views into term weighted matrices using two weighting schemes: TF-IDF and TF-ICF [24]. Hierarchical clustering is then applied on each view individually to obtain different partitions. In order to aggregate the clustering results, different ensemble techniques are adopted: the Cluster Based Similarity Partitioning (CBSP) [25], the Pairwise Dissimilarity (PD) [36], and the Affinity matrix based technique. Each ensemble technique provides a similarity matrix, which are then aggregated into an overall similarity matrix used for the final clustering. Similarly, Fraj et al. [9] proposed a multi-view ensemble methods for text clustering (MEMTC) based on multiple representations. The main idea consists of integrating different text representation models: TF-IDF, LDA, and skip-gram to generate, respectively, syntactic, topic, and semantic views. Lastly, ensemble techniques CBSP and Pairwise Dissimilarity are used to aggregate the different partitions yielded by each view. The main steps of MEMTC are presented in Algorithm 2

Algorithm 2: Ensemble methods for multi-view clustering

8.2.1.2 Multi-View Clustering Based on Latent Models

Bruno et al. [7] proposed an integration framework based on latent models for document clustering. In this work, documents from each view are clustered into k v clusters. The set of clusterings \(\{c_1^v,\ldots ,c_k^v\}\) are then concatenated into K × M matrix C, such that K =∑v k v is the total number of clusters over all views. Based on C, a joint probability \(P(c_k,c_{k'})\) is derived to deduce the pairwise cluster agreement, which represents the number of documents belonging simultaneously to clusters c k and \(c_{k'}\). The joint cluster-cluster probability is defined as follows:

$$\displaystyle \begin{aligned} P(c_k,c_{k'}) &= \sum_n P(c_k|x_i)P(c_{k'}|x_i)P(x_i) \\ &=\sum_n \dfrac{P(c_k,x_i)P(c_{k'},x_i)}{P(x_i)} \end{aligned} $$
(8.1)

where the joint-cluster document probability is obtained by:

$$\displaystyle \begin{aligned} P(c_k,x_i) = \dfrac{{\mathbf{C}}_{k,i}}{MN},\quad \forall k \in [1,K], \forall i \in [1,N] \end{aligned} $$
(8.2)

To obtain the final clustering for each document, the Probabilistic Latent Semantic Analysis (PLSA) [12] is adopted to derive latent variable z j such that

$$\displaystyle \begin{aligned} P(c_k,c_{k'}) = P(c_{k'})\sum_j^L P(c_k|z_j)P(z_j|c_{k'}) \quad ,j=1,\ldots,L \end{aligned} $$
(8.3)

PLSA seeks to find the relationship between the clusters observations across different views and the latent variables z. The overall clustering is established by assigning to document x i the discrete variable z j that maximizes the following posterior probability:

$$\displaystyle \begin{aligned} P(z_j|x_i) = \dfrac{\sum_k P(z_j|c_k) P(c_k,x_i)}{P(x_i)} \end{aligned} $$
(8.4)

To estimate the latent variables, experiments were carried using the Expectation-Maximization (EM) algorithm and the Nonnegative Matrix Factorization (NMF) [11] where both methods have performed similarly. Algorithm 3 summarizes the main step of the approach.

Algorithm 3: A late fusion approach using latent models

8.2.2 Co-training Based Methods

Co-training based methods seek to find a consensus by maximizing the mutual agreement across all views. Co-training was originated by Blum et al. [4] in order to tackle semi-supervised problem. Given the abundance of unlabeled data, such data can be used to enrich the training set of the labeled data, such that given two views the learning algorithm is trained on the labeled data of both views in a bootstrapping manner. Finally, based on the consensus principle, the views should agree on all labeled data. Eventually, co-training was adopted in unsupervised learning [3] and has shown good performance despite the absence of labeled data. In general, co-training based methods are based on three main assumptions: Sufficiency: each view is sufficient to perform the clustering task, Compatibility: each pair of views predicts with high probability the same label for data points with co-occurring features, and Conditional independence: the views are conditionally independent given the class label [16]. Figure 8.2 presents the general process of co-training based methods.

Fig. 8.2
figure 2

General process of co-training

8.2.2.1 Multi-View K-Means Based Methods

Multi-view clustering was advanced by Bickel et al. [3], where the empirical results show that the proposed multi-view spherical k-means improves the quality of document clustering in comparison to the single-view version of the algorithm. The presented co-training algorithm is based on the following assumptions: given two views v 1 and v 2, each view is sufficient to output clustering results by itself, and views are conditionally independent given the class label. The clustering process starts by randomly initializing the set of parameters Θv including the centers \(c^v_j, j= 1, \ldots , k \), where k is the desired number clusters and v = 1 or v = 2. Documents are then assigned to clusters given the smallest computed distance to \(c^v_j\). A two-step iterative process is applied afterwards taking turns between views. The first step consists of updating the clusters centers such that:

$$\displaystyle \begin{aligned} c_j^v = \frac{\sum_{x^v\in\pi^v_j} x^v}{ \| \sum_{x^v \in \pi^v_j} x^v \|} \end{aligned} $$
(8.5)

where \(\pi ^v_j\) is the jth partition given the vth view. The assignment step consists of calculating the distance between documents and centers, and finding the new partitions. After each iteration, partitions are exchanged for an updating and assignment steps for the other view. For the final clustering, consensus centers are calculated by considering the documents that both views agree on such that:

$$\displaystyle \begin{aligned} cons\_c_j^v = \frac{\sum_{x_i^1 \in \pi^1_j \wedge x_i^2 \in \pi^2_j} x_i^v}{ \| \sum_{x_i^1 \in \pi^1_j \wedge x_i^2 \in \pi^2_j} x_i^v \|} \end{aligned} $$
(8.6)

The final partitioning is obtained by assigning documents to the closest consensus vector. Given that the algorithm is based on alternating partitions between views, convergence is not guaranteed. The main steps of multi-view spherical k-means are presented in Algorithm 4.

Algorithm 4: Multi-view spherical k-means

Bettoumi et al. [2] proposed a collaborative multi-view K-means CO-K-means that introduces an interconnection term to overcome the inter-view disagreement. Views are encouraged to reach an agreement by minimizing the contradiction across partitions. To solve this problem, the K-means objective function is altered such that:

$$\displaystyle \begin{aligned} \Omega = \sum_v \sum_i \sum_k \|{\mathbf{x}}^v_i - {\mathbf{c}}^v_k\|{}^2_2 + \mu \varphi \end{aligned} $$
(8.7)

where μ is a modulation parameter and φ is the interconnection term denoted by:

$$\displaystyle \begin{aligned} \varphi = \dfrac{1}{|V|-1}\sum_{v>v'} \sum_i^n \sum_k (\|{\mathbf{x}}^v_i - {\mathbf{c}}^v_k\|{}^2_2 -\|{\mathbf{x}}^{v'}_i - {\mathbf{c}}^{v'}_k\|{}^2_2) \end{aligned} $$
(8.8)

Similarly to the classic K-means, the proposed algorithm starts by randomly initializing the clusters centers for each view, followed by an assignment step. Then, for each view, new centers are computed. The interconnection term φ aims to reduce the distance between the partitions yielded from each view. The main steps of Co-K-means are given in Algorithm 5.

Algorithm 5: Collaborative multi-view K-means

8.2.2.2 Self-Organizing Map Multi-View Clustering

Fraj et al. [10] proposed a multi-view clustering method based on the Self-Organizing Map neural network [15]. Similarly to [9], each view corresponds to a text representation model, i.e., TF-IDf, LDA, and skip-gram. The views are presented as input layers, such that each document has three vector representations x = {x 1, x 2, x 3}. Documents are then mapped onto the output layer, such that each document is assigned to a node on the map. Consequently, each node (neuron) of the output layer is defined by v prototypes w each of which is associated with a view. First, the learning process consists in generating random SOM prototypes, W v. Secondly, an overall distance is calculated for each document \({\mathbf {x}}_i^v\) in the view v and the node w such that

$$\displaystyle \begin{aligned} D = \sum_{i} D_v(x^v,w) , \ v\in {1,2,3} \end{aligned} $$
(8.9)

The node with the smallest distance is considered the Best Matching Unit BMU to which the document x i is assigned. The number of nodes on the output map is set empirically to boost the performance of the SOM learning, the number, however, may not coincide with the desired number of clusters which is usually less important. Therefore, the nodes on the map are clustered using agglomerative hierarchical clustering and each document is assigned to the same cluster as its corresponding SOM node. The main steps of MVSOM are presented in Algorithm 6.

Algorithm 6: Self-organizing map for multi-view text clustering

8.2.2.3 Multi-View Spectral Clustering

Kumar et al. [16] have presented a co-training based spectral clustering, where two views exchange the eigenvectors resulting from the graph Laplacian of each view. The algorithm ensures consistency across views such that if two points are assigned in same cluster in one view, it should be so in all the views. On the other hand, if two points belong to different clusters in one view, they should be clustered separately across all views. The proposed algorithm first builds an adjacency matrix A v for each view, from which the graph Laplacian matrix L v is obtained such that:

$$\displaystyle \begin{aligned} {\mathbf{L}}^v={{\mathbf{D}}^v}^{-1/2}{\mathbf{A}}^v {{\mathbf{D}}^v}^{-1/2} \end{aligned} $$
(8.10)

where D v is the diagonal matrix such that \({\mathbf {D}}^v_{ii}=\sum \limits _j {\mathbf {A}}^v_{ij}\). The k largest eigenvectors of L hold the discrimination information for clustering. Thus, the eigenvectors are exchanged across views to propagate the per-view clustering information, such that the largest k eigenvectors form the matrix U v. Precisely, the co-trained spectral clustering uses the eigenvectors of one view to modify the adjacency matrix of the other view and consequently the graph structure, such that each column a i of A represents the similarity of the data point i with all point in the graph. The algorithm projects the column vectors of one view in the direction of the k eigenvectors of the other view, then back projects them to the original space to obtain the modified graph. To obtain the update adjacency matrix S v, a symmetrization step is performed such that:

$$\displaystyle \begin{aligned} {\mathbf{S}}^v = \mathit{sym}({\mathbf{U}}^{\bar{v}}{{\mathbf{U}}^{\bar{v}}}^T A^v) \end{aligned} $$
(8.11)

where sym(S) = (S + S T)∕2. The new graph Laplacian L v are obtained from S v, from which the k eigenvectors and U v are deduced. The algorithm performs these steps for a defined number of iteration. The final clustering is given by the k-means algorithm performed on matrix V, the column-wise concatenation of U v. The main steps of co-training multi-view spectral clustering are given in Algorithm 7.

Algorithm 7: Co-training based multi-view spectral clustering

Lin et al. [20] proposed Multi-view Proximity Learning for Clustering (MVPL), a method that learns the proximity matrix based on data representative and spectral clustering. Given a set of multi-view data \(\mathbf {X}=\{{\mathbf {X}}^1,{\mathbf {X}}^2,\ldots , {\mathbf {X}}^m\} \in \mathbb {R}^{d^v \times n}\), a data representative matrix \({\mathbf {U}}^v \in \mathbb {R}^{d^v \times n}\) is associated with each view to exploit the relations between objects within the same view. The new data representative considers the proximity between each pair of data points. Therefore the learned similarity matrix is affected by these representatives, and inversely. On the other hand, MVPL considers the spectral embedding of data to integrate the different views and thus consider the inter-view relations into the similarity matrix. The goal of MVPL is to minimize the following objective function:

$$\displaystyle \begin{aligned} & \min\limits_{\{{\mathbf{U}}^v\},\{{\mathbf{S}}^v\}, \mathbf{F}} \dfrac{1}{n}\sum_{v=1}^m \bigg( \sum_{i=1}^n \|{\mathbf{x}}_i^v - {\mathbf{u}}_i^v\|{}_2^2 + \dfrac{\alpha}{n^2} \bigg(\sum_{i,j=1}^n\|{\mathbf{u}}_i^v - {\mathbf{u}}_j^v\|{}_2^2s_{ij} + \beta \| \mathbf{S}\|{}_F^2\bigg) \bigg)\\ &\quad + \gamma \dfrac{1}{2n^2} \sum_{v=1}^m\sum_{i,j=1}^n s_{ij}^v \|{\mathbf{f}}_i -{\mathbf{f}}_j\|{}^2_2 \\ &\quad {s.t} \; \sum_{j=1}^n s_{ij}=1, s_{ij}^v \geq 0, \forall i,j, \mathbf{FF}^T= \mathbf{I} \end{aligned} $$
(8.12)

where the first term considers the impact of the data representatives, while the second term models the relation between the spectral embedding matrix F and the similarity matrix S v, γ is a trade-off parameter that balances the two terms, α controls the distance between the original data features and their representatives, and β controls the sparsity of S. Algorithm 8 describes the main steps of MVPL.

Algorithm 8: Multi-view proximity learning

8.2.3 Subspace Clustering Based Methods

The third category of multi-view clustering is based on subspace learning. Recently, more and more studies have exploited subspace clustering to extract distinct clustering features. Multi-view subspace clustering assumes that the data samples from different views share the same subspace [33]. Figure 8.3 illustrates the process of learning a shared subspace from multi-view data. The performance of subspace clustering relies on the latent representation matrix obtained from the different multi-view subspaces. Several methods have been proposed in order to identify the common subspace, we distinguish two main subcategories: NMF based methods and latent representation based methods.

Fig. 8.3
figure 3

General process of multi-view subspace clustering

8.2.3.1 Muti-View Subspace Clustering Based on Nonnegative Matrix Factorization

Liu et al. [22] proposed MultiNMF, a multi-view clustering via joint nonnegative matrix factorization. The algorithm enforces each view’s indicator matrix towards a common consensus. Given multi-view data \({\mathbf {X}}^v \in \mathbb {R}^{d \times n}_+\), its matrix factorization is:

$$\displaystyle \begin{aligned} {\mathbf{X}}^v \approx {\mathbf{U}}^v{{\mathbf{V}}^v}^T \end{aligned} $$
(8.13)

where \({\mathbf {V}}^v \in \mathbb {R}^{n \times k}_+\) and \({\mathbf {U}}^v \in \mathbb {R}^{d \times k}_+\) represent the indicator matrix and the basis matrices of view v, respectively. MultiNMF adopts a normalization constraint so that all indicator matrices are comparable and significant for clustering. The problem can be defined as a joint minimization of the following objective function:

$$\displaystyle \begin{aligned} &\sum_v^m \|{\mathbf{X}}^v-{\mathbf{U}}^v{{\mathbf{V}}^v}^T\|{}_F^2 + \sum_v^m \lambda_v\|{\mathbf{V}}^v{\mathbf{Q}}^v - {\mathbf{V}}^*\|{}_F^2\\ &s.t \; v \in \{1,\ldots,m\}, {\mathbf{U}}^v \geqslant 0, {\mathbf{V}}^v \geqslant 0, {\mathbf{V}}^* \geqslant 0 \end{aligned} $$
(8.14)

where V is the consensus matrix, and Q v is a diagonal matrix such that:

$$\displaystyle \begin{aligned} {\mathbf{Q}}^v = Diag \left ( \sum_{j=1}^d {\mathbf{U}}_{j1}^v,\sum_{j=1}^d{\mathbf{U}}_{j2}^v,\ldots, \sum_{j=1}^d{\mathbf{U}}_{jk}^v \right) \end{aligned} $$
(8.15)

Finally, the clustering assignment of data point i is computed as \( \operatorname *{\mathrm {argmax}}_k {\mathbf {V}}_{ik}^*\). The main steps of MultiNMF are given in Algorithm 9.

Algorithm 9: Multi-view NMF

Zhang et al. [34] proposed a constrained NMF based clustering (CMVNMF) that uses an inter-view must-link (ML) and cannot-link (CL) constraints in order to minimize the disagreement between each pair of views. To accomplish the clustering task, the following objective function is minimized:

$$\displaystyle \begin{aligned} \|{\mathbf{X}}^v - {\mathbf{U}}^v {{\mathbf{V}}^v}^T \| + \beta \sum_{v,v' \in [1,m]} \Delta_{v,v'} \quad s.t \; {\mathbf{U}}^v \geq 0, {\mathbf{V}}^v \geq 0 \end{aligned} $$
(8.16)

where β is a regularization parameter, and Δ measures the disagreement between v and v′ such that:

$$\displaystyle \begin{aligned} \Delta_{v,v'} = \sum_{({\mathbf{x}}_i^v, {\mathbf{x}}_j^{v'})\in ML^{v,v'}} ({\mathbf{v}}_i- {\mathbf{v}}^{\prime}_j) + 2 \sum_{({\mathbf{x}}_i^v, {\mathbf{x}}_j^{v'})\in CL^{v,v'}} {\mathbf{v}}_i {\mathbf{v}}^{\prime}_j \end{aligned} $$
(8.17)

The must-link and cannot-link constraints are defined by matrices \({\mathbf {M}}^{vv'}\) and \({\mathbf {C}}^{vv'}\), respectively, such that :

$$\displaystyle \begin{aligned} {\mathbf{M}}_{ij}^{vv'}\left\{\begin{matrix} 1,& (x^v_i,x_j^{v'} ) \in ML ^{v,v'} \\ 0, & \text{otherwise} \end{matrix}\right. \end{aligned}$$
$$\displaystyle \begin{aligned} {\mathbf{C}}_{ij}^{vv'}\left\{\begin{matrix} 1, & (x^v_i,x_j^{v'} ) \in CL ^{v,v'} \\ 0,& \text{otherwise} \end{matrix}\right. \end{aligned}$$

The distance between a pair of data points in the same cluster from different views is minimized through the must-link constraints, while the cannot-link constraints aim to maximize the distance of data points belonging to different views and different clusters. The main steps of CMVNMF are given in Algorithm 10.

Algorithm 10: Constrained multi-view NMF

8.2.3.2 Multi-View Subspace Clustering Based on Shared Latent Representation

Zhang et al. [33] proposed Latent Multi-view Subspace Clustering (LMSC), which is based on the assumption that multi-view data share a latent subspace representation. LMSC learns a common representation from the different views based on subspace clustering. First, the original multi-view data X v is reconstructed based on projection models P v and achieve a common latent representation H such that:

$$\displaystyle \begin{aligned} {\mathbf{x}}_i^v= {\mathbf{P}}^v {\mathbf{h}}_i + {\mathbf{e}}_i^v \end{aligned} $$
(8.18)

where \({\mathbf {e}}_i^v\) denotes the reconstruction error. Then, the latent representation is integrated into subspace clustering, such that the clustering problem is defined as:

$$\displaystyle \begin{aligned} \min\limits_{\mathbf{Z}} \mathit{L}_r (\mathbf{H,HZ}) + \alpha \Omega(\mathbf{Z}) \end{aligned} $$
(8.19)

where Z is the subspace representation matrix, L r() is the loss function of the subspace reconstruction, Ω() corresponds to the regularization term, α balances the regularization. By introducing the parameters λ 1 and λ 2, the overall objective function of LMSC becomes as follows:

$$\displaystyle \begin{aligned} &\min\limits_{\mathbf{P,H,Z,E_{\mathit{h}},E_{\mathit{r}}}} \|{\mathbf{E}}_h\|{}_{2,1} + \lambda_1 \|{\mathbf{E}}_r\|{}_{2,1} + \lambda_2 \|\mathbf{Z}\|{}_* \\ &s.t \quad \mathbf{X = PH + E_{\mathit{h}}}, \mathbf{H = HZ + E_{\mathit{r}}}, \text{and} \ \mathbf{PP}^T=1 \end{aligned} $$
(8.20)

The 2,1 norm ensures robustness in the presence of noise, while the nuclear norm captures the underlying clustering structure. To solve Eq. 8.20, the error matrices E h and E r are vertically concatenated, and the Augmented Lagrangian Multiplier with Alternating Direction Minimization (ALM-ADM) strategy proposed in [21] is adopted. The main steps of LMSC are given in Algorithm 11.

Algorithm 11: Latent multi-view subspace clustering

Brbic et al. [6] proposed a multi-view low-rank and sparse subspace clustering (MLRSSC), with two regularization scheme: pairwise and centroid based. The first establishes a pairwise agreement across views, whereas the second coerces the representations towards a common centroid, as first introduced by Kumar et al. [17]. Both methods are based on constructing a low-rank and sparse affinity matrix from multi-view data. Given a set of multi-view data X v, MLRSSC aims to find a joint representation matrix C that presents an agreement across views by minimizing the following objective function:

$$\displaystyle \begin{aligned} \min_{{\mathbf{C}}^{(1)},{\mathbf{C}}^{(2)},\ldots,{\mathbf{C}}^{(m)}} &\sum_{v=1}^{m}(\beta_1\left \|{\mathbf{C}}^{v} \right \|{}_* + \beta_2\left \|{\mathbf{C}}^{v} \right \|{}_1 ) + \sum_{1\leq v, w\leq m,v\neq w}\lambda^{v}\left \|{\mathbf{C}}^{v} - {\mathbf{C}}^{w}\right \|{}_F^2\\ &\text{s.t.} \quad {\mathbf{X}}^v = {\mathbf{X}}^{v}{\mathbf{C}}^{v}, \quad diag({\mathbf{C}}^{v})=0, \quad v=1,\ldots,m, \end{aligned} $$
(8.21)

where Z v is the representation matrix of view v, β 1 and β 2 are the balancing parameters of low-rank and sparsity constraint, λ v is the consensus parameter. In case where all views are considered equally important, the same λ v is used. The last term maximizes the pairwise similarity across views. To solve the problem in Eq. 8.21, the Alternating Direction Method of Multipliers (ADMM) strategy is used [5]. Algorithms 12 and 13 summarize the steps of pairwise MLRSSC and centroid-based MLRSSC, respectively.

Algorithm 12: Pairwise MLRSSC

Algorithm 13: Centroid-based MLRSSC

8.2.4 Summary of Multi-View Methods for Text Clustering

Compared to single-view data, multi-view data presents multiple advantages given its ability to describe objects from different aspects and thus give a more comprehensive representation of data. However, the manipulation and exploitation of multi-view data require further advanced algorithms in order to mine the complementarity between views and discover knowledge that is otherwise hidden in a single-view framework. Multi-view data is furthermore challenging in the case of unlabeled data given that no prior knowledge is available. The existing multi-view clustering algorithms, as the ones presented in this chapter, have shown good performance in dealing with different points of multi-view data such as finding a consensus across views, integrating the information provided by each view, discovering hidden patterns, etc.

Multiple methods for multi-view text clustering rely on a single representation model, usually the TF-IDF [16]. Although this model is capable of capture the syntactic properties of text, it is, however, unable to give an insight on semantic concepts or topically related features of text data. To this end, other methods exploited different representation models such as TF-ICF in [13] or topic models and word embeddings [9, 10]. Table 8.1 summarizes the characteristics of multi-view clustering methods.

Table 8.1 Summarization of multi-view clustering methods

8.3 Experiments

We evaluate in this section the performance of multi-view clustering methods on text data. We select methods from each category: MEMTC [9], MVEM [13], MVKM [3], MVSOM [10], LMSC [33], pairwise MLRSSC and centroid-based MLRSSC [6]. We also compare these methods to other baseline such as PCA and basic spectral clustering applied to concatenated views.

8.3.1 Data Sets Description

The experiments are carried on four commonly used data sets for multi-view text clustering. The Reuters data set is a collection of 2189 documents belonging to 8 classes. The 20 Newsgroups consists of 2828 news articles distributed on 20 classes. The WebKB data set is a collection of 4168 web pages collected from computer science departments, belonging to 4 classes (student, faculty, project, course). The BBC Sport consists of 737 documents from the BBC Sport website corresponding to sports news articles belonging to 5 areas: football, rugby, tennis, athletics, and cricket. Before applying the clustering algorithms, a preprocessing step is performed on the data sets including stop words removal. Stop words removal consists in eliminating common words that appear frequently and offer no additional semantic value. Table 8.2 summarizes the properties of all data sets.

Table 8.2 Data sets description

8.3.2 Evaluation Measures

To measure the quality of the clustering and compare it with existing methods, three evaluation measures are utilized: the F-measure [18], the Normalized Mutual Information (NMI) [37], and Purity [23]. Given a set of clusters C = {c 1, c 2, …, c k} and the gold standard classes G = {g 1, g 2, …, g j}:

F-measure is a trade-off between Precision and Recall such that:

$$\displaystyle \begin{aligned} F-measure(c_k,g_j) = 2 * \frac{Precision(c_k,g_j) \times Recall(c_k,g_j)}{Precision(c_k,g_j) + Recall(c_k,g_j)} \end{aligned} $$
(8.22)
$$\displaystyle \begin{aligned} Precision(c_k,g_j) = \dfrac{|c_k \cap g_j|}{|c_k|} \end{aligned} $$
(8.23)
$$\displaystyle \begin{aligned} Recall(c_k,g_j) = \dfrac{|c_k \cap g_j|}{|g_j|} \end{aligned} $$
(8.24)

Normalized Mutual Information (NMI) measures the quality of clustering with regards to the number of clusters and their sizes. NMI is defined as:

$$\displaystyle \begin{aligned} NMI(C,G) = \dfrac{I(C,G)}{[E(C)+E(G]]/2} \end{aligned} $$
(8.25)

where I is the mutual information and E(C) is entropy.

$$\displaystyle \begin{aligned} I(C,G) = \sum_k \sum_j \dfrac{|c_k \cap g_j|}{N} \log \dfrac{N|c_k \cap g_j|}{|c_k| |g_j|} \end{aligned} $$
(8.26)
$$\displaystyle \begin{aligned} E(C) = - \sum_k \frac{|s_k|}{N} \log \frac{|s_k|}{N} \end{aligned} $$
(8.27)

Purity: measures the number of correctly assigned documents, where each cluster is assigned to the dominant class in that cluster. The larger the number of clusters is, the higher the Purity is. Unlike NMI, Purity cannot trade-off the quality of the clustering against the number of clusters

(8.28)

For all measures, the values range from 0 to 1, such that values closer to 0 represent poor quality

8.3.3 Experimental Results

Table 8.3 reports the performance of the different methods. Given the results, we can observe that most multi-view methods provided better clustering in comparison to concatenated views. This shows that concatenating views can result in losing the individual properties of views and affect the overall clustering. Another noticeable observation is that all methods have given their best results on the smallest data set, the BBC Sport, while the overall performance is affected on the largest data set, 20 newsgroup. We can conclude that the size and the dimension of the data set can jeopardize the performance; this may be due to noise and redundant information. Although all methods have yielded close results, we can notice that multi-view subspace clustering methods achieve relatively better results on almost all data sets, which can indicate that these methods are capable of learning a common latent representation from all views. On the other hand, both ensemble methods have performed similarly, however, MEMTC had better performance, which indicates that including other representation scheme can improve the final clustering.

Table 8.3 Comparison of clustering results with multi-view methods

Overall, late integration based method has shown good empirical performance given that the individual clustering provided by each view can compensate the clustering inaccuracy of another view. However, such methods can be computationally expensive since the clustering is performed of the number of views and the integration phase is independent from the clustering phase and can add on to the computational cost.

Co-training based on a simultaneous optimization of one unified objective function to achieve one clustering result from different views [2, 3]. However, having a unified objective function does not allow to learn from each view independently, which can result in losing the knowledge held in different views and can later be integrated to improve the overall clustering. Furthermore, co-training based method becomes intractable when the number of views is over three.

Another issue consists of integrating multiple views while maintaining their diversity. Precisely, in the clustering process reaching a consensus, or co-training based clustering can result in losing the specificity of each view. To this end subspace clustering based algorithm can present a solution [6]. However, the challenge remains in finding a shared subspace while incorporating the diversity aspect. To summarize, this experimental results help drawing the following conclusions:

  • Large data set and high-dimensional data affects the performance of multi-view methods. Therefore, considering a dimensionality reduction methods can help avoid this issue.

  • Taking advantage of different representation schemes can improve the clustering performance of multi-view methods.

  • Subspace based methods have good performance, yet these methods include multiple parameters and the optimization scheme is not evident to achieve.

8.4 Conclusion

We have presented in this chapter a categorization of existing multi-view clustering methods based on the fusion style of multi-view data. Three main integration scheme can be distinguished: late integration, co-training based methods, and subspace based methods. For each category, we have detailed a number of multi-view clustering algorithms, and the means of managing text data. Lastly, we have discussed the advantages and the limits of these methods and raised the following issues: the representation of multi-view text data relies on terms frequencies only, the intra-view properties of each view can be further leveraged to improve the clustering results, incorporating the specificity of each view in the clustering process can provide a better understanding of data. Multiple recent research studies focus on incomplete views with missing values. Some other works rely on incorporating deep learning into multi-view clustering to further discover hidden patterns shared among views.