Keywords

1 Introduction

Text clustering aims to organize a collection of documents into several clusters, such that documents in the same cluster are as similar as possible, whereas those in different clusters are as dissimilar as possible [1]. This technique has been widely applied to many fields such as information retrieval [24], topic detection [2] and social networks analysis [20]. Several clustering methods were proposed in the literature [9, 15] which can be categorized into hierarchical, partitional and neural network based. Among these categories, Self-Organizing Map (SOM) [10] is very popular unsupervised neural network for the analysis of high-dimensional patterns in machine learning applications [21]. SOM is designed to implement a nonlinear projection of high dimensional space onto a low-dimensional space called map. The nodes on the map correspond to clusters of the input samples and can be visualized. Compared to other text clustering methods, SOM allows visualizing the similarity between documents within the low-dimensional map.

On the other hands, text clustering mainly consists of two phases: the first pre-processing phase which formats the textual data into a structured form by using a representation model, while the second phase which consists in running a clustering algorithm to partition the data samples into distinct groups of documents related to the same topics. The pre-processing is a fundamental phase, in which the representation scheme helps identify the hidden pattern that might improve the clustering results. To this end, multiple text representation models have been proposed in the literature, such as the Vector Space Model [19], topic models [4] and more recently word embeddings [16]. However, in the absence of an optimal representation model for text, tackling such data becomes challenging for clustering algorithms. Moreover, text depicts multiple facets, that a single representation fails to capture. More precisely, the different facets of text usually correspond to different features spaces i.e. representations, each of which contains a piece of information that is not captured or explicitly represented by the other features. This issue is formally referred to as multi-view learning, such that each view corresponds to a feature space. So, multi-view text clustering refers to clustering textual data with multiple representations by integrating the information held in each view, in order to improve the clustering results. Several multi-view clustering methods were proposed in the literature [3, 11, 12, 14, 25]. For example, Bickel et al. [3] proposed multi-view versions of K-means and EM methods where the clustering algorithm is alternately applied on two conditionally independent views, and the results are bootstrapped. Hussain et al. [8] proposed a multi-view document clustering which applies individual clustering on each view, then aggregates different ensemble techniques: Cluster Based Similarity Matrix, Affinity Matrix and Pair-wise Dissimilarity Matrix to obtain a final consensus clustering. Although the existing methods have admittedly shown efficient performance over textual data sets, they only consider a single representation i.e. the terms frequencies, across natural splits of data. Moreover, multiples views certainly provides different perspectives for the same data, however the abundant amount of information raises the high dimensionality issue and integrating such information is challenging.

To deal with these issues, we propose in this paper a new multi-view text clustering method which exploits different text representation models to capture and integrate information from each view and maintain both syntactic and semantic aspects of text. Moreover, the proposed method uses the Self-Organizing Map to not only overcome the issue of high dimensionality by mapping the different views onto a low-dimensional space, but also to represent the intrinsic structure of the multi-view data by preserving the topology of each view.

The remainder of this paper is organized as follows: Sect. 2 reviews related work to subspace based methods for multi-view clustering. The proposed multi-view approach for text clustering is presented in Sect. 3. The experimental results are discussed in Sect. 4. Conclusions are given in Sect. 5.

2 Related Work

The multi-view subspace clustering methods first seek a common lower dimensional subspace that represents the intrinsic structure of the multi-view data then run and then applies a clustering algorithm to obtain the partitions. In [7] the subspace learning is formulated as a convex joint optimization problem. A low-rank linear projection is performed in [6] to uncover shared information and minimize the semantic gap across views. The proposed algorithm CoKmLDA in [28] first applies K-means on each view to obtain cluster assignments, the Linear Discriminant Analysis (LDA) is then used to project each view into a lower dimensional subspace. Using the clustering results from other views to impose a similar structure for all subspaces. The final cluster labels for all samples obtained from each subspace have to be consistent across all views. In [22], sparse subspace matrices are constructed from each view then a spectral clustering algorithm is performed. Other approaches were based on Non-negative Matrix Factorization (NMF). The idea is to find a common latent factor among multi-view data through low rank factor matrices. Liu et al. [14] proposed multiNMF which regularizes the coefficient matrices towards a common consensus. Recently, Nie et al. [17] introduced MLAN, a graph-based model that learns a local manifold structure and performs clustering simultaneously. [30] proposed a multi-manifold regularized NMF where the main contribution is to preserve the geometrical structure of the multi-view data through consensus manifold and consensus coefficient matrix. The approach in [26] adopted semi-NMF, a variant of NMF and followed a deep learning strategy. Newly approaches were proposed to tackle incomplete multi-viw data [23, 27].

3 Self-Organizing Map for Multi-view Text Clustering

The proposed multi-view method is based on different representation models of text. When documents are clustered based on the traditional Vector Space Model, and the tf-idf scheme in particular which only considers the occurrence of a word in a document, identifying semantically related documents becomes hard. Consequently, our method intends to enrich the pre-processing step by incorporating other representation schemes such that each representation corresponds to a view that captures a particular aspect of the text. Hence, each document will have three distinct vector representations. The views are then presented as parallel layers, and mapped onto a low-dimensional subspace using SOM architecture in order to uncover a latent structure. Finally, we run a clustering algorithm to obtain distinct documents clusters. The overall process is illustrated in Fig. 1.

Fig. 1.
figure 1

SOM for multi-view text clustering

3.1 Document Representation

Vector Space Model (VSM) is the most commonly used method for text representation [19]. The VSM is an algebraic model in which documents and terms are represented as vectors in a multidimensional space and can be represented as a term-document matrix in which the values correspond to the Term Frequency-Inverse Document Frequency (tf-idf) weights [18]. TF-IDF is a weighting method that is used to score the importance of a word in a document based on the number of times it occurs in the document itself and in the entire collection of documents. The tf-idf weight is calculated through two values:

The Term Frequency (TF) measures how frequently a term appears in a document as follows:

$$\begin{aligned} tf_{(t,d)} = \dfrac{{\text{ Number } \text{ of } \text{ times } \text{ term } \text{ t } \text{ appears } \text{ in } \text{ a } \text{ document } }d}{{\text {Total number of terms in the document }}d} \end{aligned}$$
(1)

The Inverse Document Frequency (IDF) measures the relevance of a term in the whole collection of documents, terms that are rare overall in the data set are considered more important.

$$\begin{aligned} idf_{(t)} = \log ( \dfrac{\text{ Total } \text{ number } \text{ of } \text{ documents }}{\text {number of documents containing term t}}) \end{aligned}$$
(2)

The intuition behind the TF-IDF weighting measure is that the importance of a word should increase proportionally to the number of times that word appears in a document. Words such as “is”, “the”, “a” are very common in the English language and tend to appear more often than other words but they carry no meaning in them, therefore their values are offset by their idf weights.

As a result, the total tf-idf weight is calculated as follows:

$$\begin{aligned} tf-idf(t,d) = tf_{(t,d)} * idf_{(t)} \end{aligned}$$
(3)

Latent Dirichlet Allocation Model (LDA). The Latent Dirichlet Allocation model (LDA) is a probabilistic model that identify the underlying topics in a corpus of documents [4]. LDA assumes that a document is represented by topic distributions, whereas a topic is provided by word distributions given by the following a probabilistic process:

  • * Choose \(\theta \sim Dir(\alpha )\)

    For each word w:

  • * Choose a topic \(z_n \sim Multinomial(\theta )\)

  • * Choose a word \(w_n \sim Multinomial(\beta _k)\), where k is the number of topics.

\(\alpha \) is the parameter of the Dirichlet prior on the per-document topic distribution.

\(\beta \) is the parameter of the Dirichlet prior on the per-topic word distribution.

\(\theta _m\) is the topic distribution of document \(d_m\)

\(z_v\) is the topic for the \(v^{th}\) word of document \(d_m\).

The Dirichlet distribution is computed as follows:

$$\begin{aligned} p(\theta |\alpha ) = \dfrac{\varGamma (\sum ^k_{i=1} \alpha _i)}{\prod ^k_{i=1}\varGamma (\alpha _i)} \theta ^{\alpha _1 -1}_{1} \cdots \theta ^{\alpha _k -1}_{k} \end{aligned}$$
(4)

The LDA model is used along with Gibbs sampling [5] to discover topics represented by a documents-topic matrix.

The Skip-gram model is a neural network that produces word embeddings, while retaining both semantic and syntactic information of words [16]. The vector representation is learned through the adjustment of the network weights, such that words that appear in the same context have similar vector representation. In order to obtain the vector representation for each document, a new feature space is built using the Skip-gram words representation. To this end, words are clustered by applying the K-means algorithm with the cosine distance. The obtained semantic cliques i.e. the clusters of words correspond to the new features from which a document-semantics matrix is built. The entries of this matrix are tf-idf weights of the semantic cliques, such that:

$$\begin{aligned} w_{ij} = \dfrac{tf_{ij} \times idf_i}{\sum \limits _{\mathrm {t \in d_j}} tf_{ij} \times idf_i} \end{aligned}$$
(5)

where \(w_{ij}\) is the weight of the semantic clique \(S_i\) in the document \(d_j\), \(tf_{ij}\) is the frequency of \(S_i\) in \(d_j\) that is the sum of the tf weights of terms \(\mathrm {t}\) belonging to clique \(S_i\) and document \(d_j\) as expressed by (6), \(idf_i\) is the idf weighting of \(S_i\) which is the sum of the idf weights of words w in \(S_i\) (7). The denominator is for normalization.

$$\begin{aligned} tf_{ij} = \sum _{l=1} tf(\mathrm {t_l \in S_i}, d_j) ,\, i\in \{1,..,K\};\,j\in \{1,..,N\} \end{aligned}$$
(6)
$$\begin{aligned} idf_i = \sum _{l=1} idf(\mathrm {t_l \in S_i}) \quad i\in \{1,\cdots ,K\} \end{aligned}$$
(7)

3.2 SOM for Multi-view Clustering

After the generation of views in three different feature space, each document is represented by three vectors: traditional representation with tf-idf, topical vector with LDA, and vector based on semantic cliques with Skip-gram. These vectors constitute the inputs to the SOM neural network, such that each view is an input layer. We first recall the basic concepts of SOM, then we adapt the SOM architecture for multi-view document clustering.

3.2.1 Basic Concepts of SOM

Self-organizing map is an unsupervised neural network model also know as Kohonen network, that projects high dimensional input vectors of data onto a low-dimensional sapce [10]. The obtained feature space is a topological map that enables the partitioning of the input data into similar groups. Generally, SOM consists of two layers: the input layer contains the input vector as represented in their original space, and the output layer consists of the nodes of the map. Moreover, each node i.e. neuron on the map corresponds to a set of inputs. The learning algorithm of SOM is as follows:

  1. 1.

    Initialization: Start with random values for initial weights \(w_i\)

  2. 2.

    Matching: Determine the winning neuron c, at a time index t according to the smallest Euclidean distance

    $$\begin{aligned} c = \mathop {\text {arg min}}\limits _i ||x_j-w_i|| , \ i=1,2,\cdots ,M \end{aligned}$$
    (8)

    where \(x_j\) is an input vector, M is the number of neurons.

  3. 3.

    Weights Updating: Adjust the weights of the winning neuron and its neighbors, such that:

    $$\begin{aligned} w_i(t+1)= w_i(t)+\epsilon (t)\times h_{ci}(t)\times (x_j(t)-w_i(t)) \end{aligned}$$
    (9)
    $$\begin{aligned} h_{ci}=\exp \left( -\dfrac{||r_c-ri||^2}{2\sigma ^2(t)}\right) \end{aligned}$$
    (10)

    where \(h_ci\) is the neighborhood kernel function, \(\epsilon (t)\in [0,1]\) is the learning rate and \(\sigma (t)\) defines the width of the kernel, both value decrease monotonically at each time index, \(r_c\) and \(r_i\) are the position vectors of nodes c and i on the map.

3.2.2 Multi-view SOM

In order to adapt the SOM for multi-view clustering, the three obtained views are used as input layers instead of one single layer. To this end, each view is a separate layer that corresponds to a feature space: VSM, LDA, and Skip-gram, hence each document is represented with three numeric vectors \(x_1\), \(x_2\), and \(x_3\). Each input layer correspond to an output layer consisting of a two-dimensional topological map as shown in Fig. 2. The weight vectors are randomly initialized and updated after computing the wining neuron. More precisely, for each corresponding input and output layers, distances between input vectors and weight vectors to find the best matching neuron i.e. the neuron defined by the smallest Euclidean distance. Lastly, in order to assign a document to a final output node, an overall distance is then calculated such that:

$$\begin{aligned} D = \sum _{i} D_i(x_i,n_i) , \ i=1,2,3 \end{aligned}$$
(11)
Fig. 2.
figure 2

The mapping of a multi-view document

After the training step of SOM, the output map consists of clusters such that a cluster can correspond to one neuron or a set of neurons that belongs to the same class. Hence, the positions of the neurons in the map are important, neighboring neurons have similar vector representations and indicate a particular pattern of the data. Moreover, the topological map can be exploited to determine the properties of the data, such that similar documents are represented by neurons located in neighboring regions of the map. The training algorithm of this step is given in Algorithm 1.

figure a

3.2.3 Clustering

Although the SOM provides clusters of documents given by the neurons on the map, the number of output neurons is however more important than the desired number of clusters. A neuron on the map is associated to a single document or to a set of documents, thus a neuron on the map may represent a cluster, while other neurons are “empty” and have no documents assigned to them. Therefore, in order to have multiple neurons representing a cluster, a clustering algorithm is applied in a second level. The output neurons on the map are grouped into a number of distinct regions corresponding to the desired number of clusters. Consequently, each input document is assigned to a cluster according to their nearest output neuron. This process is illustrated in Fig. 3.

Given that neighboring neurons have similar vector representation and distant neurons on the map are considered unrelated, a distance matrix can be calculated and used for agglomerative hierarchical clustering.

Fig. 3.
figure 3

Clustering of the SOM

4 Experimental Results

In order to evaluate the performance of the proposed method, we conduct several experiments on different data sets with regards to three evaluation measures. First, we carry out a comparison with the equivalent single view methods. Second, our method is evaluated against two existing methods for multi-view clustering.

4.1 Data Sets Description

Four well known data sets for text mining are used for experiments. The Reuters R8 data sets is a subset of Reuters data set with 2189 documents belonging to 8 classes. The 20 Newsgroups consists of 2828 news articles distributed on 20 classes. The WebKB data set is a collection of 4168 web pages collected from computer science departments, belonging to 4 classes (student, faculty, project, course). The BBC Sport consists of 737 documents from the BBC Sport website corresponding to sports news articles belonging to 5 areas: football, rugby, tennis, athletics and cricket. Before applying the clustering algorithms, a preprocessing step is performed on the data sets including stop words removal. Stop words removal consists in eliminating common words that appear frequently, and offer no additional semantic value.

4.2 Evaluation Measures

To measure the quality of the clustering and compare it with existing methods, three evaluation measures are utilized: the F-measure [13], the Normalized Mutual Information (NMI) [29], and Purity [17]. Given a set of clusters \(C=\{c_1, c_2, \ldots , c_k\}\) and the gold standard classes \(G=\{g_1, g_2, \ldots , g_j\}\):

F-measure is a trade-off between Precision and Recall such that:

$$\begin{aligned} \small F-measure(c_k,g_j) = 2 * \frac{Precision(c_k,g_j) \times Recall(c_k,g_j)}{Precision(c_k,g_j) + Recall(c_k,g_j)} \end{aligned}$$
(12)
$$\begin{aligned} Precision(c_k,g_j) = \dfrac{|c_k \cap g_j|}{|c_k|} \end{aligned}$$
(13)
$$\begin{aligned} Recall(c_k,g_j) = \dfrac{|c_k \cap g_j|}{|g_j|} \end{aligned}$$
(14)

Normalized Mutual Information (NMI) measures the quality of clustering with regards to the number of clusters and their sizes. NMI is defined as:

$$\begin{aligned} \small NMI(C,G) = \dfrac{I(C,G)}{[E(C)+E(G]]/2} \end{aligned}$$
(15)

where I is the mutual information and E(C) is entropy.

$$\begin{aligned} \small I(C,G) = \sum _k \sum _j \dfrac{|c_k \cap g_j|}{N} \log \dfrac{N|c_k \cap g_j|}{|c_k| |g_j|} \end{aligned}$$
(16)
$$\begin{aligned} \small E(C) = - \sum _k \frac{|s_k|}{N} \log \frac{|s_k|}{N} \end{aligned}$$
(17)

Purity: measures the number of correctly assigned documents, where each cluster is assigned to the dominant class in that cluster. The larger the number of clusters is, the higher is the Purity. Unlike NMI, Purity cannot trade off the quality of the clustering against the number of clusters

$$\begin{aligned} \small Purity(C,G)=\frac{1}{N} \sum _k \underset{j}{max} \ |c_k \cap g_j | \end{aligned}$$
(18)

For all measures, the values range from 0 to 1, such that values closer to 0 represent poor quality.

4.3 Experimental Results

For the training, SOM requires predefining the size and structure of the network, the neighborhood function, and the learning rate. These parameters are generally selected experimentally.

For the first comparison, the proposed method is evaluated against the single view based clustering. Each view correspond to a single text representation model which are the VSM, the LDA model, and the Skip-gram model. We use classic SOM for all single view methods.

Given the results in Table 1, we notice that the best results were scored for the bbc sports data, with F-measure, NMI, Purity of 0.797, 0.658 and 0.734 respectively, this could be related to the size of the data set. The lowest results were scored for the 20 Newsgroupes, this data set contains the most number of classes. However, the LDA method socored the highest NMI and Purity for the Reuters Data with 0.485 and 0.642 respectively. Furthermore, the results show variation of performance of the single view based methods with regard to each data set. The Skip-gram model outperformed the other single view methods on three data sets: 20 Newsgroups, WebKB, and BBC Sports, while VSM gave better results than LDA on the BBC Sports. This shows that each view presents different data patterns.

Table 1. Comparison of clustering results with single view methods

To further evaluate the proposed method, we compare it against two other multi-view clustering methods: the Multi-view K-means (MVKM) prposed in [3] and the Multi-view via ensemble method (MVEM) in [8]. The results in Table 2 are averaged on 50 iterations for each method. The proposed approach outperformed the other two methods. The MVKM is based on the tf-idf representation and co-training of K-means, while the MVEM is based on ensemble techniques to aggregate individual clustering obtained from each view. The results shows that taking into account different representations and preserving the topology of each view can improve the clustering results in comparison to other multi-view clustering methods.

Table 2. Comparison of clustering results with multi-view methods

5 Conclusion

In this paper, we have proposed a novel method for multi-view text clustering. Different from existing works, our method explores the use of three representation models i.e. VSM, LDA, Skip-gram to generate different views that respectively capture syntactic, topical, and semantic aspect of text. Moreover, we exploit the SOM to handle the issue of high dimensionality by mapping the different views onto a low-dimensional space. Furthermore, SOM helps maintain the topological properties of each view while uncovering the intrinsic structure of the multi-view data to improve the clustering results. The conducted experimentation shows that, in comparison to single view based clustering, using multiple views improves the clustering quality. The experiments also show that the proposed method yields better results compared to other multi-view clustering methods.