Abstract
Text document clustering represents a key task in machine learning, which partitions a specific documents’ collection into clusters of related documents. To this end, a pre-processing step is carried to represent text in a structured form. However, text depicts several aspects, which a single representation cannot capture. Therefore, multi-view clustering present an efficient solution to exploit and integrate the information captured from different representations or views. However, the existing methods are limited to represent views using terms frequencies based representations which lead to losing valuable information and fails to capture the semantic aspect of text. To deal with these issues, we propose a new method for multi-view text clustering that exploits different representations of text. The proposed method explores the use of Self-Organizing Map to the problem of unsupervised clustering of texts by taking into account simultaneously several views, that are obtained from textual data. Experiments are performed to demonstrate the improvement of clustering results compared to the existing methods.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Text clustering aims to organize a collection of documents into several clusters, such that documents in the same cluster are as similar as possible, whereas those in different clusters are as dissimilar as possible [1]. This technique has been widely applied to many fields such as information retrieval [24], topic detection [2] and social networks analysis [20]. Several clustering methods were proposed in the literature [9, 15] which can be categorized into hierarchical, partitional and neural network based. Among these categories, Self-Organizing Map (SOM) [10] is very popular unsupervised neural network for the analysis of high-dimensional patterns in machine learning applications [21]. SOM is designed to implement a nonlinear projection of high dimensional space onto a low-dimensional space called map. The nodes on the map correspond to clusters of the input samples and can be visualized. Compared to other text clustering methods, SOM allows visualizing the similarity between documents within the low-dimensional map.
On the other hands, text clustering mainly consists of two phases: the first pre-processing phase which formats the textual data into a structured form by using a representation model, while the second phase which consists in running a clustering algorithm to partition the data samples into distinct groups of documents related to the same topics. The pre-processing is a fundamental phase, in which the representation scheme helps identify the hidden pattern that might improve the clustering results. To this end, multiple text representation models have been proposed in the literature, such as the Vector Space Model [19], topic models [4] and more recently word embeddings [16]. However, in the absence of an optimal representation model for text, tackling such data becomes challenging for clustering algorithms. Moreover, text depicts multiple facets, that a single representation fails to capture. More precisely, the different facets of text usually correspond to different features spaces i.e. representations, each of which contains a piece of information that is not captured or explicitly represented by the other features. This issue is formally referred to as multi-view learning, such that each view corresponds to a feature space. So, multi-view text clustering refers to clustering textual data with multiple representations by integrating the information held in each view, in order to improve the clustering results. Several multi-view clustering methods were proposed in the literature [3, 11, 12, 14, 25]. For example, Bickel et al. [3] proposed multi-view versions of K-means and EM methods where the clustering algorithm is alternately applied on two conditionally independent views, and the results are bootstrapped. Hussain et al. [8] proposed a multi-view document clustering which applies individual clustering on each view, then aggregates different ensemble techniques: Cluster Based Similarity Matrix, Affinity Matrix and Pair-wise Dissimilarity Matrix to obtain a final consensus clustering. Although the existing methods have admittedly shown efficient performance over textual data sets, they only consider a single representation i.e. the terms frequencies, across natural splits of data. Moreover, multiples views certainly provides different perspectives for the same data, however the abundant amount of information raises the high dimensionality issue and integrating such information is challenging.
To deal with these issues, we propose in this paper a new multi-view text clustering method which exploits different text representation models to capture and integrate information from each view and maintain both syntactic and semantic aspects of text. Moreover, the proposed method uses the Self-Organizing Map to not only overcome the issue of high dimensionality by mapping the different views onto a low-dimensional space, but also to represent the intrinsic structure of the multi-view data by preserving the topology of each view.
The remainder of this paper is organized as follows: Sect. 2 reviews related work to subspace based methods for multi-view clustering. The proposed multi-view approach for text clustering is presented in Sect. 3. The experimental results are discussed in Sect. 4. Conclusions are given in Sect. 5.
2 Related Work
The multi-view subspace clustering methods first seek a common lower dimensional subspace that represents the intrinsic structure of the multi-view data then run and then applies a clustering algorithm to obtain the partitions. In [7] the subspace learning is formulated as a convex joint optimization problem. A low-rank linear projection is performed in [6] to uncover shared information and minimize the semantic gap across views. The proposed algorithm CoKmLDA in [28] first applies K-means on each view to obtain cluster assignments, the Linear Discriminant Analysis (LDA) is then used to project each view into a lower dimensional subspace. Using the clustering results from other views to impose a similar structure for all subspaces. The final cluster labels for all samples obtained from each subspace have to be consistent across all views. In [22], sparse subspace matrices are constructed from each view then a spectral clustering algorithm is performed. Other approaches were based on Non-negative Matrix Factorization (NMF). The idea is to find a common latent factor among multi-view data through low rank factor matrices. Liu et al. [14] proposed multiNMF which regularizes the coefficient matrices towards a common consensus. Recently, Nie et al. [17] introduced MLAN, a graph-based model that learns a local manifold structure and performs clustering simultaneously. [30] proposed a multi-manifold regularized NMF where the main contribution is to preserve the geometrical structure of the multi-view data through consensus manifold and consensus coefficient matrix. The approach in [26] adopted semi-NMF, a variant of NMF and followed a deep learning strategy. Newly approaches were proposed to tackle incomplete multi-viw data [23, 27].
3 Self-Organizing Map for Multi-view Text Clustering
The proposed multi-view method is based on different representation models of text. When documents are clustered based on the traditional Vector Space Model, and the tf-idf scheme in particular which only considers the occurrence of a word in a document, identifying semantically related documents becomes hard. Consequently, our method intends to enrich the pre-processing step by incorporating other representation schemes such that each representation corresponds to a view that captures a particular aspect of the text. Hence, each document will have three distinct vector representations. The views are then presented as parallel layers, and mapped onto a low-dimensional subspace using SOM architecture in order to uncover a latent structure. Finally, we run a clustering algorithm to obtain distinct documents clusters. The overall process is illustrated in Fig. 1.
3.1 Document Representation
Vector Space Model (VSM) is the most commonly used method for text representation [19]. The VSM is an algebraic model in which documents and terms are represented as vectors in a multidimensional space and can be represented as a term-document matrix in which the values correspond to the Term Frequency-Inverse Document Frequency (tf-idf) weights [18]. TF-IDF is a weighting method that is used to score the importance of a word in a document based on the number of times it occurs in the document itself and in the entire collection of documents. The tf-idf weight is calculated through two values:
The Term Frequency (TF) measures how frequently a term appears in a document as follows:
The Inverse Document Frequency (IDF) measures the relevance of a term in the whole collection of documents, terms that are rare overall in the data set are considered more important.
The intuition behind the TF-IDF weighting measure is that the importance of a word should increase proportionally to the number of times that word appears in a document. Words such as “is”, “the”, “a” are very common in the English language and tend to appear more often than other words but they carry no meaning in them, therefore their values are offset by their idf weights.
As a result, the total tf-idf weight is calculated as follows:
Latent Dirichlet Allocation Model (LDA). The Latent Dirichlet Allocation model (LDA) is a probabilistic model that identify the underlying topics in a corpus of documents [4]. LDA assumes that a document is represented by topic distributions, whereas a topic is provided by word distributions given by the following a probabilistic process:
-
* Choose \(\theta \sim Dir(\alpha )\)
For each word w:
-
* Choose a topic \(z_n \sim Multinomial(\theta )\)
-
* Choose a word \(w_n \sim Multinomial(\beta _k)\), where k is the number of topics.
\(\alpha \) is the parameter of the Dirichlet prior on the per-document topic distribution.
\(\beta \) is the parameter of the Dirichlet prior on the per-topic word distribution.
\(\theta _m\) is the topic distribution of document \(d_m\)
\(z_v\) is the topic for the \(v^{th}\) word of document \(d_m\).
The Dirichlet distribution is computed as follows:
The LDA model is used along with Gibbs sampling [5] to discover topics represented by a documents-topic matrix.
The Skip-gram model is a neural network that produces word embeddings, while retaining both semantic and syntactic information of words [16]. The vector representation is learned through the adjustment of the network weights, such that words that appear in the same context have similar vector representation. In order to obtain the vector representation for each document, a new feature space is built using the Skip-gram words representation. To this end, words are clustered by applying the K-means algorithm with the cosine distance. The obtained semantic cliques i.e. the clusters of words correspond to the new features from which a document-semantics matrix is built. The entries of this matrix are tf-idf weights of the semantic cliques, such that:
where \(w_{ij}\) is the weight of the semantic clique \(S_i\) in the document \(d_j\), \(tf_{ij}\) is the frequency of \(S_i\) in \(d_j\) that is the sum of the tf weights of terms \(\mathrm {t}\) belonging to clique \(S_i\) and document \(d_j\) as expressed by (6), \(idf_i\) is the idf weighting of \(S_i\) which is the sum of the idf weights of words w in \(S_i\) (7). The denominator is for normalization.
3.2 SOM for Multi-view Clustering
After the generation of views in three different feature space, each document is represented by three vectors: traditional representation with tf-idf, topical vector with LDA, and vector based on semantic cliques with Skip-gram. These vectors constitute the inputs to the SOM neural network, such that each view is an input layer. We first recall the basic concepts of SOM, then we adapt the SOM architecture for multi-view document clustering.
3.2.1 Basic Concepts of SOM
Self-organizing map is an unsupervised neural network model also know as Kohonen network, that projects high dimensional input vectors of data onto a low-dimensional sapce [10]. The obtained feature space is a topological map that enables the partitioning of the input data into similar groups. Generally, SOM consists of two layers: the input layer contains the input vector as represented in their original space, and the output layer consists of the nodes of the map. Moreover, each node i.e. neuron on the map corresponds to a set of inputs. The learning algorithm of SOM is as follows:
-
1.
Initialization: Start with random values for initial weights \(w_i\)
-
2.
Matching: Determine the winning neuron c, at a time index t according to the smallest Euclidean distance
$$\begin{aligned} c = \mathop {\text {arg min}}\limits _i ||x_j-w_i|| , \ i=1,2,\cdots ,M \end{aligned}$$(8)where \(x_j\) is an input vector, M is the number of neurons.
-
3.
Weights Updating: Adjust the weights of the winning neuron and its neighbors, such that:
$$\begin{aligned} w_i(t+1)= w_i(t)+\epsilon (t)\times h_{ci}(t)\times (x_j(t)-w_i(t)) \end{aligned}$$(9)$$\begin{aligned} h_{ci}=\exp \left( -\dfrac{||r_c-ri||^2}{2\sigma ^2(t)}\right) \end{aligned}$$(10)where \(h_ci\) is the neighborhood kernel function, \(\epsilon (t)\in [0,1]\) is the learning rate and \(\sigma (t)\) defines the width of the kernel, both value decrease monotonically at each time index, \(r_c\) and \(r_i\) are the position vectors of nodes c and i on the map.
3.2.2 Multi-view SOM
In order to adapt the SOM for multi-view clustering, the three obtained views are used as input layers instead of one single layer. To this end, each view is a separate layer that corresponds to a feature space: VSM, LDA, and Skip-gram, hence each document is represented with three numeric vectors \(x_1\), \(x_2\), and \(x_3\). Each input layer correspond to an output layer consisting of a two-dimensional topological map as shown in Fig. 2. The weight vectors are randomly initialized and updated after computing the wining neuron. More precisely, for each corresponding input and output layers, distances between input vectors and weight vectors to find the best matching neuron i.e. the neuron defined by the smallest Euclidean distance. Lastly, in order to assign a document to a final output node, an overall distance is then calculated such that:
After the training step of SOM, the output map consists of clusters such that a cluster can correspond to one neuron or a set of neurons that belongs to the same class. Hence, the positions of the neurons in the map are important, neighboring neurons have similar vector representations and indicate a particular pattern of the data. Moreover, the topological map can be exploited to determine the properties of the data, such that similar documents are represented by neurons located in neighboring regions of the map. The training algorithm of this step is given in Algorithm 1.
3.2.3 Clustering
Although the SOM provides clusters of documents given by the neurons on the map, the number of output neurons is however more important than the desired number of clusters. A neuron on the map is associated to a single document or to a set of documents, thus a neuron on the map may represent a cluster, while other neurons are “empty” and have no documents assigned to them. Therefore, in order to have multiple neurons representing a cluster, a clustering algorithm is applied in a second level. The output neurons on the map are grouped into a number of distinct regions corresponding to the desired number of clusters. Consequently, each input document is assigned to a cluster according to their nearest output neuron. This process is illustrated in Fig. 3.
Given that neighboring neurons have similar vector representation and distant neurons on the map are considered unrelated, a distance matrix can be calculated and used for agglomerative hierarchical clustering.
4 Experimental Results
In order to evaluate the performance of the proposed method, we conduct several experiments on different data sets with regards to three evaluation measures. First, we carry out a comparison with the equivalent single view methods. Second, our method is evaluated against two existing methods for multi-view clustering.
4.1 Data Sets Description
Four well known data sets for text mining are used for experiments. The Reuters R8 data sets is a subset of Reuters data set with 2189 documents belonging to 8 classes. The 20 Newsgroups consists of 2828 news articles distributed on 20 classes. The WebKB data set is a collection of 4168 web pages collected from computer science departments, belonging to 4 classes (student, faculty, project, course). The BBC Sport consists of 737 documents from the BBC Sport website corresponding to sports news articles belonging to 5 areas: football, rugby, tennis, athletics and cricket. Before applying the clustering algorithms, a preprocessing step is performed on the data sets including stop words removal. Stop words removal consists in eliminating common words that appear frequently, and offer no additional semantic value.
4.2 Evaluation Measures
To measure the quality of the clustering and compare it with existing methods, three evaluation measures are utilized: the F-measure [13], the Normalized Mutual Information (NMI) [29], and Purity [17]. Given a set of clusters \(C=\{c_1, c_2, \ldots , c_k\}\) and the gold standard classes \(G=\{g_1, g_2, \ldots , g_j\}\):
F-measure is a trade-off between Precision and Recall such that:
Normalized Mutual Information (NMI) measures the quality of clustering with regards to the number of clusters and their sizes. NMI is defined as:
where I is the mutual information and E(C) is entropy.
Purity: measures the number of correctly assigned documents, where each cluster is assigned to the dominant class in that cluster. The larger the number of clusters is, the higher is the Purity. Unlike NMI, Purity cannot trade off the quality of the clustering against the number of clusters
For all measures, the values range from 0 to 1, such that values closer to 0 represent poor quality.
4.3 Experimental Results
For the training, SOM requires predefining the size and structure of the network, the neighborhood function, and the learning rate. These parameters are generally selected experimentally.
For the first comparison, the proposed method is evaluated against the single view based clustering. Each view correspond to a single text representation model which are the VSM, the LDA model, and the Skip-gram model. We use classic SOM for all single view methods.
Given the results in Table 1, we notice that the best results were scored for the bbc sports data, with F-measure, NMI, Purity of 0.797, 0.658 and 0.734 respectively, this could be related to the size of the data set. The lowest results were scored for the 20 Newsgroupes, this data set contains the most number of classes. However, the LDA method socored the highest NMI and Purity for the Reuters Data with 0.485 and 0.642 respectively. Furthermore, the results show variation of performance of the single view based methods with regard to each data set. The Skip-gram model outperformed the other single view methods on three data sets: 20 Newsgroups, WebKB, and BBC Sports, while VSM gave better results than LDA on the BBC Sports. This shows that each view presents different data patterns.
To further evaluate the proposed method, we compare it against two other multi-view clustering methods: the Multi-view K-means (MVKM) prposed in [3] and the Multi-view via ensemble method (MVEM) in [8]. The results in Table 2 are averaged on 50 iterations for each method. The proposed approach outperformed the other two methods. The MVKM is based on the tf-idf representation and co-training of K-means, while the MVEM is based on ensemble techniques to aggregate individual clustering obtained from each view. The results shows that taking into account different representations and preserving the topology of each view can improve the clustering results in comparison to other multi-view clustering methods.
5 Conclusion
In this paper, we have proposed a novel method for multi-view text clustering. Different from existing works, our method explores the use of three representation models i.e. VSM, LDA, Skip-gram to generate different views that respectively capture syntactic, topical, and semantic aspect of text. Moreover, we exploit the SOM to handle the issue of high dimensionality by mapping the different views onto a low-dimensional space. Furthermore, SOM helps maintain the topological properties of each view while uncovering the intrinsic structure of the multi-view data to improve the clustering results. The conducted experimentation shows that, in comparison to single view based clustering, using multiple views improves the clustering quality. The experiments also show that the proposed method yields better results compared to other multi-view clustering methods.
References
Aggarwal, C.C., Zhai, C.: Mining Text Data. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4
Allan, J.: Topic Detection and Tracking: Event-Based Information Organization, vol. 12. Springer, Boston (2012). https://doi.org/10.1007/978-1-4615-0933-2
Bickel, S., Scheffer, T.: Multi-view clustering. In: ICDM, vol. 4, pp. 19–26 (2004)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bolstad, W.M.: Understanding Computational Bayesian Statistics, vol. 644. Wiley, Hoboken (2010)
Ding, Z., Fu, Y.: Low-rank common subspace for multi-view learning. In: 2014 IEEE International Conference on Data Mining, pp. 110–119. IEEE (2014)
Guo, Y.: Convex subspace representation learning from multi-view data. In: AAAI, vol. 1, p. 2 (2013)
Hussain, S.F., Mushtaq, M., Halim, Z.: Multi-view document clustering via ensemble method. J. Intell. Inf. Syst. 43(1), 81–99 (2014). https://doi.org/10.1007/s10844-014-0307-6
Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)
Kohonen, T.: The self-organizing map. Proc. IEEE 78(9), 1464–1480 (1990)
Kumar, A., Daumé, H.: A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 393–400 (2011)
Kumar, V., Minz, S.: Multi-view ensemble learning: an optimal feature set partitioning for high-dimensional data classification. Knowl. Inf. Syst. 49(1), 1–59 (2015). https://doi.org/10.1007/s10115-015-0875-y
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22. Citeseer (1999)
Liu, J., Wang, C., Gao, J., Han, J.: Multi-view clustering via joint nonnegative matrix factorization. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 252–260. SIAM (2013)
MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nie, F., Cai, G., Li, X.: Multi-view clustering and semi-supervised classification with adaptive neighbours. In: AAAI, pp. 2408–2414 (2017)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Serrat, O.: Social network analysis. In: Knowledge Solutions, pp. 39–43. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-0983-9_9
Shieh, S.L., Liao, I.E.: A new approach for data clustering and visualization using self-organizing maps. Expert Syst. Appl. 39(15), 11924–11933 (2012)
Yin, Q., Wu, S., He, R., Wang, L.: Multi-view clustering via pairwise sparse subspace representation. Neurocomputing 156, 12–21 (2015)
Yin, Q., Wu, S., Wang, L.: Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recogn. 67, 313–327 (2017)
Zhai, C., Massung, S.: Text data management and analysis: a practical introduction to information retrieval and text mining. In: Association for Computing Machinery and Morgan & Claypool (2016)
Zhang, G.Y., Wang, C.D., Huang, D., Zheng, W.S., Zhou, Y.R.: Tw-co-k-means: two-level weighted collaborative k-means for multi-view clustering. Knowl.-Based Syst. 150, 127–138 (2018)
Zhao, H., Ding, Z., Fu, Y.: Multi-view clustering via deep matrix factorization. In: AAAI, pp. 2921–2927 (2017)
Zhao, L., Chen, Z., Yang, Y., Wang, Z.J., Leung, V.C.: Incomplete multi-view clustering via deep semantic mapping. Neurocomputing 275, 1053–1062 (2018)
Zhao, X., Evans, N., Dugelay, J.L.: A subspace co-training framework for multi-view clustering. Pattern Recogn. Lett. 41, 73–82 (2014)
Zhuang, F., Karypis, G., Ning, X., He, Q., Shi, Z.: Multi-view learning via probabilistic latent semantic analysis. Inf. Sci. 199, 20–30 (2012)
Zong, L., Zhang, X., Zhao, L., Yu, H., Zhao, Q.: Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Networks 88, 74–89 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fraj, M., Hajkacem, M.A.B., Essoussi, N. (2020). Self-Organizing Map for Multi-view Text Clustering. In: Song, M., Song, IY., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2020. Lecture Notes in Computer Science(), vol 12393. Springer, Cham. https://doi.org/10.1007/978-3-030-59065-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-59065-9_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59064-2
Online ISBN: 978-3-030-59065-9
eBook Packages: Computer ScienceComputer Science (R0)