1 Introduction

Nowadays, web and social media have become the main sources of large amounts of text data owing to the widespread surge of such data over the Internet, such as social media feeds, news websites, learning forums, and electronic book libraries [1]. In order to manage and organize a large number of documents, document clustering may be used. Document clustering is a technique for partitioning a set of documents into distinct clusters mainly based on content similarity [2]. Clustering is an essential way to learn without supervision and discover the inherent structure of unlabeled data [3]. This technique facilitates comprehensive analysis in the field of text mining in different applications such as recommender systems [4,5,6], disease detection and categorization [7], spam detection [8,9,10], information retrieval [11], topic modeling [12], time series analysis [13], sentiment classification [14] and text summarization [15].

The three principal steps of a document clustering algorithm are text preprocessing, document representation, and clustering. The clustering performance largely relies on the quality of document representation, in which the raw documents are commonly converted into numerical feature vectors to make various clustering algorithms applicable. The document-representation quality can be assessed from two aspects: semantic quality and statistical quality [16]. Semantic quality refers to the interpretability of the feature vector and the degree to which it describes the content of the document. The most popular document representation method, known for its intuitive and simple interpretability, is the BoW model, which represents a document by a feature vector based on its word frequencies [17]. Although BoW is simple to interpret, it suffers from excessive dimensionality, and when the number of unique words increases, cannot retain the proximity information. Statistical quality assessment means how much the feature vector can distinguish documents from each other in different categories. Some methods, such as convolutional neural networks (CNNs) [18], recurrent neural networks (RNNs) [19], and Doc2Vec [20], create low dimensional vectors to represent documents that also preserves the proximity information. Nevertheless, the obtained representations are complicated to interpret because the value of each feature is computed through complicated structures of the neural network weights. To overcome the weakness of previous methods of document representation, another approach is recently introduced, which leverages concepts [21]. After embedding the words, concepts are extracted by clustering the word vectors, and documents are represented based on concepts. Therefore, the document vectors have reasonable dimensions.

The use of labeled data to improve the quality of clustering is increasingly attracting more attention [22]. The main idea is that the unlabeled data capture the overall structure of the clusters and a small number of labeled data adjust the centroid of clusters. This technique utilizes both labeled and unlabeled data to obtain a better clustering model and is referred to as semi-supervised clustering [23]. In general, semi-supervised clustering methods are divided into two categories: similarity-based and search-based approaches. In similarity-based algorithms, the underlying similarity criterion is adapted according to the constraints and labels of the supervised data. The methods proposed by Zhang et al. [24] assume that a mixture model generates the data population. In these methods, the unlabeled data is labeled and used to train the model. However, it is not clear how much data re-labeling is required and how reliable this re-labeling will be. In search-based methods, the semi-supervised clustering algorithm is modified to use labels or restrictions provided by the user to search for the proper partition bias. The TESC semi-supervised clustering [25] uses a search-based clustering approach to classify texts. The primary purpose of this method is to improve the quality of classification, not the clustering. It uses semi-supervised clustering to identify the components of the corpus and leverage these components to predict labels for unlabeled data.

In this paper, we propose a novel semi-supervised document clustering method that, unlike some current popular semi-supervised clustering algorithms [23, 24], uses both unlabeled data and a restricted set of labeled data simultaneously in the clustering process. To the best of our knowledge, none of the current semi-supervised clustering methods uses the conceptual representation of documents. This representation has the benefit of putting documents with common concepts in one cluster, unlike other representation methods that construct the document vectors directly based on the word occurrences. It can capture the underlying characteristics and distinctive features of documents, while preserving their interpretability. It has the advantage of maintaining the proximity information of the documents, and as shown in the experiments, significantly improving the clustering quality.

After embedding the words in the corpus, its components are identified by extracting concepts from the words, and the documents are represented based on the defined concepts. This representation of the documents is used in the semi-supervised clustering process. The proposed method can arbitrarily use the labeled data in the concept extraction phase or the clustering phase. Although document representation based on concepts is previously adopted by some studies outside the context of semi-supervised clustering, in this paper we propose the novel idea of semi-supervised concepts for document representation. The semi-supervised concepts capture both the semantic components of the corpus and their correspondence to the class labels of data. The proposed model is explicable, providing a deeper understanding of the corpus and a more transparent operation logic for reasoning. In addition, as the documents are expressed as a collection of labeled concepts, an intuitive understanding of the comprising semantic components along with their label correspondence is achieved. In the semi-supervised clustering phase, the unlabeled documents are used to identify the overall structure of the clusters and, a small number of labeled documents are utilized to more precisely determine the centroids.

We conduct a comprehensive evaluation of the proposed method using three well-known benchmark datasets and compare the proposed approach from to multiple baseline and state-of-the-art algorithms. The evaluation results show that the proposed method has a significant advantage over previous semi-supervised document clustering methods. The main contributions of this paper are as follows:

  • Proposing a novel semi-supervised document clustering framework based on the conceptual representation of the documents.

  • Proposing approaches to involve a limited set of labeled documents in either the concept extraction or the document clustering phases.

  • Proposing the notion of semi-supervised concepts for document representation.

  • Conducting a comprehensive evaluation for analyzing and comparing the proposed methods.

The rest of the paper is organized as follows. In Section 2, we review current studies related to both document representation and semi-supervised clustering. Research Objectives are introduced in Section 3. Our proposed semi-supervised document clustering framework is presented in Section 4. Experimental results, detailed analysis, and discussion are presented in Section 5. Ultimately, our work is concluded in Section 6.

2 Related work

2.1 Document representation

Word embedding [26] is a set of language modeling techniques in natural language processing where semantic relationships of words can be captured in a low-dimensional and dense vector space. One of these techniques is the Word2Vec [27] method which is based on the assumption that words and terms that occur in similar contexts tend to have similar meanings. Le et al. [20] introduced the Doc2Vec model, which uses textual information from words and documents mutually to learn the representation of documents in a continuous vector space. Research has shown that Doc2Vec is more efficient than Word2Vec in solving clustering and classification problems [28]. In addition, due to the smaller dimensions of the generated document vectors, it is more efficient than BoW. Nevertheless, Doc2Vec has low interpretability, and the logic behind the generation procedure of document vectors is unclear.

In this paper, we represent documents based on concepts. A summary of studies using conceptual factorization for document representation is provided in [29]. Kim et al. [21] proposed the Bag-of-Concepts (BoC), which BoC creates concepts by clustering the word vectors generated by word2vec. It then creates the document vector using the frequencies of concepts in the documents. BoC is an unsupervised method that does not provide a specific solution for clustering and classification applications. Lee et al. [30] introduced concept-based representation which derives the conceptual knowledge from an external knowledge base. They also reduce the concept ambiguity by clustering concepts in an attempt to enhance BoC. Lou et al. [31] proposed a concept-based scheme for clustering and visualization of biomedical documents which the concept embedding is learned through neural networks. Another study [32] presents a decomposition method that generates concept vectors by identifying semantic word communities in a weighted word co-occurrence network extracted from the short text collection. Based on the idea that each entity may have different aspects reflected in different documents, a representation scheme is proposed in [33]. Each entity is modeled through different aspects, where each aspect consists of a mixture of latent topics. The Bag-of-Senses model [34] is based on the assumption that a document is a set of senses of words, and the senses are considered instead of the words. The sense of a word is estimated based on the documents in which it occurs. In the text analysis applications, the performance of feature-based techniques can be significantly hampered by the word mismatch and ambiguity problems. As a solution, in [35] a concept-based approach is proposed using domain-specific ontologies to support automated document classification. In [36], a conceptualization method using a Tagged Bag-of-Concepts (TBoC) is presented to detect sentiment in short texts.

To the best of our knowledge, no previous study aimed at clustering documents using a semi-supervised approach has adopted and analyzed the concept representation of documents. In addition, in contrast to previous studies, we propose a semi-supervised method of concept extraction to generate labeled concepts.

2.2 Semi-supervised clustering

Semi-supervised clustering is developed as an alternative to conventional unsupervised methods where partial domain knowledge is employed in the clustering process to improve its performance. A comprehensive survey of some semi-supervised clustering algorithms is provided by Basu et al. [37].

Inspired by the purpose of label propagation for semi-supervised learning, Zheng et al. [38] developed a method that involves label prediction for unlabeled data. The combination of Naive Bayes and EM algorithms (NBEM) is also adopted for semi-supervised clustering [24]. The model iteratively labels the unlabeled data and employs this newly labeled data to re-train the model. Zhang et al. [25] developed a method called TESC for text classification using semi-supervised clustering. TESC assumes that the data set is composed of different components and uses a clustering process to capture these components.

To improve the clustering performance of documents with supervisory information, a semi-supervised approach to the factorization of concepts is proposed by Lu et al.[39]. They incorporate the pairwise constraints of penalty and rewards in the concept factorization, which can ensure that data points regarding a cluster in the main space are still in the same cluster in the transformed space. In another study, text clustering utilizing automatic generation constraints is used for document classification [40]. The clustering algorithm allows obtaining a set of must-link/cannot-link constraints that can be used in the semi-supervised clustering step. Next, these constraints are used as semi-supervision in the hierarchical clustering algorithm.

Li et al. [41] proposed a distributed semi-supervised clustering method. The clustering process is the same as TESC, except that distributed clustering and collecting the results of the sub-clusters is applied. In [42] a new method of selecting the pairwise constraints from unlabeled data for semi-supervised clustering of documents is presented. A dense data group is selected from each initial cluster, and in these dense groups, the most informative data are identified by the local density estimation method. The identified data are used to form a set of constraint pairs in semi-supervised clustering. In the work of Gan et al. [43], the basic idea is that when the label of a labeled sample is risky, the predictions of the labeled sample and the nearest homogeneous unlabeled samples should be similar. This is performed by unsupervised clustering and then building a local graph to model the relationships between labeled and nearest unlabeled samples.

In [44], a bag of phrases is used to classify texts which incorporate phrases into the vector space model for the document classification task. The Semi-Supervised Hierarchical Latent Dirichlet Allocation (SSHLDA) is used to separate phrases from the corpus. In [45], an embedding-based generative framework is used for semi-supervised text categorization based on the integration of labels, metadata, and text. In [46], a text clustering method based on a deep learning approach is proposed that combines (i) a Convolutional Siamese Network (CSN) based on pair constraints for representation learning, and (2) the traditional K-Means algorithm for clustering the learned vectors without supervision. In [47], a semi-supervised approach is proposed to address the costs of labeling unlabeled data by combining a dynamic graph with a self-learning mechanism. In [48], a new selection criterion is proposed using the neighborhood construction algorithm for semi-supervised learning. Unlabeled data are selected close to the decision boundary, and to determine the correct labels for these data points, an agreement is formed between the classifier predictions and the neighborhood construction algorithm. In [49] a semi-supervised method based on deep learning is presented. SDEC learns the features that enhance the clustering tasks. It incorporates pairwise constraints in the feature learning process, such that data samples belonging to the same cluster are close together and data samples belonging to different clusters are far apart in the learned feature space. In [50], a discriminative semi-supervised NMF (DSSNMF) method is proposed which uses the partial label information as a discriminative constraint. The DSSNMF method is investigated with two different cost functions and provides relevant update rules for the optimization problems. A semi-supervised clustering approach based on deep metric learning is presented in [51]. To dynamically update unlabeled data to labeled data, they combine embedding with label propagation strategies and triplet loss with deep metric learning networks.

Although we also make simultaneous use of labeled and unlabeled data, our approach is different from previous approaches. In our proposed method and evaluations, a large amount of the data is unlabeled, and a limited number of labeled data may be used. This difference significantly reduces the cost of labeling data in real-world applications. Furthermore, most of the mentioned semi-supervised clustering methods neglect the document representation issue, which can largely affect the clustering outcome. In this paper, other than using the concept-based document representation in semi-supervised clustering, we propose the notion of labeled concepts generated through semi-supervised concept extraction.

3 Research objectives

This paper proposes a novel semi-supervised clustering method for documents based on their conceptual representation. We assume that the input document collection \(D=\left\{{D}^{l},{D}^{u}\right\}\) is divided into labeled (\({D}^{l}\)) and unlabeled (\({D}^{u}\)) documents. A labeled document \(d\) has a corresponding label belonging to the set of all labels \(L\) (\(label\left(d\right)\in L\)). Each document is composed of a set of words. The union of all words from all documents is denoted as \(W\). We aim to reach a clustering model \(C= \left\{{C}_{1},\dots ,{C}_{m}\right\}\) of the documents, such that \(\bigcup_{1\le i\le m}{C}_{i}=D\) and \({C}_{i}\cap {C}_{j}=\varnothing (1\le i\ne j\le m)\).

The main purpose of this study is to present a new framework for semi-supervised clustering of the document collection \(D\). The method of document representation plays an important role in text clustering, and this is further investigated in the experiments. The proposed method is based on conceptual representation of documents, which offers a low-dimensional representation while reflecting the proximity information of documents. Using the concepts extracted from the set of words \(W\), documents are represented as concept vectors. Because tagging unlabeled data is costly, it can be tedious in real-world text analysis applications. One of the principal objectives of this study is to cluster unlabeled textual data using a limited number of labeled data \({D}^{l}\). For this purpose, depending on whether labeled data are used in the concept extraction or document clustering phases, we offer two approaches for semi-supervised clustering, both of which require a small ratio of labeled data. In the former approach, we propose the notion of semi-supervised concepts, which simultaneously captures the intrinsic subjects in the collection, and their label associations. We use an agglomerative hierarchical clustering algorithm to extract the semi-supervised concepts and the document clustering in the former and latter approaches, respectively.

4 The proposed semi-supervised clustering framework

Figure 1 shows the detailed process of the proposed framework for semi-supervised clustering, described in terms of three phases: preprocessing, document representation, and clustering. The representation of documents is based on the concepts extracted from the corpus. The proposed clustering model is semi-supervised, benefiting from labeled documents to improve the clustering quality. Based on whether we want to use the labels in the representation phase or only in the clustering phase, two approaches are proposed. In the former, the document labels can be utilized in concept modeling, leading to the Semi-Supervised Concept Extraction (SSConE) approach. In the latter, the labels are used in the actual document clustering, leading to the Semi-Supervised Cluster Extraction (SSClusE) approach. Nevertheless, both methods yield a final clustering model of the documents. Table 1 presents the set of symbols used in the algorithm description. In the following subsections, we describe the three phases of the proposed method in detail.

Fig. 1
figure 1

The overall process of the proposed semi-supervised document clustering framework. The two methods proposed in this paper are presented in two different colours: SSConE in red, and SSClusE in blue. The common steps of the two are marked in black

Table 1 Description of the symbols used in the algorithms

4.1 Preprocessing

Initially, documents are tokenized after eliminating stop-words and preprocessing the texts to obtain a set of words \(W\). The proposed algorithm requires that each word in the set \(W\) is represented with a vector in an embedded space. The embedding preserves the semantic relationships between the words. In the experiments, we use two popular word embedding models, namely Word2Vec [52] and BERT [53] to learn word associations from the input corpus.

If the documents labels are to be used in for concept modeling, a set of labeled words \({W}^{l}\) are required. To this end, a limited set of more discriminating words \({W}^{l}\) are extracted from the set of labeled documents \({D}^{l}\), and labeled as follows. The TF-IDF model is used to assign a weight to each word \(w\) in each labeled document as calculated in Eq. 1, where \(t{f}_{w,d}\) denotes the number of occurrences of \(w\) in document \(d\), and \(d{f}_{w}\) is the number of labeled documents containing \(w\). A limited set of words with the highest weights are selected, and further labeled according to the document in which they occur. To simplify the algorithm description, the unlabeled words are denoted as \({W}^{u}\), such that \({W}^{l}\cup {W}^{u}=W\). The labeled words are later used in the representation phase for semi-supervised concept extraction.

$$TF-IDF(w,d) = {tf}_{w,d}\times \mathrm{log}(\frac{|{D}^{l}|}{{df}_{w}})$$
(1)

4.2 Representation

The representation of documents is based on the concepts extracted from the corpus. In the representation phase, we first extract a set of concepts \(T= \left\{{T}_{1},\dots ,{T}_{z}\right\}\) from the set of words \(W\), such that each concept consists of an exclusive set of words, i.e. \({T}_{i}\cap {T}_{j}=\varnothing (1\le i\ne j\le z)\) and \(\bigcup_{1\le i\le z}{T}_{i}=W\). The general procedure of concept extraction involves executing a clustering algorithm on the set of words \(W\) to partition it into several clusters, each representing a concept. The concepts may be extracted in a semi-supervised or unsupervised approach, as described next. The former requires the set of labeled words \({W}^{l}\). After constructing the concepts, each document \(d\) is represented by a concept vector \(\overrightarrow{d}\).

4.2.1 Semi-supervised concept extraction

The semi-supervised concept extraction employs a semi-supervised hierarchical clustering algorithm executed on the set of words \({W}^{l}\cup {W}^{u}\). Each resulting cluster is considered as a concept that may have a label based on the underlying words. The same procedure described here is later used in the semi-supervised clustering of documents in Section 4.3.2. In the clustering procedure, unlabeled data are used to capture the overall structure of data ingredients, while the labeled data are utilized to adjust the centroids of text ingredients.

Figure 2 shows the steps of semi-supervised clustering executed on a set of data points \(X\) to obtain a set of clusters \(P\). To use a consistent notation; we introduce a dummy label \(U\) assigned to the unlabeled data so that all data points have a label. Initially, A matrix is formed to store the distance between the data. Each data point \({x}_{i}\) is considered a primary cluster \({S}_{i}\) with the same label of \({x}_{i}\). The main clustering loop executes a maximum of \(|X|\) rounds until the primary set of clusters has at most one member. Among all pairs of primary clusters, the two closest clusters, \({S}_{i}\) and \({S}_{j}\) (two clusters with the smallest cosine distance between their centroids), are selected. Next, based on the labels of primary clusters, either a new cluster is created (mergeable), or the two clusters will be preserved (not mergeable). If neither cluster has the label \(U\) and label(\({S}_{i}\))\(\ne\) label(\({S}_{j}\)), the two clusters are preserved and added to the final partitioning (\(P\)). In the other cases, the two clusters are merged into a new cluster \({S}_{k}\) which replaces the selected clusters in the primary cluster set. The label of \({S}_{k}\) is the same as the label of \({S}_{i}\) and \({S}_{j}\) if they have similar labels, or is the same as the one whose label is not \(U\). The algorithm iterates by selecting pairs of primary clusters. Final clusters with members below a threshold are removed as outliers (remove noise clusters from \(P\)). The final clusters preserve their labels and represent the labeled concepts. The number of clusters \((m)\) is determined by the clustering algorithm and depends on the shape and number of input data. Hence the number of concepts (\(z\)) and consequently the length of the document vectors (\(\left|\overrightarrow{d}\right|\)) is not predetermined and is variable.

Fig. 2
figure 2

Semi-supervised clustering process

As the set \(W\) is large, and the complexity of the described semi-supervised clustering algorithm is \(O({\left|W\right|}^{3})\), the words may be optionally summarized before extracting the concepts to reduce the complexity (Optional word summarization in Fig. 1). To this end, the Spherical K-means [54] clustering is used to cluster \(W\). The resulting clusters are further inspected. Any cluster containing words belonging to more than one label is divided such that the resulting clusters are pure in terms of word labels. The cluster centroids are labeled according to their members and form the new pseudo labeled words \({\widehat{W}}^{l}\) as the input of the semi-supervised clustering algorithm.

4.2.2 Unsupervised concept extraction

If labeled data is not used in the concept extraction, the unsupervised spherical K-Means clustering algorithm utilizing the cosine distance is used to cluster \(W\). The resulting clusters are considered as unlabeled concepts extracted from the words. Spherical K-Means is an unsupervised clustering method and requires a predetermined value for the number of clusters for clustering operations. For this reason, the number of concepts extracted by this approach has a fixed and predetermined count.

4.2.3 Document vector extraction

Clusters created by semi-supervised clustering or unsupervised clustering are counted as concepts, and document vectors will be constructed from these concepts. Due to the semantic space trained by embedding method, words with similar meanings or common hypernyms are grouped in a cluster. Consequently, each word in the text will be considered as a member of a concept, and each document can be considered as a collection of concepts. As a concept that occurs in many documents is not a proper discriminator for machine learning applications [55], so Concept Frequency-Inverse Document Frequency (CF-IDF) [17] (Eq. 2) is applied to the generated concepts to eliminate the adverse effects of common concepts between documents. In Eq. 2, \(c{f}_{c,d}\) denotes the number of occurrences of concept \(c\) in document \(d\), and \(d{f}_{c}\) is the number of documents containing concepts \(c\).

$$CF-IDF\left(c, d\right)={cf}_{c,d}\times \mathrm{log }\frac{\left|D\right|}{d{f}_{c}}$$
(2)

The length of the document vector may vary based on the extracted concepts. In semi-supervised concept extraction, the number of concepts is automatically determined by the semi-supervised clustering algorithm and depends on the data characteristics and the structure of the labeled data. In unsupervised concept extraction, however, the number of concepts is arbitrary and defined by the user. The effect of changing the number of concepts is evaluated in Section 5.5.1 (A). Changing the number of concepts in the latter approach changes the length of document vectors. Therefore, the storage and computational complexities of the subsequent clustering step is under the control of the user.

4.3 Clustering

Now that the document vectors have been extracted, the document clustering task may be performed. It is expected that two documents containing similar concepts will have similar document vectors. According to whether the concepts are extracted in a semi-supervised or unsupervised manner, the clustering process of the documents is adjusted. In the former case, clustering is performed using the labeled concepts, while in the latter, a semi-supervised clustering of documents is required. The two alternative approaches are described below.

4.3.1 Clustering based on labeled concepts

Each entry in the document vector \(\overrightarrow{d}\) is associated with an extracted concept. If concepts are extracted from the set of words by a semi-supervised algorithm, each concept \({T}_{j}\) has a respective label. The clustering process reduces to the task of identifying the most effective concept in each document vector. The label of this concept determines the label (cluster) of the document, \(label(d)\). To determine the most effective concept, different approaches may be employed. In this study, we propose and evaluate three approaches as represented in Section 5.4, and the most accurate of the three is formulated in Eq. 3. In this method the label which has the most aggregate weight in the CF-IDF vector of \(d\) is assigned to this document.

$$label(d) = \underset{l}{\mathrm{argmax}}\sum_{label\left({T}_{j}\right)=l}\overrightarrow{{d}^{j}}$$
(3)

4.3.2 Clustering based on unlabeled concepts

If the process of extracting concepts from the set of words is performed without supervision, the set of concepts do not carry labels with them. Therefore, document vectors are clustered with the help of labeled documents through the semi-supervised clustering algorithm previously introduced in Fig. 2. The input of algorithm (\(X\)) is set to document vectors \(\bigcup_{d\in D}\overrightarrow{d}\), and the output (\(P\)) is the set of document clusters. As before, the documents without a label are assigned a dummy label \(U\). The algorithm produces clusters of documents, each having an appropriate label.

Both alternative methods of semi-supervised document clustering introduced in this paper cover multiple benefits. Classic methods of text representation fail to maintain the non-linear semantic relationships between words. The document vectors produced by some recent proposals such as Doc2Vec are not intuitive and understandable. The approach adopted in this paper not only preserves the non-linear semantic relationships between words but also has high interpretability and intuition due to the extraction of concepts in the corpus. Since the concepts separate the components of the text, increasing the number of concepts can capture the more petite subcategories of the more fine-grained concepts of the text. As the algorithm requires partially labeled data, the overhead and cost of tagging the documents may be kept low. Accordingly, in applications where large amounts of unlabeled data are to be clustered, such as social media analysis, the method can appear effective. Finally, the logic behind the clustering process is clear, and the resulting clusters are explainable.

5 Experiments

Several experiments are conducted to show the effectiveness of the proposed method in clustering quality. Three common datasets in natural language processing with different number of samples and classes are employed in this paper: Reuters-21578, 20-Newsgroups, and WebKB. The proposed semi-supervised clustering approaches are labeled as SSConE and SSClusE in the experiments. The compared methods are divided into two categories: unsupervised methods and semi-supervised methods, which are described below.

5.1 Compared methods

5.1.1 Unsupervised algorithms

K-means

This algorithm divides the sample data into partitions (clusters) by minimizing the sum of squares within the clusters. The number of clusters should be defined before clustering operation. This algorithm has been used in many data clustering applications.

Hierarchical clustering

The hierarchical algorithm identifies clusters by merging or breaking them consecutively [56]. The clustering operation is performed in a bottom-up or top-down manner by creating a tree including the root (all data with different classes) at the top, and leaves (individual data samples) at the bottom.

Spectral clustering

The basic idea behind spectral clustering is to use the standard clustering approach using eigenvectors of Laplacian matrices of data similarity [57].

Birch

Clusters are generated using a tree structure named the clustering feature tree. One of the characteristics of this method is that the nodes have many clustering features [58].

Deep embedded clustering (DEC)

This method uses deep neural network features for textual data clustering operations. Kullback-Leibler divergence is used to optimize and adjust the parameters of the model. DEC maps the data to the feature space using a stacked autoencoder [59].

5.1.2 Semi-supervised algorithms

An approach to TExt classification using Semi-supervised Clustering (TESC)

It uses an agglomerative hierarchical algorithm for clustering and then classifying documents. The algorithm identifies and labels the test data by merging the clusters using a semi-supervised approach, and then calculating distance of the test instances with each cluster [25].

Doc2Vec

This method uses the similar hierarchical clustering algorithm in this paper on the document vectors extracted by Doc2Vec. This comparison can reveal the advantages of the proposed methods over Doc2Vec-based document representation. Doc2Vec converts documents into feature vectors with suitable dimensions. Due to the neural network-based structure of model training, there is no specific logic for the feature identification mechanism used in the document vector [20].

Discriminative semi-supervised non-negative matrix factorization for data clustering (DSSNMF)

This method uses the label information of a part of the data as supervision for the clustering process. In this research, two cost functions and new update rules are used for optimization [50].

Semi-supervised deep embedded clustering (SDEC)

This model learns features that are beneficial for clustering tasks using a deep learning approach. By considering the pairwise constraints during training, the data belonging to a cluster are located at a close distance to each other in the feature space [49].

5.2 Datasets

In order to show the performance of the proposed model and its applicability, three datasets namely Reuters-21578,Footnote 1 20-NewsgroupsFootnote 2 and WebKBFootnote 3 are used in experiments.

The Reuters-21578 news dataset includes a collection of news items published on the Reuters website in 1987, which was collected and processed by Reuters’ personnel in 1991. Reuters contains 21,578 texts and 135 data classes that are manually labeled. There is an imbalance in the distribution of documents across the classes, so there may be fewer than 10 documents in one class and over 4000 in another. In this study, 9979 documents are used from the top 10 classes. A detailed breakdown of the classes and the number of documents in each class can be found in Table 2.

Table 2 Documents distribution in Reuters-21578

The second dataset is the 20-Newsgroups documentation, which includes 18,821 newsgroups posts on 20 topics. The distribution of documents in different classes is balanced. There are some classes that are very close to one another and some that are very far apart. This dataset's distribution of documents and classes can be seen in Table 3.

Table 3 Documents distribution in 20-Newsgroups

The WebKB dataset contains web pages from the computer science departments at various universities. In total, 8282 web pages are categorized into six imbalanced categories (Students, Faculty, Staff, Department, Course, Project). A miscellaneous category is also included that cannot be compared to the rest so we dumped this category because pages were very different among this group. The Department and Staff classes are also discarded, as there were only a few pages from each university. Table 4 shows the distribution of documents in each class.

Table 4 Documents distribution in WebKB

5.3 Evaluation metrics

The Normal Mutual Information (NMI) criterion is a clustering validation metric that assesses the quality of the clustering concerning given underlying labeling of the data. NMI measures how closely the clustering algorithm could reconstruct the underlying label distribution [60]. A value of 0 indicates a random cluster assignment, and values closer to one demonstrate that the clustering can recreate the true class membership. The NMI measure is defined in Eq. 4.

$$NMI= \frac{I(C;K)}{(H\left(C\right)+H(K))/2}$$
(4)

where \(I(X;Y) = H(X) - H(X|Y)\) is the mutual information between the random variables \(X\) and \(Y\), \(H(X)\) is the Shannon entropy of \(X\), and \(H(X|Y)\) is the conditional entropy of \(X\) given \(Y\).

The Silhouette Coefficient (SC) is a criterion used to calculate the superiority of a clustering technique. This criterion ranges from -1 to 1 and a higher SC score indicates that the clusters are better separated. According to Eq. 7, this criterion is calculated for each data sample in the clustering process:

$$IntraClusterDistance= \frac{1}{\left|{C}_{k}\right| \left(\left|{C}_{k}\right|-1\right)} \sum_{{w}_{i},{w}_{j} \in { C}_{k} , { w}_{i}\ne {w}_{j}}dist({w}_{i},{w}_{j})$$
(5)
$$InterClusterDistance =\frac{1}{\left|{C}_{k}\right| \left|{C}_{p}\right|} \sum_{{w}_{i} \in {C}_{k} , { w}_{i} \in { C}_{p}}dist\left({w}_{i},{w}_{j}\right)$$
(6)

where \(\left|{C}_{k}\right| , \left|{C}_{p}\right|\) are the number of points in clusters \(k\) and\(p\).

$$Silhouette\;Coefficient=\frac{InterClusterDistance-IntraClusterDistance}{max(IntraClusterDistance,InterClusterDistance)}$$
(7)

In this relation, \(IntraClusterDistance\) is the average distance between a data sample and all other samples in the same cluster, and \(InterClusterDistance\) is the average distance between a sample and all other samples in the next closest cluster. The average of SC scores for all data points shows the clustering SC. A score of 1 means that the clustering is very dense, while -1 indicates incorrect clustering, and 0 indicates overlapping clusters.

5.4 Evaluation model

In the preprocessing phase, we perform typical tasks such as removing non-alphanumeric characters, stop word elimination, etc. To decrease the complexity of the word embedding, the infrequent words occurring less than five times in the entire dataset are eliminated. In the word summarization task of the semi-supervised concept extraction, according to the available resources, the word vectors are summarized to 2000 pseudo labeled vectors. Although different methods can be used for word embedding, the results are presented with the two popular Word2Vec and BERT embedding models. If unspecified, the default embedding is Word2Vec.

As described in Section 4.3.1, when concepts have labels, the task of assigning a document \(d\) to a cluster reduces to finding the most effective label in the document vector \(\overrightarrow{d}\). Three different methods are proposed and evaluated in this respect, which select the most effective label to be (I) the one having the largest weight in \(\overrightarrow{d}\), (II) the most frequent label in the concepts associated with \(\overrightarrow{d}\), and (III) the one having the largest aggregate weight in \(\overrightarrow{d}\) as described in Section 4.3.1 Eq. 3. All three methods have been tested, and the results of the clustering NMI can be seen in Fig. 3 when the percentage of labeled documents varies. As observed, the third method shows the highest NMI score in clustering the documents. In the rest of the experiments, this method is used for cluster assignment in SSConE.

Fig. 3
figure 3

NMI score of document clustering for the proposed method (SSConE) with the different methods of label assignment for Reuters-21578

The selection of initial points in the spherical K-Means algorithm is essential and affects the clustering performance. To address this issue, when spherical K-Means is executed (e.g. when pseudo words are extracted), the algorithm is executed five times with random initial points, and the clustering with minimum within-cluster distance is selected. All the experiments are executed on a system with an Intel Core i5 processor and 6 GB RAM.

5.5 Results

5.5.1 Parameter study

Number of concepts

In the SSClusE model, the number of concepts defines the size of the document vector. Since larger document vectors provide a more fine-grained view into the semantic space, and the distance between documents are obtained by comparing the document vectors, the number of concepts may have a significant effect on the clustering quality. To measure this effect, an experiment is designed in which the number of concepts varies from 500 to 1500, and 25% of the data are labeled. Figure 4 and Table 5 show the values of NMI and SC for the SSClusE model, respectively. According to the results, when the number of concepts increases resulting in larger document vectors, the quality of the created clusters improves in all three datasets.

Fig. 4
figure 4

NMI score of document clustering for the proposed method (SSClusE) for different datasets when number of concepts varies (a) SSClusE (Word2Vec) (b) SSClusE (BERT)

Table 5 SC of document clustering for the proposed method (SSClusE) for different datasets when number of concepts varies

As an illustration, we opted for the highest number of concepts that could be identified in each dataset and assessed the clustering outcome using the lengthiest document vector. In all documents, words that occurred more than five times (min_freq = 5) are utilized to create concepts. This condition results in 7891, 24,301, and 6324 words remaining for Reuters-21578, 20-Newsgroups, and WebKB datasets, respectively. If we consider each word as a cluster or concept, the longest possible concept vector and document vector are created in the proposed method. We evaluated the document clustering results in this setting, and achieved the NMI values of 0.6867, 0.4796, and 0.5642 for the Reuters-21578, 20-Newsgroups, and WebKB datasets, respectively. Based on the results and Fig. 4, clustering quality improves gradually as the number of concepts increases. Therefore, if the user has sufficient computational and storage resources, incorporating more concepts can be beneficial.

According to Fig. 4(a), the NMI values ​​for the Reuters-21578 dataset when the document vector is created based on 500 and 1500 concepts, are equal to 0.63 and 0.67, respectively. This improving trend can also be seen in the two other datasets. The values ​​in Table 5 show that the SC score of clustering also increases with larger number of concepts. For example, the clustering silhouette for the 20-Newsgroups dataset has increased from 0.05 to 0.08, and from 0.07 to 0.12 with different embeddings. Therefore, the number of extracted concepts shows a direct relationship with the clustering quality. If the number of extracted concepts is small, the concepts will be coarse-grained, each containing a large portion of the words. Such concepts may actually be combined from several, more delicate concepts. Therefore, describing documents in terms of these concepts, results in a less distinguishing capability among the documents covering proximate yet different topics. If the number of concepts is chosen correctly resulting in a suitable granularity of concepts, distinguishing the class of documents will be facilitated.

Embedding window size

In the proposed model, Word2Vec embedding is used to maintain the semantic relationships between words. One of the important parameters in embedding is the size of the sliding window. A larger window can capture more semantic relationships among the words, which may have a positive impact on extracting meaningful concepts. We change the window size from four to 20 and observe its effect on the document clustering quality. In this experiment, 25% of the data are labeled and document vector size is 1000. Tables 6 and 7 represent the results. The performance of the proposed model as reported by NMI and SC improves with higher window sizes, with the improvement being larger when changing the window size from four to eight, and marginal when increasing it to 20. For example in Table 6, when the window size changes from four to 20, the NMI value increases from 0.59 to 0.70, indicating a better quality. The SC criterion also shows an increasing trend. This quality improvement in clustering is achieved due to the better conceptual modeling of the corpus. This is explained by the Word2Vec neural network encountering more co-occurrences at each step, and discovering an expanded semantic relation among words. In order to reduce the time complexity, a window size of eight is considered for training the Word2Vec model in the next experiments.

Table 6 The performance of the SSClusE when the size of the window varies
Table 7 The performance of the SSConE when the size of the window varies

Word vector dimension

When it comes to embedding words, the dimensions of the word vectors play a crucial role in determining the outcome of document similarity measurement. Small values may not maintain semantic relationships correctly, while large values may result in a huge computational overhead. Hence, it is crucial to determine the suitable and most efficient value by experimenting with the inherent properties of each dataset. Table 8 shows the results of document clustering with the proposed method when the word vector dimensions change from 500 to 5000. In this experiment, 25% of the data are labeled and window size is 8.

Table 8 The performance of the SSClusE when the Word2Vec dimension varies

Upon closer examination of Table 8, it becomes apparent that as the dimensions of the word vector increase, the clustering performance and quality improve. However, this increase is not consistent and reaches a point where further changes in word vector dimension are not noticeable for clustering performance. For example, in Reuters-21578, the best clustering performance is obtained with Word2Vec of length 1500. This value is 2500 and 1000 for the 20-Newgroup and WebKB datasets, respectively.

5.5.2 Effect of the percentage of labeled documents

The number of labeled documents has a literal impact on the quality of clustering performed. To observe this effect, we conducted another experiment in which the size of the labeled data varied. This configuration can assess the performance of the model when encountering different ratios of labeled data. It can reveal in particular whether the model is robust when possessing only a tiny subset of labeled data and if it can benefit from a large number of labeled data to improve its performance.

Figure 5 shows the performance of the proposed methods compared to other semi-supervised methods when the percentage of the labeled data varies. Number of concepts is set to 500 in this experiment. According to Fig. 5(a), the clustering performed with the proposed method on Reuters-21578 dataset outperforms other methods. When increasing the number of labeled documents, the value of NMI and consequently the clustering quality increases. For example, in the worst case, when 5% of the documents are labeled, the NMI values of SSConE (BERT), SSClusE (BERT), SSConE (Word2Vec), SSClusE (Word2Vec), TESC, Doc2Vec, DSSNMF, and SDEC are 0.4953, 0.4016, 0.42, 0.29, 0.19, 0.32, 0.30, and 0.43, respectively. When a large portion of documents (30%) are labeled, the NMI value of the SSClusE (BERT) reaches 0.7128 which is highest among all methods. For the Silhouette coefficient in Fig. 5(b), with 30% labeled documents, the value of SC for SSConE (BERT) is 0.2182, and at worst, when there are 5% labeled data, SC is 0.13. In evaluating the SC, the proposed methods have maintained their superiority over all other methods. The results of this experiment reveal that the proposed methods can benefit from a small amount of labeled data, while their performance improves when having access to more labeled data.

Fig. 5
figure 5

The performance comparison of different models at the various percentage of labeled documents on Reuters-21578, with 500 concepts (a) NMI score (b) Silhouette coefficient

Figures 6 and 7 show the performance of the proposed methods compared to other semi-supervised document clustering methods with 1000 concepts on Reuters and 2000 concepts on 20-Newsgroups, respectively. From the diagrams in Figs. 5, 6, and 7, it can be seen that the proposed methods are superior to nearly all other methods when a small set of labeled data is available. With more labeled documents, the difference between the NMI of the proposed method and other methods increases. It indicates that access to more labeled data can improve the clustering quality of the proposed methods. In addition, it confirms that the algorithm design is benefiting from a larger labeled set in discovering the clusters in the data. Another interesting observation is that when less labeled data is available, in some cases unsupervised concept extraction (SSClusE) is performing better than semi-supervised concept extraction (SSConE) on 20-Newsgroups. Gradually with the increase in the number of labeled data, SSConE takes over. The results of SC show that the proposed methods have been able to create denser and more distinct clusters than other methods. The proposed methods take an important step towards better clustering of documents by extracting and decomposing text components, compared to other methods. Using labeled data, cluster centers are better tuned, and a more reliable clustering can be expected.

Fig. 6
figure 6

The performance comparison of different models at the various percentage of labeled documents on Reuters-21578, with 1000 concepts (a) NMI score (b) Silhouette coefficient

Fig. 7
figure 7

The performance comparison of different models at the various percentage of labeled documents on 20-Newsgroups, with 2000 concepts (a) NMI score (b) Silhouette coefficient

5.5.3 Comparison with other methods

Table 9 compares the clustering performance of the proposed method with several semi-supervised and unsupervised methods in all three datasets. For this examination, 25% of the labeled data has been used in semi-supervised methods and the evaluation has been done with 1000 concepts or features. Similar to the settings mentioned earlier in Section 5.4 for the spherical K-Means algorithm, K-means is executed five times with random initial points, and the clustering with the minimum within-cluster distance is selected. The Doc2Vec and TESC methods were executed multiple times with varying parameters, and the average results of the top five clustering results were reported in Table 9. The length of documents in all methods is set to 1000 and the window sizes of Word2Vec (for K-means clustering) and Doc2Vec methods are both set to eight. For a comprehensive evaluation of the proposed framework, independent of the embedding method, we have also used the embedded vectors of BERT large language model [53] (SSClusE-BERT and SSConE-BERT in Table 9). In order to substitute Word2Vec vectors, the average of the last four layers from the pre-trained BERT-BASE model is employed for each word, which has a vector length of 768. The proposed SSConE and SSClusE algorithms outperform all other techniques in terms of SC and NMI in all datasets, which confirms that the proposed methods can learn the semantic feature vectors of documents. Using the BERT embeddings produces higher quality clusters.

Table 9 The performance comparison of different algorithms

For the Reuters-21578 dataset, after the proposed methods, the semi-supervised SDEC and DSSNMF methods compete with each other. TESC (BoW) and Doc2Vec are next. Generally, the semi-supervised algorithms are performing better than unsupervised clustering, especially in terms of NMI. Recall that the performance of the SSConE method does not depend on the number of concepts because this method uses a hierarchical algorithm that automatically determines the number of concepts. For the dataset of 20-Newsgroups as well as WebKB, the proposed methods achieve a performance that is superior compared to other models in all configurations. This shows that using conceptual modeling for document representation is effective in document clustering.

5.5.4 Interpretability

Representation of documents using concepts provides high interpretability with a more clear understanding. To illustrate this interpretability, we use the SSConE model, where each final cluster of documents corresponds to an extracted concept. Thus, the important words in each cluster are the same as those in the corresponding concepts. Using the TF-IDF weighting of Eq. 1, we can determine the important words in the documents of each cluster. Table 10 shows five important words in some randomly chosen concepts of Reuters-21578. By analyzing the words within each concept, it becomes clear what the concept pertains to and why the related documents are grouped together in their respective clusters. For example, in the first displayed concept, the top five words (construction, metals, hut, operatorship, coal) are all either related to raw materials or construction, which matches with the actual class label, "crude", in the Reuters-21578 dataset. For another example, the top words in the fourth concept (financial, Dubai, uneconomic, dominance, minimizing) are related to the monetary and economic topics, which is consistent with the actual class label, "earn".

Table 10 Five important words in some extracted concepts of Reuters-21578

The words that occur in the same context are placed close to each other in the embedded space, and the formed concepts maintain the semantic similarities between the words. To demonstrate this more effectively, an experiment was conducted in which the average cosine distance between the words in each concept (Eq. 5) and the words in the nearest neighboring concept (Eq. 6) are measured. Larger \(InterClusterDistance\) and smaller \(IntraClusterDistance\) imply better interpretability because the words in a concept \(k\) are closer compared to the words in the nearest neighboring concept \(p\). Table 11 shows the average values of \(IntraClusterDistance\), \(InterClusterDistance\), and SC for all concepts in Reuters-21578, 20-Newsgroups, and WebKB datasets. As can be seen, the proximity of a word to other words within its own concept is greater than its proximity to words in the next nearest concept. This demonstrates the preservation of semantic similarity among words within a concept.

Table 11 Average values of IntraClusterDistance, InterClusterDistance, and SC for all concepts in Reuters-21578, 20-Newsgroups, and WebKB

Furthermore, we conducted an experiment to gauge the sensitivity of the clustering methods to subtle changes in the extracted concepts. By removing a percentage of words randomly from each concept, we observed the resulting changes in document clustering. The insightful results of this experiment are presented in Table 12. At each step of the test, 10% of the words in each concept are eliminated (except for the concepts that have less than 5 words) and then the documents are clustered with these new concepts. For this examination, 25% of the labeled data has been used in the semi-supervised method and the evaluation has been done with 400 concepts or features. The tests were conducted 5 times and the average values are reported in Table 12. In the worst case, we observe a 0.0388 and 0.0177 decrease in NMI and SC, respectively.

Table 12 NMI and SC score of document clustering for the proposed SSConE method when number of words in each concept varies

After analyzing Table 12, it is evident that the decline in clustering performance was minimal, indicating that the proposed clustering method is capable of withstanding variations in concepts.

5.5.5 Discussion

The proposed framework represents documents using dense vectors in the space of constituent concepts of the collection. As observed in the experiments, it achieved better performance compared to methods using other document representation schemes such as TF-IDF (BoW) and Doc2Vec. Unlike neural-networks-based methods [20, 46, 49] which have low interpretability, concept-based algorithms have high interpretability and the logic of creating document vectors is well visible. In neural-network-based document representation methods, it is not clear how each of the document vector elements is calculated and what feature it describes. The BoC [21] representation, although being based on concepts, is unsupervised, does not provide a direct solution for text clustering or classification, and was shown to be less effective compared to the proposed framework. Some other representation methods require an external knowledge base [30] or a domain-specific ontology [44] beside the document collection. Other representation approaches based on model graphs [33], adaptive slider-windows [34], and Latent Dirichlet Allocation [35], may be more complex to extend if labeled data are available and also have not provided a clustering solution. In another method presented for the conceptual representation of short texts [32], the network of semantic communities of words has been used to extract concepts. Such methods may not be suitable for longer texts as used in this paper, due to the computational complexity.

From the algorithm perspective, some semi-supervised algorithms using a small set of labeled data, concentrate on the classification task [45]. Some semi-supervised clustering approaches iteratively label the unlabeled data to use them in model training [24]. This labeling should continue until the model converges, and it is not clear how much labeling is required. Our model in contrast, does not use such labeling, and being based on conceptual representation and hierarchical clustering, is shown to be superior in terms of clustering quality (Figs. 5, 6, and 7). Unlike the methods that cannot maintain the semantic similarity of words in vectors with small dimensions and require model convergence [50, 57], the proposed method has the ability to maintain semantic concepts in vectors with logical length. In TESC [25], a semi-supervised approach is used to categorize documents using a large ratio of labeled data. We use conceptual representation which is able to discover categories of the text in different granularities, and use only a small amount of labeled data. These advantages can be beneficial in various real-world applications where labeled data is scarce. According to the results of Figs. 5, 6, and 7, it is observed that the proposed method outperforms TESC in the face of large amounts of unlabeled data. Unlike some other studies [47, 48], our proposed framework is able to use labeled and unlabeled data simultaneously in the model training phase. It can identify the overall structure of clusters with unlabeled data and tune the centers of each cluster with the help of labeled data. In addition, the use of cumulative hierarchical clustering and the discovery of concepts in the texts allows the proposed framework to benefit from labeled data in assigning unlabeled data to the appropriate cluster. Because the conceptual representation preserves the proximity information of documents as well as the nonlinear relationships between words, it is more convenient to find similar documents compared to the word-based approaches. Overall, the experiment results show that the proposed framework has high efficiency in clustering documents, and can benefit from labeled data in improving the quality of results.

6 Conclusion

In this paper, we presented a concept-based method for semi-supervised document clustering. Our basic assumption was that the type of document representation affects the quality of document clustering and classification tasks. We adopted a concept-based representation based on the concepts extracted using the semantic similarities between the corpus words. This method overcomes the limitations and weaknesses of other methods of text representation based on BoW and document embedding. Additionally, the proposed method has the advantage of describing the documents in a low dimensional space of concepts while offering high interpretability of document vectors. Limited labeled data are used alongside unlabeled data to adjust the cluster centers. In this paper, two methods for engaging labeled data in the clustering process were introduced. In the first method, the labeled documents were employed in the concept extraction task, while in the second method, the labeled data were used in the actual document clustering step. The former method employs supervision in discovering the hidden structure in the embedded space of the words and captures labeled concepts. It allows for a better discovering of the relation between the components in the text with the provided label information. Clustering is a fundamental step of many natural language processing tasks and has numerous practical applications in different diverse areas. Semi-supervised clustering may be beneficial when dealing with tasks that have limited labeled data, such as social network analysis, question answering, topic modeling, concept hierarchy generation, training deep neural networks, and when interpretability is crucial. As shown in the experiments, results on three popular datasets showed the superiority of the proposed methods, especially when a smaller number of labeled data are available. Although the experimental results are satisfying and promising, further extensions to our study can be performed. For example, introducing fuzziness in the produced concepts may yield more realistic modeling of the text components, as each word may participate in more than one concept. In addition, constructing hierarchical overlapping concepts can be investigated in accordance with hierarchical clustering. Keyword expansion may be used to enrich and generalize the set of extracted concepts in dealing with new documents.