Keywords

1 Introduction

The task of short-text processing is important due to the increase of Internet content, such as news summaries, abstracts, forum messages, social networks, twitter etc. Clustering allows to obtain a structured representation of collections with automatic grouping of topically close documents.

We are interested in clustering academic papers. The solution to this task is required for structured representation of data within scientific domains, e.g. in academic search engines [14].

Abstracts are accessed from e-libraries. Many of these libraries usually provide free access to abstracts, while full papers require subscription or a payment. However, abstracts are brief summaries of the contents and are deemed sufficient for clustering academic papers with adequate results [5].

The task of clustering abstracts is closely related to a number of problems, e.g., the task of identifying whether short texts and processed texts may belong to the same topic, which requires the use of an approach to narrow-domain short text clustering [610]. One of the main problems with this type of clustering is high data sparseness [8]. If the documents come from the same source it is probable that all of them are topically close. In this case there may be a significant overlap of common words, which complicates the clustering task even more [8]. The corpus for testing and analysis of algorithm performance includes a set of collections that are widely used in the field (CICling, SEPLN-CICling, EasyAbstracts) [510]. Experiments implemented within the framework of the present work are also based on these collections.

K-means has been chosen for testing. In the related work, k-means is considered the base or one of the base algorithms for comparison [6, 7]. We have not found any papers describing the attempts to improve the quality of k-means performance in narrow-domain short-text clustering, researchers tend to use the base version. In [6, 7] it was shown that considering document clustering as an optimization problem ensures high performance. These algorithms yield better results as compared to the base version of k-means. In [12] the advantage of combining LSI with the optimization algorithm is presented, which shows the benefit from the use of LSI in short text clustering (together with algorithms based on particle swarm optimization [6]). We study how to increase the performance of k-means by applying LSI and also by extending the feature space with word groups. The word groups extraction uses linguistic patterns (e.g., NN_NN, NN_NN_NN, JJ_NN, NNS_NN, NN_NNS, JJ_NN_NN, etc., where NN denotes Noun, singular o mass, NNS - Noun, plural, JJ - Adjective; examples of extracted phrases: cluster algorithm, binary vector, maximum entropy model method, etc. that were built on the basis of collections labeled for key phrase extraction tasks.

2 Objectives and Data

2.1 Research Tasks

For the purposes of the present study several tasks were formulated. Firstly, we check whether it is possible to improve the quality of narrow-domain short-text clustering using k-means together with LSI [12]. Secondly, we check the possibility to increase the quality of k-means algorithm performance by using patterns. Thirdly, whether it is possible to improve the clustering quality through the application of both patterns and LSI.

2.2 Testing Dataset

The test set was formed from three collectionsFootnote 1 that are often used for testing narrow-domain short text clustering algorithms. All of them are sets of abstracts divided between four clusters by the expert. For each collection the “gold standard” or the best variant of grouping is known. Each collection includes 48 documents. CICling 2002 is considered one of the most difficult collections for clustering [68].

2.3 Pattern Extraction

For the purposes of pattern extraction we employed one of the most widely used collections in the field of keyword extraction - SemEval 2010, which was used in the competition of algorithms for keyword extraction TREC 2010 [13]. The collection includes documents and keywords defined by the expert characterizing each document. We used 23 patterns based on the most frequent keywords determined by the expert. We relied on the assumption that, in this way, we will be able to select patterns that are specific to the keywords in texts and also word sequences that reflect textual semantics. We assumed that the use of such sequences would increase clustering quality.

2.4 Clustering

Data Pre-processing. Word stems were used for text representation during clustering. Stemming was performed with Porter StemmerFootnote 2. Stop-words were removed using a standard list. PoS tags were assigned to each word for the pattern-based extraction of phrases using the Stanford PoS tagger.Footnote 3 Clustering Algorithm. K-means algorithm was chosen for testing in the present work. Each document is represented as a vector in the feature space defined by the main dictionary or within the feature space formed by the extended dictionary, which includes word groups extracted using patterns. The word order was taken into account during the retrieval of word groups. Upon the extraction each word group was transformed into another group (a set of words). The calculation of term/word group weight was performed using TF-IDF, the distance was measured by the cosine similarity between feature vectors. The number of clusters was the same as in the gold standard, which means that there were 4 clusters in each case. LSI maps the feature space to a semantic space of a lower dimensionality using singular value decomposition of the text-attributes matrix. As a result, each document has a vector representation in the semantic space. To perform LSI we need to indicate the dimensionality of the space where the data are mapped.

2.5 Evaluation

Clustering quality evaluation was performed in a classic way using the combined information on Precision and Recall of the resulting clusters [14, 15]:

$$\begin{aligned}&F = \sum _{i}\frac{G_{i}}{\left| D \right| } \max _{j} F1{_{ij}}, where \ F1_{ij} = 2 \times \frac{Precision_{ij} Recall_{ij}}{Precision_{ij}+Recall_{ij}}, \\&Precision_{ij} = \frac{\left| G_{i} \bigcap C_{j} \right| }{\left| G_{i} \right| }, Recall_{ij} = \frac{\left| G_{i} \bigcap C_{j}\right| }{\left| C_{j} \right| } \end{aligned}$$

\(G = \{G_{i}\}_{i=\overline{1,m}}\) - clusters generated by the algorithm, \(C=\{C_{j}\}_{j=\overline{1,n}}\) - clusters identified by the experts, D - number of documents in the collection.

3 Experiments and Results

In the course of the experiments two feature space settings were compared: the first was based on the main dictionary of the collection (main dict.) and the second - on the combination of the main dictionary and that of word groups obtained through pattern-based extraction from texts (ext.dict.). For each of the variants we compared the performance of k-means before and after applying the LSI. For each experiment there were 500 iterations on the basis of which the best (max), the worst (min) and the average (avg) results were selected. The results are presented in the Table 1. The values of semantic space dimensionality (“num”) that yield the best results are shown.

Table 1. K-means performance before and after applying LSI (“num” stands for semantic space dimensionality value; “main. dict.” indicates all cases where the feature space defined by the main dictionary of the collection was used; “ext. dict.” indicates cases where an extended feature space was applied)

The analysis of Table 1 allows us to put forward the following assumptions. Firstly, in case of deploying k-means algorithm based on the extended dictionary, no improvement (or a small one) in clustering quality is observed. The application of LSI to this feature space does not produce better results than when we use the main dictionary alone. It turns out that in case word groups are extracted using patterns, the results contain much noise (common phrases, such as “experiment results”).

Secondly, if the feature space is not expanded with word groups, its mapping to the semantic space with the optimal dimensionality value may lead to some increase in clustering quality. In order to check the feasibility of searching the necessary dimensionality value we modeled the dynamics of conditional dependency between the quality of k-means performance and the dimensionality value.

Fig. 1.
figure 1

Dependency of k-means performance on the dimensionality of the semantic space created using LSI (CICling and SEPLN-CICling)

Figure 1 presents the dependency of the average quality of k-means clustering, according to the results of 500 iterations, on the dimensionality of the semantic space created using LSI (for CICling and SEPLN-CICling). Notation: LSI, if the feature space is based on the main dictionary alone; LSI+P, if the dictionary of word groups is added. According to the diagrams, applying LSI representation, the best result on the narrow-domain short texts can be obtained if the dimensionality of the resulting semantic space is less than 10 (num \(<\) 10). Also, the diagrams show that when applying LSI most results are in the range from 0.40 to 0.50 (F). For CICling collection the maximum value of this range causes only insignificant increase in quality value obtained before using LSI. The real effectiveness of LSI application can be observed only for pre-defined values of space dimensionality (CICling - 6, SEPLN-CICling - 3). If these values cannot be identified a priori, the use of LSI is not feasible.

4 Conclusion

The present paper considers the problem of clustering narrow-domain short texts such as abstracts to academic papers. The purpose is to check the possibility of improving k-means performance on such collections using a feature space expanded with the dictionary of pattern-extracted word groups. In addition, we examined the possibility to increase clustering quality by applying LSI to project the feature spaces onto a semantic space of lower dimensionality. The results allow us to assume that the indicated modifications cannot be deemed feasible in practical terms (except when the optimal dimensionality value can be determined) as compared to the use of the simple k-means algorithm and the feature space defined by the main dictionary of the corpus.