Keywords

1 Introduction

Large text corpora are the basic resource for many researchers in humanities and social science [20]. Therefore, there is a need to automatically categorize documents in terms of subject areas. One solution to this problem is to apply supervised text classification methods. The results reported in the literature [25, 28] are very promising, especially those based on BERT [3] deep neural networks. They show that it is possible to automatically assign text documents to subject categories. However, supervised approaches are very often hard to be applied in real-world scenarios, because in practice, most of the analysed by researcher corpora are lacking a consistent set of labels. Developing such labels is a costly process that also requires annotation rules. One could use an already trained classifier to process a new dataset, but it is highly probable that the documents that had been used for training concerned other areas. Supervised models work well only used on texts similar to the training data. Therefore, unsupervised approaches like clustering [4, 11, 29] or document similarity visualisation in 2-D space [19, 30] are essential in practise.

Clustering and document similarity visualisation are quite similar processes. In most cases, they are based on representing documents by multidimensional feature vectors (document vectorization), calculating the similarities or distances between those vectors, and then applying clustering [6] or dimensionality reduction algorithm [1]. Within this paper we will focus on the second problem.

The goal of similarity visualisation is to present documents (represented by feature vectors) as points on a 2D plane. Documents that are more similar should be closer together in the plot than objects that differs. Such charts allow people to easily interpret the corpus and find potential outliers. Often, one can find nonobvious relationships between groups of texts that exhibit subtle similarities hidden to the naked eye but traceable by multidimensional statistical techniques [20]. Similarity visualisation is also a very helpful tool in the process of defining labels for future supervised learning experiments.

The main problem in the application of similarity visualisation methods is the selection of the method parameters. First, there are a large number of possible techniques of representing documents by feature vectors, starting from the bag-of-words technique [26], thorough word embedding [9], to deep neural network models like ELMo [18] or BERT [3]. Next, there are many available pretrained language models. There are also many dimensionality reduction algorithms like PCA [6], t-SNE [10], or UMAP [12], and each of them has many hyperparameters. It raises the question of which combination of the above should be selected for visualization? There is no easy answer to that, because the result of the similarity visualisation is a scatter plot (see Fig. 1) that is interpreted by people. The aim of the paper is to find metrics that are consistent with human perception and allow to automatically compare different approaches of generating plots. Such assessments will not only make it possible to generate better visualizations but also will allow easier selection of any parameter (like the vectorization method) for the dataset that is yet to be labeled.

This work is an extension of the research presented in [30]. We added a new corpus, used new methods of generating document vectors, applied new methods of dimension reduction (originally we only considered t-SNE), and finally proposed a much larger set of evaluation metrics (originally 1, and now 5). The metrics, vectorization and reduction methods were evaluated on labelled (in terms of subject area) corpora in Polish.

The paper is structured as follows. In Sect. 2 we shortly describe the vectorization methods that we used to transform documents into feature space. Next, in Sect. 3 we describe the three dimensionality reduction methods that were used in experiments. Section 4 contains descriptions of the proposed evaluation metrics. In Sect. 5 we discuss the datasets and our results. Conclusions are at the end of the paper.

2 Vectorization Methods

2.1 TF-IDF

The TF-IDF method is based on the bag-of word concept [22], i.e., counting the occurrences of the most common terms (words or their n-grams) in the corpus (term frequencies). Next, these frequencies are weighted by the maximum term frequency in a document and by the inverse document frequency. In the performed experiments, we have used the 1000 most frequent terms (words or bigrams).

2.2 fastText

The big step in the area of text analysis was the introduction of the word2vec method [9]. In this approach, individual words are represented by high-dimensional feature vectors (word embeddings) trained on a large text corpus. The most common solution to generate the document features is to average vector representations of individual words. This approach is known as doc2vec [13].

Due to a large number of word forms in morphological rich languages such as Polish, there are two main approaches: to use lemmas (the text have to lemmatized) or the word2vec extension [5] from the fastText package. The last one uses the position weights and subword information (character n-grams) that allow to generate embeddings for unseen words.

Doc2vec as well as TF-IDF ignores word order. Therefore, these methods are not aware of word contexts.

2.3 ELMo

The newest approaches in language modeling are inspired by deep learning algorithms and context-aware methods. The first successful is called ELMo [18]. ELMo word embeddings are defined by the internal states of a deep bidirectional LTSM language model (biLSTM), which is trained on a large text corpus. What is important, ELMo looks at the whole sentence before assigning an embedding to each word in it. Therefore, the embeddings are sentence aware and could solve the problem of polysemous words (words with multiple meanings). As the document feature vector, we used the average mean vector of every sentence in it. Generating sentence vectors is built into the model and consists in mean pooling of all contextualized word representations. The main problem with ELMo is its slow performance caused by the bidirectional architecture of LSTM networks.

2.4 BERT

The next step was a usage of the transformer [27] architecture for building language models. The state-of–the–the-art solution is BERT [3]. Due to its bidirectional representation, jointly built on both the left and the right context, BERT looks at the whole sentence before assigning an embedding to each word in it. As the document feature vector, we have used the CLS pooling method, i.e., the embedding of the initial CLS token.

2.5 Method Summary

The above vectorization methods, except TF-IDF, require pretrained language models. The names used in results reporting and sources of the used models are presented in Table 1.

The main drawback of ELMo and partly BERT is the requirement of using GPU even for the vector generation phase. Usage of ELMo on CPU is impractical due too very long processing time. It is slightly better in the case of BERT, but still TF-IDF and doc2vec work much faster on CPU than BERT.

Table 1. Document vectorization methods and sources of language models

3 Reduction Methods

The aim of the reduction is to present documents in the 2D plane to visualise the distances or dissimilarities between them. The distances between points should reflect similarities in the original multidimensional space of feature vectors (generated as described in Sect. 2).

There are several methods that can be used for 2D visualisation of multidimensional feature vectors. They can be divided in two categories [12]: ones preserving the distance structure within the data such as the PCA [6] or multidimensional scaling [1] and ones that preserve the local distances over the global distance like t-SNE [10], Laplacia eigenmaps, Isomap, and the newest UMAP [12]. Within this work have analysed three methods: PCA, t-SNE, and UMAP.

3.1 PCA

PCA (Principal component analysis) [17] is a traditional and widely used dimensionality reduction technique. It works by identifying the linear correlations with preserving most of the valuable information. PCA algorithm is based on the principal components of the covariance matrix – a set of vectors, the first of which best fits (explain the maximum amount of variance) the data while the rest are being orthogonal to it. To generate low dimensional space, we ignore the less significant principle components by projecting each data point.

3.2 T-SNE

T-SNE, proposed in [10], is a non-linear dimensionality reduction method. It preserves the similarity between points defined as normalised Gaussians. Therefore, it uses Euclidean distance in the original space. The bandwidth of the Gaussian is set by the bisection algorithm, in a way that the resulting perplexity is equal to some predefined value. As a result, the bandwidth, and therefore the similarity, for each point is adapted to the local density of the data. The similarities in low-dimensional space are modeled by a normalised t-Student distribution. The t-SNE method minimises the Kullback-Leibler divergence between the similarities in both spaces with respect to the locations of the points in the low-dimensional space.

3.3 UMAP

Uniform manifold approximation and projection (UMAP) [12] constructs a high-dimensional graph representation of the data and next optimizes a low-dimensional graph to be as structurally similar as possible. It assumes that the data is uniformly distributed on Riemannian manifold which is locally connected [12]. UMAP high-dimensional graph edges represent the likelihood of connection of each pair of data points. UMAP connects (edges) only points for which the local point radius overlaps. Each point local radius is set based on the distance to each point’s and number of neighbours. The size of point neighbours is the method hyperparameter. In many papers, it was shown that UMAP outperforms other methods (including PCA and t-SNE) [12, 16, 24].

4 Evaluation Methods

The main problem addressed in the paper is the measurement of the document visualization quality. Corpus of text is mapped to the multidimensional space by one of the methods described in Sect. 2. Next, this set of high-dimensional vectors (each representing a single document) is projected to a 2D space using one of the methods described in Sect. 3. As a result, we obtain plots like those presented in Fig. 1. The question is: which combination of document feature vector generation methods, reduction methods, and their hyperparameters should be used? Or, in other words, how to quantify individual plots to be able to choose the best one. We need a metric that allows to compare visualisation results automatically, a metric that promotes results with well-separated classes.

This problem does not have a common quality metric. Therefore, in this section, we propose five different methods. All these metrics are based on the assumption that for method comparison purposes we have a set of labels assigned to the documents. We assume that the documents within the same label are similar (at least some of them, a group does not have to be unimodal) and documents assigned to two different labels are different. In other words, the points representing the same label documents should be placed in low-dimensional space close to each other and far away from points representing other classes.

4.1 Closest Match (CM)

In [30], we proposed a simple coherence score defined as an average (over all points) of a number of k-nearest neighbours belonging to the same class, i.e.:

$$\begin{aligned} \frac{1}{nk}\sum _{p}\sum _{o\in N_k(p)}I(c(p)==c(o)), \end{aligned}$$
(1)

where n is a number of points (documents), I is the identity function, \(N_k(p)\) is the neighbourhood of p defined by the k closest points (using euclidean distance in low dimensional space), and c(p) is a class of point p. The method is parameterized by k - a number of nearest neighbours used in analysis. It measures how many neighbours (in average) of a given point belong to the same label. In our experiments, we used k equaled to 10. In [30] we shown that the value of the metric depends on k in a similar way regardless the used dataset. Therefore, the value of k is not essential (except the extreme values) in the case of comparison.

4.2 KNN

To measure the quality of the reduction, one could also use any classifier that is trained using two-dimensional data. Therefore, we generated ten folds (90% of the data were used for training) using the stratified K-fold strategy and calculated the average accuracy (Exact Match Ratio, MR) as a final score. The formula is as follows:

$$\begin{aligned} \textit{MR} = \frac{1}{n}\sum _{i=1}^{n}I(y_i == \hat{y}_i), \end{aligned}$$
(2)
$$\begin{aligned} \textit{KNN} = \frac{1}{K}\sum _{k}MR_k, \end{aligned}$$
(3)

where K is the number of folds, I is the indicator function, \(y_i\) is a true label of a sample i and \(\hat{y}_i\) is predicted label of the same sample.

We decided to use a simple KNN classifier (using ten nearest neighbours), which makes the score similar to the one from the previous section. However, almost any classifier could be used here. We have originally started with the multilayer perceptron (MLP) [6]. However, MLP has a much higher computational cost compared to KNN, and within a preliminary experiment gave close to KNN results. Similar approach, i.e., the KNN classifier, was proposed in [12].

4.3 ARI

Instead of using a classification algorithm, it is possible to use any clustering method. If the clusters obtained in a lower space match a ground-truth label, then the clusters should be visually separated. We use the adjusted rand index [7] to calculate the correspondence.

$$\begin{aligned} n_{ij} = |X_i \bigcap Y_j|, \quad a_{i} = \sum _{j=1}^{g} n_{ij}, \quad b_{j} = \sum _{i=1}^{p} n_{ij}, \end{aligned}$$
(4)
$$\begin{aligned} ARI = \frac{ \sum _{ij} { {n_{ij}}\atopwithdelims (){2} } - [ \sum _{i} { {a_{i}}\atopwithdelims (){2} } \sum _{j} { {b_{j}}\atopwithdelims (){2} } ] / { {n}\atopwithdelims (){2} } }{ \frac{1}{2} [ \sum _{i} { a_{i}\atopwithdelims (){2} } + \sum _{j} { {b_{j}}\atopwithdelims (){2} } ] - [ \sum _{i} { {a_{i}}\atopwithdelims (){2} } \sum _{j} { {b_{j}}\atopwithdelims (){2} } ] / { {n}\atopwithdelims (){2} } }, \end{aligned}$$
(5)

where \(X = \{X_1, X_2, ..., X_g\}\) defines ground truth labels, \(Y = \{Y_1, Y_2, ..., Y_p\}\) defines predicted labels (clusters obtained by the algorithm), g defines number of true labels and p of predicted ones (it is a parameter in the clustering algorithm that we established to be equal g). In the performed experiments we use agglomerative clustering [2] with ward linkage.

4.4 Internal Similarity (INT-SIM)

The next metric we propose to use is the mean distance between samples in the same cluster converted to a similarity measure, i.e.:

$$\begin{aligned} D(X)= \frac{1}{|C_x|}\sum _{i, j \in C_x, x \in X}d(x_i, x_j), \quad T = \frac{1}{|S|}\sum _{k=1}D(S_k), \end{aligned}$$
(6)
$$\begin{aligned} \text {INT-SIM} = \frac{T}{T+1}, \end{aligned}$$
(7)

where d is a distance between two points, \(C_x\) is a set of pairs of points in a cluster X and S is a set of clusters defined by labels. The maximum value of this score is obtained when samples from the same label are gathered in a single coordinate, which can be considered as a defect. There is also a problem with clusters that are made up of subgroups that occur in different places (for example, PRESS data in Fig. 1). The score will be lower in this scenario. To overcome this, we propose to use DBSCAN [23] algorithm for each label in the data to obtain subgroups. In the performed experiments the eps parameter of DBSCAN was set to the tenth percentile of a distance distribution between samples in the given group.

4.5 External Dissimilarity (EXT-DIS)

And finally, we propose the external dissimilarity score defined as a normalized (divided by the greatest) mean distance between different groups. We calculate the distance between groups as the average distance between samples in one and the other cluster. The final formula is as follows:

$$\begin{aligned} D(X, Y)= \frac{1}{|X||Y|}\sum _{x_i \in X}\sum _{y_i \in Y} d(x_i, y_i), \end{aligned}$$
(8)
$$\begin{aligned} \text {EXT-DIS} = \frac{\overline{D(X, Y)}}{\max _{i, j \in C}(D(X_i, X_j))}, \end{aligned}$$
(9)

where X and Y are clusters defined by labels, C is a set of pairs of clusters. The score promotes data reduction with similar distances between groups and might lead to solutions with points from the same group concentrated in one place (similarly to the External Dissimilarity).

Fig. 1.
figure 1

Exemplary plots related to the highest, lowest, and middle (median) KNN score for every dataset.

5 Experiments

5.1 Datasets

In our experiments, we used three collections of text documents in Polish: Wiki, Press, and Qual. All were labelled in terms of subject area, therefore we can assume that the similarity analysed by the metrics introduced in Sect. 4 is semantic one.

The Wiki corpus consists of articles extracted from the Polish language Wikipedia. It was created by merging two publicly available collections [14] and [15]. The original corpus is labeled by 34 subject categories. For clarity of the presented pictures, we have selected a subset of 10 labels, namely: computers, music, aircraft, games, football, cars, chess, coins, shipping, and animation. The resulting corpus consists of 2, 959 elements.

The second corpus, Press [31] consists of Polish press news. There are 6, 564 documents in total in this corpus. The texts were assigned by the press agency to 5 subject categories (diplomacy, sport, disasters, economy, business, and transportation). All subject groups are very well separated and each group contains a reasonably large number of members (ca. 1, 300 documents per label) without big differences among label sizes.

The last data set, Qaul [11] includes documents containing descriptions of qualifications from a Polish public register of the Integrated Qualifications System and descriptions of degrees from Polish universities. The descriptions mainly consist of so-called learning outcomes statements, which characterize the knowledge, skills, and attitudes required to obtain a given qualification or degree. The data were manually labeled. The labels denote the sectors to which the qualifications belong. Similarly to WIKI corpus, we have selected a subset of seven labels, namely: economy, biology, industry, electronics, music, machines, and architecture. The final corpus consists of 1, 419 documents.

Table 2. Scores related to plots in Fig. 1
Table 3. Statistics of all analysed metrics (scores) for all three corpora.

5.2 Results

For every corpus and the previously described vectorization and reduction methods (and many different hyperparameters of the last ones), we generated a chart. In total, we obtained almost 3 500 two-dimensional scatter plots and evaluated them using our metrics. To measure the quality and effectiveness of them, we conducted several experiments. The first of them are based on a visual assessment of the correlation between the proposed metrics and the actual plots. In Fig. 1, we present three pictures for each of the corpus that had the lowest, highest, and middle KNN score. In this could be noticed that the top-rated figures contain visually clear and well-separated groups, while the worst-rated ones are rather indistinct. The behavior of KNN measure is following the requirements stated in Sect. 4. Table 2 shows all examined metrics for all plots from Fig. 1. The KNN, ARI, and CLOSEST MATCH (CM) scores act similarly (although the overall promotes different solutions), but the tendency for the remaining scores is the opposite. This means that the best solutions are those where the samples are not too close to each other (INT-SIM) and the distances between pairs of groups are not similar (EXT-DIS). Those two methods should not be used as an out-of-the-box evaluation method.

Fig. 2.
figure 2

Values of KNN and ARI metrics (larger values are better) for all methods and data sets. A point represents a single experiment. Experiments differ by data set, document vector generation method and 2-D projection method, and their parameters.

Table 3 shows statistics (average, standard deviation, maximum, and minimum) for each metric and corpus. The statistics are calculated over different metrics, vectorization methods and hyperparameters of reduction algorithms. The Fig. 2 presents the results for each experiment as the relation between KNN and ARI metric for every analysed document in each of the three corpora.

First, it could be noticed that PCA gives the worst results and t-SNE the best. Mind that this statement only applies to the visual aspect of the method. In this paper, we do not address the problem of how well a given method preserves the features of a high-dimensional data. We only focus on the similarity and dissimilarity between documents. Surprisingly, the results for t-SNE outperform UMAP. In the literature, UMAP is considered as a method that outperforms t-SNE [12, 24].

Moreover, we can notice that there is a strong correlation between KNN and ARI for QUAL dataset, but for the other two corpora the relation disappears. Probably, this is due to the existence of different subgroups within the same label (multimodal data within each label). While KNN takes it into consideration, the ARI score is reduced because it is based on a clustering algorithm with a fixed number of classes (that matches the number of labels in the ground truth). The nature of the used algorithm (i.e., agglomerative clustering) might cause wrong assignments in the low dimensional space. The ARI score should rather be used for uni-modal labels (visually one label point should not occur in different areas) or the number of groups in clustering should be at least twice than the number of labels. Since PCA results are not significant, we focus only on t-SNE and UMAP in further analysis. Figure 3 shows correlation between KNN and ARI metric for left reduction methods. It is not clear to determine which of the vectorization methods works the best. It strongly depends on the corpus (the results follow the intuitive statement that document similarity is subjective) and used a reduction method. However, some tendencies can be noticed. First of all, kgr10 (fastText model trained on the KGR10 corpus) is always in the top three. Secondly, we could notice a high position of a simple and old-fashioned TF-IDF method (red circles), especially for KNN metric. It could be explained by the existence of keywords in each label. For example “aircrafts” (label from WIKI) can be simply classified by an occurrence of words such as “aircraft” or “plane”. Moreover, the results also suggest that the dataset used for training the language models has a big influence on results. It could be noticed comparing the results achieved by models trained on the KGR10 corpus [8] (pink and violet) to the results obtained by default models (orange and blue respectively). KGR10 results outperforms the base models in the case of fastText and BERT models. Moreover, we see that fastText outperforms (i.e., pink) the Bert based vectorization (violet). The fact that BERT based methods are not suitable for similarity/distance calculation is well known in the literature [21]. The surprising results are for elmo (brown). ELMo have a bad performance for WIKI dataset compared to quite good results for PRESS and QUAL. The other interesting pattern noticeable in the results is the relatively small dependency of the reduction method hyperparameters (i.e., perplexity, learning rate and number of iterations for t-SNE and number of neighbours and minimal distance for UMAP) on the KNN score. The points in the same color represent results from the same vectorization method but with varying values of the reduction method hyperparameters. It could be noticed that they group and even make vertical lines in case of t-SNE. It is probably due to the fact that the perplexity in the case of t-SNE and k-neighbours in the case of UMAP have a big influence on creating subgroups in each label. And as it was already stated, the KNN is less subjective to this feature than ARI.

Fig. 3.
figure 3

Values of KNN and ARI metrics (larger values are better) for all three datasets. Points differ by a vector generation method (UMAP and T-SNE) and used hyperparameters.

Comparing the results of t-SNE and UMAP, we can notice that the achieved plots have some similarities, but they differ in details. It shows that each method focuses on different aspects of multidimensional space.

6 Conclusion

In this work, we proposed a method to assess the visual quality of two-dimensional plots of document similarity obtained using the most popular dimensionality reduction methods. We propose five metrics for quantification of this quality. Based on the testes performed on the three corpora, we conclude that the classifier (KNN), clusterization (ARI) based approaches, or simple score Closest Match can actually determine which of the generated figures better preserves the information of document similarity. This allows for an automatic search of the parameter space to find the optimal ones. We also showed which of the used vectorization methods perform better in the task. The results suggest that fastText based approaches outperform the BERT ones and that the language models for Polish trained on KGR10 outperform others in the analysed problem. Even though we focused on texts in Polish, our approach can be used in virtually any problem in the field of data mining.

Although we have shown a convenient way to evaluate the appearance of the plots, there are several aspects that require further research. First, although we believe that the correlation between the human perspective and our scores is true, it is necessary to verify this thesis with a larger number of people using a survey. We hope that having plots evaluated by people, we will be able to suggest a combination of proposed scores as a final method (especially INT-SIM and EXT-DIS which cannot be used alone). Next, we focused on solving the problem with the assumption that ground-truth labels are given. This is not always the case in real-world scenarios, but makes just defining the goal difficult. Building such measures would probably require using also context information (other reductions) and not individual plots.