Keywords

1 Introduction

It is generally accepted that even such a simple unsupervised algorithm as the classic K-means achieves surprisingly good classification results, if it is presented with appropriate feature vectors. Our previous research [8] confirmed that the well-established tf-idf vectors work rather well. The aim of the work presented in this paper was to test whether the recently introduced document embeddings produced by the doc2vec method [2, 4, 15] can further improve the performance.

2 Datasets

As our basic dataset, we have again picked the 20NewsGroups English corpusFootnote 1 which is widely used as a benchmark for document classification [1, 3, 5, 7, 8, 11, 12]. It contains 20 000 text documents which are evenly divided into 20 categories that each contain discussion about a specific topic.

The second data set CNO and all sub-sets of this data set are in the Czech language. It contains approximately 68 000 articles divided into 31 categoriesFootnote 2. This corpus was created so that it is at least in size and partially also in topics comparable to the English data set.

Third group of data sets – TC and Large TC – consists of the transcription of phone calls from the Language Consulting Center (LCC) of the Czech Language Institute of the Academy of Sciences of the Czech Republic, which provides a unique language consultancy service in the matters of the Czech language. The counselors of the LCC are answering questions regarding the Czech language problems on a telephone line open to public calls. The data, gathered from these language queries are unique in several aspects. The Language Consulting Center deals with completely new language material so it is the only source of advice for new language problems. It also records peripheral matters that will never be explained in dictionaries and grammar books as these are focused on the core of the language system.

In order to compare our results with the ones published previously, we have re-created two subdivision of the 20NewsGroup corpus. The first one is created according to [12] and also used in our previous work [8] where it is described in more details.

The other subdivision is created in order to compare the results with experiments described in [1, 3]. 20NG1 data sub-set consists of the 5 new categories (according to [1]) created by joining original ones as follows: Motorcycle – Motorcycle and Autos; Hardware – Windows and MAC; Sports – Baseball and Hockey; Graphics – Computer graphics; Religion – Christianity, Atheism and misc. Furthermore, they divided this sub-set to three training and testing data sets, where they used [50, 200, 350] documents as test data and the rest as training data.

20NG2 input is whole unchanged 20NewsGroup corpus divided into training (13 000 documents) and testing (aproximatelly 7 000 documents) data (the same divisions as in [3]).

The results achieved on the CNO and TC sets and sub-sets cannot be directly compared with the results of other research teams as the data are not (yet) made publicly available. However, these data are important for our own research and we decided to publish the results here to show some important properties of the doc2vec embedding (see the discussion below).

From the first Czech data – CNO – we have created the following subsets:

  • Set CNO consists of all 31 original categories. This results in approximately 68 000 documents in total.

  • Set RCNO1 consists of 11 original categories which contain at least 1000 documents.

  • Set RCNO2 consists of 10 original categories containing between 500 and 1500 documents.

  • RCNO3 set is created from 12 categories, each containing randomly chosen 1000 documents from the original categories. This set is created for the purpose to be similar to 20NewsGroup corpus.

The data TC and Large TC sets were created from a corpus obtained by LCC. These data sets consist of manually transcribed 607 parts of historical mono phone calls (each call can contain more than one parts, each part with different questions about different topic) and automatically transcribed (by ASR systemFootnote 3) 3128 parts of actual stereo phone call, all divided into 20 categories by their topic. These 20 categories were manually assigned by counselors from LCC (for example “semantics” or “lexicology”) and corresponds with the higher level of the linguistic topic tree. The division of phone calls into categories is not uniform, some categories contain only a few parts. The setting is based on previous findings. TC consists of mentioned 20 categories containing 3713 transcripted text parts of the phone calls. Some of the categories are formed from a small number of texts (for example only 10), we responded to that by creating Large TC data consisting of 10 original categories (3343 transcripted text parts) where each contains at least 100 text parts.

3 Preprocessing

First processing step is only in case of the 20NewsGroups data, where we removed all the headers except for the Subject. Then all uppercase characters were lower-cased and all digits were replaced by one universal symbol.

As the next processing step, we wanted to conflate different morphological forms of the given word into one representation. We opted for lemmatization. The MorphoDiTa [13] tool was picked for the task – it works for both English and Czech and is available as a Python package.Footnote 4

Traditional stop word removal is further preprocessing operation done in this paper by picking only the top T lemmas with highest mutual information (MI).

After applying all these processing steps we can create following vector representations:

3.1 Representation by TF-IDF Weights

Common representation in text processing task named TF-IDF weights – i.e. combination of Term Frequency (TF) and Inverse Document Frequency (IDF) weights. The well-known formula to compute TF-IDF weights \(w_{l,d}\) for the lemmas \(l \in L\) and documents \(d \in D\):

$$\begin{aligned} w_{l,d} = tf_{l,d} * idf_l \end{aligned}$$
(1)

where \(tf_{l,d}\) denotes the number of times the lemma l occurs in document d and \(idf_l\) is computed using formula:

$$\begin{aligned} idf_l = \frac{N}{N(l)} \end{aligned}$$
(2)

where N is a total number of documents and N(l) denotes a number of documents containing the lemma l.

In essentially all further experiments we use implemented Python package sklearn [9]Footnote 5 for computing TF-IDF weights.

3.2 Representation by Doc2vec Weights

According to [4] doc2vec representation is simple extension of word2vec. This is done by embedding word sequences into vectors. Input can be n-grams, sentences, paragraphs or whole documents. This type of representation is considered as state-of-the-art for sentiment analysis, which is essentially also a classification task. There was therefore a good chance that it will help in our task as well.

In this paper we use the doc2vec implementation in Gensim package [10] for Python. Input data are in form of pairs consist of feature vector representation gain from 3 and label of the given document. The output is then vectors of doc2vec weights, where every row corresponds to a specific document.

3.3 Use of LSA Reduction on Representations 3.1 and 3.2

We have also tried to further reduce the dimension of the vector representations described in 3.1 and 3.2 by the Latent Semantic Analysis (LSA) and consequently analyze the effect on the classification accuracy. The LSA method is implemented in the Python package sklearn – the module TruncatedSVD.

4 Classification Methods

For our purposes, we picked one simple supervised and one simple unsupervised method. Our goal is to use unsupervised classification and at least get similar results to supervised ones.

4.1 K-Means

Simple unsupervised classification algorithm – the classic K-means clustering method [6] – is being used here. It is generally accepted that even such a simple method is quite powerful for unsupervised data clustering if it is given an appropriate feature vector. As we have shown in [8], even simple feature vectors consisting of the tf-idf weights appear to capture the content of the document rather well (and the reduced feature vectors obtained from LSA do it even better). However, we expected to obtain even better results from doc2vec weights as they have been shown to be very good for extraction of the semantic information from the documents.

The sklearn package implementation is being used as the version of the K-means algorithm. All preprocessed representation created according to 3 are used and this model is applied to all the data sets described in Sect. 2. Results can be found in Sect. 6.

4.2 SVM

The supervised classification method being used here is the classic Linear SVM algorithm. This simple but powerful supervised data classification algorithm could be quite sufficient. This algorithm was run only with TF-IDF weights representation.

We have used the version of Linear SVM algorithm implemented in our favourite sklearn package (to be exact the module Linear SVM). Results can be found in Sect. 6.

5 Evaluation

Quite a few measures for evaluation of the classification algorithms are widely-used in published papers. In our experiments, we have decided to use accuracy, precision, recall and F1; this choice was guided mostly by the fact that we wanted to compare the performance of our algorithms to the previously published results.

The Accuracy (Acc) measure is picked only because of 20NG2 data set. It represents the percentage of correctly classified documents. This percentage is simply a number of the test documents, which are assigned with the correct topic.

The Tables 1 and 3 lists the results with the use of Precision and Recall measures computed according to [12]. Following equations for computing micro-average type of Precision and Recall measures are explained in our previous work [8] or in article [12].

$$\begin{aligned} P(T) = \frac{\sum _c \alpha (c,T)}{\sum _c \alpha (c,T)+\beta (c,T)} \end{aligned}$$
(3)
$$\begin{aligned} R(T) = \frac{\sum _c \alpha (c,T)}{\sum _c \alpha (c,T)+\gamma (c,T)} \end{aligned}$$
(4)

Standard equation for computing F1 measure is [1]:

$$\begin{aligned} F1 = 2* \frac{P*R}{P+R} \end{aligned}$$
(5)

The results reported in Tables 1 and 3 lists only the Precision measure, this is caused by usage of uni-labeled data sets (number of original categories in corpus have to be also the same as the number of output clusters from algorithms), the P(T) is necessarily equal to R(T) and to F1 and it is sufficient to report only one of those values.

6 Results

First sets of results are listed in Table 1; these results were achieved on 20NG, 10NG, Binary[0/1/2], 5Multi[0/1/2], 10Multi[0/1/2] data sets. We are reporting only 10Multi Average, 5Multi Average, 2Multi Average result of the smaller data sub-sets and compare it with the values reported in the previously published paper [12]. It were used only results of unsupervised Sequential Information Bottleneck (sIB) method created by the autors of the mentioned paper. In our experiments, Linear SVM uses 10-fold cross validation technique and we run K-means algorithms 10 times over each subset (same approach used in [12]). Averaged results from those runs are listed in Table 1. The meaning of the K-means experiment labels is listed in the following table:

  • TF-IDF uses tf-idf weights as input, every vector has size 5000.

  • TF-IDF (LSA) uses tf-idf weights reduced by LSA method, every vector has size 200.

  • doc2vec uses doc2vec weights as input, every vector has size 5000.

  • doc2vec (LSA) uses doc2vec weights reduced by LSA method, every vector has size 200.

  • TF-IDF + doc2vec is combination of TF-IDF (LSA) with doc2vec (LSA) weights, every vector has size 400.

Table 1. Comparison of our results with results achieved in [12].

In Table 2 are listed second sets of results. We again compare our results with values reported in the previously published papers [1, 3]. The authors of the [1] paper used SVM based 1 (SVM b. 1) and SVM based 2 (SVM b. 2) methods. Both of these methods are classic SVM algorithms, in case of SVM b. 1 method uses as input generated training data by use of WordNet, documents of input corpus and preprocessing as: stop-word removal, tokenization, TF-IDF representation, clusters created by Latent Semantic Indexing (LSI), etc. The SVM b. 2 method is same in preprocessing but uses the corpus of input documents. The results of both their methods and our used algorithms are macro F1-measures from three data sub-sets divided into training and testing data according to Sect. 2.

The method listed as HM stated in [3] is semi-supervised classification and uses the hybrid model of deep belief network and soft regression. The unlabeled data are used to train deep belief network model and labelled data are used to train softmax regression model and fine-tune the coherent whole system. The results stated as HM are only one of the few results in [3], they use different division of the data set to training and testing data, for these results they used 7 500 as the test set, 11 000 as unlabeled training set and 3000 as the labelled training set. For gaining our results we used similar division used in [3]. We gained training (we concatenated their unlabeled and labelled data – approximately 13 000 labelled documents for Linear SVM and without labels for K-means) and test data (approximately 7000 documents).

Table 2. Comparison of our results with results achieved in [1, 3].

Results on Czech data sets are listed in Table 3. We state these only for the purpose of testing our approach on the data in the different language than English. The results on the language rather distant from English shows that our approach of the preparation of the data can be also applied in this case.

Table 3. Results on Czech data sets.

7 Conclusion

A reasonably effective pipeline for unsupervised text documents classification according to their topic is introduced in this paper. Preprocessing of the raw input textFootnote 6 and extracted feature vectorsFootnote 7 are key factors in our approach. Simple supervised Linear SVM and unsupervised classification K-means algorithms were used and as was predicted, the supervised one is superior to the unsupervised. Our main goal is to at least have similar results with unsupervised algorithm to supervised one. The performance of this unsupervised method (stated in Table 2) was almost on par with semi-supervised algorithm and even better against supervised algorithms used in [1]. Also as you can see from all Tables 12 and 3 representation with use of doc2vec model increases performance of our unsupervised method around 10%. This is an important finding of our research, since the benchmark training data – which are necessary for supervised learning – are often not available. Also our approach of preprocessing input data texts is suitable even for simple supervised Linear SVM algorithm whose performance is comparable with more complex one (Table 2).