Keywords

1 Introduction

Text classification is the task of assigning specific categories to documents, examples are spam detection and sentiment analysis. Naive Bayes, a technique based on applying Bayes’ Theorem, is frequently used as a baseline method for text classification because it is relatively effective, fast, and easy to implement [11]. Numerous attempts have been made to tackle the poor assumptions of Naive Bayes [8, 17].

Various types of neural networks have been developed throughout the years, many of these techniques are used for natural language processing (NLP) applications. A traditional method is the multilayer perceptron (MLP), trained on the bag-of-words (BoW) model [1]. The BoW model is a sparse representation of texts, ignoring both word order and semantic and syntactic features, treating texts as unordered sets of words. In order to capture the subtleties of language, we seek a dense representation that does capture these features. Many state-of-the-art word embedding techniques [12, 15] are based on the distributional hypothesis [3], stating that linguistic items with similar distributions have similar meanings. These dense representations capture multiple degrees of similarity [14], both semantic and syntactic, such that similar words have similar representations.

Convolutional neural networks (CNN) make use of the internal structure of the dense representation, both in the feature domain, and the temporal (word order) domain. CNN models have achieved remarkable results on various text classification tasks [5, 21]. Whereas CNN models make use of the word order for a specific region size, recurrent neural networks (RNN) have the ability to capture long-term dependencies for texts of any length. More specifically, the Long Short-Term Memory (LSTM) architecture [4] is well suited for longer texts because of its ability to remember information for long periods of time.

In this paper, we introduce a novel dataset which we will use for multi-label text classification. We compare several state-of-the-art techniques, such as the concatenation-CNN and the LSTM network, with more traditional techniques. Furthermore, we introduce a novel architecture that applies a histogram on word embeddings, followed by an MLP. Unlike most research, we trained our own word embeddings, making our setup stand-alone.

In Sect. 2 we introduce our dataset, followed by Sect. 3 where we explain the used methods. The experimental setup is described in Sect. 4, and in Sect. 5 we show and discuss the results. We conclude the paper in Sect. 6 with a conclusion and a proposal for future work.

2 Dataset

The dataset used in the experiment is an intra-lingual movie subtitle corpus, collected by [9], and originates from OpenSubtitlesFootnote 1. We extracted the English corpus, and removed all tokens apart from the spoken text. Subsequently, we convert all words to lowercase and remove punctuation, see Fig. 1. We did not apply stop word removal or stemming. The total dataset consists of 44,171 subtitles, with in total 135,862,112 words and 920,705 unique words. Every subtitle is linked to at least one, and often multiple genres. In total the dataset contains subtitles with 27 different genres, ranging from animation and comedy, to documentary. Because every subtitle can have multiple genres, the classification task is considered a multi-label classification task. This should not be confused with multi-class classification, where every document has exactly one label. Multi-label classification is considered to be significantly more difficult, due to the vast amount of possible label combinations.

Because of the limited availability of computer power we will narrow our focus to the classification of the following genres: “Romance”, “Thriller”, and “Action”. This subset consists of 15,500 subtitles, with in total 48,998,774 words and 448,101 unique words. The distribution of the subtitle lengths is depicted in Fig. 2.

Fig. 1.
figure 1

Text preprocessing, single sentence example.

Fig. 2.
figure 2

Distribution of subtitle lengths

3 Methods

In this paper, we will differentiate between models that use the BoW model, and models that use word embeddings. For the first model we have two different methods, and for the latter we will discuss four methods.

3.1 Bag of Words

We will use the BoW model for both the Naive Bayes classifier and the multilayer perceptron. Let \(\varvec{d} = \lbrace d_1, ... ,d_n \rbrace \) be a collection of documents, where \(d_{ij}\) denotes the number of occurrences of word i in document j. Furthermore, let \(\varvec{l} = \lbrace l_1, ..., l_n \rbrace \) be the according labels, where \(l_i\) is itself a set of possibly multiple labels.

Multinomial Naive Bayes Transformations. We will focus on the Multinomial version of the Naive Bayes model (MNB), where each word position is assumed to be independent of every other word. We will use several improvements proposed by [17], e.g. normalising the weight vectors.

To train the MNB model, we apply the transformations described in Eq. 1. For a test document t, with word i occurring \(t_i\) times, the document is labelled according to Eq. 2 for some threshold \(\theta \).

figure a

In Eq. 1.1, we down-weight common words, a heuristic known as “inverse document frequency”. Common words have little influence on the class of a document, but small variations can cause spurious correlations. Note that in most literature a “term frequency” heuristic precedes Eq. 1.1, we however found that this did not improve the accuracy. Therefore as shown in Eq. 1.1 we just use the term frequency. In order to prevent that document length affects the classification, we normalize every document according to Eq. 1.2. Finally, in Eq. 1.3 we add the weights of all documents belonging to the same genre.

For classification we first multiply each word frequency with the weight, as illustrated in Eq. 2.1. In standard multi-class classification, we could now assign a label to the class with the highest score. However, since the task is multi-label classification, we have to be able to assign multiple labels to a single document. We do this by first normalizing the weights according to Eq. 2.2, and then assign each label for which the weight is greater than the predefined threshold \(\theta \). By increasing \(\theta \) we can trade-off recall for precision (defined in Sect. 5.1). We determine this threshold by means of the validation set.

Multi-layer Perceptron. The multi-layer perceptron (MLP) has been shown to be effective on a wide variety of tasks, despite its simplicity. We use a fully connected network, with two hidden layers. We use the ReLU activation function, and in every layer we apply \(L_2\)-normalisation before activation. The input of the MLP is again the BoW model, with the n most frequent words. Every word frequency is rescaled according to \(d_{ij} = \log {(1 + d_{ij})}\), reducing the influence of frequently occurring words.

3.2 Skip-Gram Model

Many state-of-the-art techniques require dense word vectors as input. It is hypothesised that the techniques developed by e.g. [12] create dense word vectors that contain ‘universal’ features that can be used for various tasks. We will focus on the Skip-gram model [13]. In this model, each current word is used as an input, and the target is to predict the words that occur within a certain context c before and after the center word, as illustrated in Fig. 3. Furthermore, we use Negative sampling (NGE) as objective, where the task is to distinguish the target word from k negative samples drawn from a noise distribution. Since frequent words generally provide less information, we apply subsampling to all words as described in [13]. We train the model using all subtitles in our dataset, in Sect. 4.2 we denote the used hyperparameters. In order to explore the quality of the word vectors we use the Semantic-Syntactic Word Relationship test set, defined in [12]. This test set consists of five types of semantic questions and nine types of syntactic questions. The task is to predict a word, based on the relationship between three given words. An example for the semantic test is: “What word is similar to Oslo in the same way as France is similar to Paris?”, the answer would be Norway. This test is performed by computing the vector \(\varvec{x} =\) vector(“france”) - vector(“paris”) + vector(“oslo”), and finding the word that has the smallest cosine distance to this vector \(\varvec{x}\) (different from the three question words). An answer is considered correct only if the closest word is identical to the word in the question. Table 1 shows the results on the word analogy task, indicating the effectiveness of the technique as well as generalizability of the used dataset. For the accuracy we denote both the percentage correct, and the number of correct classified pairs combined with the total number of pairs. Note that we used a subsection of the original test set, because some of the test words do not occur in our dataset. We evaluated 6,067 out of the original 8,869 semantic relations, for the syntactic relations we evaluated 10,300 out of 10,675 pairs. The results show that for the semantic relations the categories Common capital city and Man-Woman are learned very accurately, whereas Currency scores poorly. We expect that this is a result of the nature of movie subtitles, relationships (Man-Woman) and famous locations (Common capital city) play an important part in many movies, in contrast to currencies. The syntactic relations show a more balanced result, probably because all nine syntactic categories occur in spoken language.

Table 1. Results of semantic-syntactic word relationship test set.
Fig. 3.
figure 3

The Skip-gram architecture, with a context size c of 2.

3.3 MLP on Histogram of Word Embeddings

Previous research has shown that first training a part of the model on an unsupervised task can reduce the training time and increase the accuracy on the supervised task [16]. Because the pre-trained word embeddings contain various features, we expect that a basic model can find relationships between several features in order to learn a supervised task. Preliminary experiments have shown that taking scalar indicators (such as \(\min \), \(\max \), or \(\mathrm {mean}\)) of a single feature over all words in combination with an MLP does not lead to satisfying results. Both the \(\min \) and \(\max \) operators can be affected by single, meaningless outliers, whereas the \(\mathrm {mean}\) operator can potentially reduce significant positive and negative weights to a meaningless average. In order to capture more information we propose to use a histogram, where each word-embedding feature is described by a certain number of bins. Every bin denotes the relative frequency of a range of values for that specific feature. We will now describe the method used to convert a document to a word-embedding histogram, that subsequently can be used as an input for an MLP. Let every subtitle be consisting of n words, such that

$$\begin{aligned} {\begin{matrix} X = \varvec{x}_1 \oplus \varvec{x}_2 \oplus ... \oplus \varvec{x}_n \end{matrix}} \end{aligned}$$
(3)

where \(\varvec{x}_t \in \mathbb {R}^{k}\) is the k-dimensional word embedding and \(\oplus \) is the concatenation operator. Note that in most literature word embeddings are referred to by “words”, we will use “concepts” because the word embeddings are actually the representation of the concept of a word, and not the word itself. The concatenation of the word embeddings results in a matrix \(X \in \mathbb {R}^{n \times k}\), where n and k denote the number of words in the subtitle and dimension of the word embedding respectively. In order to use this matrix in combination with a histogram, we first need to scale the values such that we can use bins with a prefixed size and range. We normalize the matrix X according to

$$\begin{aligned} {\begin{matrix} X_{ij}&= \frac{X_{ij} - \min _{i} X_{ij} }{ \max _{i} X_{ij} - \min _{i} X_{ij}} \end{matrix}} \end{aligned}$$
(4)

We will now make a histogram along every word dimension, using s bins, where every bin has a width of size 1/s. The range of the bins are denoted by \(\lbrace b_1= [0, \frac{1}{s} ) , b_2 = [\frac{1}{s}, \frac{2}{s}), ..., b_s = [\frac{s-1}{s} , 1 ]\rbrace \). For every bin \(b_l\) and every word dimension k we now calculate

$$\begin{aligned} {\begin{matrix} H_{lk}&= \text {card}(\lbrace X_{jk} : X_{jk} \in b_l \rbrace ) \end{matrix}} \end{aligned}$$
(5)

Subsequently, we calculate the \(L_1\) norm

$$\begin{aligned} {\begin{matrix} H_{lk}&= \frac{H_{lk}}{\sum _{i=1}^s H_{ik} } \end{matrix}} \end{aligned}$$
(6)

and calculate the z-score

$$\begin{aligned} {\begin{matrix} Z_{lk}&= \frac{H_{lk} - \mu }{\sigma } \end{matrix}} \end{aligned}$$
(7)

where \(\mu \), and \(\sigma \) are the mean and standard deviation of all values in H respectively. The resulting matrix \(Z \in \mathbb {R}^{s \times k}\) is then used as an input for an MLP.

3.4 Convolutional Neural Network

A Convolutional Neural Network (CNN) is a feedforward neural network, originally used for image classification [7]. CNN models have shown to be effective on various NLP tasks, by utilising local features of the word embeddings [6]. We will now describe the CNN architecture. Let every document of length n be described by a sequence as defined in Eq. 3 (padded if necessary). Let \(x_{i:i+j}\) denote the concatenation of concept \(x_i\) up to \(x_{i+j}\). The convolutional filter \(\varvec{w} \in \mathbb {R}^{h \times k}\) is applied to a window of h concepts, which produces a new feature \(c_i\). Note that k denotes again the word embedding size. We could in theory slide the convolution along the word-features too, there is however no reason to assume that any specific local relationships exist between concepts. The window of concepts \(\varvec{x}_{i:i+h-1}\) generates a new feature by

$$\begin{aligned} {\begin{matrix} c_i = f(\varvec{w} \circ \varvec{x}_{i:i+h-1} + b) \end{matrix}} \end{aligned}$$
(8)

where \(\circ \) is the element-wise multiplication, \(b \in \mathbb {R}\) is a bias term, and f is a non-linear function such as the sigmoid, hyperbolic tangent, etc. By applying this filter to all possible windows, we obtain a feature map \(\varvec{c} = [c_1, c_2, ..., c_{n-h+1} ]\). Note that we can use multiple filters, with possibly different filter widths.

In order to capture the more significant events, we subsequently apply a max pooling operation on the feature map \(\varvec{c}\). Throughout the paper we will differentiate between two types of max pooling, namely max-over-time pooling and 1-D max pooling.

Max-over-time Pooling. The first technique extracts a single (maximum) scalar from each feature map. By using multiple convolutional filters, with varying filter widths, we obtain several features which are then passed on to a fully connected layer. This architecture was introduced by [6], and is referred to as concatenation-CNN (C-CNN). Whereas the architecture introduced by [6] uses a final softmax layer, we adapt the network for a multi-label problem by using a sigmoid activation output layer.

1-D Max Pooling. Max-over-time pooling reduces a feature map to a single feature, we can also reduce the feature map to several features, for different windows. In order to determine the maximum value for a window of size m we define

$$\begin{aligned} {\begin{matrix} p_i = \max (c_{i:i+m-1}) \end{matrix}} \end{aligned}$$
(9)

with \(i = (1,1+s,1+2s,...)\), where s denotes the size of the stride. In both the convolutional layer and the 1-D max pooling layer we can vary the stride, meaning that instead of moving the filter one step at the time, we move the filter several places per step. We use multiple filters for the same region, making it possible to learn complementary features from the same regions. With l filters, the generated l feature maps are combined to create a matrix \(X \in \mathbb {R}^{l \times \lfloor (n-h-m+2)/s \rfloor }\). These feature maps are then used in combination with an LSTM network, as explained in Sect. 3.6.

3.5 Long Short-Term Memory Network

A Recurrent Neural Network (RNN) has the ability to capture time-dependent relations between words. It does this dynamically, without the use of fixed-size context windows. In particular, the Long Short-Term Memory Network (LSTM) [4] excels at tasks where long term dependencies are important. This network has received a lot of attention because of its capability of capturing important events throughout time series, and being relatively unsusceptible of gaps between important events. Given a sequence as described by Eq. 3, at time step t the LSTM network updates \(c_t\) and \(h_t\) with input \(x_t\) as follows

(10)
$$\begin{aligned} {\begin{matrix} c_t&= f_t \circ c_{t-1} + i_t \circ \hat{c}_t \end{matrix}} \end{aligned}$$
(11)
$$\begin{aligned} {\begin{matrix} h_t&= o_t \circ \text {tanh}(c_t) \end{matrix}} \end{aligned}$$
(12)

where \(c_t\) and \(h_t\) are the memory and hidden state respectively, \(i_t\), \(f_t\), \(o_t\), and \(\hat{c}_t\) are the input gate vector, forget gate vector, output gate vector, and current cell state vector respectively. Note that in Eqs. 10 and 12 the functions \(\text {sigm}\) and \(\text {tanh}\) are applied element-wise. In order to map the output of the LSTM network to the output layer, we apply mean-over-time pooling on the output gate vectors \(o_t\), meaning that we calculate the mean of all \(h_t\) values over all time steps t. Finally, the mean-over-time pooling is followed by a fully-connected layer with a sigmoid activation function.

The traditional LSTM network may have problems when the change of the parameters of one layer has an effect on the distribution of the input to all subsequent layers, also known as internal covariance shift. A solution proposed by [2], called Batch Normalized LSTM (BN-LSTM), normalizes both the input-to-hidden and hidden-to-hidden transformations by empirically estimating their means and standard deviations.

3.6 CNN-BN-LSTM

We discussed that CNN leverages the local features of words, whereas LSTM dominates in tasks where long term relations play a part. By combining the two techniques, we hope to get the best of both worlds. We start with applying a CNN layer, followed by a 1-D max pooling layer, as discussed in Sect. 3.4. The resulting matrix is then used as input for the LSTM network, such that there are \(\lfloor (n-h-m+2)/s \rfloor \) time steps, each with dimension l. Similar to the procedure discussed in Sect. 3.5, we subsequently apply mean-over-time pooling on the output gate vectors, together with a fully-connected layer with sigmoid activation. An example of this network is shown in Fig. 4.

Fig. 4.
figure 4

Graphical representation of the CNN-BN-LSTM network. The hyperparameters of this example are as follows. The word-embedding size is 6. The convolutional layer uses a window size of 2, with a stride of 1, this is followed by 1-D max-pooling with a window size of 3, and a stride of 1.

4 Experimental Setup

4.1 Dataset

The proposed models are tested on the dataset introduced in Sect. 2. We split the dataset into a validation set and a train-test set of respectively 1500 and 14,000 subtitles, so that we tune the hyperparameters on the validation set and use cross validation on the train-test set. We use 7-fold cross validation in order to test the methods, the train set consists each time of 12,000 movies, the test set of 2,000 movies.

4.2 Hyperparameters and Training

The following hyperparameters are all determined by performing a grid search on the validation set. For the MNB model we only take into account words that occur more than 3 times. We use a classification threshold \(\theta \) of 0.7 for the MNB model. For all other models we use a threshold value of 0.5.

For the BoW-MLP model we use the 50,000 most frequent words. The first hidden layer contains 512 nodes, the second layer 256. In both layers we apply the ReLU activation function, followed by dropout [19] with a dropout rate of 0.5.

Throughout all experiments we use a word embedding size of 300. We use static word embeddings, we thus apply no back propagation on the word embeddings in any of the experiments. We trained the word-embeddings on all subtitles, thus not only on the used subset for the multi-label classification task. The training was performed for 12 epochs, using a learning rate of 0.1, a mini-batch size of 16, a subsample threshold of \(10^{-3}\), a context size c of 5, and with 15 negative samples.

For the MLP-Histogram model we use 25 bins, followed by 128 hidden nodes in the first layer of the MLP, and 64 nodes in the second layer. Furthermore, in order to prevent overfitting we add Gaussian noise to the input with a mean of 0, and a standard deviation of 0.02. Additionally, after each layer dropout is applied with a rate of 0.5. Finally, in each layer the ReLU activation function is applied.

The BN-LSTM model uses 300 hidden units, on both the input and output connections we use dropout with a rate of 0.2. We constrain the \(L_2\)-norm of the gradient to not exceed 10, this is known as gradient clipping.

For the CNN-BN-LSTM network we use similar LSTM hyperparameters, proceeded by a CNN. The CNN consists of 200 feature maps, with a window size of 8, a filter stride of 2, followed by a 1-D max pool filter of size 4, with a stride of 2. The activation function used in the CNN is the ReLU. Again we constrain the \(L_2\)-norm of the gradient to a maximum of 10.

In the C-CNN model we use filters of width 3,4 and 5, all with 128 feature maps. We apply dropout with a rate of 0.5, and constrain the \(L_2\) norm again to 10.

For the CNN-BN-LSTM and the C-CNN model we pad the documents to a maximum length of 4000 words. In all models we use a mini-batch size of 20. We train the MLP-BoW and C-CNN for 6 epochs, all other models are trained for 10 epochs. We used the Adadelta update rule [20] for training, while shuffling the mini-batches.

Throughout all experiments (apart from training the word-embeddings) we anneal the learning rate \(\alpha \) using exponential decay, defined by \(\alpha = \alpha _0 r^{t/k}\), where \(\alpha _0\) is the initial learning rate, r is the decay rate, t is the iteration step, and k indicates the decay step, such that every k steps the learning rate is decayed. In all experiments we use a decay rate r of 0.97. For the MLP-Histogram, BN-LSTM, and CNN-BN-LSTM we used an initial learning rate of 0.1, for the MLP-BoW and C-CNN we use an initial learning rate of 0.005.

5 Results and Discussion

5.1 Metrics

In order to compare our models we will use the \(F_1\) score, which takes into account both the recall and precision. Recall, precision, and the \(F_1\) score for one label are respectively defined as:

(13)
(14)
$$\begin{aligned} {\begin{matrix} F_1&= 2 \cdot \frac{{\text {precision}} \cdot {\text {recall}}}{\text {precision} + \text {recall}} \end{matrix}} \end{aligned}$$
(15)

In order to calculate the final recall, precision, and \(F_1\)-score of the models, we calculate the mean scores over all three genres.

5.2 Results

The results of our models are listed in Table 2. Our baseline method (MNB) does not perform well. The model that performs best is the MLP-BoW model, with an average \(F_1\)-score of 0.77 ± 0.02. This is a significant higher result (\(P < 0.005\)) compared to the other models. The novel MLP-Histogram model achieves the second highest \(F_1\)-score. The BN-LSTM does not perform well on it own, however, in combination with a CNN layer (CNN-BN-LSTM) the model obtains the third best results. Finally, the C-CNN model is outperformed by all but two models.

Table 2. The results on the test set, after the specified number of epochs. Both the mean and standard deviation of the cross validation are displayed. We denote the recall, precision, and \(F_1\)-score for the three genres, together with the mean of the recall, precision, and \(F_1\)-score of the three genres.

5.3 Discussion

Our baseline model (MNB) does not perform well, the genre thriller has a very low recall and therefore a low \(F_1\)-score. The other two genres have however a very high recall (higher than all other models). We expect that the poor results on the genre thriller are caused by a combination of how the threshold is determined and the poor assumptions of the MNB model. We also experimented with n-grams, with n ranging from 1 to 3, but the performance decreased for n higher than 1.

The MLP-BoW model outperformed all other (more complex) models. This was in contrast with our expectations, because the model is relatively simple compared to the other machine learning models. Not only is the \(F_1\)-score high, the training time was also relatively short. The fact that this model achieves the highest \(F_1\)-score could suggest that there exist some combinations of important ‘indicator words’ that are strong predictors for certain genres. We experimented with adding more layers to the network, but this had no significant positive effect on the results. Removing one layer had a negative effect on the accuracy.

Considering that the MLP-Histogram model only takes into account the relative frequency of word embedding feature values the model performs remarkably well. This is another illustration of the ‘universal’ features of word embeddings. Similar to the MLP-BoW model, the training time is relatively short. Furthermore, this newly proposed model performed best with using the word-embeddings.

The BN-LSTM model performs rather poorly. We expect that this is due to the length of the documents. A careful observation of Fig. 2 shows that the genre romance has relatively long subtitles. This could explain the poor results on this genre for models that are susceptible for document length. Although the BN-LSTM network does suffer less from vanishing gradients compared to other RNN networks, the network still has problems with documents of substantial length. Another explanation for the inadequacy of the BN-LSTM model could be that for this task word order is irrelevant and only the occurrence of certain words is important. Preliminary experiments have shown that stacking multiple BN-LSTM layers on top of each other had no effect on the final accuracy. The accuracy increases drastically with the use of batch normalization. Adding batch normalization also causes faster, more stable convergence. Furthermore, the model often diverged without the use of gradient clipping.

Contrary to the MNB model, we saw that for the C-CNN model the use of n-grams (by means of the filter widths) did increase the performance. Although the similar model introduced in [6] achieves state-of-the-art results on various tasks with a similar model, we find only moderate results on our task. One main difference is that the documents in the datasets used in [6] are significantly shorter compared to our dataset, making them less susceptible for outliers that can affect the max-over-time pooling.

By combining a CNN model with a BN-LSTM model (into the CNN-BN-LSTM model) we see a performance improvement compared to a separate C-CNN or BN-LSTM model. By combining the two methods we get the powerful feature extractor of the CNN model, and the capability of detecting long term dependencies of the LSTM model. The downside of this method is that even more hyperparameters have to be tuned. Exploratory research indicated that adding a CNN layer after the BN-LSTM or CNN-BN-LSTM model did not improve the accuracy.

6 Conclusion and Future Work

In this paper we described various techniques that can be used for multi-label classification of movie genres based on subtitles. First, we established a baseline using a multinomial naive Bayes (MNB) classifier combined with several heuristics that “tackle the poor assumptions of MNB” [17]. We trained word embeddings on an unsupervised task, and showed that these embeddings contain indicative features for genre classification. We developed a novel architecture that combines a histogram of the word embeddings with an MLP. Despite the simple nature of this model it outperforms several more complex models. Both the C-CNN network and the BN-LSTM perform poorly on their own. However, by combining both techniques we observe a drastic increase in performance. The model that performs best is the MLP-BoW model, a surprising result given that many papers consider this network to be a baseline method.

We observed that simple models sometimes outperform more complex, state-of-the-art networks. The best network thus completely depends on the problem at hand. Therefore we would like to stress that exploring simpler text-classification methods is of great importance when a new dataset is studied. This directly relates to the principle of Occam’s razor, stating that of all possible hypotheses, the one with the fewest assumptions should be used. When we decide to use a specific technique, we make certain assumptions about the data. A simple technique is less prone to overfit the data compared to a more complex technique, because it makes less assumptions about the data. With more assumptions, it is easier to choose parameters such that they only fit the observed data, and do not generalise well.

In follow-up work we would like to consider non-static word embeddings. In [6] it is shown that for certain tasks the performance improves when either non-static word embeddings, or a combination of both static and non-static word embeddings are used. Moreover, we would like to explore the use of random word embeddings and word embeddings trained by others, e.g. [13]. The final \(F_1\)-scores could be improved by using more advanced threshold techniques, and in future research the number of genres should be extended (up to 27). Finally, experiments on more datasets can be conducted, e.g. the Movie Review Sentiment dataset [10] and the Stanford Sentiment Treebank [18].