Keywords

1 Introduction

Text classification is an important task in natural language processing with many applications. Suitable text encoding scheme will benefit it a lot. Early studies use discrete and context-insensitive approaches such as TF-IDF [15] and weighted bag-of-words [4]. Along with the prosperity of research on the distributed representation of words [2, 10], linear combination of word embeddings in a sentence [14] is widely used for classification. However, this method captures the information from individual words only. Many studies that make use of contextual information commonly adopt neural networks as an encoder to capture the semantic information of the text, including recurrent neural networks (RNN) [16], tree-structure recursive networks [17] and convolutional neural networks (CNN) [3, 6, 7].

However, the aforementioned methods lack consideration on the semantic segmentation of text. It is intuitive that the words in text play different roles thus contribute to different groups of semantic functionality. Words like “the”, “an”, “of” act as connection between the syntax structures of the text and contribute little for classification. Other words like “internet”, “linux” or “wireless” indicating a specific domain (science) tend to have a huge impact on the classification performance. Some recent studies attempt to capture different types of word importance via hierarchical attention [21] or multi-head self-attention [8]. However, these attention mechanisms still only consider the relative importance of separate words, which does not group words with similar semantics together to strengthen the semantic expression of the whole text.

As classification is a word sensitive task, people tend to integrate all related words when making decisions about the category of a sentence. Therefore, in this paper, we propose to augment text representation from a higher level: cluster level. Concretely, we divide the words in a text into different latent semantic clusters and get cluster representations by combining contextual embeddings of the words together according to their cluster probability distribution. The cluster representations are then concatenated as the final representation of the text. We further introduce two regularization terms to better guide the clustering process. Considering that not all semantic clusters contain useful information for classification, we design a gating mechanism to dynamically control their contributions for classification.

Experiments are conducted on five standard benchmark datasets of text classification. Quantitative results show that our method outperforms or is at least on a par with the state-of-the-art methods. We also perform visualized and statistical analysis on the intermediate word clustering results which proves the effectiveness of the clustering process.

In summary, the contributions of this paper include:

  • We propose an intuitive architecture for text classification with a novel semantic clustering process for better capturing distant topical information in text representation. The semantic clusters in our framework are automatically calculated on-the-fly instead of being fitted in advance.

  • Due to the probabilistic nature of soft semantic clustering, we introduce two regularization schemes to better guide the behaviors of our model.

  • We conduct extensive experiments to evaluate our proposed method on five text classification benchmarks. Results show that our model could obtain competitive performance compared with state-of-the-art approaches.

  • We provide statistical and visualization analysis on cluster distribution captured by our learned model further corroborating our motivation.

Fig. 1.
figure 1

The framework of our model.

2 Model

We propose a latent semantic clustering representation (LSCR) framework as shown in Fig. 1 consisting of four parts: (1) the word representation layer at the bottom converts words into vector representations (embeddings); (2) the encoding layer transforms a series of word embeddings to their corresponding hidden vector representations; (3) the semantics clustering layer attributes all the words into different clusters and composes them together respectively so as to get a set of cluster representations; (4) the aggregation layer combines those cluster vectors into a single vector as the final representation of the text.

Formally, given a text consisting of n words \((w_1, w_2, \cdots , w_n)\), the word representation layer converts them to their corresponding word embeddings, represented as \(\varvec{X}=(\varvec{x}_1, \varvec{x}_2, \cdots , \varvec{x}_n)\). Then in encoding layer, we employ a bi-directional LSTM [13] as the encoder to aggregate information along the word sequence getting final states \(\varvec{h}_t = [\overrightarrow{\varvec{h}}_t;\overleftarrow{\varvec{h}}_t]\). We concatenate the hidden state from the encoder with the initial word embedding to enrich the representations. The output of the t-th word from this layer takes the form: \(\varvec{r}_t = [\varvec{x}_t;\varvec{h}_t]\).

The final representation of words generated from this layer is defined as \(\varvec{R}= (\varvec{r}_1, \varvec{r}_2, \cdots , \varvec{r}_n)\). In the semantics clustering layer, suppose that words from the text could be assigned to m semantic clusters, where m is a tunable hyperparameter. For each word, we employ a MLP to determine the probabilities of dividing it into every cluster, defined as:

$$\begin{aligned} \varvec{A}= f_1(\varvec{W}_2 \cdot f_2(\varvec{W}_1 \cdot \varvec{R} + \varvec{b}_1) + \varvec{b}_2) \end{aligned}$$
(1)

We use the softmax function for \(f_1\), and ReLU for \(f_2\). Concretely, \(A_{i, j}\) indicates the probability of the j-th word being clustered into the i-th cluster. For each word \(w_j\) of the text, \(\sum _{i=1}^m A_{i,j}=1\). After getting the probabilities, the vector representation of the i-th cluster is given by the weighted sum of the contextualized word representations (\( \varvec{R}\)) in the text, followed by a nonlinear transformation. The process is formulated as:

$$\begin{aligned} \varvec{C}=\text {ReLU}(\varvec{W}_s(\varvec{A} \cdot \varvec{R}) + \varvec{b}_s) \end{aligned}$$
(2)

The i-th row of \(\varvec{C}\), \(\varvec{c}_i\), refers to the vector representation of the i-th cluster. Considering that not all clusters are helpful for text classification. There may exist redundant clusters that contain little or irrelevant information for the tasks. Therefore, in the aggregation layer, we add a gating mechanism on the cluster vector to control the information flow. Concretely, the gate takes a cluster vector \(\varvec{c}_i\) as input and output a gate vector \(\varvec{g}_i\):

$$\begin{aligned} \varvec{g}_i = \sigma (\varvec{W}_g\varvec{c}_i + \varvec{b}_g) \end{aligned}$$
(3)

We do the same operation on other cluster vectors as well, leading to a series of gate vectors \(\varvec{G} = (\varvec{g}_1, \varvec{g}_2, \cdots , \varvec{g}_m)\) from the cluster vectors. Then the gated cluster vectors \(\bar{\varvec{C}} = ( \bar{\varvec{c}}_1, \bar{\varvec{c}}_2, \cdots , \bar{\varvec{c}}_m)\) are calculated as:

$$\begin{aligned} \bar{\varvec{C}} = \varvec{G} \odot \varvec{C} \end{aligned}$$
(4)

At last, we concatenate vector representations of all the semantic clusters to form the text representation: \(\varvec{s} =[\bar{\varvec{c}}_1, \bar{\varvec{c}}_2, \cdots ,\bar{\varvec{c}}_m]\). For classification, the text representation \(\varvec{s}\) is followed by a simple classifier which includes a fully connected hidden layer and a softmax output layer to get the predicted class distribution \(\varvec{y}\). The basic loss function is the cross-entropy loss \(\mathcal {L}\) between the ground truth distribution and the prediction.

2.1 Regularization Terms

Due to the probabilistic nature of our soft semantic clustering scheme, it is natural to integrate probabilistic prior knowledge to control and regularize the model learning process. We consider two regularization terms in word level and class level. In the semantics clustering layer, we get \(\varvec{a}_i = (a_{i1}, a_{i2}, \cdots , a_{im})\) indicating the probability distribution of the i-th word to each cluster. The word-level entropy regularization term is defined as:

$$\begin{aligned} \mathcal {L}_{word} = -\sum _{t=1}^N\sum _{k=1}^{m}a_{tk}\log (a_{tk}) \end{aligned}$$
(5)

We expect the probability distribution for a specific word over all the clusters is sparse, which means a word would be attributed to only one or few clusters with greater probability instead of being evenly distributed to all clusters. Thus our optimization goal is to minimize the word-level entropy.

Another class-level regularization term is specifically designed for text classification. Suppose there is a vector \(\varvec{v}_{c_i}\) indicating the i-th class’ probability distribution over m clusters, which is calculated by averaging the cluster probability distribution of the text belonging to i-th class within a mini-batch during training. We take the average of the word’s cluster probability distribution \(\varvec{a}\) in the text as the text-level cluster probability distribution \(\varvec{v}_s = \frac{1}{N_w}\sum _{i=1}^{N_w} \varvec{a}_i\), where \(N_w\) is the number of words in the text. Thus the i-th class’ cluster probability distribution is:

$$\begin{aligned} \varvec{v}_{c_i} = \frac{1}{N_{c_i}}\sum _{k=1}^{N_{c_i}} \varvec{v}_{s_k} \end{aligned}$$
(6)

where \(N_{c_i}\) is the number of samples belonging to i-th class in a mini-batch, \(\varvec{v}_{s_k}\) indicates the k-th text-level cluster probability distribution. We hope that different cluster can capture different category related semantics. Thus the distribution between every two classes needs to be different. We take an intuitive and practical method that we expect the peak value of different class-level distribution exists in different cluster. To implement this, we add the maximum value of all the class-level distributions and expect the summation greater. The class-level regularization term is defined as:

$$\begin{aligned} \textstyle \mathcal {L}_{class} = \sum _{i=1}^{m}\max _{j=1:N_C}(\varvec{v}_{c_j}^i) \end{aligned}$$
(7)

where \(N_C\) is the number of category and \(\varvec{v}_{c_j}^i\) means the i-th dimension (corresponding to i-th cluster) in j-th class probability distribution vector. The larger the summation, the distributions between classes tend to be more different. The final objective function for classification is defined as:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{total}=\frac{1}{N}\sum _{i=1}^N(\mathcal {L} +\lambda _1 \mathcal {L}_{word}) -\lambda _2 \mathcal {L}_{class} \end{aligned} \end{aligned}$$
(8)

where N is the number of samples in a mini-batch. The training objective is to minimize \(\mathcal {L}_{total}\).

Table 1. Data statistic on the five benchmarks.

3 Experiment

3.1 Datasets

To evaluate the effectiveness of our proposed model, we conduct experiments on the text classification task. We test our model on five standard benchmark datasets (AGNews, DBPedia, Yahoo! Answers, Yelp P., Yelp F.) including topic classification, sentiment classification and ontology classification as in [23]. AGNews is a topic classification dataset that contains 4 categories: world, business, sports and science. DBPedia ontology dataset is constructed by choosing 14 non-overlapping classes from DBPedia 2014. The fields used for this dataset contain the title and abstract of each Wikipedia article. Yahoo! Answers is a 10-categories topic classification dataset obtained through the Yahoo! Webscope program. The fields include question title, question content and best answer. Yelp P. and Yelp F. are Yelp reviews obtained from the Yelp Dataset Challenge in 2015. Yelp p. predicts a polarity label by considering stars 1 and 2 as negative, 3 and 4 as positive. Yelp F. predicts the full number of stars from 1 to 5. We use the preprocessed datasets published by [18] and the summary statistics of the data are shown in Table 1.

3.2 Compared Models

We compare our models with different types of baseline models: traditional feature based models i.e. n-gram TF-IDF and bag-of-words (BoW) [23]; word embedding based models such as FastText [5] and SWEM [14]; RNNs like LSTM [23]; reinforcement learning based models including ID-LSTM and HS-LSTM [22]; CNNs consisting of DeepCNN [3], small/large word CNN [23], CNN with dynamic pooling [6] and densely connected CNN with multi-scale feature attention [19]; CNN combined with RNN [20]; self-attentive based model [8]; other models specifically designed for classification [12, 18].

Table 2. Accuracy of all the models on the five datasets. The result marked with \(\diamond \) is re-printed from [19].

3.3 Implementation Details

For word representation, we use the pre-trained 300-dimensional GloVe word embeddings [11]. They are also updated with other parameters during training. We split 10% samples from the training set as the validation set and tuned the hyper parameters on validation set. The input texts are padded to the maximum length appeared in the training set. In the encoding layer, the hidden state of the bi-LSTM is set to 300 dimensions for each direction. The MLP hidden units are 800 for semantic clustering. The dimension of the clustering vectors is set to 600. The cluster number is set to 8 on AGNews and 10 on other datasets. The MLP used for classification has 1000 hidden units. The coefficients of the regularization terms are set to 0.001. We adopt Adam as the optimizer with a learning rate of 0.0005. The batch size is 64. Our models are implemented with Tensorflow [1] and trained on one NVIDIA 1080Ti GPU. For all datasets, training converges within 4 epochs. We will release our codes later.

Fig. 2.
figure 2

The visualization on AGNews about topic classification. The heat map shows the distribution of words among different clusters in the text. The X-axis are words in text. The Y-axis represents the clusters. The title of each figure indicates the predicted class/the ground truth.

3.4 Evaluation Results

We compared our method to several state-of-the-art baselines with respect to accuracy, the same evaluation protocol with [23] who released these datasets. We present the experimental results in Table 2. The overall performance is competitive. It can be seen from the table that our method improves the best performance by 1.0%, 0.1% and 2.5% on Yah A., Yelp P. and Yelp F. respectively and is comparable on AGNews.

Compared with the baselines, our results exceed the traditional feature-based models (n-gram TF-IDF, BoW), the word embedding based representation models (FastText, SWEM) and LEAM by a large margin. Furthermore, we gain great improvement over the LSTM-based models. Though our model and the self-attentive model all aim at getting multiple vectors indicating different semantic information, as our model gather the words into clusters to enrich the representations, we achieve a better performance over theirs. Compared with the deep CNNs, our shallow model with a relatively simpler structure outperforms them as well. The models proposed by [19] and [12] aim at capturing features from different regions. Our results are on par with [19] and win over [12].

3.5 Ablation Study

In this section, we randomly select two datasets (Yelp P. and Yah.A.) and conduct ablation studies on them. We aim at analyzing the effect of gating mechanism and two kinds of regularization terms: word-level and class-level. The results without each part separately are shown in Table 3. “Full Model” denotes the whole model with nothing absent. Models trained with the gating mechanism and the regularization terms outperform their counterpart trained without, which demonstrates the effect of each part.

Table 3. The ablation results on Yelp P. and Yah.A.
Fig. 3.
figure 3

The statistical clustering results on AGNews in cluster 3, 4 and 7.

3.6 Discussions

Analysis on Clustering Results. In this section, we take out the intermediate clustering results calculated when we test on AGNews which is a 4-category topic classification dataset including science, business, sports, and world.

Firstly, we visualize the distribution of words in clusters using heat map as shown in Fig. 2. Each column means the probabilities of a word attributed to the clusters which sum up to 1. We can see that most of the meaningless words, high frequency words and punctuation are divided into cluster 3, like “a”, “for”, “that”, etc. The text in Fig. 2(a) belongs to business category and the model predicts it right. We can find that the words about business are in cluster 4 such as “depot”, “oil”, “prices”. Likewise, from Fig. 2(b) we find words about science are assigned to cluster 7 like “apple”, “download”, “computer”. In Fig. 2(b), the text contains words about science and business, which are divided into their corresponding clusters separately. The words like “beats”, “second”, “local”, which are domain independent nouns or adjectives or verbs have a more average probability distribution among all clusters.

Fig. 4.
figure 4

t-SNE plot of the intermediate clustering probability distribution vector of text on AGNews test set.

We further make statistics about the clustering results over all the words. To be specific, we count the frequency of all the words being assigned to each cluster. A word is assigned to some cluster means that it is attributed to which with maximum probability. For each cluster, we sort the belonged words according to their frequency. We display the top 20 words in cluster 4, cluster 7 and cluster 3 as shown in Fig. 3. We can see that the words in cluster 4 are about business while the words in cluster 7 are science related and in cluster 3, the words are almost meaningless for classification. They are inconsistent with the cluster phenomenon in the heat map.

From the above visualization we have the following observations: first, words that indicate the same topic or with the similar semantics are divided into the same clusters; second, different categories correspond to different clusters; third, representative keywords have much greater probabilities (deeper color in heat map) in the specific cluster. These results exactly correspond to the motivation of the two regularization terms.

To evaluate the relevance between the clustering distribution and the classification results, we utilize t-SNE [9] to visualize the text-level cluster probability distribution on a two-dimensional map as shown in Fig. 4. Each color represents a different class. The point clouds are the text-level cluster probability distribution calculated by averaging the words’ cluster probability distributionFootnote 1. The location of the class label is the median of all the points belonging to the class. As can be seen, samples with the same class can be grouped together according to their text-level cluster probability distribution and the boundary between different class is obvious, which again demonstrates the strong relevance between the clustering distribution and the classification results.

Analysis on The Number of Semantic Clusters. As the number of semantic clusters is a tuned hyper parameter, we also analyze how performance is influenced by the number of semantic clusters by conducting experiments on AGNews and Yelp P., varying the cluster number m among {2,4,6,8,10,12}. From Fig. 5(a) we can find that the accuracy increases as the number of clusters increase and begins to drop after reaching the upper limit. Obviously, the cluster number does not in line with the class number to gain the best performance.

Analysis on Different Text Length. As the text length in datasets varies, we visualize how test accuracy changes with different text length. We perform experiments on AGNews and Yelp F. while the former has shorter text length than the latter. We divide the text length into 6 intervals according to the length scale. As Fig. 5(b) shows, our model performs better on relatively longer texts. With the increase in text length, our model tends to gather more information from the text. This visualization is in line with the overall performance of our model that it performs better on Yah.A., Yelp P. and Yelp F. rather than AGNews and DBPedia as the former three datasets have longer average text length.

Fig. 5.
figure 5

Quantitative analysis on cluster number and sentence length.

4 Related Work

Text representation is an important and fundamental step in modern natural language processing, especially with the current development of approaches based on deep neural networks. Bi-LSTM [13] is a widely-used representation model which conveys information in both directions. Although it alleviates the problem of information vanishment due to the increase of sentence length, our framework can further integrate information by clustering words with similar semantics and give a visual explanation for the results. Another popular representation models are those based on CNN [3, 6, 23] which can capture the word features locally while our model can break the limitation of distance. A burgeoning and effective model is the structured self-attentive sentence embedding model proposed by [8]. They propose a multi-head self-attention mechanism mapping sentence into different semantic spaces and get sentence representation for each semantic space through attention summation. However, they focus on extracting different aspects of a sentence based on the automatically learned relative importance instead of composing similar aspects together to strengthen the information of each part as our work does.

Except for the above representation models, there are other several models specifically designed for classification. Wang et al. [18] takes the label information into consideration by jointly embedding the word and label in the same latent space. Wang et al. [19] uses a densely connected CNN to capture variable n-gram features and adopts multi-scale feature attention to adaptively select multi-scale features. Qiao et al. [12] utilizes the information of words’ relative positions and local context to produce region embeddings for classification.

5 Conclusion and Future Work

In this paper, we propose to transform the flat word level text representation to a higher cluster level text representation for classification. We cluster words due to their semantics contained in their contextualized vectors gotten from the concatenation of the initial word embeddings and the outputs of the bi-LSTM encoder. We further introduce regularization schemes over words and classes to guide the clustering process. Experimental results on five classification benchmarks suggest the effectiveness of our proposed method. Further statistical and visualized analysis also explicitly shows the clustering results and provides interpretability for the classification results. In the future, we will try other encoding basement for capturing the words’ semantics and we are interested in considering phrases instead of individual words as the basic elements for clustering. The idea of clustering representation is worth trying on other NLP tasks.