Recurrent networks with attention and convolutional networks for sentence representation and classification

Liu, Tengfei; Yu, Shuangyuan; Xu, Baomin; Yin, Hongfeng

doi:10.1007/s10489-018-1176-4

Recurrent networks with attention and convolutional networks for sentence representation and classification

Published: 27 April 2018

Volume 48, pages 3797–3806, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Recurrent networks with attention and convolutional networks for sentence representation and classification

Download PDF

Tengfei Liu ORCID: orcid.org/0000-0001-8084-7663¹,
Shuangyuan Yu¹,
Baomin Xu¹ &
…
Hongfeng Yin²

1456 Accesses
40 Citations
Explore all metrics

Abstract

In this paper, we propose a bi-attention, a multi-layer attention and an attention mechanism and convolution neural network based text representation and classification model (ACNN). The bi-attention have two attention mechanism to learn two context vectors, forward RNN with attention to learn forward context vector $\overrightarrow {\mathbf {c}}$ and backward RNN with attention to learn backward context vector $\overleftarrow {\mathbf {c}}$, and then concatenation $\overrightarrow {\mathbf {c}}$ and $\overleftarrow {\mathbf {c}}$ to get context vector c. The multi-layer attention is the stack of the bi-attention. In the ACNN, the context vector c is obtained by the bi-attention, then the convolution operation is performed on the context vector c, and the max-pooling operation is used to reduce the dimension. After max-pooling operation the text is converted to low-dimensional sentence vector m. Finally, the Softmax classifier be used for text classification. We test our model on 8 benchmarks text classification datasets, and our model achieved a better or the same performance compare with the state-of-the-art methods.

Hierarchical Convolutional Attention Networks Using Joint Chinese Word Embedding for Text Classification

Double Attention Mechanism for Sentence Embedding

Attentional Recurrent Neural Networks for Sentence Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Text classification, with the goal is to assign labels to text, is one of the classic tasks in NLP [39]. And Text classification has a wide range of applications, such as spam detection [30], sentiment classification [22, 29] and topic labeling [36] and so on.

But, a better text representation is the key to get a better performance for natural language processing tasks such as text classification. The traditional text representation is the one-hot representation, which not only lost the context information of a text, but also is sparse and faced with the curse of dimensionality.

Recently, word-level representation, distributed distributions, based on neural network models [3, 5, 11, 23,24,25], have become increasingly popular. The distribution representation based on the neural network model is called word vector, word embedding or distribution representation. The neural network word embedded technique models the context and the relationship between the context and the target word by neural network technology, which can map the words to low-dimensional vector space [18].

For text representation, with the performance improvement of hardware and the increase in the amount of data, the deep learning methods are more and more popular, such as convolutional neural networks [13, 14], recurrent neural networks [18] and attention mechanism [39] to learn text representations for text classification, and have a better performance. All these have proved that the text classification method based on the neural networks is quite effective [6, 12, 14, 21, 34, 38].

In this paper, we propose a text representation and classification model (ACNN) that combines attention mechanism and convolution network. In this model, we use the bi-attention to learn two context vectors. The bi-attention can provide the context vector c for the followed convolution layer, so that the convolution layer can be targeted for feature extraction; and the convolution layer can extract the feature of the context vector c and convert the text into a low dimension feature vector m. Finally, the Softmax classifier can be used for text classification. We test our model on 8 benchmarks text classification datasets, and our model achieved a better or the same performance compared with the state-of-the-art methods.

2 Related work

Text classification is one of the basic tasks of natural language processing. There are many researchers exploring various ways to improve the performance of the classification. Hill, Cho, and Korhonen 2016 proposed learning distributed representations of sentences from unlabelled data [8]. Conneau et al. 2017 indicate the suitability of natural language inference for transfer learning to other NLP tasks [6]. Le and Mikolov 2014 proposed to learn a Paragraph Vector for classification [19]. Illinois-LH system [17], Tree-LSTM [33] and AdaSent [41] are all the state-of-the-art methods for text classification.

Arora, Liang, and Ma proposed an unsupervised sentence embedding methods [1]. Lin et al. 2017 proposed a structure for sentence embedding, too [21]. Lai et al. 2015 proposed a recurrent convolutional neural networks for text classification [18]. Zhang, Zhao, and LeCun 2015 use a character-level convolutional networks for text classification [40]. Kim 2014 apply a word-level convolutional neural networks for sentence classification [14]. Socher et al. 2013 proposed a deep recursive model [31]. Dai and Le 2015 use semi-supervised to improve sequence learning [7]. Kalchbrenner, Grefenstette, and Blunsom 2014 apply a convolutional neural network for modelling sentences [13]. Kiros et al. 2015 proposed SkipThouht [16] and Wang and Manning 2012; 2013 use the SVM and F-Dropout etc [35, 36]. All the state-of-the-art methods have proved that the text classification method based on the neural networks is quite effective.

The model proposed in this paper combined attention mechanisms and convolutional neural networks to learn a sentence vector for classification. The attention mechanism was first proposed by Mnih et al. in the computer vision [26]. Bahdanau, Cho, and Bengio 2014 applied it to neural machine translation [2]. Yang et al. 2016 proposed to use the hierarchical attention model for document classification [39], and achieve a well performance.

3 Background

In natural language processing, most of the data are in the form of sequence data. However, the original neural network is not suitable for processing data in sequence form. Compared with the original neural network, the recurrent neural network(RNN) allows the information to be persist, which makes it have achieved success in speech recognition, language model, machine translation and other tasks. However, the RNN is poorly performing for long-dependent problems.

Long short-term memory(LSTM) proposed by [9], using the gated mechanism to solve the problem of poor performance of RNN on long-dependent problems. Gated Recurrent Unit(GRU) proposed by [4] changes the gated mechanism of the LSTM, making the training time of the model optimized. Bahdanau et al. [2] proposed to apply the attention mechanism to the end-to-end neural machine translation, and get a better performance.

In this paper, we use the attention mechanism and the convolution neural network for text classification, and get a better performance. In the following subsection, we will give a introduction to the bidirectional RNN, attention mechanism, bi-attention and multi-layer attention.

3.1 BiRNN

Bidirectional RNN (BiRNN), Bidirectional LSTM (BiLSTM) and Bidirectional GRU (BiGRU) are what we need for experiment, here, we only have a brief recall of BiRNN.

Let (x₁,⋯ ,x_T) as the input sequence, The hidden state h_t of the RNN at the step t calculated as (1).

$$ \mathbf{h}_{t}=f\left( \mathbf{U}\mathbf{x}_{t}+\mathbf{W}\mathbf{h}_{t-1}+\mathbf{b}\right) $$

(1)

where, f (⋅) is an non-linear activation function, such as: sigmoid, tanh, etc., and the trainable parameters U,W and b are the same for each step of the RNN.

However for the BiRNN, it contains a forward RNN $\overrightarrow {f}$ and a backward RNN $\overleftarrow {f}$, as shown in (2) and (3) respectively.

$$\begin{array}{@{}rcl@{}} \overrightarrow{\mathbf{h}}_{t}&=&\overrightarrow{f}\left( \overrightarrow{\mathbf{U}}\mathbf{x}_{t}+\overrightarrow{\mathbf{W}}\overrightarrow{\mathbf{h}}_{t-1}+\overrightarrow{\mathbf{b}}\right) \end{array} $$

(2)

$$\begin{array}{@{}rcl@{}} \overleftarrow{\mathbf{h}}_{t}&=&\overleftarrow{f}\left( \overleftarrow{\mathbf{U}}\mathbf{x}_{t}+\overleftarrow{\mathbf{W}}\overleftarrow{\mathbf{h}}_{t + 1}+\overleftarrow{\mathbf{b}}\right) \end{array} $$

(3)

where, $\overrightarrow {f}\left (\cdot \right )$ and $\overleftarrow {f}\left (\cdot \right )$ are the corresponding non-linear activation function, $\overrightarrow {\mathbf {U}},\overrightarrow {\mathbf {W}},\overrightarrow {\mathbf {b}},\overleftarrow {\mathbf {U}},\overleftarrow {\mathbf {W}}$ and $\overleftarrow {\mathbf {b}}$are the trainable parameters.

The forward RNN $\overrightarrow {f}$ reads the input sequence from x₁ to x_T and calculates a sequence of forward hidden states $\left (\overrightarrow {\mathbf {h}}_{1},\cdots ,\overrightarrow {\mathbf {h}}_{T}\right )$. And the backward RNN $\overleftarrow {f}$ reads the sequence from x_T to x₁ and get a backward hidden states sequence $\left (\overleftarrow {\mathbf {h}}_{1},\cdots ,\overleftarrow {\mathbf {h}}_{T}\right )$.

Finally, we get the hidden state h_t of the BiRNN at step t by concatenating the forward hidden state $\overrightarrow {\mathbf {h}}_{t}$ and the backward hidden state $\overleftarrow {\mathbf {h}}_{t}$, as shown in (4).

$$ \mathbf{h}_{t}=\left[\overrightarrow{\mathbf{h}}_{t}^{\top};\overleftarrow{\mathbf{h}}_{t}^{\top}\right]^{\top} $$

(4)

3.2 Attention mechanism

Mnih et al. [26] proposed to apply the attention mechanism to the computer vision. In NLP, Bahdanau et al. [2] first applied the attention mechanism to the neural machine translation system, and achieved a well performance.

Figure 1 show an architecture of a RNN with the attention mechanism. Where x₁,⋯ ,x_T is the input sequence, c_t is the output(context vector) of the attention model part corresponding to time t, calculated as follows:

$$ \mathbf{c}_{t}=\sum\limits_{k = 1}^{T}{\alpha_{t}^{k}} \mathbf{h}_{k} $$

(5)

It can be seen from (5) that c_t is the weighted sum of the hidden state (h₁,⋯ ,h_T) of RNN, where ${\alpha _{t}^{k}}$ is the weight of the hidden state h_k, and ${\alpha _{t}^{k}}$ is calculated as follows

$$ {\alpha_{t}^{k}}=\frac{\exp\left( \hat{\alpha}_{t}^{k}\right)}{{\sum}_{j = 1}^{T_{x}}\exp\left( \hat{\alpha}_{t}^{j}\right)} $$

(6)

Where $\hat {\alpha }_{t}^{k}$ is jointly learned with other part of the model.

From (6) we know that $\left ({\alpha _{t}^{1}},\cdots ,{\alpha _{t}^{T}}\right )$ is obtained by softmax normalization the trainable weight $\left (\hat {\alpha }_{t}^{1},\cdots ,\hat {\alpha }_{t}^{T}\right )$.

3.3 Bi-attention

The bi-attention has two attention mechanism to learn two context vectors, forward RNN with attention to learn forward context vector $\overrightarrow {\mathbf {c}}$ and backward RNN with attention to learn backward context vector $\overleftarrow {\mathbf {c}}$, and then concatenation $\overrightarrow {\mathbf {c}}$ and $\overleftarrow {\mathbf {c}}$ to get context vector c.

Figure 2 shown is the architecture of bi-attention. For the forward RNN, the forward weight $\left (\overrightarrow {\alpha }_{t}^{1},\cdots ,\overrightarrow {\alpha }_{t}^{T}\right )$ learned for forward hidden state $\left (\overrightarrow {\mathbf {h}}_{1},\cdots ,\overrightarrow {\mathbf {h}}_{T}\right )$ to get the t-th forward context vector $\overrightarrow {\mathbf {c}}_{t}$. For the backward RNN, the backward weight $\left (\overleftarrow {\alpha }_{t}^{1},\cdots ,\overleftarrow {\alpha }_{t}^{T}\right )$ learned for backward hidden state $\left (\overleftarrow {\mathbf {h}}_{1},\cdots ,\overleftarrow {\mathbf {h}}_{T}\right )$ to get the t-th backward context vector $\overleftarrow {\mathbf {c}}_{t}$. Then concatenation the $\overrightarrow {\mathbf {c}}_{t}$ and $\overleftarrow {\mathbf {c}}_{t}$ to get the t-th context vector c_t.

3.4 Multi-layer attention

The multi-layer attention is the stack of the bi-attention.

Figure 3 shown is the architecture of two-layer attention. The inputs of the forward RNN in the second layer is the forward context vector $\overrightarrow {\mathbf {c}}_{t}^{1}$ was calculated by the forward weight $\left ({~}^{1}\overrightarrow {\alpha }_{t}^{1},\cdots ,^{1}\overrightarrow {\alpha }_{t}^{T}\right )$ and the forward hidden state $\left (\overrightarrow {\mathbf {h}}_{1}^{1},\cdots ,\overrightarrow {\mathbf {h}}_{T}^{1}\right )$ of the first layer, the inputs of the backward RNN in the second layer is the backward context vector $\overleftarrow {\mathbf {c}}_{t}^{1}$ was calculated by the backward weight $\left ({~}^{1}\overleftarrow {\alpha }_{t}^{1},\cdots ,^{1}\overleftarrow {\alpha }_{t}^{T}\right )$ and the backward hidden state $\left (\overleftarrow {\mathbf {h}}_{1}^{1},\cdots ,\overleftarrow {\mathbf {h}}_{T}^{1}\right )$ of the first layer. The second layer calculated the forward context vector $\overrightarrow {\mathbf {c}}_{t}$ by the forward weight $\left ({~}^{2}\overrightarrow {\alpha }_{t}^{1},\cdots ,^{2}\overrightarrow {\alpha }_{t}^{T}\right )$ and the forward hidden state $\left (\overrightarrow {\mathbf {h}}_{1}^{2},\cdots ,\overrightarrow {\mathbf {h}}_{T}^{2}\right )$ of the second layer, and calculated the backward context vector $\overleftarrow {\mathbf {c}}_{t}$ by the backward weight $\left ({~}^{2}\overleftarrow {\alpha }_{t}^{1},\cdots ,^{2}\overleftarrow {\alpha }_{t}^{T}\right )$ and the backward hidden state $\left (\overleftarrow {\mathbf {h}}_{1}^{2},\cdots ,\overleftarrow {\mathbf {h}}_{T}^{2}\right )$ of the second layer, and then concatenated $\overrightarrow {\mathbf {c}}_{t}$ and $\overleftarrow {\mathbf {c}}_{t}$ to get the context vector c_t.

4 Model

In this section, we will describe the text representation and classification model(ACNN:we proposed in this paper) which combined with attention mechanism and convolution neural network. The model architecture shown in Fig. 4. We will introduce our model from the following three subsections.

4.1 BiRNN with attention mechanism

As shown in the left part of Fig. 4, we use a BiRNN with the attention mechanism (Bi-Attention in Section 3.3) in the attention part of our model.

The input x_t at the t-th step of the BiRNN is the word vector corresponding to the t-th word in the input sequence. Then we got the forward hidden state $\overrightarrow {\mathbf {h}}_{t}$ and the backward hidden state $\overleftarrow {\mathbf {h}}_{t}$ for the BiRNN (as (2) and (3)). We calculate the forward context vector $\overrightarrow {\mathbf {c}}_{t}$ with $\overrightarrow {\mathbf {h}}_{t}$ and $\overrightarrow {\alpha }_{t}$, and the backward context vector $\overleftarrow {\mathbf {c}}_{t}$ were calculated with $\overleftarrow {\mathbf {h}}_{t}$ and $\overleftarrow {\alpha }_{t}$ (as (5)). Where the weights $\overrightarrow {\alpha }_{t}$ and $\overleftarrow {\alpha }_{t}$ are learned as (6).

Then we concatenated the the forward context vector $\overrightarrow {\mathbf {c}}_{t}$ and the backward context vector $\overleftarrow {\mathbf {c}}_{t}$ to got the context vector c_t (7).

$$ \mathbf{c}_{t}=\left[\overrightarrow{\mathbf{c}}_{t}^{\top};\overleftarrow{\mathbf{c}}_{t}^{\top}\right]^{\top} $$

(7)

Finally, we concatenate the context vector of each step ,ie.

$$ \mathbf{c}=\left[\mathbf{c}_{1},\cdots,\mathbf{c}_{T}\right], $$

(8)

as the input of the convolution operation in the middle part of Fig. 4.

4.2 Convolution and max-pooling

As the middle part of Fig. 4 shown, we apply a max-pooling operation after the convolution operation to get the sentence vector m.

In the convolution operation, we use the context vector c obtained by the attention model as the input to the convolution operation. The convolution operation is shown as (9)

$$ \mathbf{v}_{i}=f\left( \mathbf{V}\cdot\mathbf{c}_{i:i+h}+\mathbf{b}\right) $$

(9)

Where, f (⋅) is the non-liner activation function, h is the size of the filter window, and the weights V and the biases b are the trainable parameters. We concatenate the feature v_i of each step convolution operation to obtain v:

$$ \mathbf{v}=\left[\mathbf{v}_{1},\cdots,\mathbf{v}_{T}\right] $$

(10)

Then, use the max-pooling(see (11)) [5] to get the sentence vector m, and send m to the classifier shown in the right part of Fig. 4.

$$ \mathbf{m}=\max\left( \mathbf{v}\right) $$

(11)

4.3 Softmax classifier

Shown in the right part of Fig. 4, there is a fully connected layer after the max-pooling layer, followed by is a softmax classifier to predict the text label.

In this part, the sentence vector m, which is the output of the max-pooling layer, are used as the input of classifier. Intuitively, after the full connection transform, there is a softmax function(see (12)) transform to predict the probability p_k of the text belongs to the category k, and then by the argmax to obtain the predict of the text.

$$ p_{k} = \frac{\exp\left( y_{k}\right)}{{\sum}_{j = 1}^{n}\exp\left( y_{j}\right)} $$

(12)

Where y_k is the output of the transition between the full connection layers and the softmax layer, and p_k is the output of the softmax layer.

5 Experiments

This section mainly introduces the datasets, experimental setup and the model variations.

5.1 Datasets

We test the model proposed in this paper on 8 benchmarks text classification datasets. A summary of the datasets shown in Table 1

MR: MR (Movie reviews) was first used in [28] with one line per review. The dataset has 5331 positive and 5331 negative processed reviews. There are two target classes(positive or negative) of this dataset. So this is a binary classification task. We use the 10-fold cross-validation on this dataset.
Subj: Sub (Subjectivity) with 5,000 subjective and 5,000 objective processed sentences, was first used in [27]. There are two target classes(subjective or objective) of this dataset. So this is a binary classification task, too. And we use the 10-fold cross-validation on this dataset, too.
SST: SST (Stanford Sentiment Treebank) was split for train/dev/test provided by [31], with 5 labels (very positivity, positive, neutral, negative and very negative). There 11855 samples in this dataset, 8544 for train set, 2210 for test set and others for dev set.
SST2: SST2 is derived from SST. We removed the neutral reviews, and merge the very positive reviews and the positive reviews as positive reviews, the negative reviews and the very negative reviews as negative reviews. So there are 9,613 samples,7,792 for train set and 1,821 for test set. And this is a binary classification task, too.
IMDB: IMDB contains 50,000 reviews split evenly into 25k train and 25k test sets, provided by [22]. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. There is additional unlabeled data for use as well. This also is a binary classification task.
TREC: TREC is a question classification dataset with 6 labels, first used in [20]. There 5952 samples in this dataset. There are 1000; 2000; 3000; 4000 and 5500 samples labeled train set respectively, and the test set have 500 samples.
CR: Annotated customer reviews of 14 products obtained from Amazon [10]. The task is to classify each customer review into positive and negative categories.
MPQA: Phrase level opinion polarity detection subtask of the MPQA data set [37].

Table 1 A summary of the datasets

Full size table

5.2 Experimental setup

word vectors: The word vectors we used were pre trained by Mikolov et al. [23] on 100 billion words of GoogleNews. The word vectors was trained with CBOW (continuous bag-of-words) model, and the dimensionality of the word vectors is 300. For the words not present in the word vectors, initialized randomly. To make sure that the dimensionality of the words are same as the hidden units number of the BiRNN, this is a full connection layer we use in this study.
$$ \mathbf{u} = f\left( \mathbf{W}\cdot\mathbf{w}+\mathbf{b}\right) $$
(13)
where, f (⋅) is the nun-liner activation function. w is the word vectors, W and b are the trainable parameters. And the dimensionality of u is same as the hidden units number.
Weights and biases initialization: All weights in the training are randomly initialized with a normal distribution which mean is 0 and standard variance of 0.1. All bias are initialized with 0.1.
Hyperparameters: In the experiment, we used the Adam [15] optimization method to train our model with a learning rate is set to 0.001. For the fully connected layer, the L2 regularization method is used to prevent over-fitting, with the rate is set to 0.0001. The rate of dropout [32] is set to 0.5. Batch size for train is 64, and hidden units of the BiRNN is set to 128. For convolutions layer, filter windows of 3, 4 and 5 with 100 for each.

5.3 Model variations

There are three variation models in our experiments. The difference between the three models are that the recurrent neural networks structure of the attention model part. In other words, is that the structure used to get the hidden state h are different.

ACNN(BiRNN): The recurrent neural networks structure based on original RNN. The hidden state h calculate as (14)
$$\begin{array}{@{}rcl@{}} \mathbf{h}_{t} = \text{BiRNN}\left( \mathbf{x}_{t}\right),\ t \in\left[1,\cdots,T\right] \end{array} $$
(14)
For simplicity, BiRNN is used to represent the calculation of bidirectional recurrent neural networks.
ACNN(BiLSTM): The recurrent neural networks structure based on LSTM. The hidden state h calculate as following:
$$ \mathbf{h}_{t} = \text{BiLSTM}\left( \mathbf{x}_{t}\right),\ t \in\left[1,\cdots,T\right] $$
(15)
where, LSTM cell used to replace of the original RNN cell.
ACNN(BiGRU): The recurrent neural networks structure based on GRU. The hidden state h calculate as following:
$$ \mathbf{h}_{t} = \text{BiGRU}\left( \mathbf{x}_{t}\right),\ t \in\left[1,\cdots,T\right] $$
(16)
where, GRU cell used to replace of the original RNN cell.

6 Results and discussion

In this section, we will give the analysis of the experimental results. The classification accuracy of our ACNN compared with the state-of-the-art models is shown in Tables 2 and 3. Underlined are the best results for ACNNs, and in bold are best results of the state-of-the-art methods reported by the corresponding papers.

Compare with the state-of-the-art models: As shown in Table 2 the BiRNN based model got the same performance as most of the state-of-the-art models on the 7 text classification tasks. The BiLSTM based model and the BiGRU based model get almost the same accuracy as the best result of the state-of-the-art models. In particular, on the TREC dataset and CR dataset, our methods achieves the best accuracy. We reduce the error rate by 6.3% with BiRNN based ACNN, by 26.7% with BiGRU based ACNN and by 34.4% with BiLSTM based ACNN for the TREC dataset, and by 8.8% with BiLSTM based ACNN for the CR dataset. For the MR dataset, the accuracy of our model BiLSTM based ACNN are the same as the best of the state-of-the-art methods AdaSent. For the Subj dataset, the SST dataset and the MPQA dataset, ACNN got the second higher accuracy, only lower than AdaSent. And for the SST2 dataset, ACNN got the third higher accuracy, only lower than Tree-LSTM (Glove vectors, tuned) and CNN-multichannel. The results prove the effectiveness of the proposed method.
Compare BiRNN, BiLSTM and BiGRU based model: The BiLSTM based ACNN and the BiGRU based ACNN got a better performance than the BiRNN based model. For all of the datasets, the max length of sentences are all than 30 words. As discuss in the section Background that the RNN is poorly performing for long-dependent problems, and the LSTM use the gated mechanism to improve RNN. GRU only changes the gated mechanism of the LSTM. So that BiGRU based ACNN and BiLSTM based ACNN can get a better performance than BiRNN based ACNN. GRU is the improvement of the LSTM gating mechanism, reducing the number of gates, so the training time is accelerated. However, there are less weights, performance will be not better than LSTM. So that BiGRU based ACNN got less performance than BiLSTM based ACNN.
Performance on the IMDB dataset: Shown in Table 3, the accuracy of our model is 91.1%. It’s better than most of the state-of-the-art models. But the Paragraph-Vec, SA-LSTM and seq2-bown-CNN are better than our model. There are multiple sentences in a sample in the IMDB dataset. In the experiment, we treat a sample as a sentence. So the average length of the sentences in the IMDB dataset is 231(shown in Table 1). The RNN is not good at long-dependent problem, even though the LSTM is the improvement of RNN, but for a too long sentence, it’s not good, too. The Paragraph-Vec [19] learned a paragraph vector, which may be more suitable for multi-sentence data processing. Maybe we can use a hierarchical structure(like [39] done) to handle multiple sentence data.
Convergence speed: Seen in Fig. 5, with the training is going on, BiLSTM based ACNN and BiGRU based ACNN will be able to get more than 90% accuracy soon, almost 1000 batches of training, but BiRNN based ACNN requires 5,000 batches of training to reach 90% accuracy. So the convergence speed of the BiLSTM based ACNN and the BiGRU based ACNNs are quicker than BiRNN based ACNN. As discuss in the Background, LSTM and GRU are better at long dependence problems than RNN. And for all of the datasets, the max length of sentence are all than 30 words. So for the long sentence, the BiLSTM based ACNN and the BiGRU based ACNN are easier to get a local optimal than BiRNN based ACNN while the LSTM and the GRU improved the performance of the RNN for long sentence.
Compare with BRCNNs (with no attention mechanism): As shown in Table 4, when we stack the LSTM layers, the accuracy is improved. However, the ACNN with 1 attention layer got the same performance as the BRCNN with 2 or 3 LSTM layers. The accuracy of the validation set is very volatile when we stack LSTM layers (2 or 3 LSTM layers, shown in Fig. 6), but not appear when we stack the attention layers(shown in Fig. 7). So when we join the attention mechanism to our model, it can be a stable convergence to a local optimal. That is, when we apply the attention mechanism, we get a better sentence vector.
Multi-layer attention: We also trained the ACNNs which have multi-layer attention (seen in Fig. 3), the accuracy seen in Table 4. When we stack the attention layers the accuracy does not increase. This means that a layer of attention is enough.

Table 2 Results of our models and the state-of-the-art models

Full size table

Table 3 Performance of models on the IMDB sentiment classification task

Full size table

Table 4 Performance of models on the TREC dataset

Full size table

In this section, we have a detailed analysis of the experimental results. The experiment showed that ACNNs can learning a better sentence vector for classification. The RNN of the attention mechanism models the word order information of a sentence. The attention mechanism give different weight of the information and calculate the context vectors. The convolutional operation get the locale information and the max-pooling operation get the max feature information. So it is better for sentence representation and classification.

7 Conclusion

This paper propose a bi-attention, a multi-layer attention, and describes a text representation and classification model based on attention mechanism and convolution neural network. The bi-attention has two attention mechanisms to learn two context vectors, forward RNN with attention to learn forward context vector and backward RNN with attention to learn backward context vector, and then concatenation them to get context vector. The multi-layer attention is the stack of the bi-attention. For ACNN, it combines bi-attention and convolutional neural network for sentence representation and classification. All the testing on the 8 benchmarks text classification datasets, our model achieved a better or the same performance compare with the state-of-the-art methods, which shows that our approach is feasible. In other word, it’s shown that ACNNs can learn a better sentence vector for classification.

8 Future work

In this work we only focused on English text representation and classification, further we will do more experiments on other language text. We can also speed up the training of the model using only the max-pooling operation instead of the convolution operation and max-pooling operation.

References

Arora S, Liang Y, Ma T (2017) A simple but tough-to-beat baseline for sentence embeddings
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
MATH Google Scholar
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv:170502364
Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Advances in neural information processing systems, pp 3079–3087
Hill F, Cho K, Korhonen A (2016) Learning distributed representations of sentences from unlabelled data. arXiv:160203483
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 168–177
Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: ACL. ACL, pp 873–882
Johnson R, Zhang T (2014) Effective use of word order for text categorization with convolutional neural networks. arXiv:14121058
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv:14042188
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:14085882
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv:14126980
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302
Lai A, Hockenmaier J (2014) Illinois-lh: a denotational and distributional approach to semantics. In: SemEval@ COLING, pp 329–334
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: AAAI, vol 333, pp 2267–2273
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196
Li X, Roth D (2002) Learning question classifiers. In: COLING. ACL, pp 1–7
Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv:170303130
Maas A, Daly R, Pham P, Huang D, Ng A, Potts C (2011) Learning word vectors for sentiment analysis. In: ACL. ACL, pp 142–150
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:13013781
Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the 24th international conference on machine learning. ACM, pp 641–648
Mnih V, Heess N, Graves A et al (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212
Pang B, Lee L (2004) A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: ACL. ACL, p 271
Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: ACL. ACL, pp 115–124
Pang B, Lee L et al (2008) Opinion mining and sentiment analysis. Foundations and Trends®;, in Information Retrieval 2(1–2):1–135
Article Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, vol 62, pp 98–105
Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol 1631, pp 1642
Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. arXiv:150300075
Tang D, Qin B, Liu T (2015) Document modeling with gated recurrent neural network for sentiment classification. In: EMNLP, pp 1422–1432
Wang S, Manning C (2013) Fast dropout training. In: ICML, pp 118–126
Wang S, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th Annual meeting of the association for computational linguistics: short papers, vol 2. Association for Computational Linguistics, pp 90–94
Wiebe J, Wilson T, Cardie C (2005) Annotating expressions of opinions and emotions in language. Lang Resour Eval 39(2):165–210
Article Google Scholar
Wieting J, Bansal M, Gimpel K, Livescu K (2015) Towards universal paraphrastic sentence embeddings. arXiv:151108198
Yang Z, Yang D, Dyer C, He X, Smola AJ, Hovy EH (2016) Hierarchical attention networks for document classification. In: HLT-NAACL, pp 1480–1489
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: Advances in neural information processing systems, pp 649–657
Zhao H, Lu Z, Poupart P (2015) Self-adaptive hierarchical sentence model. In: IJCAI, pp 4069–4076

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China [NSFC61572005], the Fundamental Research Funds for the Central Universities [2016JBM080], and Key Projects of Science and Technology Research of Hebei Province Higher Education [ZD2-017304].

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Tengfei Liu, Shuangyuan Yu & Baomin Xu
Department of Computer Science, Beijing Jiaotong University Haibin College, Huanghua, 061199, China
Hongfeng Yin

Authors

Tengfei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shuangyuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Baomin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Hongfeng Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tengfei Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, T., Yu, S., Xu, B. et al. Recurrent networks with attention and convolutional networks for sentence representation and classification. Appl Intell 48, 3797–3806 (2018). https://doi.org/10.1007/s10489-018-1176-4

Download citation

Published: 27 April 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10489-018-1176-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Recurrent networks with attention and convolutional networks for sentence representation and classification

Abstract

Similar content being viewed by others

Hierarchical Convolutional Attention Networks Using Joint Chinese Word Embedding for Text Classification

Double Attention Mechanism for Sentence Embedding

Attentional Recurrent Neural Networks for Sentence Classification

1 Introduction

2 Related work