Keywords

1 Introduction

Recently, researchers from many subareas of Natural Language Processing and Machine Learning have been working on the sentiment analysis and related tasks [8, 13, 14, 17, 20]. In this work, we focus on one fundamental task in sentiment analysis—the detection of opinion expressions—both direct subjective expressions (DSEs) and expressive subjective expressions (ESEs) as defined in Wiebe et al. [18]. DSEs are explicit mentions of private states or speech events expressing private states; and ESEs are expressions that indicate sentiment, emotion, etc. without explicitly conveying them.

Opinion expressions extraction has often been treated as a sequence labeling task in previous works. This approach usually uses the conventional B-I-O tagging scheme to convert the original opinion expressions to sequences of tagging tokens: B indicates the beginning of an opinion expression, I is for the token within the range of opinion expression, O is the tag used to denote token outside any opinion expression. Since two types of opinion expressions (DSE, ESE) are used in annotation, there are five tagging labels in this task: B_DSE, I_DSE, B_ESE, I_ESE and O. The example sentences in Fig. 1. show this tagging scheme. For instance, the DSE “wanted this very much” results in one B_DSE tag for “wanted” and three I_DSE tags for “this very much”.

Fig. 1.
figure 1

Example Sentences with opinion expression B-I-O labels

Conditional random fields (CRFs) [10] have been quite successful for different sequence labeling problem in sentiment analysis including opinion target extraction [15], opinion holder recognition [11] etc. The state-of-the-art models of opinion expression extraction are also CRF [2] and variant of CRF that relaxes the Markovian assumption [21]. However, the success of CRFs depends heavily on the use of an appropriate features set and carefully manual selection, which requires a lot of engineering effort.

In recent years, there is no doubt that deep learning has ushered in amazing technological advances on natural language processing (NLP) researches. Deep learning models automatically learn the latent features and represent them as distributed vectors, outperforming CRF-based model in several tasks of NLP. For example, Yao et al. applied Recurrent Neural Network (RNN) to name entity recognition task, and showed that RNN obtains state-of-the-art result in this task [22]. Based on the aforementioned architectures, a new direction of neural networks has emerged. It learns to focus “attention” to specific parts of text as the simulation of human’s attention while reading. The researches on neural network with attention mechanism show promising results on a sequence-to-sequence (seq2seq) tasks in NLP, including machine translation [1], caption generation [19] and text summarization [16].

Motivated by the recent researches on attention model of neural networks, we explore to apply recurrent neural network with attention to opinion expression extraction which can be treated as an instance of seq2seq learning tasks. In general, we except that the neural attention model would make use of correlation of words in the sentences and emphasize the crucial parts for this task to improve the performance compared with the vanilla RNNs.

The rest of this paper proceeds as follows. In Sect. 2, we present our recurrent neural network with attention model. In Sect. 3, we show the experimental results on MPQA dataset and analyze them. In Sect. 4, we conclude and discuss future work.

2 Methodology

This section describes a novel architecture for opinion expression extraction. The new architecture consists of a bidirectional recurrent neural network with long short-term memory (LSTM) as an word encoder, a decoder that outputs the predicted B-I-O tags of opinion expressions, and a neural attention layer that softly aligns the word sequences and output sequences.

2.1 RNN with Long Short-Term Memory

An RNN [4] is a kind of neural network that processes sequences of arbitrary length by recursively applying a function to its hidden state vector \( h_{t} \in {\mathbb{R}}^{d} \) of each element in the input sequences. The hidden state vector at time-step \( t \) depends on the input symbol \( x_{t} \) and the hidden state vector at last time-step \( h_{t - 1} \) is:

$$ h_{t} = \left\{ \begin{aligned} & 0\quad \quad \quad \quad \quad \quad \quad t = 0 \\ & g(h_{t - 1} ,x_{t} )\quad \quad \quad {\text{otherwise}} \\ \end{aligned} \right. $$
(1)

A fundamental problem in traditional RNN is that gradients propagated over many steps tend to either vanish or explode. It makes RNN difficult to learn long-dependency correlations in a sequence. Long short-term memory network (LSTM) was proposed by [7] to alleviate this problem. LSTM has three gates: an input gate \( i_{t} \), a forget gate \( f_{t} \), an output gate \( o_{t} \) and a memory cell \( c_{t} \). They are all vectors in \( {\mathbb{R}}^{d} \). The LSTM transition equations are:

$$ \begin{aligned} & i_{t} = \sigma (W_{i} x_{t} + U_{i} h_{t - 1} + V_{i} c_{t - 1} ), \\ & f_{t} = \sigma (W_{f} x_{t} + U_{f} h_{t - 1} + V_{f} c_{t - 1} ), \\ & o_{t} = \sigma (W_{o} x_{t} + U_{o} h_{t - 1} + V_{o} c_{t - 1} ), \\ & \tilde{c}_{t} = \tanh (W_{c} x_{t} + U_{c} h_{t - 1} ), \\ & c_{t} = f_{t}^{{}} \odot c_{t - 1} + i_{t} \odot \tilde{c}_{t} , \\ & h_{t} = o_{t} \odot {\tanh} (c_{t} ) \\ \end{aligned} $$
(2)

where \( x_{t} \) is the input at the current time step, \( \sigma \) is the sigmoid function and \( \odot \) is the elementwise multiplication operation. In our model, we use the output vector \( o_{t} \) of each time step as the representation of input sequence.

2.2 Bidirectional RNNs

Observe that with above definition, LSTMs only have information about the past, when making a decision on input \( x_{t} \). This limits LSTMs to make use of previous sequential information which is important for most NLP tasks. To capture long-distance dependencies from the future as well as from the past, Graves and et al. proposed to use bidirectional LSTMs which allow bidirectional links in the network [6]. For the Elman-type RNN in Sect. 2.1, the bidirectional variant of it is:

$$ \begin{aligned} & \overrightarrow {{h_{t} }} = \overrightarrow {g} (\overrightarrow {{h_{t - 1} }} ,x_{t} ){ (}\overrightarrow {{h_{0} }} = 0 )\\ & \overleftarrow {{h_{t} }} = \overleftarrow {g} (\overleftarrow {{h_{t + 1} }} ,x_{t} ){ (}\overleftarrow {{h_{T} }} = 0 )\\ & h_{t} = [\overrightarrow {{h_{t} }} \overleftarrow {{,h_{t} }} ] \\ \end{aligned} $$
(3)

where \( \overrightarrow {g} \) and \( \overleftarrow {g} \) are forward and backward transitional functions, they use different weight matrices and bias vectors. The concatenated vector \( h_{t} = [\overrightarrow {{h_{t} }} \overleftarrow {{,h_{t} }} ] \) combines vectors of the same time-step from both directions. We can thus interpret \( h_{t} \) as an intermediate representation summarizing the past and the future, which is then used to make decision on the current input. Similarly, unidirectional LSTMs can be extended to bidirectional LSTMs by allowing bidirectional connections in the hidden layers.

2.3 Stacked RNNs

Here, we describe briefly the underlying framework, called Stacked RNNs proposed by (El Hihi and Bengio) [3] on which we build a novel architecture that model attention. In the Stacked RNNs framework, there are \( k(k > = 2) \) RNNs RNN1, RNN2, …, RNNk where the \( j{\text{th}} \) RNN receive \( (j - 1){\text{th}} \) RNN’s output as its input and feed its output into the \( (j + 1){\text{th}} \) RNN, meanwhile the first RNN receives the word sequences as its input and the last RNN omits the vector representation of the labels which are used to predict the targets. Suppose the output of \( j^{th} \) RNN on time-step \( t \) is \( h_{t}^{j} \), the stacked RNNs can be formulated as:

$$ h_{t}^{j} = \left\{ {\begin{array}{*{20}c} {x_{t} } \\ {g(h_{t - 1}^{j} ,h_{t}^{j - 1} )} \\ \end{array} } \right.\begin{array}{*{20}c} { \, j = 0} \\ {\text{ otherwise}} \\ \end{array} $$
(4)

The function \( g \) used in (4) can be replaced by any RNN transition function, In this paper, we use bidirectional LSTM described in Sect. 2.2. Figure 2 demonstrates a stacked RNN consisting two LSTMs, the input sequence is the vectors of words in sentences and the output sequence is the B-I-O tags of opinion expressions. In order to make the stacked RNNs to be extended easily, we use stacked bidirectional LSTMs with depth of 2 as our basic model in this paper.

Fig. 2.
figure 2

Demonstration of stacked RNNs for emotion expression extraction, the input of the whole model is the word embeddings and the output is the predicted B-I-O tags. In this paper we use stacked bidirectional LSTMs with depth of 2 as our basic model

2.4 Stacked RNNs with Neural Attention

Recently, researches on neural network with attention mechanism show promising results on a sequence-to-sequence (seq2seq) tasks in NLP, including machine translation [1], caption generation [19] and text summarization [16]. For opinion expression extraction, we proposes to use neural attention to focus the important parts in the sentences. As we described in Sect. 2.3, we use stacked bidirectional-LSTMs with depth of 2 as our basic model. For the attention model, the input of the second LSTM on each time step \( t \) is a weighed sum of the first LSTM’s output vectors. The input vector of the second LSTM on time \( t \), \( i_{t}^{2} \) is represented by

$$ i_{t}^{2} = \mathop \sum \limits_{s = 1}^{T} \alpha_{ts} h_{s}^{1} $$
(5)

In Eq. (5), \( h_{s}^{1} \) is the output vector of the \( 1^{st} \) LSTM on time step \( s \), \( \alpha_{ts} \) is the weight value that maps output sequence of the \( 1^{st} \) LSTM \( [h_{1}^{1} ,h_{2}^{1} , \ldots ,h_{T}^{1} ] \) to input vector of the \( 2^{nd} \) LSTM. \( \alpha_{ts} \) can also be consider as a value that indicates how much of a difference the \( s^{th} \) word will make to the decision of the \( t^{th} \) label. The weight \( \alpha_{ts} \) is obtained by

$$ \begin{aligned} & e_{ts} = \tanh (W^{1} h_{s}^{1} + W^{2} h_{t - 1}^{2} + b) \\ & \alpha_{ts} = \frac{{\exp (e_{ts}^{T} e)}}{{\mathop \sum \limits_{k = 1}^{T} \exp (e_{tk}^{T} e)}} \\ \end{aligned} $$
(6)

In Eq. (6), \( W^{1} \) and \( W^{2} \) are parametric matrices that will be tuned in training phase, \( b \) is the bias vector. \( e \) in this equation is a vector with the same length with \( e_{ts} \), and is jointly trained with all other parameters. The first line in this equation can be treated as a fully-connected neural network whose input is the output vectors of the emitted vectors of both LSTMs with separated parametric matrix. The second line in Eq. (6) is also a fully-connected neural network but with a softmax activation function that outputs the attention weights. The whole model is illustrated in Fig. 3.

Fig. 3.
figure 3

Stacked RNNs with neural attention. For the sake of simplicity, the attention layer in this figure is represented by a abstract part.

3 Experiments

In this section, we investigate the empirical performance of our proposed model on opinion expression extraction and compare it with state-of-the-art models for this task. We use MPQA 1.2 corpusFootnote 1 [18]. It contains 535 news documents of 11,111 sentences annotated with both DSEs and ESEs labels at phrase level. As in previous work, we use 135 documents as a development set and employ 10-fold cross validation on the remaining 400 documents. The summary statistics of MPQA 1.2 is listed in Table 1.

Table 1. Summary statistics of the MPQA 1.2 datasets.

3.1 Evaluation Metrics

We use precision, recall, and F1-measure to evaluate the performance of the model. Since the boundaries of opinion expressions are hard to define even for human annotators [18], we use Binary Overlap and Proportional Overlap as two soft measures to evaluate the performance. Breck et al. firstly introduced the Binary Overlap measure to opinion expression extraction which counts every overlapping match between a predicted and true expression as correct [2]. And Proportional Overlap is a stricter measure that computes the proportion of overlapping spans [9].

3.2 Model Training and Hyper-parameters

The model can be trained in an end-to-end way by back-propagation, where the objective function is cross-entropy of error loss. Training is done through gradient descent with the Adadelta update rule. In all of these experiments, the word embeddings are initialized with the publicly available word2vec vectors that were trained on 100 billion words from Google News [12]. Other parameters are set as follows. The number of hidden units of both LSTM is 32, dropout rate is 0.5 and mini-batch size is 128. These hyper-parameters are chosen via a grid search on the development set.

3.3 Baselines

To illustrate the performance boost of our proposed attention model, we compare our model with some baseline methods. Since we use bidirectional LSTM as component of our model, we implement an RNN with LSTM memory unit as a baseline. We also compare our model with stacked LSTM with depth of 2.

  • Bi-LSTM: LSTM for sequence labelling. [5]

  • Bi-LSTM(stacked): stacked model of two bi-directional LSTMs [8].

    We also compare our model with the following state-of-the-art models:

  • CRF: Features used in CRF are words, part-of-speech tags and membership in a manually constructed opinion lexicon (within a [−1,+1] context window) [2].

  • Semi-CRF: Since Semi-CRF is a variant of traditional CRF model that relaxes the Markovian assumption and focus on the phrase level features rather than token-level features. Semi-CRF also use parse trees to generate the candidate segments of sentences [21].

3.4 Results and Analysis

Since our model is based on RNNs, we firstly conduct experiments to confirm that our model outperforms vanilla bidirectional LSTM and stacked LSTM. The experimental results are shown in Table 2. We notice that vanilla bidirectional LSTM performs the worst among all the models since it cannot extract high-level features for this task. Two-layer LSTM uses deeper architecture “in space” to give LSTM additional power to tackle complex problems, and it obtains higher F1 scores than the vanilla LSTM. Our model which introduces the attention layer to stacked LSTM gives the best performance among the three models. For F1 scores, our model outperforms stacked LSTM with maximum absolute gains of 2.80 % for DSE, and 3.39 % for ESE. All differences are statistically significant at the 0.05 level. These results can demonstrate that neural attention model can emphasize the crucial parts for specific tasks and improve the performance of RNNs on sequence labeling problems.

Table 2. Experimental evaluation of our proposed model and baseline methods

Table 3 shows comparison of our model to the previous best results in the literature. In term of F1 value, our model performs best for both DSE and ESE detection. Semi-CRF with its high recall, performs comparably to our model on F1 measure. Note that our model does not have to access any hand-crafted features other than word embeddings pre-trained by word2vec. In general, CRF models achieve high precision but low recall on both DSE and ESE detection (Note that it obtains best precision for binary and proportional measures, however it performs worst for recall measure). While Semi-CRF exhibit a high recall, low precision performance, since it use a more relaxed Marcovian assumption. Compared with Semi-CRF, our model produces even higher recall and comparable precision. We can observe that our model obtains higher F1 scores than Semi-CRF — 71.17 vs. 71.15 (binary overlap) and 65.10 vs. 64.27 (proportional overlap) for DSEs; 66.48 vs. 66.37 (binary overlap) and 57.57 vs. 50.95 (proportional overlap) for ESEs.

Table 3. Results of our proposed model against CRF-based models.

3.5 Case Study

In order to validate that our model is able to select salient parts in a text sequence, we visualize the attention layers in Fig. 4. For an example sentence from the MPQA dataset in which our model predicted all labels correctly. The example sentence and its corresponding labels are:

Fig. 4.
figure 4

Visualization of attention signals in sample sentences in the MPQA dataset. (Color figure online)

Nevertheless

he

wanted

to

clarify

some

of

Powell

‘s

statement

O

O

B_DSE

B_DSE

B_DSE

O

O

O

O

O

This sentence contains a DSE “wanted to clarify” which is a verb phrase. In order to understand the attitudes and feelings which this phrase conveys, we have to consider its corresponding object — “Powell’s statement”. We except our attention model can recognize this correlation and emphasize it for extracting the correct opinion expressions.

In Fig. 4. deeper colors mean higher attention and pale colors indicate lower attention. First of all, we can observe that for each label, the highest attention value is always associate with its corresponding word in the sentence. This result is consistent to our expectation, since each word has the biggest influence on its corresponding label. We can also find that except “wanted to clarify” ‘s own words, the phrase “Powell’s statement” has the most highest attention value on the labels of this DSE. This means our model can emphasize words related to the opinion expressions other than the corresponding ones in text. This example shows that introducing attention mechanism gives RNNs additional power to tackle more complicated sequence labeling problems that involve semantic understanding.

4 Conclusion

In this paper, we improve the traditional recurrent neural networks (RNNs) by introducing the attention mechanism to tackle the opinion expression extraction task. The new model can emphasize the most important parts in text and evaluate the correlation of each words in sentence with their expression labels (DSE and ESE). Experimental results show that attention layer gives RNNs additional power to process more complicated sequence labeling problems such as opinion expression extraction. Since our model can produce higher recall on both DSE and ESE, it outperforms traditional CRF-based methods on MPQA dataset.

In the future, we would like apply our models to other sequence labeling tasks in sentiment analysis including opinion holder extraction, aspect-based sentiment analysis, etc.