Keywords

1 Introduction

With the evolution of the Internet, OM has become one of the most vigorous research areas in NLP field. An aspect is a concept in which the opinion is expressed in the given text [1]. Aspect specific OM task can be divided into four main subtasks: aspect extraction, opinion extraction, sentiment analysis and opinion summarization [2]. This paper focuses on the second subtasks: opinion extraction. In this paper, we propose a hierarchical model based on stacked Bi-LSTM using both semantic information and syntactic information as input to extract aspect specific opinions.

Internet reviews OM can be carried out from three directions: document-level OM [3], sentence-level OM and aspect-level OM. Aspect-level OM is to extract both the aspects and the corresponding opinion expressions in sentences [4]. The extraction of opinion towards its corresponding aspect is a core task in Aspect-level OM. In recent years, the neural network has reached remarkable effect in NLP. Pang et al. [5] committed a survey of the current deep models used to handle text sequence issues. Socher et al. [6] proposed the recursive neural tensor network and represent phrases by distributed vectors. RNN [7] and its variants such as LSTM [8] and GRU [9] stood out from various deep learning methods. Huang et al. [10] proposed a bidirectional LSTM-CRF model for sequence labeling, and on this basis, Ma et al. [11] joined the CNNs in the model to encode character-level information of a word into its character-level representation. Du et al. [12] proposed an attention mechanism based RNN model which contains two bidirectional LSTM layers to label sequences so that to extract opinion phrases. Nevertheless, the neural networks’ performance drops rapidly when the models solely depend on neural embedding as input [11].

2 SBLSTM Model

2.1 SBLSTM Model Structure

We model aspect opinion extraction as a sequence labeling. The input of the model includes embedded vector, POS tags and dependency relations. The output is the corresponding label sequence of the input text sequence. We use a stacked Bi-LSTM between the input layer and output layer. Opinion expressions extraction has often been treated as a sequence labeling task. This kind of method usually uses the conventional B-I-O tagging scheme.

The basic idea of LSTM is to present each sequence forwards and backwards to two separate hidden states to capture past and future information. Then the two hidden states are concatenated to form the final output. The bidirectional variant of one unit’s hidden state’s update at time step t is as following.

$$ \overrightarrow {{h_{t} }} = \vec{g}\left( {\overrightarrow {{h_{t - 1} }} ,x_{t} } \right)\left( {\overrightarrow {{h_{0} }} = 0} \right) $$
(1)
$$ \overleftarrow {{h_{t} }} = \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{g} \left( {\overleftarrow {{h_{t + 1} }} ,x_{t} } \right)\left( {\overleftarrow {{h_{T} }} = 0} \right) $$
(2)

\( h_{t} = \left[ {\overrightarrow {{h_{t} }} ,\overleftarrow {{h_{t} }} } \right] \) can be regarded as an intermediate representation containing the information from both directions to predict the label of the current input \( {\text{x}}_{\text{t}} \).

Stacked RNNs is stacked by k(k ≥ 2) RNN networks. The first RNN receives the word embedding sequences as its input and the last RNN forms the abstract vector representation of the input sequence which is used to predict the final labels. Suppose the output of \( {\mathbf{j}}^{{{\mathbf{th}}}} \) RNN on time-step t is \( {\mathbf{h}}_{{\mathbf{t}}}^{{\mathbf{j}}} \), the stacked RNNs can be formulated as following.

$$ h_{t}^{j} = \left\{ {\begin{array}{*{20}l} { g\left( {h_{t - 1}^{j} ,x_{t} } \right)\quad \quad j = 0 } \hfill \\ { g\left( {h_{t - 1}^{j} ,h_{t}^{j - 1} } \right)\quad \quad otherwise} \hfill \\ \end{array} } \right. $$
(3)

The function g in (3) can be replaced by any RNN transition functions. We expect to capture the important opinion elements. Therefore, we choose the 2 layer Stacked-BiLSTM network as the basic model, and adds attention mechanism to it. In this attention model, the second BLSTM’s input \( {\text{i}}_{t}^{2} \) at time t can be expressed as:

$$ {\text{i}}_{\text{t}}^{2} = \sum\nolimits_{s = 1}^{T} {\alpha_{ts} h_{s}^{1} } $$
(4)

where \( h_{s}^{1} \) is the output vector of the first BLSTM at time s, \( \alpha_{ts}^{ } \) is the weight of the output vector sequence \( [h_{1}^{1} ,h_{2}^{1} ,h_{3}^{1} , \ldots ,h_{T}^{1} ] \), the product of which is the input of the second BLSTM at the time t. The weight \( \alpha_{ts}^{ } \) is calculated as follows:

$$ e_{ts}^{ } = tanh\left( {W^{1} \, _{s}^{1} + W^{2} \, _{t - 1}^{2} + b} \right) $$
(5)
$$ \alpha_{ts}^{ } = \frac{{exp\left( {e_{ts}^{T} e} \right)}}{{\mathop \sum \nolimits_{k = 1}^{T} exp\left( {e_{tk}^{T} e} \right)}} $$
(6)

where \( {\text{W}}^{1} \) and \( {\text{W}}^{2} \) are the parameter matrixs that update in the model training process. b is the bias vector. e and \( e_{ts}^{ } \) has the same dimension and also update with the above adjustable parameters in the model training process.

Figure 1 demonstrates a stacked Bi-LSTM model consisting two Bi-LSTMs with an attention layer. The input is distributed word vectors of texts while the output is a series of B-I-O tags predicted from the network. In order to make the stacked RNNs to be extended easily, we use stacked bidirectional LSTMs with depth of 2 as our basic model in this paper.

Fig. 1.
figure 1

A stacked bidirectional LSTM network

2.2 Features in SBLSTM Model

In SBLSTM model, the features used is as following

  • Word embeddings. The word embedding is a kind of distributed vector which contains the semantic information.

  • POS tags. We use Stanford Tagger to obtain the POS tags.

  • Syntactic tree. Here we particularly apply the syntactic information, dependency tree in the model. Figure 2 displays the dependency tree for a movie review.

    Fig. 2.
    figure 2

    Dependency tree of an example context

The syntactic representation of one word is defined as its m (m ≥ 0) children in a dependency tree, where m denotes the window size to limit the amount of the dependency relations of one word for the learning models. Introducing the window size could prevent excessive usage on VRAM.

Finally, the three type’s features will be concatenated as the input vector and fed to the SBLSTM model. Figure 3 shows the final features composition of one word.

Fig. 3.
figure 3

Input composition of one word

3 Experiments Design and Analysis

3.1 Datasets

For now, there are no available benchmark datasets that mark phrase boundaries of aspect specific expressions. Therefore, two manually constructed datasets are used in our experiment. Mukherjee constructed an annotated corpus which considers 1, 2, and 3-star product reviews from Amazon in English. The other dataset consists of online reviews of three Chinese movies Mr. Six, The Witness and Chongqing Hotpot collected from Douban, Mtime and Sina microblog. The movie reviews dataset is manually annotated.

The statistics information of these two datasets are showed in Table 1. Figures 4 and 5 display the sentences length distributions of the two datasets.

Table 1. Statistics of dataset
Fig. 4.
figure 4

Distribution of movie dataset sentences length

Fig. 5.
figure 5

Distribution of product dataset sentences length

3.2 Experimental Setting

In experiments, we use Stanford parser to obtain the syntactic information. The SBLSM models is implemented in python 2.7 and we use the Keras framework to construct the deep neural networks. The input length is limited to be 60 in LSTMs’ models and the amount of LSTM input units is 60 while the number of the output units is 64. The word embeddings dimension is set to be 100. The window size in extracting dependency features is set to be 4 initially. In training process, the ratio between training set and validation set is 4:1. The activation function chosen is softmax function and the batch size to train the model is 256.

3.3 Quantitative Analysis

Evaluation Metrics.

Precision, recall and F1 score are commonly used to evaluate the performance of OM models. In OM task, the boundaries of opinion expressions are hard to define. Therefore, we use proportional overlap as a soft measure to evaluate the performance.

Model Comparative Analysis.

To illustrate the performance boost of our SBLSTM model, we firstly compare our model with some baseline methods on both two datasets. Since we use stacked bidirectional LSTM with depth of 2 as the core model, we choose the LSTM network and the bidirectional LSTM as the baselines. Furthermore, we also compare our model with the CRF model and rules based method which depends on the dependency tree.

As shown in Table 2, we reported the accuracy, the precision, the recall and the F1-score across all single runs for each approach. We could found that the proposed SBLSTM model outperforms the baseline methods in terms of accuracy, recall and F1. Bi-LSTM outperforms all of the others in terms of precision in the movie dataset, and compared with the CRF model, which achieves the highest precision in the product dataset, our proposed model also provides a comparable precision. Another observation is for both datasets, Bi-LSTM outperforms the normal LSTM model with absolute gains of 4.73% and 4.87% in terms of F1 score.

Table 2. Results of our proposed model against baseline methods

Feature Comparative Analysis.

In training process, the batch size is set to be 256 and the epoch number is set to be 30. Table 3 shows the comparison of experimental results using different feature sets.

Table 3. Comparison of the models performance using different features

Our proposed methods which introduces all of the three types feature performs best in term of accuracy, recall and F1 score. Refer to the third line and the fourth line of Table 3, adding word embeddings into the feature set makes the performance of the model improved in a similar way. This indicates that both word embedding and POS tags have some help in extracting the aspect specific opinion expressions. Particularly, we can observe that the recall measure and the F1-score are improved by 20% and 10% respectively when the dependency relations have been added into features, providing the evidence that syntactic information does play an important role in extracting the opinions.

Window Size Analysis.

We conduct a series of experiments with different window sizes to compare and analyze the impact of the children amount in dependency tree on model performance on movie dataset. The batch size in training process is set to be 256 and each trial was carried out in 300 epochs. Table 4 shows the comparison of the predictive performance of the proposed stacked Bi-LSTM models with both the semantic features and the syntactic features.

Table 4. Performance of different window sizes

From Table 4, we found that F1-score increase with the growth of the window size in general, tending to be stable when the window length is greater than 4.

3.4 Qualitative Analysis

To explore the contribution of this paper, we conducted a qualitative analysis experiment on five chinese movie comments below and the aim aspect is FengXiaogang.

The experiment uses rules-based model, CRF model, LSTM network and Bi-LSTM network as baseline methods. The aspect specific opinion extraction results of different methods are shown in Table 5. The green words are annotated opinion expression which we want models to extract, and the red words refer to those words haven’t been extracted by the model while these blue ones are those words not in annotation.

Table 5. Aspect specific opinion extraction results of different methods

The dependency rule based method is more effective when the sentence is short and simple. When here comes a complex sentence, it is impossible to obtain more information when the comment contains a demonstrative pronoun. Most importantly, no matter the length of the sentence, our model can extract the opinion information well.

4 Conclusions

In this paper, we proposed a method to embed syntactic information into the deep neural models. Experimental results on two domains and different languages data sets showed that the proposed stacked bidirectional LSTM model outperform all of the baseline methods, proofing that the syntactic information did play a significant role in correctly locating the aspect-specific opinion expressions.