Keywords

1 Introduction

Aspect-level sentiment analysis is a fine-grained task in sentiment analysis, which aims to predict sentiment polarity (i.e., positive, neutral, or negative) of a specific target of a given sentence. Aspect-level sentiment analysis can be used in many fields such as product review analysis, public opinion analysis, and stock opinion analysis etc.

A core challenge of aspect-level fine-grained sentiment analysis is to correctly find the corresponding sentiment polarity of a given aspect in a sentence which contains more than one aspect with different polarities. For example, given a sentence “Service was slow, but the people were friendly.”, “service” and “people” are two targets of the sentence and each of them related to different opinion words “slow” and “friendly”. It means that as for the “service” target, the polarity of sentiment is negative, and as for the “people” target, the polarity of sentiment is positive. Therefore, finding the relationship between the target and corresponding opinion words is important for getting the final sentiment of a target in the sentence.

In the previous studies, various solutions are proposed to capture the context information of the given target. One solution is to use the position of opinion words in the sentence to obtain more precise relationship between opinion words and the target. For example, Zeng et al. [1] introduced the position information of words to help capture the relationship between opinion words and the target. Another solution is to model context information (not limited to the opinion words) related to the target in the sentence. For example, Tang et al. [2] used two LSTM networks to model left context information and right context information related to the target, respectively. Wang et al. [3] combined each word hidden state with aspect embedding as context information supplementation to the target. These methods achieve good performance in the task of aspect-level sentiment analysis task based on context information related to the target words in the sentence or the word location information.

However, we find that the context information captured by the above models is those words across the sentence (e.g., left context and right context). We argue that the opinion words are more important in supervising the polarity of the sentence for the given target, that is to say, we can independently consider the importance of the relationship between the target and opinion words.

To this end, we proposed a position-aware hybrid attention network based model which consists of two components, namely opinion attention network and context attention network. The context attention network is used to capture context information between words across sentence with the target, and the opinion attention network is used to incorporate independent relationship between opinion words and the target. The proposed model shows a stable improvement results in laptop and restaurant data sets. Based on our work, the main contributions are as follows:

  1. (1)

    We propose a hybrid attention network to capture the context information between the words across sentence with the target, as well as the independent relationship between the opinion words and the target to obtain more precisely sentiment information of the given target in the sentence.

  2. (2)

    We conduct several experiments and ablation tests on public laptop and restaurant datasets to validate our model. We will show that our model achieves a stable and effective performance compared with the baseline models.

2 Related Work

Aspect-level sentiment analysis aims to detect polarity of a sentence for a given target in a sentence. Many of the previous studies rely on rich features, such as sentiment lexicons, linguistic features and syntax etc., to help detect the sentiment. Kiritchenko et al. [4] built two sentiment lexicons for restaurant and laptop domain, and achieved good results in detecting aspects and sentiment by using these lexicons. Wagner et al. [5] combined four sentiment lexicons to design some rule-based features and extracted Bag-of-N-gram features to train a classifier for aspect-level sentiment analysis. Vo et al. [6] split a tweet into a left context and a right context according to a given target, using distributed word representations and neural pooling functions to extract features.

In recent years, different models based on neural networks are proposed and achieve good results in aspect-level sentiment analysis task due to their strong capacity to automatically extract high-level features of sentences [7,8,9,10,11,12,13,14,15,16,17,18,19,20].

As the context information of a given target is useful for improving the performance of aspect-level sentiment analysis task, some of previous studies focused on modeling context information related to the target. For example, Tang et al. [2] used two LSTM networks to model left context information and right context information related to the target words, respectively. The left and right target-dependent representations are concatenated together as the final representation of the sentence to predict the sentiment polarity of the aspect. Wang et al. [3] combined hidden states of each word with aspect embedding as context information supplementation to supervise the generation of attention vectors, and used the attention vectors to generate the final representation of the sentence. Tang et al. [7] captured the correlation between each context word and the aspect through multiple attentions and used the output of the last attention as the final representation of the sentence. Different from the above models, Ma et al. [8] used two independent LSTM networks to model aspects and contexts respectively, and used the attentive representation of aspect for the context, the concatenation of the hidden states of the two LSTM networks as the final representation of the sentence. Chen et al. [9] proposed a multi-layer architecture, in which each layer includes attention-based word feature aggregation and a GRU unit is proposed to learn the sentence representation.

Some of recent studies paid more attentions to word location information and achieved a new good results [1, 21, 22]. Zeng et al. [1] used Gaussian kernel to model the position of words. By introducing the position information into the model, their methods improved the results of the aspect-level sentiment analysis task. Wang et al. [21] introduced the ideas of global attention scores and grammar-based local attention scores for the task, a gating mechanism was used to synthesize global information and local information to generate the final representation of the sentence.

In this paper, we also focus on capturing the context information related to the given target of the sentence. We propose a hybrid attention network based model to incorporate independent relationship between opinion words and target, as well as the context information between the words across sentence with the target.

3 The Proposed Model

In this paper, a position-aware hybrid attention network is proposed for the aspect-level sentiment analysis. As shown in Fig. 1, our model mainly includes four parts: embedding layer, encoder layer, attention layer, and output layer, where the hybrid attention layer is divided into opinion attention module and context attention module.

Fig. 1.
figure 1

The whole architecture of our model.

For the context attention module, as previous work, we use aspect representation to help calculate the attention of each word across the sentence related to the target. For the opinion attention module, we use the aspect representation to help calculate the attention score of the candidate opinion words, and generate the opinion feature representation with different weights. We input the context information representation getting from the whole sentence and the opinion relationship representation getting from only the independent opinion words into a fully connected layer to get the final representation of the sentence. In our model, similar to previous work, we also introduced position embedding to be concatenated with word embedding to better obtain the position information of the words related to the target.

In the following sections, we will describe our model in more detail. Section 3.1 gives the problem definition, Sect. 3.2 introduces word position embedding, Sect. 3.3 introduces the encoding layer for sentence, target and opinion words, Sect. 3.4 introduces the hybrid attention networks and Sect. 3.5 describes the loss function of our model.

3.1 Problem Definition

Given a sentence with n number of word sequences \( S = \left\{ {w_{1} ,w_{2} ,w_{3} , \ldots ,w_{n} } \right\} \), a target with k number of word sequences \( A = \left\{ {w_{1}^{a} ,w_{2}^{a} ,w_{3}^{a} , \ldots ,w_{k}^{a} } \right\} \), where \( A \) is a subset of \( S \). The purpose of aspect-level sentiment analysis is to find out the sentiment of the given target A in a context sentence S.

As said in the above section, we argue that the opinion words are also important in supervising the polarity of the sentence for the given target. In our model, we use sentiment lexiconFootnote 1 of Bin Liu [23] to extract the opinion words of the sentence. Given the sentence S, we can also get opinion words \( O = \left\{ {w_{1}^{o} ,w_{2}^{o} ,w_{3}^{o} , \ldots ,w_{m}^{o} } \right\} \), where \( O \) is a subset of \( S \). Then the final definition of our model is described as finding out the sentiment of the given target A in a context sentence S with extracted opinion words \( O \).

3.2 Word Position Embedding

Let \( E \in {\mathbb{R}}^{{d_{e} \times \left| V \right|}} \) be the pre-trained word embedding matrix generated by the unsupervised method [24, 25], \( P \in {\mathbb{R}}^{{d_{p} \times \left| N \right|}} \) be the position embedding matrix. Where \( d_{e} \) is the dimension of word embedding, \( \left| V \right| \) is the vocabulary, \( d_{p} \) is the dimension of position embedding, and \( \left| N \right| \) is the number of possible relevant positions between each word and aspect.

We define the relative distance between each word and the target as the relative offset of the word across the sentence to the target. We calculate the distance using the formula (1). Where i is the index of the each word across the sentence, j is the index of the first word in the target, k is the length of the target, and n is the length of the whole sentence.

$$ \left\{ \begin{aligned} & \;\;\,\,i - j\quad \quad \quad \quad \;\;\,i < j \\ & i - j - k\quad \quad j + k < i \le n \\ & 0\quad \quad \quad \quad \;\;j \le i \le j + k \\ \end{aligned} \right. $$
(1)

In the pre-training word embedding matrix, find each word in sentence S, opinion word O, and target A, we map them into \( d_{e} \) vectors. In the position embedding matrix, find each sentence in word S, opinion word O, we map them into \( d_{p} \) vectors. Finally, the word embedding and position embedding are concatenated together. In target A, there is no position embedding, and no concatenating is needed:

$$ x_{i} = \left[ {E\left( {w_{i} } \right);P\left( {w_{i} } \right)} \right] $$
(2)
$$ x_{i}^{o} = \left[ {E\left( {w_{i}^{o} } \right);P\left( {w_{i}^{o} } \right)} \right] $$
(3)
$$ x_{i}^{a} = E\left( {w_{i}^{a} } \right) $$
(4)

Where \( w_{i} ,w_{i}^{o} ,w_{i}^{a} \) represent word sequences S, opinion word O, aspect A respectively. E (w) means search in word embedding matrix, P (w) means search in position embedding matrix, [;] represents vector stitching.

3.3 Sentence, Target and Opinion Words Encoding

In our model, we use three bidirectional long short term memory (Bi-LSTM) networks [26] to encode contextual information, opinion information and aspect information respectively. For forward LSTM, we fed word embedding \( x_{i} \) and the hidden state at last time step \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {h}_{t - 1} \) and the hidden state \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {h}_{t} \) can be calculated as:

(5)

Backward LSTM does the same thing as forward LSTM except that the input sequence is fed in a reversed way. Then the hidden state of forward LSTM and backward LSTM are concatenated and hyperbolic tangent activation function is applied to the concatenation result to form the hidden state \( h_{i} \):

(6)
$$ h_{i} = { \tanh }\left( {\left[ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {h}_{i} ; \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftharpoonup}$}} {h}_{i} } \right]} \right) $$
(7)

where \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {h}_{i} \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftharpoonup}$}} {h}_{i} \) are the hidden state of forward LSTM and backward LSTM at time step i respectively. The output of the Encoder layer are denoted as \( H = \left\{ {h_{1} , h_{2} , h_{3} , h_{n} } \right\} \), \( H_{o} = \left\{ {h_{1}^{o} , h_{2}^{o} , h_{3}^{o} , h_{m}^{o} } \right\} \), \( H_{a} = \left\{ {h_{1}^{a} , h_{2}^{a} , h_{3}^{a} , h_{m}^{a} } \right\} \).

3.4 Hybrid Attention Network

As shown in Fig. 1, we design two attention modules, Opinion Attention and Context Attention, to incorporate independent relationship information of opinion words related to the target and context information of words related to the target across sentence. In the two modules, Opinion Attention aims to generate precisely opinion representation by learning the relationship between opinion words and the target, and Context Attention makes the model focus on the words across sentence related to the target.

Opinion Attention.

Opinion Attention is designed to get the independent relationship of different opinion words and the target. As shown in Fig. 2, our opinion words extraction strategy is as follow: we first combine the positive and negative sentiment lexicons as a whole sentiment lexicon. Based on the combined lexicon, given a sentence S, we can get the candidate opinion words O. In order to determine the corresponding opinion words of each aspect, we also use dependency syntax analysis. The words that are dependent on the aspect are called “direct reach”, and the distance between these words and the aspect is 1. We test different candidate opinion words extraction strategies with different distance and the results will be discussed later.

Fig. 2.
figure 2

Our opinion words extraction strategy based on dependency tree.

Given the target \( H_{a} = \left\{ {h_{1}^{a} ,h_{2}^{a} , \ldots ,h_{k}^{a} } \right\} \), and the hidden state of opinion words \( H_{o} = \left\{ {h_{1}^{o} ,h_{2}^{o} , \ldots ,h_{m}^{o} } \right\} \), the Opinion Attention score α can be calculated by the following formulas (810). First, we get the average pooling of target representation \( h_{a\_avg} \). We use the aspect representation to learn the attention score of each word across sentence related to the target \( \alpha_{i} \), where \( W_{att1} \in {\mathbb{R}}^{{2d_{l} }} \) is the weight matrix.

$$ h_{a\_avg} = \frac{1}{k}\mathop \sum \limits_{i = 1}^{k} h_{i}^{a} $$
(8)
$$ f_{o} \left( {h_{i}^{o} ,h_{a\_avg} } \right) = h_{i}^{o} W_{att1} h_{a\_avg}^{{\rm T}} $$
(9)
$$ \alpha_{i} = \frac{{\exp \left( {f_{o} \left( {h_{i}^{o} ,h_{a\_avg} } \right)} \right)}}{{\mathop \sum \nolimits_{j = 1}^{n} \exp \left( {f_{o} \left( {h_{j}^{o} ,h_{a\_avg} } \right)} \right)}} $$
(10)

Then, the relationship representation \( r_{o} \in {\mathbb{R}}^{{2d_{l} }} \) is expressed as a weighted sum of the hidden state \( h_{i}^{o} \) and its attention score \( \alpha_{i} \) as shown in formula (11):

$$ r_{o} = \mathop \sum \limits_{i = 1}^{n} h_{i}^{o} \alpha_{i} $$
(11)

Context Attention.

Given the target representation and the hidden state of each words across sentence \( H = \left\{ {h_{1} ,h_{2} , \ldots ,h_{n} } \right\} \), the contextual attention score β can be calculated by the following formula (1213), where, \( W_{att2} \in {\mathbb{R}}^{{2d_{l} }} \) is the weight matrix.

$$ f_{c} \left( {h_{i} ,h_{a\_avg} } \right) = h_{i} W_{att2} h_{a\_avg}^{{\rm T}} $$
(12)
$$ \beta_{i} = \frac{{\exp \left( {f_{c} \left( {h_{i} ,h_{a\_avg} } \right)} \right)}}{{\mathop \sum \nolimits_{j = 1}^{n} \exp \left( {f_{c} \left( {h_{j} ,h_{a\_avg} } \right)} \right)}} $$
(13)

Then, the context information \( r_{c} \in {\mathbb{R}}^{{2d_{l} }} \) is expressed as a weighted sum of the hidden state \( h_{i} \) and its attention score \( \beta_{i} \) as shown in formula (14).

$$ r_{c} = \mathop \sum \limits_{i = 1}^{n} h_{i} \beta_{i} $$
(14)

From the above attention model, the relationship representation \( r_{o} \) and the context information \( r_{c} \) are obtained. Then we use a non-linear layer to project a particular aspect of the attention representation r into the class C target space, as shown in formula (15).

$$ r = \tanh \left( {W_{o} r_{o} + W_{c} r_{c} } \right) $$
(15)

Where, \( W_{o} \in {\mathbb{R}}^{{2d_{l} \times C}} \) and \( W_{c} \in {\mathbb{R}}^{{2d_{l} \times C}} \) are weight matrices, and C is the number of emotional polarities. Then we use softmax to calculate the sentiment distribution of r as formula (16).

$$ y_{i} = \frac{{\exp \left( {r_{i} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{C} \exp \left( {r_{i} } \right)}} $$
(16)

3.5 Loss Function

Let \( \hat{y} \) be the estimated probability distribution and y be the true distribution. We use cross entropy and L2 regularization for the parameters as the loss function, as shown in formula (17). Where i is the index of sentence, j is the index of class. N is the number of training samples, \( C \) is the number of sentiment classes, \( \lambda \) is the L2-regularization term. \( \Theta \) is the parameter set.

$$ J = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{C} y_{i}^{j} \log \left( {\widehat{{y_{i}^{j} }}} \right) + \lambda \left( {\mathop \sum \limits_{{\theta \in\Theta }} \theta^{2} } \right) $$
(17)

4 Experiments

4.1 Dataset

We conducted several experiments on the data set of SemEval 2014Footnote 2 task 4 to verify the effectiveness of our model. The SemEval 2014 dataset includes comments in two areas, which are notebooks and restaurants. These comments have three emotional polarities: positive, neutral, and negative, as shown in Table 1. In addition, followed previous work, we use the accuracy as the evaluation index of the model.

Table 1. The details of the laptop and restaurant datasets.

4.2 Experiment Settings

In our experiment, word embeddings are all initialized using a pre-trained 300 dimensional GloVeFootnote 3 word vector [24]. All words outside the vocabulary are initialized by sampling from the uniformly distributed from (−0.1, 0.1). The position embedding of sentences and opinion words is initialized using xavier uniform distribution, and the dimension is set to 100 dimensions. The weight matrix and offset are also initialized using the xavier uniform distribution. In order to perform dependency syntax analysis, the sentences of both datasets are parsed using Stanford CoreNLPFootnote 4.

In model training, we set the dimension of the hidden state of the LSTM to 100, the dropout to 0.5, and the L2 regularization weight to 0.001. We use the Adam optimizer to optimize the model and set the batch size and the learning rate to 64 and 0.001 respectively.

4.3 Baseline Models

We use several models as our compared models, these baseline models are as follows:

Majority assigns the most frequent emotional polarity in the training set to each sample in the test set. TD-LSTM [2] uses two LSTM networks to model the left and right contexts with the target, which are stitched together as the final representation to predict the emotional polarity of the aspect. AE-LSTM [3] uses LSTM network to model context words, and combines the word hidden state with aspect embedding to supervise the generation of attention vectors. ATAE-LSTM [3] is based on the improvement of AE-LSTM. ATAE-LSTM further enhances the effect of aspect embedding, and adds aspect embedding after each word embedding vector to represent context. PosATT-LSTM [1] introduces position information to model the word’s position, and then combines the hidden state of aspect and position information, supervising the generation of attention vectors.

MemNet [7] captures the correlation between each context word and the depicted aspect through multiple attentions, and focuses the last attention. IAN [8] uses two independent LSTM networks to model aspects and contexts respectively, and uses the average pooling of the hidden state of the context for aspect attention score calculation. RAM [9] is a multi-layer architecture, where each layer includes attention-based word feature aggregation and a GRU unit to learn sentence representation. SHAN [21] captures a synthesized global information and local information with gating mechanism by introducing a global attention score and a grammar-based local attention score respectively.

4.4 Experimental Results and Analysis

We test our model on the laptop and the restaurant datasets, the experimental results are shown in Table 2. As shown in Table 2, we can see that Majority has the worst effectiveness among all models. The LSTM-based models are better than Majority, which shows that LSTM network can effectively generate sentence feature representations to predict the emotional polarity of aspects.

Table 2. Experimental results of different models on the laptop and restaurant datasets.

We also can see that using the word position information related to the target plays an important role in generating the final representation. Both PosATT-LSTM and SHAN considered the positions of the words, and the experimental result of the two models are also remarkable. Comparing with ATAE-LSTM and PosATT-LSTM, we can see that PosATT-LSTM increased 4.1% and 2.2% performance in the laptop dataset and restaurant dataset, respectively by using location information. SHAN does not directly use the relative distance between each word and the aspect, but considers the distance based on the syntax, which eliminates a lot of noise to a certain extent and also achieves good results.

Our model combines relative distance and syntactic distance to further improve the performance of the experimental results. Compared with the above baseline models, our model achieve the best performance. In the laptop dataset and restaurant dataset, our model achieve 75.71% and 81.43% accuracy, respectively, which proves the feasibility of our model.

4.5 Ablation Studies

In order to verify the efficiency and advantage of different components of our proposed model, we also carried an ablation test. We use Pos-LSTM denotes that our model just retains the sentence encoding with position embedding without other components. We use Pos-Context-ATT denotes our model retain the context attention component, but without opinion attention component. The ablation test results are reported in Table 3.

Table 3. Experimental results of our model in ablation analysis.

As shown in Table 3, we can see that Pos-Context-ATT performs better than that of Pos-LSTM, which has an increase of 2.35% and 1.78% on laptop and restaurant datasets. This indicates that capturing the context information of words across sentence related to the target can actually improve the performance of this task. In addition, compared with Pos-Context-ATT, our final model has an increase of 1.26‬% and 1.61% on laptop and restaurant datasets, which means that the relationship between opinion words and the target is significantly supervised the final representation and improved the prediction results.

4.6 Discussion

To verify the impact of dependency distance of our model, we conducted several experiments with different dependency distances with 1, 2, and 3. The results are shown in Table 4.

Table 4. The impact analysis of dependency distance to our model.

It can be observed that the greater the dependency distance, the worse the performance of our model. Compared with the dependency distance of 1, when the dependency distance is 2, the accuracy decreases by 0.95% and 0.81%, and when the dependency distance is 3, the accuracy decreases by 2.2% and 1.79%. We believe that when the dependency distance is too large, it will choose opinion words that are not related to the aspect. And these opinion words introduce a lot of noise, which decrease the performance.

5 Conclusions

Basing on the observation that the independent relationship between opinion words and the target can supervise important sentimental information of the given target, a position-aware hybrid attention network for aspect-level sentiment analysis is proposed in this paper. Our model not only captures the context information of the words related to the target across the sentence, but also obtain the relationship between opinion words and the target. The experimental results carried on the public dataset show that our model is more effective than the compared baseline models.

Although hybrid attention proposed in our model achieve good performance, we find that the information of opinion attention is not well used in context attention. In the following research, we will focus on the interaction between the opinion words and the context of the content. We hope that opinion words are helpful to supervise the generation of attention scores in the context, which can make the model focus on context words related to opinion words.