Keywords

1 Introduction

Recently, rapid development has been witnessed in mobile social which has fully infiltrated into the global user communities. As one of the most important mobile social applications, Weibo contains entertainment, social, marketing and so on [1]. It has gradually evolved from a social demand that satisfies people’s “weak relationship” to a popular public opinion platform, becoming one of the most important realtime information sources and the center of spreading public opinion. Viewpoints and proposals in Weibo are universal and adaptive due to large amounts of users along with the differences of their standpoints and knowledge they have. Consumers or businesses can understand most users’ emotional attitudes toward the relative products by means of sentiment analysis technique which can provide policy-making references for consumers and the evidence for enterprises to improve product quality. The recognition of the subjective information in Weibo is the main purpose of Weibo sentiment analysis, which means to analysis users’ viewpoints and proposals toward products, news and hotspots.

Various aspects are included in viewpoints expressed in Weibo. For example, the text, “Nice Service but the food was too bad!”, expresses the emotions of two different goals. With the approach of aspect level sentiment analysis, we can analyze the opinions and emotions expressed by Weibo in a more fine-grained manner. As to the above instance, the polarity is positive when target is “service”, but it turns negative if “food” is seen as target.

Traditional methods focus on characteristic rule such as sentiment vocabulary and word bag feature for aspect-level sentiment analysis. These features are utilized to train a classifier [2]. However, the manual features are concentrated labor force and are highly relied by these methods. Different from previous methods, neural network models extract text feature in a labor-saving and scalable way.

Learning text feature is mainly by means of sequence transduction models in neural network models. The mainstream of sequence transduction models are based on complicated recurrent neural network (RNN) or convolutional neural networks (CNN) which consist of an encoder and a decoder. The models that connect the encoder and decoder through attention cells give the best combined properties [3]. Long short-term memory (LSTM) [4] and gated recurrent (GRU) [5] neural networks have been attained the best result in sentiment analysis domain. However, Weibo texts has become increasingly complex and it is quite difficult to extract context-dependent text feature. LSTM and GRU are the best sequence transduction models at present, which can’t process long text. Reference [6] puts forward that text could be separated into four parts to ensure no key information lost. But, context-dependent text feature may be unable to be extracted.

Besides, target may consist of several words, and target feature can be learned directly by sequence transduction models. Experiment shows that average target vector is the best method to get target feature [7], however this method has certain weaknesses. For example, in text “Nice macarons in France are not good at all.”, the target is “Nice macarons”, consists of two words. Due to the limited number of words, vocabulary only contains “Nice”, and thereby “macarons” will be assigned to a set of minimal random feature. Then, the average target feature of “Nice macarons” equals approximately the feature of “Nice”, which leads to wrong results of classification.

Based on the two problems analyzed above, we propose a memory network model that combines Transformer. Transformer is a novel network architecture, based solely on attention mechanisms without recurrent and convolutional structure entirely. Its basic unit is self-attention mechanism which can obtain better context-dependent text representation using interactive calculation in each part of sequence whether the size of sequence.

And we utilize the memory network to capture sentiment information of given target. It contains four modules: the context module for encoding Weibo texts, the question module for storing knowledge from previous steps, the question module for conversion target and encoding questions, and the answer module for evaluating sentiment polarity by data from memory module.

The memory network is proposed for the question answering task, however, the aspect-based sentiment classification doesn’t have an exact question. The original MN handles this by initializing the problem vector generated by the problem module with a zero or offset vector, while we argue that every target in the text could be converted into a question. We propose the Transformer based memory network (TF-MN) to realize our ideas, the question module of TF-MN treats each target in the text as implicitly asking a question “What is the emotion tendency of target in the text?”. Figure 1 is the overview of TF-MN architecture.

Fig. 1.
figure 1

The architecture diagram of TF-MN.

The following is a summary of our work:

  • In sentiment analysis task, we extract long text feature by using Transformer for the first time. Our model effectively solves the problem of inaccurate long text feature extraction by using LSTM and GRU.

  • The problem module does not average the target word vector, but is responsible for designing the corresponding target problem to handle the case where the target consists of multiple words.

  • On our dataset, our model achieved the best accuracy, and the experimental results further show that using Transformer and adding implicit question can actually improve the performance of the model.

2 Related Work

ABSA is a subdomain in sentiment analysis which focuses on fine-grained sentiment information [8]. There are two main approaches to solve ABSA problem.

The first one is the traditional method using lexicons and rules. Reference [9] compute sentiment word score using the weight sum method. Reference [10] proposed a holistic lexicon-based method involving both explicit and implicit opinions. The method improves performance to identify the aspect level relations by multiple kernels  [11].

The second one is the machine learning method. Reference [12] employed hingeloss Markov random fields to tackle ABSA in MOOC. Reference [13] put forward a emotion-aligned model for to predict aspect rate. Reference [14] combined vocabulary-based method and characteristics-based Support Vector Machine (SVM), and detect sentiment towards aspect words in SemEval 14 competition at first time. Reference [15] constructed binary phrase dependency tree of target to build the feature of aspect words. Reference [16] solved the problem using recurrent neural network, and suggested two approaches TD-LSTM and TC-LSTM. Reference [17] proposed an attention-based LSTM method. It is the best method that deal with abstractive context memory information. Reference [18] introduced a deep memory network method to solve ABSA task. It used hierarchical structure model in which the text was fully connected with target for final classification by using attention cell [19]. Reference [20] processed average target and context-dependent vector from LSTM with attention method. Reference [6] introduced a method of dividing each sentence into three parts, and the context-dependent feature was extracted using bidirectional GRU. LSTM and GRU are widely used by the models mentioned above without considering how to extract the long text feature. Meanwhile, they never consider the situation of lacking target in vocabulary when getting target by average target vector.

Different from above models, we are enlightened by self-attention mechanism. We resolve the problem of long text by Transformer. In addition, The way of solving the wrong target aims to convert target into the form of sentiment question. At last, we eliminate the influence of sentiment-irrelevant words by multiple extractions in memory module.

3 The Proposed Model

We presents TF-MN model for ABSA in this section. The task of ABSA concluded as follows: given a text consisting of n words \(C = \{w_1^C, w_2^C, \cdots , w_{n-1}^C, w_n^C\}\) that is named context, a target \(T = \{w_1^T, w_2^T, \cdots , w_{i-1}^T, w_i^T\}\) in which several adjacent words appear in the context, and the aim is predicting the sentiment polarity of the specified target in the given context.

Fig. 2.
figure 2

The architecture of TF-MN model.

Figure 2 shows the TF-MN architecture, which converts context into a consecutive low dimensional sequence with pretrained word embeddings in the context module. And Transformer processes context sequence to preserve sequential information in memory module. In question module, target is converted into sentiment question. The question is in the form of “What is the emotion tendency of target in the text?”. Then, Transformer is also operated upon the question. In memory module, we eliminates the effect of unrelated words by several extractions. Finally, softmax layer outputs sentiment polarity. Each step of our model are showed as follows.

3.1 Context Module

The context module includes the following layers: the context encoder, the location encoder and the fusion layer. The context and the location encoder layer encode each context and location information into a vector separately, while the fusion layer exchange information between these encoded vectors using Transformer.

Context Encoder Layer. Specified a context \(C = \left\{ w_1^C, w_2^C, \cdots , w_{n-1}^C, w_n^C\right\} \), every word in C is converted into a k-dimensional vector \(e_i^C \in \mathbb {R}^k\) with a pretrained word embedding matrix \(E \in \mathbb {R}^{k*\left| V\right| }\), such as Tencent AI Lab Embedding [21]:

$$\begin{aligned} e_i^C = E(w_i^C) \end{aligned}$$
(1)

where “\(\left| V\right| \), k” are the size of vocabulary and word vector respectively.

Location Encoder Layer. We establish a context location word embedding matrix \(L = \in \mathbb {R}^{k*n}\), which maps word location into a k-dimensional vector \(l_i^C~\in ~\mathbb {R}^k\):

$$\begin{aligned} l_i^C = L(w_i^C) \end{aligned}$$
(2)

where n is the dimension of row vector in L. To provide rich location information for context, row vector of L is a k-dimensional vector consisting of k location information. Matrix L is a set of parameters to be trained, in which every row vector will be assigned a sequence of location information with random normal \(U\left( -0.02, 0.02\right) \).

Fusion Layer. The fusion layer processes the context vector \(E_C\) and context location vector \(L_C\) which contain exchanged information among vectors. We generate context representation \(H_C \in \mathbb {R}^{k*nc}\):

$$\begin{aligned} H_c = Transformer(E_C, L_C) \end{aligned}$$
(3)

where nc denotes the max size of context (If the length is not enough, fill it with 0).

3.2 Question Module

The question module further also contains these layers: the question encoder layer, the location encoder layer and the fusion layer. The question encoder layer converts target into sentiment question firstly, and then encodes question into vector. The location encoder layer encodes location information into a vector. The fusion layer fuses these vectors into more specific features through the Transformer.

Question Encoder Layer. Given a question \(T = \left\{ w_1^T, w_2^T, \cdots , w_i^T\right\} \), T consists of one or more words, which can be included in C or doesn’t appear in C. If T is embedded in sentiment question, you will get the question as “What is the emotion tendency of target in the text?”. Every word in question is converted into a k-dimensional vector \(e_i^Q \in \mathbb {R}^k\) with a pretrained word embedding matrix E:

$$\begin{aligned} e_i^Q = E(w_i^Q) \end{aligned}$$
(4)

Location Encoder Layer. We also establish a question location word embedding matrix L, which maps word location into a k-dimensional vector \(l_i^Q \in \mathbb {R}^k\):

$$\begin{aligned} l_i^Q = L\left( w_i^Q\right) \end{aligned}$$
(5)

where the location information matrix of Q is the same as context module.

Fusion Layer. The fusion layer generates the sentiment question representation \(H_Q \in \mathbb {R}^{k*nq}\):

$$\begin{aligned} H_Q = Transformer\left( E_Q, L_Q\right) \end{aligned}$$
(6)

where nq denotes the max size of question.

3.3 Memory Module

The memory module has three components: the attention gate, feature conversion and the memory update gate, which is used to combine information from context with target and purify of the target vector from the given context.

The output F from context module, the question \(q^*\) from question module and the acquired knowledge stored in the memory vector \(m_{t-1}\) from the previous step.

The three inputs are transformed by:

$$\begin{aligned} u = \left[ F *q^*; \left| F - q^*\right| ; F *m_{t-1}; \left| F - m_{t-1} \right| \right] \end{aligned}$$
(7)

where “;” is concatenation. “\(*,~-,~\left| \right| \)” are element-wise product, subtraction and absolute value respectively. F is a matrix of size \(\left( 1, H_C\right) \), while \(q^*\) and \(m_{t-1}\) are vectors of size \(\left( 1, H_Q\right) \) and \(\left( 1, H_m\right) \), where \(H_m\) is the output size of the memory update gate. To allow element-wise operation, \(H_C\), \(H_Q\) and \(H_m\) are set to the same shape. In Eq. (7), the first two terms measure the similarity and difference between facts and the question. The last two terms have the same functionality for context and the last memory state.

Let the i-th element in \(\alpha \) to be the attention weight for \(w_i^C\). \(\alpha \) is obtained by transforming u using a two-layer perceptron:

$$\begin{aligned} \alpha = softmax\left( tanh\left( u\cdot W_{m1}\right) \cdot W_{m2}\right) \end{aligned}$$
(8)

where \(W_{m1}\) and \(W_{m2}\) are parameters of the perceptron and we omit bias terms.

The feature conversion takes F and \(\alpha \) as input and then get the updated F:

$$\begin{aligned} F = F \cdot \alpha \end{aligned}$$
(9)

The memory update gate outputs the updated memory \(m_t\) using question \(q^{*}\), previous memory state \(m_{t-1}\) and the updated F:

$$\begin{aligned} m_t = relu\left( \left[ q^*;m_{t-1};F\right] \cdot W_u\right) \end{aligned}$$
(10)

where \(W_u\) is the parameter of the linear layer.

The memory module could be iterated several times with a new \(\alpha \) generated for each time. This allows the model to attend to different parts of the facts in different iterations, which enables the model to perform complicated reasoning across sentences. The memory module produces \(m_t\) as the output at the last iteration.

3.4 Answer Module

In answer module, we regard the memory module outputs as the final representation, and put it into a softmax layer for aspect-based sentiment analysis task. To minimize the cross entropy error of sentiment classification, we train the model in a supervised method in which loss function is described as follows:

$$\begin{aligned} loss = -\sum _{\left( c, q\right) \in T}\sum _{lb\in LB}P_{lb}^g\left( c, q\right) \cdot \log \left( P_{lb}\left( c, q\right) \right) \end{aligned}$$
(11)

where T is all training items, LB is the set of sentiment polarities, \(\left( c, q\right) \) is a context-question pair. Our system outputs the probability of class lb by computing the item \(\left( c, q\right) \). \(P_{lb}^g\left( c, q\right) \) means zero or one, expressing whether the item is positive or not. In the module, We calculate the gradients of the overall parameters by using back propagating and update them in a stochastic gradient descent manner.

4 Experiment

4.1 Dataset and Experiment Setup

Dataset. In most recent ten years, a lot of Chinese Weibo competitions have been held, and many excellent datasets have been produced such as NLPCC 2013 and 2014 training dataset. Unfortunately, Most datasets only analyze the overall sentiment polarity of the entire sentence. So we try to construct a new dataset for ABSA. We collect weibo data from Weibo, and each the weibo may contain multiple target entities. Each target can be an entity that appears in weibos or an abstracted entity in weibos. The emotional polarity of the goals we mark includes positive, negative, and neutral. If the target has a divergence between these three polarities, we will ignore this target entity. We mainly build dataset in the fields of restaurant which contains four aspects: traffic, service, price and environment. Then, we randomly selected 18480 Weibo items (a total of 22821 Weibo items) as training set, and the rest 4341 items as test set. Table 1 is the detail of the dataset. Finally, it is noted that the text length of our dataset is generally more than 200 words, which belong long texts.

Table 1. Statistics of dataset

Evaluation. We validate our model with the accuracy, and remove the label from test dataset before training the model. If the output label of the model answer module matches the label by manual method, it will become the right classification result.

Parameter Setting. In the model, the word embeddings matrix of the contexts and targets in the dataset are assigned 200-dimensional vectors from Tencent AI Lab Embedding [21]. All words out of the vocabulary are randomly assigned a vector that obeys uniform distribution \(U\left( -0.01, 0.01\right) \). To prevent data overfitting, we set the loss rate to 0.1. Our optimizer is Adam whose batch size and learning rate is 8 and 6.25e−5 respectively. We use jieba [22] to do Chinese phrase segmentation and generate the word vector matrix of our experiment. It is noted that the results of the experiment will change with each randomly assigned word vector even if we set up the seeds of the experiment. In order to solve this problem, the experimental data are obtained through average 10 experimental results.

4.2 Experiment Setting

In order to test the performance of our model, we did the following set of experiments: SVM, LSTM, TD-LSTM [17], AT-LSTM [23], IAN [24], BILSTM-ATT, MENNET [16].

Firstly, for the SVM experiment, we directly call the svm class inside sklearn [25]. Then, since the number of samples is much larger than the number of features, we use a nonlinear kernel rbf. As for the optimal parameters of the model, we search for parameters in a large-range and large-step grid by the grid search method.

Secondly, we use the LSTM and BILSTM module in tensorflow, where we regard Webo text as input. Then, we output the results of the classification through the softmax layer.

Finally, we use the open source TD-LSTM and AT-lSTM [26], IAN [27], MENNET [28] code on github to do experiment. Some parameters and settings of the experiment are as close as possible to the original author’s paper.

4.3 Model Comparisons and Analysis of Results

In these models, SVM belongs to the traditional machine learning field; LSTM and TD-LSTM are general neural network model methods; AT-LSTM, IAN, BILSTM-ATT, MEMNET are mainly apply the attention mechanism.

We compared the TF-MN model with the correctness of other models, and the results of this comparison are listed in Table 2.

In this table, we can observe that the performance difference between SVM performance and TF-MN model is the largest. We suggest it is caused by two reasons: the SVM training model does not use aspect; the SVM model only classifies the text, and does not mine deep text features.

The effect of the LSTM model on this data set is also not very good, because LSTM can cause the loss of key sentiment information due to the mechanism of forget-door when processing long text.

The accuracy of TD-LSTM is much better than the single LSTM model because it combines ASPECT and text.

Unlike TD-LSTM, AT-LSTM uses a attention mechanism to effectively extract important emotional information from text with aspect, which greatly enhances the final classification result.

BILSTM-ATT is derived from the improvement of the AT-LSTM model, which uses a bidirectional LSTM model. Bidirectional LSTM can significantly improve the performance of the model for sequence classification problems.

The above models mainly focus on the impact of ASPECT on text, but the IAN model also uses the influence of text on ASPECT as a basis for classification. It believes that ASPECT and text should be mutually influential and not just one-way connections.

Based on the above models, MENNET proposes that the attention mechanism should directly affect the process of LSTM coding. Therefore, this model abandon the LSTM model and uses a simpler memory module to encode information. This memory model repeatedly uses the local attention mechanism to extract information and achieves good results.

Although MENNET is good enough, we found that the separate memory module does not encode the text information well during the experiment, and it is as bad as the LSTM model for longer text encoding. At the same time, we also feel that the previous treatment of ASPECT is too rough. Based on these two problems, we improve the MENNET model and achieve better model results.

Table 2. 3-way experimental results in accuracy. 3-way represents the three polarities of positive, negative, neutral. Best scores in each group are in bold.

4.4 Memory Network Optimization

To improve the effect of the memory module to extract emotional information, we conducted a number of experiments to optimize and adjust the number of our memory updates. We found that when the updated hop count is set to 5, the model classification works best. The results of these groups of experiments are shown in Table 3. We believe this is due to excessive update operations that cause the local attention mechanism to repeatedly operate on the same block of text.

Table 3. The result of Memory network hop count comparison.

5 Conclusion

In this work, we explore the use of memory network architecture to model sentiment classification into the question answering task. The key is to frame the goal as an emotional question. Therefore, we believe that memory networks can be replaced by other more efficient network architectures. We believe that the attention gate in the memory module of TF-MN can add syntactic information. Other tasks with context but no clear issues may also benefit from this work. In the paper, the TF-MN model is proposed, which uses the memory network model to model the Weibo sentiment analysis to the question answering task. We turn the goal into an emotional question. We have done 3-way experiments in the field of restaurant on the Weibo dataset. The results show that this way of modeling improves the accuracy of classification.