Keywords

1 Introduction

The emotion analysis task is a hot research area in the field of natural language processing. As comments and opinions on the Internet play an increasingly important role in influencing people's choices or events, the reasons for people's comments become more important. The objective of the Emotion Cause Extraction [1] (ECE) task is to discover the causes behind emotions. With the development of deep learning technology, deep learning is used to solve ECE task. Inspired by the Memory networks in the Question Answering System, Gui et al. [2] designed an ECE model based on the QA system. This method ignores the context of emotional words in the text. Therefore, Li et al. [3] designed a co-attention network to make full use of the information between clauses. Furthermore, Yu et al. [4] proposed to use hierarchical network to model the information at word, phrase and clause level. Xia et al. [5] also designed a hierarchical network based on Transformer, and combined the global prediction information with the relative position. Fan et al. [6] further introduced external emotional corpus as regular term in hierarchical network, and at the same time introduced regular term for relative position.

However, the ECE task requires marking the polarity of emotions in advance. The Emotion Cause Pair Extraction (ECPE) task is an improvement on ECE task. It does not need to mark the Emotion in advance. The task objective is to select the sentence where the emotion is and the sentence which causes the emotion. At present, the ECPE task is implemented by the deep learning model, and there are mainly two methods to solve the ECPE task: the two-step method and the end-to-end method. The two-step method is the traditional method for this task. First, all the emotion and cause clauses in the text are extracted to obtain two clause sets. A filter was then trained to pick the right emotion cause pair from the candidates.

However, it has the problem of error propagation. In addition, because the first step of the two-step method faces the emotion and cause clause extraction task, rather than the ECPE task, the effect of the model will be reduced.

In order to solve the problem, end-to-end architecture is widely used. Wu et al. [7] proposed a multi-task neural network to perform ECPE and two auxiliary tasks in a unified model. Fan et al. [8] compared the ECPE task to the directed graph generation task, and designed the transition-based ECPE model TransECPE. Wei et al. [9] proposed a more efficient model RANKCP, which used graph attention network to model document content and structure, solving the ECPE task from the perspective of ranking. In addition, in order to utilize the interactive information between different tasks, the parameter sharing method in multi-task learning is adopted to establish the relationship between the two sub-tasks and the emotion cause pair. Tang et al. [10], calculated the relationship score between emotion and cause clauses by using self-attention [11], extracted the emotion cause pair whose scores exceeded the threshold.

But the multi-task model cannot make full use of the rich interaction information between tasks, and the information transfer between tasks is also implicit. Label embedding has been proved to be an effective method in various fields and tasks. In the field of natural language processing, label embedding for text classification has been studied in the heterogeneous network of Tang et al. [12] and the multi-task learning environment of Zhang et al. [13]. Subsequently, Wang et al. [14] regarded the task of text classification as the problem of label and word joint embedding, and introduced an attention framework to measure the compatibility between text sequence and label embedding. Zhang et al. [15] proposed multi-task label embedding, which maps labels of each task to semantic vectors and uses a method similar to word embedding to process word sequences, so as to transform classification tasks into vector matching tasks.

The main contributions of this paper can be summarized as follows: (1) Construct the relationship between tasks by sharing parameters at the bottom level, share the results of emotion and cause extraction with the ECE task explicitly through label embedding. (2) Feature fusion is adopted to obtain feature representation with multiple relational information. (3) The model obtained optimal results on data set ECPE, which verified the validity of the model.

2 Related Work

Xia et al. [16] proposed an end-to-end solution to solve the error propagation problem of the two-step method, integrating the generation, pairing and final prediction of emotion cause pair representation into a joint framework. They designed a two-dimensional Transformer [17] and two variants to obtain the representation of emotion cause pair and captured the interaction between different emotions and causes in the process of generating vector representation.

Some researchers transfer the ideas of other tasks to ECPE tasks. Song et al.[18] regarded ECPE as a link prediction problem, to predict whether there is an edge from the emotion clause to the cause clause, and if there is, the two clauses constitute an emotion cause pair. Fan et al. [19] treated the ECPE task as a sequence labeling problem and proposed a multi-task sequence labeling model via label distribution refinement.

Most current graph convolution network rely on the structural information of graphs for classification. Li et al. [20] retained the structural information and feature similarity of graphs, and introduced the attention mechanism for classification. Chen et al. [21] applied a graph convolution network to the model and constructed an emotion cause pair graph to simulate three kinds of dependency relationships between candidate pairs in the local neighborhood, where each node represents a candidate emotion cause pair and the edge connecting two nodes represents the dependency relationship between two candidate pairs. Finally, the graph convolution network is used to extract these three types of edges so as to disseminate the context information in the graph.

3 Methodology

3.1 Task Description

The task of ECPE in this paper is to extract emotion clauses and corresponding cause clauses from a given text. A text case in the ECPE data set is given in Table 1. The text is divided into eight clauses according to punctuation marks, the sixth clause is taken as an emotion clause, the eighth clause is the corresponding cause clause. The final goal of ECPE task is to extract emotion cause pairs composed of (C5, C6).

Table 1. Data example.

3.2 Model Description

The main architecture of GAT-ECPE, an ECPE task model proposed in this paper, is shown in Fig. 1.

Fig. 1.
figure 1

MulECPE model architecture.

3.3 Input Layer

The input layer contains the word embedding and word encoding layer. Firstly, word embedding is carried out on the input text sequence, and the words which have been segmented are transformed into vectors to obtain the text sequence matrix. The i-th clause in document D can be represented by an embedding matrix \(c_i = \left( {w_1^i , \ldots ,w_{\left| D \right|}^i } \right)\).

Bi-LSTM is used to encode words. For each clause, Bi-LSTM is used to encode the context information at the word level, as shown in the following formula.

$$ \begin{array}{*{20}c} {\left[ {h_i^1 , \ldots ,h_i^j , \ldots ,h_i^{\left| {c_i } \right|} } \right] = \left[ {Bi - LSTM\left( {w_i^1 , \ldots ,w_i^j , \ldots ,w_i^{\left| {c_i } \right|} } \right)} \right]} \\ \end{array} $$
(1)

where \(w_i^j\) represents the word vector of the j-th word in the i-th clause in the document, and \(h_i^j\) represents the feature representation of the clause word level.

3.4 Feature Fusion Layer

After passing through the input layer, the feature representation at the word level is obtained. The feature representation of each clause of the text is shown in the following formula.

$$ \begin{array}{*{20}c} {c_i = \left[ {h_{i,1} , \ldots ,h_{i,n} } \right]} \\ \end{array} $$
(2)

Convolution kernels of different sizes are used to conduct one-dimensional convolution operations at the word level, so that the model can learn multiple features at the word level, as shown in the formula.

$$ \begin{array}{*{20}c} {x_t^i = conv_t \left( {c_1 , \ldots ,c_{\left| c \right|} } \right)} \\ \end{array} $$
(3)

where \(conv_t\) represents the convolution operation, and t represents the size of the convolution kernel. Considering that the traditional connection cannot correctly process the language combination, the convolutional layer is densely connected:

$$ \begin{array}{*{20}c} {x = \left[ { \oplus_t x_t } \right]} \\ \end{array} $$
(4)

After densely connection, the downstream layer of the convolution part can access the features generated by the upstream layer. This paper draws on the multi-scale feature attention mechanism proposed by Wang et al. [14], so that each position of the text adaptively selects features of different scales. The multi-scale feature attention mechanism consists of two steps: convolution aggregation and scale weighting (Fig. 2).

Fig. 2.
figure 2

Feature fusion.

First, the convolution aggregation aims to use a descriptor \(s_l^i\) to represent the features of the different scales obtained at position \(i\). \(K\) convolution kernels are used in each convolution block. Each eigenvector is then represented by using a scalar.

$$ \begin{array}{*{20}c} {s_l^i = F_{ensem} \left( {x_l^i } \right)} \\ \end{array} $$
(5)

where \(F_{ensem}\) represents a function that sums all \(K\) elements of the input vector. And the output scalar can be used as the description of the eigenvector.

Secondly, after scale weighting, features of different scales can be weighted adaptively. The final feature representation calculation is as follows:

$$ \begin{array}{*{20}c} {x_{atten}^i = \mathop \sum \limits_{l = 1}^L \alpha_l^i x_l^i } \\ \end{array} $$
(6)

3.5 Intra-clause Information Encoding Layer

In order to further learn the semantic order information in the document, Bi-LSTM is applied. The calculation method is shown in the formula:

$$ \begin{array}{*{20}c} {s_i = \left[ {Bi - LSTM\left( {a_i } \right)} \right]} \\ \end{array} $$
(7)

3.6 Label Embedding Layer

According to the Bi-affine attention mechanism, this paper takes the emotion clause as the core item and the cause clause as the dependency. The matrix composed of emotion and cause clauses are asymmetric and can sense direction. Two MLP are applied in this paper to generate the representation of emotion clauses and cause clauses respectively, as shown in the following formula:

$$ \begin{array}{*{20}c} {z_i^e = \sigma \left( {W^e h_i^e + b^e } \right)} \\ \end{array} $$
(8)
$$ \begin{array}{*{20}c} {z_i^c = \sigma \left( {W^c h_i^e + b^c } \right)} \\ \end{array} $$
(9)

where \(z_i^e\) represents the i-th emotion clause, and \(z_i^c\) represents the i-th cause clause.

The extraction of emotion and cause clause were embedded into the input of graph attention network, when processing the results extracted from the two auxiliary tasks, the results extracted from the emotion and the cause clause are first mapped to the vector representation by embedding matrix W:

$$ \begin{array}{*{20}c} {Y_i^e = W^e y_i^e } \\ \end{array} $$
(10)
$$ \begin{array}{*{20}c} {Y_i^c = W^c y_i^c } \\ \end{array} $$
(11)

where \(y\) represents the extraction result of clauses, \(Y\) is the transformed vector, and \(s\) is the context-encoded clause representation of the output of inter-clause encoding layer. After vector concatenating, they are used as the input of graph attention network to encode the representation of emotion and cause clause.

$$ \begin{array}{*{20}c} {s_i^e = \left[ {s_i ;Y_i^e } \right]} \\ \end{array} $$
(12)
$$ \begin{array}{*{20}c} {s_i^c = \left[ {s_i ;Y_i^c } \right]} \\ \end{array} $$
(13)

The concatenated vectors are transformed into emotion clause representation and cause clause representation respectively through the fully connected layer.

$$ \begin{array}{*{20}c} {z_i = \sigma \left( {W^e s_i^e + b^e } \right)} \\ \end{array} $$
(14)
$$ \begin{array}{*{20}c} {d_i = \sigma \left( {W^c s_i^c + b^c } \right)} \\ \end{array} $$
(15)

where \(z\) represents emotion clause and \(d\) represents cause clause.

3.7 Graph Attention Layer

Although the information between clauses has been learned through Bi-LSTM, it can only learn the single-hop relationship, while there may be a multi-hop relationship between emotion and cause. The graph attention mechanism can naturally capture the multi-hop relationship and integrate the higher-order information. In this chapter, the clause representation is updated by stacking two graph attention network layers.

The attention weight reflects correlation between clauses, which is learned by MLP. The attention between clauses is calculated by the following methods:

$$ \begin{array}{*{20}c} {e_{ij}^{\left( t \right)} = w^{\left( t \right)^T } \tanh \left( {\left[ {W^{\left( t \right)} z_i^{\left( {t - 1} \right)} ;W^{\left( t \right)} d_i^{\left( {t - 1} \right)} } \right]} \right)} \\ \end{array} $$
(16)
$$ \begin{array}{*{20}c} {\alpha_{ij}^{\left( t \right)} = \frac{{\exp \left( {LeakyReLU\left( {e_{ij}^{\left( t \right)} } \right)} \right)}}{{\sum_{k \in N\left( i \right)} \left( {\exp \left( {LeakyReLU\left( {e_{ij}^{\left( t \right)} } \right)} \right)} \right)}}} \\ \end{array} $$
(17)

Considering that sentiment clauses correspond to multiple cause clauses, the original neighbor aggregation method in the graph attention network is updated so that the multi-hop relationship information between two clauses can be better learned in the update. The formula is shown below:

$$ \begin{array}{*{20}c} {z_i^t = g\left[ {W_1 \sum \limits_{j \in N\left( i \right)} \left( {\alpha_{ji}^t z_j^{t - 1} + \alpha_{ij}^t d_j^{t - 1} } \right) + b_1 z_i^{t - 1} } \right]} \\ \end{array} $$
(18)
$$ \begin{array}{*{20}c} {d_i^t = g\left[ {W_2 \sum \limits_{j \in N\left( i \right)} \left( {\alpha_{ij}^t z_j^{t - 1} + \alpha_{ij}^t d_j^{t - 1} } \right) + b_2 d_i^{t - 1} } \right]} \\ \end{array} $$
(19)

In order to integrate these two representations into the matrix of emotion cause pairs for the final mission objective, we fold the emotion cause pairs into two parts. First, a Bi-linear like operation needs to be performed for each possible emotional cause pair. Second, Bi-affine transformations are used to deal with complex interactions, as shown in the formula:

$$ \begin{array}{*{20}c} {M_{p,q} = \left( {W^m z_p^e + b^m } \right)^T z_q^c } \\ \end{array} $$
(20)

All possible clause pairs in the original document \(d\) are regarded as candidates, assuming that the document length is \(\left| d \right|\), then all of the possible emotion cause will form \(\left| d \right|*\left| d \right|\) matrix, \(M_{p,q}\) means the possible correct emotion cause pairs composed of the p-th emotion clause and the q-th cause clause.

The sigmoid function is further used to activate the emotion cause pair matrix:

$$ \begin{array}{*{20}c} {\tilde{M}_{p,q} = g\left( {M_{i,j} } \right)} \\ \end{array} $$
(21)

The location information between emotion and cause clause is also an important factor influencing the accuracy, emotion clause is close to cause clause, so the weight of the setting is bigger in the closer position, as shown the following formula.

$$ \begin{array}{*{20}c} {A_{p,q} = \frac{{\left| D \right| - \left| {p - q} \right| + \epsilon }}{\left| D \right| + \epsilon }} \\ \end{array} $$
(22)

where \(\epsilon\) is the smooth term, integrates position matrix with the emotion cause pair:

$$ \begin{array}{*{20}c} {\hat{M}_{p,q} = \tilde{M}_{p,q} \odot A_{p,q} } \\ \end{array} $$
(23)

3.8 Extraction Results

In the E2EECPE task proposed by Song et al. [18], the emotion and cause clause extraction is regarded as two independent tasks. In this paper, the previous task is improved by adding the result of extraction of emotion and cause clauses into the representation of subsequent emotion and cause clauses.

The formulas of emotion and cause extraction of clauses are as follows:

First, the feature representation of the clause used for extraction is obtained:

$$ \begin{array}{*{20}c} {z_i^{ae} = \sigma \left( {W^{ae} h_i^{ae} + b^{ae} } \right)} \\ \end{array} $$
(24)
$$ \begin{array}{*{20}c} {z_i^{ac} = \sigma \left( {W^{ac} h_i^{ac} + b^{ac} } \right)} \\ \end{array} $$
(25)

Input to softmax layer after passing through two fully connected layers:

$$ \begin{array}{*{20}c} {y^{ae} = softmax\left( {\hat{W}^{ae} z_i^{ae} + \hat{b}^{ae} } \right)} \\ \end{array} $$
(26)
$$ \begin{array}{*{20}c} {y^{ac} = softmax\left( {\hat{W}^{ac} z_i^{ac} + \hat{b}^{ac} } \right)} \\ \end{array} $$
(27)

The method of obtaining the final result of emotion extraction and cause extraction is similar to the formula of auxiliary task, but the parameters are different:

$$ \begin{array}{*{20}c} {\tilde{z}_i^e = \sigma \left( {\tilde{W}^e h_i^e + \tilde{b}^e } \right)} \\ \end{array} $$
(28)
$$ \begin{array}{*{20}c} {\tilde{z}_i^c = \sigma \left( {\tilde{W}^c h_i^c + \tilde{b}^c } \right)} \\ \end{array} $$
(29)

where \(\tilde{W}^e \in R^{d_z \times 2d_h }\), \(\tilde{b}^e \in R^{d_z }\) and \(\tilde{W}^c \in R^{d_z \times 2d_h }\), \(\tilde{b}^c \in R^{d_z }\).

3.9 Prediction Layer

Compare \(\hat{M}\) with the set threshold \(\eta\) to judge whether it is for emotion cause pair:

$$ \left\{ {\begin{array}{*{20}c} {\hat{Y}_{p,q} = 1 \left( {\hat{M}_{p,q} > \eta } \right)} \\ {\hat{Y}_{p,q} = 0 \left( {\hat{M}_{p,q} \le \eta } \right)} \\ \end{array} } \right. $$
(30)

The whole model structure can be trained by standard gradient descent. The loss is the combination of cross entropy and L2 regularization, and consists of two parts:

The first part is the loss function of task for emotion causes:

$$ \begin{array}{*{20}c} {L_{pair} = - \sum \limits_{p,q} Y_{p,q} \log \left( {\hat{M}_{p,q} } \right) - \sum \limits_{p,q} (1 - Y_{p,q} )\log \left( {1 - \hat{M}_{p,q} } \right)} \\ \end{array} $$
(31)

The second part is the loss of emotion extraction and cause extraction for clauses:

$$ \begin{array}{*{20}c} {L_{class} = - \sum \limits_i \left[ { \sum \limits_k y_k^e \log \left( {\hat{y}_k^e } \right) + \sum \limits_k y_k^c \log \left( {\hat{y}_k^c } \right)} \right]} \\ \end{array} $$
(32)

\(y_k^e ,y_k^c ,Y_{p,q}\) represent the extraction of emotion clauses, the extraction of cause clauses, and the correct judgment result of ECPE.

If it is an emotion clause then \(y^e = 1\), otherwise \(y^e = 0\).Same rule for \(y^c\), \(Y_{p,q}\), \(y^e\).

Two auxiliary tasks are also set, which also need to calculate the loss function:

$$ \begin{array}{*{20}c} {L_{aux} = - \sum \limits_i \left[ { \sum \limits_k y_k^e \log \left( {y_k^{ae} } \right) + \sum \limits_k y_k^c \log \left( {y_k^{ae} } \right)} \right]} \\ \end{array} $$
(33)

The final training objective is composed of the above objective function, and coefficients are added to control its influence, as shown in the formula:

$$ L = L_{pair} + L_{class} + \beta L_{aux} + \lambda \left| {\left| \theta \right|} \right|_2 $$
(34)

\(\beta\) is used to adjust the influence of auxiliary tasks on the overall model, \(\theta\) represents all parameters that is optimizable, and \(\lambda\) represents the parameters of L2.

The detailed approaches are stated in Algorithm 1.

figure a

4 Experiment

4.1 Experiment Setup

Dataset

The dataset was constructed by Xia et al. [16], each document contains only one emotion and its corresponding one or more causes. The average document length is 14.77, and the maximum length is 73. Table 2 shows the proportion of documents with different emotion cause pairs.

Table 2. Dataset information

In order to better meet the task for ECPE, the documents with the same text content are integrated into one document. According to statistics, there are 1,945 paragraphs in total, including 2,167 emotion cause pairs, and the number of documents which only have one pair of emotion cause pair accounts for 89.77% of the total documents.

Experimental Parameters

In this paper, the experimental conditions are GTX2080Ti, PyTorch framework is used, Word2Vec trained on Weibo is used as the word vector, the number of filters is set to 50. The specific experimental parameters are shown in Table 3.

Table 3. Dataset information.

4.2 Experimental Results and Analysis

Table 4 shows a specific document, and the final analysis result is shown in Fig. 3. A lighter color indicates a higher score for the candidate emotion cause pair.

Table 4. Case analysis.

Figure 3 shows that all the output result values are small, this is because after sigmoid, the output result of the double affine attention layer is multiplied by the weight matrix, resulting in a lower score.

Fig. 3.
figure 3

Result analysis.

As shown in Fig. 4, the ablation experiment was set up to better analyze the effect of each module proposed in this paper.

Fig. 4.
figure 4

Comparison of emotion cause pair extraction results.

According to the ablation experiment, when label embedding and feature fusion were added to the model in this paper, the three evaluation indexes of ECPE task were greatly improved. This is because label embedding makes up for the shortcoming of the shared feature parameters. In addition, with feature fusion, the indicators also increased, indicating that the multi-scale features obtained by this method can help the computer better understand emotion words and cause words.

In order to better verify the validity, the Mul-ECPE model is compared with other models in Table 4. Table shows that the proposed model is superior to the latest model in the F1 values of three tasks. The F1 value of emotion ECPE task was 0.8% higher than that of the latest model (Table 5).

Table 5. Experiment Results.

5 Conclusion

The Mul-ECPE model proposed in this paper is used for ECPE. This model makes full use of the effect of feature fusion, multi-scale information is allocated to the word vector representation so that the model can better understand the emotion word and cause word. The addition of label embedding can make full use of the rich interactive relationship information among the three tasks. Experimental results also show that the proposed model achieves good results on ECPE data sets. Existing methods remain at the clause level, so future work can be improved from the perspective of granularity level, so as to extract emotion cause pairs at a fine-grained level.