1 Introduction

In recent years, discovering the causes behind emotions has received more and more attention from scholars in the field of sentiment analysis. Some scholars [1] defined a clause-level emotion cause extraction (ECE) task and provided a new corpus. The ECE task is to further detect the underlying causes behind the emotions by labeling the emotions, and its goal is to judge whether each clause in the document is the cause of emotional correspondence. Analyzing the causes of emotions has a very wide range of applications, such as mining user reviews on e-commerce and internet platforms, and monitoring the trend of public opinion on social platforms. In its follow-up research, the ECE task has received extensive attention. Although the ECE task is well-defined, there are still two key problems: first, the ECE task ignores the potential relationship between emotions and causes; second, the emotion needs to be annotated before the cause is extracted. This limits the application scenarios of ECE tasks to a large extent. In response to these two problems, Xia and Ding et al. [2] proposed a new task named emotion-cause pair extraction(ECPE), which aims to extract all emotion-cause pairs from a given document. As shown in Fig. 1, there are seven sentences in the document, and all clauses are used as input to the model. Clause E contains the emotion “lucky” and clause F contains the emotion “thank”, corresponding to the cause clause D. (“Now I have a baby.”) and the clause F (“I also want to thank all the people who have helped me.”), through model training, to predict, and finally output the emotion-cause pair {E,D}, {F,F}.

Fig. 1
figure 1

A sample document in the data set

For the ECPE task, Xia and Ding et al. [2] proposed a two-step framework. First, a binary classifier based on Bi-LSTM is used to extract emotion clauses and cause clauses respectively, and then emotion-cause pairs in the document are filtered through the cartesian product pairing filter. Such a two-step framework is effective, but there are still some problems. The first is that the extraction of emotions and causes in the first step of the model will directly affect the accuracy of the second step. Second, the model uses filters to filter incorrect pairs, resulting in low prediction accuracy. As a result, some scholars have proposed to use an end-to-end model to solve the above problems [3,4,5]. With the popularity of transformer, graph convolution and other modules, some basic deep learning models are used to solve the pairing problem of emotion and causes. For example, Ding et al. [3] proposed an end-to-end model, called ECPE-2D, which is an emotion-cause pair extraction model that uses transformer to combine two-dimensional representation, interaction, and prediction. Although this type of model well integrates the hierarchical relationship between the ECPE task and the two subtasks, and fully learns the relationship between all possible emotion-cause pairs. However, this type of model also has the following two problems, which affect the accuracy and efficiency of the prediction results:

First of all, when generating the expression of emotion-cause pairs, although existing research considers the relationship between emotion-cause pairs, it ignores the relationship between emotion clauses and cause clauses. Learning and strengthening the relationship between emotion clauses and cause clauses may be more beneficial to the prediction of emotion-cause pairs. Taking the ECPE-2D model proposed by Ding et al. [3] as an example, ECPE-2D uses an end-to-end model to pair the emotion and cause expression learned by Bi-LSTM one by one, and express the emotion-cause pair as a vector. Then, using transformer to learn the relationship between each emotion-cause pair. According to Fig. 2, ECPE-2D first paired emotion clauses and cause clauses one by one to form a two-dimensional representation, that is, Pair in the Fig. 2, and then input it into transformer to learn the relationship between pairs. However, it ignores the relationship between the emotions and causes, that is, Emo and Cau in the Fig. 2. We believe that the causal relationship between the emotion clause and the cause clause in the correct emotion-cause pair is more conducive to the prediction of the emotion-cause pair. Therefore, it is more important to learn the relationship between emotion clauses and cause clauses than to learn the relationship between pairs and pairs, and the above approach fails to capture the complex relationships and correlations at different levels.

Fig. 2
figure 2

An example is used to compare the two models in the process of emotion cause pair extraction

Secondly, redundant model architecture and numerous model trainable parameters are also one of the problems faced by current research. For example, ECPE-2D pairs all the clauses one by one, and then enters the Transformer for multi-level and multi-faceted training. Such a training method produces large trainable parameters, resulting in a complex network structure, time-consuming training, and a decrease in prediction accuracy, resulting in inaccurate prediction results.

In order to solve the above two problems, we propose an end-to-end model based on interactive attention, called IA-ECPE. Based on the research of Ding et al. [3] and other scholars [5], we continue to use the end-to-end network structure, and combine the resulting emotion clause set and the set of cause clauses are input to the interactive attention separately, and the degree of correlation between the emotion clause and the cause clause is learned. Then the output of the interactive attention is fused with multi-level information, and finally the prediction of the emotion-cause pair is carried out. Our model not only learns the relationship between pairs and pairs, but also considers the interaction between emotion clauses and cause clauses. At the same time, through the optimization of BiGRU [6], interactive attention and other modules, the amount of model parameters has also been reduced, and the training accuracy has also been improved. The experimental results show that our method achieves the best experimental results on quasi basic corpus. Combined with the above description, the main contributions of this article are as follows:

  • We designed an interactive attention module suitable for ECPE tasks, aiming to accurately capture the causal relationship between emotions and causes.

  • We added different levels of fusion mechanisms to the module. At the same time, the complexity of our model architecture has been reduced, and the number of trainable parameters has also been reduced to a certain extent.

  • Our model is tested on a quasi-base corpus, and the results show that our method outperform the state-of-the-art baselines.

The other parts of this paper are described as follows. In the Section 2, we introduces in detail the research and findings of relevant scholars on this task, as well as the relevant progress. In the Section 3, we mainly introduces our model. We describe in detail the interactive attention mechanism and how to use the fusion mechanism to capture different levels of information. In the Section 4, the experimental results and performance of this paper, which introduces the details and result analysis of the experiment in detail. Finally, the Section 5 summarizes the article and puts forward our prospects.

2 Related work

This section introduces the ECE task and the related research of ECPE task.

2.1 Emotion cause extraction

Lee et al. [7] first studied the task of extracting emotional causes. They designed a system based on language rules to detect cause events. In the early days, some work tried rules-based [8,9,10], common sense [11] and traditional machine learning [12] method to extract some causes for emotional expression. Gui et al. [1] proposed an event-driven multi-core SVM method and published a benchmark corpus. Feature-based methods [13] and neural methods [14,15,16] have recently been proposed. Xia et al. [17] adopted a method based on a transformer encoder enhanced by position information and an integrated global prediction embedding method to improve the prediction accuracy. Fan et al. [18] adopted an approach based on mood and position regulators to inhibit parameter learning. Hu et al. [19] used an external emotion classification corpus to pre-train the model. In some other research fields [20], emotional causes were extracted in the context of multi-user micro blog. In addition, Kim and Klinger [21] and Bostan et al. [22] treat emotions as structured data, and study the semantic role of emotions, including trigger phrases, experiencers, goals and causes, and readers’ perceptions.

2.2 Emotion-cause pair extraction

In the past, emotion-cause extraction tasks all needed to mark emotion clauses. In response to this problem, Xia et al. [2] first proposed the emotion-cause pair extraction task in 2019, and proposed a two-step framework to extract emotion clauses and cause clauses respectively, and then send them to the model for training classification to filter out negative sentence pairs. Recently, some studies have focused on the ECPE task and proposed an end-to-end model, thus avoiding some cascading errors that may be caused by the two-step method proposed by Xia and Ding et al. [2]. Wei et al. [23] proposed a model named RANKCP, which uses graph neural network to propagate between clauses to learn pairwise representations and perform candidate emotion-cause pairs based on the learned pairwise representations. Sort, finally make predictions, and output the prediction results. Tang et al. [24] proposed a model called LAE-Joint-MANN-BERT, which is based on BERT for joint processing of emotion detection (ED) and ECPE tasks. Specifically, they calculated the attention value of the associative clauses in all clauses to express the relevance and importance of the associative clauses, thereby predicting the probability that each pair is an emotion-cause pair. Similarly, Song et al. [25] regarded the ECPE task as a chain network prediction problem, and proposed E2EECPE to solve the ECPE task. Specifically, the model uses the parental attention module to model the relevance and importance of clauses. Ding et al. [3] proposed the ECPE-2D model, which uses a hierarchical self-attention module to calculate the attention matrix between all clauses, and deploys a 2D converter to simulate the correlation and correlation between emotion-cause pairs. Importance. Fan et al. [18] transformed the extraction of emotion-causse pair into performing a series of actions, and modeled them to solve ECPE tasks from different perspectives. Specifically, they convert each given document into a directed graph, and convert the original data set into a sequence. On this basis, they trained a model to predict the next state of the current state. Different from the above work, Turcan et al. [26] redefines the ECPE task as a unified sequence labeling problem, and designed a unified label to replace the original label, so that their model can extract emotion-cause pairs of different emotion types and encode adjacent information to improve task performance.

3 Model

In this section, we propose an emotion-cause pair extraction based on interactive attention framework, called IA-ECPE. As shown in Fig. 3, IA-ECPE is a stacking framework based on interactive attention. The lower network can provide information for the upper network to optimize the results. In the embedding part of the model, since the object is a document, we use hierarchical coding, from word to sentence, and then from sentence to document. We encode words with word2vec, and the word-to-sentence process uses BiLSTM to obtain the encoding of the sentence, and then input BiGRU to encode the sentence in the document to obtain the expression of the emotion clauses and the cause clauses. Then we input the obtained clause expression into the interactive attention module to learn the relationship between emotion clauses and cause clauses, and use the fusion mechanism to strengthen the internal information of the emotion clauses and the cause clauses, Thus, different levels of complex correlation are captured. Finally, the obtained vector and position information are fused and predicted to obtain the final prediction result. Below we will introduce the modules we designed in detail.

Fig. 3
figure 3

The overall framework of the joint model based on interactive attention is used to solve the emotion cause pair extraction task

3.1 Interactive attention

As shown in Fig. 4, it is the internal overall framework of the interactive attention we designed. It can be seen from the figure that this module is mainly composed of a feedforward neural network and a weight calculation module. Specifically, according to the output of BiGRU, we can get the expression of emotion clauses \(emo=\left \{c_{1}^{emo},c_{2}^{emo},\ldots \ldots ,c_{N}^{emo}\right \}\) and the expression of cause clauses \(cau=\left \{c_{1}^{cau},c_{2}^{cau},\ldots \ldots ,c_{N}^{cau}\right \}\), where N represents the number of sentences in a document. We first input the emotion clause and cause clause into a feed forward neural network with a non-linear layer Relu to update the vector expression of emotion and cause. The vector representations ei and ci corresponding to emotion clauses and cause clauses are obtained by feedforward neural network, and then this module use the fusion mechanism to fuse the output of BiGRU with its features to obtain the vector representations Ei and Ci that are finally input to the attention module. The specific formula is expressed as:

$$ e_{i}=F_{1}\left( {W^{e}c}_{i}^{emo}+b^{e}\right), $$
(1)
$$ c_{i}=F_{2}\left( W^{c}c_{i}^{cau}+b^{c}\right), $$
(2)
$$ E_{i}=f_{a}\left( c_{i}^{emo},e_{i}\right), $$
(3)
$$ C_{i}=f_{a}\left( c_{i}^{cau},c_{i}\right), $$
(4)

where ei represents the update expression of the i-th emotion clause in the document, ci represents the update expression of the i-th cause clause in the document, We and Wc are the coefficient matrices of emotion and cause linear transformation, respectively, be and bc are bias terms, F represents the function representation corresponding to the feedforward neural network, and fa represents the corresponding addition function between the vectors.

Fig. 4
figure 4

Internal structure of interactive attention

The obtained Ei and Ci are jointly calculated to obtain the attention weight matrix. The rows and columns of the matrix represent the relevance and importance of the emotions and the causes clause. Specifically, the rows of the matrix represent the degree of relevance of each emotion clause to all the cause clauses in the document, and the columns of the matrix represent the degree of relevance of each cause clause to all the emotion clauses in the document. After that, the attention weight matrix is standardized. As shown in Fig. 5, the attention matrix is normalized in different directions, and the specific formula is as follows:

$$ a_{ij}={e_{i}^{T}}c_{i}, $$
(5)
$$ \alpha_{ij}=\frac{\exp\left( a_{ij}\right)}{{{\sum}_{m}^{N}}\exp\left( a_{im}\right)} \quad j\in[0, N], $$
(6)
$$ \beta_{ij}=\frac{\exp\left( a_{ij}\right)}{{{\sum}_{m}^{N}}\exp\left( a_{mj}\right)} \quad i\in[0, N], $$
(7)

where αij and βij represent the correlation weight coefficients corresponding to the emotions and causes respectively, and aij is the correlation degree of the i-th row and the j-th column of the interactive attention matrix.

Fig. 5
figure 5

Calculation principle of attention weight matrix

3.2 Fusion mechanism

Through the previous calculations, we found that the vector expression of clauses may lose some key information during the training process, so we added a fusion mechanism to the interactive attention framework to strengthen the correlation between clauses and capture the complex dependence between different levels. The specific operation is as follows. Through the above processing, we get the corresponding attention weight, that is, correlation coefficient αi of emotion clause corresponding to cause clause, and the correlation coefficient βj of the cause clause corresponding to the emotion clause. The attention weight is respectively fused with the updated emotion clause vector expression Ei and the cause clause vector expression Ci, and the correlation degree between the clauses is fused with the original information to obtain the vector expression of the attention mechanism Emocom and Caucom. As shown in Fig. 4, because we consider all possible pairs, we fuse the features of all emotion clauses and cause clauses. Then, emotion clauses and cause clauses are paired one by one to connect, and get the output vector Yij of the final interactive attention. The specific formula is as follows:

$$ {Emo}_{j}^{com}=f_{a}\left( c_{j}^{emo},\alpha_{ij}C_{j}\right), $$
(8)
$$ {Cau}_{i}^{com}=f_{a}\left( c_{i}^{cau},\beta_{ij}E_{i}\right), $$
(9)
$$ Y_{ij}=\left[f_{a}\left( {Emo}_{i}^{com},F_{2}\left( {Emo}_{i}^{com}\right)\right), f_{a}\left( {Cau}_{j}^{com},F_{2}\left( {Cau}_{j}^{com}\right)\right)\right], $$
(10)

where, \({Emo}_{i}^{com}\) and \({Cau}_{j}^{com}\) represent the vectors of emotion and cause clauses are obtained by fusing the original information and attention information, and fa represents the corresponding addition function between the vectors, F2 represents the feedforward neural network function with Relu.

3.3 Joint expression

Our model is a stacking framework. The results of the subtasks in the lower layer of our task may affect the upper task. Therefore, we input the emotion and cause sentence expression output by BiGRU into softmax for prediction, and the predicted value yemo and ycau of the emotion and cause clauses is obtained. Each emotion clause and cause clause are paired one by one, and the output Y of the interactive attention is fused with the predicted values yemo, ycau and location information P to obtain the representation U of the emotion-cause pair,

$$ U_{ij}=Concat\left( Y_{ij},y_{i}^{emo},y_{j}^{cau},P_{ij}\right), $$
(11)

where W1 and W2 are weight matrices, b1 and b2 are bias terms, Pij is the relative position information between the i-th emotion clause and the j-th cause clause, and Uij represents the joint expression of the i-th emotion clause and the j-th cause clause.

3.4 Predictions of affective causes

Through the above calculation, we have obtained all possible emotion-cause pairs for the expression U. Based on common sense, we found that causes generally appear around emotion. Therefore, we set up a sliding window. For each emotion-cause pair, only the cause clauses within a certain range around it are used for prediction. The setting of sliding window not only reduces the prediction range, but also reduces the trainable parameters of the model to a certain extent. Then we predict the obtained Uw with softmax. The loss functions for emotion clause classification, cause clause classification and emotion-cause pair classification are as follows:

$$ L^{emo}=-\sum\limits_{i=1}^{\lvert N\rvert}{Y_{i}^{e}}\cdot \log\left (y_{i}^{emo} \right ), $$
(12)
$$ L^{cau}=-\sum\limits_{i=1}^{\lvert N\rvert}{Y_{i}^{c}}\cdot \log\left (y_{i}^{cau} \right ), $$
(13)
$$ L^{pair}=-\sum\limits_{i=1}^{\lvert N\rvert}\sum\limits_{j=1}^{\lvert M\rvert}u_{i,j}^{pair}\cdot log\left (U_{ij} \right ), $$
(14)

where \({Y_{i}^{e}}\), \({Y_{i}^{c}}\), \(u_{i,j}^{pair}\) are the ground truth distribution of emotion, cause and emotion-cause pair, respectively. Combine the three to get the final loss function L:

$$ L=L^{emo}+L^{cau}+L^{pair}+\lambda \left \| \theta \right \|^{2}, $$
(15)

where λ is the weight of the l2 regularization term. 𝜃 represents all model parameters.

4 Experiment

In this section, we will introduce the details of our experimental research and evaluate the feasibility and effectiveness of our model.

4.1 Dataset, metrics and experimental settings

We evaluate the model using two publicly available ECPE datasets [2, 5]. Among them, the Chinese dataset constructed by Gui et al. [1], and contains 1945 documents from news websites, and the English dataset contains 2843 documents from English novels. We divided the data set into a training set and a testing set, where 90% of the data is the training set, and the remaining 10% is the testing set. In order to further verify the robustness of the model statistics, we perform 10-fold cross-validation and report the average results of these experiments. We used the same data partitioning as Xia and Ding [2]. We use precision, recall and F1 score as evaluation indicators. Among the three indexes, F1, the main evaluation index, is the comprehensive evaluation of precision and recall rate. The calculation is as follows:

$$ P=\frac{\sum{correct_{ECP}}}{\sum{proposed_{ECP}}}, $$
(16)
$$ R=\frac{\sum{correct_{ECP}}}{\sum{annotated_{ECP}}}, $$
(17)
$$ F1=\frac{2\times P\times R}{P+R}, $$
(18)

where correctECP represents the number of emotion-cause pairs marked and predicted, proposedECP represents the number of emotion-cause pairs predicted by the model, and annotatedECP represents the total number of emotion-cause pairs marked in the data set.

Regarding the experimental parameter settings, the details are as follows. We use word2vec [27] to pre-train the word vectors in the Chinese Weibo corpus. We set the pre-training dimension to 200 dimensions. For the BiLSTM used in the model, the number of hidden layers of BiGRU is set to 100. The hidden state in the attention mechanism is set to 30. In the feed-forward neural network, the dimension of the middle layer is 30, and the dropout is set to 0.9. The initial dimension of the relative position vector is 50. The sliding window size is 3. In the training process, we set the learning rate to 0.005 and the batch size to 32. For the optimization process, we use the Stochastic Gradient Descent (SGD) algorithm and Adam optimizer. λ is set to 1e − 5.

4.2 Baseline models

We compared our model with the following baseline model:

  • Indep: This method is the first model proposed by Xia and Ding et al. [2], which treats emotion extraction and cause extraction as two independent tasks. Emotions and causes are extracted by multi task learning, and then emotions and causes are paired and filtered by filter.

  • Inter-CE: An interactive multi-task learning method that uses the prediction of cause extraction to improve emotion extraction. The rest of the model is the same as Indep.

  • Inter-EC: This is another interactive multi-task learning method that uses the prediction of emotion extraction to improve the cause extraction. The rest of the model is the same as Indep.

  • ECPE-2D: Proposed by Ding et al. [3], they propose to realize the interaction of all emotion-cause pairs in 2D form, and use the self-attention mechanism to calculate the attention matrix of emotion-cause pairs. Here we choose the Inter-EC model with better effect.

  • E2EECPE: An end-to-end model proposed by Song et al. [25]. This model is a multi-task learning linkage framework that uses a biaffine attention to mine the relationship between any two clauses.

4.3 Overall performance

The overall effect of our model is shown in Table 1. It can be seen that in the three tasks of our model, the performance of indicator F1 is the best, which fully proves the effectiveness of our model.

Table 1 The performance of our model and baseline model on ECPE task and two subtasks based on accuracy, recall and measurement index F1

First of all, our model is significantly better than the three models of Indep, Inter-CE and Inter-EC, regardless of the results on the two subtasks or on the ECPE task. Compared with Inter-EC, which performs best among them, our model improves F1 by about 5.28% on the task of Emotion-cause pair extraction.

Secondly, we compared the model with ECPE-2D, the model with the best result among all baseline models. Our model improved in three tasks, among which the F1 value on ECPE task increased more significantly than ECPE-2D. This fully proves the effectiveness of the interactive attention module. This further confirms our guess that the relationship between emotion and cause may be more conducive to pairing than the relationship between pair and pair. At the same time, our model not only uses a simpler architecture to achieve similar or even better results than ECPE-2D, but also has better performance in terms of parameters, as shown in Table 2. Compared with ECPE-2D, the parameter quantity is reduced by 7.73%.

Table 2 The parameters of our model and ECPE-2D

To verify the robustness of our model, we apply our model to the English dataset to observe its performance. According to Table 3, on the emotion extraction task, IA-ECPE is only slightly higher than ECPE-2D in recall rate. However, on the task of cause extraction, the P, R, and F1 values of the IA-ECPE model are all better than the baseline model, and the comprehensive index F1 is 2.62% higher than the optimal baseline. This fully demonstrates that our model can effectively capture the underlying causes in the document. On the emotion-cause pair extraction task, the IA-ECPE model also achieved better results. We speculate that this may be because our model improves the accuracy of cause clause extraction, thereby improving the accuracy of emotion-cause pair extraction.

Table 3 Experimental performance of emotion-cause pair extraction, emotion extraction and cause extraction on English datasets

4.4 Ablation study

To prove the effectiveness of various modules of our model, we conducted ablation experiments on the model, as shown in Table 4. First of all, “-IA” means that the model removes the interactive attention we designed, and uses the output of BiGRU to directly perform joint expressions and then make predictions. According to the Table 4, the scores of F1 decreased in different degrees in the three tasks, which fully indicates that removing the interactive attention will lead to a decrease in prediction accuracy. At the same time, compared to ECPE-2D, even without interactive attention, our ECPE task F1 has achieved better results when the subtask does not have a good result of ECPE-2D. As a result, this shows that our framework has certain advantages in the pairing of emotion-cause pairs.

Table 4 Ablation study of our method

Secondly, we conducted an ablation experiment on the fusion mechanism, “+IA-M”, which means that the interactive attention has been added but the fusion mechanism in the interactive attention has been deleted. Compared with the former, the F1 of the three tasks has increased, and the P has also increased significantly, which fully demonstrates the effectiveness of interactive attention. By looking at R, it can be seen that IA-ECPE has improved scores on both subtasks and the ECPE task compared to the +IA-M model. We guess that IA-ECPE is beneficial to improve the recall rate, and it is precisely because of the improvement of the index R that our F1 on the three tasks is improved. These results fully demonstrate the effectiveness of our proposed IA-ECPE framework.

5 Conclusion

In response to the new tasks in the field of sentiment analysis in recent years, the emotion-cause pair extraction task (ECPE), we propose an end-to-end model based on interactive attention. In our model, the emotion clause set and the cause clause set are input into the interactive attention respectively to learn the relationship between the emotions and the causes, get new clauses representation, and then conduct fusion pairing. This approach not only learns the relationship between pairs and pairs, but also considers the interaction between affective clauses and cause clauses. At the same time, through BiGRU, interactive attention reduces the amount of model parameters, reduces training time and improves training accuracy. The experimental results show that our method shows good experimental results on the quasi-base data set. Moreover, the ablation experiment further proved the effectiveness of our method. In the future, we plan to study the conditional random field on the ECPE task and apply it to our model, and use BERT for word pre-training in our framework. We think this may be more conducive to improving the prediction accuracy.