1 Introduction

Emotion cause extraction (ECE), as a branch of emotion analysis, aims at discovering the corresponding causes for a certain emotion expressed in a document. This task was first defined as a word-level sequence labeling problem [1]. Afterward, Chen et al. [2] found that emotion causes are often expressed in phrases or sentences, and then they changed the extraction granularity of this task from word-level to clause-level. Gui et al. [3] treated the ECE task as a clause-level binary classification problem, which aims to detect clause-level causes towards a certain emotion expressed in the document. Following the formulation of the ECE task in [3], many methods [4,5,6,7,8] have been proposed to address the ECE task. However, these models have an obvious disadvantage. These emotions should be manually pointed out before performing the ECE task, which significantly limits their practical application scenarios.

To solve the problem in the above task, Xia et al. [9] proposed a new task called emotion-cause pair extraction (ECPE), which seeks to discover all emotion-cause pairs in a document. A specific example is shown in Fig. 1. The document contains 4 clauses. Clause c4 is an emotion clause, and its corresponding cause clause is clause c3. The goal of the ECPE is to find all latent emotion-cause pairs, e.g., (c4,c3).

Fig. 1
figure 1

An example used to illustrate the emotion-cause pair extraction task

Meanwhile, Xia et al. [9] put forward a two-stage model to extract emotion-cause pairs. In the first stage, the emotion clauses and cause clauses are extracted. In the second stage, a classification model is used to extract the emotion-cause pairs from all candidate pairs by applying the Cartesian product of the emotion clauses and the cause clauses. However, this model has an obvious shortcoming that the error generated at the first stage will be propagated to the second stage. Therefore, many efforts have been paid to handle the ECPE task in an end-to-end manner [10,11,12,13,14,15,16,17].

Most of these end-to-end methods use multi-task learning to complete ECPE task. But some of them [10, 14, 17] cannot share feature information among different tasks. Therefore, we consider employing a unified framework to jointly perform the emotion clause extraction (EE), the cause clause extraction (CE), and the ECPE tasks. Specially, the three tasks use the same shallow structure to share information between tasks. Then, each specific task extracts the characteristics for the task and completes the corresponding task.

In addition, there is a label imbalance problem for the ECPE task, that is to say, only a few of these pairs are emotion-cause pairs, and most of them are not emotion-cause pairs. This means that the class distribution in the training dataset is uneven, which is called label imbalance. Unfortunately, almost all existing approaches treat all pairs into the training set and then classify positive cases, i.e., emotion-cause pairs.

Based on the above considerations, we propose a new multi-task learning framework for ECPE task (ECPE-MTL) to solve the EE, CE, and ECPE tasks simultaneously. It is composed of two modules: a shared module and a task-specific module. The shared module is responsible for generating clause representations, which fully mines the shared information among tasks. Based on the shared module, there is a task-specific module that contains three independent parts. Each part is responsible for generating the task-specific representation and performing the corresponding task, i.e., EE, CE, and ECPE. Generally, the corresponding causes often appear in the context of this specific emotion, and vice versa. Thus, towards the label imbalance of emotion-cause pairs, we construct the training set by sampling a subset from all candidate pairs according to the absolute distance between two clauses in the candidate pairs. The main contributions of our work can be summarized as follows:

  1. 1.

    We design a multi-task learning framework to jointly perform EE, CE, and ECPE with the aim of gaining mutual benefit.

  2. 2.

    To handle the imbalanced class distribution in the ECPE task, we design a strategy to construct a training set, which is conceptually simple and effective.

  3. 3.

    The experimental results on the benchmark dataset demonstrate that our method achieves a better performance on the ECPE task.

2 Related work

2.1 Emotion cause extraction

Gui et al. [3] employed a multi-kernel SVM classifier to perform ECE task based on the publicly available dataset constructed by themselves. Afterwards, this dataset becomes a benchmark dataset for the ECE research. With the development of deep representation learning and the extensive application of the attention mechanism, a set of deep learning based methods [4,5,6,7,8, 18] were proposed for the ECE task. These methods seek to model text sequence information and the relationship between emotional words and clauses to improve emotion cause extraction. For example, Li et al. [5] held that the context of the specific emotion is also a valuable clue to find the corresponding causes, and designed a co-attention module to make use of the context of emotion for the ECE task. Yu et al. [6] believed that the relationships among clauses are also important and proposed a hierarchical framework, which not only takes semantic information between emotion description and clause into consideration, but also considers the relationships among clauses. In addition to the content of the document, Ding et al. [7] found that the label information and relative position information between emotion description and clause are important for emotion cause extraction. Xia et al. [8] employed the Transformer [21] as the clause encoder to model the relations between the clauses and further extracted emotion causes. Hu et al. [18] employed graph convolutional networks to encode the semantic and structure information of the clause and achieved superior performance on extracting emotion causes. However, the ECE task has an obvious disadvantage: emotions should be labeled manually before extracting emotion causes. As a result, Xia et al. [9] proposed the ECPE task.

2.2 Emotion-cause pair extraction

In recent years, extensive efforts have been paid to address the ECPE task in an end-to-end fashion. For example, Wei et al. [10] emphasized the importance of the relationships between clauses. They adopted a ranking perspective to deal with this task and selected pairs with high confidence as the emotion-cause pairs. Ding et al. [12] designed a 2D Transformer to model the interaction between pairs. They integrated the emotion-cause pair representation learning, emotion-cause pair interaction, and emotion-cause pair prediction into a unified framework to complete emotion-cause pair extraction. Additionally, Ding et al. [13] pointed out that extracting the causes without specifying the emotion is unreasonable, and vice versa. Thus, they proposed two dual frameworks for the ECPE task. The first framework takes every clause in the document as an emotion clause, and then employs multi-label learning to extract the corresponding cause clauses in the context of the emotion clause, named EMLL. The second framework regards every clause in the document as a cause clause and then employs multi-label learning to extract the corresponding emotion clauses in the context of the cause clause, named CMLL. Yuan et al. [11] devised a novel tagging scheme, and proposed a sequence labeling model based on Bi-LSTM to extract emotion-cause pairs. Tang et al. [14] believed that the current research failed to detect the relationship between emotion detection (ED) and ECPE, and thus proposed a multi-task learning framework for ED and ECPE tasks. Wu et al. [15] proposed a multi-task learning neural network to perform emotion extraction, cause extraction, and emotion-cause relation classification tasks jointly, which explores the interactions among these tasks. Song et al. [16] tackled the ECPE task as predicting directional links between emotion and cause. They designed a multi-task learning model to perform ECPE tasks with the help of auxiliary tasks, i.e., EE and CE. Yu et al. [19] proposed a mutually auxiliary multi-task model, which adds two auxiliary tasks to build the interaction between emotion extraction and cause extraction. However, the label imbalance issue had not been solved in the above methods. In this paper, we propose an end-to-end multi-task learning model which employs a sampling-based strategy to construct the training set to alleviate the above problem.

3 Approach

3.1 Problem definition

The input of the ECPE task is a document composed of multiple clauses \(\ D=\left [c_{1},c_{2},\ldots , c_{|D|}\right ]\), where |D| is the number of clauses in it. Every clause \(c_{i}\ \left (i=1,2,\ldots ,|D|\right )\) consists of several words \(c_{i}=\left [w^{i}_{1},{w^{i}_{2}},\ldots ,w^{i}_{|c_{i}|}\right ]\), where \(\left | {{c_{i}}} \right |\) is the length of clause ci. The target of the ECPE task is to find all potential emotion-cause pairs in the document D:

$$ P=\left\{\ldots,(c^{emo}_{j},c^{cau}_{j}),\ldots\right\}, $$
(1)

where \(\left (c^{emo}_{j},c^{cau}_{j}\right )\) is the j th emotion-cause pair, \(c^{emo}_{j}\) and \(c^{cau}_{j}\) are emotion clause and the corresponding cause clause, respectively.

3.2 Overall architecture

We propose a novel multi-task learning framework to perform the EE, CE, and ECPE tasks simultaneously (shown in Fig. 2), which mainly consists of two modules. The module below, termed the shared module, is to generate clause representations. The module above is a task-specific module, which includes three independent parts. These three parts generate task-specific representation based on the outputs of the shared module, and then perform the corresponding task, i.e., EE, CE, and ECPE tasks.

Fig. 2
figure 2

The overall architecture of ECPE-MTL. The shared module is responsible for generating task invariant clause embeddings [o1,o2,...,o|D|], where |D| is the number of clauses in the document and oi is the embedding of clause i. The three blue modules in the task-specific module are responsible for the EE, CE, and ECPE tasks. The left module CE in the task is to extract the cause clauses, and \(\hat {y}_{c}^{i}\) is the probability of clause ci being a cause clause. The middle module ECPE is to extract emotion-cause pairs, and \(\hat {y}_{ij}\) denotes the probability of candidate pairs (ci,cj) being an emotion-cause pair. The right module EE is to extract the emotion clauses, and \(\hat {y}_{e}^{i}\) is the probability of clause ci being an emotion clause

3.3 Shared module

3.3.1 Clause encoder

BERT [20] is a bidirectionally pre-trained language model, which shocked the deep learning world when it led to excellent improvement on the downstream task in NLP. Therefore, our model generates clause representations based on BERT. Specifically, given a document \(D=\left [c_{1},c_{2},\ldots ,c_{|D|}\right ]\) consisting of |D| clauses and each clause \({c_{i}} = \left [w_{1}^{i},{w_{2}^{i}}, {\ldots } ,w_{\left | {{c_{i}}} \right |}^{i}\right ]\) containing |ci| words, the input of BERT is composed of ci and two additional tokens, formulated as \(\left (\left [CLS\right ],{w_{1}^{i}},{w_{2}^{i}},\ldots ,w_{|c_{i}|}^{i},[SEP]\right )\). The [CLS] token is added at the beginning of each clause, where its final hidden state can be used as a semantic representation of the whole clause. The [SEP] token is added at the ending of each clause to distinguish other clauses. We take the final hidden state of [CLS] as raw clause representation \(h_{i}\in \mathbb {R}^{d_{B}}\) for clause ci.

Then we employ one fully connected layer for dimension reduction. Finally, we obtain all clause hidden representations of the document D and formulate them as \(\left [h_{1},h_{2},\ldots ,h_{|D|}\right ]\in \mathbb {R}^{d_{b} \times |D|}\), which will be fed into the Transformer layer.

3.3.2 Learning correlations between clauses with Transformer

As we know, clauses in a document do not exist independently, and the correlations between the clauses are helpful information. Generally, grasping contextual cues can help us understand the current clause better. Therefore, we apply an encoder module of Transformer [21] to generate an updated clause representation by incorporating other clauses’ information into the current clause, which enables us to understand the current clause from the perspective of the document. The standard encoder of the Transformer includes a stack of N identical layers, where each layer has two sub-layers. The first sublayer is a multi-head self-attention layer, and the second sublayer is a fully connected feed-forward network.

multi-head self-attention layer

One-head attention is the foundation of single-head attention, and in our setting, we adopt single-head attention. Concretely, for each clause ci, we first feed its clause representation \(h_{i}\in \mathbb {R}^{db}\) into three distinct fully connected layers to produce the query, key, and value vectors, represented as qi, ki, and vi as follows:

$$ q_{i}=W_{q}h_{i}, $$
(2)
$$ k_{i}=W_{k}h_{i}, $$
(3)
$$ v_{i}=W_{v}h_{i}, $$
(4)

where \(W_{q}\in \mathbb {R}^{d_{q}\times d_{b}}\), \({W_{k}\in \mathbb {R}^{d_{k}\times d_{b}}}\) and \({W_{v}\in \mathbb {R}^{d_{v}\times d_{b}}}\) are trainable parameters, and dk, dq, dv are the dimension of key, query, and value vectors, respectively.

After that, the query vector qi of clause ci does dot product with all key vectors kj (j = 1,2,…,|D|) to produce a score vector \({Score}_{i} \in \mathbb {R}^{|D|}\) as follows:

$$ {Score}_{i}=[{Score}_{i,1},{Score}_{i,2},\ldots,{Score}_{i,|D|}]^{\top}, $$
(5)
$$ Scor{e_{i,j}} = {q_{i}^{T}} \cdot {k_{j}},j = 1,2, {\ldots} ,\left| D \right|. $$
(6)

Finally, we normalize score vector Scorei to get the attention weights vector \(A_{i}\in \mathbb {R}^{|D|}\) and get an output vector \(z_{i}\in \mathbb {R}^{d_{v}}\) by calculating the weighted sum of all the value vectors, where zi is an updated clause representation for clause ci:

$$ A_{i}=softmax(\frac{{Score}_{i}}{\sqrt{d_{k}}}), $$
(7)
$$ z_{i}={\sum}_{j}{A_{i,j}v_{j}}. $$
(8)

Intuitively, we employ self-attention to map all input clause embeddings into an updated clause embedding that holds the learned information of the whole document.

fully connected feed-forward network layer

The attention sublayer is then followed by a fully connected feed-forward network sublayer:

$$ e_{i}=W_{2}(ReLU(W_{1}z_{i}+b_{1}))+b_{2}, $$
(9)

where \({W_{1}\in \mathbb {R}^{d_{ff}\times dv}}\), \({W_{2}\in \mathbb {R}^{d_{v}\times d_{ff}}}\), \({b_{1}\in \mathbb {R}^{d_{ff}}}\), and \({b_{2}\in \mathbb {R}^{d_{v}}}\) are trainable parameters.

To help the Transformer training, we add a residual connection followed by layer normalization at the output of above each sublayer:

$$ o_{i}=LayerNorm(x_{i}+Sublayer(x_{i})), $$
(10)

where xi is the input of Sublayer and Sublayer(xi) is the output of the sublayer.

As noted above, the standard encoder of Transformer is composed of N identical layers. We take the output of the previous layer as the input of the next layer:

$$ {H^{(l + 1)}} = {O^{(l)}}, $$
(11)

where l denotes the index of encoder layers.

Finally, the encoder of Transfomer outputs a set of clause embeddings represented as \(\left [o_{1}^{(N)},o_{2}^{(N)},\ldots ,o_{|D|}^{(N)}\right ]\), and we formulate them as \(\left [ {{o_{1}},{o_{2}}, {\ldots } ,{o_{\left | D \right |}}} \right ]\).

3.4 Task-specific module

3.4.1 Multi-task setting

Since ECPE task is related to the EE and CE tasks, we want to apply multi-task learning to bring improvement on extracting emotion-cause pairs with the help of auxiliary tasks, i.e., the EE and CE tasks. As with some previous work in [22], we generate task-specific representations for each task. Specifically, since the extraction granularity of these tasks is clause-level, the shared module generates clause representations. Then the three parts in the task-specific module all share the outputs of the shared module and generate task-specific features for the desired tasks.

In detail, upon the clause representations [o1,o2,…, o|D|] output by shared module, we use three different fully connected layers to generate three task-specific feature vectors for EE, CE, and ECPE.

3.4.2 Emotion clause extraction and cause clause extraction

Given a document \(D = \left [ {{c_{1}},{c_{2}}, {\ldots } ,{c_{\left | D \right |}}} \right ]\), EE aims to predict whether the clause ci \(\left ({i = 1,2, {\ldots } ,\left | D \right |} \right )\) is an emotion clause, and CE aims to predict whether clause ci \(\left (i=1,2,\ldots ,|D|\right )\) is a cause clause. For each clause ci, we feed its clause representation \(o_{i}\in \mathbb {R}^{d_{v}}\) into two distinct fully connected layers to get the task-specific clause representations \({h_{e}^{i}}\) for the EE task and \({h_{c}^{i}}\) for the CE task:

$$ {h_{e}^{i}}=W_{e}o_{i}+b_{e}, $$
(12)
$$ {h_{c}^{i}}=W_{c}o_{i}+b_{c}, $$
(13)

where We, Wc\(\in \mathbb {R}^{d_{h}\times d_{v}}\) and be, \(b_{c}\in \mathbb {R}^{d_{h}}\) are trainable parameters.

After obtaining the task-specific clause representations \({h_{e}^{i}}\) and \({h_{c}^{i}}\), we feed \({h_{e}^{i}}\) into a connected layer followed by a logistic function σ(⋅) to predict the probability of clause ci being an emotion clause. Similarly, we feed \({h_{c}^{i}}\) into another connected layer followed by a logistic function σ(⋅) to predict the probability of clause ci being an emotion cause clause. The formulas are as follows:

$$ \hat{y}_{e}^{i}=\sigma\left( \hat{W}_{e} {h_{e}^{i}}+\hat{b}_{e}\right), $$
(14)
$$ \hat{y}_{c}^{i}=\sigma\left( \hat{W}_{c} {h_{c}^{i}}+\hat{b}_{c}\right), $$
(15)

where \(\hat {W}_{e}\in \mathbb {R}^{1\times d_{h}}\), \(\hat {b}_{e}\in \mathbb {R}\), \(\hat {W}_{c}\in \mathbb {R}^{1\times d_{h}}\), and \(\hat {b}_{c}\in \mathbb {R}\) are trainable parameters.

3.4.3 Emotion-cause pair extraction

For the ECPE task, an intuitive method is to take all candidate pairs as the training set. However, in most cases, there is only one pair with the corresponding emotion-cause relationship in a document. That means there exists the problem of class imbalance. Moreover, according to the dataset constructed by Gui et al. [3], emotion clauses appear mostly in the context of their corresponding cause clause and vice versa. Based on the above analysis, we sample a subset from all candidate pairs and take this subset as the training set.

Concretely, if the absolute distance between clause ci and clause cj is less than or equal to a specific positive value W, we treat it as a training sample for ECPE. Consequently, we construct the training set \(\mathcal {P}\) for ECPE:

$$ \mathcal{P} = \left\{ {\left( {{c_{i}},{c_{j}}} \right)|\left| {j - i} \right| \le W} \right\},\ i,j = 1,2, {\ldots} ,\left| D \right|. $$
(16)

For each clause pair candidate \(\left (c_{i}, c_{j}\right )\in \mathcal {P}\), we construct its representation via concatenating three vectors, i.e., the clause embedding \(o_{i}\in \mathbb {R}^{d_{v}}\) of clause ci, the clause embedding \(o_{j}\in \mathbb {R}^{d_{v}}\) of clause cj, and their relative position embedding \(r_{j-i}\in \mathbb {R}^{d_{r}}\) encoding the distance between clause ci and cj. Then we use a fully connected layer followed by a ReLU function to get a task-specific pair representation pij for ECPE:

$$ h_{ij}=\left( o_{i}\oplus o_{j}\oplus r_{j-i}\right), $$
(17)
$$ {p_{ij}} = {\text{Re}} LU\left( {{W_{p}}{h_{ij}} + {b_{p}}} \right), $$
(18)

where the adopted relative position embedding rji is the same as in RANKCP [10], dr represents the dimension of relative position embedding, \({W}_{p}\in \mathbb {R}^{(2\times d_{v}+d_{r})\times (2\times d_{v}+d_{r})}\) and \({b}_{p}\in \mathbb {R}^{(2\times d_{v}+d_{r})}\) are trainable parameters.

Then pair representation pij is fed into softmax classifier to get the \({\hat {y}}_{ij}\), which denotes the probability of candidate pair (ci,cj) being an emotion-cause pair:

$$ \hat{y}_{i j}=\sigma\left( \hat{W_{p}}{p}_{ij}+\hat{b}_{p}\right), $$
(19)

where \(\hat {W}_{p}\in \mathbb {R}^{1\times (2\times d_{v}+d_{r})}\) and \(\hat {b}_{p}\in \mathbb {R}\) are trainable parameters.

In the testing phase, we aim to predict whether the candidate pair in \(\mathcal {P}^{\prime }=\left \{ { {\ldots } ,\left ({{c_{i}},{c_{j}}} \right ), {\ldots } } \right \}\ \left ({i,j = 1,2 {\ldots } ,\left | D \right |} \right )\) is an emotion-cause pair. Specifically, for each candidate pair \(\left (c_{i}, c_{j}\right )\) in \({\mathcal {P}^{\prime }}\), we first construct its pair representation by concatenating the representations of ci, cj and their distance embedding rji. But for those pairs whose relative distance is larger than W, we set rw as its position embedding. For those pairs whose relative distance is less than -W, we set r-w as its position embedding. Then we feed its representation into the ECPE-part of the task-specific module to get the probability yij of \(\left (c_{i}, c_{j}\right )\) of being an emotion-cause pair. Finally, we get \(\{\ldots ,\hat {y}_{ij},\ldots \}\) and select the candidate pair with the highest probability as the emotion-cause pair.

In addition, there may be multiple emotion-cause pairs in some documents. In order to deal with this kind of problem, we take the same approach as in RANKCP [10]. Concretely, we select the top− N candidate pairs {p1,p2,…,pN} from \({\mathcal {P}^{\prime }}\) according to their probabilities of being an emotion-cause pair and take p1 as an emotion-cause pair by default. For the rest \(p=({c_{i}^{1}},{c_{j}^{2}})\in \{p_{2},\ldots ,p_{N} \}\), if its probability is larger than a threshold η and the clause \({c_{i}^{1}}\) contains sentiment word according to a sentiment lexicon [23], we extract it as an emotion-cause pair.

3.5 Objective function

As shown in (20), (21) and (22), \({\mathscr{L}}_{e}\) is the loss of the EE task, \({\mathscr{L}}_{c}\) is the loss of the CE task, and \({\mathscr{L}}_{p}\) is the loss of the ECPE task:

$$ \mathcal{L}_{e}=-{\sum}_{k} \left( {y_{e}^{k}} \log \left( \hat{y}_{e}^{k}\right)+\left( 1 - {y_{e}^{k}}\right) \log \left( 1-\hat{y}_{e}^{k}\right)\right), $$
(20)
$$ \mathcal{L}_{c}= - {\sum}_{k} \left( {y_{c}^{k}} \log \left( \hat{y}_{c}^{k}\right)+\left( 1-{y_{c}^{k}}\right) \log \left( 1-\hat{y}_{c}^{k}\right)\right), $$
(21)
$$ \mathcal{L}_{p}=\!-{\sum}_{\forall\left( c_{i}, c_{j}\right) \in \mathcal{P}} \left( y_{i j} \log \left( \hat{y}_{i j}\right) + \left( 1 - y_{i j}\right) \log \left( 1 - \hat{y}_{i j}\right)\right). $$
(22)

Since our model trains EE, CE, and ECPE tasks jointly, the objective function is the combination of cross-entropy loss of these tasks with the L2-norm regularization term, which is formulated as follows:

$$ \mathcal{L}=\mathcal{L}_{e}+\mathcal{L}_{c}+\mathcal{L}_{p}+\lambda \operatorname{reg}\|\theta\|_{2}, $$
(23)

where 𝜃 is the parameters set for L2-norm regularization and λ is a coefficient for L2-norm regularization.

4 Experiments

4.1 Dataset

We evaluate our ECPE-MTL model on the benchmark dataset published by Xia et al. [9], which is also the only dataset for the ECPE task. This dataset is constructed based on an open dataset for emotion cause extraction released by Gui et al. [3]. The statistical information about the benchmark dataset is shown in Table 1. It is noticed that nearly 90% of documents have only one emotion-cause pair. Moreover, the ratio of emotion-cause pairs where the relative distance between emotion clause ci and cause clause cj less than 3 accounts for 95.8%.

Table 1 Statistics of benchmark dataset for ECPE

4.2 Evaluation metrics

Following previous work, we adopt the P, R, and F1 scores as evaluation metrics for the ECPE task:

$$ P=\frac{\sum{correct\_pairs}}{\sum{predicted\_pairs}}, $$
(24)
$$ R=\frac{\sum{correct\_pairs}}{\sum{actual\_pairs}}, $$
(25)
$$ F1=\frac{2\times P\times R}{P+R}, $$
(26)

where predicted_pairs is the number of pairs predicted as emotion-cause pairs, correct_pairs is the number of pairs predicted as emotion-cause pairs correctly, and autual_pairs is the number of actual emotion-cause pairs in the dataset.

Similarly, we use the same evaluation metrics in [3] to evaluate the performance of our model on the EE and CE tasks.

4.3 Experimental setup

Our model is trained based on the Adam optimizer, where the batch size is set to 4, and the learning rate is set to 2e-5. In our objective function, the L2 regularization coefficient λ is set to 1e-5. We adopt BERT-Chinese as the clause encoder in our work. The dimension of clause representations db is set to be 200. In the Transformer, the number of layers N of the encoder module is set to 1. The dimensions of query, key, value vectors \(\left ({{d_{q}},{d_{k}},{d_{v}}} \right )\) are all set to 200, and the dimension dff of the hidden states is set to 400. The relative distance W is set to 4. Similarly, we set relative position embedding to 50. The parameter η we set is 0.5. In our experiment, we divide our data into 10 parts and employ 10-fold cross-validation. The average result of ten folds is recorded as the final result.

4.4 Baselines

To prove the effectiveness of our ECPE-MTL model on the ECPE task, we compared it with the following state-of-the-art methods.

Indep

[9] is a two-stage method. It extracts all emotion clauses and cause clauses at the first step. In the second step, emotion-cause pairs are extracted based on the results of the first step.

Inter-CE/Inter-EC

[9] are different from Indep in the first stage. Inter-CE utilizes the result of causes prediction to help emotions prediction. Inter-EC utilizes the prediction results of emotions to predict the emotion causes. The second step of Inter-CE and Inter-EC is the same as Indep.

RANKCP

[10] tackles the ECPE task from a ranking perspective, which emphasizes inter-clause modeling and enhances the pair representations. RANKCP/BERT is based on BERT embeddings.

LAE-MANN

[14] emphasizes the connection between EE and ECPE. It employs a multi-level attention module to model word-level interaction between two clauses in the candidate pair. LAE-MANN/BERT is based on BERT embeddings.

MTNECP

[15] performs EE, CE, and ECPE tasks jointly and exploits the interaction between these tasks.

E2EECPE

[16] is a multi-task learning model that jointly executes EE, CE, and ECPE tasks. It regards EE and CE as auxiliary tasks to improve the performance of ECPE.

MAM-SD

[19] is a mutually auxiliary multi-task model that aims to model the mutual interaction between emotion extraction and cause extraction to improve the performance of the ECPE task.

ECPE-tagging

[11] is a tagging method. It tackles ECPE as a tagging task and builds a novel tagging scheme.

ECPE-MLL

[13] integrates two joint frameworks named CMLL and EMLL to solve the ECPE task. CMLL assumes that each clause in the document is an emotion-oriented clause and then finds the corresponding cause clauses in its context. EMLL and CMLL are the same in general, but the steps are opposite. The final result is obtained by integrating the results of CMLL and EMLL. ECPE-MLL/BERT is based on BERT embeddings.

4.4.1 Results on emotion-cause pair extraction

The experimental results are shown in Table 2. Obviously, our ECPE-MTL model achieves the best performance in terms of F1 score on ECPE. In addition, the P score and R score are higher than most models, which proves the effectiveness of ECPT-MTL. Secondly, we compare our model with RANKCP/BERT because we use the same method to extract sentiment-cause pairs. Overall, our model improves the F1 score and P of ECPE tasks by 1.43% and 4.29%, respectively. Besides, in terms of performance metrics, ECPE-MLL/BERT and RANKCP/BERT are currently the best two models. It can be clearly observed that ECPE-MLL/BERT achieves the highest score in metric P, but it performs less well in metric R, which indicates that it extracts are fewer positive cases (i.e., emotion-cause pairs) than RANKCP/BERT and our model. Toward the model MAM-SD, our model improves 5.85%, 17.58%, 11.83% on P, R and F1 score. The main reason is that it is still a two-stage model, which exists the problem of error propagation. LAE-MANN/BERT performs ED and ECPE jointly. Compared with it, our model improves 4.38%, 14.87%, 9.53% on P, R, and F1 score. Our model exceeds it a lot on R in that we employ a sampling method for constructing pairs before executing the ECPE task, which helps the model focus on finding the pairs with the emotion-cause relationship. Especially, our model also performs well on EE and CE. We attribute such high performance to the multi-task setting. Based on the clause representation output by the shared module, each task generates task-specific representation, which includes helpful information for the current task.

Table 2 The experimental results on emotion-cause pair extraction

4.4.2 Results on emotion cause extraction

Before ECPE was proposed, ECE tasks had been widely discussed and studied in previous work. Therefore, we compare our model with some ECE models, including a traditional machine learning method (Multi-kernel [3]) and several deep learning based methods (MemNet [4], CANN [5], PAE-DGL [7], and RTHN [8]). It should be noted that these models need to label emotions manually before performing the ECE task. But in our model, we don’t need any annotation in advance. RHTN-APE and CANN-E are the models which remove emotion annotations during the training phase.

As seen in Table 3, our ECPE-MTL model performs better than these models, which relies on emotion annotations to extract emotion causes. We think it can be attributed to two reasons. On the one hand, the CE task benefits from multi-task learning because our approach performs EE, CE, and ECPE jointly. On the other hand, we utilize BERT to generate powerful clause embeddings.

Table 3 Comparison of the performance of our ECPE-MTL model with other models on emotion cause extraction

4.5 Ablation studies

In this section, we conduct several ablation experiments to verify the effects of several components in our approach.

4.5.1 Effectiveness of multi-task learning

To explore the effect of auxiliary tasks, we conduct an ablation experiment on ECPE-MTL. Concretely, we remove \({\mathscr{L}}_{e}\ +\ {\mathscr{L}}_{c}\) in our objective function and train our model with only \({\mathscr{L}}_{p}\), which we name ECPE-MTL w/o aux. The results are shown in Table 4. We notice that the performance of ECPE-MTL w/o aux drops a lot compared with ECPE-MTL, which demonstrates that ECPE actually benefits from the joint learning of the three tasks.

Table 4 Effectiveness of multi-task learning

4.5.2 Effectiveness of transformer

We apply the Transfomer to learn the correlations between clauses, which enables our model to understand the current clause from the perspective of the document. In addition, we seek to use the graph attention network to model the correlations between clauses. We conduct an ablation study by designing the following ECPE-MTL variant:

ECPE-MTL w/o Transformer+GAT

It’s a model that we replace the Transformer with Graph Attention Network [24] to learn the correlations between clauses.

The results are shown in Table 5. Compared with ECPE-MTL w/o Transformer+GAT, ECPE-MTL performs better on the P, R, and F1 score. This is intuitive that Transformer does better at learning the correlations between clauses.

Table 5 Effectiveness of Transformer. We employ Graph Attention Network to learn the correlations between clauses

4.5.3 Effectiveness of constructing the training set

Towards ECPE-MTL, we construct the training set for the ECPE task by selecting candidate pairs (ci,cj) (i,j = 1,2,…,|D|) with relative distance |ji| less than or equals to a specific value W instead of taking all candidate pairs as the training set. We explore the effectiveness of the training set construction method by an ablation study.

ECPE-MTL w/o Sampling

In this model, we do not use the sampling strategy, but directly take all candidate pairs into the training set.

From Table 6, we can find that ECPE-MTL performs better than ECPE-MTL w/o sampling in P, R, and F1 score.

Table 6 Effectiveness of constructing training set

It proves that our sampling-based method to construct the training set is effective, because it alleviates the problem of label imbalance effectively and enables our model to focus on finding the candidate pairs with the emotion-cause relationship. We construct the training set according to hyperparameter W. We further explore the influence of different values of hyperparameter W on our model, and the results are shown in Fig. 3. Obviously, our model achieves the best performance when W is set to 4.

Fig. 3
figure 3

The influence of hyperparameter W. We construct the training set of ECPE according to W. If the absolute distance between the two clauses in a candidate pair is less than W, we treat it as a training sample

5 Conclusion

For the ECPE task, the existing works usually tackle the problems in a two-step fashion, and the problem of label imbalance for the ECPE task is ignored. To solve the above shortcomings, in this paper, we proposed an end-to-end model that employs multi-task learning to jointly perform the EE, CE, and ECPE tasks. In addition, targeting the problem of imbalanced classes distribution for the ECPE task, we proposed a sampling-based method to construct the training set. The experimental results showed that our multi-task learning model and the sampling-based method to construct the training set are effective. In future work, we will work in the following two aspects. Firstly, we will attempt to design a unified framework that extracts emotion firstly and then extract its corresponding causes around the context of emotion and vice versa, which is also more in line with the human thinking process. Secondly, because usually the emotions and causes are expressed by several words rather than the whole clause, we will consider using more fine-grained extraction, such as span-level extraction rather than clause-level extraction, to discover emotions and causes.