Keywords

1 Introduction

Open-domain conversation generation has made remarkable progress over recent years, relying on deep learning and neural networks [5, 19, 27, 32]. However, previous works primarily centre around improving the linguistic quality of the generated responses, such as grammatical correctness, content variety, and topic relevance, neglecting the important factor of emotion [31]. The information conveyed by humans during communication contains not only syntactic and semantic information but also emotional information. Emotion is one of the essential cognitive behaviors in humans, and artificial intelligence has the objective of enabling machines to mimic human intelligent behaviors. As an important research branch of AI, one of the long-term goals of dialogue systems is to enable machines to perceive, comprehend and express emotions. In addition, studies [13,14,15] have shown that introducing emotional information into conversation systems can improve user engagement and satisfaction, make human-computer conversation more natural, and reduce the number of conversation terminations.

As a new research hotspot for the NLP community, most existing approaches on empathetic conversation generation focus on identifying the emotion category of the input sequence and generating a response based on the predicted emotion label. Song et al. [23] introduce an external emotion lexicon into the generation module to achieve explicit and implicit emotion expression. Li et al. [9] create a coarse-grained and fine-grained emotion representation for the input sequence by using an emotion classifier to identify the input sequence’s emotion category and an external emotion lexicon to identify the emotion terms. Majumder et al. [12] improve the empathetic expression in the generated response by mimicking the input sequence’s emotion while taking into account their emotional polarity. Firdaus et al. [3] incorporate multiple emotions to varying degrees in the generation of responses to make the model more anthropomorphic.

Table 1. Examples of empathetic conversation.

Existing works mainly focus on emotion-related issues, focusing less on content-relevance. However, a case study of the responses generated by the existing models shows that the existing models do not guarantee the content-relevance of the generated responses very well. As shown in Table 1, in case 1, EmpDG [9] generates an emotionally irrational and irrelevant response, whereas GPT2 [16] can express empathy for the user’s emotion, but the generated response deviates from the conversation’s topic (from diet to age); in case 2, EmpDG and GPT2 both focus too much on the user’s lonely emotion but fail to develop the conversation around the user’s specific situation, resulting in the generation of a safe response, which is also irrelevant to the situation.

We suggest there are two main reasons: Firstly, as Gao et al. [4] pointed out, existing works deal with emotions on a surface level and do not consider the underlying causes of the emotion, making it difficult to comprehend user’s complicated emotions and badly affecting the subsequent links of emotion prediction and empathetic conversation generation. Secondly, emotion category is a strong supervisory signal, and overemphasizing its importance in the process of generating responses can easily lead to the generation of safe responses for specific emotion categories. For examples in Table 1, if the model can accurately capture the emotion cause in the input sequence (as highlighted in red) and incorporate them into the process of generating responses, the model will have the ability to understand the user’s emotion better and generate responses with more relevant content by developing topics around the facts conveyed by the user during the generation process.

To this end, we propose an empathetic conversation generation model enhanced by emotion cause to improve the content-relevance of generated responses. Specifically, our model involves two components, an emotion cause extractor and an empathetic conversation generator. In order to accurately identify emotion cause in the absence of large-scale labeled data, we present a semi-supervised training method to optimize the emotion cause extractor. To integrate the extracted emotion cause into the empathetic conversation generator and minimize the damage to the general language knowledge already learned by the pre-trained language model, we introduce a biased self-attention mechanism to enhance the model’s attention to the emotion cause when generating responses.

The contributions of our work are summarized as follows:

  • To compensate for the scarcity of large-scale word-level emotion-cause labeled datasets, a semi-supervised training method using labeled and unlabeled data for joint training is proposed.

  • To integrate the extracted emotion cause into the generation process, a biased self-attention mechanism that does not introduce new additional parameters is proposed.

  • Experimental results indicate that our proposed model performs superior to the baselines and improves the content-relevance of the generated responses.

2 Related Work

Empathetic conversation generation has made great progress in recent years. Several works [18, 20, 21, 23, 26, 30] attempt to make dialogue models more empathetic and have achieved promising results. Song et al. [23] introduce an external emotion lexicon into the generation module to achieve explicit and implicit emotion expression. Shen et al. [20] present a novel framework that extends the emotional conversation generation through a dual task and alternatively generates the responses and queries. Welivita et al. [26] combine dialogue intent modeling and neural response generation to obtain more controllable and empathetic responses. Zheng et al. [30] propose a multi-factor hierarchical framework to model communication mechanism, dialog act and emotion in a hierarchical way. Sabour et al. [18] introduce external commonsense information to absorb additional information about the situation and help the model better understand the user’s emotion.

Emotion cause extraction is intended to discover the stimulus reasons behind the user’s emotion [2, 7]. Although there has been a lot of excellent works in this research direction [1, 24, 28], most of the existing datasets are at the sentence/sub-sentence level [6]. There is still a lack of a large-scale word-level emotion-cause labeled dataset up till now.

Most existing approaches on empathetic conversation generation only consider superficial emotional information in the dialogue context but ignore deeper emotional causes. Recently, some researches [4, 6] have attempt to investigate emotion cause in empathetic conversation generation, resulting in more relevant and empathetic responses. Since there is no large-scale word-level emotion-cause labeled dataset, Gao et al. [4] train an emotion cause extractor using a sentence-level labeled dataset and then automatically construct a word-level labeled dataset. Kim et al. [6] use a Bayesian conditional probability formula based on the emotion category of the dialogue context to train an emotion cause extractor in a weakly supervised way. In order to incorporate emotion cause into the process of generating responses, Gao et al. [4] introduce a soft gating mechanism and a hard gating mechanism to make model boost the attention on emotion cause; while Kim et al. [6] introduce the RSA framework, which is essentially a Bayesian conditional probability-based response rewriting module based on the original decoder.

3 Task Formulation

Emotion Cause Extraction. Given an input sequence \(X_e=\left( x_1,x_2,...,x_k \right) \), the goal is to predict the emotion cause probability \(C=\left( c_1,c_2,...,c_k \right) \) that indicates whether the token is an emotion cause. Specifically, we add special tokens [CLS] and [SEP] at the beginning and end of the sequence, respectively (as shown in Fig. 1).

Empathetic Conversation Generation. Given an input sequence \(X_g=\left( x_1,x_2,...,x_n \right) \), the goal is to generate a response \(Y=\left( y_1,y_2,...,y_m \right) \) that is empathetic and relevant to the conversation. Specifically, follow the previous works [4, 10, 22], we concatenate all utterances in the dialogue context together as input and separate utterances by [SEP] tokens (as shown in Fig. 1).

Fig. 1.
figure 1

The overview of our proposed ECE and ECG.

4 Approach

Our proposed emotion-cause-enhanced empathetic conversation generation model consists of two main modules: Emotion Cause Extractor and Empathetic Conversation Generator. The overview is shown in Fig. 1. Since there is no large-scale word-level emotion cause dataset available, we present a semi-supervised training method to obtain the emotion cause extractor using small-scale labeled data jointly trained with large-scale unlabeled data. To involve the emotion cause in the generation process, we introduce multiplicative signals to implement the biased self-attention mechanism. The multiplicative signal enhances the model’s attention to the emotion cause in the generation process and improves the content-relevance of the generated responses.

4.1 Emotion Cause Extractor

The RoBERTa model [11] created by stacking the Transformer encoder [25] can better model contextual information in both directions. We construct the Emotion Cause Extractor (ECE for short) based on the RoBERTa to identify the emotion categories of the input sequence and its emotion causes. Thus the tasks of the ECE can be divided into emotion recognition and emotion cause detection.

Emotion Recognition. Emotion recognition is a classification problem aiming to predict the emotion category of the input sequence. Given a input sequence \(X_e\), the forward propagation process of the model can be defined as:

$$\begin{aligned} H_h^E&= \textrm{RoBERTa} {\left( X_e \right) } \end{aligned}$$
(1)
$$\begin{aligned} P&= \textrm{softmax} {\left( W_eH_{h,1}^E + b_e \right) } \end{aligned}$$
(2)

where \(H_h^E\) denotes the output of the last hidden layer, and \(H_{h,1}^E\) denotes the output of the first token (i.e., [CLS]) in the last hidden layer. \(W_e\) and \(b_e\) denote the parameters of the feed-forward neural network.

After obtaining the probability distribution P of emotion category, the emotion category of the \(X_e\) can be defined as \(\mathcal {E} = \textrm{argmax} {\left( P \right) }\).

We employ the following loss function to optimize the parameters:

$$\begin{aligned} \mathcal {L}_{emo} \left( P \right) = -\sum _{i \in labels} t \left( i \right) \log p_i \end{aligned}$$
(3)

where \(labels \in \left\{ 1,2,...,s \right\} \) denotes emotion categories, and \(t \left( i \right) \) denotes the ground truth distribution corresponding to the input sequence.

It is noted that the input representation of the RoBERTa contains both word embedding and positional embedding:

$$\begin{aligned} H_0^E = X_eW_e^W + X_e^PW_e^P \end{aligned}$$
(4)

where \(W_e^W\) denotes the word embedding matrix, \(X_e^P\) denotes the absolute position of tokens in \(X_e\), and \(W_e^P\) denotes the positional embedding matrix.

Emotion Cause Detection. Emotion cause detection is a sequence labeling problem that aims to predict whether each token in the input sequence is the emotion cause, i.e., a word-level \(\left\{ 0,1 \right\} \) labeling problem. Since no large-scale word-level emotion cause dataset is available, this section proposes a semi-supervised training method using small-scale labeled data jointly with large-scale unlabeled data.

For the labeled data, given an input sequence \(X_e\), the context-aware word representation is obtained by encoding using the RoBERTa. Then, a layer of the feed-forward neural network is used for \(\left\{ 0,1 \right\} \) sequence labeling:

$$\begin{aligned} H_h^E&= \textrm{RoBERTa} {\left( X_e \right) } \end{aligned}$$
(5)
$$\begin{aligned} \widehat{C}&= \textrm{softmax} {\left( W_cH_h^E + b_c \right) } \end{aligned}$$
(6)

where \(\widehat{C}\) represents the emotion cause probability of each token, \(W_c\) and \(b_c\) denote the parameters of the feed-forward neural network.

The loss function applied for parameter learning is as follows:

$$\begin{aligned} \mathcal {L}_{cau} \left( \widehat{C} \right) = -\sum _{i=1}^k \log \textrm{P} {\left( \widehat{C}_i \right) } \end{aligned}$$
(7)

where k indicates the length of the input sequence, and \(\textrm{P} {\left( \cdot \right) }\) denotes obtaining the probability corresponding to the ground truth label of each token.

For the unlabeled data, we observe that the model needs to pay attention to the emotion cause when predicting the emotion category of the input sequence. Thus the attention weight distribution of the model in predicting emotion categories can be used to predict whether each token is an emotion cause or not. Given an input sequence \(X_e\), emotion recognition is performed using the RoBERTa to obtain the attention weight distribution \(Att^{CLS}\) of the first [CLS] token in the last hidden layer. Then, simple filtering based on the rules (including removing punctuation, special words, stop words, etc.) is applied, and the tokens with top-k weights are selected as the emotion cause of the input sequence. In this way, emotion cause labels can be automatically constructed for unlabeled data, and the rest of the processing is similar to labeled data.

However, the above method of automatic emotion cause labeling requires converting each token from vector to text at the realization and then performing rule-based filtering. This leads to the fact that the computational graph of automatic emotion cause labeling module is not fully linked with that of emotion cause detection module, i.e., the loss function \(\mathcal {L}_{cau}\) of emotion cause detection is not derivable for \(Att^{CLS}\), and cannot be directly involved in the optimization of \(Att^{CLS}\). Thus we propose an additional auxiliary loss function to link the computational graph and introduce the regularization constraint by computing the vector inner product of \(Att^{CLS}\) and \(\widehat{C^1}\):

$$\begin{aligned} \mathcal {L}_{aux} \left( Att^{CLS},\widehat{C} \right) = Att^{CLS} \cdot \widehat{C^1} \end{aligned}$$
(8)

where \(\widehat{C^1}=\widehat{C} \left[ 1,: \right] \) denotes the probability that each token is the emotion cause.

In summary, we employ the following loss function to optimize the emotion cause extractor:

$$\begin{aligned} \mathcal {L}^{ECE} = \lambda _1 \mathcal {L}_{emo} + \lambda _2 \mathcal {L}_{cau} + \lambda _3 \mathcal {L}_{aux} \end{aligned}$$
(9)

where \(\lambda _i\) indicates the weight of each loss function (we set \(\lambda _1=1/3\), \(\lambda _2=\lambda _3=1\)).

4.2 Empathetic Conversation Generator

Conversation Generation. Given a input sequence \(X_g\) and the corresponding probability of emotion cause C, the goal of the Empathetic Conversation Generator (ECG for short) is to maximize the probability \(P\left( Y | X_g,C \right) \). The empathetic conversation generator proposed in this section is implemented based on the GPT2 [16]. Forward propagation process of the GPT2 in conversation generation task can be defined as:

$$\begin{aligned} H_h^G&= \textrm{GPT2} {\left( X_g \right) } \end{aligned}$$
(10)
$$\begin{aligned} \widehat{Y}&= \textrm{softmax} {\left( W_gH_h^G + b_g \right) } \end{aligned}$$
(11)

where \(W_g\) and \(b_g\) denote the parameters of the feed-forward neural network.

The loss function is as follows:

$$\begin{aligned} \mathcal {L}^{ECG} \left( \widehat{Y} \right) = -\sum _{i=1}^m \log \textrm{P} {\left( \widehat{Y}_i \right) } \end{aligned}$$
(12)

where m denotes the length of the sequence, and \(\textrm{P} {\left( \cdot \right) }\) denotes obtaining the probability corresponding to the ground truth.

It is noted that the input representation of the GPT2 contains three parts: word embedding, positional embedding and role embedding:

$$\begin{aligned} H_0^G = X_gW_g^W + X_g^PW_g^P + X_g^RW_g^R \end{aligned}$$
(13)

where \(X_g^R\) denotes the role identifier of each token in the input sequence \(X_g\) (used to distinguish different speakers), and \(W_g^R\) denotes the role embedding matrix.

Biased Self-attention Mechanism. In order to integrate the emotion cause into the generation progress of the GPT2, it is typical to introduce a new attention mechanism layer. However, considering that the GPT2 has large-scale, trained parameters, if a new attention mechanism layer is introduced in the fine-tuning phase, it may greatly impact the original parameters and destroy the general knowledge already learned by the GPT2. Therefore we chose to introduce multiplicative signals based on emotion cause on top of the original self-attention mechanism of the GPT2 to enhance the model’s attention to emotion cause during generation. Meanwhile, the above possible problems are avoided since no additional parameters are introduced.

Moreover, considering that deep neural networks are biased toward modelling syntactic information at the bottom level and semantic information at the top level, the first few layers of the GPT2 network do not require special attention for the emotion cause. We use the layer number information to scale the above multiplicative signals. As the number of layers increases, the multiplicative signals based on the emotion cause gradually strengthen.

The original self-attention mechanism of the GPT2 is defined as:

$$\begin{aligned} \textrm{MaskedAttention} {\left( Q,K,V \right) } = \textrm{softmax} {\left( \frac{QK^T}{\sqrt{d_k}} \odot M - \lambda \left( I-M \right) \right) V } \end{aligned}$$
(14)

where \(\odot \) denotes the multiplication of the corresponding elements of the matrix, \(\lambda \) denotes an infinite scalar (generally taken as \(\lambda = 10000\)). M denotes the lower triangular matrix with all non-zero elements being 1, I denotes the matrix where all elements are 1.

Our proposed biased self-attention mechanism based on the emotion cause can be defined as:

$$\begin{aligned} \textrm{MaskedScore} {\left( Q,K \right) }&= \textrm{softmax} {\left( \frac{QK^T}{\sqrt{d_k}} \odot M - \lambda \left( I-M \right) \right) } \end{aligned}$$
(15)
$$\begin{aligned} \textrm{BiasedScore} {\left( Q,K \right) }&= \textrm{Normalize} {\left( \textrm{MaskedScore} {\left( Q,K \right) } \odot \left( I + \frac{h_i}{h} C \right) \right) } \end{aligned}$$
(16)
$$\begin{aligned} \textrm{Normalize} {\left( X \right) }&= \frac{x_{i,j}}{\sum _i x_{i,j}} \end{aligned}$$
(17)
$$\begin{aligned} \textrm{BiasedAttention} {\left( Q,K,V \right) }&= \textrm{BiasedScore} {\left( Q,K \right) } V \end{aligned}$$
(18)

where C represents the probability of each token being an emotion cause, \(h_i \in \left\{ 1,2,...,h \right\} \) denotes the serial number of the self-attention layer, \(\textrm{Normalize} {\left( \cdot \right) }\) denotes the function for normalization by row.

4.3 Training Strategy

Our proposed model is trained using a two-stage training strategy.

In the first stage, the ECE is trained using a semi-supervised training method, as shown in Algorithm 1.

figure e

In the second stage, the ECG is trained based on the emotion cause extracted by the ECE, and the parameters of the ECE are frozen in this stage. The training process is shown in Algorithm 2.

figure f

5 Experiments

5.1 Datasets

We use the following two datasets to conduct experiments.

EmpatheticDialogues (EmpDialog for short) is a dataset for empathetic conversation generation created by Rashkin et al. [17]. The dataset, which contains 19,533 conversations in the training set, 2770 conversations in the validation set and 2547 conversations in the test set, is collected and created by the Amazon Mechanical Turk platform. EmpDialog defines 32 emotion categories, and each conversation is created based on an emotional category and a situation description. An example of the EmpDialog dataset is shown in Table 2.

Table 2. An example of the EmpDialog dataset.

EmoCause is a word-level emotion cause dataset created by Kim et al. [6] based on the validation and test sets of EmpDialog. The dataset is also collected and created by the Amazon Mechanical Turk platform. The workers are asked to vote for each token in a given situation to determine whether it is the emotion cause. EmoCause have 2770 validation data and 2547 test data. An example of the EmoCause dataset is shown in Table 3.

Table 3. An example of the EmoCause dataset.

As described in Subsect. 4.3 our proposed model is trained in two stages and the experimental data used in different stages are different.

Experimental Data for ECE: The experimental data used by ECE are obtained from EmpDialog and EmoCause. First, the validation set of EmoCause is randomly divided into two equal parts (denoted as EmoCause-1 and EmoCause-2). Then, the training set (unlabeled) of EmpDialog is combined with EmoCause-1 (labeled) to form the training set used in the experiments, EmoCause-2 is used as the validation set for experiments, and the test set of EmoCause is used as the test set for experiments.

Experimental Data for ECG: The experimental data used in ECG are derived from EmpDialog, and the division of the training set, validation set and test set is the same as the original dataset.

5.2 Comparison Methods

For ECE, we chose the following three models as baselines: (1) EmpDG [9]: a Transformer-based model that creates the coarse and fine-grained emotion representation by emotion classification and external emotion lexicon. In addition, it uses two discriminators to interact with user feedback. Here, we select the coarse-grained tokens as the emotion cause. (2) RoBERTa_Att: a RoBERTa-based [11] model that is trained on the emotion recognition task, we obtain emotion cause by the attention weight distribution of the first special token [CLS]. (3) GEE [6]: a BART-based [8] model that uses a Bayesian conditional probability formula based on the emotion category labels of context to predict emotion cause.

For ECG, we chose the following three models as baselines: (1) EmpDG [9]: the same as mentioned above. (2) RecEC [4]: a Transformer-based model that incorporates emotion cause into response generation with gating mechanisms. It constructs emotion cause labels using a pre-trained sentence-level emotion cause extractor. (3) GPT2 [16]: a GPT2-based model that is fine-tuned on the conversation generation task.

5.3 Evaluation Metrics

For ECE, we conducted the automatic evaluation to evaluate with the following metrics: emotion classification accuracy (Accuracy for short) and emotion cause recall rate (Recall for short).

For ECG, we used automatic evaluation and manual evaluation to verify the effectiveness. The metrics used for the automatic evaluation included Perplexity, Distinct-1, Distinct-2, and emotion classification accuracy (Accuracy for short), well-known metrics commonly used to evaluate conversation generation. Additionally, we introduced BERTscore [29] to measure the cosine similarity between the generated response and the gold response. BERTscore contains three more specific metrics, namely recall rate (\(\mathrm {R_{BERT}}\)), precision rate (\(\mathrm {P_{BERT}}\)) and F1 score (\(\mathrm {F_{BERT}}\)).

The manual evaluation included both quantitative and qualitative components. The quantitative component required scorers to score on three dimensions of Empathy, Relevance, and Fluency, with each dimension being scored in an increasing value domain from 1 to 5. The qualitative component required scorers to rank the response generated by different models in order of preference. The manual evaluation randomly selected 100 test data and disrupted the responses generated by different models. Afterwards, these responses are distributed to 3 scorers for scoring, and the final results are averaged. The above approach fully ensures the fairness of the manual evaluation.

5.4 Parameter Settings

ECE is constructed based on RoBERTa-base, and ECG is constructed based on GPT2-base. Table 4 is drawn to show the parameter settings in detail.

Table 4. Parameter setting of ECE and ECG.
Table 5. Results on comparative experiments of the different Emotion Cause Extractors.

5.5 Experimental Results and Analysis

Table 5 shows the experimental results of different emotion cause extractors. Our ECE performs optimally in all metrics compared to the comparison methods. Compared with the Roberta_Att, ECE maintains its original strong competitiveness in emotion classification accuracy while achieving remarkable improvement in emotion cause recall rate. These achievements demonstrate that our proposed semi-supervised training method can effectively narrow the gap between emotion recognition and emotion cause detection and significantly improve the emotion cause detection ability of the model.

Table 6. Results on ablation study of the ECE.

We design the ablation study to further analyze the effectiveness of our proposed semi-supervised training method. In Table 6, the “train” (or “valid”) in Training Dataset represents that ECE uses only the training (or validation) set of EmoCause for unsupervised (or supervised) training. Similarly, “merge” represents that ECE uses the training set of EmpDialog with EmoCause-1 for semi-supervised training. Note that in the “valid” set of experiment, the test set of EmoCause is used as the validation set, which is actually not a regular practice and is only required here to meet the need of the ablation experiments because we do not have more labeled data.

The experimental results in Tabel 6 show that the supervised training method is outstanding on Top-1 Recall and Top-3 Recall compared with the unsupervised training method. Still, the supervised training method is significantly weaker than the unsupervised training method on Top-5 Recall. This phenomenon declares that the supervised training method is superior to the unsupervised training method in performance, but it can easily cause overfitting and lead to instability. In contrast, the semi-supervised training method has the advantage of combining the two. On the one hand, supervised training can be used to provide a clear, task-appropriate optimization goal for emotion cause detection. On the other hand, the labeled data can guide the processing of automatic emotion cause labeling and the unlabeled data can avoid overfitting that may result from using only labeled data. In addition, an ablation study on \(\mathcal {L}_{aux}\) under the semi-supervised training method also validates the effectiveness of our proposed auxiliary loss function.

Table 7. Results on Automatic Evaluation of the ECG. It should be noted that the particularly large Perplexity of RecEC is because the model is trained with \(\mathrm {F_{BERT}}\) as the optimization target for the early stop strategy.

Table 7 demonstrates the automatic evaluation results of different empathetic conversation generation models. Our ECG achieves remarkable improvements in all metrics compared with EmpDG and RecEC, which are Transformer-based models. ECG also makes a small improvement in all metrics except Distinct compared with the pre-trained language model GPT2. The above phenomenon suggests that our ECG can improve the quality of the generated responses by introducing attention to emotion cause on the basis of pre-trained language models. Regarding the poor performance of ECG on Distinct, it may be due to the limitations caused by the emotion cause in the generation process.

Table 8. Results on Manual Evaluation of the ECG.
Table 9. Preference test (%) between any two method.

Table 8 shows the manual evaluation results of different empathetic conversation generation models. The improvement in Empathy and Relevance of the responses generated by ECG is remarkable, which indicates that introducing attention to the emotion cause in the generation process can promote the model’s understanding of user emotion and generate more content-relevant emotional responses. Table 9 presents the preferences of scorers for different models. The scorers’ preference for our ECG is greater than the other models, which verifies the validity of the ECG.

5.6 Case Study

Table 10. Two cases of responses generated by different models.

To further illustrate that focusing on the emotion cause helps improve the content-relevance of the generated responses, we show two cases in Table 10. In the first case, ECE identifies the emotion cause in user input (as highlighted in red) and understands the stimulus behind the user’s disgusted emotion is the poor environment of the restaurant, which prompts ECG to generate an empathetic response expressing sympathy and concerning for subsequent development (as highlighted in cyan). In the second case, ECE recognizes the emotion cause in user input (as highlighted in red) and understands the stimulus behind the user’s joyful emotion is the long-awaited birth of a son, prompting ECG to generate an empathetic response that congratulates to the user and fits the user’s family situation (as highlighted in cyan).

Comparing the responses generated by different models in the above two cases, it can be seen that our proposed model can accurately capture the emotion cause in user input and effectively incorporate it into the generation process, showing stronger content-relevance compared to other baselines, which further illustrates the important role of the emotion cause in the content-relevance of generated responses.

6 Conclusion

In this paper, we present an empathetic conversation generation model enhanced by the emotion cause to make the generated responses more content-relevant. Our proposed model comprises an emotion cause extractor and an empathetic conversation generator. To compensate for the scarcity of large-scale word-level emotion-cause labeled datasets, we suggest a semi-supervised training method that simultaneously uses labeled and unlabeled data for training. To integrate the extracted emotion cause into the generation process, we propose a biased self-attention mechanism that does not introduce new additional parameters. Experimental results indicate that our proposed model performs superior to the baselines and the generated responses of our model are more empathetic and content-relevant.