Keywords

1 Introduction

Empathy is the ability to understand others’ feelings, and respond appropriately to their situations . Previous studies have shown that empathetic dialogue models can improve user’s satisfaction in several areas, such as customer service [14], healthcare community [26] and etc. Therefore, how to successfully implement empathy becomes one of the key issues to build an intelligent and considerate agent. In recent years, many studies have been conducted on the task of empathetic dialogue generation, which are mainly divided into two categories: One is to enhance the understanding of a user’s situation and emotion by leveraging knowledge from one or more external knowledge bases [11, 15, 21, 25] or adding emotion causes as prior emotion knowledge [3, 25]. This is to improve the cognitive ability. The issue of the existing work is that they overlook the importance of paths between users’ critical keywords, which can actually reflect the contextual logic in the conversation. Although some studies [25] build paths between emotion concepts and cause concepts, they mainly focus on the causality aspect and ignore the fact that paths between any keywords can help. The second category is to design emotion strategies, such as mixture of experts [12], emotion mimicry [17] and multi-resolution emotions [10] to generate appropriate responses from the affection aspect. Unfortunately, these studies learn to respond properly mainly according to the speaker’s emotion rather than both interlocutors’ emotions. In this paper, we aim to improve the aforementioned weak aspects of the existing works to help advance the study of empathetic dialogue generation.

Fig. 1.
figure 1

A dialogue from the EmpatheticDialogues dataset. The cognitive ability is improved by retrieving entities (bold in black) and relationships (grey) from ConceptNet and building paths between critical keywords (red) to generate a high quality response under the influence of anxious and confident emotions and wishing dialogue act. (Color figure online)

Psychological research shows that empathy is a complex mental process involving three aspects of interlocutors: cognition, affection and behavior [13]. Specifically, cognitive empathy refers to the ability to understand and interpret a user’s situation [2]; affective empathy is an emotional reaction based on differentiating the emotions of oneself and others [13]; behavioral empathy means verbal or non-verbal forms of communication used in the empathetic dialogue [6]. Among the existing works, some only consider the aspects of congition and affection [21, 28]; others mainly consider the aspect of behavior [1, 27]. None of the existing works had comprehensively considered all the three aspects (cognition, affection, behavior), which we believe are all important. In the following, we elaborate in detail with the example in Fig. 1. The dialogue in Fig. 1 shows that (1) Cognition: The speaker is anxious about attending a job interview. In the first turn, there exists a path between<job, interview> with internship as a bridge to enhance the understanding of the keywords and the context. In the next turn, the paths between < poorly, asked> and < asked, job> are built to alleviate the problem that it is difficult to capture the contextual logic based on limited context. Thus, it can be seen that the paths , which establish the relationships between utterances, are critical to improve the cognitive ability. (2) Affection: In interpersonal conversations, responses are usually influenced by both interlocutors’ emotions [5]. As shown in Fig. 1, in the second turn, instead of both sides falling into anxiety, the listener is able to perceive the speaker’s emotion and accept the emotion difference between them, thus generating a response with more positive emotion (hopeful). Therefore, how to learn the emotional dependencies between the context and target response based on both interlocutors’ emotions is critical for responding properly. (3) Behavior: Appropriate dialogue acts are used as communicative form to enhance empathy expression. For example, the listener inspires the speaker by encouraging and makes the speaker relaxed by wishing. Different from [27], we consider that all the responses (rather than some of them) are generated by the guiding of dialog act. In this way, we can guide dialogue generation better.

To this end, we propose a novel empathetic dialogue generation model including aspects of Cognition, Affection and Behavior (CAB) to achieve a comprehensive empathetic dialogue task. Specifically, since keywords are important to understand the contextual logic, our model builds paths between critical keywords through multi-hop commonsense reasoning to enhance the cognitive ability. Conditional Variational Auto Encoder (CVAE) model with dual latent variables is built based on both interlocutors’ emotions, and then the dual latent variables are injected into the decoder together with the dialogue act features to produce empathetic responses from the perspective of affection and behavior. Our contributions are summarized as follows:

  • To the best of our knowledge, we are the first to propose a novel framework for empathetic dialogue generation based on psychological theory from three perspectives: cognition, affection and behavior.

  • We propose a context-based multi-hop reasoning method, in which paths are established between critical keywords to acquire implicit knowledge and learn contextual logic.

  • We present a novel CVAE model, which introduces dual latent variables to learn the emotional dependencies between the context and target responses. After that, we incorporate the dialogue act features into the decoder to guide the generation.

  • Experiments demonstrate that CAB generates more relevant and empathetic responses compared with the state-of-the-art methods.Footnote 1

2 Related Work

Recently, there has been numerous works in the task of empathetic dialogue generation proposed by Rashkin et al. [20]. Lin et al. [12] assign different decoders for various emotions, and fuse the output of each decoder with users’ emotion weights. Majumder et al. [17] adopt emotion stochastic sampling and emotion mimicry to respond to positive or negative emotions for generating empathetic responses. Li et al. [10] construct an interactive adversarial learning network considering multi-resolution emotions and user feedback. Liu et al. [16] incorporate anticipated emotions into response generation via reinforcement learning. Gao et al. [3] adopt emotion cause to better understand the user’s emotion. However, all of the above methods only consider the user’s emotion and ignore the influence between both interlocutors’ emotions in the dialogue.

Several studies have incorporated external knowledge into empathetic dialogue generation. Li et al. [11] employ multi-type knowledge to explore implicit information and construct an emotional context graph to improve emotional perception. Liu et al. [15] prepend the retrieved knowledge triples to the gold responses in order to get proper responses. However, these approaches retrieve knowledge triples without fully considering the contextual meaning of the words. Although Wang et al. [25] adopt ConceptNet to explore the emotional causality by commonsense reasoning between the emotion clause and the cause clause, the logical relationships between other utterances may be ignored. Sabour et al. [21] use ATOMIC for commonsense reasoning to better understand the user’s situation and feeling, but reasoning on a whole dialogue history may neglect the important role of keywords in the context. To overcome the above proposed shortcomings, we propose a context-based multi-hop commonsense reasoning method to enrich contextual information and reason about the logical relationships between utterances.

Fig. 2.
figure 2

The overall architecture of CAB.

3 Method

3.1 Task Formulation and Overview

In empathetic dialogue generation, each dialogue consists of a dialogue history \(C=[S_1,L_1,S_2,L_2,\ldots ,S_{N-1},L_{N-1},S_{N}]\) of 2N-1 utterances and a gold empathetic response \(L_N=[w_N^1,w_N^2,\ldots ,w_N^n]\) of n words, where \(S_i\) and \(L_i\) denote the i-th utterance of speaker and listener respectively. Our goal is to generate an empathetic response \(R=[r_1,r_2,\ldots ,r_m]\) based on the dialogue history C, the speaker’s emotion \(e_s\), the listener’s emotion \(e_l\), and the listener’s dialogue act \(a_l\).

We provide a overview of CAB in Fig. 2, which consists of five components: (a) Emotional Context Representation. The predicted emotions, \(e_s\) and \(e_l\), are fed into context C by emotional context encoder to obtain the emotional context representation \(\boldsymbol{\hat{H}}_S\) and \(\boldsymbol{\hat{H}}_L\); (b) Affection. Then prior network and posterior network capture dual latent variables \(\boldsymbol{z}_s\) and \(\boldsymbol{z}_l\), based on \(\boldsymbol{\hat{H}}_S\) and \(\boldsymbol{\hat{H}}_L\) in the test and training phase; (c) Cognition. To build paths P, we leverage ConceptNet to acquire external knowledge and incorporate it into C to obtain a knowledge-enhanced context representation \(\boldsymbol{\hat{H}}_C\); (d) Behavior. The dialogue act features \(\boldsymbol{E}_a\) are distilled based on a predictor and the embedding layer; (e) Response Generation. The three-stage decoder generates an empathetic response R based on the aspects of affection, cognition and behavior.

We evaluate the model on EmpatheticDialogues [20], which is a publicly available benchmark dataset for empathetic dialogue generation. However, dialogues in this dataset do not contain labels of emotion and dialogue act for each listener’s utterance, and we annotate emotion and dialogue act by Emoberta [7] and EmoBERT [27], respectively, to support the studies in this paper.

From Sect. 3.2 to Sect. 3.7, we introduce CAB briefly due to space limit. More model and experiment details are in the full version [4].

3.2 Emotional Context Encoder

Input Representation. We divide the dialogue history into two segments \(C_S=[S_1,S_2,\ldots ,S_N]\) and \(C_L=[L_1,L_2,\ldots ,L_{N-1}]\). Following the previous work [12], we first gain the embedding of speaker context, listener context, global context and gold response respectively. Then the embedding of speaker context and listener context are fed into the Transformer-based inter-encoder (ItrEnc) to obtain \(\boldsymbol{H}_S\) and \(\boldsymbol{H}_L\), and the Transformer encoder (TransEnc) encodes the embedding of global context and gold response into \(\boldsymbol{H}_C\) and \(\boldsymbol{H}_N\).

Emotion Classification. To understand the emotions of the speaker and the listener, we project the hidden representations of the first token from \(\boldsymbol{H}_S\) and \(\boldsymbol{H}_L\) into the emotion category distribution \(P_{s}\) and \(P_{l}\) to predict their emotions. Then we send the emotions to a trainable emotion embedding layer to obtain the emotion states embedding matrix \(\boldsymbol{E}_{emos}\) and \(\boldsymbol{E}_{emol}\).

Emotion Self-attention. To make the latent variables in Sect. 3.3 incorporate both interlocutors’ emotions, \(\boldsymbol{H}_S\) and \(\boldsymbol{H}_L\) are concatenated with \(\boldsymbol{E}_{emos}\) and \(\boldsymbol{E}_{emol}\) and then fed into a self-attention layer followed by a linear layer to obtain the emotional context representation \(\boldsymbol{\hat{H}}_S\) and \(\boldsymbol{\hat{H}}_L\).

3.3 Prior Network and Recognition Network (Affection)

We introduce dual latent variables \(\boldsymbol{z}_*\in \{\boldsymbol{z}_s,\boldsymbol{z}_l\}\) in CVAE, mapping the input sequences \(C_*\in \{C_S,C_L\}\) into the output sequence \(L_N\) via \(\boldsymbol{z}_*\). Taking speaker as an example, we illustrate how to realize the prior network and the recognition network. The prior network \(p_\theta (\boldsymbol{z}_s \vert C_S)\) is parameterized by 3-layer MLPs to compute the mean \(\mu _s^\prime \) and variance \(\sigma _s^{\prime 2}\) of \(\boldsymbol{z}_s\). The network structure of the recognition network \(q_\varphi (\boldsymbol{z}_s \vert C_S,L_N)\) is the same as that of the prior network, except that the input also includes \(\boldsymbol{H}_N\). In order to learn the emotional dependencies based on both interlocutors’ emotions, we fuse \(\boldsymbol{z}_s\) and \(\boldsymbol{z}_l\) due to the emotional similarity coefficient \(\beta \) between \(\boldsymbol{E}_{emos}\) and \(\boldsymbol{E}_{emol}\) to obtain \(\boldsymbol{z}=\beta \cdot \boldsymbol{z}_s+(1-\beta )\cdot \boldsymbol{z}_l\).

3.4 Knowledge Acquisition and Fusion (Congnition)

Knowledge Acquisition. We first obtain the keyword set \(\tau _{all}\) of size \(\boldsymbol{cw}\) from \(C_S\) based on the TextRank algorithm [18]. Then we build paths as follows:

a. Take one keyword in \(\tau _{all}\) as the head entity \(h_i\in \tau _{all}\), then feed the embedding of \(h_i\) and speaker context into ItrEnc to extract the semantic features of \(h_i\). The Top-K knowledge triples in ConceptNet associated with \(h_i\) are retrieved based on a score and removed relation set [11].

b. To ensure that the triples are logically related to other keywords \(\tau _{other}\), we first obtain the semantic features of \(h_j\in \tau _{other}\) like step a. After ranking the triples by relevance between tail entity and \(h_j\), we select Top-k triples. If the tail entity is same as \(h_j\), which indicates there exists a one-hop path between \(h_i\) and \(h_j\), we add them to the final keywords set \(\tau _r\) (e.g. red circles in Fig. 2). If not, the tail entity is added to \(\tau _{all}\) to continue finding the paths by repeating step a and b. Finally, we retain some paths P (e.g. the paths connected by grey arrows in Fig. 2) for futher fusion. The attention weight vector \(\boldsymbol{g}\) is calculated to measure importance of each word in C with \(\tau _r\) by the attention mechanism.

Knowledge Fusion. We first convert the paths into sequences. Then the sequences are fed into the two-layer Bi-GRU to obtain the knowledge representation \(\boldsymbol{H}_k\). Finally, following previous work [21], we concatenate \(\boldsymbol{H}_k\) with context at token-level to learn the knowledge-enhanced context representation \(\boldsymbol{\hat{H}}_C\).

3.5 Dialogue Act Predictor and Representation (Behavior)

To guide the communicative form of empathetic dialogue generation, our model uses the first token of \(\boldsymbol{\hat{H}}_C\) to predict dialogue act \(\boldsymbol{a}_l\). Then, \(\boldsymbol{a}_l\) is fed into the embedding layer to learn the dialogue act embedding representation \(\boldsymbol{E}_a\).

3.6 Response Generation

Finally, the aforementioned information \(\boldsymbol{E}_a\), \(\boldsymbol{g}\), \(\boldsymbol{z}\) and \(\boldsymbol{\hat{H}}_{C}\) are applied at the Transformer-based decoder (TransDec) through the following three stages: (1) The embedding of the start-of-sequence token \(\boldsymbol{E}_{SOS}\) and \(\boldsymbol{E}_a\) are fed into a linear layer, then the high-level act features are adopted to guide the generation. (2) We design a multi-head keywords attention, which takes the output of the cross-attention layer as query, the dot-product over \(\boldsymbol{g}\) and \(\boldsymbol{\hat{H}}_C\) as key and value. Then TransDec outputs the hidden state \(\boldsymbol{H}_G\). (3) To learn the emotional dependencies, we concatenate \(\boldsymbol{z}\) and \(\boldsymbol{H}_G\) at token-level and use pointer network [23] to output the probability distribution of each word in the vocabulary.

3.7 Training Objectives

We jointly optimiaze the emotion classification loss, dialogue act prediction loss, the loss of CVAE model and bag-of-word loss as:

$$\begin{aligned} \mathcal {L}=\gamma _1 \mathcal {L}_{s}+\gamma _2 \mathcal {L}_{l}+\gamma _3 \mathcal {L}_{a}+ \gamma _4 \mathcal {L}(C_*,C_N;\theta ,\varphi )+\gamma _5 \mathcal {L}_{bow} \end{aligned}$$
(1)

where \(\gamma _1\), \(\gamma _2\), \(\gamma _3\), \(\gamma _4\) and \(\gamma _5\) are hyper-parameters.

4 Experiments

4.1 Experimental Setup

Baselines. We compare our model with the state-of-the-art models as follows: (1) Transformer [22]: The vanilla Transformer with the pointer network trained by optimizing the generation loss. (2) Multi-Trans [20]: A variant of Transformer that includes emotion classification loss in addition to the generation loss to jointly optimize the model. (3) MOEL [12]: A model that includes several Transformer decoders, and the outputs are softly combined to generate responses. (4) MIME [17]: A model adopting emotion mimicry and emotion clusters to deal with positive or negative emotions. (5) EmpDG [10]: A generative adversarial network that considers multi-resolution emotion and introduces discriminators to supervise the training in semantics and emotion. (6) KEMP [11]: A model that uses two-type knowledge to help understand and express emotions. (7) CEM [21]: A method for generating empathetic responses by leveraging commonsense to improve the understanding of interlocutors’ situations and feelings.

Implementation Details. We implement all models in PyTorchFootnote 2 with GeForce GTX 3090 GPU, and train models using Adam optimization [8] with a mini-batch size of 16. All common hyper-parameters are the same as the work in [12]. We adopt 300-dimensional pre-trained 840B GloVE vectors [19] to initialize the word embeddings, which are shared between the encoders and the decoder. The hidden size is 300 everywhere, and the size of latent variable is 200. We use the KL annealing of 15,000 batches to achieve the best performance. During test, the batch size is 1 and the maximum greedy decoding steps is 50.

Automatic Evaluation Metrics. We choose the widely used PPL [24], Distinct-1, Distinct-2 [9] as our main automatic metrics. PPL is used to estimate the generation quality of a model in general. Distinct-1 and Distinct-2 are used to measure the diversity of responses. Since emotion accuracy of speaker/listener (EmoSA/EmoLA) reflects the understanding of both interlocutors’ emotions and dialogue act accuracy (ActA) can determine whether the proper dialogue acts are chosen to produce responses, we also report these metrics.

Table 1. Results of the automatic evaluation, and w/o Cog/Aff/Beh indicate ablation experiments and the best results of all models are bold.

4.2 Results and Analysis

Automatic Evaluation Results. The overall automatic evaluation results are shown in the Table 1. Our model CAB outperforms the baselines on all metrics significantly. The lower PPL score implies that CAB has a higher quality of generation generally, reflecting the importance of considering empathy from multi-perspective. The remarkable improvements in Distinct-1 and Distinct-2 suggest that the introduction of external knowledge can be beneficial in improving the understanding of dialogue history and thus generating a wider variety of response. The higher accuracy of emotion classification verifies the validity of modelling both interlocutors’ emotions separately.

Ablation Study. As shown in the bottom part of Table 1, we also conduct ablation experiments to explore the effect of each component. From the results, we can observe that all metrics decrease except for PPL, especially Distinct-1 and Distinct-2, when commonsense knowledge acquisition and fusion are removed (w/o Cog), suggesting that the paths capture additional information to enhance cognitive ability, thus improving the quality and diversity of responses. The increasing PPL score may be due to the introduction of knowledge, which may have an impact on the fluency of the generated responses. In addition, we find that only considering the speaker’s emotion by removing the latent variable of listener (w/o Aff) yields lower emotion accuracy and higher PPL score, and thus it is difficult to generate appropriate responses without understanding both interlocutors’ emotions exactly. All metrics decrease when we remove the classification of dialogue act and the dialogue act features fused at the decoder (w/o Beh), indicating the emphasis of the dialogue acts in improving empathy.

5 Conclusions

In this paper, we build paths by leveraging commonsense knowledge to enhance understanding of the user’s situation, considering both interlocutors’ emotions and guiding responses generation through dialogue act, namely by generating empathetic responses from three perspectives: cognition, affection and behavior. Extensive experiments based on benchmark metrics have shown that our method CAB outperforms the state-of-the-art methods, demonstrating the effectiveness of our method in improving empathy of the generated responses.