Fine-Grained Knowledge Enhancement for Empathetic Dialogue Generation

Chen, Ai; Zhong, Jiang; Dai, Qizhu; Wang, Chen; Li, Rongzhen

doi:10.1007/978-3-031-46674-8_6

Ai Chen ORCID: orcid.org/0009-0008-4409-0761¹⁵,
Jiang Zhong ORCID: orcid.org/0000-0002-5169-4634¹⁵,
Qizhu Dai¹⁵,
Chen Wang¹⁵ &
…
Rongzhen Li¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14179))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

593 Accesses

Abstract

An engaging dialogue system is supposed to generate empathetic responses, which requires a cognitive understanding of users’ situations and an affective perception of their emotions. Most of the existing work only focuses on modeling the latter, while neglecting the importance of the former. Despite some efforts to enhance chatbots’ empathy in both cognition and affection, limited cognition conditions and inaccessible fine-grained information still impair the effectiveness of empathy modeling. To address this issue, we propose a novel fine-grained knowledge-enhanced empathetic dialogue generation model KEEM. We first explore strategies to filter fine-grained commonsense and emotional knowledge and leverage knowledge to construct cognitive and affective context graphs. And we learn corresponding context representations from the two knowledge-enhanced context graphs. Then we encode the raw dialogue context to learn the original cognitive and affective representations and fuse them with the knowledge-enhanced representations in cognition and affection. Finally, we feed the two fused representations into a decoder to produce empathetic replies. Extensive experiments conducted on the benchmark dataset EMPATHETICDIALOGUES verify the effectiveness of our model in comparison with several competitive models.

Access provided by Autonomous University of Puebla. Download conference paper PDF

CAB: Empathetic Dialogue Generation with Cognition, Affection and Behavior

KnowDT: Empathetic dialogue generation with knowledge enhanced dependency tree

Article 20 June 2024

Empathetic Dialogue Generation with Pre-trained RoBERTa-GPT2 and External Knowledge

Keywords

1 Introduction

Empathy, an important aspect of engaging human conversations [14], usually refers to the ability to understand others’ situations, perceive their emotions, and respond to them appropriately [5]. Research in social psychology has also shown that empathy is a vital step towards a more humanized dialogue system [20]. Consequently, we concentrate on the task of empathetic dialogue generation, which aims to empathize with users and realize a more human-like chatbot.

It is demonstrated that empathy is a complex construct involving cognition and affection [15], where cognitive empathy attends to users’ situations [2] while affective one focuses on users’ emotions instead [3]. But most of the existing methods [7,8,9,10, 14] in empathetic dialogue generation only rely on detecting users’ emotions and modeling emotional dependencies, while ignoring the importance of realizing cognitive empathy. To achieve empathy in both two aspects, Sabour et al. [15] make an attempt to use commonsense to enhance the modeling of cognitive and affective empathy and several cognition conditions would be inferred in this method. However, limited cognitive conditions and inaccessible fine-grained information still result in inaccurate recognization of users’ circumstances and feelings, thereby impairing the empathetic effect of the generated responses.

Providing external knowledge to dialogue systems has been proven to operate in favor of modeling empathy in cognition and affection [15]. Intuitively, fine-grained knowledge would be beneficial for a more comprehensive understanding of user situations and a more accurate perception of user feelings, which is shown in Fig. 1. In this case, with the related cognitive concept of “walk”, the chatbot understands the user’s situation that he or she encountered a snake while walking so it asks what the user did. Also, the affective concepts “dazed”, “demon” and “poisonous” help the robot perceive the terrified feeling of the user.

In this paper, we propose a fine-grained Knowledge-Enhanced EMpathetic dialogue generation model (KEEM). We first explore novel knowledge selection strategies to filter commonsense, i.e. structural and semantic knowledge, and emotional knowledge, and use the selected fine-grained knowledge to construct cognitive and affective context graphs. We also learn the corresponding context representations from the above two knowledge-enhanced context graphs. Then we encode the original dialogue context to acquire the original information about cognition and affection, and fuse them with the two knowledge-enhanced representations. Finally, we feed the two fused cognitive and affective representations to a decoder to generate empathetic responses with coherent content and appropriate emotion. Extensive experiments are conducted on the benchmark dataset EMPATHETICDIALOGUES [14] for empathetic dialogue generation and the empirical results have verified that our model can produce more empathetic responses in comparison with several competitive models.

Our contributions can be summarized as follows:

We propose KEEM, a novel approach that models cognitive and affective empathy by constructing and encoding corresponding context graphs.
We explore knowledge selection strategies for cognitive and emotional knowledge to obtain more accurate and fine-grained knowledge.
We conduct automatic and human evaluations and analyses to demonstrate the effectiveness of KEEM.

2 Preliminaries

2.1 Commonsense and Emotional Knowledge

In this work, we leverage the commonsense knowledge graph ConceptNet [17] and the emotional lexicon NRC-VAD [12] to infer the speakers’ situations and their emotions, which enhances the cognition and affective empathy and leads to more empathetic responses.

ConceptNet is a large-scale knowledge graph that connects words and phrases of natural language with labeled edges. It represents the general knowledge, allowing models to better understand the meanings behind the words [11]. It contains 34 relations, over 21M edges, and over 8M nodes. The edges stored in ConceptNet can be concisely represented as the quadruples of their start node, relation label, end node, and confidence score: (h, r, t, s).

NRC-VAD is a lexicon of more than 20k English words and their vectors of three independent dimensions, i.e. valence (positiveness-negativeness/pleasure-displeasure), arousal (active-passive), and dominance (dominant-submissive), abbreviated as VAD. The values of VAD vectors are fine-grained real numbers in the interval from 0 (lowest) to 1 (highest).

2.2 Task Formulation

Dialogue context C is a sequence of M utterances: $ C=[U_1,U_2,\cdots ,U_M] $, where the i-th utterance $U_i=[w_i^1,w_i^2,\cdots ,w_i^{m_i}]$ consists of $m_i$ words. Following [9], we flat C as a token sequence, and prepend a CLS token to it, thus obtaining a new context sequence: $C=[CLS,w_1^1,\cdots ,w_1^{m_1},\cdots ,w_M^1,\cdots ,w_M^{m_M}]$. Given C, the task of empathetic dialogue generation is to generate an empathetic response $Y=[y_1,y_2,\cdots ,y_n]$ with coherent content and appropriate emotion.

3 Methodology

3.1 Overview

Figure 2 shows an overview of our proposed model. Our model (KEEM) is composed of four stages: 1)cognitive context graph constructing and encoding; 2)affective context graph constructing and encoding; 3)cognition and affection fusion; and 4)empathetic response generation, where the first two stages can be performed simultaneously.

3.2 Cognitive Context Graph Constructing and Encoding

Cognitive Knowledge Selection. The structural and semantic information of ConceptNet, i.e. the relations and concepts therein, can help enhance cognition. Hence, for each non-stop word $w_i^j$ of C, we first retrieve all of its quadruples from ConceptNet as candidates. And then we filter the cognitive knowledge by the following heuristic steps:1) We remove the quadruples that have low confidence scores (i.e. scores lower than 1.0) or inappropriate relations (i.e. relations unrelated to cognition). 2) We define and calculate the correlation degrees of the retrieved concepts. The correlation degree of a concept is the number of C’s words that link to it in ConceptNet. If the concept itself appears in the dialogue context, the correlation degree is increased by one. 3) We rank the correlation degrees of the quadruples and select the top $K_1$ ones as the knowledge needed in cognitive graph construction.

Cognitive Context Graph Constructing. To build the cognitive graph, we first take all the tokens of C, including the CLS token, as the initial nodes of the graph. We also add the semantic concepts as the new nodes to the graph, which are selected from Cognitive Knowledge Selection. Then we connect vertex pairs of three types: 1) every two consecutive words in dialogue context; 2) every word in dialogue context and all of its semantic-related concepts; 3) CLS token and all words in dialogue context. Note that the edges connecting vertices are directed.

Thus, the dialogue context is enhanced by external structural and semantic knowledge and represented as the cognitive context graph $G_{cog}$, and the links of the graph are stored in the adjacency matrix $A_{cog}$.

Cognitive Context Graph Encoding. Similar to [8], we first initialize the cognitive vector presentation of every vertex $v_{sem}^i$ by summing up its word embedding $\boldsymbol{E}_w\left( v_{{c o g }}^i\right) \in \mathbb {R}^{d_{m o d e l}}$, positional embedding $\boldsymbol{E}_p\left( v_{c o g}^i\right) \in \mathbb {R}^{d_{m o d e l}}$, and dialogue state embedding $\boldsymbol{E}_s\left( v_{c o g}^i\right) $:

$$\begin{aligned} \textbf{v}_{cog }^i=\boldsymbol{E}_w\left( v_{c o g}^i\right) +\boldsymbol{E}_p\left( v_{c o g}^i\right) +\boldsymbol{E}_s\left( v_{c o g}^i\right) \end{aligned}$$

(1)

where $d_{model}$ is the dimension of the embeddings.

Then, we adopt a multi-head graph-attention mechanism, followed by a residual connection and layer normalization, thus $v_{cog}^i$ attending to all its immediate neighbors $\left\{ v_{c o g}^j\right\} _{j \in A_{c o g}^i}$ to update its cognitive presentation with structural and semantic knowledge:

$$\begin{aligned} \hat{\textbf{v}}_{cog }^i=\text {LayerNorm}\left( v_{c o g}^i+\Vert _{n=1}^H \sum _{j \in A_{c o g}^i} \text {att}_{c o g}^n\left( \textbf{v}_{c o g}^i, \textbf{v}_{cog }^j\right) \textbf{W}_{c o g}^{n v} \textbf{v}_{cog }^j\right) \end{aligned}$$

(2)

where $\text {LayerNorm}$ is layer normalization, $\Vert $ is the concatenation of H attention heads, $A_{c o g}^i$ are the immediate neighbors of $v_{c o g}^i$ presented in the adjacency matrix $A_{c o g}$, $\textbf{W}_{cog }^{n v} \in \mathbb {R}^{d_{model } \times d_h}$ is the linear transformation, $d_h=\frac{d}{H}$ is the dimension of each head, and $\text {att}_{c o g}^n\left( \textbf{v}_{cog }^i, \textbf{v}_{cog }^j\right) $ is the self-attention mechanism for the n-th attention head:

$$\begin{aligned} \text {att}_{c o g}^n\left( \textbf{v}_{cog }^i, \textbf{v}_{cog }^j\right) =\frac{\exp \left( \left( \textbf{W}_{cog }^{n q} \textbf{v}_{cog }^i\right) ^{\top } \textbf{W}_{cog }^{n k} \textbf{v}_{cog }^j\right) }{\sum _{k \in A_{c o g}^i} \exp \left( \left( \textbf{W}_{cog }^{n q} \textbf{v}_{cog }^i\right) ^{\top } \textbf{W}_{cog }^{n k} \textbf{v}_{cog }^k\right) } \end{aligned}$$

(3)

where $\textbf{W}_{cog }^{n q} \in \mathbb {R}^{d_{model } \times d_h}$, $\textbf{W}_{cog }^{n k} \in \mathbb {R}^{d_{model } \times d_h}$ are the linear transformations. And notably, when $\textbf{v}_{cog }^i$ is the vector of a word in dialogue context and $\textbf{v}_{cog }^j$ is the counterpart of a concept semantic-related, following [1], we update $\textbf{v}_{cog }^j$ with the subtraction between the concept embedding and the corresponding relation embedding:

$$\begin{aligned} \textbf{v}_{cog }^{j}=\textbf{v}_{cog }^{j}-\boldsymbol{E}_{r}\left( \textbf{r}_{cog }^{i j}\right) \end{aligned}$$

(4)

where $\boldsymbol{E}_{r}\left( \boldsymbol{r}_{cog }^{i j}\right) \in \mathbb {R}^{d_{model }}$ is the relation embedding between $\textbf{v}_{cog }^i$ and $\textbf{v}_{cog }^j$.

After that, we apply Transformer layers [18] to update the vector representations of vertices, incorporating global cognitive information into all vertices in $G_{cog}$:

$$\begin{aligned} \tilde{\textbf{v}}_{cog}^i=\text {TRSEnc}\left( \hat{\textbf{v}}_{cog }^i\right) \end{aligned}$$

(5)

where $\tilde{\textbf{v}}_{cog}^i \in \mathbb {R}^{d_{m o d e l}}$ is the semantic vector representation of $v_{cog}^i$, and $\text {TRSEnc}$ represents Transformer encoder layers.

Finally, we use the cognitive vertex presentation of CLS token, i.e. $\tilde{\textbf{v}}_{cog}^0$, as the global cognitive representation of $G_{cog}$.

3.3 Affective Context Graph Constructing and Encoding

The procedures of Subsects. 3.2 and 3.3 are similar, with the strategies of knowledge selection and the approaches of graph encoding being different.

Affective Knowledge Selection. To inject appropriate affective knowledge into the dialogue context to detect users’ emotions, we filter emotional concepts with high emotional intensity value and low emotion gap value between them and the dialogue context.

Be analogous to Cognitive Knowledge Selection., for non-stop word $w_i^j$ of C, we first acquire its quadruples from ConceptNet and filter the affective knowledge by removing the quadruples that have low confidence scores or affection-unrelated relations.

Then we retrieve the VAD vectors (mentioned in Sect. 2) of the emotional concepts and calculate their emotion intensity values [8, 21]. The formula for calculating the emotional intensity value of concept c is as follows:

$$\begin{aligned} E I(c)=\min -\max \left( \left\| V(c)-\frac{1}{2}, \quad \frac{A(c)}{2}\right\| \right) \end{aligned}$$

(6)

where $\min -\max $ is the min-max normalization, $\Vert \Vert $ is $L_2$ norm, V(c) and A(c) are concept c’s values of the valence and arousal dimensions in the VAD vector, respectively. If c is not in NRC-VAD, EI(c) will be set to 0.

Besides, we compute the values of the emotion gap between each nonstop word and its emotional concepts. The formula for computing the emotion gap value between word w and concept c is as follows:

$$\begin{aligned} E G(w, c)=\frac{a b s(V(w)-V(c))+a b s(A(w)-A(c))}{2} \end{aligned}$$

(7)

where abs is the operation to get the absolute value.

Eventually, we filter the quadruples whose emotion gap values between head concept and tail concept (i.e. word in dialogue context and its emotional concept) are lower than 0.5, rank the emotion intensity values of the quadruples, and select the top $K_2$ ones.

Affective Context Graph Constructing. We construct the affective context graph in a way similar to Cognitive Context Graph Constructing and represent it as $G_{aff}$, and the emotional links in the knowledge-enhanced graph are stored in the affective adjacency matrix $A_{aff}$.

Affective Context Graph Encoding. Taking the same steps as Cognitive Context Graph Encoding, we initialize the affective vector presentation of every vertex $v_{aff}^i$, and then update $v_{aff}^i$ with affective knowledge. But notice that, when encoding the affective context graph, the relations between vertices would not be considered. We also employ Transformer layers [18] to update the vector representations of vertices, incorporating global emotional information into all vertices in $G_{aff}$.

The affective vertex presentation of the CLS token, that is $\tilde{\textbf{v}}_{aff}^0$, is also used as the global affective representation of $G_{aff}$.

3.4 Cognition and Affection Fusion

To full exploit the original information of the dialogue context, we encode the raw dialogue context. We use the same initialization method as Cognitive Context Graph Encoding to acquire the embeddings of context sequence, i.e. $E_C$, and then feed it into new Transformer encoder layers to get the hidden representations of C:

$$\begin{aligned} H=T R S E n c(C) \end{aligned}$$

(8)

Then we use the hidden representation of the CLS token to represent the context sequence:

$$\begin{aligned} h=H[0] \end{aligned}$$

(9)

And then we perform the cognitive and affective linear transformation on the hidden representation h to obtain the corresponding representations of the dialogue context, respectively:

$$\begin{aligned} h_{cog }=\textbf{W}_{coc } h \end{aligned}$$

(10)

$$\begin{aligned} h_{aff}=\textbf{W}_{aoc } h \end{aligned}$$

(11)

where $\textbf{W}_{coc } \in \mathbb {R}^{d_{model } \times d_{model}}$, $\textbf{W}_{aoc } \in \mathbb {R}^{d_{model } \times d_{model}}$ are the cognitive and affective linear transformations.

Emotion Classification. Our proposed model learns to predict the users’ emotional state to guide the empathetic response generation. We concatenate $\widetilde{\textbf{v}}_{a f f}^0$ with $h_{aff}$ to obtain a fused affective representation $\widetilde{h}_{a f f}^0$:

$$\begin{aligned} \tilde{h}_{a f f}=\tilde{\textbf{v}}_{a f f}^0 \oplus h_{a f f} \end{aligned}$$

(12)

where $\oplus $ denotes concatenation and $\tilde{h}_{a f f}\in \mathbb {R}^{d_{model }}$.

Hence, we pass $\tilde{h}_{a f f}$ through a linear layer followed by a Softmax operation to produce the emotion category distribution $P_{e m o}\in \mathbb {R}^{q}$, where q is the number of emotion categories:

$$\begin{aligned} P_{e m o}=\text {Softmax}\left( W_{e m o} \tilde{h}_{a f f}\right) \end{aligned}$$

(13)

where $W_{e m o}\in \mathbb {R}^{2d_{model } \times q}$ is the emotional linear transformation. During training, we conduct the parameter learning by minimizing the Cross-Entropy (CE) loss between the ground truth label $e^*$ and the predicted label e:

$$\begin{aligned} \mathcal {L}_{e m o}=-\log \left( P_{e m o}\left( e=e^*\right) \right) \end{aligned}$$

(14)

Information Integration. To generate empathetic responses, we integrate cognitive and affective information into the dialogue context. First concatenate the global cognitive representation of the cognitive context graph, i.e. $\tilde{\textbf{v}}_{cog}^0$, and the cognitive representation of the raw dialogue context, i.e. $h_{c o g}$, to obtain a fused cognitive representation:

$$\begin{aligned} \tilde{h}_{c o g}=\tilde{\textbf{v}}_{cog}^0 \oplus h_{c o g} \end{aligned}$$

(15)

where $\tilde{h}_{c o g} \in \mathbb {R}^{2d_{m o d e l}}$.

Then $\tilde{h}_{c o g}$ and $\tilde{h}_{a f f}$ are concatenated and the combination of them is passed through a Multi-Layer Perceptron with ReLU activation, which aims to learn a contextualized representation with adequate cognition and affective information:

$$\begin{aligned} \hat{h}_{c t x}=\tilde{h}_{c o g} \oplus \tilde{h}_{a f f} \end{aligned}$$

(16)

$$\begin{aligned} \tilde{h}_{c t x}=\text {MLP}\left( \sigma \left( \hat{h}_{c t x}\right) \odot \hat{h}_{c t x}\right) \end{aligned}$$

(17)

where $\hat{h}_{c t x} \in \mathbb {R}^{d_{m o d e l}}$, $\text {MLP}$ denotes Multi-Layer Perceptron, and $\odot $ denotes element-wise multiplication.

3.5 Empathetic Response Generation

The contextualized representation of cognition and affection, i.e. $\tilde{h}_{c t x}$, and the word embeddings of the target response $R_{gold }$, i.e. $E_w\left( R_{gold }\right) $, are fed as the inputs into the Transformer decoder layers to generate a response:

$$\begin{aligned} O=\text {TRSDec}\left( E_w\left( R_{g o l d}\right) , \tilde{h}_{c t x}\right) \end{aligned}$$

(18)

$$\begin{aligned} P_{r e s p}=\text {softmax}\left( W_o O\right) \end{aligned}$$

(19)

$$\begin{aligned} p\left( R_t \mid R_{<t}, G_{\text{ cog } }, G_{\text{ aff } }\right) =P_{\text{ resp } }[t] \end{aligned}$$

(20)

where $O\in \mathbb {R}^{l_{R } \times d_{model}}$, $l_{R }$ is the length of the predicted response, $\text {TRSDec}$ represents the Transformer decoder layers, $P_{r e s p} \in \mathbb {R}^{l_R \times |V|}$, |V| is the vocabulary size, and $p\left( R_t \mid R_{<t}, G_{\text{ cog } }, G_{\text{ aff } }\right) $ is the distribution over the vocabulary V for the t-th word $R_t$.

Then a standard Negative Log-Likelihood (NLL) is used to optimize generated responses:

$$\begin{aligned} \mathcal {L}_{g e n}=-\sum _{t=1}^{l_R} \log p\left( R_t \mid R_{<t}, G_{c o g}, G_{a f f}\right) \end{aligned}$$

(21)

To avoid generating generic empathetic responses, following [15], we adopt Frequency-Aware Cross-Entropy (FACE) [4] as an additional loss to penalize high-frequency tokens. Therefore, during the training process, we first compute the relative frequency of each token ${word}_i$ in the training corpus:

$$\begin{aligned} R F_i=\frac{\text {freq}\left( {word}_i\right) }{\sum _{j=1}^V \text{ freq } \left( {word}_i\right) } \end{aligned}$$

(22)

where V is the vocabulary size of the training corpus. Accordingly, the frequency-based weight $w_i$ can be calculated as follows:

$$\begin{aligned} w_i=a \times R F_i+1 \end{aligned}$$

(23)

where $a=-\left( \max _{1 \le j \le V}\left( R F_j\right) \right) ^{-1}$ is the frequency slope, 1 is added as the bias so that $w_i$ falls into [0, 1]. As done by [15], we normalize $w_i$ to have a mean of 1. The diversity loss is finally computed as below:

$$\begin{aligned} \mathcal {L}_{d i v}=-\sum _{t=1}^T \sum _{i=1}^V w_i \delta _t\left( c_i\right) \log \textrm{P}\left( c_i \mid y_{<t}, C\right) \end{aligned}$$

(24)

where $c_i$ is a candidate token in the vocabulary and $\delta _t\left( c_i\right) $ is the indicator function, which equals to 1 only if $c_i=y_t$ and 0 otherwise.

Eventually, all the parameters of our proposed model are trained and optimized by jointly minimizing the emotional loss (Eq. 14), the generation loss (Eq. 21) and the diversity loss (Eq. 24) as follows:

$$\begin{aligned} \mathcal {L}=\gamma _1 \mathcal {L}_{e m o}+\gamma _2 \mathcal {L}_{g e n}+\gamma _3 \mathcal {L}_{d i v} \end{aligned}$$

(25)

where $\gamma _1$, $\gamma _2$, $\gamma _3$ are hyper-parameters to balance the above three losses.

4 Experiments

4.1 Baselines

We compare our proposed model with the following baselines:

MoEL [9]: A variation of Transformer consisting of one encoder and several decoders that focus on each emotion accordingly.
EmpDG [7]: A adversarial model that encodes semantic context and multi-resolution emotional context respectively and interacts with user feedback.
MIME [10]: Another variation of Transformer. It does emotion grouping and applies emotion stochastic sampling and emotion mimicry.
KEMP [8]: A knowledge-enriched model that uses commonsense and emotional lexical knowledge to explicitly understand and express emotions.
CEM [15]: Another knowledge-enriched Transformer-based model that uses commonsense to obtain more information about users’ situations.

4.2 Implementation Details

We conduct our experiments on EMPATHETICDIALOGUES [14], a large-scale dataset of 25k conversations, grounded in emotional situations. the dataset considers 32 emotion labels, of which the distribution is close to evenly distributed. For our experiments, we use the original 8:1:1 train/validation/test split of this dataset.

We use Pytorch^{Footnote 1} to implement the proposed model. The word embeddings are initialized with pre-trained Glove vectors^{Footnote 2} [13], and the relation embeddings are randomly initialized and fixed during training. For the positional embeddings, we follow the original paper [18]. The dimension of embeddings is set to 300 empirically. The maximum introducing numbers of external concepts per dialogue and per token are set as 10 and 1, respectively. And the loss weights $\gamma _1$, $\gamma _2$, $\gamma _3$ are all set to 1. We set the same hyper-parameters of Transformer as [8], including the hidden size, the number of attention heads, etc. When training our proposed model, we use Adam and early stopping with a batch size of 16 and an initial learning rate of 1e5. We varied the learning rate during training following [18] and use a batch size of 1 and a maximum of 30 decoding steps during testing and inference.

4.3 Evaluations

We evaluate our models from two aspects, i.e. automatic and human evaluations.

Automatic Evaluation. To evaluate the performance of KEEM, we first adopt Emotion Accuracy, i.e. the accuracy of emotion detection. The Perplexity [16] is also utilized to measure the high-level general quality of the generation model. A response with higher confidence will result in a lower perplexity. Furthermore, Distinct-1 and Distinct-2 [6] are used to measure the proportion of the distinct unigrams and bigrams in all the generated results, which indicate the diversity of the produced responses.

Table 1. Results of automatic evaluation.

Full size table

The results of the automatic and manual evaluations are shown in Table 1. In Table 1, we observe that KEEM achieves the highest emotion accuracy, which suggests the new strategy of selecting affective knowledge is beneficial for users’ emotion detection. Although CEM [15] gets a slightly lower perplexity score than ours, our proposed model also considerably outperforms the baselines in terms of Distinct-1 and Distinct-2, which highlights the importance of the novel approaches to incorporating commonsense knowledge and constructing the cognitive context graph.

Human Evaluation. For qualitative evaluation, we take human A/B tests to compare KEEM and five baselines, following [15]. For a given dialogue context, our model’s response is paired with a response from the baselines, and annotators are asked to choose the better one from the following three aspects: 1) Empathy: which one shows more understanding of the user’s situation and feelings; 2) Coherence: which one is more on-topic and relevant to the context; 3) Fluency: which one is more fluent and natural. We randomly sample 100 dialogues and their corresponding results from our model as well as the baselines, and then assign three crowdsourcing workers to annotate each pair.

As displayed in Table 2, responses generated by KEEM are more often preferred by human judges in empathy and coherence compared to the baselines. This also demonstrates that, with the enhancement of commonsense and emotional knowledge, our model is able to produce more empathetic and relevant responses. We also notice that KEEM does not significantly outperform the baselines not enriched by external knowledge in fluency, which might imply that the knowledge incorporated has a negative influence on fluency. There is one reasonable explanation that selected knowledge contains not only useful information but also noises, which decrease the fluency of the responses.

Table 2. Results of human evaluation ($\%$).

Full size table

4.4 Ablation Study

We conduct the ablation study to verify the effect and contribution of each component of our KEEM. More specifically, we consider the following three variants of KEEM:

w/o Cog: We remove the procedures in Sect. 3.2) and delete the concatenation of the global and original cognitive representation of dialogue context (Eq. 15). The fused cognitive representation is replaced with the latter in subsequent calculations.
w/o Aff: We ablate the new approach to filtering affective knowledge (Eq. 7). The effects of introducing affective knowledge and building and encoding affective context graphs have been proven in [8].

The results of the above two variants are in Table 1, which show that each component contributes to KEEM from different aspects. Specifically, removing the cognitive knowledge decreases most of the performance, suggesting that incorporating extra cognitive information helps to recognize the users’ situations and identify their emotions, to varying degrees. And ablating the novel strategy of selecting affective knowledge influences the emotion classification considerably and impairs the quality of the generated results slightly.

4.5 Case Study

Cases from KEEM and five baseline models are listed in Table 3. In the first case, KEEM shows the best recognition of the user’s action and feeling, by incorporating the concepts of “destroy”, “joy”, and “great”, which correspond to the word “fell”, “cheer”, “nice” in dialogue context. In contrast, the baseline MoEL only attends to “baby” and ignores what the user did for the baby, and the other four baselines are only able to identify the positivity of the dialogue context, producing some context-unrelated content. In the second case, KEEM generates the most context-consistent and emotion-appropriate response, which expresses a good wish for the user about his future promotion, while MoEL and EmpDG do not even detect the user’s hopeful mood. Both two cases demonstrate that KEEM can generate responses of empathy.

Table 3. Responses generated by different models.

Full size table

5 Related Work

With the support of the newly proposed datasets [14, 19], research in empathetic dialogue generation has developed rapidly. Rashkin et al. [14] exploit adaptions of dialogue models for empathetic responding. Lin et al. [9] design listeners (i.e. decoders) in responding to different emotions and softly combining different listeners’ outputs. Li et al. [7] propose a multi-resolution adversarial model to capture the nuances of user emotion and consider the potential of user feedback. Majumder et al. [10] believe that empathic responses often mimic users’ emotions to varying degrees. Li et al. [8] construct an emotional context graph to perceive implicit emotions and learn emotional interactions. All of them are about how to perceive and express emotions.

Recently, some work [15] has attempted to boost both the cognitive and affection empathy of dialogue models. Sabour et al. [15] argue that the cognitive empathy of the user’s situation should be considered, and they introduce commonsense to further enhance it. Nonetheless, due to some weaknesses of the knowledge base used (e.g. the limited commonsense relations, the unsatisfactory inference accuracy, and the inability to acquire fine-grained information), the effectiveness of this proposed approach would be impacted.

6 Conclusion

In this paper, we propose a novel Knowledge-Enhanced EMpathetic dialogue generation model (KEEM) to demonstrate how leveraging commonsense and emotional knowledge is beneficial to the cognition of users’ situations and the detection of users’ feelings, which helps produce more empathetic responses. We conduct experiments on EMPATHETICDIALOGUES dataset, and our automatic and manual evaluations have empirically proven the significance of our approach in empathetic response generation. Nevertheless, as the results demonstrate, the model still has shortcomings in terms of fluency, which is one of the future directions for us to challenge.

Notes

References

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Cuff, B., Brown, S.J., Taylor, L., Howat, D.J.: Empathy: a review of the concept. Emot. Rev. 8, 144–153 (2014)
Article Google Scholar
Elliott, R., Bohart, A.C., Watson, J.C., Murphy, D.: Therapist empathy and client outcome: an updated meta-analysis. Psychotherapy 55(4), 399 (2018)
Article Google Scholar
Jiang, S., Ren, P., Monz, C., de Rijke, M.: Improving neural response diversity with frequency-aware cross-entropy loss. In: The World Wide Web Conference, pp. 2879–2885 (2019)
Google Scholar
Keskin, S.C.: From what isn’t empathy to empathic learning process. Procedia. Soc. Behav. Sci. 116, 4932–4938 (2014)
Article Google Scholar
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, W.B.: A diversity-promoting objective function for neural conversation models. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119 (2016)
Google Scholar
Li, Q., Chen, H., Ren, Z., Ren, P., Tu, Z., Chen, Z.: EmpDG: multi-resolution interactive empathetic dialogue generation. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4454–4466 (2020)
Google Scholar
Li, Q., Li, P., Ren, Z., Ren, P., Chen, Z.: Knowledge bridging for empathetic dialogue generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 10993–11001 (2022)
Google Scholar
Lin, Z., Madotto, A., Shin, J., Xu, P., Fung, P.: Moel: Mixture of empathetic listeners. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 121–132 (2019)
Google Scholar
Majumder, N., et al.: Mime: mimicking emotions for empathetic response generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8968–8979 (2020)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Computer Science (2013)
Google Scholar
Mohammad, S.: Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: Long Papers), pp. 174–184 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L.: Towards empathetic open-domain conversation models: a new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5370–5381 (2019)
Google Scholar
Sabour, S., Zheng, C., Huang, M.: CEM: commonsense-aware empathetic response generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11229–11237 (2022)
Google Scholar
Serban, I.V., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Hierarchical neural network generative models for movie dialogues. arXiv preprint arXiv:1507.048087(8), 434–441 (2015)
Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Welivita, A., Pu, P.: A taxonomy of empathetic response intents in human social conversations. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4886–4899 (2020)
Google Scholar
Zech, E., Rimé, B.: Is talking about an emotional experience helpful? effects on emotional recovery and perceived benefits. Clin. Psychol. Psychoth. Int. J. Theory Pract. 12(4), 270–287 (2005)
Article Google Scholar
Zhong, P., Wang, D., Miao, C.: Knowledge-enriched transformer for emotion detection in textual conversations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 165–176 (2019)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the Associate Editor and anonymous reviewers for their valuable comments and suggestions. This work is funded in part by the National Natural Science Foundation of China under Grants No.62176029. This work also is supported in part by the Chongqing Technology Innovation and Application Development Special under Grants CSTB2022TIAD-KPX0206. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Chongqing University, Shapingba, Chongqing, China
Ai Chen, Jiang Zhong, Qizhu Dai, Chen Wang & Rongzhen Li

Authors

Ai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Qizhu Dai
View author publications
You can also search for this author in PubMed Google Scholar
Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rongzhen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiang Zhong .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
The University of Indonesia, Depok, Indonesia
Heru Suhartanto
Beijing Institute of Technology, Beijing, China
Guoren Wang
Northeastern University, Shenyang, China
Bin Wang
University of Technology Sydney, Sydney, NSW, Australia
Jing Jiang
Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
Bing Li
Sun Yat-sen University, Guangzhou, China
Huaijie Zhu
Anhui University, Hefei, China
Ningning Cui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, A., Zhong, J., Dai, Q., Wang, C., Li, R. (2023). Fine-Grained Knowledge Enhancement for Empathetic Dialogue Generation. In: Yang, X., et al. Advanced Data Mining and Applications. ADMA 2023. Lecture Notes in Computer Science(), vol 14179. Springer, Cham. https://doi.org/10.1007/978-3-031-46674-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-46674-8_6
Published: 05 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46673-1
Online ISBN: 978-3-031-46674-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Grained Knowledge Enhancement for Empathetic Dialogue Generation

Abstract

Similar content being viewed by others

CAB: Empathetic Dialogue Generation with Cognition, Affection and Behavior

KnowDT: Empathetic dialogue generation with knowledge enhanced dependency tree

Empathetic Dialogue Generation with Pre-trained RoBERTa-GPT2 and External Knowledge

Keywords

1 Introduction

2 Preliminaries

2.1 Commonsense and Emotional Knowledge

2.2 Task Formulation

3 Methodology

3.1 Overview

3.2 Cognitive Context Graph Constructing and Encoding

3.3 Affective Context Graph Constructing and Encoding

3.4 Cognition and Affection Fusion

3.5 Empathetic Response Generation

4 Experiments

4.1 Baselines

4.2 Implementation Details

4.3 Evaluations

4.4 Ablation Study

4.5 Case Study

5 Related Work

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fine-Grained Knowledge Enhancement for Empathetic Dialogue Generation

Abstract

Similar content being viewed by others

CAB: Empathetic Dialogue Generation with Cognition, Affection and Behavior

KnowDT: Empathetic dialogue generation with knowledge enhanced dependency tree

Empathetic Dialogue Generation with Pre-trained RoBERTa-GPT2 and External Knowledge

Keywords

1 Introduction

2 Preliminaries

2.1 Commonsense and Emotional Knowledge

2.2 Task Formulation

3 Methodology

3.1 Overview

3.2 Cognitive Context Graph Constructing and Encoding

3.3 Affective Context Graph Constructing and Encoding

3.4 Cognition and Affection Fusion

3.5 Empathetic Response Generation

4 Experiments

4.1 Baselines

4.2 Implementation Details

4.3 Evaluations

4.4 Ablation Study

4.5 Case Study

5 Related Work

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation