Keywords

1 Introduction

Conversational systems are currently a hot topic in NLP research. Studies [1] show that 80% of enterprises will be equipped with chatbots (conversational systems) by the end of 2021, and the market will grow to $9.4 billion by 2024.

A core definition of a text generation task is the capability to generate an expected output sequence using a provided input sequence, often known as a sequence-to-sequence task. Thanks to the development of deep learning [2], numerous deep learning networks have been suggested for application in dialogue systems, encompassing recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. These networks offer diverse approaches to handle the complexities of dialogue generation. Although there are already many models that already perform well, traditional sequence-to-sequence models do not understand discourse well and generate responses that tend to be general replies because the input text itself contains a smaller amount of knowledge in the traditional sequence-to-sequence models, for example, the newly proposed sequence-to-sequence model [3] can dynamically capture the range of local contexts and can better extract semantic information but can only generate meaningless responses such as “I don’t know” due to the lack of external knowledge. Some models introducing external knowledge [4, 5] have emerged to address this problem. Some studies demonstrate that introducing external knowledge can enhance performance, e.g., Huang [6] introduced a knowledge graph that can answer 10% more responses than the original model; introducing a knowledge graph in story generation also helps to understand the storyline [7].

To address the aforementioned limitations [8], researchers have proposed background-based dialogue approaches. These methods aim to generate sensible and informative responses by utilizing a combination of background knowledge (unstructured information) and input dialogue. The objective is to produce responses that are contextually relevant and provide valuable information to enhance the conversation. Knowledge selection is one of the most critical modules in background-based dialogues, which requires identifying the appropriate knowledge from the background knowledge based on the conversation, which will directly affect the quality of the generated response.

Using appropriate external knowledge augmentation also enables model generated responses to be implicitly emotional because, like humans, machines need to rely on experience and external knowledge to express implicit emotions [9, 10]. If a dialogue system has some empathy, it can generate more appropriate and fluent responses [11, 12].

Background-based dialogue research is one of the classifications of external knowledge enhancement research, and the advantage of background-based dialogue over traditional non-knowledge enhancement methods is that unstructured external knowledge is used [13]. Recent studies have shown that the coverage of a single knowledge source is not sufficient [14], and the results of several studies have shown that using more knowledge sources can improve the performance of knowledge-enhanced dialogue models [14, 15].

To tackle the challenges mentioned above, this paper introduces a common sense emotional context enhanced dialog model (CEC). To fully utilize all the information (session history, background knowledge, external knowledge), a double-matching approach is proposed to fuse the information for knowledge selection. First, the model encodes the conversation history and background knowledge separately and then uses double matching to obtain the relevance weights among conversation history, background, and sentiment. After knowledge selection gets the knowledge topic transformation vector and combines it with graph feature representation to generate naturally flowing and informative responses.

In this paper, we perform an experimental analysis of CEC on Holl-E [16]. The experimental results show that CEC significantly outperforms the baseline model in machine evaluation, with stronger performance in knowledge selection and the ability to generate more appropriate responses.

The summarized contributions of this paper are as follows:

  1. (1)

    We propose a dialogue model for emotional knowledge enhancement (CEC). By introducing common-sense knowledge and emotional-emotional information, the information implicit in the session is taken into account when making knowledge selection, enhancing the accuracy of knowledge selection and generating more appropriate responses.

  2. (2)

    We introduced external knowledge through composition and proposed a dual matching matrix to integrate conversations with knowledge from various sources to construct an affective topic guidance vector to guide response generation.

2 Model

This model aims to combine external knowledge based on background knowledge to improve the rationality of knowledge selection and generate responses that conform to backward and forward logic. Formally this paper gives the symbolic definition. Given a session \(C=\{c_{1},c_{2},c_{3},c_{4},...,c_{\Vert C\Vert }\}\), where \(c_{n}\) represents the \(n^{th}\) word, similarly for unstructured background knowledge \(K=\{k_{1},k_{2},k_{3},k_{4},...,k_{\Vert K\Vert }\}\), where \(k_{n}\) represents the \(n^{th}\) word. This model generates responses \(=\{r_{1},r_{2},r_{3},r_{4},...,r_{\Vert R\Vert }\}\) based on conversation and background knowledge. The overall model framework is shown in Fig. 1.

In this section, the four modules that make up the entire model are presented.

  1. (1)

    Background Context Encoder Using two independent encoders, a given history session and background knowledge are encoded, and then an aggregation operation is performed to obtain the history session vector \(H_{C}\) and background knowledge vector \(H_{K}\).

  2. (2)

    Emotional context graph and graph encoder ConceptNet and NRC_VAD, two sentiment enhancement libraries, are used to form a sentiment context map G with session history C. Then it is input into the graph encoder to obtain the graph feature representation \(H_{G}\).

  3. (3)

    Knowledge Selection Based on the double-matching matrix, the historical session \(H_{C}\), graph feature representation \(H_{G}\) and background knowledge representation \(H_{K}\) are used for matching operations.

  4. (4)

    Response decoder The knowledge topic transformation vector \(H_{GC \rightarrow k}^{s}\) and the graph feature representation \(H_{G}\) are stitched together to obtain the emotional topic guidance vector \(H_{GCK}^{g}\), and the module performs vocabulary generation based on this vector.

The whole process can be summarized as putting the history session C and the background knowledge K into the context encoder. The session history is combined with the knowledge base to obtain the feature representation through the graph encoding layer. Then, the knowledge selection module chooses the relevant information, which then guides the response decoder in generating the final response.

Fig. 1.
figure 1

The Overview of CEC

2.1 Background Context Encoder

We use two independent bidirectional GRUs to encode session history C and background knowledge K, respectively, to obtain \(h_{C}=\{h_{c_{1}},h_{c_{2}},h_{c_{3}},h_{c_{4}},...,h_{c_{\Vert C\Vert }}\}\) and \(h_{K}=\{h_{k_{1}},h_{k_{2}},h_{k_{3}},h_{k_{4}},...,h_{k_{\Vert K\Vert }}\}\).

$$\begin{aligned} h_{c_{t}}=BIGRU(e(c_{t}),h_{c_{t-1}}) \end{aligned}$$
(1)

The parameters of these two GRUs are independent. We perform a highway transformation of these two vectors separately with the output of each layer of the bidirectional GRU to obtain a historical session \(H_{C}\) and background knowledge representation \(H_{K}\) for the next matching operation.

$$\begin{aligned} H_{k_{t}}=g_{k}(W_{1}[h_{k_{t}},h_{X_{\Vert x\Vert }}]+b)+(1-g_{k})tanh(W_{n1}[h_{k_{t}},h_{X_{\Vert x\Vert }}]+b) \end{aligned}$$
(2)
$$\begin{aligned} g_{k}=\sigma (W_{g}[h_{k_{t}},h_{X_{\Vert x\Vert }}]+b) \end{aligned}$$
(3)

2.2 Emotional Context Graph and Graph Encoder

In this module, we use ConceptNet and NRC_VAD combined with Dialogue C to construct the sentiment graph G.Inspired by Li et al., we construct a series of candidate tuples \(T_{i}=\{t_{i}^{k}=(c_{i},r_{k}^{i},x_{k}^{i},s_{k}^{i})\}_{k=1,2,3,...,K}\) for each non-deactivated word of the dialogue combined with the keywords in ConceptNet. The candidate tuples are filtered according to the following rules: (1) Only the tuples with confidence scores greater than 0.1 (\(s_{k}^{i}\)>0.1) are retained. (2) Use NRC_VAD to calculate the sentiment intensity value (\(\mu (x_{i}^{k})\)) and select the k tuples with the highest scores. We compose the composition based on candidate tuples and dialogues, and the rules are as follows: (1) Two adjacent words will point to the next word in order. (2) The selected candidate sentiment words will point to his keywords (\(c_{i}\)).

For the graph encoder, we need to transform each vertex of the sentiment graph G. Similar to the transformer model, our proposed model utilizes both the position embedding layer and the word embedding layer. Additionally, we incorporate the vertex state embedding to further enhance the model’s performance. Therefore, the vector representation of the entire vertex consists of three embeddings:

$$\begin{aligned} v_{i}=E_{w}(v_{i})+E_{p}(v_{i})+E_{v}(v_{i}) \end{aligned}$$
(4)

Then go to the multiheaded graph attention mechanism to obtain a deeper representation of each vertex.

$$\begin{aligned} \hat{v_{i}}=v_{i}+ \Vert _{n=1}^{H}\sum _{j\in A_{i} }a_{ij}^{n}W_{v}^{n}v_{j} \end{aligned}$$
(5)
$$\begin{aligned} a_{ij}^{n}=a^{n}(v_{i},v_{j}) \end{aligned}$$
(6)

where H represents the number of multiheads. \(A_{i}\) is the adjacency matrix of G, and \(a^{n}\) is the self-attentive module of each head. To obtain a global contextual representation, after a multiheaded graph attention layer, we use the encoding layer of the transformer for global modelling to obtain a sentiment contextual graph representation \(h_{g}=\{\bar{v_{i}}\}\).

$$\begin{aligned} h_{i}^{l}=LayerNorm(\hat{v}_{i}^{l-1}+MHA(\hat{v}_{i}^{l-1})) \end{aligned}$$
(7)
$$\begin{aligned} \bar{v}_{i}^{l}=LayerNorm(h_{i}^{l}+FNN(h_{i}^{l})) \end{aligned}$$
(8)

where l represents the \(l^{th}\) layer of the coding layers, MHA represents the multiheaded attention module, and FNN represents a two-layer feedforward network with ReLU as the activation function.

2.3 Knowledge Selection

This module uses a double-matching matrix, the first of which is constructed using the potential representation of historical sessions \(H_{C}\) and background knowledge \(H_{K}\) derived in Sect. 3.1.

$$\begin{aligned} M_{kc}[i,j]=V_{M}tanh(W_{m_{1}}H_{k_{i}}+W_{m_{2}}H_{k_{l}}) \end{aligned}$$
(9)

where \(V_{M}\) are the learnable vectors, and \(W_{m_{1}}\) and \(W_{m_{2}}\) are the learnable parameters. To match the sentiment map features with the background features, we first need to use a multilayer perceptron (MLP) to transform the \(h_{g}\) derived in Sect. 3.2 to obtain the \(H_{G}\).

$$\begin{aligned} H_{G}=MLP(h_{g}) \end{aligned}$$
(10)

We use a similar approach to obtain the second matching matrix \(M_{kg}\):

$$\begin{aligned} M_{kg}[i,j]=V_{Mg}tanh(W_{mg_{1}}H_{k_{i}}+W_{mg_{2}}H_{g_{l}}) \end{aligned}$$
(11)

For this dual matching matrix, we use the maximum pooling layer along the x-axis to obtain two perceptual background weight feature representations; each element in the feature represents the weight of relevance to the context, with higher weights representing greater relevance:

$$\begin{aligned} W_{C\rightarrow K}=\max _{x}(M_{kc}) \end{aligned}$$
(12)
$$\begin{aligned} W_{G\rightarrow K}=\max _{x}(M_{kg}) \end{aligned}$$
(13)

Finally, we combine these two perceptual contextual weight feature representations to obtain the emotional contextual perceptual weight vector \(W_{CG\rightarrow K}\). Although this vector captures the relationship between the context, sentiment map and background, it only considers the distribution of relationships in the word direction. It lacks a global perspective to derive the probability distribution of knowledge selection properly. Drawing inspiration from GLKS, we adopt sliding windows for the purpose of global knowledge selection. In the knowledge selection module, we employ the “m-size unfold and sum” and “m-size unfold and attention” operations. The former operation obtains the global semantic information, and the latter operation obtains the global attention weights.

In the first operation “m-size unfold and sum”, we can obtain a sliding semantic representation by the following formula:

$$\begin{aligned} {W}'_{G\rightarrow K}=([{W}'_{G\rightarrow K}]_{0:m},...,[{W}'_{G\rightarrow K}]_{N:N+m},...) \end{aligned}$$
(14)
$$\begin{aligned}{}[{W}'_{G\rightarrow K}]_{N:N+m}=\sum _{i=N}^{N+m}W_{CG\rightarrow K}[i] \end{aligned}$$
(15)

For the second operation, we use the “m-size unfold and attention” operation for the last layer of the background knowledge representation \(h_{k}\) to obtain the global attention \({H}'_{K}\):

$$\begin{aligned} {H}'_{K}=([{h}'_{K}]_{0:m},...,[{h}'_{K}]_{N:N+m},...) \end{aligned}$$
(16)
$$\begin{aligned}{}[{h}'_{K}]_{N:N+m}=\sum _{i=N}^{N+m}a_{i}h_{K}[i] \end{aligned}$$
(17)
$$\begin{aligned} a_{i}=att(h_{c_{\Vert C\Vert }},[h_{k_{m}}...h_{k_{N+m}}]) \end{aligned}$$
(18)

where \(a_{i}\) represents the attention weight of the session versus the background knowledge. Then we combine background knowledge K to generate knowledge topic transformation vectors \(H_{CG\rightarrow k}^{s}\):

$$\begin{aligned} H_{CG\rightarrow k}^{s}=\sum _{N}P(K_{N}:K_{N+m}\vert C)[{h}'_{K}]_{N:N+m} \end{aligned}$$
(19)
$$\begin{aligned} P(K_N:K_{N+m} \vert C) \propto \text {softmax}([W_{CG \rightarrow K}']_{N:N+m}) \end{aligned}$$
(20)

2.4 Response Decoder

During each decoding time step, the response decoder carries out a splicing operation utilizing the knowledge topic transformation vector \(H_{GC \rightarrow k}^{s}\) and \(H_{G}\) in order to acquire the sentiment topic guidance vector \(H_{GCK}^{g}\). Based on this vector, the response decoder obtains the probability of generating from the vocabulary and the probability of intercepting directly from the background and goes through a gate mechanism to finally decide how to generate.

First, we connect the decoded status code to \(H_{GC \rightarrow k}^{s}\) and \(H_{G}\):

$$\begin{aligned} H_{GCK}^{g}=[H_{GC \rightarrow k}^{s},H_{G},e(r_{t-1})] \end{aligned}$$
(21)

where \(e(r_{t-1})\) denotes the vector generated from the previous time step. Then we use the attention module to perform an attention operation on the knowledge-emotion topic vector with the background knowledge K, which will give us the background guidance vector \(\bar{h}_{K_t}\). Similarly, we use the attention module to perform an attention operation with the session history C to obtain the session guidance vector \(\bar{h}_{C_t}\):

$$\begin{aligned} \bar{h}_{K_{t}}=\sum _{i=1}^{\Vert K\Vert }a_{K_{i}}h_{K_{i}} \end{aligned}$$
(22)
$$\begin{aligned} a_{K_{i}}=attention(H_{GCK_{t}}^{g},h_{K}) \end{aligned}$$
(23)

Then we join the two guidance vectors with the knowledge-emotion topic vector and use a readout layer to obtain an overall feature vector \(\bar{h}_{r_t}\).

$$\begin{aligned} \bar{h}_{r_t} = \text {readout}(H_{GCK_t}^g, \bar{h}_{K_t}, \bar{h}_{C_t}) \end{aligned}$$
(24)

Putting feature vectors \(\bar{h}_{r_t}\) into linear layers with a softmax function to obtain the probability of generating words from the vocabulary:

$$\begin{aligned} P_v(r_t) = \text {softmax}(W_v \bar{h}_{r_t}) \end{aligned}$$
(25)

For \(P_k(r_t)\), we use an attention module for background knowledge to learn the intercept’s start position pointer and end position pointer.

$$\begin{aligned} P_k(r_t) = \text {attention}(H_{GCK_t}^g, h_K) \end{aligned}$$
(26)

Finally, we combine \(P_v(r_t)\) and \(P_k(r_t)\)as follows:

$$\begin{aligned} P(r_t) = gP_v(r_t) + (1-g)P_k(r_t) \end{aligned}$$
(27)

3 Experiments

3.1 Implementation Details

The word embedding size is configured as 300, while the hidden layer size is set to 256. The number of vocabulary words is limited to approximately 26,000, the length of the conversation history is limited to 65, and the length of the background knowledge is limited to 256. The optimizer uses Adam, and the batch size is set to 32. The entire model was trained for 20 rounds, and the highest scores were taken for comparison in the evaluation phase.

3.2 Datasets

To ensure a more accurate representation of the model’s performance, we opted for Holl-E as the benchmark for our comparative experiments. The number of samples in the datasets is shown below.

Holl-E: This is a dataset with the correct labels containing background knowledge and the correct knowledge selection labels. The dataset focuses on the movie part, two people having a conversation about the movie plot, and each response will be a change or copy of the background knowledge to reply. The background knowledge consists of four parts: movie plot, reviews, professional commentary, and fact sheets related to the movie. The experiments in this paper use two versions of Holl-E: oracle background and mixed-short background. We partition the dataset into three according to its original partitioning method, with the training set containing 34486 samples, the validation set containing 4388 samples, and the test set containing 4318 samples (Table 1).

Table 1. Dataset sizes.

3.3 Evaluation Metrics

In this paper, the evaluation metrics chosen for machine evaluation are ROUGE-1, ROUGE-2, and ROUGE-L. Since dialogue responses are generated using background knowledge, previous studies have shown that these metrics are consistent with BLEU. Therefore, employing these metrics would provide a comprehensive assessment of the model’s performance.

3.4 Results

The experimental results are shown in the table. Table 2 and Table 3 subtables represent the results of the oracle background and oracle mixed-short background in Holl-E.

The experimental results demonstrate that CEC outperforms the baseline model across all metrics, providing evidence that CEC can enhance knowledge selection performance and generate more appropriate responses. Compared with BiDAF (extraction-based generation method), which benefits from combining extractive and generative approaches, CEC generates more reasonable and natural responses while using background knowledge well. RefNet uses span annotations, while CEC does not need additional annotation information and can better locate the correct background knowledge location. This is because we use guidance vectors and learn two pointers to locate background knowledge in the generation process. Compared with AKGCM, which fuses knowledge graphs, and GLKS, which used to have the highest knowledge selection scores, CEC connects structured knowledge in a more rationalized way and, simultaneously, can significantly increase the performance of knowledge selection. Our advantage lies in the utilization of the double-matching matrix, which effectively fuses structured and unstructured knowledge to enhance knowledge information. This approach leads to a substantial improvement in knowledge selection performance while ensuring that empty responses are not generated. In both versions of the Holl-E dataset, we can observe that the same model in both tables (including CEC) performs better in the oracle mixed-short background version than in the oracle background. This is because the knowledge in the oracle background contains only one source, which has less information than in the oracle mixed-short background. Additionally, compared to the magnitude of the improvement of the baseline model in both datasets, we can observe that the improvement of CEC is not very significant. This may be because the knowledge richness in the dataset can already reach a standard level, and the added knowledge does not enhance it much. The above experimental analysis proves that it is essential to include additional knowledge in a session. Choosing the right way to integrate different knowledge types can improve response quality.

Table 2. Results on oracle background (256-word)
Table 3. Results on mixed-short background (256-word)

3.5 Ablation Study

Since the performance of CEC is consistent across datasets, the experiments in this section are conducted in the oracle background for ablation experiments only. We will analyze three aspects: (1) w/o emo_embedding+emo_match: No sentiment matching matrix and sentiment vector. (2) w/o emo_match:No sentiment matching matrix. (3) w/o emo_embedding:No emotion vector.

The experimental results are shown in Table 4. Both the sentiment matching matrix and the sentiment vector impact the final generation, and removing either will degrade performance. Second, the performance degradation is most obvious if we remove the sentiment matching matrix (w/o emo_match) alone for knowledge selection. This demonstrates that adding additional sentiment-structured knowledge significantly improves the accuracy of knowledge selection and enhances model performance, possibly because the added knowledge is generated based on the current session and is, therefore, highly relevant and contains a more significant amount of valuable knowledge. Finally, to validate the effectiveness of the sentiment vector, we remove the sentiment vector (w/o emo_embedding) directly when combining the sentiment topic guidance vectors. The results demonstrate that adding sentiment vectors can improve the performance of the generation module, which means that sentiment vectors can provide additional sentiment information in addition to the session itself. It also improves the correctness of the selection knowledge when generating responses and making the responses more reasonable and justified.

Table 4. Ablation study

4 Conclusion

In this article, we introduce external knowledge by constructing a sentiment graph, generating a sentiment vector using graph attention, and then using a matching matrix to combine the background knowledge with the sentiment vector to enhance both the precision of knowledge selection and the naturalness of response generation. The experimental results are better than all the baselines.

This paper introduces a sentiment knowledge base. Although it can improve the final response, the model does not explicitly model sentiment classification or recognition; therefore, this model can only restrict the generated responses to the session sentiment. To construct an empathic dialogue model, in the future, our work will focus on enhancing the model’s capabilities in both emotion recognition and inference.