Keywords

1 Introduction

Dialogue systems intend to use big data to provide users with fast and concise answers. They can be divided into single-turn dialogue systems and multi-turn dialogue systems. Because multi-turn dialogue is more popular and widely used in actual situations, it has been widely used in real-world applications, including customer service systems, personal assistants, and chatbots.

In order to make a dialogue system behave more like humans, two methods are mainly used: one is based on retrieval and the other is based on generation. The retrieval method is widely used in single-turn dialogues to quickly respond to user requests and get answers. As a multi-turn dialogue is much more complicated than a single-turn dialogue, it needs to handle more contextual information and extract effective information from context for generating a better response for the multi-turn dialogue. So how to effectively use complex context to generate a more appropriate response is an important problem in the multi-turn dialogue generation task.

In multi-turn conversation model, it is very difficult to distinguish which turn is important in the context dialogue. Hierarchical Recurrent Encoder-Decoder (HRED) [14, 17] is widely used in multi-turn dialogue generation. This method captures context information through a hierarchical encoder and decoder structure. HRED adds an additional encoder to the traditional encoder-decoder model. Compared with ordinary recurrent neural network (RNN) language model, it can capture the context, reduce the calculation steps between adjacent sentences, and realize multiple turns of dialogue generation. However, the performance of the model will be damaged if the context is handled indiscriminately, and the model often fails to grasp the specific sentence information. Then in order to solve this problem, the recent relevant context with self-attention (ReCoSa) [25] model uses long-distance self-attention mechanism to model the context and response separately, so as to find out which word is important. It uses the self-attention to encode and decode the information, the parameters of both encoder and decoder are learned by maximizing the averaged likelihood of the training data. However, due to the introduction of long attention to calculate the weight of the context only, the model tends to lose the valid information of the current turn and often replies to repeated information.

Consider more specific information is proved very effective on multi-turn conversation model in previous research. Short-text Topic-level Attention Relevance with Biterm Topic Model (STAR-BTM) [24] integrates the topic information in the dialogue generation. However, the topic information is an implicit representation in a conversation, few valid information can be learned. Conditional Historical Generation (CHG) [26] focuses more on the integration of historical information. But CHG can’t learn additional information and repeated useless historical dialogue information of a session can’t improve the performance of the model.

Table 1. An example in the Daily dialogue dataset.

On the other hand, it is also a big challenge to use valid information in context to retrieve relevant historical information and integrate it with the model. There are many attempts in the integration of both the retrieval methods and generative methods in single-turn dialogue generation. Zhu et al. [28] use adversarial training methods to integrate retrieval information in single-turn dialogue. They propose a Retrieval-Enhanced Adversarial Training approach to make better use of N-best response candidates. However, using adversarial training in text generation tasks will greatly increase the difficulty of the model training. So far, no one has done the combination of the retrieval methods and generative methods in multi-turn dialogues, possibly the complexity caused by multiple turns.

Table 1 is an example in the Daily dialogue dataset. People say more than one sentence in a turn. This makes the multi-turn dialogue models more difficult to capture the most effective information. Utterance 1, utterance 2 and utterance 3 are former turns, the current turn is the asking question, retrieval 1 and retrieval 2 are the answers obtained by the retrieval method according to the contextual information. The words in red color represent the response by integrating important words of different turns and historical information in word level. We argue that the current turn is the most important and applying an attention mechanism on the current turn can get different weights of previous turns in sentence level. From Table 1, we can see the response is related to the current turn information, the context information and historical information can be used as a supplement. Utterance 1 and utterance 2 are talking about the price of the product. Utterance 3 is talking about the reasons for the rise in commodity prices. In current turn the customer wants a better price. Retrieval 1 and retrieval 2 serve as supplements to context, finding more background knowledge and hoping to reach a deal with customers in terms of quantity. The response can give the right answer according to context and historical topic, like ‘scale’ and ‘profit’ information. Due to the employment of retrieving historical information, the response is more diverse. The historical information can give response a correct direction. Without introducing historical information, it is difficult to guarantee the diversity of the generated dialogue and the consistency of the context.

In this paper, we propose the hybrid retrieval and generation Model (HRG) model, which is a multi-turn dialogue generation model that combines generated information and retrieved historical information. The model can obtain the information of the same scene according to the semantic similarity. We separate the contexts into two parts, one is current turn information and the other is previous turn information. We argue the current turn information is the most important context. The motivation of this paper is that the combination of generative and retrieval methods will make the response generated by the model more in line with the actual context, and the model can capture more information and find the latent features to maintain context semantic consistency. Compared with the existing multi-turn dialogue methods, we use the retrieved historical information and propose a novel fusion method to integrate historical information hierarchically. Compared with the existed single-turn dialogue method, our model can fit multi-turn tasks. In our proposed HRG model, we use KL divergence to measure the difference between the retrieval information and the context information and give different weights to different contexts, and finally use hierarchical fusion encoder to dynamically integrate the retrieved information. KL divergence is widely used to generate image and increase generalization ability in computer vision.

In the retrieval stage, we first use the semantic similarity method to encode different sentences and then find the most relevant answer based on the given sentence. At the same time, due to the very large amount of data, our retrieval method uses the distilled robustly optimized BERT pretraining approach (RoBERTa) [10]. In the generative stage, we use the Transformer encoder block to capture context information, current turn information, and retrieved historical information separately and then use the hierarchical fusion encoder to integrate the retrieval information, finally, we send the fusion vectors into the decoder to get the response. In our experiment, we use two public datasets, the Daily dialogue dataset, and the English Ubuntu community dataset to evaluate our model. The results show that our model can produce more flexible answers and more appropriate responses than existing baseline models. The contextual consistency of the dialogue is maintained after incorporating retrieval information, which shows that our method is effective and reasonable.

The contributions of this paper are summarized as follows:

  • We propose the HRG model, which integrates the retrieved historical information on the basis of the generative model. Due to the employment of historical information, our model can find latent information and maintain the consistency of the dialogue context.

  • We propose a novel hierarchical fusion encoder to integrate the retrieved historical information through a KL divergence gate dynamically. Using hierarchical fusion encoder can effectively utilize historical information.

  • We conduct experiments on the Ubuntu large-scale English multi-turn dialogue community dataset and Daily dialogue dataset. The experimental results show that our hybrid model performs well on both automatic evaluation and human evaluation compared with the existing baseline models.

2 Related Work

Most of the existing dialogue systems are based on the retrieval method. They use this method to find rich information and respond smoothly. They choose the information and discourse in the previous turn as input and choose the context-sensitive natural response. However, in the generative stage, the answers generated by the generative-based method are more flexible and can cope with complex contexts.

2.1 Single-Turn Response Matching

Retrieval-based methods choose a response from candidate responses. Retrieval-based methods focus more on message-response matching. Matching algorithms have to overcome semantic gaps between messages and responses [3]. Early studies of retrieval-based chatbots focus on response selection [20, 21], where only the single-turn message is used to select a proper response. They calculate the similarity between the context and the answer vector which is encoded by long short-term memory (LSTM). Recent semantic method calculate the similarity of sentences in pre-train model, such as [2, 10]. Azzalini et al. [1] try to use entity linkage to improve semantic quliaty.

2.2 Multi-turn Response Matching

In multi-turn retrieval matching, the existing work is to splice the utterances in context and match the final response. In multi-turn response selection, current message and previous utterances are taken as input. The model selects a response from the repositories which is the most relevant to the whole context. Identifying important information in the previous contexts is very important. Lowe et al. [11] encoded the context and candidate response into a context vector and a response vector through RNN and then computed the similarity between the two vectors. Selecting the previous utterances in different strategies and combining them with current messages is proved effective [23]. The next improvement is to establish a strategy to select context information. Yan et al. [23] select the previous utterances in different strategies and combined them with current message to get the answer.

2.3 Single-Turn Response Generation

Early end-to-end open domain dialogue generation work was inspired by neural machine translation [13, 16]. A good generator often can cope with complex contexts and produce a fluent, grammatical answer. The widely used method is sequence to sequence (Seq2Seq) model [18], it uses RNN or LSTM as encoder and decoder. Li et al. [6] try to use meta-learning to train a better encoder. Given a context sequence, a recurrent neural network-based encoder is first utilized to encode each message, and then in the decoding stage, another RNN decoder is used to generate the response. The parameters of both encoder and decoder are learned by maximizing the averaged likelihood of the training data. Due to the introduction of the attention mechanism, the performance of seq2seq model is better, and the decoder can use attention mechanism to utilize sentence information differently, which improves the accuracy of generation.

2.4 Multi-turn Response Generation

Despite many existing research works on single-turn dialogue generation, multi-turn dialogue generation has gained increasing attention [9, 12, 27]. One reason is that it is more accordant with the real application scenario, such as customer services chatbot. Because more information is considered, customers usually change topics in multi-turn dialogue, which has brought huge challenges to researchers in this field. In the multi-turn dialogue generation task, hierarchical recurrent encoder-decoder architectures (HRED) [14] are proposed to capture context information. Later, Serben propose HRED model with hidden variables, called VHRED [15]. This method introduces hidden variables into the intermediate state in the previous HRED to improve the diversity of the generated dialogue. ReCoSa [25] model can find the most important information of word-level in multi-turn dialogue. ReCoSa use long self-attention mechanism [19] to model multi-turn dialogues. Hierarchical self-attention network (HSAN) [5] can find the most important words and utterances in the context simultaneously.

However, although these methods fully model the context information, they have not considered the diversity of the generated sentences and the consistency of the context information. Now, the data-driven single-turn dialogue system can be roughly divided into two categories, one is generative tasks, and the other is retrieval tasks. In generative tasks, the same as the multi-turn dialogue generation method, it has improved on the basic Seq2Seq model. Recently, using the decoder in transformer block [19] is widely used because of its speed advantage, but this method often generates repetitive and meaningless responses.

There are some works that integrate retrieval and generative methods. Zhu et al. [28] used adversarial training methods to combine generative sentences with sentences obtained from retrieval to get good results, but this method is based on a single-turn of dialogue. And using generative adversarial networks (GAN) is hardly training. Similar to the fusion retrieval information, STAR-BTM [24] integrates the topic information in the dialogue generation. CHG [26] proposes a merchant history conversation selection module which can copy words directly from the relevant history conversations. However, it is difficult to retrieve valid information and fuse them into the context, no one has merged the retrieved information in multi-turn of dialogue according to our current research.

3 HRG Model

Fig. 1.
figure 1

The architecture of Hybrid Generation and Retrieval model. Each turn passes the Bi-GRU to get the sentence representation and add the turn level position encoding.

In this section, we illustrate our model in detail, whose architecture is depicted in Fig. 1. The red bar represents the distribution of retrieved historical information. The knowledge base contains all candidate context information.

3.1 Current Turn and Context Attention

To find the most important turns in contexts, we get the weights of the contexts through the attention mechanism. Given contexts:

$$\begin{aligned} Contexts = \left\{ {utterance{_1},utterance{_2},...,utterance{_n}} \right\} , \end{aligned}$$

each \(utterance{_i}\) in contexts can be represented as \( utterance{_i} = \left\{ {{w_1},{w_2},...,{w_n}} \right\} \), where \({w_i}\) represents a word in an utterance. All sentences in one session are encoded by Bi-GRU to get the representation of the sentence separately, then get the word level importance coefficient and obtain every sentence representation through current turn and context attention. In addition, we integrate the positional embedding to the contexts so as to find the importance of different contexts at the sentence level. The final contexts representation is shown as follow:

figure f

where \(TOS_i\) is the turn of sentence embedding, which represents positional embedding to indicate the order of turns, and express Bi-directional sentence vector respectively. The current turn representation is:

figure i

The context attention layer and attention formula are calculated as follows,

$$\begin{aligned} Attention = softmax \left( {\frac{{Q{K^T}}}{{\sqrt{{D_K}} }}} \right) V \end{aligned}$$
(1)
$$\begin{aligned} h^{CurCon} = Attention\left( {CurrentTurn,Contexts} \right) , \end{aligned}$$
(2)

where \(h^{CurCon}\) is the current turn and contexts attention representation.

3.2 Retrieval

In order to obtain the supplement information of related contexts, we use distilled RoBERTa [10] language model. First, we set up message-response pairs, and use RoBERTa to encode each message into a 768-dimension vector. Then we use each message as a query to calculate the semantic similarity in the vector library, to search the most matching message in the range k. Finally, during the training process, according to the matching score, the semantic algorithm returns the most matching message called candidate context. In addition, our retrieval model is also trained with the model, which can not only improve the quality of model generation but also improve the retrieval performance of similar sentences. So our model can enrich the context and response fields.

Since the training set used in the retrieval is the same as the training set used in the model training, in order to avoid the retrieval method from seeing the ground truth, we exclude the retrieved sentences from the ranking. The retrieval function between retrieval and attention vectors similarity is calculated by cosine similarity.

Algorithm 1 presents how to find semantic retrieval vectors, \(h^{Con}\), \(h^{Cur}\) represent context vector and current turn vector. And input the retrieval knowledge, through our algorithm, we can finally get the most relevant retrieval vectors. The similarity formula mentioned in the algorithm is pre-encoded as shown in the above formula.

figure j

3.3 Hierarchical Fusion Encoder

In order to solve the semantic consistency of context and response, it is very important to integrate the information obtained from semantic retrieval into context. Based on this principle, we propose a hierarchical divergence fusion encoder and use KL divergence to measure the word distribution between context and retrieval context. KL divergence can measure the difference between two distributions. Then we use a hierarchical encoder to fuse retrieval and context information differences.

Through the retrieval part, we get the context candidate and response candidate. Taking context level fusion as an example, in order to detect the difference between the retrieved candidate context information and context information, we let the candidate context pass KL gate, and the formula is as follows:

$$\begin{aligned} {D_{KL}}\left( {h^{Con}|h^{Ret}} \right) = \sum \limits _i {h_i^{Con}\log \frac{{h_i^{Con}}}{{h_i^{Ret}}}} \end{aligned}$$
(3)
$$\begin{aligned} Gate = {{D_{KL}}\left( {h^{Con}|h^{Ret}} \right) }, \end{aligned}$$
(4)

where i represents the \(i^{th}\) utterance in the context. From formula 2 we get the context’s attention weights, we use it to control the amount of information in each context. Our hierarchical fusion encoder can dynamically fuse context in each time step in order to get the final contexts vector. The representation of fusion \(context_i\) is calculated as follows:

$$\begin{aligned} h^{Fus}{ = h^{Con}*W_{Con} + }\left( {1 - Gate} \right) *h^{Ret}, \end{aligned}$$
(5)

where \(W_{Con}\) represent the attention weights in context attention, \(h^{Fus}\) represents the final fusion retrieval vector.

Each fusion context vector should pass the GRU to get the final contexts representation. Figure 2 shows fusion information is auto-regressive and gets the final time step representation. Then we concatenate contexts representation and current turn representation. The response representation is similar to context, each word can be seen as the context in contexts, and get the fusion representation as to the decoder input.

Fig. 2.
figure 2

Context hierarchical fusion encoder.

Table 2. Human evaluation results of mean score, proportions of three levels (+2, +1, and 0 represent excellent, good and average respectively).

3.4 Decoder

Finally, the values in the encoder and the decoder are jointly trained. At this time, we combine the fused contexts vectors and current turn vectors together and sent them into the decoder. The context response attention can be expressed by the following equation:

$$\begin{aligned} \begin{aligned} h^{CR} = Attention\left( {h_{Con}^{Fus}, h_{Res}} \right) ,\\ \end{aligned} \end{aligned}$$
(6)

where \(h_{Con}^{Fus}\) represents the hierarchical fusion of contexts and current turn, \(h_{Res}\) represents response vector. Given an input response: \(Response=\{y_1,y_2,\dots ,y_m\}\), the likelihood of the corresponding response sequence is:

$$\begin{aligned} P\left( {Y|C;\theta } \right) = \prod \limits _{t = 1}^{T'} {P\left( {{y_t}|C,{y_1},...,{y_{t - 1}};\theta } \right) } \end{aligned}$$
(7)

After passing the context and response attention, we can generate words of response through softmax:

$$\begin{aligned} P\left( {{y_t}|C,{y_1},...,{y_{t - 1}};\theta } \right) = P\left( {{y_t}|h^{CR};\theta } \right) \end{aligned}$$
(8)

4 Experiment

4.1 Datasets

We use the Ubuntu community multi-turn dialogue dataset [11] and Daily dialogue dataset [8] to evaluate the performance of our proposed model. We use the official script to mark positive sample training.

4.2 Baselines

Seq2Seq: Sequence to sequence model with attention mechanism [18]. HRED: Hierarchical Recurrent Encoder-Decoder [14]. Using this method, multiple dialogue turns are modeled separately. VHRED: VHRED is a variant of HRED. In order to increase robustness, implicit variable information is added [15]. ReCoSa: Relevant context with self-attention [25]. Use long distance attention method can capture important word information. STAR-BTM: Multi-turn dialogue generation integrate the topic information [24]. CHG: Utilizing historical dialogue representation learning and historical dialogue selection mechanism [26]. HSAN: A hierarchicalself-attention network, which attends to find the important words and utterances in context simultaneously [5].

4.3 Experiment Settings

In order to make a fair comparison between all baseline methods, the hidden layer size is set to 512, the batch size is set to 32 and8 heads attention is used. We use Pytorch to run all models on three Tesla T4 GPUs.

4.4 Human Evaluation

We randomly sampled 200 messages from the Ubuntu test set to conduct the human evaluation as it is extremely time-consuming. We recruit 5 evaluators to judge the response from three aspects [4].

  • Appropriateness: a response is logical and appropriate to its message.

  • Informativeness: a response has meaningful information relevant to its message.

  • Grammaticality: a response is fluent and grammatical.

4.5 Automatic Evaluation

We use perplexity [14], BLEU [22] and Dist-1, Dist-2 [7] to evaluate the diversity of our responses, where Dist-k is the number of different k-grams after normalization of the total number of words in the response.

We have done a lot of experiments on both datasets to verify the effectiveness of our retrieval model. In order to ensure the fairness and consistency of the baseline model, we conducted several groups of experiments, which are: comparison between our complete model (with retrieval information) and other baseline models without retrieval information (as shown in Table 3), in order to prove the importance of introducing semantic retrieval information; Comparison between our complete model (with retrieval information) and other baseline models with simple fusion retrieval information (as shown in Table 4), in order to ensure the fairness between our model and baseline model and prove the effectiveness of using hierarchical fusion encoder; A simple way to introduce retrieval information is to use the same semantic retrieval mechanism to find the most relevant retrieval sentence of the current turn and context attention features and directly spliced together. What’s more, we also conduct an ablation experiment on the Daily dialogue dataset (as shown in Table 5) to prove our hierarchical encoder is efficient. The KL gate is replaced by a simple attention.

Table 3. Performance of different models on Ubuntu dataset and daily dialogue dataset.
Table 4. Comparison of our complete model and simple fusion retrieval information model on Ubuntu and daily dialogue dataset.
Table 5. Ablation experiment on daily dialogue dataset.

5 Analysis

The different test results are shown in the table. We use two evaluation criteria, one is human evaluation, and the other is mainstream evaluation algorithms based on machine translation. Under the same dataset, the hierarchical fusion mechanism through KL gate improves the fusion degree of other information, which also shows that the performance of the model can be improved by fusing more information, and the semantic-based information retrieval can recall the most relevant candidates. This shows that the model can help improve the quality of the generated dialogue when more information is considered. The following is a detailed analysis.

The human evaluation focuses more on areas that are not covered by the automatic assessment. In the human evaluation in Table 2, our method has the highest average score in terms of appropriateness, informativeness, and grammaticality. This shows that after integrating the retrieved information, the model can capture more background context and generate richer responses. Due to the use of contextual turn level attention, the model has also achieved good results in maintaining the consistency of the context of the generated response. From Table 2, we find that the improvement of appropriateness is not large. When the model is generated, it has the limitation of retrieval candidate background context direction, which leads to the lack of syntax flexibility. It can be seen from the human evaluation table that our model performs very well at +2 on the index of informativeness, which shows that it is very effective to consider richer information. Compared with the CHG and STAR-BTM model, which merge additional information, we can find that using hierarchical fusion to retrieve information is more effective than historical information and topic information. Overall, there has been an improvement in all indicators.

In automatic evaluation, in order to prevent the influence of retrieval information on the performance of the model, we also add retrieval information to the other baseline. As shown in Table 3, our model has achieved good results on the PPL index and Dist of both datasets, we think this has a great relationship with the basis encoder and decoder architecture and the integration of the retrieved information. Besides, the quality of the sentences retrieved by semantic similarity is much better. Thanks to the powerful RoBERTa model, these sentences are more in line with the context, which is also helpful to the later experimental results. We conduct several experiments with ReCoSa method, but we can’t get the performance of the BLEU score like the ReCoSa paper, but other indicators we follow the paper. Dist index can measure the richness of response. On the richness index of response model, our model performs well, due to the fusion of more information, which makes the sentences generated by the model very rich. In Table 3, our HRG model performs well in all indicators. Compared with the HSAN, which uses hierarchical self-attention, our method uses turns attention is much better. In terms of the degree of additional information fusion, our hierarchical fusion model can dynamically measure the information difference according to the data distribution by KL divergence, and fuse according to the importance of different turns on two datasets.

As shown in Table 4, we also add semantic retrieval information to the baseline model, but only simply integrate the retrieval information. A simple way to introduce retrieval information is to use the same semantic retrieval mechanism to find the most relevant retrieval sentence of the current turn of information, and directly splice the vector obtained with the vector of the current turn. The results show that after the simple introduction of the retrieval module, all baseline models improve the performance of BLEU, which also proves that considering more information can improve the richness of the model. The performance of STAR-BTM+R model is improved after integrating the retrieval distribution, this shows that the semantic retrieval features roughly contain the implicit information of the latitude of the topic, and the retrieved sentence information is richer. What’s more, in the baseline models, using our separate method to split context makes the baseline models’ performance improve, this also shows that our split method is very effective. Maybe because we simply fuse the retrieval information, the PPL index has increased a little compared with other baseline models. Overall, all response’s quality of baseline model has a certain improvement, which proves the effectiveness of integrating retrieval background information. Besides, using our hierarchical fusion encoder is more effective than simply integrating method. Both automatic evaluation and human evaluation show that the context consistency of the conversation and the richness of the answers are improved after the retrieval information is added.

In our ablation experiment on daily dialogue dataset as shown in Table 5, we remove retrieval information and hierarchical encoder simultaneously, the results of different indicators are similar to ReCoSa method, but due to we use the turn of sentence embedding (TOS), our HRG model performs a little better, which shows the necessity of TOS embedding. After removing the KL gate and replacing it with a simple attention, we can see the experiment results improve a lot compared with the former ablation experiment, which shows that using KL gate can measure the difference between two distributions effectively.

6 Conclusion and Future Work

In this paper, we propose a novel hierarchical fusion encoder that combines the retrieved information in multi-turn of dialogue generation. We encode current turn and context information respectively and add turn of sentence embedding (TOS) to enhance the sentence level attention. Using hierarchical fusion encoder can effectively utilize retrieved historical information according to the different weights of context sentences. Experiments show the effectiveness of this method. In future research, we will explore how to introduce knowledge graph information or other external forms of knowledge into multi-turn dialogue generation.