Keywords

1 Introduction

With the development of open-domain dialogue system, great progress has been achieved in many fields, such as intelligent assistants, customer service and chatbots [3]. The end-to-end network [14] has been proven effective for generative dialogue systems. However, it is still difficult to build more engaging and realistic conversations owing to the lack of interlocutor personas.

Fig. 1.
figure 1

The two dialogues from ConvAI2 persona-chat, where the same colors of sentences imply that the sentences are related to each other in the conversation.

Several efforts have been made to explore the abilities of personas for facilitating response generation [7]. [20] introduced a novel dataset, PERSONA-CHAT, where each dialogue is assigned a character description using 5 sentences as a persona profile. We define these persona profiles as explicit personas. Then, [12, 13] generated responses with this kind of personas. However, in real conversations, sometimes the repliers answer with explicit personas directly, and sometimes they answer with some useful information that can be inferred from the explicit personas and context, which we define as implicit personas. Specifically, as shown in Fig. 1, the two dialogues are associated with the same explicit personas. The response in Dialogue1 is directly associated with explicit personas, ‘My skin is olive colored. My eyes are green. I wear glasses that are cateye.’. It describes the image of the speaker which are consistent with the context. Therefore, how to capture context-relevant personas is essential in persona-based dialogue. However, in Dialogue2, the response not only mentions the persona ‘I want to be a librarian.’ but also explain the reason why the speaker wants to be a librarian. This kind of information does not appear in the context and explicit personas, but it can be inferred from persona-based context. This indicates it is possible to use implicit personas in some responses. Although some persona-based dialogue methods have been proposed, the following challenges still exist: 1) In multi-turn dialogue, as shown in Fig. 1, the response is related to some contextual personas, and previous methods cannot effectively capture the key explicit personas, which is not conducive to persona consistency. 2) In the persona-based dialogue, the attractive responses are not only persona-consistent but also diverse, while the existing methods mainly focus on persona consistency. 3) Previous methods usually take explicit personas into consideration, but neglect that both explicit and implicit personas mentioned above can interact with each other in one model at the same time.

To tackle these challenges, we propose a model called Exploiting Explicit and Inferred Implicit Personas for Multi-turn Dialogue Generation (EIPD), which consists of three components. Specifically, the explicit persona extractor mainly adopts a transformer encoder to acquire some explicit personas relevant to the context. Second, the implicit persona inference module employs the von Mises-Fisher (vMF) distribution, which is suitable for modeling directional data to reason the implicit personas and improve the response diversity. Third, the persona-response generator is designed to guide the implicit personas and fuse the two kinds of personas to generate the response. Finally, the ConvAI2 persona-chat dataset is used to evaluate the effectiveness of proposed model. We summarize the contributions of this work as follows:

  • It is the first time that an effective framework for multi-turn dialogue generation takes two kinds of personas into consideration simultaneously.

  • An implicit personas inference module with an vMF distribution is devised to reason the implicit personas.

  • The persona generator is used to supervise the generation of implicit personas.

  • The experimental results demonstrate that our model can generate responses with more diversity and persona consistency compared with baseline results.

2 Related Work

2.1 Persona-Based Dialogue Model

In open-domain dialogue generation, the persona-based dialogue model has attracted an increasing number of researchers’ attention. Recent works focus on improve the persona-based dialog generation performance as well as persona consistency. [11] assigned a desired identity to chatbot which can generate coherent response. [20] constructed a persona-chat dataset with different speaker profiles. Based on this dataset, [13] proposed an Reinforcement Learning framework to improve persona consistency of response. Besides these works using speaker profiles, other works using implicit information to achieve it. [7] used pretrained speaker embeddings and dialogue context to boost informative and diverse response. [10] proposed a multi-task learning approach that incorporated speaker characteristics to train the neural conversation models. Despite the success of using implicit persona in conversation, they are still difficult to learn implicit personas displayed by the speakers automatically.

2.2 von Mises-Fisher Distribution

The von Mises-Fisher(vMF) distribution represents a latent hyperspherical space which can model directional data better. Considering this characteristic, the vMF distribution is introduced into some NLP works. Both [1] and [9] integrated vMF into a topic model to explore the semantic consistency and to improve the performance. [18] replaced Gaussian distribution with vMF distribution in CVAE and discovered that the ‘collapse’ problem can also be alleviated. [5] used vMF distribution to draw the context word vectors to improve the embedding models. Different from these works, we apply vMF distribution in the Conditional Variational Autoencoder(CVAE) framework to infer the implicit personas.

Fig. 2.
figure 2

The framework of the EIPD model, including explicit persona extractor, implicit persona inference and persona response generator. The process represented by the dotted line only occurs during the training.

3 The Proposed Model

A persona-based dialogue system generates responses with context and personas. Our problem is formulated as follows: context \(X=\{{x_1},{x_2},...,{x_m}\}\), each utterance \({x_i}=(w_{i,1}^x,w_{i,2}^x,...,w_{i,{M_i}}^x)\), a set of explicit personas \(P_{exp}=\{{p_1},{p_2},...,{p_n}\}\), each persona \({p_i} = (w_{i,1}^p,w_{i,2}^p,...,w_{i,{N_i}}^p)\), and response \(Y=\{w_1^y,w_2^y,...,w_k^y\}\). Given X, the implicit personas \({P_{imp}}\) are explored by the implicit persona inference module with the supervision of explicit personas. By leveraging the context, explicit personas, and implicit personas, the goal is to generate a diverse and persona-consistent response Y. We drop the subscript of \(P_{exp}\) for simplicity.

As shown in Fig. 2(a), the whole framework can be divided into three modules: (1) Explicit Persona Extractor, (2) Implicit Persona Inference, and (3) Persona Response Generator.

3.1 Explicit Persona Extractor

Following Transformer [15], this component (Fig. 2(b)), which includes a context encoder and a persona encoder, takes context and explicit personas as the input and extracts the most relevant explicit personas to improve persona consistency.

Context Encoder: We use the transformer encoder to encode the context X. The multi-head self-attention is defined as \(\mathrm{{MultiHead}}(Q,K,V)\), where QK, and V represent query, key, and value, respectively. The encoder is composed of \({N_c}\) layers. The encoding of context is as follows:

$$\begin{aligned} H_c^n\mathrm{{ = MultiHead}}(O_c^{n - 1},O_c^{n - 1},O_c^{n - 1}) \end{aligned}$$
(1)
$$\begin{aligned} O_c^n = \mathrm{{FFN(}}H_c^n\mathrm{{)}} \end{aligned}$$
(2)
$$\begin{aligned} \mathrm{{FFN(}}x\mathrm{{) = max(0, }}x{W_1} + {b_1}){W_2} + {b_2} \end{aligned}$$
(3)

where \(\mathrm{{n}} \in \mathrm{{(2, }}{N_c}\mathrm{{)}}\). \(H_c^{n}\) and \({O_c^{n}}\) are the n-th layer output of the multi-head self-attention and feed-forward network, respectively. In the first layer, \({O_c^1}\) represents the word embedding and positional embedding of the input. Following [15], we also add layer normalization to the sub layers, and we can finally obtain the context representation \({O^{{N_c}}}\) after \({N_c}\) layers.

Persona Encoder: According to the examples in Fig. 1, we observe the following: 1) The response Y is often related to some personas \({p_i}\) and contexts \(X_j\). 2) The relevance between X and \({p_i}\) is beneficial to generate an informative and consistent response. Therefore, we want to consider them. Specifically, we use another multi-head self-attention to encode the explicit personas. \(O_{exp}\) represents the output of this attention mechanism. We then use PerCon-Attention which takes \({O^{{N_c}}}\) as query, \(O_{exp}\) as key and value to compute the contextual explicit persona hidden vector \({O_{cp}}\) based on the following equations:

$$\begin{aligned} {H_{cp}} = \mathrm{{PerConAtt}}(O^{{N_c}},{O_{exp}},{O_{exp}}) \end{aligned}$$
(4)
$$\begin{aligned} {O_{cp}} = \mathrm{{FFN}}({H_{cp}}) \end{aligned}$$
(5)

3.2 Implicit Persona Inference

According to Fig. 1, we can see that the personas shown in the response are not entirely extracted from the given explicit personas. We therefore employ an inference module using vMF distribution to reason the implicit personas (Fig. 2(c)) for the personalized and diverse responses.

Since different speakers express different implicit personas, this information can be represented in different directions in the semantic space. The vMF distribution [18] can model directional data better, therefore, we introduce it into the CVAE framework. Specifically, in the CVAE framework, the prior network \({p_\theta }(z|X,P)\) and the recognition network \({q_\varphi }(z|X,P,Y)\) are used to sample the latent variable z, namely, implicit personas, and can be written as \({p_{imp}}\). In our settings, \({p_{imp}}\) follows the vMF distribution, specifically the prior network \({p_\theta }({p_{imp}}|X,P) \sim vMF({\mu _{prior}},{\kappa _{prior}})\) and the posterior network \({q_\varphi }({p_{imp}}|X,P,Y)\sim vMF({\mu _{pos}},{\kappa _{pos}})\).

VMF Distribution: The von Mises-Fisher distribution is defined over a hypersphere of unit norm, depending on the direction vector \(\mu \in R {^m}\) with \(\mathrm{{||}}\mu \mathrm{{|| = }}1\) and a concentration parameter \(\kappa \in R{_{ \ge 0}}\), where m denotes the dimension of the word vectors. The Probability Density Function of the vMF distribution for a random unit vector \(\mathrm{{z}} \in {R^m}\) is defined as:

$$\begin{aligned} {f_m}({p_{imp}};\mu ,\kappa ) = {C_m}(\kappa )\exp (\kappa {\mu ^\mathrm{T}}{p_{imp}}) \end{aligned}$$
(6)
$$\begin{aligned} {C_m}(\kappa ) = \frac{{{\kappa ^{m/2 - 1}}}}{{{{(2\pi )}^{m/2}}{I_{m/2 - 1}}(\kappa )}} \end{aligned}$$
(7)

where \({C_m}(\kappa )\) is the normalization constant and \({I_{m/2 - 1}}\) stands for the modified Bessel function of the first kind at order v. Inspired by NVSRN [2], we encode Y into representations \({O_y}\), set \({\kappa _{prior}}\) and \({\kappa _{pos}}\) as constants and compute \({\mu _{prior}}\), \({\mu _{pos}}\) as:

$$\begin{aligned} \mu _{pos}^ \sim = {f_{pos}}([{O^{{N_c}}}, {O_{cp}}, {O_y}]) \end{aligned}$$
(8)
$$\begin{aligned} {\mu _{pos}} = \mu _{pos}^ \sim /||\mu _{pos}^ \sim || \end{aligned}$$
(9)
$$\begin{aligned} \mu _{prior}^ \sim = {f_{prior}}([{O^{{N_c}}}, {O_{cp}}]) \end{aligned}$$
(10)
$$\begin{aligned} {\mu _{prior}} = \mu _{prior}^ \sim /||\mu _{prior}^ \sim || \end{aligned}$$
(11)

where \({f_{prior}}\) and \({f_{pos}}\) are two transformations and \(|| \cdot ||\) denotes the 2-norm used to ensure the normalization. Since the prior \({p_\theta }({p_{imp}}|X,P)\) follows the \(vMF({\mu _{prior}},{\kappa _{prior}})\) rather than \(vMF( \cdot ,0)\), the KL divergence will be computed as:

$$\begin{aligned} \begin{aligned} {{\mathcal {L}}_{KL}}&\mathrm{{ = }} KL({q_\varphi }(p_{imp}|X,Y,P)||{p_\theta }(p_{imp}|X,P))\\&= (m/2 - 1)\log \frac{{{\kappa _{pos}}}}{{{\kappa _{prior}}}} + \log \frac{{{I_{m/2 - 1}}({\kappa _{prior}})}}{{{I_{m/2 - 1}}({\kappa _{pos}})}} \\&- {\kappa _{prior}}{\mu _{prior}}\mu _{pos}^{ - 1}\frac{{{I_{m/2}}({\kappa _{pos}})}}{{{I_{m/2 - 1}}({\kappa _{prior}})}} + {\kappa _{pos}}\frac{{{I_{m/2}}({\kappa _{pos}})}}{{{I_{m/2 - 1}}({\kappa _{prior}})}} \end{aligned} \end{aligned}$$
(12)

Sampling Technique for vMF: Following the implementation of [4], we use the rejection sampling scheme to sample \(w \in [ - 1,1]\), and then the latent variable \({p_{imp}}\) is derived from \({p_{imp}} = w\mu + v\sqrt{1 - {w^2}} \), where v is a randomly sampled unit vector tangent to the hypersphere at \(\mu \).

3.3 Persona Response Generator

This component comprises a response generator and a persona generator (Fig. 2(d)). Considering the interaction between the two kinds of personas, we use the two generators to further enhance the modeling of directional data and better fuse the implicit and explicit personas.

Persona Generator: To strengthen the supervision for implicit personas, during this process, we employ an RNN decoder that receives implicit persona \(p_{imp}\) as the initial hidden state and then generates tokens sequentially under the probability distributions:

$$\begin{aligned} {p_{{\theta _p}}}(P|p_{imp}) = \prod \limits _i^n {\prod \limits _{j = 1}^{{N_i}} {p({w_{i,j}}|{P_{< i}},{w_{i < j}})}} \end{aligned}$$
(13)

where n is the number of turns of explicit personas; \({N_i}\) is the length of the i-th utterance \({p_i}\). During this process, the loss function is:

$$\begin{aligned} {{\mathcal {L}}_{p}}\mathrm{{ = }}{\mathbf{{E}}_{{q_\varphi }({p_{imp}}|X,P,Y)}}[\log {p_\theta }(P|{p_{imp}})] \end{aligned}$$
(14)

Response Generator: Finally, conditioned based on explicit personas, implicit personas and context, we employ a response decoder to generate the response Y:

$$\begin{aligned} {p_{{\theta _g}}}(Y|X,P,p_{imp}) = \prod \limits _{i = 1}^k {{p_{vocab}}({w_{y,i}})} \end{aligned}$$
(15)

where \({p_{vocab}}\) is the vocabulary’s probability distribution; \({{p_{vocab}}({w_{y,i}})}\) is the probability of the word \({w_{y,i}}\); k is the length of the response Y. In general, the ELBO in the decoder can be rewritten as:

$$\begin{aligned} {\mathcal {L}}_{r}\mathrm{{ = }}{\mathbf{{E}}_{{q_\varphi }(p_{imp}|X,Y,P)}}[\log {p_\theta }(Y|p_{imp},X,P)] - {{\mathcal {L}}_{KL}} \end{aligned}$$
(16)

3.4 Training Objective

In the EIPD model, the overall objective is:

$$\begin{aligned} \mathcal {L} = \lambda {{\mathcal {L}}_{p}} + (1 - \lambda ){{\mathcal {L}}_{r}} \end{aligned}$$
(17)

where the hyperparameter \(\lambda \) is used to control the balance between response generator and persona generator.

4 Experiments

4.1 Experimental Settings

Dataset: We use the released ConvAI2 persona-chat dataset, which is an extended version of PERSONA-CHAT [20]Footnote 1, to verify our proposed method. The dataset consists of 164,356 utterances in 10,981 dialogues, and each speaker has at least 4 persona profiles. We randomly split the data into the training, validation, and test sets, which respectively contain 67112, 8395, and 4478 dialogues.

Baselines: We compared the proposed EIPD model with five commonly used baseline models. S2SAP: the Seq2Seq model, which integrates context and persona as the input [20]. CVAEFootnote 2: an RNN-based model that exploits latent variables to improve the diversity of the response [21]. TransFootnote 3: the transformer model [15] that concatenates personas and context as the input. PerCVAEFootnote 4: a memory augmented CVAE model that uses multi-hop attention to exploit the persona information to improve the response quality [12]. TransferTransfoFootnote 5: a finetuned GPT2 that takes personas and dialogue context as the input [16] (Table 1).

Table 1. Objective (on the left) and subjective evaluation (on the right) results with respect to the ConvAI2 persona-chat dataset. Results in bold represent the best scores. In the subjective evaluations, the percentages of each kind of response are calculated by combining the evaluations from three annotators together. The Kappa scores of all models are higher than 0.4, which indicates that the three annotators reach a fair agreement.

Parameters: For the RNN-based models, we set word embeddings to the size of 300. The encoder is a 2-layer GRU structure with a hidden size of 600. For the Transformer, the size of word embedding is set to 512, and the numbers of layers of encoder and decoder are set to 3 and 1. Besides, the number of heads in multi-head attention is 8, and the inner-layer size of the feed-forward network is 2048. In our model, the parameters of the explicit persona extractor are the same as those of Transformer. The dimension of the latent variable is set to 180. We use the Adam algorithm to update the parameters with a learning rate of 0.0001. The batch size is set to 32. An early-stop strategy is used to obtain the best model. Our model is implemented using the Tensorflow framework. We conduct all experiments on a GPU.

Evaluations: In our experiments, we use Dist-1, BLEU-1/2 and F1 to evaluate our method. In addition to the automatic metrics, we recruit three human annotators familiar with the NLP tasks to judge the quality of the generated responses. We sampled 200 context-response-persona triples from the above models. They are required to provide 4-graded judgements according to the following criteria: G1: The generated response is not grammatically correct, is irrelevant to the semantics of context or is inconsistent with the given personas. G2: The generated response is fluent and weakly related to the context, such as some generic responses. G3: The generated response is fluent and relevant to the context semantics and slightly consistent with the personas. G4: The generated response is not only fluent and semantically relevant but also consistent with the given personas.

Table 2. Performances of model ablation. EIPD is significantly better than the ablation approaches.

4.2 Experimental Results

Objective and Subjective Evaluations: For objective evaluation, (1) Dist-1 is the ratios of distinct unigrams which can reflect the diversity of the generated response. It can be found that the performance of S2SAP is the worst because it only roughly combines the explicit personas. PerCVAE surpassed other baselines due to the exploitation of explicit personas. Compared with the baselines, EIPD outperforms them, which indicates that the proposed model can generate diverse responses. (2) BLEU-1/2 evaluates how many n-grams (n = 1,2) in the generated responses overlap with them in the ground truth. EIPD performs better than baselines except for TransferTransfo in BLEU-1, and we speculate that the reason may be that the pretrained language model contains semantic information. (3) For F1, the score of EIPD is higher than others, demonstrating that the model can generate more accurate information.

For subjective evaluation, the responses generated by EIPD are more engaging as compared to the responses from all baselines. It can be determined that the percentage of diverse and persona-consistent responses (the grade ‘G3&4’) is 66.89%, obviously higher than others, which indicates that EIPD can generate persona-consistent responses. Additionally, the percentage of ‘G2’ is declining, while, the percentage of ‘G3’ is rising. This proves that EIPD has the ability to generate context-relevant responses, and alleviate the problem of generic responses at the same time. Among the baselines, the results of S2SA perform poorly since it the model does not take any kind of personas into consideraiton. By adding explicit personas or global information, the performance of these models improve gradually, yet still worse than our model.

Ablation Analysis: To investigate the effects of specific modules in EIPD, we ablated our model through several different approaches: D: A generative dialog model without explicit and implicit personas. EPD: It removes the implicit persona inference, that is, the model does not use implicit personas. IPD: It replaces the explicit persona extractor with the RNN to represent the explicit personas. \({\mathbf{EIPD }_\mathbf{Gau }}\): This model replaces the vMF distribution with the Gaussian distribution. \({\mathbf{EIPD }_\mathbf{pd }}\): This approach deletes the persona generator, so the generation of implicit personas loses the supervision of the explicit personas.

As shown in Table 2, from D, IPD, EPD, \(\mathrm{{{EIPD}}_{{Gau}}}\), and \(\mathrm{{{EIPD}}_{{pd}}}\) to EIPD, every step yields an observed improvement on the automatic metrics. EIPD achieves the best performance among all the methods. Specifically, compared with D, the improvements of EPD and IPD on all metrics imply that the explicit persona extractor can capture the explicit personas related to some context, and the implicit persona inference module can obtain the implicit personas inferred from the given context and explicit personas. Furthermore, we note that EIPD performs better than \(\mathrm{{EIPD}_{Gau}}\) on all metrics, which proves that the vMF distribution is more useful than the Gaussian distribution in this framework. Specifically, the implicit persona inference module can reason the more rational implicit personas with vMF distribution, and this phenomenon is consistent with the characteristics of vMF, which is good at modeling directional data, such as the personalities of different speakers. In addition, the performance of \(\mathrm{{EIPD}_{pd}}\) is inferior to EIPD, which verifies that the persona generator can facilitate the generation of persona-consistent and diverse responses.

Table 3. An example of dialogue with the personas   ‘Black coffee is my addiction. My favorite hobby is gardening. My family gets together every Saturday. My husband died last year.’ in ConvAI2 persona-chat dataset.

Case Study: According to Table 3, we can determine that the baseline models often generate some fluent but irrelevant and weak personalized responses. For comparison, we use the EIPD to generate different responses through implicit persona inference, and we find that the responses are related to the personas ‘My favorite hobby is gardening’. The first response directly answers the speaker’s attitude about gardening, and the second response expands the information about the given personas.

5 Conclusion and Future Work

In this paper, we propose an effective EIPD for multi-turn persona-based dialogue. To the best of our knowledge, we are the first to fuse the explicit personas and implicit personas to generate more realistic responses. It uses an explicit persona extractor to improve the persona consistency, and employs an implicit persona inference module with vMF distribution to improve the diversity. Finally, the persona response generator is used to fuse personas and generate the response. Experimental results on ConvAI2 persona-chat dataset demonstrate the effectiveness of our model and verify the importance of implicit personas. In the future, we would like to use knowledge graphs and pretrained language model to strengthen the inference of implicit personas.