1 Introduction

In the field of dialogue systems, [1] pointed out that chitchat models must not only have the ability to generate diverse responses but also establish an emotional state connection with interlocutors during the conversation. Each user has his or her own characteristics and habits. It is also very important to fully mine the user’s persona information. Seq2Seq models [2,3,4,5] can fully integrate the context information of the dialogue and solve the problem that RNN s yield output with fixed data dimensions. This model can effectively improve the diversity of responses in the dialogue system. Although the Seq2Seq model has now been widely adopted in dialogue systems, there is still a long way to go before dialogue systems can understand natural human language and pass the Turing test [6]. There are still quite a few problems with adopting the Seq2Seq model to build dialogue language models; for example, the generated responses have a lower degree of personalization [7] and correlation with the dialogue history. The Seq2Seq neural network model applied for dialogue generation tends to generate safe and shared responses (e.g., “I don’t know”) [8]. [3] pointed out that the reason for the above problems is that the persona information associated with speakers was not combined into the dialogue generation process.

In some cases, the generated response does not need to reflect persona information, but it needs to combine the persona information properly on the basis of fully combining the historical dialog information. To improve the personality performance of dialogue models, we exploit the latent implicit personalized interactive information by a multihop attention mechanism for dialogue context and persona information. In this paper, we designed a persona information memory selection network (PMSN), which is trained by predefined persona information. When predicting the responses, we employ the PMSN to generate the most relevant persona information and input it into dialogue generation blocks to assist in the generation process. Figure 1 illustrates our method, the persona information WA of speaker A is composed of L pieces of profile information \(\{{W_{1}^{A}}, \mathrm {~W}_{2}^{A} {\ldots } \mathrm {W}_{L}^{A}\}\). When we communicate with others, firstly we think about what kind of characters they are, and what kind of personality traits they have. These features will be generated by PMSN which is based on pre-defined persona information. The dialogue history \(h_{n-1}^{A}=({x_{1}^{A}}, {x_{1}^{B}}, {x_{2}^{A}}, {x_{2}^{B}} {\ldots } x_{n-1}^{A}, x_{n-1}^{B})\) and the most correlative persona information will be adopted in the dialogue generation process. When we chat with others, not all of the responses need to incorporate persona information, and appropriate integration of persona information is more suitable. Therefore, to improve the response diversity, we input the dialogue context into the PMSN to predict the most correlative persona information and then employ these selected persona features to assist in the dialogue generation process. The PMSN employs the multilayer perceptron (MLP) method to mine the historical persona information and adopts W = MLP(W,x) to choose the most correlative persona information. The dialogue generation network Transferrer is a sequence prediction model based on dialogue history and the selected persona information to generate more personalized and diverse responses. Our basic dialogue model adopts the GPT-2 pretrained model because it involves a larger number of parameters and is trained by more data than the GPT model, which achieves an excellent performance in dialogue generation. Transferrer adopts a conditional probability \(p({x_{n}^{A}} \mid W^{A}, W^{*}, h_{n-1}^{A})\) to predict the target sequence, where \({x_{n}^{A}}\) denotes the target token. WA means the persona information about speaker A, W denotes the most relativity persona information, and \(h_{n-1}^{A}\) is the n − 1 rounds of historical dialogue.

Fig. 1
figure 1

Method Overview

To speed up the convergence of our model and improve the performance of the model, we integrate the reinforcement learning method based on a Markov decision process (MDP) into the learning process, apply fine-tuning and optimize the parameters of our model. Two Transferrers are initialized by allowing them to chat with each other. In addition, we design a reward mechanism that suits interlocutors’ dialogue preferences to generate interesting, personalized and smooth dialogue responses. With the exploration of agents, the PMSN will be completed with persona information. A set of successful dialogs allows the two interlocutors to enhance their understanding of each other through the content of the dialogue and the characteristics of the interlocutors. In summary, our contribution is threefold:

  • We propose a solution for personalized dialogue generation. Our method can generate abundant personalized and smooth responses from predefined persona information, and distinguish the most relevant profile features from noise data effectively.

  • We design an optimized method named Self-Transferrer framework to optimize our personalized dialogue model. The Self-Transferrer optimize method joint each part of our model to generate more personalized and smooth responses.

  • Extensive experiments on ConvAI2 conducted to validate the superiority of our model can generate a more personalized response than other popular baselines in terms of different metrics.

The remainder of our paper is organized as follows. In Section 2, we describe the related work of dialogue systems and personalized systems. Sections 3 and 4 introduce the PMSN and Transferrer architecture we employed in our personalized dialogue generation process separately. Besides, we present the Self-Transferrer framework to joint each part of the dialogue model to improve the personality and fluency of responses in Section 5. Finally, Section 6 shows the experimental details, results, and further analysis to verify the effectiveness of our method. Section 7 concludes our work and suggests ways to improve dialogue quality in the future.

2 Related work

Due to the availability of various related large-scale datasets [9, 10], the task of personalized dialogue generation has made considerable progress [36]. Table 1 illustrates a round of conversations in the ConvAI2 dataset, and the persona information is expressed in a few descriptive sentences. Dialog systems based on generation methods employ an encoder-decoder architecture, which encodes the dialogue context and adopts a Seq2Seq model to predict the target responses. Generative-based dialogue systems still have certain problems, such as nonsmooth and repeated responses [11, 35, 37]. The responses generated by these dialogue systems are relatively fixed and lack flexibility, making them very difficult to apply in chatbots. According to research in cognitive science, effective communication creates similar activation maps in the brains of the two interlocutors [12], suggesting that understanding interlocutors’ persona information and emotional states is an essential process for generating high-quality conversations.

Table 1 A sample of the dataset, every sample includes persona information of each speaker and dialogue context

In the field of personalized dialogue generation, the traditional method of constructing personalized dialogue systems focuses on psychological characteristics, such as the “Big Five” personality traits [13]. Modeling the psychological features and collecting the corresponding dialogue context is very difficult. These limitations have hindered the development of personalized dialogue systems. Recent studies have tried to build personalized dialogue generation models in a data-driven manner. [8] first incorporated persona information, and the persona information was converted into a dense vector for the subsequent dialogue generation task, which could effectively reduce the number of general responses and increase the variety of responses. Nevertheless, all of the above methods rely too much on labeled data, the training cost is high, and the features of the training data are sparse, so [11] proposed a new reinforcement learning process based on a Markov decision process, which effectively reduced the number of regular responses and effectively increased the variety of responses. [14] input persona information and historical dialogue into a neural network model and aimed to generate more meaningful responses.

Traditional datasets about personalized dialogue generation do little encourage the dialogue model to engage in understanding natural language and maintaining persona information. Inspired by this mind, [9] proposed a dialogue dataset based on persona information, and they further proposed two generative models: persona-Seq2Seq and a generative profile memory network, which were employed to incorporate persona information into responses. In the work of [15], the researchers pointed out that dialogue agents fail to engage users, especially when trained on an end-to-end dialogue system. For the above reasons, they introduced a new dataset providing more persona information and dialogue context based on persona information. With the development of a pretrained model, [16] proposed a new approach named TransferTransfo that used the Transformer model to improve the fluency of responses. To fuse the target persona information into the decoding process effectively and balance its contribution, [17] proposed an attention routing structure, which can make more effective use of personalized sparse data in the process of model training. There are still many meaningful challenges in building a personalized dialogue system, for example, how to generate an informative response with multiple relevant personalities without losing fluency and coherence. To address this issue, [18] presented a model that incorporates recurrent personality interactions among response decoding steps to fuse appropriate persona information. Dialogue generation processes need to incorporate the interlocutors’ persona information in most cases, whereas in some specific cases, it is not necessary to incorporate their persona information [19]. To incorporate more coherent personality information into the dialogue generation process, [20] predefined several profile key-value pairs, including name, gender, age, location, etc., and distinctly expressed a profile value in the response. Although the above methods have achieved positive results, most models focus too much on deliberately imitating human responses and generate responses that are excessively related to persona information. To address these issues, our work generates responses with the most relevant persona information without losing fluency and coherence. [21] proposed P2 BOT, which incorporates mutual persona perceptions to improve the quality of personalized dialogue generation, and there are also more open-domain dialogue systems with reinforcement learning methods [22,23,24,25]. Moreover, integrating the task of dialogue generation into task-oriented dialogue systems, which provides users with a more interesting experience while completing tasks, has also attracted the attention of researchers [26, 27].

In summary, researchers have performed many meaningful studies in the field of personalized dialogue generation. Nevertheless, many challenges still exist, such as (1) how to enhance the diversity expression of the dialogue generation process by mining the personalized expression of historical dialogue, (2) how to model the dialogue process with the development of a pretrained model, and (3) how to combine reinforcement learning methods to make the training process of data-driven dialogue systems more effective. The reason why cause these issues is mostly due to the personality characteristics of interlocutors are limited and there has no good way to explore the persona features from limited persona information and dialogue history context. To address these challenges, we design our model from the following three aspects. First, we exploit deep implicit personalized interaction information by the multihop attention mechanism, it can explore the relation between given profile and dialogue history. Second, we apply the advantages of the pretrained language model to enhance semantic representation based on the limited persona information. Third, we design a self-learning process based on a reinforcement learning method, which can help improve the learning efficiency of the dialogue system.

3 Persona information memory selection network (PMSN)

To better integrate persona information into the dialogue generation process, the persona information will be input to the PMSN for memorization before the dialogue starts. Memorizing persona information is a process of paying attention to the persona information. To reduce the error caused by the memorization process, the multihop attention method is adopted to calculate the attention of the character information. The calculation process is as follows:

First, inspired by the attention mechanism, our model calculates the attention score between dt(Query) and each wi(Persona) by \(e_{t i}={d_{t}^{T}} w_{i}\), and then employs the softmax function to normalize the attention score, as shown in formula 1.

$$ a_{t i}=softmax\left( {d_{t}^{T}} w_{i}\right) $$
(1)

Second, we adopt an attention weight to measure the matching degree of the current dialogue context and persona information, and the attention weight ati and its corresponding wi weighted sum are employed as the attention output of the tth dialogue sequence. The calculation formula is shown in formula 2. Each person’s information has an output vector Ci (by another embedding matrix C).

$$ Attention \left( h_{t}, \mathrm{W}, \mathrm{C}\right)={\sum}_{i} a_{t i} C_{i} $$
(2)

The process of calculating the attention is essentially a weighted sum function. If it employs only one-hop attention, there will exist a certain error. Some features that are not highly relevant to the current context are discarded, and there is an error in this selection process. The calculated attention matrix cannot indicate the degree of association between the target sentence and the current context well.

Finally, inspired by [19, 38, 39], we adopt a multihop attention structure, where the attention output of the i hops is shown in formula 3.

$$ m^{i}=m^{i-1}+ Attention^{i-1} $$
(3)

Where m0=dt. Finally m3 is employed as our memorized persona. In our experiments, i = 3 achieves a better performance than i = 1 or 2, whereas there is no crucial increase when i = 4, 5 or 6.

When selecting the persona representation related to the current case W, a linear transformation is applied to the output of the multihop attention to obtain the persona information, and the formula is as follows:

$$ W^{*}=softmax\left( W_{p}\left[m^{3}\right]\right)=M L P\left( \left[m^{3}\right]\right) $$

Where Wp is the weighted matrix when selecting the most relevant persona information, and the selected persona information is adopted in the process of dialogue generation.

3.1 Training

It is necessary to label the dialogue context before training the PMSN, which adopts TFIDF to compute the relevance between the dialogue context and each person’s persona information. The inverse document frequency is computed by formula 4:

$$ i d f_{i}=\frac{1}{\left( 1+\log \left( 1+t f_{i}\right)\right)} $$
(4)

Where tfi is the index of Glove vocabulary, W is the predicted persona information that is most relevant to the current context, pi is the labeled persona,and the loss function is as follows:

$$ \mathcal{L}_{p m s n}=\frac{1}{N} \sum\limits_{i=1}^{N}\left|W_{i}^{*}-p_{i}\right| $$
(5)

4 Transferrer

According to the research of [9, 11, 21], we treat the task of dialogue generation as a sequence prediction process. Our method adopts the l GPT2 language model [28] to initialize our dialogue model. Compared with GPT [29] models, GPT2 increases the scale of training data and enriches the content of the models. They are all based on the Transformer model. When training the GPT2 model, the effectiveness of unsupervised learning is also verified.

The complete dialogue generation process is shown in Fig. 2. The Transferrer adopts the decoder structure in the Transformer model [30], processes dialogue-related context information and generates responses. The dialogue-related context information includes persona information WA, historical dialogue information \({h_{n}^{A}}\), and persona information \(w^{A^{*}}\) with the highest degree of correlation with the current context. We employ the MLE to predict the next token of the response sequence, and the loss function is given in formula 6.

$$ \mathcal{L}_{m l e} = \sum\limits_{t} \log p_{\theta}\left( \!x_{n, t}^{A}\!\mid W^{A}, w^{A^{*}},\!{h_{n}^{A}},\!x_{n,<t}^{A}\right) $$
(6)

Where 𝜃 is the parameter of Transferrer, \(x_{n, t}^{A}\) means the t-th token in \({x_{n}^{A}}\), and \(x_{n<t}^{A}\) indicates the token before t-th token. Formula 6 applies to both A and B, and we mention A for the sake of brevity (B is the same as below).

Fig. 2
figure 2

Transferrer architecture, every block is decoder block in transformer

During the prediction process, beam search is applied to store the top-ranked response candidates \(\{\hat {x}_{n}^{A^{*}}\}\), and the Transferrer further chooses the candidate that maximizes the length-normalized score as the prediction as follows:

$$ \hat{x}_{n}^{A^{*}}=argmax \frac{\log p_{\theta}(\hat{x}_{n}^{A^{*}} \mid W^{A}, w^{A^{*}}, {h_{n}^{A}})}{|\hat{x}_{n}^{A^{*}}|} $$
(7)

To improve the generalizability of the model and find a more powerful and robust feature representation that will benefit the process of dialogue generation, inspired by [21], we set an auxiliary task to optimize the dialogue generation model: next utterance classification. In addition to generating more appropriate responses, a [CLS] token is added at the end of the generated sequence, and a classifier is added to the last layer of the model to determine whether the responses generated by the system are appropriate responses. The method of classification randomly selects interference item data and trains the classifier to distinguish between normal replies and interference items, and formula 8 is extended as follows:

$$ \begin{array}{@{}rcl@{}} x_{n}^{\mathcal{A}^{*}}&=&\underset{\hat{x}_{n}^{A}}{\arg \max }(\alpha \cdot \frac{\log p_{\theta}(\hat{x}_{n}^{\mathcal{A}} \mid w^{\mathcal{A}}, h_{n}^{\mathcal{A}})}{|\hat{x}_{n}^{\mathcal{A}}|}\\&&+(1-\alpha) \cdot \log p_{\theta}(y_{n}=1 \mid w^{\mathcal{A}}, h_{n}^{\mathcal{A}}, \hat{x}_{n}^{\mathcal{A}})) \end{array} $$
(8)

Where 𝜃 is the shared parameter of the dialogue generation task and the auxiliary task, yn = 1 indicates that \({x}_{n}^{A}\) is predicted as the next personalized utterance, and α is a hyper-parameter.

5 Self-transferrer & fine-tuning

Although the supervised dialogue generation model can imitate a speaker’s personalized responses well based on the training data, it cannot fully allow the machine to fully understand natural language. Therefore, we try to match the two speakers randomly and encourage the Transferrer to learn a policy that could yield the maximum reward. A dialogue is carried out, and we encourage the Transferrer to learn a strategy that can obtain the greatest reward through reinforcement learning. We further optimize the model by fine-tuning, which employs reinforcement learning to maximize the reward function. We apply self-play to simulate the interaction between two Transferrers. Transferrers adopt the historical dialogue context and the persona information of the interlocutors for complete exploration. The exploration is shown in Fig. 3, and the details are explained below.

Fig. 3
figure 3

The exploring process. Agent A select context from context database to start conversation as it is difficult to generate high quality sequence without dialogue history

According to the work of [21], we divide the two conversational individuals into users and agents. The process of self-play is the process by which the agent optimizes parameters 𝜃. User \(\mathcal {A}\) starts the conversation, and \({\mathscr{B}}\) will reply as the agent. Inspired by the work of [11, 21], we introduce some necessary formulations for modeling our problem with reinforcement learning. A policy defines the behavior of the learnable agent at a specific time and computes the conditional probability of a certain action taken in a certain state, and the formula is expressed as \(p_{\theta }=({a_{n}^{B}} \mid {s_{n}^{B}})\). The policy is responsible for mapping a state to an action. The reward defines the temporary income of the agent. After the agent takes a certain action, the environment sends a reward to the agent at each time step. The goal of the value function is to judge which action is better from a long-term perspective, indicating the long-term expectations after taking a certain action. A state contains the overall persona information of the interlocutors, the persona information most relevant to the current context, and the dialog history. Here, we define the state as a triple s = (W,h,W) such that the state of agent \({\mathscr{B}}\) in the n-th round is expressed as \(s_{n}^{{\mathscr{B}}}=(W^{{\mathscr{B}}}, h_{n}^{{\mathscr{B}}}, W^{*})\). An action is taken by the agent according to a policy. In our personalized dialogue generation task, we regard the action as the response of agent \({\mathscr{B}}\) to the question of user \(\mathcal {A}\), and the action is defined as \(a_{n}^{{\mathscr{B}}}\). The agent learns from the dialogue history and chooses the best answer from \(a_{t+1}^{1} {\ldots } \cdot a_{t+1}^{K}\) for each time step t = 1,...,T. After the agent takes an action, the agent receives a reward from the environment as the hidden state ht+ 1 for the next time step. Then, the agent sets a new action set \(a_{t+1}^{1} {\ldots } \cdot a_{t+1}^{K}\) and chooses the proper answer according to the state and policy. For user \(\mathcal {A}\), the state is updated when receiving \({\mathscr{B}}\)’s response, and the Transferrer method is employed to generate a response. We adopt a gradient policy [29] of the Transferrer, and it can output the policy function directly.

We define the sequence of exploratory processes as τ = {s1,a1,s2,a2sT,aT}, where \(s_{1}=(W^{\mathcal {A}}, h_{0}^{\mathcal {A}}, W^{*}),s_{2}=(W^{{\mathscr{B}}}, {h_{1}^{B}}, W^{*})\), and user \(\mathcal {A}\) and agent \({\mathscr{B}}\) alternately interact with the environment to update their states. According to the Markov decision process, the probability of the occurrence of a certain sequence τi is shown in formula 9.

$$ p_{\theta}\left( \tau_{i}\right) =p\left( s_{1}\right) \prod\limits_{t=1}^{T} p_{\theta} \left( a_{t} \mid s_{t}\right) p\left( s_{t+1} \mid s_{t}, a_{t}\right) $$
(9)

This formula has an expected value of reward for each episode sequence and estimates the expected value of the return for each episode sequence by the action-value function, as shown in formula 10.

$$ \bar{R}_{\theta} =E_{\tau \sim p_{\theta}(\tau)}[R(\tau)] $$
(10)

Where R(τ) is the true reward during exploration and the target is the maximum expected value of the reward. We employ the gradient policy to optimize the parameters in the next section.

5.1 Policy gradient

To obtain the maximum expected value of the reward, the likelihood ratio trick is adopted to update the parameter 𝜃 through a gradient ascent method, where the gradient of the expected value is shown in formula 11.

$$ \nabla \bar{R}_{\theta}= {E}_{\tau \sim p_{\theta}(\tau)}[R(\tau) \nabla \log p_{\theta}(\tau)] $$
(11)

where 𝜃 is the parameter and the update method of 𝜃 is shown in formula 12.

$$ \theta \leftarrow \theta+\eta \nabla \bar{R}_{\theta} $$
(12)

As mentioned above, the action space is infinite. The reinforcement algorithm is adopted to approximate formula 11 by sampling an action from the policy distribution. Furthermore, by subtracting a baseline, [31], applied the mean reward of a mini-batch as the reward to reduce the variance. The agent samples the tokens one by one through multinomial sampling over the output distribution. Multinomial sampling provides more diversity than beam search sampling.

5.2 Reawrd shaping

According to the work of [21], high-quality personalized dialogue generation models should focus on modeling human language and mutual persona perception between interlocutors. In the process of reinforcement learning, when the reward in the environment is too sparse, it may be slow to find a solution to the problem by simply relying on the agent to explore and learn, so experience can be integrated into the reward design process to be more effective in solving the problem and speeding up convergence. We design two reward processing schemes as follows.

5.2.1 RS.1

In the task of personalized dialogue generation, the responses generated by the system must conform to human language characteristics and dialogue rules so that the generated responses will be meaningful. According to the work of [21], such rules can be measured by pretrained models. Therefore, we employ reward shaping based on the pretrained model (GPT2). The reward for the actions taken by learner \({\mathscr{B}}\) in sequence τ is

$$ R_{1}\left( a_{n}^{\mathcal{B}}\right)=\frac{1}{\left|a_{n}^{\mathcal{B}}\right|} \sum\limits_{t} \log p_{\theta}\left( a_{n, t}^{\mathcal{\mathcal{B}}} \mid a_{n,<t}^{\mathcal{\mathcal{B}}}\right) $$
(13)

5.2.2 RS.2

The language is evaluated separately, and the coherence of the dialogue context is not fully considered. Therefore, a reasonable dialogue generation model should fully integrate the historical dialogue information to generate more meaningful responses, and we employed the auxiliary task in the previous article to design the reward, as shown in formula 14.

$$ R_{2}\left( a_{n}^{\mathcal{B}}\right)=\log p_{\theta}\left( y_{n}=1 \mid a_{n}^{\mathcal{B}}, s_{n}^{\mathcal{B}}\right) $$
(14)

It is safe to assume that human responses are always more natural and personalized than dialogue agents. yn = 1 is the signal indicating that the generated responses \(a_{n}^{{\mathscr{B}}}\) are predicted to be the next personalized utterance. In summary, the final reward is as follows:

$$ r=\beta_{1} R_{1}+\beta_{2} R_{2} $$
(15)

Where β1and β2 are hyper-parameters.

figure h

6 Experiment

6.1 ConvAI2 dataset and preparation

Our experiment is based on the large-scale ConvAI2 dataset, which incorporates interlocutor persona information, and a new test set is added to the PERSONA-CHAT dataset proposed by [9]. The dialogues are randomly paired and given persona information selected from the persona information pool. The training set contains more than 10,000 sets of multiround dialogues, including approximately 160,000 sentences. Each set of multiround dialogues contains at least five sentences describing the speaker’s profile information.

6.2 Baselines

The baselines for comparison are divided into three categories: dialogue generation models without persona information, dialogue generation models based on persona information, and dialogue generation models based on a pretrained model.

STSA

[4]: In the architecture of the Seq2Seq model with an attention mechanism, the encoder is responsible for encoding the dialogue context and calculating the semantic vector ct of each time step, and the decoder is responsible for linearly transforming the generated semantic vector and generating a response. This method does not consider that persona information plays a significant role in the dialogue generation process.

Per-STSA

[9]: On the basis of integrating the attention mechanism, persona information is integrated into the process of dialogue generation and effectively improves the diversity of the response; i.e., x = ∀pPx, where ∥ denotes concatenation.

Dia-CVAE

[32]: This dialogue model without profile information adopts a hidden variable to obtain potential features in the dialogue generation process and aims at increasing the diversity of the generated responses.

Per-CVAE

[19]: This method employs a memory-augmented architecture to exploit persona information from context and incorporates a conditional variational autoencoder model together to generate diverse and sustainable conversations.

TransferTransfo

[16]: This model combines transfer learning and the Transformer model and fine-tunes the pretrained model by optimizing the multitask objective function.

Transformer MemNet

[10]: This dialogue generation process employs two trained models, namely, knowledge selection and dialogue prediction.

KIC

[33]: This method is combined with knowledge-aware pointer networks and a recurrent knowledge interaction hybrid generator.

P2 BOT

[21]: This dialogue system incorporates mutual persona perception and reinforcement learning methods to improve the personality of the generated response.

6.3 Experimental settings

In our experiments, the RNN is a two-layer GRU with a 768-dimensional hidden state, and the dimension of word embedding is set to 768. The vocabulary size is limited to 50,256. The mini-batch size is 16, and the Adam optimizer is adopted with an initial learning rate of 0.001. All the parameters are initialized by sampling from a uniform distribution. For the PMSN, the hidden size is 768, the number of epochs is 50, and the maximum length of the persona sequence is 15. For Transferrer, the size of the hidden state is 512, the batch size is 256, the beam size is 3, the maximum length is 256, the position embedding size is 512, 100 epochs are employed, and the learning rate is set to 6.25e-5. From the reinforcement learning process, the learning rate is 0.5, β1 = 0.4, β2 = 0.6, and α = 0.1.

6.4 Automatic evaluation

Evaluation is an important task when building an open-domain dialogue system [34]. Automatically evaluating an open-domain dialogue generation model is still a challenging task. Inspired by [9], we employ official automatic metrics to evaluate our model:

PPL, F1:

PPL(Perplexity), The basic idea is to assign a higher probability value to the sentences in the test set and adopt the perplexity to measure the fluency of the dialogue and the intelligibility of the generated response. The lower the perplexity is, means the generated responses more fit the sentence of the test set, and the smoother and easier it is to understand the generated response. The definition of PPL is shown as below:

$$ \operatorname{PPL}(S)=P\left( W_{1}, W_{2}, W_{3} {\ldots} W_{N}\right)^{-\frac{1}{N}} $$
(16)

The F1 score reflects the precision and recall. We adopt the following metrics to prove the effectiveness of our method.

Diff-k-grams:

Based on the idea of [19], Distinct-K is adopted to measure the diversity of the generated response. The idea of this metric is to calculate the number of different k-grams in the generated responses.

Persona Coverage(P.Cover):

Inspired by [19], we adopt this metric to measure the coverage of persona information in the generated responses. Suppose that there are M pieces of predefined interlocutor persona information {p1,p2pM,}, the generated responses are \(\left \{\hat {y}_{1}, \hat {y}_{2} {\ldots } \hat {y}_{N}\right \}\), and the definition of P.Cover is as follows:

$$ \mathcal{C}_{\text {per}}=\frac{{{\sum}_{i}^{N}} \max_{j \in[1, M]} \mathcal{S}\left( \hat{y}_{i}, p_{j}\right)}{N}\\ \mathcal{S}\left( \hat{y}_{i}, p_{j}\right)=\frac{{\sum}_{w_{k} \in W}\left( f_{k}\right)}{|W|} $$
(17)

where W is the shared word set of \(\hat {y}_{i}\)and pj, and|W|means the length of the shared word list. N means the number of generated responses in each turn. The method to compute the fk is \(f_{k}=\frac {1}{\left (1+\log \left (1+t f_{i}\right )\right )}\).

The final experimental score is obtained by averaging the model on the test set, and the results are shown in Table 2. When N = 1, the diversity of our model is higher than those of the baselines, and the P.Cover of the generated response is slightly lower than that of Per-CVAE, TransferTransfo and P2 BOT. When N = 5 or 10, since we employ the GPT2 pretrained model and the reinforcement learning method in the dialogue generation process, the performance of our model is better than the other baselines in PPL and F1 as shown in Fig. 4. Our method outperforms almost all the baselines on personalized metrics due to the Self-Transferrer framework focus on generating more diversity and fluency dialogues. As the dialogue continues, the amount of information involved in the dialogue increases, and the P.Cover shows a downward trend due to the limited amount of persona information of the interlocutors that usually contains 5 pieces of persona information for each speaker.

Table 2 Experiment about diversity
Fig. 4
figure 4

Experiment about the dialogue quality. a STSA, b Per-STSA, c Dia-CVAE, d Per-CVAE, e TransferTransfo, f Transformer MemNet, g KIC, h P2 BOT, i Our(PMSN+GPT2+RL)

6.5 Additional analysis of the pretrained model

After employing the PMSN, the diversity of the generated responses shows a certain improvement. The purpose of adopting RS.1 is to make the predicted responses conform to human dialogue characteristics and dialogue rules. Table 3 shows that the predicted response complexity has been improved to a certain extent, the response complexity has been reduced, and the language can be explained clearly. The purpose of adopting RS.2 is to allow the reply to fully integrate the historical information of the dialogue so that it contains richer content. As shown in Table 3, after the introduction of RS.2, the persona information contained in the reply becomes richer. We compare our method with a previous method that deletes the pretrained model, and the result is shown in Fig. 5. Due to the RS.2 and the PMSN, the personalized metrics achieve significant improvement. After we remove the pre-trained model, our method is also slightly better than the other methods in PPL and F1. These improvements benefit from the multihop attention mechanism and the exploration of the RL algorithm. It also can prove the effectiveness of pre-trained model GPT-2 from Fig. 5.

Table 3 Ablation tests
Fig. 5
figure 5

Our method compares with previdous methods without pre-trained model. a TransferTransfo, b Transformer-MemNet, c P2 BOT, d Our(PMSN+RL)

6.6 Human evaluation

The automatic evaluation metrics show that our method can effectively combine persona information to generate a variety of interesting responses. We employ human evaluation to better evaluate the diversity of the responses generated by our method. The following results are based on N = 5.

In the human evaluation process, we randomly selected 10 workers to test our dialogue model, these workers with high-level language skills and know nothing about our methods. We randomly sample 200 profile-question-response pairs from the test set, and the repeated pairs were filtered out. These workers chat with different chatbots follow by given persona information, and score the quality of generated responses and employ the average score as the last result. In our human evaluation, 1 means only fluent in terms of grammar and vocabulary, 2 means the responses generated by the system are related to the given persona information, 3 means the reply contains comprehensive and diverse information about the interlocutors, 4 means the responses are consistent with the given persona information. The results are shown in Table 4. From the table, our model can generate more personalized and consistent responses. Finally, we provide a dialogue example, as shown in Table 5, to prove more directly the superiority of our method compared with other different methods in the personalized dialogue generation process.

Table 4 Result of human evaluation
Table 5 Sampled dialogues from different models

7 Conclusion and future work

The BOT-PMSN method we proposed is based on the persona information of the speaker, adopts the Self-Transferrer framework and persona information to assist in dialogue generation, and then introduces a reward signal in the dialogue process. The signal enhances the persona perception between humans and machines and realizes the task of personalized dialogue generation. The dialogue generation model that we trained can effectively understand natural language. Experiments on the large-scale dialogue public dataset ConvAI2 verify the effectiveness of our method. In future work, we will consider sustainability in the dialogue generation process and mine the rich expressions between persona and dialogue context. In the task of dialogue generation, meaningful dialogue also needs to fully consider the transfer of the speaker’s emotional states. Therefore, emotional analysis will be added to subsequent work to increase the diversity of the generated responses.