1 Introduction

Developing a personalized open-domain dialogue system is a fascinating yet challenging area of study that has garnered significant attention from both academic and industrial communities. While there are numerous dialogue methods, certain challenges still exist and have yet to be effectively addressed [1]. One of the major challenges in this field are dialogue consistency [2, 3]. Ensuring dialogue consistency in open-domain dialogue systems is an ongoing challenge. Two widely adopted strategies to improve response consistency are persona consistency and contextual consistency. The responses must be coherent and consistent with persona knowledge and dialogue history. In the open-domain dialogue systems, explicit persona knowledge refers to a set of descriptive sentences and the features that determine the expression, personality and behaviors. Recent theoretical advancements have emphasized the importance of dialogue systems exhibiting consistent persona characteristics to establish and maintain long-term user trust. A well-designed dialogue system should avoid generating responses that contradict its predefined persona knowledge and previous responses in terms of logic and reasoning. Therefore, equipping dialogue systems with consistent persona and generating responses that maintain a consistent expression are critical research tasks.

A few attempts have been made for personalized dialogue generation. Zhang et al. [4] create the PersonaChat dataset and utilize a memory network to store and retrieve persona knowledge. This research represents a relatively novel area of study. Song et al. [5] exploit persona knowledge from context and incorporate a conditional variational autoencoder(CVAE) model to generate diverse conversations. Previous research has focused on manually constructing explicit persona knowledge, which can take the form of personalized descriptions [6] or key-value formatted persona knowledge [7]. The studies mentioned above primarily aim at incorporating explicit persona knowledge into the responses but may need to look into the implicit persona knowledge derived from the dialogue history. Implicit persona reflects the hidden character of interlocutors. It is always hidden in dialogue history. Therefore, there is a need for research that can effectively integrate both explicit and implicit persona knowledge. A significant amount of implicit persona knowledge can be derived from the dialogue history, making it a promising approach for discovering such information [8,9,10,11]. Incorporating implicit persona knowledge into dialogue systems can also significantly enhance dialogue consistency. In the following section, we will discuss the advantages of leveraging implicit persona knowledge.

Fig. 1
figure 1

Different users reply differently to the same question. The implicit persona knowledge is hidden in the dialogue history

Based on the above discussion, we propose integrating implicit and explicit persona knowledge to generate more consistent and contextually appropriate responses. The advantage of implicit persona knowledge can be summarized in two parts. Firstly, when it comes to convenience and extensibility, implicit persona knowledge is often more favorable than explicit persona knowledge. Acquiring a significant amount of explicit persona knowledge requires extensive manual labeling, which can be a labor-intensive and time-consuming process. In a practical setting, users may also be unwilling to provide labels for personalized sentences, making it difficult to gather such data at scale. Furthermore, implicit persona knowledge is more easily updated as the dialogue history accumulates, since it is naturally learned over time. Implicit persona knowledge is more flexible and adaptable compared with explicit persona knowledge. Secondly, various personalized expressing characteristics regularly hide in the dialogue history. Here, the personalized expressing characteristics mean the personalized attributes(e.g., dialogue habit, prerequisite knowledge, and preferences). These personalized attributes are valuable to enhance the dialogue system’s ability to generate coherent responses. As illustrated in Fig. 1, implicit persona knowledge can be derived from the dialogue history, as users with different persona knowledge may respond differently to the same question. This demonstrates personalized expression characteristics can be concealed within the dialogue history. Personalized expression characteristics become a valuable source of information for dialogue systems. Taking the question Where do you live and what did you do today? as an example shown in Fig. 1, the user with personas I live in New York. and I love sandwich. replies I live in New York and I will be making some sandwiches today., the user with persona I live in Midwest America. replies I live in Midwest America.. As shown in Fig. 1, the dialogue history shows user B thinks listening to music is a good way to release stress when user B receives the question What do you do when you’re unhappy?, user B replies I love to read or listen to music. Most existing personalized dialogue generation methods use explicit persona information for dialogue generation. Persona information is expensive to label. It is necessary to model the user’s persona information from dialogue history.

Motivated by the positive impact of leveraging implicit persona knowledge, we have developed a new dialogue model named IMPACT which focuses on integrating implicit persona knowledge from dialogue history. Our proposed approach involves generating a consistent response from two perspectives: discovering implicit personalized characteristics and modeling a consistent expressive style through contextual and persona consistency. The former aims at discovering implicit personalized characteristics from the dialogue history and explicit persona knowledge. The latter considers the interdependent relationship between the current query, the user’s persona and the dialogue history to ensure consistency. However, existing approaches often lack clear and consistent modeling objective. To address this, we utilize unlikelihood training objective [12] to improve dialogue consistency.

To be more specific, we first develop a personalized characteristics discovering module that utilizes multi-layer attentive modules to capture multi-grained personalized characteristics. We simultaneously execute self-attention and cross-attention with historical responses, related posts, explicit persona knowledge and current query at each layer. By doing so, we can obtain personalized attributes and better understand the contextual relationships between responses and historical posts. Secondly, we build a dialogue consistency matching module to model dialogue consistency. We model the relationship between the current query and the dialogue history from two perspectives: contextual-level relevance and word-level relevance. Additionally, we model the relationship between the current query and persona knowledge similarly. Finally, we integrate three matching features in the fusion process to generate the most appropriate response. Moreover, since these methods lack a consistency modeling objective, we employ the unlikelihood training objective to mitigate inconsistent responses. To validate the effectiveness of our proposed model, we conduct comprehensive experiments on two publicly available datasets. We evaluate the model using automatic and human evaluation metrics for personalized response generation. Furthermore, we test our method in a persona-sparse setting since not all dialogues require the integration of persona knowledge. Experimental results show that our model outperforms all baseline models. Our contributions are threefold:

  • To generate consistent responses, we propose IMPACT to model the personalized expression characteristics and construct the implicit persona knowledge from dialogue history.

  • We model the inner relations between the query, persona and dialogue history from multi-views, including contextual consistency and persona consistency. Moreover, we employ the unlikelihood training objective to alleviate the inconsistency expressed.

  • We carry out comparative experiments in various environments to prove the effectiveness of our method. Extensive experiments show that our model outperforms all the baseline models.

2 Related work

As a challenging task in natural language processing, dialogue systems have attracted significant attention from researchers due to their broad application. Previous dialogue systems, such as Parry [13] and Eliza [14], aim to imitate human dialogue behavior to pass various Turing tests. Generally speaking, there are two kinds of dialogue systems: task-oriented and non-task-oriented dialogue systems. Task-oriented dialogue systems [15,16,17,18,19] are designed for specific purposes, such as flight reservations, hotel bookings, customer service, technical support, and other particular fields. They have been successfully employed in some real-world applications. Generally speaking, there are two types of dialogue systems: task-oriented and non-task-oriented. Task-oriented dialogue systems, such as those used for flight reservations, hotel booking, customer service, technical support, and other specific fields, have been successfully employed in real-world applications. On the other hand, non-task-oriented dialogue systems are more challenging to develop as they aim to generate open-ended conversations. Personalized dialogue generation is one of the most active research topics in non-task-oriented dialogue systems, which aims to generate persona-aware responses in multi-turn conversations [20,21,22].

Recently, personalized dialogue generation has attracted interest from various fields, and a significant amount of influential work is aimed to construct personalized dialogue systems [23]. Li et al. [24] and Zhang et al. [25] incorporate persona knowledge by encoding it into a vector and then decoding it in the utterance prediction process to capture individual characteristics. Such approaches require conversational data tagged with user persona, which can be costly to obtain for large amounts of data. Therefore, Wang et al. [26] introduce personalized models with only categorical attributes (e.g., gender, hobbies, and location). The categorical attributes are converted into vectors and fed into the decoder for response generation. However, user identities are often unavailable, which makes external supervision difficult. Therefore, Zhang et al. [27] design a neural dialogue model that generates consistent responses by preserving specific features that are correlated to personas and topics. Cheng et al. [28] introduce a novel approach for studying dynamically updated speaker embeddings in conversational contexts. Unlike other methods that require external supervision-specific data, this approach leverages dialogue data to train persona feature extractors and self-supervised topics. The resulting neural model captures the nuances of speakers’ personalities and conversational topics, which can be used for content ranking in dialogue action estimation. This approach has the potential to improve the effectiveness of conversational agents and other dialogue-based systems by enabling them to better understand and respond to users’ needs and preferences.

Although Ouchi et al. [29] and Zhang et al. [30] prove that user embedding is an efficacious arrangement to restrict the characters of speakers, penalization in these models is handled implicitly and, therefore, not easy to interpret and control in producing desired replies. So the researchers begin modeling the user embedding with explicit persona knowledge. Qian et al. [31] propose an explicit persona model to produce well-organized replies considering a specified user profile. The personality chatbot is defined by a key-value form composed of gender, hobbies, names, or other relative factors. Over a generation, the model first picks the key-value pairs from the profile and then decodes a reply from backward and onward. XiaoIce also applies an explicit persona model [32]. Many personalized dialogue datasets have been proposed to develop models. The explicit personal information is human-annotated. Persona-Chat [4] is such a dataset that has extensively promoted the development of this field. Along with the Persona-Chat, Zhang et al. [5] introduce a structure based on memory mechanism to exploit persona knowledge from context and apply a conditional variational autoencoder(CVAE) to produce sustainable and various conversations. Owning to the one-stage decoding model can relieve inconsistent persona words generation barely. Thus, Song et al. [33] introduce a three-stage framework employing a generate-delete-rewrite mechanism to remove inconsistent words from a produced reply prototype and re-write it to a personality-consistent one. However, most daily conversations do not aim to show their personality within narrow turns of interactions. Namely, the dialogues from the real world are not generally related to the speaker’s personality. Based on this fact, Zheng et al. [34] release a Chinese persona-sparse dataset PersonalDialog which guides the model to imitate the conversation between people in the real world. The above methods to increase dialogue consistency are used to explicitly define a set of agents and learn to generate a personalized response. However, such methods need a more consistent modeling process, these personalized dialogue models still have the inconsistency challenge [35]. So Welleck et al. [35] frame the consistency of the dialogue generation task as a natural language inference(NLI) task. They make a novel natural language inference dataset called Dialogue NLI (DNLI). To discover the consistent relations between attribute information and reply, Song et al. [7] create a large-scale human-annotated dataset called KvPI, which labels the relation between profile and response. To use the persona knowledge correctly and generate consistent responses, Xu et al. [36] propose the Persona Exploration and Exploitation(PEE) framework, extending the persona description with semantically correlated content before exploiting them to produce dialogue responses.

In the above, we have discussed the methods of building personalized dialogue systems, which control conversation generation using explicitly defined user persona knowledge. The existing challenges of modeling personalized dialogue methods can be concluded in three points: (1)The most existent approaches need to be more sufficient in modeling the psychological personality of speakers from dialogue history. (2)Ignore the inner relations between the user’s message, persona and dialogue history. (3)Lack of modeling consistency training objective. Next, we will introduce our model in detail from the above three challenges.

3 Methodlogy

3.1 Task formulation

Personalized dialogue generation consists in predicting an utterance R, given a context \(x=\{q, K, H\}\) that includes dialogue query q, persona sentences K and dialogue history utterances H from interlocutors who take continuous turns. Response requires to reflect the personality defined by persona knowledge K. More importantly, the response requires consistency with persona K and previous responses in the dialogue history H.

3.2 Preliminary

The general goal of Personalized dialogue systems is to generate the most proper response which reflects explicit and implicit persona. The generative dialogue model is to predict the output distribution \(p_{\gamma }\left( r^{*} \mid q, K, H\right) \) for a given message q under the dialogue history H and given persona knowledge K. For a single-turn dialogue system, \(H=\varnothing \) and responses are generated from the above distribution for input message q and persona knowledge K. For a multi-turn dialogue system, H is the context of the previous conversation turns. This work investigates generating an inconsistent response in a multi-turns dialogue system. We divide the problem of consistency into context consistency and persona consistency. The consistent issue can be formally defined as: given the dialogue history H, and a set of persona knowledge \(K=\{k_{1},k_{2},...,k_{n}\}\), to generate a response \(r^{*}\) based on H and K. Moreover, \(r^{*}\) should be consistent with the persona knowledge set K and dialogue history H, which means NLI category between \(r^{*}\), \(\forall K, H, \text {NLI}(r^{*},K) \in \{E,N\}, \text {NLI}(r^{*},H) \in \{E,N\}\), where E and N mean entailment and neutral, respectively. The notations used in this paper defined in Table 1.

Table 1 Notations

3.3 IMPACT overview

Fig. 2
figure 2

Main framework of our method

Two significant aspects impact the consistency of the dialogue agent’s expression. First, the dialogue agent should be consistent with previous personalized expression regarding the dialogue history. The preferred expression usually determines such a personalized expression manner. Second, how a dialogue agent generates a personalized response is also conditioned on the internal relations between the current query, personalized preference and dialogue history. Given the same dialogue query, different users may provide different reactions, and we attempt to gain personalized preference from dialogue history. Based on this, we presume the personalized description consists of two aspects: (1) personalized expression manner S and (2) consistency matching style \(C_{H}\), \(C_{P}\) and \(C_{K}\) which measure the context consistency and persona consistency, respectively.

Figure 2 shows the structure of IMPACT. Specifically, we first build a Personalized expression Cognition module(in Sect. 3.5), which intends to cache the personalized expression representation \(g^S\) utilizing dialogue historical reactions. Next, we design dialogue consistent matching module(in Sect. 3.6) which measures the consistency degree between persona knowledge \(g^{C_{K}}\) and dialogue history \(g^{C_{H}}\), \(g^{C_{P}}\). Then, three matching modules match separately, and in the model fusion module(in Sect. 3.7), we merge three types of features to figure out the final matching grade. In the remaining parts of this section, we will introduce each component’s details.

3.4 Foundation: attentive module

In this part, we demonstrate the fundamental module in our method, the attentive module, which is a necessary part of our approach. Inspired by previous works [37, 38], we employ the attentive module to transform the semantics under the dialogue context into embedding representation. Attentive Module is a variant of Transformer structure [39]. Compared with the multi-head attention layer employed in Transformer, the attentive module only adopts one head for attention. Specifically, attentive module \(\text {Attn}(Q, K, V)\) takes \(Q \in {\mathbb {R}}^{l \times d}\), \(K \in {\mathbb {R}}^{l \times d}\), and \(V \in {\mathbb {R}}^{l \times d}\) as input, where l and d denotes the sequence length and the number of hidden dimensions, respectively. The attentive module maps QKV to a weighted output. Weights are calculated by making each token focus the tokens in critical sentences by scaled dot production:

$$\begin{aligned} \text {Attn}(Q, K, V)=\text {softmax}\left( \frac{Q \cdot K^{\top }}{\sqrt{d}}\right) V \end{aligned}$$
(1)

Then we employ the residual connection with layer normalization to obtain a better representation and keep the gradient from vanishing. The final result is computed from a feed-forward network(FFN) with ReLU activation:

$$\begin{aligned} \text {FFN}(x)=\text {ReLU}\left( \mathbf {W_{1}} \cdot x +b_{1}\right) \cdot \mathbf {W_{2}}+b_{2} \end{aligned}$$
(2)

where x has same shape with query Q, b and \({\textbf{W}}\) are trainable parameters. We denote the above process as \(f_{\text {Attn}}(\cdot , \cdot , \cdot )\).

3.5 Personalized characteristics discovering module

It is important to incorporate personalized expression into the conversation to ensure that a personalized dialogue system can generate coherent and consistent responses that accurately reflect a user’s persona. This can be achieved by developing a Personalized Characteristics Discovering Module to analyze a user’s historical responses and identify patterns in their preferred speaking style. User personas and dialogue history are modeled in an interactive style. Then, we obtain persona-related representations from the dialogue history, that is implicit persona information. Finally, we integrate the captured multi-views persona features and use them for downstream personalized response generation.

Formally, given dialogue history \(H=\{(p_{1},r_{1}),(p_{2},r_{2}),...,(p_{t-1},r_{t-1})\}\), persona knowledge K and dialogue query at t time step \(q^{t}\). The personalized Characteristics Discovering Module aims to get a matching vector \(g^S(q^{t},K,H)\), which mines the personalized feature from persona K and historical context. The personalized matching module achieves the goal via three layers, Semantic Extraction Layer, Interaction Layer and Fusion Layer. Semantic Extraction Layer models multi-grained semantic representations for historical response and persona knowledge. Knowledge Interaction Layer executes matching operation at each semantic scale. Fusion Layer fuses matching signals between historical response and persona knowledge to obtain \(g^S(q^{t},K,H)\). We will introduce these layers in detail as follows:

Semantic Extraction Layer aims to gain multi-grained semantic representation \(R_{c}^j=\{r^j_{0}, r^j_{1},...,r^j_{n}\}\) for personalized style, and it also get the cross-attention representations \(P_{c}^j=\{c^p_{0}, c^p_{1},...,c^p_{n}\}\), \(K_{c}^j=\{c^k_{0}, c^k_{1},...,c^k_{n}\}\) for historical response \(r_j, j\in [1, t-1]\), dialogue query q and persona knowledge K by n attentive modules.

Taking \(j-th\) response \(r_j\) as example, we employ \(R_{c}^j=\{r^j_{0}, r^j_{1},...,r^j_{n}\}\) to represent the contextual semantic feature of \(r_j\). Specifically, firstly, we initialize word representation \(r^j_{0}\) by Word2Vec [40]. Next, we feed the word embedding into n attentive modules to obtain deep contextual response representations.

$$\begin{aligned} r_{l}^{j}=f_{\text {Attn}}\left( r_{l-1}^{j}, r_{l-1}^{j}, r_{l-1}^{j}\right) , \quad 1 \le l \le n \end{aligned}$$
(3)

where \(r_{l}^{j}\) is the contextual representation output by \(l-th\) attentive module.

Furthermore, the personalized is also conditioned on dialogue posts. In view of this, we let the representation \(r_j \in R^j\) attend to corresponding post representation \(p_j\) to obtain cross-attention representation \(P^j=\{c^p_{0}, c^p_{1},...,c^p_{n}\}\):

$$\begin{aligned} c_{l}^{p}=f_{\text {Attn}}\left( r_{l}^{j}, p_{l}^{j}, p_{l}^{j}\right) , l \in [1,n], \end{aligned}$$
(4)

where \(p_{l}^j\) is obtained in same way as \(r_{l}^j\). \(K^j\) is calculated between \(q_t\) and persona knowledge K.

In the conversation process, we believe that dialogue posts and current dialogue queries are essential for persona understanding; redundant persona knowledge will negatively impact dialogue generation. In view of this, we also make cross-attention between dialogue posts P, current query q and persona knowledge K to obtain \(K_{c}^{j}=\{c_1^K, c_2^K,...,c_n^K\}\) and \(R^{j}=\{c_1^q, c_2^q,...,c_n^q\}\):

$$\begin{aligned} \begin{aligned} c_{l}^{K}&=f_{\text {Attn}}\left( p_{l}^{j}, K_{l}, K_{l}\right) , l \in [1,n] \\ c_{l}^{q}&=f_{\text {Attn}}\left( K_{l}, q_{l}, q_{l}\right) , l \in [1,n]. \end{aligned} \end{aligned}$$
(5)

Knowledge Interaction Layer aims to generate a personalized style state \(M_s\) which measures the personalized matching degree at the multi-grained level. Specifically, given \(j-th\) dialogue post \(p^j_{l}\) and its corresponding response \(r_{l}^{j}\), we calculate \({m1}_{l}^{j}\) which measure the relations between dialogue post and response. Then, we compute \({m2}_{l}^{j}\) and \({m3}_{l}^{j}\) to obtain the relations between dialogue posts, persona knowledge and dialogue query:

$$\begin{aligned} {m1}_{l}^{j}=\frac{c_{l}^{P} \cdot {p^j_{l}}^\top }{\sqrt{d}}, {m2}_{l}^{j}=\frac{c_{l}^{q} \cdot {p^j_{l}}^\top }{\sqrt{d}}, {m3}_{l}^{j}=\frac{{c}_{l}^K \cdot {q_{l}^t}^\top }{\sqrt{d}}, l \in [1,n] \end{aligned}$$
(6)

where d is dimension of embeddings and l denotes the representation output by \(l-th\) attentive module. Therefore, for t historical context \(H=\left\{ (p_{1},r_{1}),...,(p_{t-1},r_{t-1})\right\} \), we have \(M1_{l}=\{{m1}_{l}^{1},...,{m1}_{l}^{t-1}\}\), \({{M2}}_{l}=\{{{m2}}_{l}^{1},...,{{m2}}_{l}^{t-1}\}\) and \({{M3}}_{l}=\{{{m3}}_{l}^{1},...,{{m3}}_{l}^{t-1}\}\) to mine the personalized feature from dialogue history and persona knowledge.

To share the above matching matrices, we transforme these matrices into a shared feature space:

$$\begin{aligned} M_{s}=f_{\text {stack}}\left( \left\{ {M1}_{1}, \ldots , {M1}_{n}, {M2}_{1}, \ldots , {M2}_{n}, {{M3}}_{1}, \ldots , {{M3}}_{n} \right\} \right) \end{aligned}$$
(7)

where \(f_{\text {stack}}(\cdot )\) refers to concatenation operation on a new dimension. \(M_{s} \in {\mathbb {R}}^{3(n+1) \times (t-1) \times L \times L}\) and L is maximum sequence length.

Inspired by [41, 42], we extract matching features from personalized style state \(M_s\) via Convolutional Neural Network(CNN) in Fusion Layer. CNN can simply and effectively aggregate the character characteristics of multiple perspectives, including the explicit persona features, implicit persona features and dialogue history features. High dimension feature is then linear mapped into a lower-dimensional feature space via multi-layer perceptron(MLP):

$$\begin{aligned} V_{s} = \text {MLP}(f_{\text {CNN}}(M_{s})). \end{aligned}$$
(8)

Especially, the personalized matching matrix \(V_{s}\) contains the matching feature between dialogue history and dialogue query. Due to the dialogues being sorted by time in conversation, the impact of long-distance dialogue will disappear. Though dialogues are sorted by time in conversation, temporal patterns disappear for the historical response. We thus employ self-attention to sum the personalized matching state up \(V_{s}\) dynamically. Finally, we obtain the personalized matching feature \(g^S(q^{t},K,H)\):

$$\begin{aligned} s_{Attn}=\, & {} \text {softmax}\left( \text {MLP}\left( \tanh \left( \text {MLP}\left( V_{s}\right) \right) \right) \right) \end{aligned}$$
(9)
$$\begin{aligned} g^S(q,K,H)= & {} \sum _{\text {dim}=0} s_{\text {Attn}} \odot V_{s}, \end{aligned}$$
(10)

where \(s_{\text {Attn}} \in {\mathbb {R}}^{(t-1) \times d}\) means the attention weights and \(\odot \) denotes element-wise multiplication.

3.6 Dialogue consistency matching module

Generative persona-grounded dialogue systems should output coherent responses consistent with the context. These contexts are essential for generating a consistent response. For example, the chatbot once said he likes to drink tea, and regarding what he hated drinking, the chatbot reply was also tea. The above context inconsistent phenomenon will make it difficult for users to gain trust in the chatbot and lose its application value. Under the above observation, we design a Dialogue Context Consistency Matching module to model the consistency feature from the dialogue.

The consistency matching module aims to obtain consistency matching state vector \(g^C(q,H)\) and \(g^C(q,K)\), which measures the relevance between dialogue query q, dialogue history H and persona knowledge K. Unrelated dialogue history and persona knowledge can have a negative effect on response generation. Therefore, we need to filter out unrelated dialogue history and personal knowledge. Intuitively, we can compute the relevance vector \(s_H \in {\mathbb {R}}^{t-1}\), \(s_P \in {\mathbb {R}}^{t-1}\) and \(s_K \in {\mathbb {R}}^{t-1}\) that measure the topic relatedness between current query q, historical posts \(p=\{p_1,p_2,...,p_{t-1}\}\) and persona knowledge K. Then, we re-weight the dialogue history H, dialogue posts P and persona knowledge K:

$$\begin{aligned} D_{H}=s_{H} \cdot H, D_{P}=s_{P} \cdot P, D_{K}=s_{K} \cdot K. \end{aligned}$$
(11)

We assume the topic relatedness can be divided into contextual-level relevance and word-level relevance. The relevance state s consists of contextual-level relevance state \(s_1\) and word-level relevance state \(s_2\).

For contextual-level relevance state:

$$\begin{aligned} {\textbf{s}}_{1}=f_{\text {sim}}(p,q)=\frac{{\textbf{U}}^{p} \cdot {\textbf{u}}^{q}}{\left\| {\textbf{U}}^{p}\right\| _{2}\left\| {\textbf{u}}^{q}\right\| _{2}}, \end{aligned}$$
(12)

where \(u^q = \mathop {\mathrm{{mean}}}\limits _{\dim = 1} {q}\) and \(U^{P} = \mathop {\mathrm{{mean}}}\limits _{\dim = 2} {P}\) are sentence representation obtained by mean pooling over the word dimension.

For word-level relevance state, we compute the word-level matching state matrix \(M_w \in {\mathbb {R}}^{(t+1) \times L \times L}\) by:

$$\begin{aligned} M_{w}={\textbf{W}}_{3}^{\top } \tanh \left( K \cdot {\textbf{W}}_{1} \cdot q^{\top } + P \cdot {\textbf{W}}_{2} \cdot q^{\top } \right) \end{aligned}$$
(13)

where \({\textbf{W}}_{1} \in {\mathbb {R}}^{d \times d \times (t+1)}\), \({\textbf{W}}_{2} \in {\mathbb {R}}^{d \times d \times (t+1)}\) and \({\textbf{W}}_{3} \in {\mathbb {R}}^{(t+1) \times 1}\). \(K=\{k_{1},k_{2},...,k_{t-1}\}\) is the contextual representation for historical posts. To obtain the most important matching features, we employ the max-pooling on the word-level matching state matrix and then use the softmax function linear into the word-level relevance vector \(s_2\):

$$ s_{2} = {\text{softmax}}\left( {{\text{MLP}}\left( {\left[ {\mathop {\max }\limits_{{\dim = 2}} M_{w} ;\mathop {{\text{max}}}\limits_{{\dim = 3}} M_{w} } \right]} \right)} \right) $$
(14)

where[; ] is concatenation operation. We combine context-level and word-level relevance vector by:

$$\begin{aligned} s = \varvec{\alpha } \cdot s_1 + (1-\varvec{\alpha }) \cdot s_2, \end{aligned}$$
(15)

Relevance vector s is obtained using current query q as a key to attend to persona knowledge and related historical posts. Then we design a multi-hop perception method to track the multi-hop conversation dynamically. Specifically, we store historical posts in the dialogue buffer and persona buffer. At each hop, the most related historical dialogue \(s^{H_j}\) and persona knowledge \(s^{K_j}\) will be selected and then make up \(S=\{(s^{H_1}, s^{k_1}),(s^{H_2}, s^{k_2}),...,(s^{H_{t-1}}, s^{k_{t-1}})\}\). At hop-1, \(S=\varnothing \) and the relevance score is computed by Eq. (15) with current query q. Then, we update the current query q by:

$$\begin{aligned} q=\mathop {{\text {mean}}}\limits _{\dim =2} f_{\text {stack}}(q \cup S) \end{aligned}$$
(16)

We then obtain a new relevance score s via Eq. (15) using the updated representation q. \(s^i\) means the relevance score of hop-i. We then linearly map these cores into the final re-weighting scores:\({\bar{s}}_{H} = S \cdot \varvec{\beta _{H}}\), where \(\varvec{\beta _{H}} \in {\mathbb {R}}^{k \times 1}\) and \({\bar{s}}_{H} \in {\mathbb {R}}^{t+1}\). Thus, we can rewrite the Eq. (11) to:

$$\begin{aligned} D_H = {\bar{s}}_H \times H, D_P = {\bar{s}}_P \times P, D_K = {\bar{s}}_K \times K. \end{aligned}$$
(17)

To thoroughly measure the relevance between dialogue history, current query and persona knowledge, we construct three matching matrices:

$$\begin{aligned} \begin{aligned} M_1^{K}&= \left[\frac{{q{B_1}D_{K}^T}}{{\sqrt{d} }};\frac{{q \cdot D_{K}^T}}{{{{\left\| q \right\| }_2}{{\left\| {D_{K}} \right\| }_2}}}\right] \\ M_1^P&= \left[\frac{{q{B_2}{D_{P}^T}}}{{\sqrt{d} }};\frac{{q \cdot {D_{P}^T}}}{{{{\left\| q \right\| }_2}{{\left\| D_{P} \right\| }_2}}}\right] \\ M_1^H&= \left[\frac{{q{B_3}{D_{H}^T}}}{{\sqrt{d} }};\frac{{q \cdot {D_{H}^T}}}{{{{\left\| q \right\| }_2}{{\left\| D_{H} \right\| }_2}}}\right]. \end{aligned} \end{aligned}$$
(18)

where \(B_* \in {\mathbb {R}}^{d \times d}\) and d is embedding size.

As with the personalized matching process, we use 2D CNN with max-pooling to extract advanced match features. Since dialogue history is chronological, we utilize a single GRU layer to capture the response pairs’ time signals in a dialogue history. We employ the GRU’s final state as the consistency matching feature \(g^{C_{H}}(q,H)\), \(g^{C_{P}}(q,P)\), and \(g^{C_{K}}(q,K)\) is is calculated in the same way.

3.7 Module fusion module

In Sects. 3.5 and 3.6, we obtain three matching features:(1) personalized matching features \(g^S(q_t, K, H)\), which mine the personalized feature from persona K and historical context; (2) context-aware consistent matching feature \(g^{C_{H}}(q,H)\), \(g^{C_{P}}(q,P)\) and (3) persona-aware consistent matching feature \(g^{C_{K}}(q,K)\), which measure the various consistency of generated responses. We concatenate three matching features jointly to get the final match vector. Next, we use an MLP with a sigmoid activation function to calculate the final matching score:

$$\begin{aligned} {\mathscr {F}} = \sigma (\textrm{MLP}([g^S(q, P, H);\,g^{C_{H}}(q,H);\,g^{C_{P}}(q,P);\,g^{C_{K}}(q,K)])). \end{aligned}$$
(19)

3.8 Unlikelihood training

We train our model in an Unlikelihood method [12, 35, 43] to learn the ability to understand coherence from large-scale dialogue inference data. We employ negative log-likelihood loss (NLL) and unlikelihood loss for dialogue generation and consistency understanding. Details will be provided in this section.

$$\begin{aligned} \mathcal {L}_{\mathrm{{NLL}}} = -\log (p_{\gamma }(R \mid q,H,K) \end{aligned}$$
(20)

We collect positive samples \({\mathcal {D}}^\text {p}\) from entailed category and collect negative samples \({\mathcal {D}}^\text {n}\) from contradicted category in DNLI:

$$\begin{aligned} {\mathcal {D}}^\text {p} = \{ {{\bar{P}}^{(i)}}, {{\bar{R}}^{(i){p}}} \}, {\mathcal {D}}^\text {n} = \{ {\bar{P}}^{(i)}, {\bar{R}}^{(i){n}} \}, \end{aligned}$$
(21)

where \({\bar{P}}\) and \({\bar{R}}\) are premise and hypothesis. For data from \({\mathcal {D}}^+\), we employ NLL loss:

$$\begin{aligned} {\mathcal {L}}_\mathrm{{UL}}^{p} = - \sum _{i=1}^{|{{\bar{R}}} |} \log \left( p_\gamma \left( r^* \mid {\bar{P}}, {\bar{R}}\right) \right) , \end{aligned}$$
(22)

For data from \({\mathcal {D}}^\text {n}\), we apply the unlikelihood objective to minimize the likelihood of contradictions:

$$\begin{aligned} {\mathcal {L}}^{n}_\mathrm{{UL}} = - \sum _{i=1}^{|{{\bar{R}}} |} \log (1 - p_\gamma (r^* \mid {\bar{P}}, {\bar{R}})), \end{aligned}$$
(23)

which reduces the probability of inconsistent tokens in the generation process. Training steps can be summarized as follows:

  1. (1)

    Response prediction Given dialogue query q, persona knowledge K and dialogue historyH from personalized dialogue data, our model calculates the dialogue loss following Eq. (20);

  2. (2)

    Consistency Enhancing Given \({\mathcal {D}}^\text {p}\) and \({\mathcal {D}}^\text {n}\) in DNLI, our model calculate the unlikelihood loss \({\mathcal {L}}=\varvec{\beta } {\mathcal {L}}_\mathrm{{UL}}^{p} + (1-\varvec{\beta }){\mathcal {L}}_\mathrm{{UL}}^{n}\).

4 Experiments

In this section, we will illustrate and discuss some of the experimental results to validate the effectiveness of our approach and explore its limitations. Our code will be released on https://github.com/fuyongxu0908/IMPACT.

4.1 Datasets

To assess the performance of our model, we performed persona-based dialogue generation experiments in a persona-dense and a persona-sparse scenario with two publicly available datasets:

  • PersonaChat [4] is a crowdsourced dataset that covers wealth persona features. The dialogue in this dataset is based on specific personal facts. Here we use ConvAI2 [44], so the results are comparable to existing methods.

  • PersonalDialog [45] is a persona-sparse dataset which is collected from Weibo. This dataset provides persona profiles and conversations, but most conversations are not about the character. Random and biased test sets are available. Random test sets are distributed similarly to train sets, and biased test sets are hand-selected to cover personality-related characteristics.

Table 2 Statistics of persona-based dialogue datasets
Table 3 Statistics of different inference datasets

The key statistics of two personalized dialogue datasets are summarized in Table 2. As aforementioned, we leverage two dialogue inference datasets, DNLI [35] and KvPI [7] for unlikelihood training. The statistics of these inference datasets are summarized in Table 3.

4.2 Baselines

  • Seq2Seq [46] is the standard Seq2Seq Model with attention. We string together persona descriptions and historical discourse as sequence inputs and generate responses.

  • HRED(Hierarchical Recurrent Encoder-Decoder) [47] model with attention.

  • Generative Profile Memory network(GPMN) [4] is a generative model that encodes each persona description as an individual memory representation within a memory network.

  • Persona-CVAE(Per-CVAE) [5] is a memory enhancement structure based on chatbots persona that focuses on diverse types of conversation responses.

  • Transformer [39] is employed as a baseline for both PersonaChat and PersonalDialog experiments. Personas linked to dialogue queries.

  • CMAML [48] is a meta-learning-based approach that learns from rarely filmed characters through custom model structures.

  • GDR [33] is a three-stage framework that employs a generation-deletion-rewrite mechanism to remove inconsistent words from a generated response prototype and rewrite them into a personality-consistent prototype.

  • Generative Split Memory Network(GSMN) [49] is a memory network that utilizes the splitting memories, one for persona knowledge and the other for dialogue history.

4.3 Evaluation metrics

We evaluate our approach primarily in three areas: response quality, diversity, and consistency. For comparison of different models, we employ automatic metrics and human evaluations.

4.3.1 Automatic metrics

We employ perplexity (PPL.) and distinct 1/2 (Dist.1/2) following by common practice [4, 34] to evaluate our method. Lower perplexity means our approach has a better performance for language modeling. Distinct 1/2 [50] is the proportion of different uni-grams/bi-grams, and higher distinct means generated responses are more diverse. Distinct 1/2 is formulated as:

$$\begin{aligned} \begin{aligned} \textrm{Distinct}-k({\hat{Y}})&= \frac{|C_{k} |}{\sum _{s \in C_{k}} \sum _{i=1}^{N} f({\hat{Y}}_i, s)} \\ C_{k}&= \bigcup _{i=1}^{N} \mathrm {k-gram}({\hat{Y}}_i), \end{aligned} \end{aligned}$$
(24)

where \(f({\hat{Y}}, s)\) is the number of s in \({\hat{Y}}\), k=1,2.

We employ Consistency Score (C.Score) [51] and Delta Perplexity (\(\Delta P\)) [12] to test our method for response consistency. Consistency Score(C.Score) leverages a referee model to predict consistency and can be defined as:

$$\begin{aligned} \begin{aligned} \textrm{NLI}(r, t)&= {\left\{ \begin{array}{ll} -1&{} \textrm{contradict}\\ 0 &{} \textrm{irrelevant}\\ 1 &{} \textrm{entailment} \end{array}\right. }\\ \mathrm {C.Score}(r)&=\sum _{i=1}^t\textrm{NLI}(r,t) \end{aligned} \end{aligned}$$
(25)

Here the NLI function is a pre-trained Roberta model finetuned with the dialogue inference datasets, i.e., DNLI and KvPI, as described in Table 2. Delta Perplexity(\(\Delta P\)) evaluates consistency from the model’s internal distributions. Li et al. [12] first estimate the perplexity of entailed (p.Ent) and contradicted (p.Ctd) dialogues in the inference dataset. A well-understood dialogue model should reduce perplexity about inevitable dialogue and increase perplexity about contradictions. Based on this, \(\Delta P\) can be defined as:

$$\begin{aligned} \Delta P = {f}_{\text {PPL}}(\text {Contr}) - {f}_{\text {PPL}}(\text {Ent}) \end{aligned}$$
(26)

where a larger \(\Delta P\) means model has a better ability to distinguish entailment from contradiction.

4.3.2 Human evaluations

We recruit five volunteers who are skilful in language tasks while knowing nothing about the models. We sample 100 samples for each volunteer to evaluate our model. These volunteers are requested to measure dialogue quality from three aspects. fluency(Flue.), informativeness(Info.), and relevance(Relv.). Each aspect is rated on a five scale, where 1, 3, and 5 indicate inappropriate, intermediate, and ideal performance. Volunteers are likewise instructed to label the consistency between context, persona and response(Con.C. and Per.C.). 1 means consistent, 0 means neutral, and -1 means contradicted.

4.4 Results and analysis

Not all conversations need to integrate persona knowledge in the real world; the persona in dialogue is sparse. To verify the effectiveness of our method, we experiment in two dialogue environments Persona-Dense dialogue and Persona-Sparse dialogue. The results are shown below (Figs. 3 and 4).

4.4.1 Persona-dense results

Table 4 Automatic and human evaluation results on the full ConvAI2 dataset
Fig. 3
figure 3

Results on full PersonaChat

We first inform the results with PersonaChat in Table 4. Our approach reaches better performance overall human and automatic evaluation metrics, which indicates our model’s effectiveness. Our model obtains significant advancements on all metrics on \(\Delta P\) and C.Score. From these results, it is clear that our model could differentiate entailment from contradiction more effectively than other baseline approaches, which shows our model better understands persona consistency. Besides, our method also has a specific improvement on PPL and diversity. Lower PPL demonstrates that our model gets an excellent language modeling ability. Higher diversity represents our model can generate diverse responses more effectively. Furthermore, dialogue quality is better than other baseline models from human evaluation metrics, including consistency evaluation metrics.

For Ablation Study, we respectively remove (1)the unlikelihood training objective(UL), (2)the Personalized Characteristics Discovering Module(PCD), and (3)the Dialogue Consistency Matching Module(DCM) of IMPACT to investigate their effectiveness. Results are shown in Table 4. Diversity metrics D.AVG has an inevitable descent when removing PCD. Without PCD, IMPACT only employs explicit persona knowledge to generate the personalized response. So implicit persona knowledge from Personalized Characteristics Discovering Module(PCD) can improve the diversity of generated responses. From Per.C, we also conclude that PCD can improve persona knowledge consistency. Then, when we remove DCM, \(\Delta P\) and Con.C also decline. Besides, when we add PCD and DCM, the final response becomes less fluent from PPL and Fluc. While mining implicit semantics, PCD and DCM may introduce some noise to affect response fluency. Moreover, when removing UL, the consistency metrics Con.C and Per.C are also reduced.

4.4.2 Persona-sparse results

In the real world, persona knowledge is sparse in conversation, and not all responses need to be integrated with a specific persona. We further validate our model in a persona-sparse setting and test our model’s performance in different contexts. Random test datasets are persona-sparse, sampled from the real-world conversation process. Biased test dataset was purposely selected to provide different contexts under which speakers tend to show their personas. We report the evaluation results on both random and biased test datasets in Tables 5 and 6.

Table 5 The results of automatic and human evaluation on Random Testset of PersonalDialog

In persona-sparse settings, the performance of our model in persona consistency can not exceed the GDR model with a rewriting mechanism. One possible reason is that the task of dialogue using persona-sparse data degenerates into conventional dialogue generation tasks, so the advantages of our model can only partially be demonstrated. At the same time, we also found that our model improves contextual consistency in a persona-sparse environment. From the results, it is clear that when we remove DCM and only use PCD for feature extraction and fusion, our model is better than our final result in diversity. Although DCM plays a positive role in enhancing consistency understanding, it is reverse optimized in the perceptual fusion of persona knowledge and improves the diversity of generated response. This phenomenon is because our model tends to generate the words in persona and context after adding DCM, resulting in the decline of diversity. On the contrary, on the biased dataset with richer persona knowledge, our method obtains the best results on consistency metrics \(\Delta P\), C.Score, Con.C., and Per.C., demonstrating our approach’s effectiveness in improving consistency. When we remove the unlikelihood training objective(UL), the consistency metrics inevitably decline, and the results now provide evidence to prove the effectiveness of the consistency modeling objective. We observed the diversity of generated response slightly improved when we removed DCM. While focusing on consistency modeling, DCM tends to generate tokens in the above dialogue and given persona text, so there is a certain degree of repetition. The degree of dialogue consistency reduction proves the effectiveness of DCM in dialogue consistency modeling. From the descent of the auto-evaluating metrics, it is apparent that implicit persona knowledge and dialogue consistency modeling is necessary.

Table 6 The results of automatic and human evaluation results on Biased Testset of PersonalDialog
Fig. 4
figure 4

PPL and F1 results in PersonalDialog

Results in PersonalDialog are similar to PersonaChat as shown in Table 7. Our method beats all the baseline models in consistent automatic evaluation metrics \(\Delta P\) and C.Score. The results of the experiment found clear support for consistent improvement. When we removed the unlikelihood training objective(UL) and used regular training objective cross-entropy in the Chinese dataset, the dialogue quality was slightly reduced, especially the consistency of context and persona. The possible reasons are that the unlikelihood training objective may not be sensitive to Chinese, and it is challenging to understand Chinese semantics.

4.5 Experimental settings

In our model, we adopt Word2Vec [40] to initialize the word embedding. In our experiments, all baseline models apply word embedding. The max length of the input is 128, and the reference of sequence is 64. We set the number of attentive modules for the personalized characteristics discovering module as 5. The hidden states in our model are 256 (Fig. 5c and d). We optimize the model employing the Adam method with a learning rate set as 0.00001 (Fig. 5a and b) and the dropout set as 0.3. In the predicting process, we use beam search with a beam size of 10. We tune IMPACT and all baseline models on the validation set and evaluate them on the test set. For different datasets, we employ the different batch-size as shown in Fig. 5e and f and train the model for 20 epochs on one GeForce RTX 3090 24 G GPU. \(\alpha \), \(\beta _{H}\), \(\beta _{P}\), \(\beta _{K}\) are 0.5, 0.3, 0.3, and 0.5, respectively.

Fig. 5
figure 5

Performance with different parameters

Table 7 Example of dialogue by IMPACT in English

4.6 Case study

In this section, we show the dialogue examples generated from IMPACT, conversation as shown in Table 7. Our method can effectively mine implicit role information from dialogue history and promote understanding of role and context in the generation model, consistent with the given persona and context.

5 Conclusion and future work

This paper proposes a personalized dialogue system(IMPACT) that discovers the implicit persona knowledge to generate consistent responses. IMPACT generates consistent responses to achieve such a target from two aspects. Firstly, our method discovers the implicit persona knowledge, including personalized expression manner style. Secondly, the dialogue consistency matching module could enhance dialogue consistency from multi-views, including contextual consistency and persona consistency. IMPACT could be optimized through unlikelihood training objective to improve dialogue consistency. Extensive experimental results on two large datasets show that our method outperforms all previous baseline models and verify IMPACT is effective. Moreover, the external implicit persona knowledge may be noise data to a certain extent, which affects the model’s understanding of the persona. Therefore, we will consider screening persona knowledge and select knowledge valuable for interaction in future work. Our method is to mine some hidden knowledge from predefined knowledge. With the development of large model and text generation technology, large models, such as ChatGPT have been produced. How to migrate knowledge from the large model is also a good direction.