1 Introduction

We live in a globally mobile society in which people of widely different cultural backgrounds live and work together. The number of people who leave their ancestral cultural environment and move to countries with different culture and language is increasing. This spurs the need for culturally sensitive conversation agents, especially for sensitive topics such as health care. Hence, our aim is to design a culture-aware dialogue system which allows a communication in accordance with the user’s cultural idiosyncrasies. By adapting the system’s behaviour to the user’s cultural background, the conversation agent may appear more familiar and trustworthy.

In this work, we use spoken dialogues from four European cultures (German, Polish, Spanish and Turkish) to train a culture-aware dialogue manager (see Fig. 1). The selection of the next system action is thus adapted to the cultural background of the user. We use a dialogue management framework based on the concept of probabilistic rules which combines the benefits of logical and statistical methods to dialogue modelling: The probabilistic rules represent the internal models of the domain in a compact form and the unknown parameters included in the probabilistic rules are automatically estimated from data using supervised learning. We investigate whether it is possible to train culture-specific parameters for these probabilistic rules in order to represent cultural patterns in the dialogue management decision process.

Fig. 1
figure 1

The user’s culture is used in the dialogue management to adapt the system behaviour to the user.

The structure of the paper is as follows: In Sect. 2, we present related work in the field of culture-sensitive interface design. Afterwards, the corpus that has been used in this work is described in Sect. 3 and the design and implementation is outlined in Sect. 4. We present the evaluation and our results in Sect. 5, before concluding in Sect. 6.

2 Related Work

Brejcha [1] has described patterns of language and culture in Human-Computer Interaction and has shown why these patterns matter and how to exploit them to design a better user interface. Reinecke and Bernstein [8] have presented a design approach for culturally-adaptive user interfaces to enhance the usability. The authors developed a prototype web application that serves as a to-do list tool. During registration, the users have to input their current country of residence, former countries they lived in and the length of stay in each country. Based on this information, the design and content of the web application is adapted to the user.

Furthermore, Traum [9] has outlined how cultural aspects may be included in the design of a visual human-like body and the intelligent cognition driving action of the body of a virtual human. Therefore, different cultural models have been examined and the author points out steps for a fuller model of culture. Georgila and Traum [2] have presented how culture-specific dialogue policies of virtual humans for negotiation and in particular for argumentation and persuasion may be built. A corpus of non-culture specific dialogues is used to build simulated users which are then employed to learn negotiation dialogue policies using Reinforcement Learning. However, only negotiation specific aspects are taken into account while we aim to create an overall culture-sensitive dialogue manager.

A study presented by Miehle et al. [7] investigated cultural differences between Germany and Japan. The results pointed out that communication idiosyncrasies found in Human-Human Interaction may also be observed during Human-Computer Interaction in a Spoken Dialogue System context. Moreover, the study described by Miehle et al. [6] examined five European cultures whose communication styles are much more alike than the German and Japanese communication idiosyncrasies. The results show that there are differences among the cultures and that it depends on the culture whether there are gender differences concerning the user’s preference in the system’s communication style. These studies show that culture-adaptive behaviour is an important aspect of spoken user interfaces. This is why we address the task of culture-aware dialogue management.

3 Corpus Description

Our data set is based on recordings on health care topics containing spontaneous interactions in dialogue format between two participants: one is acting as the system is expected to perform while the second one is taking the role of the user of the system. Each dialogue is allocated with a unique dialogue ID and each action is assigned

  • a dialogue action number,

  • a participant,

  • a speaker,

  • a dialogue action, and

  • the original utterance.

The dialogue action number counts from 1 to n starting with the first dialogue action, where n is equal to the number of dialogue actions of the respective dialogue. The participant specifies the two roles system and user and the speaker indicates which of the predefined speakers was talking. Each speaker is identified by an anonymous speaker ID and a separate table contains profile information about each speaker, including the gender, the culture, the age, the country of birth and the current country of residence. The dialogue action is chosen out of a set of 59 distinct dialogue actions which have been predefined in advance. Furthermore, the original utterance (in the original language) is added to each dialogue action and for each utterance, the topics being talked about are annotated (in English). Moreover, for each dialogue, the system’s role is specified. The available system roles are defined as social companion, nursing assistant and health expert. Overall, the corpus covers 258 dialogues. The culture distribution of the dialogue actions is shown in Table 1.

Table 1 The culture distribution of the dialogue actions in the corpus

4 Design and Implementation

For the implementation of our culture-aware dialogue manager, we have used the open-source software toolkit OpenDial [5]. It combines the benefits of logical and statistical methods to dialogue modelling by adopting a hybrid approach. Probabilistic rules represent the domain model in a structured format and allow system designers to integrate their domain knowledge. These rules contain unknown parameters that can be estimated from dialogue data using supervised learning. Thus, this hybrid concept allows the system designers to integrate domain-dependent constraints into a probabilistic context. The probabilistic rules formalism is described in [4]. Practically, they are defined as if...then...else constructs that map logical conditions to a distribution over possible effects. For the action selection, OpenDial provides utility rules that associate utility values to system decisions. They can be used to find the action with the highest expected utility in the current state. Using these utility rules, we have implemented our dialogue domain as described in Sect. 4.1. Afterwards, we have performed parameter estimation as explained in Sect. 4.2.

Fig. 2
figure 2

Implementation of the rule for greeting based on the corpus

4.1 Domain Design

For the implementation of our dialogue domain, we have derived the utility rules from the database described in Sect. 3. We have extracted all possible system actions in response to a user action , regardless of culture. Since three or more consecutive dialogue actions occur rarely in the data, we have limited the number of possible system actions as response to a user action to two. Afterwards, we have implemented a rule for every user action. Overall, we have seven user actions:

  • Accept

  • Declare

  • Goodbye

  • Greet

  • Reject

  • Request

  • Thank

As an example, the implementation of the rule for greeting is depicted in Fig. 2. The rule gets activated if the condition is true, i.e. if the user action is Greet. Since it is possible to react with one or two consecutive dialogue actions, there are eight possible effects for the next system action. This design approach ensures that only reasonable pairs of dialogue actions that are covered in the database are included in the domain.

Fig. 3
figure 3

Example of a dialogue transcript based on the corpus

4.2 Parameter Estimation

After implementing the dialogue domain which captures the possible system actions in response to every user action, we have used the supervised learning approach based on the so-called Wizard-of-Oz learning provided within the OpenDial toolkit in order to estimate the parameters (e.g. ). This learning approach allows not only to learn from Wizard-of-Oz experiments, but also from dialogue transcripts. As our corpus contains dialogue interactions between two participants where one is taking the role of the system while the other one is taking the role of the user of that system, thus resembling the situation of Wizard-of-Oz experiments, we have created transcripts of these dialogues as input for our parameter estimation. An example of such a transcript is shown in Fig. 3. In this interaction, the user action ( Greet is followed by the system action ( PersonalGreet.

In the following, we explain the Wizard-of-Oz learning that has been used for the parameter estimation. According to Lison [4], a Wizard-of-Oz interaction is defined as a sequence of state-action pairs

$$\begin{aligned} \mathscr {D} = \{\langle \mathscr {B}_{i}, a_{i} \rangle : 1 \le i \le n \}, \end{aligned}$$
(1)

where \(\mathscr {B}_{i}\) is the dialogue state, \(a_{i}\) the performed wizard action at time i and n the total number of recorded actions. During the learning process, the goal is to learn the posterior distribution of the rule parameters \(\varvec{\theta }\) based on the Wizard-of-Oz training data set \(\mathscr {D}\). The algorithm takes each state-action pair \(\langle \mathscr {B}_{i}, a_{i} \rangle ~\epsilon ~\mathscr {D}\) and updates the posterior parameter distribution after each pair. This posterior distribution can be decomposed as

$$\begin{aligned} P (\varvec{\theta }~|~\mathscr {D}) = \eta ~P (\varvec{\theta }) \prod _{\langle \mathscr {B}_{i}, a_{i} \rangle \epsilon \mathscr {D}}^{} P ( a_i~|~\mathscr {B}_{i};\varvec{\theta }), \end{aligned}$$
(2)

where \(P (\varvec{\theta })\) is the prior distribution and \(P ( a_i~|~\mathscr {B}_{i};\varvec{\theta })\) represents the likelihood of the wizard selecting the action \(a_i\) in the dialogue state \(\mathscr {B}_{i}\) given the rule parameters \(\varvec{\theta }\). This likelihood can be expressed as a geometric distribution

$$\begin{aligned} P ( a_i~|~\mathscr {B}_{i};\varvec{\theta }) = \eta (1 - p)^{x-1} p, \end{aligned}$$
(3)

where x is the rank of action \(a_i\) in the utility \(U ( a_i~|~\mathscr {B}_{i};\varvec{\theta })\), \(\eta \) is a normalisation factor and p represents the learning rate of the estimation process. For our experiments, we used \(p = 0.2\).

As prior distribution \(P (\varvec{\theta })\), we selected a Gaussian distribution with a mean value of 5 and a variance of 1. However, the probabilistic model in Eq. 2 contains both continuous and discrete random variables. This leads to a nontrivial inference problem. OpenDial offers a sampling technique called likelihood weighting to approximate the inference process to solve this issue [3]. Hence, the posterior parameter distribution is sampled after the likelihood of the wizard action is calculated. The outcome is then expressed as a Kernel density estimator which subsequently can be converted into a Gaussian distribution. This procedure is performed as long as training data is available.

5 Evaluation

After implementing the dialogue domain and creating the dialogue transcript files for each culture (German, Polish, Spanish and Turkish) based on the data set described in Sect. 3, we have used the transcript files to train the rule parameters \(\varvec{\theta }\) of the dialogue domain for the different cultures. Thus, four different culture-specific domains have been trained. Proceeding from the initial probability distribution (Gaussian distribution, \(\mu =5\), \(\sigma ^2=1\)), each parameter has been trained based on the appearance of the corresponding system action in the data set. Since the parameters are updated after each user action - system action tuple, a more frequent occurrence of a system action in the database causes the shifting of the mean value to a higher value. In contrast, a rare occurrence correlates with a lower mean value, reducing the probability that such a system action is selected. This effect is illustrated in Fig. 4. In the following, we will evaluate whether the trained parameters vary among the different cultures and therefore represent cultural patterns.

Fig. 4
figure 4

Probability distribution of each parameter before training () and example probability distributions of two parameters after training, representing a frequently occurring system action () and a rarely occurring system action ()

Table 2 Mean values \(\mu \) and variances \(\sigma ^2\) of the probability distributions of the parameters \(\varvec{\theta }\) after training with 1000 German dialogue actions each
Table 3 Mean values \(\mu \) and variances \(\sigma ^2\) of the probability distributions of the three most highly ranked parameters \(\varvec{\theta }\) after training with the culture-specific data sets

In a first step, we have evaluated whether 1000 dialogue actions are enough for training the rule parameters as the Polish, Spanish and Turkish data sets each contain slightly more than 1000 dialogue actions (see Table 1). In order to do so, we have split the German data set consisting of 4849 dialogue actions into four subsets, each containing 1000 dialogue actions. After training with each of the four German training sets, we have obtained similar alternatives for the action selection as the three parameters with the highest mean values are the same for each rule. The corresponding values for the mean \(\mu \) and the variance \(\sigma ^2\) of the probability distributions for each rule are shown in Table 2. For some rules (e.g. Declare), we get the same ranking for every subset, for others (e.g. Accept) the mean values of the three most highly ranked parameters differ only slightly. However, each of the four German domains results in a similar system strategy, showing that the relevant information is contained in the data. This allows the assumption that training with 1000 dialogue actions is sufficient to train culture-specific parameters. Furthermore, as the average over all four subsets corresponds approximately to the values of subset German1, this subset is used to represent the German culture in the further course of the paper. Moreover, this first part of evaluation has revealed that the variance \(\sigma ^2\) of the probability distributions of the parameters is of little importance in the applied sampling technique. The action selection is mainly based on the mean value \(\mu \). Hence, we have based the cross-cultural comparison in the second part of our evaluation on the means.

In the second step, we have used the German, Polish, Spanish and Turkish data sets (each containing approximately 1000 dialogue actions) to train the rule parameters \(\varvec{\theta }\) of four culture-specific dialogue domains. The three parameters with the highest mean value for each culture are shown in Table 3. It can be seen that the different characteristics of the cultures occasionally have led to different parameters with highest mean values. In the following, we discuss the similarities and differences for each rule.

Accept

If the last user action is accepting what the system has said, the system either reacts with a request for more information or with giving some information to the user (Declare or ReadNewspaper, what is a special form of giving information). This is the case for every culture tested in our scenario. However, the difference between the cultures lies in whether the system adds an Acknowledge (e.g. “Okay.”) or not. While this is very likely for German, it is more unlikely for Polish, Spanish and Turkish (as the parameter with the highest mean does not include it for these cultures).

Declare

If the last user action is a Declare, the system may request for more information or give some information to the user. This applies again to each of our cultures. However, we are able to observe two differences between the cultures, namely (1) whether an Acknowledge is added or not, and (2) whether the information is presented as an Advise or a Declare. While it is very likely to add an Acknowledge for German and Turkish, it is more unlikely for Polish and Spanish. Moreover, the information will always be presented as a Declare (e.g. “He needs help getting up.”) for German and Spanish, while an Advise (e.g. “You should help him up.”) may be used for Polish and Turkish.

Goodbye

After the user says goodbye, the system usually answers with any form of saying goodbye. However, the form differs between the examined cultures. While for German, Polish and Turkish a SimpleGoodbye (e.g. “Good bye.”) is most probable, for Spanish it is more likely that a MeetAgainGoodbye (e.g. “See you.”) is used. Moreover, for German it is also very common that the user’s name is used in a PersonalGoodbye (e.g. “Bye Anna.”) and for Polish a Thank (e.g. “Thank you.”) or an AnswerThank (e.g. “You’re welcome.”) instead of a Goodbye might be used.

Greet

If the user greets the system, the most likely system response is also a greet. German, Polish and Turkish uses a PersonalGreet (e.g. “Hello Anna.”). However, in our Spanish model, the user is not addressed by name but simply welcomed with a Greet (e.g. “Hello.”). Furthermore, in the German, Polish and Turkish model, an AskTask (e.g. “How can I help you?”) or an AskMood (e.g. “How are you?”) are very likely to be added to the greeting. However, for Spanish only one dialogue action is used.

Reject

In our Polish data, it is never the case that the user rejects anything of the system, even if the topics of conversation are evenly distributed among the cultures. For the other cultures, the system usually reacts with a request for more information or with giving some information to the user. For the latter, the difference between the cultures lies in whether the system uses a Declare or an Advise to present the information: German uses a Declare, while Spanish and Turkish use an Advise. Moreover, for German and Spanish usually an Acknowledge is added.

Request

A request is the most likely user action, what can be seen from the high mean values in this rule. Obviously, answering the user’s question and thus giving the requested information is the most likely system response for every culture. However, there are differences in how the information is presented. While in the German, Polish and Turkish model, a Declare with optional addition of an Accept or Reject is used, the Spanish model rather utilises an Advise than a Declare. Moreover, the Turkish model also uses a Motivate (e.g. “Good idea!”).

Thank

If the last user action is a Thank, the system may react to it with an AnswerThank or interpret it in the way that the user wants to end the dialogue and thus answer with a Goodbye (or combine both dialogue actions). As the mean values of the three most highly ranked parameters differ only slightly for every culture, we compare the amount of occurrences of these options among the most highly ranked parameters for each culture. We can see that the Polish model always answers with a Goodbye, while the others do not always interpret the user’s Thank in this way. Moreover, the Spanish model also uses a Motivate and the Turkish model may add an Advise.

6 Conclusion and Future Directions

With the aim of designing a culturally sensitive conversational assistant, in this work we have investigated whether culture-specific parameters may be trained by use of a supervised learning approach. In order to do so, we have used spoken dialogues from four European cultures, namely German, Polish, Spanish, and Turkish, to train a culture-aware dialogue manager. For the implementation we have used the open-source software toolkit OpenDial [5] which is based on the concept of probabilistic rules. Thus, it combines the benefits of logical and statistical methods to dialogue modelling.

For our evaluation we have trained four different culture-specific dialogue domains. For each culture, we have used a data set containing approximately 1000 dialogue actions as we have shown that 1000 dialogue actions are enough for training the rule parameters. Afterwards, we have compared the probability distributions of the trained parameters. Each parameter is expressed as a Gaussian distribution. Thus, we have examined the differences between the cultures in terms of the mean values of the corresponding probability distributions. The evaluation results have shown that the different characteristics of the cultures result in different parameters with highest mean values. Hence, the system response to a user action varies depending on the culture.

In future work, we will examine whether the proposed approach can be extended to other conversational topics and further cultures. In particular, we are interested in non-European cultures since the differences to the cultures studied in this work might be more significant. Moreover, we plan to conduct an evaluation with real users to see, how a varying action selection based on the culture is perceived.