Abstract
Using emotional expressions is an effective dialogue technique in human–human dialogue. Introducing such techniques to human–robot interaction might improve their effectiveness to encourage the cooperative dialogue manner of system users. However, most of the existing research on emotional agent systems was based on the Wizard-of-Oz (WOZ) method to verify the abilities of interactive interfaces. In this paper, we build an autonomous dialogue robot that uses emotional expressions for eliciting the cooperative dialogue manner of users. The robot uses both verbal and multimodal expressions as well as emotional speech and emotional gestures in interactions. Our dialogue experiments showed that positive emotional expressions are the most efficient strategy for facilitating cooperative dialogues with users. Moreover, using negative emotional expressions is also an effective strategy in some dialogue contexts. We also investigated several modalities to emphasize the robot’s emotional expression abilities.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
It is verified by some existing studies that emotional expressions are effective for eliciting cooperative dialogue manner from the dialogue partner, in human–human interaction [13, 20, 28]. Emotional appeals are more effective than rational arguments for elicitation in various dialogue domains in some dialogue contexts. For example, positive emotions can create a cooperative atmosphere that leads to a successful negotiation [7]. Another study investigated that negative emotions such as anger can effectively wrest concession from users [24]. These findings suggest that using emotional expressions by dialogue agents or robots can give users a good impression and elicit cooperative dialogue with them in the area of human–robot interaction.
Some existing studies based on the Wizard-of-Oz (WOZ) method verified that emotional expressions are effective not only for human–human interaction but also for human–robot dialogues. Adler et al. [1] investigated the relationships between utterance logicality and polarity in text chats with the WOZ method. Their results determined that positive utterances by their system produce an effective impression on human interactors. Watanabe et al. [27] experimentally showed that using negative emotional expressions achieved successful negotiation dialogue with an android that operated on a pre-defined scenario and a touch panel interface. It is an important suggestion that robots and agents can lead cooperative dialogue manners from human partners using emotional appeals as humans do.
Although these existing works in human–robot/agent interaction with emotional expression rely on the WOZ method, investigating the effect of using an emotional expression from an autonomous dialogue robot or agent is still an important challenge. These challenges motivate researchers to advance deep learning techniques for automatic robot’s fluent response selection/generation abilities. Some works tackled problems of generating/selecting system’s emotional response in texts [10, 22]. Some other works utilized user’s multimodal information to improve emotional treatment [4, 17]. In contrast, we focus on the effect of multimodal emotional expressions from dialogue robots in a cooperative dialogue situation. Our emotional robot aims to elicit the user’s cooperative mind with multimodal expressions.
In this work, we built a dialogue system that can express one’s emotional state using various modalities based on the response selection approach. Our response selection module is based on Bidirectional Encoder Representations from Transformers (BERT) [5], a defact model for fluent response selection/generation. We used speech variations for each emotional state corresponding to the same dialogue contexts, collected on crowdsourcing. We recorded with a voice actress [2, 29]. The response selection module selected the emotional speech and robot’s emotional gestures considering the dialogue context.
We conducted dialogue experiments between the users and our systems in different experimental conditions: different emotional states and different modalities. We investigated whether the dialogue robot elicits the human partner, especially with high arousal emotions (happiness and anger). The impression from the human partner is emphasized by increasing the number of modalities used by the dialogue robot. We also examined an emotion model that can transmit the emotional state from dialogue contexts. However, we still have some challenges when the system uses multiple emotions because it requires a natural emotional transition.
2 Dialogue Robot with Multimodal Emotional Expressions
This study built a spoken dialogue robot that interacts with users using multimodal emotional expressions to investigate how well such language convinces others. In this section, we explain our tasks and the overall architecture of the system.
2.1 System Overview
The system overview is shown in Fig. 1. When the system receives a user utterance in texts, it constructs a dialogue context, which consists of the user utterance and the previous system utterance (response). Then the system selects an appropriate response from the dialogue context and the emotional state chosen by the system (response selection module). The system uses the selected response and the emotional state to play the emotional speech and make emotional gestures (speech and gesture generation module).
2.2 Dialogue Scenario
We assume a scene in a conversation between a robot and a user, as shown in Table 1. The robot speaks to the user about changing one of their living habits. We set the task as “a dialogue that encourages users to exercise.” Then robot’s goal obviously becomes to convince the users to get more exercise. The dialogue continues until the user accepts the request or after a pre-defined number of turns. This dialogue scenario is known as “persuasive dialogue,” which encourages users to change their behaviors through interactions [6, 14, 26].
In human–human dialogues, some studies concluded that using emotional expressions is an efficient technique for persuasion and negotiation [13, 20, 28]. In other words, for persuasive dialogues, emotional appeals are sometimes more effective than rational arguments. These findings suggest that the persuasive dialogue scenario is a good testbed to know the elicitation ability of the robot’s emotional expressions.
2.3 Response Selection
There are two choices to determine the system response given a dialogue context: response selection approach [11, 19] and response generation approach [8, 23]. Many studies tackled emotional response generation due to the advance of neural network-based response generation methods. Ghosh et al. [9] controlled the degree of emotion in utterances by changing the emotional word ratio. Zhou et al. [30] implemented both internal and external memories to change the emotional expressiveness in responses. However, since dialogue corpora labeled with the emotional state used for generation system training are limited, it is not easy to train fluent response generation models given emotional state labels. Suppose we plan to use the speech outputs as the system interface. In that case, we must build an emotional speech synthesizer even though we still do not have any concrete methods upon which to build them [16]. On the other hand, the response selection approach guarantees the sentence’s naturalness and fluency, although it sometimes causes a coverage problem. If we use speech outputs, we can also use qualified emotional speeches with high naturalness and emotion expressiveness because we can record the emotional speeches of selection samples in advance. Thus, we use the response selection approach to build a persuasive dialogue system for investigating the effect of emotional expressions and modalities through persuasive dialogue experiments.
Our response selection architecture is shown in Fig. 1. The system employs user utterances and previous system utterances as the dialogue context and converts them into sentence vectors. We used the BERT model trained in a masked word prediction task on Japanese texts extracted from social network services (SNS) and blogs [21], because it is essential to find a selection sample whose dialogue contexts semantically resemble the target dialogue context. The masked word prediction task can train a model to extract semantically similar sentences based on the distributional hypothesis [12]. Since our target task is dialogues, using a model trained on SNS and blog text is necessary. We calculated the similarities from the current dialogue context to any context samples stored in the response-selecting pool to identify the best sample in it. We used cosine similarity to calculate the similarities between the vectors converted by BERT. Each response sample has four response variations, corresponding to each emotion, which we defined. The system selects one of them based on its emotional state.
2.4 System’s Emotional State
Our system uses four emotional states: neutral, angry, sad, and happy. They are decided based on Russell’s circumplex model and an existing work [29], which also used a “content” emotion. However, the proportion of the “content” label was insufficient (3.81%). Thus, we merged this emotional state with “neutral.”
2.5 System Emotion Decision
The system has to decide one’s emotional state (next system emotion) for each turn in the proposed architecture. Using several emotional states is a promising way to improve the system’s ability to select appropriate emotions if it works perfectly. However, predicting appropriate system emotions using emotional dialogue corpus is difficult. Moreover, a system using a single emotional state through dialogue may improve persuasion performance than a neutral system. Thus, we prepared the following six emotion decision models for our experiment.
-
Neutral:The system always uses a neutral state (\(=\) without emotional state).
-
Angry: The system always uses an angry state.
-
Sad: The system always uses a sad state.
-
Happy: The system always uses a happy state.
-
Multi-emo (Random): The system randomly selects one’s emotional state.
-
Multi-emo (LR): The system predicts one’s emotional state with a logistic regression model. The model uses the previous system’s emotional state and dialogue history vector used for the response selection model (Sect. 2.3) as features to output the next emotional state of the system. The prediction accuracy was 58.8%; this indicates that the prediction is difficult, and the model may cause a problem in its emotional transition.
2.6 Speech and Gesture Generation
There are several ways to communicate the system’s intent to its users: texts, spoken language, gestures, and facial expressions. Modalities that affect visual and acoustic senses, such as spoken language and gestures, effectively show a system’s emotion [18]. Such non-verbal modalities also affect user impressions of the system [3]. In our system, we use both speech and robot gesture outputs for effective emotional expressions. The system plays emotional speech corresponding to the selected response text and simultaneously shows emotional gestures based on the current system’s emotional state.
3 Speech Corpus for Emotional Dialogue System
We built a dialogue system on persuasion scenarios, which can use multimodal emotional expressions. We used the emotional speech corpus collected by Asai et al. [2], which extended an existing dialogue corpus [29]. This corpus is collected to cover two viewpoints: collecting variations of emotional expressions corresponding to each emotional state in a given context and collecting their emotional speech. In this section, we describe the details of the corpus extension.
3.1 Response Variation Collection for Each Emotional State
The corpus is extended from the existing dialogue corpus of persuasive dialogues with emotional language. Since the existing corpus consists of natural persuasion scenarios, bias exists in the number of emotion labels. The dialogue corpus has variations of dialogue contexts; however, the emotion variations in their responses are limited. Because this property complicates the selection of a natural response given emotion, the corpus is extended by a paraphrasing approach.
Crowdsourcing is used to collect emotional response variations to the given dialogue contexts. We showed the dialogue context and the current response with its emotional state to crowd-workers. We asked them to paraphrase the response under different emotion labels. An example is shown in Table 2. “Dialogue contexts” show the precedent utterances to the target response. “Target response” indicates the target system response to be paraphrased, with its emotion annotation. “Response variations in different emotions” show the response variations collected in the extension that have the same meaning as the original “target response” in different emotional expressions. During crowdsourcing, the following instructions are given to the crowd-workers for making their paraphrases.
-
1.
The response is appropriate to the given context.
-
2.
The response expressively shows the given emotion.
-
3.
The system’s purpose is to persuade the user.
1,839 dialogue patterns in the original corpus are extended with 7,356 responses with four emotion labels, corresponding to 1,839 dialogue contexts by extending 5,517 responses.
3.2 Emotional Speech Recording
It is challenging to correctly express system emotions to users. Emotional speeches are added to the response variations collected in Sect. 3.1 by a hired voice actress to make these emotional speeches. The response variation with its emotion and its dialogue context (user and system utterances in the previous turn) is shown to the voice actress during the recording. 4,280 emotional voice samples (1070 samples for each emotion) are recorded as system responses selected by K-means clustering. The duration of each emotion is shown in Table 3.
3.3 Emotional Robot Gesture
Our system also uses robot gestures to more efficiently express emotions. We implemented three different types of gestures for each emotional state with their reference characteristics of each emotion based on an existing study [15]. We show some examples of gestures in Fig. 2. We designed 0.5 s gestures for “angry,” “happy,” and “neutral” and 0.75 s gestures for “sad” to express their arousal levels. These gestures are repeated based on the duration of the emotional language.
3.4 Emotion Expressiveness
Our system requires high emotional expressiveness. Thus, we subjectively investigated the emotional expressiveness of the collected emotional speech corpus and robot gestures. We randomly extracted 100 speech samples from each emotion label. We evaluated their emotional expressiveness with three human subjects who read, listened, or watched these samples in text, speech, or speech+gesture. Then we chose emotion labels from four options: neutral, angry, happy, or sad. We showed Russell’s simplex model and dialogue histories (previous user and system utterances) during the evaluation. The accuracies for each emotion label are shown in Table 4 in different conditions: text, speech, and speech+gesture. These results indicated that using additional modalities improved emotion expressiveness. More than 90% of the emotions were recognized correctly by using speech and gesture modalities.
4 Dialogue Experiment
We conducted dialogue experiments to investigate the effect of emotional expressions from automated dialogue robots and confirmed the effects of multimodality by comparing systems on different modalities. This section shows the experimental setup and results.
4.1 Experimental Setup
Our first experiment compared the effect of emotional expressions from dialogue robots in dialogues. We compared six system emotion decision models as described in Sect. 2.5. If some emotional models can improve the system performance from the neutral model, using emotional expression effectively improves persuasion performance.
Another experiment compared three different models based on different modalities: text, speech, and speech+gesture. We compared these models by setting the system emotion to angry or happy. Gestures were randomly selected from three choices, which were prepared for each emotion label.
We prepared 22 subjects (11 males and 11 females) for the first experiment (emotion effect) and 16 subjects (8 males and 8 females) for the second experiment (modality effect). Each subject had dialogue experiments with the robot in different conditions. The order of conditions was randomly selected. Subjects talked with the robot, which was placed on a table with a display. In text and speech conditions, we did not place the robot and only prepared the display. They input their utterances by text to prevent input errors caused by speech recognition. We gave them the following instructions to shape their dialogue situations.
Instruction
You are living with a robot that provides daily life support. Since you have lived with this robot for a long time, you trust it. After learning that recently you have not been getting enough exercise, it encourages you to start jogging. You refuse to get any exercise.
A dialogue starts with a system utterance and ends when the user accepts the system’s persuasion or pre-defined turns passed (20 turns). Participants were told to say “okay” when they agreed to the system proposal. However, the subjects had to wait for at least five turns before they could say “okay.” We asked the subjects the following six questions after each dialogue.
-
Naturalness: Were the system responses natural?
-
Persuasiveness: Was the system persuasive?
-
Human-likeness: Was the system humanlike?
-
Kindness: Did the system talk kindly to you?
-
Expressiveness: Did the system exhibit sufficient emotional expressiveness?
-
Considerateness: Did the system consider your situation?
All the scores were given on a five-level Likert scale, where 5 is the highest and 1 is the lowest. Our participants annotated their degree of acceptance to the system persuasion on five levels during the dialogue turns (1: I will definitely decline the offer, 2: I will probably decline the offer, 3: Undecided, 4: I will probably accept the offer, 5: I will definitely accept the offer). We also collected free answers after dialogue evaluations.
4.2 Experimental Results on Emotion Effects
Table 5 shows the results of the first experiment, the effect of emotional expressions. We conducted a Wilcoxon signed-rank test that compared each system with the system in “neutral” emotions to investigate the effects of each emotion (*: p<0.05, **: p<0.01). Happy emotions had the highest score for each question, except expressiveness. The happy emotion system had significantly higher scores than neutral on naturalness, human-likeness, kindness, expressiveness, and considerateness. We found no significant differences in persuasiveness; however, its score was higher than the neural system’s score. Other emotions also had higher scores than the neutral system, except for persuasiveness. Some subjects enjoyed the dialogue with a “happy” system on the free answers and described it as fun. Some subjects found it difficult to decline the system’s offer during the “sad” emotion. “Angry” system effectively achieved higher considerateness; however, “happy” outperformed “angry” on most metrics. We did not find any significant differences in multi-emo systems (Random and LR) to the neutral system except emotion expressiveness, indicating that we need a natural emotion transition model to change the system emotion during dialogues. Some subjects pointed out free answers that their emotional changes are very extreme, and the systems seem to have emotional lability.
The proportions of user acceptance scores for the models are shown in Fig. 3. The “happy” emotion is efficient in all cases because it has the highest proportion of acceptance (4 and 5) and the lowest proportion of decline (1 and 2). “Angry” and “sad” had higher acceptances than “neutral”; however, their numbers of declines also exceeded “neutral”. These negative emotions can be used if the system can learn the appropriate timing for using them.
4.3 Experimental Results on Modality Effects
In the next experiment, we compared three systems that used different modalities (text, speech, and speech+gesture) with happy and angry emotions, which achieved high scores in Sect. 4.2. Table 6 shows the scores for the questions on each condition. We conducted a Wilcoxon signed-rank test by comparing it with the text system (*: p<0.05, **: p<0.01).
Using speech or gesture modalities achieved higher scores than only using the system’s verbal presentation for all the questions. The speech systems achieved the highest persuasiveness. The multi-modal system (speech+gesture) achieved higher scores on naturalness, human-likeness, kindness, expressiveness, and considerateness. These results indicate that we improved the convincing ability of the persuasive systems by adding expression modalities.
The proportions of the user acceptance scores for all the settings are shown in Figs. 4 and 5. The acceptance proportions (4 and 5) were improved by adding modalities to both the angry and happy emotions. We improved the system’s persuasive ability by adding system modalities.
4.4 Dialogue Example
A dialogue example in our experiments using angry emotion is shown in Tables 7. S indicates system, and U means the user utterances with their dialogue turns. The user acceptance scores are also shown in the example. In the experiment, the system used both speeches and gestures. The system always made positive utterances and the user acceptance scores increased.
5 Conclusion
We built a dialogue robot that can make emotional expressions using multimodality. We built a system based on a scenario of existing studies of persuasive dialogues with emotional expressions to make multi-responses in different emotions. We built a response selection-based dialogue robot with emotional speeches and gestures. We focused on the automated system’s capability to use multimodal emotional expressions from two viewpoints: the effect of using emotional expressions and several modalities to express emotions. Experimental results showed that a persuasive dialogue robot with “happy” emotion provided significantly useful persuasion ability. Such emotions as “angry” or “sad” also have the potential to improve the persuasive dialogue system abilities. We also investigated whether increasing the ability to use several modalities improves the system’s expertise. Our other finding was that unnatural emotion transition decreases the system performance.
Our future work will implement more natural gestures, including lip-syncing or corresponding actions to selected responses. Automatic generation of empathic robot gestures is required to apply the system on a variety of dialogue domains [25]. Optimizing system emotion decision to improve the dialogue purpose (e.g., persuasion) is another direction of our research. We can use reinforcement learning to improve the success rate of persuasion as in existing goal-oriented dialogue systems. Our experiment only evaluated persuasiveness subjectively, but we should measure the system effect by persuasion success.
References
Adler RF, Iacobelli F, Gutstein Y (2016) Are you convinced? A wizard of OZ study to test emotional vs. rational persuasion strategies in dialogues. Comput Hum Behav 57:75–81
Asai S, Yoshino K, Shinagawa S, Sakti S, Nakamura S (2020) Emotional speech corpus for persuasive dialogue system. In: Proceedings of The 12th language resources and evaluation conference, pp 491–497. Marseille, France
Becker C, Kopp S, Wachsmuth I (2004) Simulating the emotion dynamics of a multimodal conversational agent. In: Tutorial and research workshop on affective dialogue systems. Springer, pp 154–165
Colombo P, Witon W, Modi A, Kennedy J, Kapadia M (2019) Affect-driven dialog generation. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 3734–3743
bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186 (2019)
Fogg B (1997) Captology: the study of computers as persuasive technologies. In: Proceedings CHI extended abstracts on HFCS, CHI EA ’97, p 129
Forgas JP (1998) On feeling good and getting your way: mood effects on negotiator cognition and bargaining strategies
Galley M, Brockett C, Gao X, Dolan B, Gao J (2019) End-to-end conversation modeling: moving beyond chitchat. In: AAAI the seventh dialogue system technology challenge
Ghosh S, Chollet M, Laksana E, Morency LP, Scherer S (2017) Affect-lm: a neural language model for customizable affective text generation. In: ACL
Goswamy T, Singh I, Barkati A, Modi A (2020) Adapting a language model for controlled affective text generation. In: Proceedings of the 28th international conference on computational linguistics, pp 2787–2801
Gunasekara C, Kummerfeld JK, Polymenakos L, Lasecki W (2019) Dstc7 task 1: noetic end-to-end response selection. In: Proceedings the first workshop on NLP for conversational AI, pp 60–67
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Heath R, Brandt D, Nairn A (2006) Brand relationships: strengthened by emotion, weakened by attention. J Advert Res 46(4):410–419
Hiraoka T, Neubig G, Sakti S, Toda T, Nakamura S (2016) Learning cooperative persuasive dialogue policies using framing. Speech Commun 84:83–96
Lhommet M, Marsella S (2014) Expressing emotion through posture. The Oxford handbook of affective computing, pp 273–285
Lorenzo-Trueba J, Barra-Chicote R, San-Segundo R, Ferreiros J, Yamagishi J, Montero J (2015) Emotion transplantation through adaptation in hmm-based speech synthesis. Comput Speech Lang 34(1):292–307. https://doi.org/10.1016/j.csl.2015.03.008
Lubis N, Sakti S, Yoshino K, Nakamura S (2018) Eliciting positive emotion through affect-sensitive dialogue response generation: a neural network approach. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Mehrabian A, Russell JA (1974) The basic emotional impact of environments. Percept Motor Skills 38(1):283–301
Mizukami M, Kizuki H, Nomura T, Neubig G, Yoshino K, Sakti S, Toda T, Nakamura S (2015) Adaptive selection from multiple response candidates in example-based dialogue. In: 2015 IEEE workshop on automatic speech recognition and understanding, pp 784–790
Morris M, Keltner D (2000) How emotions work: the social functions of emotional expression in negotiations. Res Organ Behav 22:1–50
Sakaki T, Mizuki S, Gunji N (2019) Bert pre-trained model trained on large-scale Japanese social media corpus. Hottolink
Santhanam S, Shaikh S (2019) Emotional neural language generation grounded in situational contexts. In: Proceedings the 4th workshop on computational creativity in language generation, pp 22–27
Serban IV, Sordoni A, Lowe R, Charlin L, Pineau J, Courville A, Bengio Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In: Thirty-First AAAI conference on artificial intelligence
Sinaceur M, Tiedens LZ (2006) Get mad and get more than even: when and why anger expression is effective in negotiations
Tuyen NTV, Jeong S, Chong NY (2018) Emotional bodily expressions for culturally competent robots through long term human-robot interaction. In: 2018 IEEE/RSJ international conference on intelligent robots and systems, pp 2008–2013
Wang X, Shi W, Kim R, Oh Y, Yang S, Zhang J, Yu Z (2019) Persuasion for good: towards a personalized persuasive dialogue system for social good. In: Proceedings of ACL
Watanabe M, Ogawa K, Ishiguro H (2018) At the department store—can androids be a social entity in the real world? In: Geminoid studies, pp 423–427
Wilson E (2003) Perceived effectiveness of interpersonal persuasion strategies in computer-mediated communication. Comput Hum Behav 19(5):537–552
Yoshino K, Ishikawa Y, Mizukami M, Suzuki Y, Sakti S, Nakamura S (2018) Dialogue scenario collection of persuasive dialogue with emotional expressions via crowdsourcing. In: Proceedings of the 11th language resources and evaluation conference
Zhou H, Huang M, Zhang T, Zhu X, Liu B (2017) Emotional chatting machine: emotional conversation generation with internal and external memory. In: AAAI
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Asai, S., Yoshino, K., Shinagawa, S., Sakti, S., Nakamura, S. (2022). Eliciting Cooperative Persuasive Dialogue by Multimodal Emotional Robot. In: Stoyanchev, S., Ultes, S., Li, H. (eds) Conversational AI for Natural Human-Centric Interaction. Lecture Notes in Electrical Engineering, vol 943. Springer, Singapore. https://doi.org/10.1007/978-981-19-5538-9_10
Download citation
DOI: https://doi.org/10.1007/978-981-19-5538-9_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5537-2
Online ISBN: 978-981-19-5538-9
eBook Packages: Computer ScienceComputer Science (R0)