Abstract
We present a dialogue system for a conversational robot, Erica. Our goal is for Erica to engage in more human-like conversation, rather than being a simple question-answering robot. Our dialogue manager integrates question-answering with a statement response component which generates dialogue by asking about focused words detected in the user’s utterance, and a proactive initiator which generates dialogue based on events detected by Erica. We evaluate the statement response component and find that it produces coherent responses to a majority of user utterances taken from a human-machine dialogue corpus. An initial study with real users also shows that it reduces the number of fallback utterances by half. Our system is beneficial for producing mixed-initiative conversation.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords
1 Introduction
Androids are a form of humanoid robots which are intended to look, move and perceive the world like human beings. Human-machine interaction supported with androids has been studied for some years, with many works gauging user acceptance of tele-operated androids in public places such as outdoor festivals and shopping malls [3, 17]. Relatively few have the ability to hold a multimodal conversation autonomously, one of the exceptions being the android Nadine [26].
In this paper, we introduce a dialogue management system for Erica (ERato Intelligent Conversational Android). Erica is a Japanese-speaking android which converses with one or more human users. She is able to perceive the environment and users through microphones, video cameras, depth and motion sensors. The design objective is for Erica to maintain an autonomous prolonged conversation on a variety of topics in a human-like manner. An image of Erica is shown in Fig. 1.
Erica’s realistic physical appearance implies that her spoken dialogue system must have the ability to hold a conversation in a similarly human-like manner by displaying conversational aspects such as backchannels, turn-taking and fillers. We want Erica to have the ability to create mixed-initiative dialogues through a robust answer retrieval system, a dynamic statement-response generation and proactively initiating a conversation. This distinguishes Erica as a conversational partner as opposed to smartphone-embedded vocal assistants or text-based chatbot applications. Erica must consider many types of dialogue so she can take on a wide range of conversational roles.
Chatting systems, often called chatbots, conduct a conversation with their users. They may be based on rules [4, 8] or machine-learned dialogue models [21, 25]. Conducting a conversation has a wide meaning for a dialogue system. Wide variations exist in the modalities employed, the knowledge sources available, the embodiment and the physical human-likeness. Erica is fully embodied, ultra realistic and may express emotions. Our aim is not for the dialogue system to have access to a vast amount of knowledge, but to be able to talk and answer questions about more personal topics. She should also demonstrate attentive listening abilities where she shows sustained interest in the discourse and attempts to increase user engagement.
Several virtual agent systems are tuned towards question-answering dialogues by using information retrieval techniques [15, 23]. However these techniques are not flexible enough to accept a wide variety of user utterances other than well-defined queries. They resort to a default failing answer when unable to provide a confident one. Moreover most information retrieval engines assume the inputs to be text-based or a near-perfect speech transcription. One additional drawback of such an omniscient system is the latency they introduce in the interaction when searching for a response. Our goal is to avoid this latency by providing appropriate feedback even if the system is uncertain about the user’s dialogue.
Towards this goal we introduce an attentive listener which convinces the interlocutor of interest in the dialogue so that they continue an interaction. To do this, a system typically produces backchannels and other feedback with the limited understanding it has of the discourse. Since a deep understanding of the context of the dialogue or the semantic of the user utterance is unnecessary, some automatic attentive listeners have been developed as open domain systems [14]. Others actually use predefined scripts or sub-dialogues that are pooled together to iteratively build the ongoing interaction [1]. One advantage is that attentive listeners do not need to completely understand the user’s dialogue to provide a suitable response. We present a novel approach based on capturing the focus word in the input utterance which is then used in an n-gram-based sentence construction.
Virtual agents combining question-answering abilities with attentive listening are rare. SimSensei Kiosk [22] is an example of a sophisticated agent which integrates backchannels into a virtual interviewer. The virtual character is displayed on a screen and thus does not situate the interaction in the real world. Erica iteratively builds a short-term interaction path in order to demonstrate a mixed-initiative multimodal conversation. Her aims is to keep the user engaged in the dialogue by answering questions and showing her interest. We use a hierarchical approach to control the system’s utterance generation. The top-level controller queries and decides on which component (question-answering, statement response, backchannel or proactive initiator) shall take the lead in the interaction.
This work presents the integration of these conversation-based components as the foundation of the dialogue management system for Erica. We introduce the general architecture in the next section. Within Sect. 3, we describe the individual components of the system and then conduct a preliminary evaluation in Sect. 4. Note that Erica speaks Japanese, so translations are given in English where necessary.
2 Architecture
Erica’s dialogue system combines various individual components. A top-level controller selects the appropriate component to use depending on the state of the dialogue. We cluster dialogue segments into four main classes as shown in Table 1 (examples are translated from Japanese). The controller estimates the current dialogue segment based on the short-term history of the conversation and then triggers the appropriate module to generate a response.
Figure 2 shows the general architecture of Erica’s dialogue system, focusing on the speech and event-based dialogue management which is the topic of this paper.
The system uses a Kinect sensor to reliably identify a speaker in the environment. Users’ utterances are transcribed by the speech recognizer and aligned with a tracked user. An event detector continuously monitors space and sound environment to extract selected events such as periods of silence and user locations. An interaction process is divided into steps, triggered by the reception of an input which is either a transcribed utterance or a detected event, and ends with a system action.
First, the utterance is sent to the question-answering and statement response components which generate an associated confidence score. This score is based on factors such as the hypothesized dialogue act of the user utterance and the presence of keywords and focus phrases. The controller then selects the component’s response with the highest confidence score. However if this score does not meet the minimum threshold, the dialogue manager produces a backchannel fallback.
Both the question-answering and statement response components use dialogue act tagging to generate their confidence scores. We use a dialogue act tagger based on support vector machines to classify an utterance into a question or non-question. Focus word detection is used by the statement response system and is described in more detail in Sect. 3.2.
Events such as silences and users entering the proximity of Erica are detected and handled by the proactive initiator. Erica instigates dialogue which is not in response to any speech input from the user, but events in the environment such as a user entering her social space. This dialogue is generated based on rules and is described in more detail in Sect. 3.3.
3 Components
In this section, we describe individual components of the system and some example responses they generate.
3.1 Question Answering with Fallback
Task-oriented spoken dialogue systems handle uncertain inputs with explicit or implicit confirmations [9]. There is a trade-off between the consequences of processing an erroneous utterance and the expected fluency of the system [24]. Question-answering engines such as smartphone assistants make no confirmations and let users decide whether they accept the returned results. As a conversational partner, Erica cannot use such explicit strategies as they interrupt the flow of the dialogue. We can consider chatting with uncertainty to be similar to conversing with non-native speakers, with misunderstandings being communicated and repaired jointly.
Erica’s question-answering components enables her to implicitly handle errors and uncertainty. Since the system’s goal is to generate conversational dialogues, an exact deep understanding of the user utterances is not necessary. Erica is able to generate implicit non-understanding prompts such as “e?” (“huh?” in English), backchannels and nodding. These signals are used when the system is unable to generate an answer with sufficiently high confidence.
The following conversation shows an instance of interaction segment between a user and Erica in which her responses are managed with only the question-answering and the backchannel components (Table 2).
The question-answering manager bases its knowledge on a handcrafted database of adjacency pairs. The following measure is used to compare a set of n ranked speech recognition hypotheses \(\{(u_1, cm_1)...(u_n, cm_n)\}\) and all first-pair parts \(fpp_{db}\) in the database:
\(ld(fpp_{db}, u_i)\) is the normalized Levenshtein distance between a database entry \(fpp_{db}\) and the hypothesis’ utterance \(u_i\). \(cm_i\) is the confidence measure of the speech recognizer mapped to the interval \(\left[ 0;1\right] \) using the sigmoid function. \(\alpha \) and \(\beta \) are weights given to the language understanding and speech recognition parts. \(\gamma \) is a bias that determines the overall degree of acceptance of the system. This approach is not highly sophisticated, but is not the main focus of this work. We found it sufficient for most user questions which were within the scope of the database of topics.
The algorithm searches for the most similar first-pair part given the incoming input. The entry for which the computed measure is the lowest is selected and the associated system response is generated. If the measure m does not exceed a threshold, the system resorts to a fallback response.
3.2 Statement Response
In addition to answering questions from a user, Erica can also generate focus-based responses to statements. Statements are defined as utterances that do not explicitly request the system to provide information and are not responses to questions. For instance, “Today I will go shopping with my friends” or “I am happy about your wedding” are statements. Chatting is largely based on such exchanges of information, with varying degrees of intimacy depending on speaker familiarity.
Higashinaka et al. [10] proposed a method to automatically generate and rank responses to why-questions asked by users. Previous work also offered a similar learning method to help disambiguate the natural language understanding process using the larger dialogue context [11, 13] and to map from semantic concept to turn realization [12].
Our approach is based on knowledge of common statement responses in Japanese conversation [7]. This includes some repetition of the utterance of the previous speaker, but does not require full understanding of their utterance. As an example, consider the user utterance “Yesterday, I ate a pizza”. Erica’s objective is to engage the user and so may elaborate on the question (“What kind of pizza?”) or partially repeat the utterance with a rising tone (“A pizza?”). The key is the knowledge that “pizza” is the most relevant word in the previous utterance. This has also been used in previous robot dialogue systems [19].
We define four cases for replying to a statement as shown in Fig. 3, with examples shown in Table 3. Focus phrases and predicates are underlined and question words are in boxes. Similar to question-answering, a fallback utterance is used when no suitable response can be found, indicated in the table as a formulaic expression.
To implement our algorithm, we first search for the existence of a focus word or phrase in the transcribed user utterance. This process uses a conditional random field using a phrase-level dependency tree of the sentence aligned with part-of-speech tags as the input [25]. The algorithm labels each phrase with its likelihood to be the sentence focus. The most likely focus phrase, if its probability exceeds 0.5, is assumed to be the focus. The resulting focus phrase is stripped so that only nouns are kept.Footnote 1 For example, the utterance “The video game that has been released is cool” would extract ‘video game’ as the focus noun.
We then decide the question marker to use as a response depending on whether a focus word can be found in the utterance. These transform an affirmative sentence into a question. Table 4 displays some examples of question words with and without a focus. We then compute the likelihood of the focus nouns associated with question words using an n-gram language model. N-gram probabilities are computed from the Balanced Corpus of Contemporary Written Japanese.Footnote 2 The corpus is made of 100 million words from books, magazines, newspapers and other texts. The models have been filtered so they only contain n-grams which include the question words defined above. The value of the maximum probability of the focus noun and question word combination is \(P_{max}\). In the case where no focus could be extracted with a high enough confidence, we use an appropriate pattern based on the predicate. In this case, instead of the focus phrase, we compute sequences made of the utterance’s main predicate and a set of complements containing question words. The best n-gram likelihood is also defined as \(P_{max}\).
The second stage of the tree in Fig. 3 makes selections from one of the four cases shown in Table 3. Each of these define a different pattern in the response construction. \(T_f\) and \(T_p\) have been empirically fine tuned. Table 5 displays the conditions for generating each responseFootnote 3 pattern.
3.3 Proactive Initiator
As shown in Table 1, the proactive initiator takes part in several scenarios. Typical spoken dialogue systems are built with the intent of reacting to the user’s speech, while a situated system such as Erica continuously monitors its environment in search of cues about the intent of the user. This kind of interactive setup has been the focus of recent studies [5, 6, 16, 18, 20]. Erica uses an event detector to track the environment and generate discrete events. For example, we define three circular zones around Erica as her personal space (0–0.8 m), social space (0.8–2.5 m) and far space (2.5–10 m). The system triggers events whenever a previously empty zone gets filled or when a crowded one is left empty. We also measure prolonged silences of fixed lengths.
Currently, we use the proactive initiator for three scenarios:
-
1.
If a silence longer than two seconds has been detected in a question-answering dialogue, Erica will ask a follow-up question related to the most recent topic.
-
2.
If a silence longer than five seconds has been detected, Erica starts a ‘topic introduction’ dialogue where she draws a random topic from the pool of available ones using a weighted distribution which is inversely proportional to the distance to the current topic in the word-embedding space.
-
3.
When users enter or leave a social space, Erica greets or takes leave of them.
4 Evaluation and Discussion
The goal of our evaluation is to test whether our system can avoid making generic fallback utterances under uncertainty while providing a suitable answer. We first evaluate the statement response system independently. Then we evaluate if this system can reduce the number of fallback utterances. As we have no existing baseline, our methodology is to conduct an initial user study using only the question-answering system. We then collect the utterances from users and feed them into our updated system for comparison.
We evaluated the statement response component by extracting dialogue from a chatting corpus created for Project Next’s NLP task.Footnote 4 This corpus is a collection of 1046 transcribed and annotated dialogues between a human and an automatic system. The corpus has been subjectively annotated by three annotators who judged the quality of the answers given by the annotated system as coherent, undecided or incoherent. We extracted 200 user statements from the corpus for which the response from the automated system had been judged as coherent.
All 200 statements were input into the statement response system and two annotators judged if the response was categorized correctly according to the decision tree in Fig. 3. Precision and recall results are displayed in Table 6.
Our results showed that the decision tree correctly selected the appropriate category in the majority of cases. The difference between the high performance of the formulaic expression and the question on predicate shows that the decision threshold in the case of no focus word could be fine-tuned to improve the overall performance.
We then tested whether the integration of statement response into Erica’s dialogue system reduced the number of fallback utterances. The initial user study consisted of 22 participants who were asked to interact with Erica by asking questions to her from a list of 30 topics such as Erica’s hobbies and favorite animals. They could speak freely and the system would either answer their question or provide a fallback utterance, such as “Huh?” or “I cannot answer that”.
Users interacted with Erica for 361 seconds on average (\(sd=\) 131 seconds) with a total interaction lasting on average 21.6 turns (\(sd=\) 7.8 turns). From 226 user turns, 187 were answered correctly by Erica and 39 were responded to with fallback utterances. Users also subjectively rated their interaction using a modified Godspeed questionnaire [2]. This questionnaire measured Erica’s perceived intelligence, animacy and likeability as a summation of factors which were measured in 5-point Likert scales. Participants rated Erica’s intelligence on average as 16.8 (maximum of 25), animacy as 8.8 (maximum of 15), and likeability as 18.4 (maximum of 25).
We then fed all utterances into our updated system which included the statement response component. User utterances which produced a fallback response were now handled by the statement response system. Out of 39 utterances, 19 were handled by the statement response system, while 20 could not be handled and so again reverted to a fallback response. This result showed that around half the utterances could be handled with the addition of a statement response component, as shown in Fig. 4. The dialogues produced by the statement response system were generally coherent with the correct focus word found.
5 Conclusion
Our dialogue system for Erica combines different approaches to build and maintain a conversation. The knowledge and models used to cover a wide range of topics and roles are designed to improve the system’s flexibility. We plan on improving the components using data collected through Wizard-of-Oz experiments.
While the question-answering system is simplistic, it can yield control to other components when uncertainty arises. The statement response mechanism helps to continue the conversation and increase the user’s belief that Erica is attentive to her conversational partner. In the future we also aim to evaluate Erica’s proactive behavior and handle errors in speech recognition.
Our experiment demonstrated that a two-layered decision approach handles interaction according to simple top-level rules. We obtained some promising results with our statement response system and intend to improve it future prototypes. Other ongoing research focuses on learning the component selection process based on data. The main challenge in this architecture is determining which component should handle the conversation, which will be addressed in future work.
Notes
- 1.
In Japanese, there are no articles such as ‘a’ or ‘the’.
- 2.
- 3.
“So desu ka”: “I see”, “Tashikani”: “Sure”.
- 4.
References
Banchs R, Li H (2012) IRIS: a chat-oriented dialogue system based on the vector space model. In: Annual meeting of the association for computational linguistics, July, pp 37–42
Bartneck C, Kulić D, Croft E, Zoghbi S (2009) Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int J Soc Robot 1(1):71–81
Becker-Asano C, Ogawa K, Nishio S, Ishiguro H (2010) Exploring the uncanny valley with Geminoid HI-1 in a real-world application. In: Proceedings of IADIS international conference interfaces and human computer interaction, pp 121–128
Bevacqua E, Cowie R, Eyben F, Gunes H, Heylen D, Maat M, Mckeown G, Pammi S, Pantic M, Pelachaud C, De Sevin E, Valstar M, Wollmer M, Shroder M, Schuller B (2012) Building autonomous sensitive artificial listeners. IEEE Trans Affect Comput 3(2):165–183
Bohus D, Horvitz E (2014) Managing human-robot engagement with forecasts and... um... hesitations. In: International conference on multimodal interaction, pp 2–9
Bohus D, Kamar E, Horvitz E (2012) Towards situated collaboration. In: NAACL-HLT workshop on future directions and needs in the spoken dialog community: tools and data, pp 13–14
Den Y, Yoshida N, Takanashi K, Koiso H (2011) Annotation of Japanese response tokens and preliminary analysis on their distribution in three-party conversations. In: 2011 international conference on speech database and assessments (COCOSDA). IEEE, pp 168–173
DeVault D, Artstein R, Benn G, Dey T, Fast E, Gainer A, Georgila K, Gratch J, Hartholt A, Lhommet M, Lucas G, Marsella S, Morbini F, Nazarian A, Scherer S, Stratou G, Suri A, Traum D, Wood R, Xu Y, Rizzo A, Morency Lp (2014) SimSensei kiosk: a virtual human interviewer for healthcare decision support. In: International conference on autonomous agents and multi-agent systems, vol 1, pp 1061–1068
Ha EY, Mitchell CM, Boyer KE, Lester JC (2013) Learning dialogue management models for task-oriented dialogue with parallel dialogue and task streams. In: SIGdial meeting on discourse and dialogue, August, pp 204–213
Higashinaka R, Isozaki H (2008) Corpus-based question answering for why-questions. In: International joint conference on natural language processing, pp 418–425
Higashinaka R, Nakano M, Aikawa K (2003) Corpus-based discourse understanding in spoken dialogue systems. In: Annual meeting on association for computational linguistics, vol 1, pp 240–247. https://doi.org/10.3115/1075096.1075127
Higashinaka R, Prasad R, Walker MA (2006a) Learning to generate naturalistic utterances using reviews in spoken dialogue systems. In: International conference on computational linguistics, July, pp 265–272. https://doi.org/10.3115/1220175.1220209
Higashinaka R, Sudoh K, Nakano M (2006b) Incorporating discourse features into confidence scoring of intention recognition results in spoken dialogue systems. Speech Commun 48(3–4):417–436. https://doi.org/10.1016/j.specom.2005.06.011
Higashinaka R, Imamura K, Meguro T, Miyazaki C, Kobayashi N, Sugiyama H, Hirano T, Makino T, Matsuo Y (2014) Towards an open-domain conversational system fully based on natural language processing. In: International conference on computational linguistics, pp 928–939. http://www.aclweb.org/anthology/C14-1088
Leuski A, Traum D (2011) NPCEditor: creating virtual human dialogue using information retrieval techniques. AI Mag 32(2):42–56
Misu T, Raux A, Lane I, Devassy J, Gupta R (2013) Situated multi-modal dialog system in vehicles. In: Proceedings of the 6th workshop on eye gaze in intelligent human machine interaction: gaze in multimodal interaction, pp 7–9
Ogawa K, Nishio S, Koda K, Balistreri G, Watanabe T, Ishiguro H (2011) Exploring the natural reaction of young and aged person with telenoid in a real world. JACIII 15(5):592–597
Pejsa T, Bohus D, Cohen MF, Saw CW, Mahoney J, Horvitz E (2014) Natural communication about uncertainties in situated interaction. In: International conference on multimodal interaction, pp 283–290. https://doi.org/10.1145/2663204.2663249
Shitaoka K, Tokuhisa R, Yoshimura T, Hoshino H, Watanabe N (2010) Active listening system for dialogue robot. In: JSAI SIG-SLUD Technical Report, vol 58, pp 61–66 (in Japanese)
Skantze G, Hjalmarsson A, Oertel C (2014) Turn-taking, feedback and joint attention in situated human-robot interaction. Speech Commun 65:50–66
Su PH, Gašić M, Mrksic N, Rojas-Barahona L, Ultes S, Vandyke D, Wen TH, Young S (2016) On-line reward learning for policy optimisation in spoken dialogue systems. In: ACL
Traum D, Swartout W, Gratch J, Marsella S (2008) A virtual human dialogue model for non-team interaction. In: Recent trends in discourse and dialogue. Springer, pp 45–67
Traum D, Aggarwal P, Artstein R, Foutz S, Gerten J, Katsamanis A, Leuski A, Noren D, Swartout W (2012) Ada and grace: direct interaction with museum visitors. In: Intelligent virtual agents. Springer, pp 245–251
Varges S, Quarteroni S, Riccardi G, Ivanov AV (2010) Investigating clarification strategies in a hybrid POMDP dialog manager. In: Proceedings of the 11th annual meeting of the special interest group on discourse and dialogue, pp 213–216
Yoshino K, Kawahara T (2015) Conversational system for information navigation based on pomdp with user focus tracking. Comput Speech Lang 34(1):275–291
Yumak Z, Ren J, Thalmann NM, Yuan J (2014) Modelling multi-party interactions among virtual characters, robots, and humans. Presence Teleoperators Virtual Environ 23(2):172–190
Acknowledgements
This work was supported by JST ERATO Ishiguro Symbiotic Human-Robot Interaction program (Grant Number JPMJER1401), Japan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Milhorat, P. et al. (2019). A Conversational Dialogue Manager for the Humanoid Robot ERICA. In: Eskenazi, M., Devillers, L., Mariani, J. (eds) Advanced Social Interaction with Agents . Lecture Notes in Electrical Engineering, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-92108-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-92108-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92107-5
Online ISBN: 978-3-319-92108-2
eBook Packages: EngineeringEngineering (R0)