Keywords

1 Introduction

Androids are a form of humanoid robots which are intended to look, move and perceive the world like human beings. Human-machine interaction supported with androids has been studied for some years, with many works gauging user acceptance of tele-operated androids in public places such as outdoor festivals and shopping malls [3, 17]. Relatively few have the ability to hold a multimodal conversation autonomously, one of the exceptions being the android Nadine [26].

In this paper, we introduce a dialogue management system for Erica (ERato Intelligent Conversational Android). Erica is a Japanese-speaking android which converses with one or more human users. She is able to perceive the environment and users through microphones, video cameras, depth and motion sensors. The design objective is for Erica to maintain an autonomous prolonged conversation on a variety of topics in a human-like manner. An image of Erica is shown in Fig. 1.

Fig. 1
figure 1

The android Erica is designed to be physically realistic. Motors within her face provide speaking motions in addition to unconscious behaviors such as breathing and blinking

Erica’s realistic physical appearance implies that her spoken dialogue system must have the ability to hold a conversation in a similarly human-like manner by displaying conversational aspects such as backchannels, turn-taking and fillers. We want Erica to have the ability to create mixed-initiative dialogues through a robust answer retrieval system, a dynamic statement-response generation and proactively initiating a conversation. This distinguishes Erica as a conversational partner as opposed to smartphone-embedded vocal assistants or text-based chatbot applications. Erica must consider many types of dialogue so she can take on a wide range of conversational roles.

Chatting systems, often called chatbots, conduct a conversation with their users. They may be based on rules [4, 8] or machine-learned dialogue models [21, 25]. Conducting a conversation has a wide meaning for a dialogue system. Wide variations exist in the modalities employed, the knowledge sources available, the embodiment and the physical human-likeness. Erica is fully embodied, ultra realistic and may express emotions. Our aim is not for the dialogue system to have access to a vast amount of knowledge, but to be able to talk and answer questions about more personal topics. She should also demonstrate attentive listening abilities where she shows sustained interest in the discourse and attempts to increase user engagement.

Several virtual agent systems are tuned towards question-answering dialogues by using information retrieval techniques [15, 23]. However these techniques are not flexible enough to accept a wide variety of user utterances other than well-defined queries. They resort to a default failing answer when unable to provide a confident one. Moreover most information retrieval engines assume the inputs to be text-based or a near-perfect speech transcription. One additional drawback of such an omniscient system is the latency they introduce in the interaction when searching for a response. Our goal is to avoid this latency by providing appropriate feedback even if the system is uncertain about the user’s dialogue.

Towards this goal we introduce an attentive listener which convinces the interlocutor of interest in the dialogue so that they continue an interaction. To do this, a system typically produces backchannels and other feedback with the limited understanding it has of the discourse. Since a deep understanding of the context of the dialogue or the semantic of the user utterance is unnecessary, some automatic attentive listeners have been developed as open domain systems [14]. Others actually use predefined scripts or sub-dialogues that are pooled together to iteratively build the ongoing interaction [1]. One advantage is that attentive listeners do not need to completely understand the user’s dialogue to provide a suitable response. We present a novel approach based on capturing the focus word in the input utterance which is then used in an n-gram-based sentence construction.

Virtual agents combining question-answering abilities with attentive listening are rare. SimSensei Kiosk [22] is an example of a sophisticated agent which integrates backchannels into a virtual interviewer. The virtual character is displayed on a screen and thus does not situate the interaction in the real world. Erica iteratively builds a short-term interaction path in order to demonstrate a mixed-initiative multimodal conversation. Her aims is to keep the user engaged in the dialogue by answering questions and showing her interest. We use a hierarchical approach to control the system’s utterance generation. The top-level controller queries and decides on which component (question-answering, statement response, backchannel or proactive initiator) shall take the lead in the interaction.

This work presents the integration of these conversation-based components as the foundation of the dialogue management system for Erica. We introduce the general architecture in the next section. Within Sect. 3, we describe the individual components of the system and then conduct a preliminary evaluation in Sect. 4. Note that Erica speaks Japanese, so translations are given in English where necessary.

2 Architecture

Erica’s dialogue system combines various individual components. A top-level controller selects the appropriate component to use depending on the state of the dialogue. We cluster dialogue segments into four main classes as shown in Table 1 (examples are translated from Japanese). The controller estimates the current dialogue segment based on the short-term history of the conversation and then triggers the appropriate module to generate a response.

Table 1 Classification of dialogue segments ([] = event/action, “” = utterance)

Figure 2 shows the general architecture of Erica’s dialogue system, focusing on the speech and event-based dialogue management which is the topic of this paper.

The system uses a Kinect sensor to reliably identify a speaker in the environment. Users’ utterances are transcribed by the speech recognizer and aligned with a tracked user. An event detector continuously monitors space and sound environment to extract selected events such as periods of silence and user locations. An interaction process is divided into steps, triggered by the reception of an input which is either a transcribed utterance or a detected event, and ends with a system action.

First, the utterance is sent to the question-answering and statement response components which generate an associated confidence score. This score is based on factors such as the hypothesized dialogue act of the user utterance and the presence of keywords and focus phrases. The controller then selects the component’s response with the highest confidence score. However if this score does not meet the minimum threshold, the dialogue manager produces a backchannel fallback.

Both the question-answering and statement response components use dialogue act tagging to generate their confidence scores. We use a dialogue act tagger based on support vector machines to classify an utterance into a question or non-question. Focus word detection is used by the statement response system and is described in more detail in Sect. 3.2.

Events such as silences and users entering the proximity of Erica are detected and handled by the proactive initiator. Erica instigates dialogue which is not in response to any speech input from the user, but events in the environment such as a user entering her social space. This dialogue is generated based on rules and is described in more detail in Sect. 3.3.

Fig. 2
figure 2

Architecture of the dialogue system

3 Components

In this section, we describe individual components of the system and some example responses they generate.

3.1 Question Answering with Fallback

Task-oriented spoken dialogue systems handle uncertain inputs with explicit or implicit confirmations [9]. There is a trade-off between the consequences of processing an erroneous utterance and the expected fluency of the system [24]. Question-answering engines such as smartphone assistants make no confirmations and let users decide whether they accept the returned results. As a conversational partner, Erica cannot use such explicit strategies as they interrupt the flow of the dialogue. We can consider chatting with uncertainty to be similar to conversing with non-native speakers, with misunderstandings being communicated and repaired jointly.

Erica’s question-answering components enables her to implicitly handle errors and uncertainty. Since the system’s goal is to generate conversational dialogues, an exact deep understanding of the user utterances is not necessary. Erica is able to generate implicit non-understanding prompts such as “e?” (“huh?” in English), backchannels and nodding. These signals are used when the system is unable to generate an answer with sufficiently high confidence.

The following conversation shows an instance of interaction segment between a user and Erica in which her responses are managed with only the question-answering and the backchannel components (Table 2).

Table 2 Example of question-answering based interaction

The question-answering manager bases its knowledge on a handcrafted database of adjacency pairs. The following measure is used to compare a set of n ranked speech recognition hypotheses \(\{(u_1, cm_1)...(u_n, cm_n)\}\) and all first-pair parts \(fpp_{db}\) in the database:

$$ m(u_i ,cm_i, fpp_{db}) = \frac{1}{1 + e^{\alpha . ld(fpp_{db}, u_i) + \beta . (1 - cm_i) + \gamma }} $$

\(ld(fpp_{db}, u_i)\) is the normalized Levenshtein distance between a database entry \(fpp_{db}\) and the hypothesis’ utterance \(u_i\). \(cm_i\) is the confidence measure of the speech recognizer mapped to the interval \(\left[ 0;1\right] \) using the sigmoid function. \(\alpha \) and \(\beta \) are weights given to the language understanding and speech recognition parts. \(\gamma \) is a bias that determines the overall degree of acceptance of the system. This approach is not highly sophisticated, but is not the main focus of this work. We found it sufficient for most user questions which were within the scope of the database of topics.

The algorithm searches for the most similar first-pair part given the incoming input. The entry for which the computed measure is the lowest is selected and the associated system response is generated. If the measure m does not exceed a threshold, the system resorts to a fallback response.

3.2 Statement Response

In addition to answering questions from a user, Erica can also generate focus-based responses to statements. Statements are defined as utterances that do not explicitly request the system to provide information and are not responses to questions. For instance, “Today I will go shopping with my friends” or “I am happy about your wedding” are statements. Chatting is largely based on such exchanges of information, with varying degrees of intimacy depending on speaker familiarity.

Higashinaka et al. [10] proposed a method to automatically generate and rank responses to why-questions asked by users. Previous work also offered a similar learning method to help disambiguate the natural language understanding process using the larger dialogue context [11, 13] and to map from semantic concept to turn realization [12].

Our approach is based on knowledge of common statement responses in Japanese conversation [7]. This includes some repetition of the utterance of the previous speaker, but does not require full understanding of their utterance. As an example, consider the user utterance “Yesterday, I ate a pizza”. Erica’s objective is to engage the user and so may elaborate on the question (“What kind of pizza?”) or partially repeat the utterance with a rising tone (“A pizza?”). The key is the knowledge that “pizza” is the most relevant word in the previous utterance. This has also been used in previous robot dialogue systems [19].

We define four cases for replying to a statement as shown in Fig. 3, with examples shown in Table 3. Focus phrases and predicates are underlined and question words are in boxes. Similar to question-answering, a fallback utterance is used when no suitable response can be found, indicated in the table as a formulaic expression.

Table 3 Response to statement cases
Fig. 3
figure 3

Decision tree for statement-response

To implement our algorithm, we first search for the existence of a focus word or phrase in the transcribed user utterance. This process uses a conditional random field using a phrase-level dependency tree of the sentence aligned with part-of-speech tags as the input [25]. The algorithm labels each phrase with its likelihood to be the sentence focus. The most likely focus phrase, if its probability exceeds 0.5, is assumed to be the focus. The resulting focus phrase is stripped so that only nouns are kept.Footnote 1 For example, the utterance “The video game that has been released is cool” would extract ‘video game’ as the focus noun.

We then decide the question marker to use as a response depending on whether a focus word can be found in the utterance. These transform an affirmative sentence into a question. Table 4 displays some examples of question words with and without a focus. We then compute the likelihood of the focus nouns associated with question words using an n-gram language model. N-gram probabilities are computed from the Balanced Corpus of Contemporary Written Japanese.Footnote 2 The corpus is made of 100 million words from books, magazines, newspapers and other texts. The models have been filtered so they only contain n-grams which include the question words defined above. The value of the maximum probability of the focus noun and question word combination is \(P_{max}\). In the case where no focus could be extracted with a high enough confidence, we use an appropriate pattern based on the predicate. In this case, instead of the focus phrase, we compute sequences made of the utterance’s main predicate and a set of complements containing question words. The best n-gram likelihood is also defined as \(P_{max}\).

The second stage of the tree in Fig. 3 makes selections from one of the four cases shown in Table 3. Each of these define a different pattern in the response construction. \(T_f\) and \(T_p\) have been empirically fine tuned. Table 5 displays the conditions for generating each responseFootnote 3 pattern.

Table 4 Question words
Table 5 Response to statement methods

3.3 Proactive Initiator

As shown in Table 1, the proactive initiator takes part in several scenarios. Typical spoken dialogue systems are built with the intent of reacting to the user’s speech, while a situated system such as Erica continuously monitors its environment in search of cues about the intent of the user. This kind of interactive setup has been the focus of recent studies [5, 6, 16, 18, 20]. Erica uses an event detector to track the environment and generate discrete events. For example, we define three circular zones around Erica as her personal space (0–0.8 m), social space (0.8–2.5 m) and far space (2.5–10 m). The system triggers events whenever a previously empty zone gets filled or when a crowded one is left empty. We also measure prolonged silences of fixed lengths.

Currently, we use the proactive initiator for three scenarios:

  1. 1.

    If a silence longer than two seconds has been detected in a question-answering dialogue, Erica will ask a follow-up question related to the most recent topic.

  2. 2.

    If a silence longer than five seconds has been detected, Erica starts a ‘topic introduction’ dialogue where she draws a random topic from the pool of available ones using a weighted distribution which is inversely proportional to the distance to the current topic in the word-embedding space.

  3. 3.

    When users enter or leave a social space, Erica greets or takes leave of them.

4 Evaluation and Discussion

The goal of our evaluation is to test whether our system can avoid making generic fallback utterances under uncertainty while providing a suitable answer. We first evaluate the statement response system independently. Then we evaluate if this system can reduce the number of fallback utterances. As we have no existing baseline, our methodology is to conduct an initial user study using only the question-answering system. We then collect the utterances from users and feed them into our updated system for comparison.

We evaluated the statement response component by extracting dialogue from a chatting corpus created for Project Next’s NLP task.Footnote 4 This corpus is a collection of 1046 transcribed and annotated dialogues between a human and an automatic system. The corpus has been subjectively annotated by three annotators who judged the quality of the answers given by the annotated system as coherent, undecided or incoherent. We extracted 200 user statements from the corpus for which the response from the automated system had been judged as coherent.

All 200 statements were input into the statement response system and two annotators judged if the response was categorized correctly according to the decision tree in Fig. 3. Precision and recall results are displayed in Table 6.

Table 6 Evaluation of statement response component

Our results showed that the decision tree correctly selected the appropriate category in the majority of cases. The difference between the high performance of the formulaic expression and the question on predicate shows that the decision threshold in the case of no focus word could be fine-tuned to improve the overall performance.

We then tested whether the integration of statement response into Erica’s dialogue system reduced the number of fallback utterances. The initial user study consisted of 22 participants who were asked to interact with Erica by asking questions to her from a list of 30 topics such as Erica’s hobbies and favorite animals. They could speak freely and the system would either answer their question or provide a fallback utterance, such as “Huh?” or “I cannot answer that”.

Users interacted with Erica for 361 seconds on average (\(sd=\) 131 seconds) with a total interaction lasting on average 21.6 turns (\(sd=\) 7.8 turns). From 226 user turns, 187 were answered correctly by Erica and 39 were responded to with fallback utterances. Users also subjectively rated their interaction using a modified Godspeed questionnaire [2]. This questionnaire measured Erica’s perceived intelligence, animacy and likeability as a summation of factors which were measured in 5-point Likert scales. Participants rated Erica’s intelligence on average as 16.8 (maximum of 25), animacy as 8.8 (maximum of 15), and likeability as 18.4 (maximum of 25).

We then fed all utterances into our updated system which included the statement response component. User utterances which produced a fallback response were now handled by the statement response system. Out of 39 utterances, 19 were handled by the statement response system, while 20 could not be handled and so again reverted to a fallback response. This result showed that around half the utterances could be handled with the addition of a statement response component, as shown in Fig. 4. The dialogues produced by the statement response system were generally coherent with the correct focus word found.

Fig. 4
figure 4

Proportion of system turns answered by a component in the experiment (left) and the updated system including statement response (right)

5 Conclusion

Our dialogue system for Erica combines different approaches to build and maintain a conversation. The knowledge and models used to cover a wide range of topics and roles are designed to improve the system’s flexibility. We plan on improving the components using data collected through Wizard-of-Oz experiments.

While the question-answering system is simplistic, it can yield control to other components when uncertainty arises. The statement response mechanism helps to continue the conversation and increase the user’s belief that Erica is attentive to her conversational partner. In the future we also aim to evaluate Erica’s proactive behavior and handle errors in speech recognition.

Our experiment demonstrated that a two-layered decision approach handles interaction according to simple top-level rules. We obtained some promising results with our statement response system and intend to improve it future prototypes. Other ongoing research focuses on learning the component selection process based on data. The main challenge in this architecture is determining which component should handle the conversation, which will be addressed in future work.