Keywords

1 Introduction

The vAssist project [6] aims at providing specific voice controlled home care and communication services for two target groups of older persons: seniors living with chronic diseases and persons living with (fine) motor skills impairments. The main goal is the development of simplified and adapted interface variants for tele-medical and communication applications using multilingual natural speech and voice interaction (and supportive graphical user interfaces where necessary) [19, 22]. The vAssist consortium consists of research institutes and companies from Austria, France and Italy. Toward the end of the project, the University of the Basque Country was included so as to expand the perimeter to Spanish speaking users.

2 Related Work

A Spoken Dialog System (SDS) is a system providing an interface to a service or an application through a dialog. An interaction qualifies as dialog when it exceeds one turn. It requires to keep track of the dialog state, including the history of past turns, in order to select the appropriate next step.

Those systems do not usually consist of a single component but comprise several specialized programs combined in order to recognize the speech, extract the relevant information in the transcriptions, act on back-end services, decide on the best next step, generate system responses and synthesize speech.

JUPITER [31] was one of the first SDSs released to the public. The phone-based weather information conversational interface has received over 30,000 calls between 1997 and 1999. Earlier, researchers from Philips implemented an automatic train timetable information desk for Germany [1]. More recently, Carnegie Mellon University provided Olympus [2], which has been used to build systems like the Let’s Go! Bus Information System [20], leading to the biggest corpus of man-machine dialogs with real users publicly available today. Recent platforms for developing spoken dialog systems include the Opendial toolkit [17] and the architecture developed by the University of Cambridge [30] for its startup VocalIQ.

ELIZA [27] is considered by many as the first dialog system. The core of the system was based on scripts which associated a system’s response by looking for a pattern in the input. Larsson and Traum argued that the state of the dialog, including its history, may be represented as the sum of the so far exchanged information [14]. An Information State (IS) designer defines the elements of the information relevant to a dialog, a set of update rules and an update strategy. An example-based dialog manager (DM) [15] constructs a request to a database from the annotated input Dialog Act (DA). The database stores examples seen in the interaction data so that the algorithm looks for the most similar entry and then executes the system’s associated action. On the other hand, plan-based DMs [3, 21] require a pre-programmed task model.

On the stochastic side, Markov Decision Processes (MDPs) represent a statistical decision framework to manage dialogs [16, 29]. Here, the dialog state space contains all the states the dialog may be in and the transitions dependent on the user inputs. The behavior of a DM based on MDPs is defined by a strategy which associates each state to an executable action. Statistical methods used for dialog management also include Stochastic Finite-State models [9, 10, 25] and SemiMDPs [8]. Finally the state-of-the-art POMDP [28] extends the MDPs hiding the states which emit observations according to a probabilistic distribution [13, 28, 30]. This additional layer encodes the uncertainty about the Natural Language Understanding (NLU) and, in the case of SDSs, the Automatic Speech Recognition (ASR). Within a theoretical framework the proposal of a global statistical framework, allowing for optimization, is highlighted by POMDP. However, practical POMDP-based DMs are currently limited in the number of variables and by the intractability of the computing power required to find an optimal strategy [7, 13, 30].

In vAssist the development context, along with the difficulty to collect ‘real’ training dialogs, favored the use of a deterministic control formalism. This was also motivated by the overall requirements of the system it had to be integrated with.

3 Main Goals and Contributions

This article describes the vAssist SDS and presents the results of it’s final system evaluation. The vAssist DM system is based on an open and adaptative software architecture that allows for an easy configuration of DMs according to a given target scenario, user requirements and context. In accordance with [18], the novelties of the vAssist SDS are the Semantic Unifier and Reference Resolver (SURR) defined in the natural understanding module and the Link-Form Filling (LFF) concept proposed to model the task (for both cf. Sect. 4.6). The vAssist prototype is based on the Disco plan-based DM and the LFF task model. For comparison purposes we have also integrated an alternative, plan-based DM, i.e. Ravenclaw (cf. Sect. 4.7).

The main contribution of this work is therefore a multilingual lab evaluation of the final vAssist assistive home care and communication service applications running on a smart-phone. Such was carried out with real users in Austria, France and Spain (cf. Sect. 5). As an additional contribution the evaluation has been carried out in terms of system performance and user experience (cf. Sect. 6). The final contribution of this work is the experimental comparison of the Disco-LFF DM and the Ravenclaw DM working within the same SDS architecture, dealing with the same task and language (i.e. Spanish), and interacting with the same users (also cf. Sect. 6).

4 System Description

The vAssist SDS extends the usual chained design (i.e. ASR \(+\) NLU \(+\) DM \(+\) NLG \(+\) TTS). Components were split into modified sub-modules and new processes were integrated into a state-of-the-art workflow chain. Figure 1 shows the resulting SDS architecture.

Fig. 1
figure 1

Architecture of the platform

4.1 Speech Recognition

The system uses the Google Speech API where an HTTP POST request transmits the signal segment to be recognized. The API returns the n-best hypotheses, being n a parameter of the request, as well as the confidence score for the best one. An empty result is returned when the segment cannot be recognized with enough confidence, i.e. when it does not contain speech.

4.2 Natural Language Generation and Text-to-Speech

A simple but effective solution to produce natural language utterances conveying the DM’s messages was targeted. Input messages are Semantic Frames (SFs). The engine is fed with a set of templates that consist of a title (identical to an SF’s goal) associated with an utterance, and whose parts may be replaced by slot names or slot name-value pairs. The result is a natural language utterance to be synthesized or displayed on a screen.

MaryTTS [23], an open-source speech synthesis framework maintained by the Cluster of Excellence MMCI and the DFKI, is used for synthesis. It offers pre-built voice models for different languages as well as tools to create and manipulate them. The MaryTTS module is a client connected to a generating server (hosted local or remote). A request containing the text to be synthesized with additional prosodic information is sent to the central server, which returns the speech stream. The text-to-speech module of the present platform is a basic client program embedded into an ActiveMQ wrapper.

4.3 Semantic Parsing

The semantic parser, which gets inputs from the ASR, associates semantic labels to text utterances (or parts of them). The most commonly used parsing techniques are based on context-free grammars or probabilistic context-free grammars, which are either hand-coded, based on the analysis of collected dialog data, or designed by experts.

Our semantic parser integrates the algorithm proposed by [12], which is the application of the work from [4]. Instead of matching whole sentences with parse structures, the algorithm looks for patterns in chunks of the text-level utterance and in the temporary (i.e. currently assigned) SF. The module applies an ordered set of conditional rules, which is learned from data.

4.4 Semantic Unification and Resolution

The Semantic Unifier and Reference Resolver (SURR) holds a rather simplistic forest of nodes which is used to mine the dialog history, incorporate external information sources and add local turn context. It is the meeting point of the user’s semantic frames, the external environment sensors and functions, the dialog history, and the links generated by the context catcher.

At its core the SURR embeds a forest structure. Trees consist of hierarchies of fully or partially defined SFs (some nodes are calls to external systems or services). When requested, the SURR may dynamically modify (remove/add) branches of the forest. The top node of a hierarchy defines the root.

The SURR algorithm tries to find a unique path from an input SF, i.e. from the parsed user input, to nodes of the forest, to a root node. Going up the trees, the algorithm applies the optional operations held on branches.

Reaching a root node equals the user input being contextualized [18]. In case the algorithm cannot find such a path, i.e. the SURR fails to produce a suitable SF (given the current context and available knowledge), a “NoMap” SF is generated to signal a ‘non-understanding’ to consecutive components.

4.5 Dialog Act Mapping

As a last stage of the NLU processing, the dialog act mapping is performed. Once an input has been parsed, external and local references have been resolved, and the semantic level has been unified, the ultimate step is to convert the SF into a DA. Following an input the mapper retrieves a set of available DAs. Then it looks for a unique match between the SF and the set of DAs.

4.6 Dialog Management Based on Disco

The core of the implemented DM is based on Disco [21], an open-source dialog management library, whose algorithm processes task hierarchy models. A dialog model is a constrained XML tree of tasks. The plan recognizer uses the recipes defined in the dialog models and this dialog state to select the best available plans for the tasks in the stack. Then the reasoning engine selects the most appropriate next step.

In an attempt to overcome the hurdles inherent to the specification of task models, the dialog modeling paradigm was shifted to a Linked-form-filling (LFF) one. Form-filling dialogs are based on structures containing sets of fields which the user needs to provide a value for in order to trigger a terminal action. The order in which the DM asks for the values is not predefined. The user may define multiple field values within a single utterance/turn.

The LFF language offers to combine these properties with the ability to trigger actions at any point of the dialog and the inclusion of subforms. Furthermore, fields and subforms can be optional, i.e. either be ignored when unset or proposed to the user. Here, we use the unlimited depth of a task model to circle tasks while keeping a sequencing order; i.e. the link between two task nodes is a reference, hence a node can point to its ‘parent’ node.

The aim of the LFF language is to offer a somehow simpler design method to a powerful standard dialog modeling specification. Since it is also an XML based language we opted for XSLT to convert an LFF document into a compliant dialog model.

A number of rules have been defined to create a well-formed LFF document. Doing this, the relative reduction in terms of code size and task hierarchy depth was 76 and 77 %, respectively.

4.7 Dialog Management Based on RavenClaw

RavenClaw (part of the CMU Communicator system [3]) is a task-independent DM. It manages dialogs using a task tree and an agenda.

The task tree is basically a plan to achieve the overall dialog task. At runtime, the tree is traversed recursively from left to right and from top to bottom. The execution of the dialog ends when the bottom-right node has been reached. During this process, loops and conditional control mechanisms may be added to the nodes in order to alter the normal exploration of the tree, allowing the definition of more complex dialog structures.

The second defining structure, the agenda, is an ordered list of agents that is used to dispatch inputs to appropriate agents in the task tree. It is recomputed for every turn and the current agent is placed on top of the stack. Inputs are matched to successive items on the agenda. When a match occurs the corresponding agent is activated with the matching concepts as inputs of the dialog. An agent may not consume all input concepts and thus remaining concepts are passed further down the agenda until agents can consume them.

In order to integrate RavenClaw in the architecture shown in Fig. 1, the original Disco-LFF DM was substituted by a module responsible for translating the message format defined by RavenClaw to the message format defined by the Disco-based component and vice versa.

5 Task and Experimental Scenarios

To empirically evaluate the operation of the developed voice-controlled application running on a smartphone under standardized condition, several scenarios were defined and implemented. In detail, the following scenarios and associated tasks were applied for the experimental study:

  • The Prescription Management enables to monitor medical prescriptions and individual intake times. To evaluate this scenario, participants were asked to add a new predefined prescription to the application database and to set a reminder for it (AP). The app requests information regarding name of medication, quantity, dosage form, frequency, and time of intake.

  • The Health Report (HR) provides an overview of physiological data. Participants filled in predefined glycaemia and blood pressure data.

  • The Sleep Report (SR) monitors sleep quality. The following data was provided by the users: the time he/she went to bed, the time he/she fell asleep, and their wake-up times. Participants also reported awake periods at night and the total number of hours slept. Finally, users were asked to rate their well-being on a six-point scale. Furthermore, the evaluation included setting a reminder to remember completing the sleep report (SRR).

  • Fitness Data Management consists of reporting daily activities (FD) and setting reminders for the reports. Within the evaluation, participants were asked to enter a new report including the duration of their fitness activity.

  • The Communication Services include sending messages (SM) and initiating phone calls (PC). Participants were asked to test both functions.

6 Experimental Evaluation

Two series of experiments were carried out: We evaluated the vAssist system including the Disco-LFF engine in three languages: French, German and Spanish. Further, we compared the RavenClaw and Disco-LFF DMs built into the vAssist system with Spanish users.

Sixteen users took part in the experiments in each of the trial sites. In France, 14 male and 2 female persons between 65 and 90 years (mn \(=\) 77.0) participated in the study. In Austria, 8 male and 8 female participants between 60 and 76 (Mn \(=\) 68.0) years old took part. The Spanish trial site included 12 males and 4 females between 26 and 69 (Mn \(=\) 39.6) years.

Users were first shown the smartphone application, followed by a short demonstration and usage advices. The experimental scenarios were then carried out without any other introduction than the simple description of the goal. It was up to the user to figure out how to perform each task.

The system’s performance was measured in terms of Task Completion (TC), i.e. success rate, and Average Dialog Length (ADL), i.e. efficiency. TC evaluates the success rate of the system in providing the user with the requested information, based on the total number of dialogs carried out and the number of successful dialogs achieved for a specific task. ADL is the average number of turns in a successful task.

For the subjective measures, a set of standardized questionnaires was applied. The standard Single Ease Questionnaire (SEQ) [24], the System Usability Scale (SUS) [5] and the Subjective Assessment of Speech System Interfaces (SASSI) [11] questionnaire were used to evaluate the vAssist system with the Disco-LFF DM. A custom set of questions was used to compare the Disco-LFF-based DM with the Ravenclaw-based DM. Results of the SEQ, SUS and SASSI are not given for Spanish, as for this language no localized mobile application interface was available.

6.1 System Performance

The first series of experiments was carried out in France, Austria and Spain, evaluating the vAssist system with the Disco-LFF DM. Table 1 shows the system performance evaluation in terms of TC and ADL values.

Table 1 TC and ADL of the vAssist system using the Disco-LFF DM

Table 1 reveals good TC rates, with the French version being the one generating the highest system performance and the Spanish version the one producing the lowest. Surprisingly, our results show that the vAssist system performance is not better for younger users (Spain: mn \(=\) 39.6 years) than for older ones (France: mn \(=\) 77 years). Language dependent modules, i.e. the ASR and, more importantly, the NLU, were more robust in French and German. Spanish results suffered from a less robust semantic parser and the missing mobile UI, leading to a higher number of turns to achieve the task goals.

6.2 Task Easiness and Usability

Besides performance, the perceived task easiness is considered an important factor influencing user experiences [26]. This aspect was measured right after each task with the SEQ using a 7-point semantic differential (“very difficult”—“very easy”). The analysis revealed a sufficient ease of use for each task; i.e. mean ratings for the Prescription Management and for sending a message were 4.94. Initiating a phone call and the Health Report were rated 5.06.

To obtain insights regarding the prototype’s usability, learnability, and intuitivity, the SUS was used. SUS scores fall between 0 and 100; the higher the score the better. The values for Austria and France were 68 (sd \(=\) 17.2) and 70 (sd \(=\) 11.5), respectively. Hence, even though the perceived easiness of single tasks was good, the overall system experience could still be improved.

6.3 Speech Assessment

The SASSI questionnaire was employed to examine the interaction quality. The analysis provides developers with an assessment of the system along several axes such as easiness, friendliness, speed, etc.

Table 2 Comparing the Disco-LFF and RavenClaw DMs

Results indicate that both “Response Accuracy” (Austria: 4.27, France: 3.99) and “Speed” (Austria: 4.64, France: 4.19) were judged neutral. The analysis of the French sample reveals that “Likeability” (4.9) and “Cognitive Demand” (5.15) were fair. In contrast, the Austrian participants rated these factors as good (Likeability: 5.28, Cognitive Demand: 5.15). Hence, we may argue that participants liked the system and were not overwhelmed by its cognitive demands.

6.4 Disco-LFF and RavenClaw DM Comparison

The second series of experiments was carried out in Spanish only. Note that both DMs were integrated in the same architecture (Fig. 1), i.e. only the task planification and the agent execution differed. Each user carried out the scenarios defined in Sect. 5 with either of the DMs. Table 2 shows the system performance achieved by both systems in terms of TC and ADL, for each of the defined subscenarios. Both metrics show similar behavior for the Disco-LFF and the Ravenclaw DM. A Z-test comparing the average TC proportions and the ADL means showed no statistically significant difference between the two DMs (p-value \(=\) 0.05). A detailed scenario-based analysis showed, however, differences between TC values in the AP and the SR scenarios, which correspond to longer dialogs in terms of the ADL metric. A previous series of experiments has furthermore highlighted a certain lack of robustness exhibited by the language dependent modules of the Spanish vAssist version. This issue was more evident in longer dialogs (AP and SR).

As there was no mobile UI for the Spanish language, the user experience was evaluated trough a set of direct questions regarding the system efficiency, usability and user satisfaction. Task easiness received an average score of 3.00 for the Disco-LFF DM and 3.14 for the RavenClaw DM. The respective satisfaction scores were 3.57 and 3.43 and efficiency scored 3.28 and 3.14.

7 Conclusion

This article had two objectives. First, we reported on the results of the final lab evaluation of the vAssist system, and second we compared the system’s core DM implementation with a publicly deployed one.

Despite minimal differences between languages, the vAssist SDS performances proved to be sufficient for its target users, i.e. older adults living with chronic diseases and persons living with (fine) motor skills impairments.

The DM comparison showed similar performance and subjective experience for the system with the Disco-LFF DM and the one with RavenClaw, promoting the Disco-LFF as a valid alternative to existing DM approaches.