Keywords

1 Introduction

Speech and natural language technologies allow users to communicate in a flexible and efficient way, also enabling the access to applications when traditional input and output interfaces cannot be used (e.g. in-car applications, access for disabled persons, etc.). Also speech-based interfaces work seamlessly with small devices (e.g., smarthphones and tablets PCs) and allow users to easily invoke local applications or access remote information. For this reason, spoken dialog systems [22, 48, 59] are becoming a strong alternative to traditional graphical interfaces which might not be appropriate for all users and/or applications.

These systems are computer programs that receive speech as input and generate as output synthesized speech, engaging the user in a dialog that aims to be similar to that between humans [48, 59]. Thus, these interfaces make technologies more usable, as they ease interaction [23], allow integration in different environments [22], and make technologies more accessible, especially for disabled people and the elderly [80].

In a dialog system of this kind, several modules cooperate to perform the interaction with the user: the Automatic Speech Recognizer (ASR), the Spoken Language Understanding Module (SLU), the Dialog Manager (DM), the Natural Language Generation module (NLG), and the Text-To-Speech Synthesizer (TTS). Each one of them has its own characteristics and the selection of the most convenient model varies depending on certain factors: the goal of each module, the possibility of manually defining the behavior of the module, or the capability of automatically obtaining models from training samples. Figure 1 shows the set of actions and main modules in the architecture of a Spoken Dialog System.

Fig. 1
figure 1

Set of actions and modules in a spoken dialog system

The goal of speech recognition is to obtain the sequence of words uttered by a speaker. Once the speech recognizer has provided an output, the system must understand what the user said. The goal of spoken language understanding is to obtain the semantics from the recognized sentence. This process generally requires morphological, lexical, syntactical, semantic, discourse and pragmatical knowledge.

The dialog manager decides the next action of the system, interpreting the incoming semantic representation of the user input in the context of the dialog. In addition, it resolves ellipsis and anaphora, evaluates the relevance and completeness of user requests, identifies and recovers from recognition and understanding errors, retrieves information from data repositories, and decides about the next system’s response. Natural language generation is the process of obtaining sentences in natural language from the non-linguistic, internal representation of information handled by the dialog system. Finally, the TTS module transforms the generated sentences into synthesized speech.

In order to enable rapid deployment of these systems, markup languages such as VoiceXMLFootnote 1 have been widely adopted as they reduce the time and effort required for system implementation. However, system development with this approach involves a very costly engineering cycle [62]. As an alternative, data-based models try to reduce the effort and time required to develop a new dialog system or to adapt them to deal with a new task. This kind of models are usually based on modeling the different processes probabilistically and learning the parameters of the different statistical models from a dialog corpus. This approach has been widely used for speech recognition and also for language understanding [14, 20, 37, 51, 67]. Even though in the literature there are models for dialog managers that are manually designed, over the last few years, approaches using statistical models to represent the behavior of the dialog manager have also been developed [36, 38, 74, 83].

As described by [58], there are three main categories of elements of the spoken dialog interaction where the availability of vast amounts of data (known as Big Data [2, 19, 47]) can potentially improve automation rate, and ultimately, the penetration and acceptance of speech interfaces in the wider consumer market. They are task-independent behaviors (e.g., error correction and confirmation behavior), task-specific behaviors (e.g., logic associated with certain customer-care practices), and task-interface behaviors (e.g., prompt selection). However, these three categories have in common today the lack of robust guiding principles validated by empirical evidence.

The following sections of this chapter describe the current uses of Big Data to develop conversational interfaces including speech recognition, natural language understanding, dialog management and optimization, context-awareness, emotion recognition, user adaptation and service personalization, multi-domain and multilingual services, proactiveness, and spoken language generation and synthesis.

2 Spoken Language Recognition

As described in the introduction section, speech recognition is the process of obtaining the text string corresponding to an acoustic input [45, 55, 78]. It is a highly complex task, as there is a great deal of variation in input characteristics, which can differ according to the linguistics of the utterance, the speaker, the interaction context and the transmission channel. Different aspects that are usually taken into account when classifying ASR systems are the kind of users supported (user-independent or user-dependent systems), style of speech supported (recognizers isolated words, connected words or continuous speech), or vocabulary size (small, medium, or large vocabulary).

The complexity of the recognition task lies in several problems: the acoustic variability (each person pronounces sounds differently when speaking), acoustic confusion (many words sound similar), the coarticulation problem (the characteristics of spoken sounds may vary depending on neighboring sounds), out of vocabulary words and spontaneous speech (interjections, pauses, doubts, false starts, repetitions of words, self-corrections, etc.), and environmental conditions (noise, channel distortion, bandwidth limitations, etc.). For these reasons, it is very important to try to detect and correct errors generated during the ASR process, since the output of the ASR is the starting point of the other modules in a spoken dialog system.

During the last decades, the field of automatic speech recognition has progressed from the recognition of isolated words in reduced vocabularies to continuous speech recognition with increasing vocabulary sets. These advances have made the communication with dialog systems increasingly more natural. Among the variety of techniques used to develop ASR systems, the data-based approach is currently the most widely used. In this approach, the speech recognition problem can be understood as finding the word sequence W uttered by the user given a sequence of acoustic data A. This sequence can be determined by means of the following expression:

$$\begin{aligned} W = \max _{W} P(W|A) \end{aligned}$$
(1)

Using the Bayes rule, the previous equation can be rewritten as follows:

$$\begin{aligned} P(W|A) = \frac{P(A|W)P(W)}{P(A)} \end{aligned}$$
(2)

where P(A|W) is called the acoustic model (probability of the sequence A the word sequence W has been uttered) and P(W) is provided by the language model (probabilities of sequences of words.). The probabilities of the rules in these models are learned from training data. The acoustic model is created by taking audio recordings of speech and their transcriptions and then compiling them into statistical representations of the sounds for the different words. Learning a language model requires the transcriptions of sentences related to the application domain of the system. Since the probability of the acoustic sequence is independent of the sequence of words, the previous expression can be written as follows:

$$\begin{aligned} W = \max _{W} P(A|W)P(W) \end{aligned}$$
(3)

For the practical implementation of this approach, the most widely used solution consists of modeling the acoustic units by means of Hidden Markov Models (HMM), as it is the case of speech recognizers widely used by the scientific community as HTK (Hidden Markov Model Toolkit)Footnote 2 or CMU Sphinx.Footnote 3

The success of the HMM is mainly based on the use of machine learning algorithms to learn the parameters of the model [61], as well as in their ability to represent speech as a sequential phenomenon over time. Multiple models have been studied, such as discrete models, semicontinuous or continuous, as well as a variety of topologies models.

The language model is one of the essential components required to develop a recognizer of continuous speech. The most used language models are based on N-grams [3, 28] and regular or context-free grammars [29, 67]. Grammars are usually suitable for small tasks, providing more precision based on the type of restrictions. However, they are not able to represent the great variability of natural speech processes.

N-grams models allow to collect more easily the different concatenations among words when a sufficient number of training samples is available. In an n-gram model, the probability \(P(w_1,\ldots ,w_m)\) of observing the sentence \(w_1,\ldots ,w_m\) is approximated as

$$\begin{aligned} P(w_1,\ldots ,w_m) = \prod ^m_{i=1} P(w_i\mid w_1,\ldots ,w_{i-1}) \approx \prod ^m_{i=1} P(w_i\mid w_{i-(n-1)},\ldots ,w_{i-1}) \end{aligned}$$
(4)

This equation assumes that the probability of observing the i-th word \(w_i\) in the context history of the preceding \(i-1\) words can be approximated by the probability of observing it in the shortened context history of the preceding \(n-1\) words (n-th order Markov property). The conditional probability can be calculated from n-gram model frequency counts:

$$\begin{aligned} P(w_i\mid w_{i-(n-1)},\ldots ,w_{i-1}) = \frac{\mathrm {count}(w_{i-(n-1)},\ldots ,w_{i-1},w_i)}{\mathrm {count}(w_{i-(n-1)},\ldots ,w_{i-1})} \end{aligned}$$
(5)

Figure 2 shows an example of the estimation of the bigram probabilities using the Maximum Likelihood Estimate.Footnote 4 Typically, the probabilities are not derived directly from the frequency counts. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or n-grams. Various methods are then used, from simple “add-one” smoothing (assign a count of 1 to unseen n-grams) to more sophisticated models, such as Good-Turing discounting or back-off models.

Fig. 2
figure 2

Estimating bigram probabilities by means of the maximum likelihood estimate

From around 2010, Deep Neural Networks (DNNs) have replaced HMM models. DNNs are now used extensively in industrial and academic research as well as in most commercially deployed ASR systems. Various studies have shown that DNNs outperform HMM models in terms of increased recognition accuracy [24, 68]. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. As described in [53], a key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and uncategorized.

3 Spoken Language Understanding

Once the spoken dialog system has recognized what the user uttered, it is necessary to understand what he said [46, 50, 85]. Natural language processing is a method of obtaining the semantics of a text string and generally involves morphological, lexical, syntactical, semantic, discourse and pragmatical knowledge. In the first stage, lexical and morphological knowledge divide the words into their constituents by distinguishing between lexemes and morphemes: lexemes are parts of words that indicate their semantics and morphemes are the different infixes and suffixes that provide different word classes.

Syntactic analysis yields the hierarchical structure of the sentences. However, in spoken language, phrases are frequently affected by difficulties associated with the so-called disfluency phenomena: filled pauses, repetitions, syntactic incompleteness and repairs [18]. Semantic analysis extracts the meaning of a complex syntactic structure from the meaning of its constituent parts. In the pragmatic and discourse-processing stage, the sentences are interpreted in the context of the whole dialog, the main complexity of this stage is the resolution of anaphora, and ambiguities derived from phenomena such as irony, sarcasm or double entendre.

The process of understanding can be understood as a change in language representation, from natural language to a semantic language, so that the meaning of the message is not changed. As in the speech recognizer, the spoken language understanding module can work with several hypotheses (both for recognition and understanding) and confidence measures. There are currently two major approaches to tackling the problem of understanding: rule-based approaches and statistical models learned from data corpus.

Rule-based approaches extract semantic information based on a syntactic-semantic analysis of the sentences, using grammars defined for the task, or by means of the detection of keywords with semantic meanings. Some analyzers, in order to improve the robustness of the analysis, combine syntactic and semantic aspects of the specific task. Other techniques are based on an analysis at two levels, in which grammars are used to carry out a detailed analysis of the sentence and extract relevant semantic information. In addition, there are systems that use rule-based analyzers automatically learned from a training corpus using natural language processing techniques.

In the case of statistical methods, the process is based on the definition of linguistic units with semantic content and obtaining models from labeled samples. This type of analysis [50, 67] uses a probabilistic model to identify concepts, markers and values of cases, to represent the relationship between markers of cases and their values and to decode semantically pronunciations of the user. The model is generated during a training phase (learning), where its parameters capture the correspondences between text entries and semantic representation. Once the training model has been learned, it is used as a decoder to generate the best representation.

The semantic definition is usually based on the concept of frame in most of the current dialog systems. In this approach, the representation generated by the spoken language understanding module contains concepts (different types of queries that users can require to the system) and attributes (information to be provided by the user to complete or modify the queries). Thus, every message sent by the spoken language understanding module to the dialog manager after each user utterance consists of a frame structure.

4 Dialog Management

Although dialog management is only a part of the development cycle of spoken dialog systems, it can be considered one of the most demanding tasks given that this module encapsulates the logic of the speech application [81]. [77] state that dialog management involves four main tasks: (i) updating the dialog context, (ii) providing a context for sentence interpretation, (iii) coordinating other modules and (iv) deciding the information to convey to the user and when to do it. Thus, the selection of a specific system action depends on multiple factors, such as the output of the speech recognizer (e.g., measures that define the reliability of the recognized information), the dialog interaction and previous dialog history (e.g., the number of repairs carried out so far), the application domain (e.g., guidelines for customer service), knowledge about the users, and the responses and status of external back-ends, devices, and data repositories. Given that the actions of the system directly impact users, the dialog manager is largely responsible for user satisfaction. This way, the design of an appropriate dialog management strategy is at the core of dialog system engineering.

Statistical approaches for dialog management present several important advantages with regard traditional rule-based methodologies. Rather than maintaining a single hypothesis for the dialog state, they maintain a distribution over many hypotheses for the correct dialog state. In addition, statistical methodologies choose actions using an optimization process, in which a developer specifies high-level goals and the optimization works out the detailed dialog plan. For instance, Hoxha and Weng have very recently proposed a mixed-initiative dialog-based approach to support autonomous clinical data access and recommend needed technology development and communication study for accelerating clinical research [26].

Automating dialog management is useful for developing, deploying and re-deploying applications and also reducing the time-consuming process of hand-crafted design. In fact, the application of machine learning approaches to dialog management strategy design is a rapidly growing research area. Machine-learning approaches to dialog management attempt to learn optimal strategies from corpora of real human-computer dialog data using automated “trial-and-error” methods instead of relying on empirical design principles [86]. The main trend in this area is an increased use of data for automatically improving the performance of the system.

Statistical models can be trained with corpora of human-computer dialogs with the goal of explicitly modeling the variance in user behavior that can be difficult to address by means of hand-written rules [66]. Additionally, if it is necessary to satisfy certain deterministic behaviors, it is possible to extend the strategy learned from the training corpus with handcrafted rules that include expert knowledge or specifications about the task [32, 72, 75, 87].

The goal is to build systems that exhibit more robust performance, improved portability, better scalability and easier adaptation to other tasks. However, model construction and parameterization is dependent on expert knowledge, and the success of statistical approaches is dependent on the quality and coverage of the models and data used for training [66]. Moreover, the training data must be correctly labeled for the learning process. The size of currently available annotated dialog corpora is usually too small to sufficiently explore the vast space of possible dialog states and strategies. Collecting a corpus with real users and annotating it requires considerable time and effort.

To address these problems, researchers have proposed alternative techniques that facilitate the acquisition and labeling of corpora, such as Wizard of Oz [16, 31], bootstrapping [1, 15], active learning [9, 41], automatic dialog act classification and labeling [56, 79], and user simulation [43, 66].

Another relevant problem is how to deal with unseen situations, that is, situations that may occur during the dialog and that were not considered during training. To address this point it is necessary to employ generalizable models in order to obtain appropriate system responses that enable to continue with the dialog in a satisfactory way.

Another difficulty is in the design of a good dialog strategy, which in many cases is far from being trivial. In fact, there is no clear definition of what constitutes a good dialog strategy [35, 66]. Users are diverse, which makes it difficult to foresee which form of system behavior will lead to quick and successful dialog completion, and speech recognition errors may introduce uncertainty about their intention.

The most widespread methodology for machine-learning of dialog strategies consists of modeling human-computer interaction as an optimization problem using Markov Decision Processes (MDP) and reinforcement methods [38, 70]. The main drawback of this approach is that the large state space of practical spoken dialog systems makes its direct representation intractable [89]. Partially Observable MDPs (POMDPs) outperform MDP-based dialog strategies since they provide an explicit representation of uncertainty [63]. This enables the dialog manager to avoid and recover from recognition errors by sharing and shifting probability mass between multiple hypotheses of the current dialog state.

An approach that scales the POMDP framework for implementing practical spoken dialog systems by the definition of two state spaces is presented in [88]. Approximate algorithms have also been developed to overcome the intractability of exact algorithms but even the most efficient of these techniques such as Point-Based Value Iteration (PBVI) cannot scale to the many thousand states required by a statistical dialog manager [82]. Composite Summary Point-Based Value Iteration (CSPBVI) has suggested the use of a small summary space for each slot where PBVI policy optimization can be applied. However, policy learning in this technique can only be performed offline, i.e. at design time, because policy training requires an existing accurate model of user behavior. An alternative technique for online training based on Q-learning is presented in [73], which allows the system to adapt to real users as new dialogs are recorded. This technique does not require any model of user behavior, so user simulation techniques are proposed to iteratively learn the dialog model.

Other authors have combined conventional dialog managers with a fully-observable Markov decision process [21, 71], or proposed using multiple POMDPs and selecting actions using hand-crafted rules [82]. In [84], the authors combine the robustness of the POMDP with the developer control afforded in conventional approaches: the (conventional) dialog manager and POMDP run in parallel, but the dialog manager is augmented so that it outputs one or more allowed actions at each time-step. The POMDP then chooses the best action from this limited set. Results from a real voice dialer application show that adding the POMDP machinery to a standard dialog system can yield a significant improvement [84].

Other interesting approaches for statistical dialog management are based on modeling the system by means of Hidden Markov Models [10], stochastic Finite-State Transducers [25, 27, 60], or using Bayesian Networks [49, 57]. Also [33] proposed a different hybrid approach to dialog modeling in which n-best recognition hypotheses are weighted using a mixture of expert knowledge and data-driven measures by using an agenda and an example-based machine translation approach respectively.

5 Natural Language Generation

Natural language generation is the process of obtaining texts in natural language from a non-linguistic representation [34, 42]. It is usually carried out in 5 steps: content organization, content distribution in sentences, lexicalization, generation of referential expressions and linguistic realization. It is important to obtain legible messages, optimizing the text using referring expressions and linking words and adapting the vocabulary and the complexity of the syntactic structures to the users linguistic expertise.

The simplest approach consists of using predefined text messages (e.g. error messages and warnings). Although intuitive, this approach completely lacks from flexibility. The next level of sophistication is template-based generation, in which the same message structure is produced with slight alterations. The template approach is used mainly for multi-sentence generation, particularly in applications whose texts are fairly regular in structure, such as business reports.

Phrase-based systems employ what can be considered as generalized templates at the sentence level (in which case the phrases resemble phrase structure grammar rules), or at the discourse level (in which case they are often called text plans). In such systems, a pattern is first selected to match the top level of the input, and then each part of the pattern is expanded into a more specific one that matches some portion of the input. The cascading process stops when every pattern has been replaced by one or more words.

Finally, feature-based systems represent the maximum level of generalization and flexibility. In feature-based systems, each possible minimal alternative of expression is represented by a single feature; for example, whether the sentence is either positive or negative, if it is a question or an imperative or a statement, or its tense. To arrange the features it is necessary to employ linguistic knowledge. Another alternative is to use corpus-based natural language generation [54], which stochastically generates system utterances.

6 Text-To-Speech Synthesis

Text-to-speech synthesizers transform a text into an acoustic signal [11]. A text-to-speech system is composed of two parts: a front-end and a back-end. The front-end carries out two major tasks. Firstly, it converts raw text containing symbols such as numbers and abbreviations into their equivalent words. This process is often called text normalization, pre-processing, or tokenization. Secondly, it assigns a phonetic transcriptions to each word, and divides and marks the text into prosodic units, i.e. phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. The output of the front-end is the symbolic representation constituted by the phonetic transcriptions and prosody information.

The back-end (often referred to as the synthesizer) converts the symbolic linguistic representation into sound. On the one hand, speech synthesis can be based on human speech production. This is the case of parametric synthesis which simulates the physiological parameters of the vocal tract, and formant-based synthesis, which models the vibration of vocal chords. In this technique, parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. Another approach based on physiological models is articulatory synthesis, which refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes.

On the other hand, concatenative synthesis employs pre-recorded units of human voice. Concatenative synthesis is based on stringing together segments of recorded speech. It generally produces the most natural-sounding synthesized speech; however, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. The quality of the synthesized speech depends on the size of the synthesis unit employed.

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllables, morphemes, words, phrases, and sentences. Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing to the recorded speech. There is a balance between intelligibility and naturalness of the voice output or the automatization of the synthesis procedure. For example, synthesis based on whole words is more intelligible than the phone-based but for each new word it is necessary to obtain a new recording, whereas the phones allow building any new word. In one extreme, domain-specific synthesis concatenates pre-recorded words and phrases to create complete utterances. It is used in applications in which the variety of texts the system will produce is limited to a particular domain, like transit schedule announcements or weather reports.

At the other extreme, diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones and German about 2,500. In diphone synthesis, only one example of each diphone is contained in the speech database. Finally, HMM-based synthesis is a method in which the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves, based on the maximum likelihood criterion.

7 User Modeling and Evaluation of the System

Research in techniques for user modeling has a long history within the fields of language processing and conversational agents. The main purpose of a simulated user in this field is to improve the usability of a conversational agent through the generation of corpora of interactions between the system and simulated users [52], reducing time and effort required for collecting large samples of interactions with real users. Moreover, each time changes are made to the system it is necessary to collect more data in order to evaluate the changes. Thus, the availability of large corpora acquired with a user simulator should contribute positively to the development of the system.

User simulators can be used to evaluate different aspects of a conversational agent, particularly at the earlier stages of development, or to determine the effects of changes to the system’s functionalities (e.g., evaluate confirmation strategies or introduce of errors or unpredicted answers in order to evaluate the capacity of the dialog manager to react to unexpected situations). A second usage, in which we are mainly interested in this contribution, is to support the automatic learning of optimal dialog strategies using statistical methodologies. Large amounts of data are required for a systematic exploration of the dialog state space and corpora acquired with simulated users are extremely valuable for this purpose.

Two main approaches can be distinguished to the creation of simulated users: rule based and data or corpus based. In a rule-based simulated user the researcher can create different rules that determine the behavior of the system [8, 39, 44]. This approach is particularly useful when the purpose of the research is to evaluate the effects of different dialog management strategies. In this way the researcher has complete control over the design of the evaluation study.

Data-based user models are based on probabilistic methods to generate the user input, with the advantage that this uncertainty can better reflect the unexpected behaviors of users interacting with the system. Statistical models for modeling user behavior have been suggested as the solution to the lack of the data that is required for training and evaluating dialog strategies. Using this approach, the dialog manager can explore the space of possible dialog situations and learn new potentially better strategies. Methodologies based on learning user intentions have the purpose of optimizing dialog strategies. A summary of user simulation techniques for reinforcement learning of the dialog strategy can be found in [66].

The most extended methodology for machine-learning of dialog strategies consists of modeling human-computer interaction as an optimization problem using Markov Decision Process (MDP) and reinforcement methods [38]. The main drawback of this approach is the large state space of practical spoken dialog systems, whose representation is intractable if represented directly. Although Partially Observable MDPs (POMDPs) outperform MDP-based dialog strategies, they are limited to small-scale problems, since the state space would be huge and exact POMDP optimization is again intractable [83].

In [12, 13], Eckert, Levin and Pieraccini introduced the use of statistical models to predict the next user action by means of a n-gram model. The proposed model has the advantage of being both statistical and task-independent. Its weak point consists of approximating the complete history of the dialog by a bigram model. In [38], the bigram model is modified by considering only a set of possible user answers following a given system action (the Levin model). Both models have the drawback of considering that every user response depends only on the previous system turn. Therefore, the simulated user can change objectives continuously or repeat information previously provided.

Georgila, Henderson and Lemon propose the use of HMMs, defining a more detailed description of the states and considering an extended representation of the history of the dialog [17]. Dialog is described as a sequence of Information States [7]. Two different methodologies are described to select the next user action given a history of information states. The first method uses n-grams [12], but with values of n from 2 to 5 to consider a longer history of the dialog. The best results are obtained with 4-grams. The second methodology is based on the use of a linear combination of 290 characteristics to calculate the probability of every action for a specific state.

Cuayáhuitl et al. present a method for dialog simulation based on HMMs in which both user and system behaviors are simulated [10]. Instead of training only a generic HMM model to simulate any type of dialog, the dialogs of an initial corpus are grouped according to the different objectives. A submodel is trained for each one of the objectives, and a bigram model is used to predict the sequence of objectives.

In [64], a new technique for user simulation based on explicit representations of the user goal and the user agenda is presented. The user agenda is a structure that contains the pending user dialog acts that are needed to elicit the information specified in the goal. This model formalizes human-machine dialogs at a semantic level as a sequence of states and dialog acts. An EM-based algorithm is used to estimate optimal parameter values iteratively. In [65], the agenda-based simulator is used to train a statistical POMDP-based dialog manager.

A data-driven user intention simulation method that integrates diverse user discourse knowledge (cooperative, corrective, and self-directing) is presented in [30]. User intention is modeled based on logistic regression and Markov logic framework. Human dialog knowledge is designed into two layers, domain and discourse knowledge, and integrated with the data-driven model in generation time. A methodology of user simulation applied to the evaluation and refinement of stochastic dialog systems is presented in [76]. The proposed user simulator incorporates several knowledge sources, combining statistical and heuristic information to enhance the dialog models by an automatic strategy learning. As it is described in the following section, our proposed user simulation technique is based on a classification process that considers the complete dialog history by incorporating several knowledge sources, combining statistical and heuristic information to enhance dialog models by an automatic strategy learning.

In the area of user modeling and dialog systems, emotion has been used for several purposes, as summarized in the taxonomy of applications proposed in [5]. In some application domains, it is fundamental to recognize the affective state of the user to adapt the systems behavior. For example, in emergency services [6] or intelligent tutors [40], it is necessary to know the user emotional state to calm them down, or to encourage them in learning activities. For other applications domains, it can also play an important role in order to solve stages of the dialog that cause negative emotional states, avoid them and foster positive ones in future interactions. Bain have recently presented a proposal to extract emotional information using cloud-based Big Data infrastructure and mobile devices [4]. Hosain et al. has also very recently proposed an infrastructure that combines the potential of emotion-aware Big Data and cloud technology towards the future generation mobile communication technologies (5G) [69].

8 Future Research and Challenges

Throughout the last years, some experts have dared to envision what the future research guidelines in the application of multimodal dialog systems for educative purposes would be based on the advances in Big Data research. These objectives have gradually changed towards ever more complex goals, such as providing the system with advanced reasoning, problem solving capabilities, adaptiveness, proactiveness, affective intelligence, and multilinguality. All these concepts are not mutually exclusive, as for example the system’s intelligence can also be involved in the degree to which it can adapt to new situations, and this adaptiveness can result in better portability for use in different environments.

As can be observed, these new objectives refer to the system as a whole, and represent major trends that in practice are achieved through joint work in different areas and components of the dialog system. Thus, current research trends are characterized by large-scale objectives which are shared out between the different researchers in different areas.

Proactiveness is necessary for computers to stop being considered a tool and becoming real conversational partners. Proactive systems have the capability of engaging in a conversation with the user even when he has not explicitly requested the system’s intervention. This is a key aspect in the development of ubiquitous computing architectures in which the system is embedded in the user’s environment, and thus the user is not aware that he is interacting with a computer, but rather he perceives he is interacting with the environment. To achieve this goal, it is necessary to provide the systems with problem-solving capabilities and context-awareness.

Adaptivity may also refer to other aspects in speech applications. There are different levels in which the system can adapt to the user. The simplest one is through personal profiles in which the users have static choices to customize the interaction. Systems can also adapt to the users’ environment, for example ambient intelligence applications such as the ubiquitous proactive systems described. A more sophisticated approach is to adapt to the user’s knowledge and expertise. This is especially important in educative systems to adapt the system taking into account the specific evolution of each of the students, the previous uses of the system, and the errors that they have made during the previous interactions.

There is also an increasing interest in the development of multimodal conversational systems that dynamically adapt their conversational behaviors to the users’ affective state. The empathetic educative agent can thus indeed contribute to a more positive perception of the interaction.

Portability is currently addressed from very different perspectives, the three main ones being domain, language and technological independence. Ideally, systems should be able to work over different educative application domains, or at least be easily adaptable between them. Current studies on domain independence center on how to merge lexical, syntactic and semantic structures from different contexts and how to develop dialog managers that deal with different domains.

Finally, technological independence deals with the possibility of using multimodal systems with different hardware configurations. Computer processing power will continue to increase, with lower costs for both processor and memory components. The systems that support even the most sophisticated multimodal applications will move from centralized architectures to distributed configurations and thus must be able to work with different underlying technologies.

9 Conclusions

Dialog systems appeared as a technology aimed at sustaining conversations with their users that could be considered natural and human-like. However, to achieve this long pursued objective, these systems must be able to operate in a wide range of domains and tasks, some of them difficult to process and complex [19], for which being able to learn from massive amounts of data becomes crucial to show appropriate behaviors.

We have addressed Big Data as (1) new sources of huge amounts of data, and (2) as the novel machine learning and information extraction algorithms that have appeared to process them. On the one hand, web pages and searches, social network, blog posts, and emails, they all provide an invaluable source for natural language resources. Similarly, voice calls, recorded dialogs and conversations have a huge potential to provide insights into human conversational behavior. However, manually examining such Big Data is laborious and error-prone. On the other hand, the emergence of different statistical approaches has enabled to accurately analyze unstructured data with a double benefit: to be less dependent on intensive manual annotation, and to gain a better understanding of human conversation by learning more accurate models from a more representative amount of data.

In this chapter we have discussed the tremendous potential of Big Data to improve several aspects of dialog system research and development, including speech processing, natural language understanding and dialog management.