Keywords

FormalPara After reading this chapter, you should know the answers to these questions:
  • What is a dialog system and how can it be used in patient- and consumer-facing systems in medicine?

  • What are the main approaches to the implementation of dialog systems? What are the limitations of these approaches?

  • How are dialog systems evaluated?

  • What are some of the safety issues in fielding patient- and consumer-facing dialog systems in medicine?

Introduction to Dialog Systems

People most commonly communicate with each other not in isolated utterances, but in interleaved sequences of utterances wrapped in ritualized behavior that we colloquially refer to as conversations. Developing natural language interfaces that can move beyond single transactions of user query/system response to fully engage users in conversation would benefit a variety of applications. At a minimum, once the information that needs to be exchanged extends beyond that which can be expressed in a single utterance, dialog becomes imperative. Beyond this, dialog is essential for performing tasks that require multiple natural language exchanges with a user in a coherent manner, as for example in a series of questions and responses to automate an interactive, incremental differential diagnosis. Certainly, the emulation of any kind of counseling session or interview to produce automated patient-facing health education systems requires complex goal-oriented dialog management that spans many interleaved patient and system messages In addition, interleaved sequences of messages allow a listener to confirm understanding or request clarification of information provided (a process referred to as “grounding”). Only in dialog can a conversational task (e.g., diagnosis or health counseling) be dynamically decomposed into sub-tasks in a coherent manner.

To ground our discussion, Fig. 9.1 shows an excerpt of a dialog between a study nurse and a patient about informed consent for an oncology clinical trial. There are several interesting things to note for those who have not studied natural conversations before. First, spontaneous conversation is full of disfluencies: there are very few grammatically complete and correct sentences in spontaneous conversation, and the use of “filler words” such as “um” (as in line #11) is very common. Second, conversational turns can span a single word to many sentences in duration. Third, a great deal of conversation is spent establishing mutual understanding of what was said: the patient feedback at lines #2 and #10, and the patient query at line #14 all serve exclusively to ensure that both parties understand each other, at least well enough for the purpose at hand. Only one person can talk at a time in conversation, and people are generally very good at coordinating their use of the speech channel, but overlaps, pauses (as in line #11) and interruptions (such as in line #15) are common. Finally, conversation typically makes extensive use of “deixis”, which is a reference to the immediate physical context or to what was said before (for example, line #1 refers to the current day, line #9 refers to the consent form that is being handed to the patient). Designing automated dialog systems that can participate in these kinds of conversations, for example taking the role of the study nurse here to automate administration of informed consent, represents an aspirational goal for dialog systems researchers. However, the state of the art is quite far from achieving this level of performance.

Fig. 9.1
An image depicts an extract of a conversation between a nurse and a patient.

Excerpt of nurse-patient dialog for administration of oncology clinical trial informed consent

Chapter 7 introduced natural language processing (NLP). In this chapter we review the state of the art in dialog systems, a sub-field of NLP, including text-based chatbots, speech-based conversational assistants, and multimodal embodied conversational agents that simulate face-to-face conversation, for both provider- and patient-facing biomedical applications.

Definitions and Scope

Dialog has been defined as a conversational exchange between two or more entities. For the purposes of this chapter, we will be concerned with communicative exchanges between a human (health professional, patient, or consumer) and an automated system in which messages are in textual or spoken natural language. This system may also be augmented with additional information such as the nonverbal behavior used by humans in face-to-face conversation (hand gestures, facial displays, eye gaze, etc.). We refer to an isolated message from one entity within a dialog as an utterance. While a dialog can consist of a single utterance, we are primarily concerned with dialogue in which several utterances from two entities are interleaved in order to accomplish some task. Discourse is a generalization of dialog that also includes the study of written text comprising multiple sentences.

Discourse theory is generally concerned with how multiple utterances fit together to specify meaning. Theories of discourse generally assume that discourses are composed of discourse segments (consisting of one or more adjacent utterances), organized according to a set of rules. Beyond this, however, discourse theories vary widely in how they define discourse segments and the nature of the inter-segment relationships. Some define these relationships to be a function of surface structure (e.g., based on categories of utterance function, such as request or inform, called “speech acts” [1]), while others posit that these relationships must be a function of the intentions (plans and goals) of the individuals engaged in conversation [2, 3]. In addition, researchers developing computational models of discourse and dialog have included a number of other constructs in their representation of discourse context, including: entities previously mentioned in the conversation; topics currently being discussed (e.g., “questions under discussion” [4]); and information structure, which indicates which parts of utterances contribute new information to the conversation as opposed to those parts that serve mainly to tie new contributions back to earlier conversation [5].

Discourse theory also seeks to provide accounts of a wide range of phenomena that occur in naturally-occurring dialog, including: mechanisms for conversation initiation, termination, maintenance and turn-taking; interruptions; speech intonation (used to convey a range of information about discourse context [6]); discourse markers (words or phrases like “anyway” that signal changes in discourse context [7]); discourse ellipsis (omission of a syntactically required phrase when the content can be inferred from discourse context); grounding (how speaker and listener negotiate and confirm the meaning of utterances through signals such as headnods and paraverbals such as “uh huh” [8]); and indirect speech acts (e.g., when a speaker says “do you have the time?” they want to know the time rather than simply wanting to know whether the hearer knows the time or not [9]).

Box 9.1 Definition

Dialog systems are computational artifacts designed to engage humans in dialog, as defined above. Intelligent agents are autonomous, goal-directed computational artifacts. Conversational agents are intelligent agents that converse with humans via a dialog system interface. Conversational assistants are conversational agents that use speech input and output to perform a wide range of tasks, as exemplified by the now ubiquitous Siri, Amazon Alexa, and Google Home products.

Embodied Conversational Agents (ECAs) are conversational agents that include the ability to use human-like conversational nonverbal behavior in their dialog (Fig. 9.2). ECAs are animated humanoid computer-based characters that use speech, eye gaze, hand gesture, facial expression and other nonverbal modalities to emulate the experience of human face-to-face conversation with their users [10]. Such agents can provide a “virtual consultation” with a simulated health provider, offering a natural and accessible source of information for patients. These agents represent one form of multimodal dialog system, in which the nonverbal modalities are recognized and produced in addition to accompanying text or speech, to more fully understand the user’s communicative intent and to better express system output. In addition to carrying additional factual information, nonverbal behavior is also used in face-to-face conversation to regulate the interaction structure itself, for example, gaze and intonation to regulate turn-taking behavior, and body position and orientation to regulate conversation initiation and termination. Nonverbal behavior is also particularly effective for conveying affective and relational cues that may be important for establishing patient trust in and working alliance with the ECA [11].

Fig. 9.2
An image of a patient in a hospital bed uses a touch screen monitor that is beside her. The patient holds a booklet. The patient's face is hidden.

Embodied conversational agent for patient education at hospital discharge

What’s Hard About Getting Machines to Engage in Spontaneous Human Conversation?

People unconsciously leverage a complex set of processes to make conversation work, most of which are entirely automatic and unconscious. They assume that an entity that engages them in what appears to be natural language dialog has these abilities until they discover their limitations. Several of these processes, such as conversation initiation and termination, and turn-taking and grounding, were mentioned above. Additional examples include: deixis, referring to something in the speakers’ mutual context (object, time, location, social relationship) in language; anaphoric or cataphoric references (referring to something said earlier or later in the dialog); and conversational framing [12] or layering [13] in which different styles or genres of talk are used to change how utterances are interpreted (e.g., symptom inquiry by a third party made within social chat storytelling occurring within the context of a clinical interview). There are many more conversational processes and linguistic phenomena that together make the seemingly effortless task of a water cooler conversation seem miraculous upon close inspection.

Fortunately, most of these conversational processes can be “compiled out” by tightly constraining what a user is allowed to do, or by greatly lowering their expectations. System-initiated dialog that rigidly walks a user through a series of steps generally avoids the need to engage in many of these processes. Similarly, a system that engages a patient in scripted greeting and small talk at the start of a health education session does not need a computational model of conversational frames. Agents that only provide responses to single utterance user queries (such as popular conversational assistants like Siri) have trained users to not expect any conversational behavior beyond these simple exchanges.

Machine Learning and Dialog Systems

In the research community, the dominant modern approach to dialog systems is now based on machine learning (ML; see Chap. 6). Learning-based approaches to dialog permit flexibility and avoid the need for exhaustive manual engineering of rules. ML-based approaches to dialog have yielded strong empirical performance, although measuring this is a challenge ([14]; see the section on “Automated Metrics for End-to-End Architectures”). However, building such systems requires training data (i.e., example conversations) from which to learn, which may not be available in all domains, and can be prohibitively expensive to collect. Moreover, it is difficult to control the outputs of machine learning models, and so deploying such systems in the context of healthcare applications may be a risky endeavor.

History of Dialog Systems in Healthcare

Chapter 2 reviewed the history of AI in medicine; here we focus on the development of dialog systems specifically. One of the very earliest dialog systems developed was produced as a demonstration of a patient-facing psychotherapy counseling agent (see Chap. 2). The ELIZA system was developed to simulate the behavior of a Rogerian psychotherapist, in which the patient and the computer exchanged typed text messages [15]. Although ELIZA was not intended to be used for actual therapy, similar systems have been proven effective for therapy in which the system is essentially prompting a patient to think aloud and work through his or her own problems [16]. An example conversation with ELIZA is shown in Fig. 9.3.

Fig. 9.3
An image depicts a conversation between a user and Eliza.

Example Conversation with ELIZA

Colby developed an ELIZA-like system that was designed to use Cognitive Behavioral Therapy to treat individuals with depression. In addition to providing typed text counseling with patients, the system provided text-based educational materials about depression [17]. These systems are characterized by system responses that are only coherent with the immediately preceding user utterance, implemented using pattern-response rules that are matched to the user input with regular expressions, and template-based text generation of system responses. They also use a variety of techniques to maintain the illusion of coherent dialog, including: maintaining system-initiated dialog, having most system outputs prompt the user with open-ended questions; relying on the user’s sense-making ability to infer coherent explanations for the system’s outputs; and reflecting the user’s inputs back to them with minor wording changes in order to give the illusion of understanding what the user is saying. This approach to dialog system implementation is widely used in “chatbots” deployed on the web for entertainment, marketing and sales applications, and has given rise to an open standard chatbot implementation language (AIML [18]).

Development dating from 1964 was also conducted on systems that could collect a medical history from patients [19]. Unlike ELIZA, these systems conducted system-initiated dialog only, asking patients a series of questions with highly-constrained patient input (mostly YES/NO questions) to drive branching logic. Research and development of these systems has continued, and some commercial tools are available, although they still have not attained wide use in clinical practice [20].

Some of the earliest work in physician-facing medical expert systems used system-initiated dialogue to interact with providers for decision support. MYCIN was an early rule-based expert system that identified bacteria causing an infection and recommended antibiotics [21] (see Chap. 2). It was designed to interact with physicians by asking a series of very constrained questions requiring one- or two-word responses. In fact, it was a desire to avoid having to implement natural language understanding that led to the use of MYCIN’s core backward-chaining diagnostic algorithm.

By using a backward-chained approach, MYCIN controlled the dialogue and therefore could ask specific questions that generally required one- or two-word answers. ([21], p. 601)

MYCIN (and derivative projects) used various text generation techniques to produce their final output case summaries.

The sections on “Example Patient- and Consumer-facing Dialog Systems” and “Example Provider-facing Dialog Systems” provide more recent examples of patient- and provider-facing medical dialog systems.

In the last decade, deep neural network-based methods trained on massive corpora have come to dominate Natural Language Processing (NLP) [22]. These methods have enabled highly accurate automatic speech transcription tasks [23] and improved NLP system performance across a variety of problems, including building conversational agents [24].

One means for building dialog systems entails specifying models that map user input utterances directly to output utterances (“end-to-end” systems). This can yield strong performance with respect to the fluency of outputs, but such systems can struggle to maintain coherence throughout a dialog [25]. While such text-to-text models have been used in the context of task-oriented dialog systems [26, 27], they may be more suitable to “general domain” conversational agents—i.e., general “chatbots”—as such models are not naturally amenable to guiding “goal-based” dialog. This reflects the myopic optimization strategy used to estimate model parameters: Typically one aims to find parameters that make the model as likely as possible to produce the words comprising response utterances in the observed training data. This optimization criterion does not explicitly encode higher-order conversational goals, which likely require explicit planning to achieve.

A common strategy to address this problem is to decompose dialog systems into independent modules, to be trained separately and combined in a pipeline. For example, one module might process user utterances, a second might then decide on an action to take, and a third might then generate a response, conditioned on this. Developing “end-to-end” methods that permit joint optimization of all components necessary for goal-based dialog is an active area of research [28]. We discuss archetypal modern machine learning models in the section on “Neural Network Methods and End-to-End Architectures”.

Dialog System Technology

Classic Symbolic Pipeline Architectures

Historically, dialog systems have been developed using a pipeline architecture, in which a user utterance is incrementally transformed into a representation that the core agent logic can provide a response to, followed by another series of processing stages to render the system output. These stages can include Automated Speech Recognition (ASR), multimodal integration, utterance understanding, dialog management, natural language generation, multimodal generation, and Text-To-Speech (TTS). Approaches to dialog management include finite-state automata, frames, and plan-based frameworks (Fig. 9.4).

Fig. 9.4
A flow chart. The chart flows as follows. Automatic speech recognition, A S R. Domain classifier. Intent classifier. Slot tagger. Discourse model and dialog manager. Content determination. Document structuring. Sentence planning. Surface realization. Text to speech, T T S.

Pipeline dialog system architecture

Automated Speech Recognition (ASR) is responsible for transcribing the users’ speech input into one or more text representations. Speech recognition has improved significantly from single-speaker digit recognition systems in 1952 [29] to speaker-independent continuous speech recognition systems based on deep neural networks [30]. Currently, several open source ASR engines such as Pocketsphinx [31], Kaldi [32], and HTK [33] are available, but accurate speech recognition can require substantial processing power which cloud based services such as IBM Watson [34], and the Google cloud platform [35] provide. Although recent systems have achieved around 5% word error rates [36, 37], there are still some doubts regarding the use of ASR in applications such as medical documentation [38]. Goss et al [39] reported that 71% of notes dictated by emergency physicians using ASR contained errors, and 15% contained critical errors.

A Natural Language Understanding (NLU) module extracts a semantic representation of the user’s utterance, which can then be used by the dialog manager to generate a system response. State-of-the-art statistical NLU systems often contain three main components: domain detection, intent detection, and slot tagging [40]. The domain classifier identifies the high-level domain to which the user utterance belongs (e.g., symptoms, medications, or educational content). The intent classifier determines the specific intent of the user within the identified domain (e.g. report_finding). Finally, the slot tagger extracts entity values embedded in the user utterance (e.g. syndrome_name or severity_level). NLU is one of the most complex tasks in dialog systems for several reasons. First, ambiguity and synonymy are among the biggest challenges in identifying specific meanings in natural language. Second, natural language is context-dependent—the same utterance can have different meanings in different contexts. Third, spontaneous speech is often noisy with disfluencies (e.g., filled pauses, repairs, restarts).

Dialog management is most typically implemented using finite state machines or layers of finite state machines, also referred to as hierarchical transition networks, particularly for applications in which the system maintains the conversational initiative. In these systems, states typically represent system utterances and branches to next states are made in response to user responses. Layers in the hierarchy can be used to represent discourse segments, for example to satisfy a particular conversational goal. Dialog managers can also be frame-based, in which a current “frame” is used to guide the conversation by asking users for information to fill slots until enough information has been gathered for the system to take an action.

More advanced approaches to dialog management involve the explicit representation of user and system plans and goals, which is required to manage conversational phenomena such as: mixed-initiative dialog, in which either the user or the system can take control of the conversation at any time; proper handling of interruptions and requests for clarifications; and indirect speech acts. Flexibly handling these phenomena requires representing and reasoning about the intentions that underlie system and user utterances, inferring the user’s goals and task plan, and dynamically synthesizing the system’s task plan. Inferring a user’s goals and task plan is necessary because, as exemplified by indirect speech acts, people’s utterances do not always correspond directly to their communicative intent (e.g., as in “Do you have the time?”). Thus, plan-based theories of communicative action and dialog assume that the speaker's speech acts are part of a plan, and the listener's task is to infer it and respond appropriately to the underlying plan, rather than just to the utterance [41]. Synthesizing system task plans, including communicative and other actions, is necessary in complex applications in which all possible conversational contingencies (and their possible orderings) cannot be anticipated and scripted, but must be addressed in an incremental, reactive manner.

Dynamic planning and plan inference can be computationally very complex and difficult to develop, and thus have not been used much to date in fielded health dialog systems. However, they remain active areas of research, and a handful of health dialog systems that use these techniques have been developed for the application of clinical guidelines [42], for the automatic generation of reminders for older adults with cognitive impairment [43], for medication advice [44], and for diet promotion [45].

One research project used a task decomposition planning formalism to drive health behavior change counseling dialog for exercise and diet promotion [46]. This formalism was based on the Shared Plans theory [47, 48], in which dialog is viewed as a collaboration in which participants coordinate their action towards achieving a shared goal. Discourse segments are defined by the sequence of sub-goals or atomic actions in a recipe that serve to elaborate a particular goal, and the only meaningful relationships among discourse segments are elaboration (goal expansion) and ordering of goals and actions. Figure 9.5 shows a portion of the plan tree for an exercise promotion dialog. Plan fragments that elaborate dialog goals into subgoals and atomic actions are referred to as recipes and are represented in ANSI/CEA-2018 [49] (ANSI/CEA-2018 provides a standard declarative representation for tasks that can be decomposed in this manner). Figure 9.6 shows a portion of a high-level recipe for behavior goal negotiation, and an example of an atomic dialogue turn that elaborates the “Recommend” subgoal for negotiating long-term exercise goals. The run-time planning system (based on the COLLAGEN collaborative dialog system [50]) starts with a top-level goal to have a counseling dialog, then incrementally elaborates the goal using recipes until atomic utterances are produced. This process results in a plan tree in which the root is the initial goal and the leaves are the utterances produced by the agent and the user (Fig. 9.5). The planning process proceeds without backtracking, i.e., elaborations are never undone once they are added to the dialog plan tree.

Fig. 9.5
An organizational chart. The title, have counseling dialog is categorized into 5: greeting, small talk, review status, counseling, pre closing, and goodbye. Arrows pass from greeting to goodbye. Counseling is classified further into negotiate goal which is classified to recommend. Recommend is categorized as agent and patient.

Plan tree fragment for exercise promotion counseling dialog

Fig. 9.6
Example pseudocode with 2 tasks: negotiate and recommend. Each has input parameters and outputs. The task negotiate has 4 steps. The task recommend has a precondition, adjacency pair, and agent utterance.

Example pseudocode for a high-level recipe and a low-level dialogue specification

The most common approach to symbolic Natural Language Generation (NLG) is template-based text generation, in which an output utterance is represented as a string annotated with variables whose values are determined at runtime [51]. While relatively simple and straightforward, this approach does not offer much flexibility or expressivity. In the most general case, text generation can involve word-by-word synthesis of utterances based on a grammar and dictionary, discourse context and world knowledge, and is itself decomposed into another pipeline of processing stages [51] (Fig. 9.4).

Content determination is the first stage, and involves deciding what information should be communicated in the output, beyond that dictated by the dialog manager. Document structuring decides how chunks of text should be grouped together in one or more output utterances and how they should be related in rhetorical terms. Sentence planning involves: selection of the specific words or other linguistic resources that should be used to express the selected content; deciding what expressions should be used to refer to entities; and deciding how structures created by document planning should be mapped onto linguistic structures, such as utterances or conversational turns.

The last step of NLG, referred to as surface realization, involves turning the internal representations produced during sentence planning into the text of one or more utterances. Research has also been conducted into generation of multi-modal system outputs (speech or text plus accompanying nonverbal behavior for an ECA, or graphics to help illustrate a concept to be conveyed) although, as with multi-modal input understanding, this has not been used widely in health dialog systems to date.

Finally, Text-To-Speech (TTS) involves the conversion of utterance text into an acoustic signal. TTS is now a very mature technology and the quality and naturalness has improved significantly over the last decade, producing understandable speech for a wide range of languages. Speech Synthesis Markup Language (SSML) enables the annotation of utterance text with tags that can manipulate speed, pitch, volume, and other aspects of prosody to produce more expressive speech [52].

Neural Network Methods and End-to-End Architectures

In the past decade, neural networks have emerged as the dominant model class for natural language processing (NLP) [22], as they have become the dominant machine learning formalism for many areas of modeling in medicine (see Chap. 1). Neural network-based NLP has in turn given rise to neural conversational models [25, 53,54,55]. Departing from the classical symbolic approaches reviewed above, neural models represent utterances as dense, continuous vectors (i.e., learned representations). Neural language models [56] are typically used in such architectures to generate responses conditioned on a representation of context, e.g., the most recent utterance.

Completely “end-to-end” systems forego explicit planning and learn to map directly from an input to an output utterance via a deep neural network [57]. Sequence-to-sequence architectures are currently dominant neural models for dialog; these use neural network modules to map input to output texts. This is typically accomplished with an encoder-decoder architecture. The encoder learns to compress inputs into a dense representation; this is then passed onto the decoder, which is responsible for conditionally generating a response. Because the decoder must generate text, it is typically defined as an auto-regressive (conditional) language model, which is to say that it generates outputs one word at a time in a left-to-right fashion. Figure 9.7 provides a high-level schematic of this approach.

Fig. 9.7
A block diagram. The block titled bidirectional encoder has a double headed arrow. The block titled autoregressive decoder has an arrow pointing right. An arrow passes from the encoder to the decoder. 3 arrows from the label A patient utterance points to the encoder. An arrow from the decoder points to response and A. A points response.

High-level schematic of a Seq2Seq model for dialog generation

Simple sequence-to-sequence approaches have the advantage of minimizing the manual effort that must be expended to build new conversational systems; they are induced entirely from training data, and so do not require explicit rule- or template-formulation. However, this brings inherent drawbacks. Chief amongst them is the reliance on large, high-quality training corpora. In addition, such models struggle to make meaningful use of dialog history [58]. With respect to task-oriented dialog systems, end-to-end sequence-to-sequence models can learn to take particular actions only implicitly, which makes them difficult to interpret and control.

Some work has attempted to make neural dialog models more explicitly task-oriented by learning policies via (deep) reinforcement learning [25, 59]. Other recent efforts have aimed to combine the strengths of end-to-end and more explicitly goal-oriented approaches [28]. Unifying the symbolic approaches discussed above with modern, data-driven neural network models for dialog is likely to remain an active area of research in the coming years.

Approaches to Dialog System Evaluation

Evaluating dialog systems is important in general, but is especially crucial in safety-critical areas such as medicine. Due to the multi-faceted nature of dialog systems, and the inherent complexity of natural language, evaluation is typically multi-dimensional. Of course, medical applications typically have well-defined health outcomes that are ultimately of greatest importance, such as knowledge gain for health education systems, or objective health outcomes for conversational agents that promote health behavior change, but here we review application-independent performance metrics and methods.

Evaluation of Pipeline Architectures

In classic pipeline architecture-based systems, a “Wizard-of-Oz” methodology is commonly used to replace one or more pipeline components with a human “wizard” (unbeknownst to test subjects) so that the overall system can be evaluated prior to full implementation, or to provide a baseline comparison for a fully-automated system [60]. Dialog from these sessions is recorded and analyzed for several purposes, including: early characterization of domain dialogs; characterization of user responses in particular contexts of interest; assessment of user acceptance of and attitude towards a planned system; and assessment of utility and efficacy of a planned system. Although ideally, user-system interaction will closely follow provider-patient interaction, it has been observed that in many situations users speak and otherwise behave differently when interacting with a computerized system than when with another human (e.g., they simplify their speech patterns) [61]. In these situations, Wizard-of-Oz testing is particularly important, since the study of provider-patient interaction will not correctly characterize these dialogs.

Pipeline architectures also have well-established evaluation metrics for certain components. For example, Word Error Rate (WER) is often used as one of the most common figures of merit for ASR modules.

Automated Metrics for End-to-End Architectures

Manual assessment of model outputs remains the gold standard for evaluating Natural Language Processing (NLP) models for text generation tasks broadly (e.g., machine translation, abstractive summarization), and for dialog systems in particular. Manual assessment involves having humans interact with a dialog system, or review transcripts of interactions or text generation outputs, and provide subjective and objective performance evaluations. However, enlisting domain experts to perform such assessments is time-consuming and expensive. Manual evaluation is therefore impracticable for model development, which typically requires iterative refinement. For this reason, contemporary work on NLP models for text generation tasks tends to favor use of fully automated metrics to facilitate model development.

Such metrics assume access to “reference” texts written by humans and aim to measure some notion of similarity between a model output \( \hat{y_i} \) for a given input xi and the corresponding reference text yi. In the context of dialog systems, xi might be an utterance and yi a reference response. Intuitively, we would like a metric that is high if \( \hat{y_i} \) is similar to yi. Most automated metrics essentially measure similarity as some function of word overlap between the model output and reference.

BLEU (short for Bilingual Evaluation Understudy) is one such metric, first popularized in the context of automated machine translation. The motivating dictum behind BLEU is “The closer a machine translation is to a professional human translation, the better it is” [62]. To operationalize this intuition, BLEU computes n-gram precision,Footnote 1 for varying n; that is, it measures the number of n-grams in a generated output that also appear in the corresponding reference. The precision values for different n-gram lengths are then combined using a weighted average. This aggregate precision score is subsequently multiplied by a “brevity penalty” factor, which is intended to measure whether outputs are comparable in length to reference summaries. Meteor [63] is a similar metric, also popular in machine translation: This proposes several modifications intended to address limitations of BLEU. Both of these metrics have been shown to correlate reasonably well with human assessments of translation system outputs [62, 63].

Recall Oriented Understudy for Gisting Evaluation (ROUGE) [64], another automated metric, is perhaps the dominant choice for evaluating summarization systems. It is—as the name suggests—more focused on recall, i.e., it is high when \( \hat{y_i} \) contains as many n-grams as possible that also appear in yi. Typically one calculates ROUGE-N for a particular n-gram list; for example, ROUGE-1 tallies unigram recall of the model output with respect to the reference. In the context of automated summarization, ROUGE has been shown to correlate with human judgements of quality [64], although it has been noted that it does not reliably measure higher-order properties of outputs such as factual accuracy [65].

The above automated measures of generated outputs were not designed for evaluating dialog generation systems, but they are nonetheless often used for this when “reference” response utterances are available. However, in the context of evaluating dialog systems such metrics have been shown to poorly correlate with human judgements, and so should be interpreted accordingly [14]. Developing better automated metrics for evaluating automatically generated dialog responses is an active area of research [66, 67].

System-Level Evaluation

There are a number of approaches for evaluating overall dialog system performance (see Chap. 17 for a more general discussion of evaluation issues). From a usability perspective, metrics such as task completion rate, user satisfaction, efficiency, and learnability are relevant. One influential dialog system evaluation framework (PARADISE) attempts to combine these into a single metric [68]. PARADISE uses a decision-theoretic framework to combine evaluations of system accuracy (success rate at achieving desired conversational outcomes) with the “costs” of using a system—comprised of quantitative efficiency measures (number of dialog turns, conversation time, etc.) and qualitative measures (e.g., number of repair utterances)—to yield a single quality measure for a given interaction. Weights for the various elements of the evaluation are determined empirically from overall assessments of user satisfaction for a sample set of conversations, and the evaluation formula can be applied to sub-dialogs as well as to entire conversations to enable identification of problematic dialog fragments.

Two other qualitative evaluation methods were developed on the TRINDI and DISC projects. They provide criteria for evaluating a dialog system’s competence in handling certain dialog phenomena. The TRINDI Tick-List consists of three sets of questions that are intended to elicit explanations describing the extent of a system’s competence [69]. The first set consists of eight questions relating to the flexibility of dialog that a system can handle. For example, the question “Can the system deal with answers to questions that give more information than was requested?” assesses whether the system has any ability to handle mixed-initiative dialog. The DISC Dialog Management grids [70] include a set of nine questions, similar to the Trindi Tick-List, that are intended to elicit some factual information regarding the potential of a dialog system.

Langevin et al. recently developed a set of usability heuristics to guide the evaluation of text- or speech-based conversational agents [71]. Usability heuristics are used to guide “expert evaluation” of an interface, in which a designer uses them as a checklist to draw their attention to common classes of usability problems. Derived from Nielsen’s classic usability heuristics [72], the 11 new heuristics were found to be more effective at identifying problems with conversational agents than Nielsen’s original set. Examples of the heuristics are shown in Table 9.1.

Table 9.1 Example conversational agent usability heuristics (from [71])

Example Patient- and Consumer-Facing Dialog Systems

A number of embodied conversational agents have been developed to provide health education and health behavior change counseling across several health conditions. For example, an ECA was developed as a virtual discharge nurse who explained their hospital discharge and home care instructions (Fig. 9.2) [73, 74]. The agent was provided on a touch screen kiosk to patients while they were in their hospital beds, and spent 30–60 min reviewing a hospital discharge booklet with them, including information about medications, follow-up appointments, and self-care procedures. Patient understanding was confirmed using comprehension checks, and at the end of the session a report was printed for the human discharge nurse that indicated questions the patient still had that he or she could address. A randomized controlled trial (RCT) was conducted with 764 patients on a general medicine floor at an urban safety net hospital, aged 49.6, 49.7% with inadequate health literacy, comparing the virtual nurse to standard care. Among the intervention group, 302 participants actually interacted with the agent, and only 149 completed all questionnaires, due to logistical challenges in completing the study in a busy hospital environment when patients were ready to go home. Patients reported very high satisfaction and working alliance with the agent, and more patients preferred talking to the agent than their doctors or nurses in the hospital.

Several speech-based conversational agents have also been developed and evaluated in RCTs [43, 75, 76]. For example, the Telephone-Linked Care (TLC) systems developed by Friedman and colleagues at Boston University used recorded speech output, and either DTMF or ASR for user input. TLC behavior change applications have been applied to changing dietary behavior [77], promoting physical activity [78], smoking cessation [79], and promoting medication adherence in patients with depression [80] and hypertension [81]. TLC chronic disease applications have been developed for chronic obstructive pulmonary disease (COPD) [82], and coronary heart disease, hypercholesterolemia, and diabetes mellitus [81]. All of these systems have been evaluated in RCTs and most were shown to be effective on at least one outcome measure, compared to standard-of-care or non-intervention control conditions.

There are now many commercially-successful patient- and consumer-facing dialog systems. Woebot, is a text-based chatbot designed to alleviate anxiety and depression using a range of counseling techniques, and was recently demonstrated to be effective at reducing substance misuse [83]. Clear Genetics produces a text-based chatbot that provides a range of genetics counseling functions, including administration of informed consent for genetic testing [84]. In addition, many dialog systems have been developed as add-on “skills” for speech-based conversational assistants such as Alexa. At the time of this writing, Amazon lists over 2000 skills (task-specific modules that can extend Alexa’s functionality) in their Health and Fitness category, all of which can be considered patient- and consumer-facing health dialog systems.

Example Provider-Facing Dialog Systems

There are far fewer examples of provider-facing medical dialog systems in the literature, and these have largely been early research prototypes. For example, the HOMEY system is a decision support tool that advises physicians on whether a patient should be referred to a cancer specialist [85]. Laranjo et al. describe several additional dialog systems that interact with both patients and providers [86]. Dialog systems may be less acceptable to health providers than to patients and consumers because they are slower to use and more error-prone compared to functionally-equivalent graphical user interfaces.

Safety Issues in Dialog Systems for Healthcare

Dialog systems that provide advice to healthcare providers can tolerate imperfect performance, since providers presumably have the expertise to recognize unsafe recommendations. However, due to the inherent ambiguity in natural language, lack of user knowledge about the expertise and natural language abilities of a conversational agent, and potentially misplaced trust, great care must be taken to ensure patients and consumers do not put themselves in situations in which may they act on information mistakenly provided by a conversational agent that could cause harm. To demonstrate these potential safety issues, a study was conducted using three widely-available disembodied conversational agents (Apple’s Siri, Google Home, and Amazon’s Alexa). Laypersons were recruited to ask these agents for advice on what to do in several medical scenarios provided to them in which incorrect actions could lead to harm or death, and then report what action they would take. Out of 394 tasks attempted, participants were only able to complete 42.6% (168). For those tasks, 29.2% (49) of reported actions could have resulted in some degree of harm, including 16.1% (27) that could have resulted in death, as rated by clinicians using a standard medical harm scale [87]. The errors responsible for these outcomes were found at every level of system processing as well as in user actions in specifying their queries and in interpreting results (see Fig. 9.8 for an example). The findings from this study imply that unconstrained natural language input, in the form of speech or typed text, should not be used for systems that provide medical advice given the state-of-the-art. Users should be tightly constrained in the kinds of advice they can ask for, for example, through the use of multiple-choice menus of utterances they are allowed to “say” in each step of the conversation (e.g. as in Fig. 9.2). In addition, unconstrained generative approaches to dialog generation pose additional complications; these may yield offensive or medically inaccurate outputs, for example (as discussed in the section on “Approaches to Dialog System Evaluation”).

Fig. 9.8
An image depicts a conversation between a user and Siri.

Example of medical advice from siri that was rated as potentially fatal (excerpt from [87]) (RA is the research assistant)

State of the Art: What We Currently Can and Can’t Do

There are currently several commercially-available toolkits for developing state-machine-based dialog systems in which the system maintains initiative, and constrained or unconstrained user inputs can be reliably mapped to a small number of options in each state. This state-based model is exemplified by standard dialog management languages (e.g., VoiceXML for speech-based systems [88]) and commercial dialog-management tools (e.g., Google’s DialogFlow [89]). As described in the section on “Example Patient- and Consumer-facing Dialog Systems”, there are also several commercial products that use this approach for consumer- and patient-facing health education and counseling.

However, we cannot reliably support general, unconstrained user input in mixed-initiative conversations, nor any of the other conversational phenomena described in the section on “Introduction to Dialog Systems”, at least not to the degree that people can.

Future Directions

It is unclear if end-to-end approaches will ever be capable of sustaining coherent dialog over many utterances given that the discourse context alone becomes combinatorically large. Hybrid systems that use the best capabilities of the pipelined and neural approaches combined represent more promising approaches, at least in the near-term. Consistent with the recurrent theme of combining machine learning and symbolic approaches mentioned elsewhere in this book, one approach is to bring more machine learning-based components into the classic symbolic pipeline. Going forward, key questions include: How can we unify end-to-end neural systems with symbolic planning-based approaches? Conversely, might we better represent and exploit (long term) context in modern neural dialog systems?

Patient- and consumer-facing dialog systems that support unconstrained natural language input are certainly preferred to those that are more constrained, since patients can express themselves freely and may be able to communicate more nuanced information. However, these systems represent a safety risk as described in the section on “Safety Issues in Dialog Systems for Healthcare”. The identification and mitigation of unsafe medical dialog remains an important area of research and a problem that must be addressed before these systems can reach their potential.

There are several active research areas dialogue systems. For example, in pipeline architectures, incremental processing, in which system responses are generated incrementally while a user is producing their utterance, allows for much faster system response time, but requires re-architecting how the pipeline works [90]. Multiparty interaction represents another important area of research to support group counseling [91] or three-way patient-clinician-agent interactions. Multimodal dialog with ECAs or humanoid robots, in which user verbal and nonverbal behavior can be used to support conversational processes and allow users to better express themselves [92], also represents an active area of research.

These advances will enable the development of automated health providers and counselors that can provide complex information to patients and consumers in a natural, fluid, and intuitive way, tailored to each user and situation, and that do not require users to dumb down and simplify their language and requests, such as the one shown in Fig. 9.1. This will enable routine patient education and counseling tasks, such as administration of informed consent, explanation of medications and medical procedures, and explanation of discharge instructions, to be fully automated.

Conclusion

In this chapter, we have provided a review of the state of medically relevant dialog systems, including their current capabilities and limitations, and directions of ongoing and future research and development. Development of this technology is important for the delivery of complex information to patients and consumers, but is particularly important for those with low health or computer literacy who may struggle with text-heavy graphical user interfaces. While these systems have great potential for improving health, attention must be paid to the risks inherent in using unconstrained text or speech input in situations in which misunderstandings can lead to harm.

Questions for Discussion

  • How do pipeline and rule-based systems differ from “end-to-end” neural approaches?

  • Why might existing automated metrics like ROUGE fail to reliably measure the factual accuracy of utterances?

  • How can unsafe medical dialog be identified and mitigated?

  • What kinds of medical applications would benefit most from embodiment by the conversational agent?

  • What kinds of medical applications would make the relative slowness of dialog systems acceptable to clinicians?

Further Reading

Chattopadhyay D, Ma T, Sharifi H, Martyn-Nemeth P. Computer-controlled virtual humans in patient-facing systems: systematic review and meta-analysis. J Med Internet Res. 2020;22(7):e18839 [93].

  • This article provides a comprehensive review of patient-facing embodied conversational agents in medicine.

Laranjo L, Dunn AG, Tong HL, Kocaballi AB, Chen J, Bashir R, et al. Conversational agents in healthcare: a systematic review. J Am Med Inform Assoc. 2018;25(9):1248–58 [86].

  • This article provides a review of conversational agents in medicine that use unconstrained natural language input.

Bickmore T, Trinh H, Olafsson S, O'Leary T, Asadi R, Rickles N, et al. Patient and consumer safety risks when using conversational assistants for medical information: an observational study of Siri, Alexa, and Google Assistant. J Med Internet Res. 2018;20(9):e11510 [87].

  • This is an empirical study of worst-case safety issues when patients or consumers use conversational agents for actionable medical advice.

Grosz B, Pollack ME, Sidner CL. Discourse. In: Posner MI, editor. Foundations of cognitive science. Cambridge: MIT Press; 1989 [94].

  • This chapter provides an excellent primer on basic issues in the study of discourse.

Sordoni A, Galley M, Auli M, Brockett C, Ji Y, Mitchell M, et al. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 196–205, Denver, Colorado. Association for Computational Linguistics [95].

  • An early neural dialog system that is illustrative of approaches to follow.

Sankar C, Subramanian S, Pal C, Chandar S, Bengio Y. Do neural dialog systems use the conversation history effectively? An empirical study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 32–37, Florence, Italy. Association for Computational Linguistics [96].

  • An examination of how well current neural based approaches can harness conversational history to inform utterances/responses.