Introduction

Developing diagnosis and clinical reasoning skills is a key element of medical education. In addition to clinical practice, medical students and practitioners can enhance these abilities by means of mannequins, role games and simulation systems. These have shown beneficial results [1,2,3,4,5,6,7] and are currently integrated in virtual patients [8,9,10,11,12,13,14,15]. Virtual patients (VPs)Footnote 1 are software through which students can train themselves by emulating the roles of health providers [16].

Ideally, a VP simulation system should simulate a patient in all consultation stages. The patient’s medical history taking (anamnesis) is an essential but difficult-to-master skill. Real consultations occur in time-restricted settings and there is a language-level gap in doctor-patient communication. Due to the health implications, doctors need to receive training to acquire these skills so that they assess patients’ conditions and make a correct diagnosis.

Natural language dialogue systems (chatbots or conversational agents) have been integrated in healthcare applications [17,18,19] and VP simulation environments. Interaction modules allow trainees to simulate history taking, mostly through constrained input—e.g. lists of questions and answers prepared for a specific case [11, 20,21,22,23,24,25]. Other methods for processing user input use rules, ontologies and knowledge bases [26, 27], statistical language models [28], machine-learning classifiers [29], crowd-sourcing data [22] and preliminary neural approaches [30, 31]. Some systems feature automatic speech recognition [32,33,34]. However, very few virtual patients feature dialogue through natural language [34] (humans’ inherent mode of communication), which might result in more natural interaction with a conversational agent [35, 36].

A successful interaction relies both on the type of technology and the degree to which the VP helps users to acquire clinical reasoning and history-taking skills. To do so, interacting with a wide range of cases is beneficial [36]. Accordingly, a VP system should provide simulations with a variety of clinical specialities. Most systems, nonetheless, only deal with one or a few conditions [33, 34, 37,38,39,40,41,42,43]. Very few systems cope with diverse pathologies [22, 44].

Objectives

Our objective was to overcome the limitation of the scarce number of simulated cases by designing a dialogue-enabled VP system that can cope with a variety of clinical conditions. We hypothesise that a multi-case VP simulation system can be achieved if medical trainers can create VPs easily, through a graphical interface (Fig. 7, Appendix), without programming anything nor the development team’s intervention. The description of the clinical case, in the form of a semi-structured record, is typed offline in natural language; next, the dialogue system embodies a patient with each clinical case.

Accordingly, a first requirement of the system is to cope with new contents across medical specialities. The second requirement is to provide unconstrained input, because the system aims at improving medical students’ history-taking skills through the interaction with the VP. Figure 1 is a sample dialogue and illustrates natural dialogue phenomena. The system is integrated in a serious game developed with partner companies and a medical team [45]. The software features an animated avatar with text-to-speech, lip-synch and minor gestures.

Fig. 1
figure 1

Sample of an actual dialogue of a medical student (D for Doctor) with a virtual patient (P)—the transcript comes from a session with the English version of our system

To make the system able to handle plenty of cases, we gave it extensive conceptual and terminological coverage of the domain [27, 46]. The system can also adapt to new records dynamically. We provided it with components to detect out-of-vocabulary words (OOV) and predict morphological information of missing words. The system with adaptation modules is available in French;Footnote 2 English and Spanish versions are available but not well-supported.

This article reports a usability evaluation of the French system, where we assessed, in a simulated history-taking setting:

  1. Q1

    Whether a multi-case system can provide quality dialogue (with regard to grammar and on-topic and realistic replies) through natural language across clinical cases.

  2. Q2

    Whether quality dialogue is maintained when processing unseen records across medical specialities.

We evaluated these aspects through user experiments in a real context. Study participants (n = 39) interacted in French language with the dialogue system, then performed a user evaluation of their dialogue. A graphical abstract (Fig. 9, Appendix) summarizes our work.

Material and methods

Dialogue system architecture

To tackle the task, we first designed a patient record model, which defines a virtual patient’s health state in a semi-structured format. Table 9 (Appendix) shows an example. Second, we conceived a knowledge model for the task, i.e. a scheme of question types, dialogue acts and entity types concerning the anamnesis. Third, we created a termino-ontological model, which hosts structured thesauri for managing the variation of terms [46, 47]. Figure 2 is a schema of the different stages (which occur asynchronously): case creation by an instructor (1), comparison and analysis of a new record (2), and dialogue by a student (3).

Fig. 2
figure 2

Schema of the virtual patient dialogue system and update components

We designed the system following a knowledge-based and rule-frame-based approach [27]. The user—typically a medical student or resident—types text. A natural language understanding (NLU) module performs the linguistic and semantic processing (e.g. pain is a symptom). A semantic frame is fed to a dialogue manager, which keeps track of the dialogue state and context information, queries the record, selects the information and replies through a template-based generation module (Fig. 3).

Fig. 3
figure 3

Example of functioning of the dialogue system from input to output. The patient record is simplified; Table 9 shows a full example

The termino-ontological model contains lexical resources for processing linguistic variation: inflection (e.g. lung \(\leftrightarrow \) lungs), derivation (e.g. face \(\leftrightarrow \) facial), synonymy (e.g. operation \(\leftrightarrow \) surgery) and mapping between full words and affixes/roots (e.g. heart \(\leftrightarrow \) cardio-). The model also defines domain relations and concepts for processing and normalising the variety of terms in a case: e.g. pain and ache refer to the same concept. These resources support a key feature of the system: its ability to map doctor’s language to patient’s language to better simulate a real patient. We populated this model with large general and domain resources (e.g., the Unified Medical Language System®; [48]). Our lexicons contain domain lists (over 161,000 terms in French, 116,000 in English, and 103,000 in Spanish) and dictionaries (over 959,000 word/concept entries in French, 1,886,000 in English, and 1,428,000 in Spanish).

Although these resources allow the system to handle plenty of cases, the medical jargon evolves continually with neologisms. Not knowing out-of-vocabulary words (OOVs) might cause incorrectly generated replies, because the system lacks the linguistic information for morphological agreement of OOVs. We thus developed methods to predict the Part-of-Speech (PoS) and gender/number of OOVs (see the bottom of Table 9 in the Appendix). Multiple approaches are run in parallel: dictionary-based, and inference from linguistic context or from the base form/affixes (Fig. 6 in the Appendix). They are combined using heuristic weights set during development. This prediction is executed offline whenever an instructor creates or modifies a case. Figure 8 (Appendix) gives more technical details of the system components.

Evaluation design

To assess whether the system provides quality dialogue across clinical cases (Q1), potential end-users (n = 39) tested 35 different VPs. Medical students, interns and expert practitioners conducted medical history-taking in French language with a VP and evaluated the system performance in different evaluation rounds in two types of conditions (Table 1). Some sessions used unseen cases that were just created; we did not modify the system between creation and use. Other sessions used already seen cases, created earlier, for which we had fine-tuned the system manually. The system evolved over evaluation rounds and improved gradually by correcting the errors in interaction logs.

Table 1 Evaluation rounds and medical specialities

The medical evaluators had varied profiles (Table 2) and some participated in multiple evaluation rounds. Medical instructors created the content of 6 seen and 23 unseen cases. A co-author of this paper (LC) input the records of 6 unseen cases using the wordings of the clinical cases of French national classifying exams for medical students.Footnote 3 Tables 10 and 11 (Appendix) provide a brief description of each case.

Table 2 Medical evaluators’ profiles

We first conducted a user evaluation by means of 5-point Likert-scale questionnaires ranging from 1 (Very poor) to 5 (Very good). After each interaction, evaluators assessed the system on nine aspects (Table 3), which come from the evaluation framework of dialogue systems [49, 50]. Evaluators were given instructions on the types of utterances the system can process, and an online link to the questionnaire.

Table 3 Description of aspects addressed in the qualitative evaluation; scores ranged from 5 (Very good) to 1 (Very poor)

We also evaluated the dialogue system’s correctness. We gathered data from the dialogues with all the 35 VP cases. We analysed dialogue logs and quantified the number of correct replies. We considered correct those replies giving a coherent answer (consistent according to the user input and correct regarding the data in the record). Table 6 (Appendix) describes some examples of correct, incorrect and deferred replies. An author of this paper (LC) annotated all data; another author (SR) checked the annotations of a subset of 84 (2%) turn-reply pairs that were unclear about how to classify; finally, a consensus was reached. We computed the kappa agreement between both annotators.

To evaluate whether quality dialogue is maintained with new cases (Q2), we compared the evaluation scores given to seen and unseen cases (Table 1). 26 of the 39 medical evaluators assessed 6 seen VP cases (50 questionnaires), and 23 of the 39 evaluators evaluated 29 unseen cases (67 questionnaires); some evaluators assessed both seen and unseen cases. We conducted two-tailed t-tests and Mann-Whitney tests, using the Prism 5 software, to determine if the differences in scores were statistically significant.

To measure the diversity of the unseen cases, we counted the word types (i.e. different word forms) appearing in only one record, and the types shared across different cases. The unseen cases belong to 14 specialities (Table 1). We analysed how scores varied according to evaluators’ profiles.

Results

Quality of natural language dialogue

Each case was tested by an average of 3.74 evaluators (± 2.8; minimum number of evaluators per case = 1; maximum = 13). Panels A and B of Fig. 4 display the average evaluator scores for the seen and unseen cases respectively. Lower scores are placed to the left of each Y axis; neutral scores, in the middle; and higher scores, to the right. The bars show the cumulated percentages of evaluator scores that were Very good, Good, Neutral, Poor and Very poor. For example, in the seen cases, performance was assessed as Very good by 6% of the evaluators, as Good by an additional 52% of evaluators, as Neutral by 28% of them, and as Poor by the remaining 14%. The overall average score, obtained by averaging the mean scores given to the 9 evaluated aspects, was of 3.84 out of 5 for seen cases, and of 4.05 for unseen cases. This is above the Likert-scale midpoint. The total number of dialogues with Poor or Very poor scores ranges from 16% (naturalness) to 0% (user-understanding) for seen cases, and from 6% (naturalness) to 0% (speed) for unseen cases.

Fig. 4
figure 4

Results of the qualitative evaluation and comparison between seen cases (used in development) and unseen cases

Regarding the system correctness, we analysed 8,078 turn-reply pairs from 131 dialogues (Tables 4 and 5). We removed 149 turn-reply pairs with out-of-task questions or statements. The two researchers who double-checked the subset of turn-reply pairs had a kappa agreement of 0.827. In the full set of dialogue logs (seen and unseen cases), when analysed per medical specialty, an average of 74.3% (± 9.5) system replies were correct (min = 53.6%, max = 93.8%), i.e. answers were coherent with regard to inputs and provided accurate information from the record. An average of 14.9% (± 6.3) of system replies were incorrect; however, unseen words only caused 2 errors. Incorrect replies affected the system’s faithfulness (26.5%), the dialogue flow (56.2%) and the exhaustiveness of the information provided by the virtual patient (17.3%) (Table 8, Appendix). The system determined that the rest of the questions were beyond the dialogue task and answered I do not understand (an average of 7.8% ± 5.3) or asked for more precision (an average of 2.9% ± 2.7). This defers giving an incorrect reply and is an additional average 10.7% of correct system behaviour, despite having a negative impact on the dialogue flow. When analysing the data per dialogue, results obtained were very similar (Table 5).

Table 4 Evaluation data for all collected dialogues (#d = 131): #T: count of turns; #W: count of words; stdev: standard deviation; #U/d: average turns per dialogue; #W/d: average words per dialogue
Table 5 Evaluation of system correctness expressed as average percentage (±standard deviation) [minimum - maximum]

Performance with unseen cases across specialities

Panels A and B of Fig. 4 display, respectively, the proportion of scores given to each aspect for the 6 seen and 29 unseen cases. Evaluators rated every aspect better in the unseen cases. The differences in evaluation scores were statistically significant for the following aspects: system performance (a mean of 3.50 (95% CI[3.27-3.73]) for seen cases versus 3.81 (95% CI[3.64-3.97]) for unseen cases, p-value = 0.029, Mann-Whitney test), coherence in replies (a mean of 3.38 (95% CI[3.18-3.58]) for seen cases versus 3.73 (95% CI[3.61-3.86]) for unseen cases, p = 0.004, Mann-Whitney test), informativeness (a mean of 3.78 (95% CI[3.58-3.98]) for seen cases versus 4.03 (95% CI[3.86-4.20]) for unseen cases, p = 0.047, Mann-Whitney test) and system-understanding (a mean of 3.44 (95% CI[3.22-3.66]) for seen cases versus 3.90 (95% CI[3.72-4.07]), p = 0.001, t-test).

We also examined the variation of scores along evaluation rounds; panels C-E in Fig. 4 show the average scores for each aspect. When we compared the scores given in the first evaluation round (using seen cases) with those in the last round (using unseen cases), the following aspects showed statistically significant differences: performance (a mean of 3.48 (95% CI[3.21-3.74]) in the first round versus 4.00 (95% CI[3.86-4.14]) in the last round, p = 0.003, Mann-Whitney test), coherence (a mean of 3.31 (95% CI[3.09-3.53]) in the first round versus 3.76 (95% CI[3.56-3.95]) in the last round, p = 0.005, t-test), informativeness (a mean of 3.69 (95% CI[3.48-3.90]) in the first round versus 4.03 (95% CI[3.87-4.19]) in the last round, p = 0.018, Mann-Whitney test), concision (a mean of 4.00 (95% CI[3.76-4.24]) in the first round versus 4.59 (95% CI[4.40-4.78]) in the last round, p = 0.001, Mann-Whitney test), and system-understanding (a mean of 3.36 (95% CI[3.11-3.60]) in the first round versus 4.07 (95% CI[3.89-4.24]) in the last round, p < 0.0001, t-test).

Figure 5 plots the evaluation scores of the unseen cases grouped by speciality. From a qualitative point of view, we could not find any speciality that would consistently obtain scores below the others; outlier values correspond to cases where few dialogues were conducted.

Fig. 5
figure 5

Qualitative evaluation across medical specialities and evaluator profiles. The size of each point expresses the number of dialogues conducted: 1–5 (small size), 6–10 (medium size) and > 10 (large size). The abbreviations of specialities are given in Table 1

Concerning the diversity of the vocabulary, unseen cases contained 1,488 types (unique word forms). 1,017 types (68.4%) appeared in isolated records; that means that only one third of the types (31.6%) occurred in more than one case. The average proportion of unique types per record is 34.6% (± 7.4). Those numbers show to which extent the lexical content of each case differs across records in the unseen cases.

We also analysed the quantity of out-of-vocabulary words (OOVs) in unseen cases. Out of the total 1,488 types in the unseen cases, only 33 words (2.5%) were missing in system resources (avg = 1.2 OOVs per case, ± 1.66). That is, our resources covered 97.5% of the vocabulary in the 29 new cases. Our analysis showed that most OOVs were spelling mistakes made when inputting data to create a new record. Our methods predicted the PoS category of these OOVs with a precision of 69.8%, a recall of 76.9%, and an F-measure of 73.2% (micro-average). Regarding the OOV words for which the system predicted the correct category, our methods to predict morphology data showed a precision of 59.4%, a recall of 61.3%, and an F-measure of 60.3% (micro-average). Table 7 (Appendix) shows further details about our results per category.

Lastly, Fig. 5 (bottom right) depicts differences in assessment according to the evaluators’ profiles. The average scores of the majority or totality of evaluators agreed on user-understanding, quickness, tediousness and concision. Students and residents gave higher average scores to system performance, coherence of replies, informativeness, system- and user-understanding. Senior practitioners or instructors generally gave lower scores.

Discussion

The quality of the natural language dialogue in seen and unseen cases received very positive, positive, or neutral judgements from between 93% and 100% of the evaluators, allowing us to answer Q1 positively. System performance and coherence of replies received Good and Very good scores and overall satisfaction was high with an average of 3.84 (seen cases) and 4.06 (unseen cases) across all aspects. We cannot compare the error rate with other works (e.g. [34]) without bias, since we tested more patient cases.

Regarding Q2, in the test on unseen cases, every aspect received a higher user evaluation score than on seen cases. The improvement of some features proved statistically significant. The system was robust enough to cope with new cases without quality loss. The system’s vocabulary coverage of unseen cases was very high (97.5%). Overall, we tested 35 different cases covering 18 medical specialities. To the best of our knowledge, this is much larger than what was reported so far in the literature.

The unseen cases covered varied medical specialities among which we could not highlight consistently less well-handled specialities from a qualitative point of view. To analyse this aspect from a quantitative perspective, a larger number of dialogues in each speciality would be needed. The comparison of scores across evaluators’ profiles showed that medical students and residents evaluated the system better. This is a good point since they are the first targeted users of the system.

The correction rate of system replies varied across cases largely due to each record content: e.g. the performance was lower in a postpartum case, where some questions referred to the patient’s newborn, but the system could not distinguish them from those related to the VP. Our analysis of logs across cases unveiled that most errors were due to the lack of variants of question formulations, missing question types, or processing errors (Table 6, Appendix). These weaknesses require fallback strategies, which we explored using machine learning [51].

At a technical level, we want to improve the performance of the dialogue manager and the comparison and update procedures. Given the lack of dialogue corpora for the task, we did not apply machine/deep learning approaches. Terminological components can mitigate the needs of the domain—rich in variant terms and acronyms, but without open training data available. This is the asset of our system. Once enough dialogue logs are collected via a rule- and terminology-based system, the data can be trained to complement the dialogue policy manager, or to generate word-embeddings for OOV terms. This is left for future work. The naturalness of system replies needs also refinement, especially the way it simplifies long sentences or outputs negative symptoms and layman terms. We are interested in evaluating the system in the overall framework of a simulated consultation, where medical students should diagnose the patient. This would allow us to know whether the system helps students to obtain all key elements of the history-taking step, and to ascertain whether students make a correct diagnosis. Finally, we need to gather dialogue data to evaluate the English and Spanish versions.

Lessons learned

Regarding development, several aspects demanded a heavy investment in resource creation: terminology components for concept mapping, update procedures to compare the existing knowledge base and OOVs, and linguistically-motivated modules to transform the data created by medical trainers according to the patient’s perspective. Moreover, misspellings in trainers’ input needed spelling correction tools. To fix the OOV errors related to spelling mistakes, the most reliable approach would be to include a correction module on the back-office interface that trainer doctors use to create the patient record. The system vocabulary could be mapped to misspellings, flag them, and the trainers could correct them before the interaction. Nevertheless, the developed modules were capable of adapting the system to new cases without causing problematic interactions, according to the end-user evaluation.

Regarding the system design and evaluation, we strongly advise that medical professionals be involved from the beginning. The closer to reality the patient data we received, the better the system was tested and improved. The more iterations were conducted for inspecting logs and fixing errors, the better the system was rated. Our evaluation revealed that experienced practitioners assessed the system as less satisfactory, given their greater diagnosis experience and different perception of these tools. This highlights the careful choice of the end-user and its impact on the framework design. This multi-case, adaptable VP system seems to fit medical students and interns, since they can bear infelicities in system replies and need to engage in the interaction to gain experience. A tool with canned answers would be rigid and necessitate more engineering to adapt to new cases. If no dialogue data are available for the task, collecting dialogue logs with potential end-users seems feasible before data-intensive methods (machine or deep learning) can be applied. Finally, this system is not yet suited for simulating VPs with chronic conditions needing follow-up consultations. Evolving symptoms would require a more advanced model of the VP’s disease timeline.

Overall, the tradeoff between adaptability and naturalness has design implications related to immediate vs long-term needs, or sophisticated case-specific vs generic applications. Table 12 (Appendix) outlines our observations.

Conclusion

Medical doctors need to master medical history taking and these abilities may be enhanced through practice by using software simulations. To complement the direct contact with patients, we proposed a dialogue system for simulating the interview with multiple virtual patient cases. Because this system features interaction through natural language, it provides favourable conditions to improve medical students’ anamnesis skills. We reported here the usability evaluation of the French system. We assessed to which extent it is mature enough in a real use context.

The agent was tested with 35 different cases from 18 different specialities. Medical evaluators considered that this system provides quality dialogue through natural language, that it does so across heterogeneous cases and medical specialities, and that it processes new records without quality loss compared to already known cases. Our usability evaluation showed that this multi-case system can support student training in history taking and provided us with lessons we thought useful to share regarding its strengths and limits.