Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

Campillos-Llanos, Leonardo; Thomas, Catherine; Bilinski, Éric; Neuraz, Antoine; Rosset, Sophie; Zweigenbaum, Pierre

doi:10.1007/s10916-021-01737-4

Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

Education & Training
Published: 17 May 2021

Volume 45, article number 69, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Medical Systems Aims and scope Submit manuscript

Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

Download PDF

Leonardo Campillos-Llanos ORCID: orcid.org/0000-0003-3040-1756¹^nAff2,
Catherine Thomas³,
Éric Bilinski¹,
Antoine Neuraz⁴,
Sophie Rosset¹ &
…
Pierre Zweigenbaum¹

852 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

Simulated consultations through virtual patients allow medical students to practice history-taking skills. Ideally, applications should provide interactions in natural language and be multi-case, multi-specialty. Nevertheless, few systems handle or are tested on a large variety of cases. We present a virtual patient dialogue system in which a medical trainer types new cases and these are processed without human intervention. To develop it, we designed a patient record model, a knowledge model for the history-taking task, and a termino-ontological model for term variation and out-of-vocabulary words. We evaluated whether this system provided quality dialogue across medical specialities (n = 18), and with unseen cases (n = 29) compared to the cases used for development (n = 6). Medical evaluators (students, residents, practitioners, and researchers) conducted simulated history-taking with the system and assessed its performance through Likert-scale questionnaires. We analysed interaction logs and evaluated system correctness. The mean user evaluation score for the 29 unseen cases was 4.06 out of 5 (very good). The evaluation of correctness determined that, on average, 74.3% (sd = 9.5) of replies were correct, 14.9% (sd = 6.3) incorrect, and in 10.7% the system behaved cautiously by deferring a reply. In the user evaluation, all aspects scored higher in the 29 unseen cases than in the 6 seen cases. Although such a multi-case system has its limits, the evaluation showed that creating it is feasible; that it performs adequately; and that it is judged usable. We discuss some lessons learned and pivotal design choices affecting its performance and the end-users, who are primarily medical students.

Intelligent Systems in Learning and Education

Authoring Negotiation Content and Programming Simulated Patients

Modelling Domain-Specific Self-regulatory Activities in Clinical Reasoning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Developing diagnosis and clinical reasoning skills is a key element of medical education. In addition to clinical practice, medical students and practitioners can enhance these abilities by means of mannequins, role games and simulation systems. These have shown beneficial results [1,2,3,4,5,6,7] and are currently integrated in virtual patients [8,9,10,11,12,13,14,15]. Virtual patients (VPs)^{Footnote 1} are software through which students can train themselves by emulating the roles of health providers [16].

Ideally, a VP simulation system should simulate a patient in all consultation stages. The patient’s medical history taking (anamnesis) is an essential but difficult-to-master skill. Real consultations occur in time-restricted settings and there is a language-level gap in doctor-patient communication. Due to the health implications, doctors need to receive training to acquire these skills so that they assess patients’ conditions and make a correct diagnosis.

Natural language dialogue systems (chatbots or conversational agents) have been integrated in healthcare applications [17,18,19] and VP simulation environments. Interaction modules allow trainees to simulate history taking, mostly through constrained input—e.g. lists of questions and answers prepared for a specific case [11, 20,21,22,23,24,25]. Other methods for processing user input use rules, ontologies and knowledge bases [26, 27], statistical language models [28], machine-learning classifiers [29], crowd-sourcing data [22] and preliminary neural approaches [30, 31]. Some systems feature automatic speech recognition [32,33,34]. However, very few virtual patients feature dialogue through natural language [34] (humans’ inherent mode of communication), which might result in more natural interaction with a conversational agent [35, 36].

A successful interaction relies both on the type of technology and the degree to which the VP helps users to acquire clinical reasoning and history-taking skills. To do so, interacting with a wide range of cases is beneficial [36]. Accordingly, a VP system should provide simulations with a variety of clinical specialities. Most systems, nonetheless, only deal with one or a few conditions [33, 34, 37,38,39,40,41,42,43]. Very few systems cope with diverse pathologies [22, 44].

Objectives

Our objective was to overcome the limitation of the scarce number of simulated cases by designing a dialogue-enabled VP system that can cope with a variety of clinical conditions. We hypothesise that a multi-case VP simulation system can be achieved if medical trainers can create VPs easily, through a graphical interface (Fig. 7, Appendix), without programming anything nor the development team’s intervention. The description of the clinical case, in the form of a semi-structured record, is typed offline in natural language; next, the dialogue system embodies a patient with each clinical case.

Accordingly, a first requirement of the system is to cope with new contents across medical specialities. The second requirement is to provide unconstrained input, because the system aims at improving medical students’ history-taking skills through the interaction with the VP. Figure 1 is a sample dialogue and illustrates natural dialogue phenomena. The system is integrated in a serious game developed with partner companies and a medical team [45]. The software features an animated avatar with text-to-speech, lip-synch and minor gestures.

To make the system able to handle plenty of cases, we gave it extensive conceptual and terminological coverage of the domain [27, 46]. The system can also adapt to new records dynamically. We provided it with components to detect out-of-vocabulary words (OOV) and predict morphological information of missing words. The system with adaptation modules is available in French;^{Footnote 2} English and Spanish versions are available but not well-supported.

This article reports a usability evaluation of the French system, where we assessed, in a simulated history-taking setting:

Q1
Whether a multi-case system can provide quality dialogue (with regard to grammar and on-topic and realistic replies) through natural language across clinical cases.
Q2
Whether quality dialogue is maintained when processing unseen records across medical specialities.

We evaluated these aspects through user experiments in a real context. Study participants (n = 39) interacted in French language with the dialogue system, then performed a user evaluation of their dialogue. A graphical abstract (Fig. 9, Appendix) summarizes our work.

Material and methods

Dialogue system architecture

To tackle the task, we first designed a patient record model, which defines a virtual patient’s health state in a semi-structured format. Table 9 (Appendix) shows an example. Second, we conceived a knowledge model for the task, i.e. a scheme of question types, dialogue acts and entity types concerning the anamnesis. Third, we created a termino-ontological model, which hosts structured thesauri for managing the variation of terms [46, 47]. Figure 2 is a schema of the different stages (which occur asynchronously): case creation by an instructor (1), comparison and analysis of a new record (2), and dialogue by a student (3).

We designed the system following a knowledge-based and rule-frame-based approach [27]. The user—typically a medical student or resident—types text. A natural language understanding (NLU) module performs the linguistic and semantic processing (e.g. pain is a symptom). A semantic frame is fed to a dialogue manager, which keeps track of the dialogue state and context information, queries the record, selects the information and replies through a template-based generation module (Fig. 3).

The termino-ontological model contains lexical resources for processing linguistic variation: inflection (e.g. lung \(\leftrightarrow \) lungs), derivation (e.g. face \(\leftrightarrow \) facial), synonymy (e.g. operation \(\leftrightarrow \) surgery) and mapping between full words and affixes/roots (e.g. heart \(\leftrightarrow \) cardio-). The model also defines domain relations and concepts for processing and normalising the variety of terms in a case: e.g. pain and ache refer to the same concept. These resources support a key feature of the system: its ability to map doctor’s language to patient’s language to better simulate a real patient. We populated this model with large general and domain resources (e.g., the Unified Medical Language System^®; [48]). Our lexicons contain domain lists (over 161,000 terms in French, 116,000 in English, and 103,000 in Spanish) and dictionaries (over 959,000 word/concept entries in French, 1,886,000 in English, and 1,428,000 in Spanish).

Although these resources allow the system to handle plenty of cases, the medical jargon evolves continually with neologisms. Not knowing out-of-vocabulary words (OOVs) might cause incorrectly generated replies, because the system lacks the linguistic information for morphological agreement of OOVs. We thus developed methods to predict the Part-of-Speech (PoS) and gender/number of OOVs (see the bottom of Table 9 in the Appendix). Multiple approaches are run in parallel: dictionary-based, and inference from linguistic context or from the base form/affixes (Fig. 6 in the Appendix). They are combined using heuristic weights set during development. This prediction is executed offline whenever an instructor creates or modifies a case. Figure 8 (Appendix) gives more technical details of the system components.

Evaluation design

To assess whether the system provides quality dialogue across clinical cases (Q1), potential end-users (n = 39) tested 35 different VPs. Medical students, interns and expert practitioners conducted medical history-taking in French language with a VP and evaluated the system performance in different evaluation rounds in two types of conditions (Table 1). Some sessions used unseen cases that were just created; we did not modify the system between creation and use. Other sessions used already seen cases, created earlier, for which we had fine-tuned the system manually. The system evolved over evaluation rounds and improved gradually by correcting the errors in interaction logs.

Table 1 Evaluation rounds and medical specialities

Full size table

The medical evaluators had varied profiles (Table 2) and some participated in multiple evaluation rounds. Medical instructors created the content of 6 seen and 23 unseen cases. A co-author of this paper (LC) input the records of 6 unseen cases using the wordings of the clinical cases of French national classifying exams for medical students.^{Footnote 3} Tables 10 and 11 (Appendix) provide a brief description of each case.

Table 2 Medical evaluators’ profiles

Full size table

We first conducted a user evaluation by means of 5-point Likert-scale questionnaires ranging from 1 (Very poor) to 5 (Very good). After each interaction, evaluators assessed the system on nine aspects (Table 3), which come from the evaluation framework of dialogue systems [49, 50]. Evaluators were given instructions on the types of utterances the system can process, and an online link to the questionnaire.

Table 3 Description of aspects addressed in the qualitative evaluation; scores ranged from 5 (Very good) to 1 (Very poor)

Full size table

We also evaluated the dialogue system’s correctness. We gathered data from the dialogues with all the 35 VP cases. We analysed dialogue logs and quantified the number of correct replies. We considered correct those replies giving a coherent answer (consistent according to the user input and correct regarding the data in the record). Table 6 (Appendix) describes some examples of correct, incorrect and deferred replies. An author of this paper (LC) annotated all data; another author (SR) checked the annotations of a subset of 84 (2%) turn-reply pairs that were unclear about how to classify; finally, a consensus was reached. We computed the kappa agreement between both annotators.

To evaluate whether quality dialogue is maintained with new cases (Q2), we compared the evaluation scores given to seen and unseen cases (Table 1). 26 of the 39 medical evaluators assessed 6 seen VP cases (50 questionnaires), and 23 of the 39 evaluators evaluated 29 unseen cases (67 questionnaires); some evaluators assessed both seen and unseen cases. We conducted two-tailed t-tests and Mann-Whitney tests, using the Prism 5 software, to determine if the differences in scores were statistically significant.

To measure the diversity of the unseen cases, we counted the word types (i.e. different word forms) appearing in only one record, and the types shared across different cases. The unseen cases belong to 14 specialities (Table 1). We analysed how scores varied according to evaluators’ profiles.

Results

Quality of natural language dialogue

Each case was tested by an average of 3.74 evaluators (± 2.8; minimum number of evaluators per case = 1; maximum = 13). Panels A and B of Fig. 4 display the average evaluator scores for the seen and unseen cases respectively. Lower scores are placed to the left of each Y axis; neutral scores, in the middle; and higher scores, to the right. The bars show the cumulated percentages of evaluator scores that were Very good, Good, Neutral, Poor and Very poor. For example, in the seen cases, performance was assessed as Very good by 6% of the evaluators, as Good by an additional 52% of evaluators, as Neutral by 28% of them, and as Poor by the remaining 14%. The overall average score, obtained by averaging the mean scores given to the 9 evaluated aspects, was of 3.84 out of 5 for seen cases, and of 4.05 for unseen cases. This is above the Likert-scale midpoint. The total number of dialogues with Poor or Very poor scores ranges from 16% (naturalness) to 0% (user-understanding) for seen cases, and from 6% (naturalness) to 0% (speed) for unseen cases.

Regarding the system correctness, we analysed 8,078 turn-reply pairs from 131 dialogues (Tables 4 and 5). We removed 149 turn-reply pairs with out-of-task questions or statements. The two researchers who double-checked the subset of turn-reply pairs had a kappa agreement of 0.827. In the full set of dialogue logs (seen and unseen cases), when analysed per medical specialty, an average of 74.3% (± 9.5) system replies were correct (min = 53.6%, max = 93.8%), i.e. answers were coherent with regard to inputs and provided accurate information from the record. An average of 14.9% (± 6.3) of system replies were incorrect; however, unseen words only caused 2 errors. Incorrect replies affected the system’s faithfulness (26.5%), the dialogue flow (56.2%) and the exhaustiveness of the information provided by the virtual patient (17.3%) (Table 8, Appendix). The system determined that the rest of the questions were beyond the dialogue task and answered I do not understand (an average of 7.8% ± 5.3) or asked for more precision (an average of 2.9% ± 2.7). This defers giving an incorrect reply and is an additional average 10.7% of correct system behaviour, despite having a negative impact on the dialogue flow. When analysing the data per dialogue, results obtained were very similar (Table 5).

Table 4 Evaluation data for all collected dialogues (#d = 131): #T: count of turns; #W: count of words; stdev: standard deviation; #U/d: average turns per dialogue; #W/d: average words per dialogue

Full size table

Table 5 Evaluation of system correctness expressed as average percentage (±standard deviation) [minimum - maximum]

Full size table

Performance with unseen cases across specialities

Panels A and B of Fig. 4 display, respectively, the proportion of scores given to each aspect for the 6 seen and 29 unseen cases. Evaluators rated every aspect better in the unseen cases. The differences in evaluation scores were statistically significant for the following aspects: system performance (a mean of 3.50 (95% CI[3.27-3.73]) for seen cases versus 3.81 (95% CI[3.64-3.97]) for unseen cases, p-value = 0.029, Mann-Whitney test), coherence in replies (a mean of 3.38 (95% CI[3.18-3.58]) for seen cases versus 3.73 (95% CI[3.61-3.86]) for unseen cases, p = 0.004, Mann-Whitney test), informativeness (a mean of 3.78 (95% CI[3.58-3.98]) for seen cases versus 4.03 (95% CI[3.86-4.20]) for unseen cases, p = 0.047, Mann-Whitney test) and system-understanding (a mean of 3.44 (95% CI[3.22-3.66]) for seen cases versus 3.90 (95% CI[3.72-4.07]), p = 0.001, t-test).

We also examined the variation of scores along evaluation rounds; panels C-E in Fig. 4 show the average scores for each aspect. When we compared the scores given in the first evaluation round (using seen cases) with those in the last round (using unseen cases), the following aspects showed statistically significant differences: performance (a mean of 3.48 (95% CI[3.21-3.74]) in the first round versus 4.00 (95% CI[3.86-4.14]) in the last round, p = 0.003, Mann-Whitney test), coherence (a mean of 3.31 (95% CI[3.09-3.53]) in the first round versus 3.76 (95% CI[3.56-3.95]) in the last round, p = 0.005, t-test), informativeness (a mean of 3.69 (95% CI[3.48-3.90]) in the first round versus 4.03 (95% CI[3.87-4.19]) in the last round, p = 0.018, Mann-Whitney test), concision (a mean of 4.00 (95% CI[3.76-4.24]) in the first round versus 4.59 (95% CI[4.40-4.78]) in the last round, p = 0.001, Mann-Whitney test), and system-understanding (a mean of 3.36 (95% CI[3.11-3.60]) in the first round versus 4.07 (95% CI[3.89-4.24]) in the last round, p < 0.0001, t-test).

Figure 5 plots the evaluation scores of the unseen cases grouped by speciality. From a qualitative point of view, we could not find any speciality that would consistently obtain scores below the others; outlier values correspond to cases where few dialogues were conducted.

Concerning the diversity of the vocabulary, unseen cases contained 1,488 types (unique word forms). 1,017 types (68.4%) appeared in isolated records; that means that only one third of the types (31.6%) occurred in more than one case. The average proportion of unique types per record is 34.6% (± 7.4). Those numbers show to which extent the lexical content of each case differs across records in the unseen cases.

We also analysed the quantity of out-of-vocabulary words (OOVs) in unseen cases. Out of the total 1,488 types in the unseen cases, only 33 words (2.5%) were missing in system resources (avg = 1.2 OOVs per case, ± 1.66). That is, our resources covered 97.5% of the vocabulary in the 29 new cases. Our analysis showed that most OOVs were spelling mistakes made when inputting data to create a new record. Our methods predicted the PoS category of these OOVs with a precision of 69.8%, a recall of 76.9%, and an F-measure of 73.2% (micro-average). Regarding the OOV words for which the system predicted the correct category, our methods to predict morphology data showed a precision of 59.4%, a recall of 61.3%, and an F-measure of 60.3% (micro-average). Table 7 (Appendix) shows further details about our results per category.

Lastly, Fig. 5 (bottom right) depicts differences in assessment according to the evaluators’ profiles. The average scores of the majority or totality of evaluators agreed on user-understanding, quickness, tediousness and concision. Students and residents gave higher average scores to system performance, coherence of replies, informativeness, system- and user-understanding. Senior practitioners or instructors generally gave lower scores.

Discussion

The quality of the natural language dialogue in seen and unseen cases received very positive, positive, or neutral judgements from between 93% and 100% of the evaluators, allowing us to answer Q1 positively. System performance and coherence of replies received Good and Very good scores and overall satisfaction was high with an average of 3.84 (seen cases) and 4.06 (unseen cases) across all aspects. We cannot compare the error rate with other works (e.g. [34]) without bias, since we tested more patient cases.

Regarding Q2, in the test on unseen cases, every aspect received a higher user evaluation score than on seen cases. The improvement of some features proved statistically significant. The system was robust enough to cope with new cases without quality loss. The system’s vocabulary coverage of unseen cases was very high (97.5%). Overall, we tested 35 different cases covering 18 medical specialities. To the best of our knowledge, this is much larger than what was reported so far in the literature.

The unseen cases covered varied medical specialities among which we could not highlight consistently less well-handled specialities from a qualitative point of view. To analyse this aspect from a quantitative perspective, a larger number of dialogues in each speciality would be needed. The comparison of scores across evaluators’ profiles showed that medical students and residents evaluated the system better. This is a good point since they are the first targeted users of the system.

The correction rate of system replies varied across cases largely due to each record content: e.g. the performance was lower in a postpartum case, where some questions referred to the patient’s newborn, but the system could not distinguish them from those related to the VP. Our analysis of logs across cases unveiled that most errors were due to the lack of variants of question formulations, missing question types, or processing errors (Table 6, Appendix). These weaknesses require fallback strategies, which we explored using machine learning [51].

At a technical level, we want to improve the performance of the dialogue manager and the comparison and update procedures. Given the lack of dialogue corpora for the task, we did not apply machine/deep learning approaches. Terminological components can mitigate the needs of the domain—rich in variant terms and acronyms, but without open training data available. This is the asset of our system. Once enough dialogue logs are collected via a rule- and terminology-based system, the data can be trained to complement the dialogue policy manager, or to generate word-embeddings for OOV terms. This is left for future work. The naturalness of system replies needs also refinement, especially the way it simplifies long sentences or outputs negative symptoms and layman terms. We are interested in evaluating the system in the overall framework of a simulated consultation, where medical students should diagnose the patient. This would allow us to know whether the system helps students to obtain all key elements of the history-taking step, and to ascertain whether students make a correct diagnosis. Finally, we need to gather dialogue data to evaluate the English and Spanish versions.

Lessons learned

Regarding development, several aspects demanded a heavy investment in resource creation: terminology components for concept mapping, update procedures to compare the existing knowledge base and OOVs, and linguistically-motivated modules to transform the data created by medical trainers according to the patient’s perspective. Moreover, misspellings in trainers’ input needed spelling correction tools. To fix the OOV errors related to spelling mistakes, the most reliable approach would be to include a correction module on the back-office interface that trainer doctors use to create the patient record. The system vocabulary could be mapped to misspellings, flag them, and the trainers could correct them before the interaction. Nevertheless, the developed modules were capable of adapting the system to new cases without causing problematic interactions, according to the end-user evaluation.

Regarding the system design and evaluation, we strongly advise that medical professionals be involved from the beginning. The closer to reality the patient data we received, the better the system was tested and improved. The more iterations were conducted for inspecting logs and fixing errors, the better the system was rated. Our evaluation revealed that experienced practitioners assessed the system as less satisfactory, given their greater diagnosis experience and different perception of these tools. This highlights the careful choice of the end-user and its impact on the framework design. This multi-case, adaptable VP system seems to fit medical students and interns, since they can bear infelicities in system replies and need to engage in the interaction to gain experience. A tool with canned answers would be rigid and necessitate more engineering to adapt to new cases. If no dialogue data are available for the task, collecting dialogue logs with potential end-users seems feasible before data-intensive methods (machine or deep learning) can be applied. Finally, this system is not yet suited for simulating VPs with chronic conditions needing follow-up consultations. Evolving symptoms would require a more advanced model of the VP’s disease timeline.

Overall, the tradeoff between adaptability and naturalness has design implications related to immediate vs long-term needs, or sophisticated case-specific vs generic applications. Table 12 (Appendix) outlines our observations.

Conclusion

Medical doctors need to master medical history taking and these abilities may be enhanced through practice by using software simulations. To complement the direct contact with patients, we proposed a dialogue system for simulating the interview with multiple virtual patient cases. Because this system features interaction through natural language, it provides favourable conditions to improve medical students’ anamnesis skills. We reported here the usability evaluation of the French system. We assessed to which extent it is mature enough in a real use context.

The agent was tested with 35 different cases from 18 different specialities. Medical evaluators considered that this system provides quality dialogue through natural language, that it does so across heterogeneous cases and medical specialities, and that it processes new records without quality loss compared to already known cases. Our usability evaluation showed that this multi-case system can support student training in history taking and provided us with lessons we thought useful to share regarding its strengths and limits.

Data availability

The dialogue data collected during development and evaluation is available at: https://pvdial.limsi.fr/data/PG-logs-eval.zip A demonstration of the dialogue system can be tested at: http://vps-9069f76a.vps.ovh.net

Code availability

Not applicable.

Notes

We refer with this term to virtual standardised patients.
http://vps-9069f76a.vps.ovh.net
http://umvf.cerimes.fr/portail/ecn.php
https://pvdial.limsi.fr

References

Washburn M., Bordnick P., Rizzo A. S.: A pilot feasibility study of virtual patient simulation to enhance social work students’ brief mental health assessment skills. Soc. Work Health Care 55 (9): 675–693, 2016
Article Google Scholar
Barnett S. G., Gallimore C. E., Pitterle M., Morrill J.: Impact of a paper vs virtual simulated patient case on student-perceived confidence and engagement. Am. J. Pharm. Educ. 80 (1): 16, 2016
Article Google Scholar
McCoy L., Pettit R. K., Lewis J. H., Allgood J. A., Bay C., Schwartz F. N.: Evaluating medical student engagement during virtual patient simulations: A sequential, mixed methods study. BMC Med. Educ. 16: 20, 2016
Article Google Scholar
Tait L., Lee K., Rasiah R., Cooper J. M., Ling T., Geelan B., Bindoff I. (2018) Simulation and feedback in health education: A mixed methods study comparing three simulation modalities. Pharmacy (Basel) 6(2):41–57
Courteille O., Fahlstedt M., Ho J., Hedman L., Fors U., von Holst H., Fellander-Tsai L., Moller H.: Learning through a virtual patient vs. recorded lecture: A comparison of knowledge retention in a trauma case. Int. J. Med. Educ. 9: 86–92, 2018
Article Google Scholar
Gupta A., Singh S., Khaliq F., Dhaliwal U., Madhu S. V.: Development and validation of simulated virtual patients to impart early clinical exposure in endocrine physiology. Adv. Physiol. Educ. 42 (1): 15–20, 2018
Article Google Scholar
de Cock C., Milne-Ives M., van Velthoven M. H., Alturkistani A., Lam C., Meinert E.: Effectiveness of conversational agents (virtual assistants) in health care: Protocol for a systematic review. JMIR Res. Protoc. 9 (3): e16934, 2020
Article Google Scholar
Ellaway R., Candler C., Greene P., Smothers V. An architectural model for MedBiquitous virtual patients. 2006 http://groups.medbiq.org/medbiq/display/VPWG/MedBiquitous+Virtual+Patient+Architecture, Accessed: 1 Apr 2021
Sijstermans R., Jaspers M. W., Bloemendaal P., Schoonderwaldt E.: Training inter-physician communication using the dynamic patient simulator®; Int. J. Med. Inf. 76 (5–6): 336–343, 2007
Article CAS Google Scholar
Danforth D. R., Procter M., Chen R., Johnson M., Heller R.: Development of virtual patient simulations for medical education. J. Virtual Worlds Res. 2 (2): 4–11, 2009
Article Google Scholar
Rombauts N. (2014) Patients virtuels: pédagogie, état de l’art et développement du simulateur Alphadiag. PhD dissertation, Faculty of Medicine, Claude Bernard University, Lyon France
Menendez E., Balisa-Rocha B., Jabbur-Lopes M., Costa W., Nascimento J. R., Dósea M., Silva L., Junior D. L.: Using a virtual patient system for the teaching of pharmaceutical care. Int. J. Med. Inf. 84 (9): 640–646, 2015
Article Google Scholar
Lin C. J., Pao C. W., Chen Y. H., Liu C. T., Hsu H. H.: Ellipsis and coreference resolution in a computerized virtual patient dialogue system. J. Med. Syst. 40 (9): 206–221, 2016
Article Google Scholar
Laleye F. A., Blanié A., Brouquet A., Behnamou D., de Chalendar G.: Semantic similarity to improve question understanding in a virtual patient.. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020, pp 859–866
Chen F., Lee Y., Hubal R.: Work-in-progress—testing of a virtual patient: Linguistic and display engagement findings.. In: 2020 6th International Conference of the Immersive Learning Research Network (iLRN). IEEE, 2020, pp 348–350
Candler C.: Effective use of educational technology in medical education.. In: Colloquium on Educational Technology: Recommendations and Guidelines for Medical Educators. AAMC Institute for Improving Medical Education, Washington, DC, 2007, pp 1–19
Schmidlen T., Schwartz M., DiLoreto K., Kirchner H. L., Sturm A. C.: Patient assessment of chatbots for the scalable delivery of genetic counseling. J. Genet. Couns. 28 (6): 1166–1177, 2019
Article Google Scholar
Chetlen A., Artrip R., Drury B., Arbaiza A., Moore M.: Novel use of chatbot technology to educate patients before breast biopsy. J. Am. Coll. Radiol. 16 (9 Pt B): 1305–1308, 2019
Article Google Scholar
Kokciyan N., Chapman M., Balatsoukas P., Sassoon I., Essers K., Ashworth M., Curcin V., Modgil S., Parsons S., Sklar E. I.: A collaborative decision support tool for managing chronic conditions. Stud. Health Technol. Inform. 264: 644–648, 2019
PubMed Google Scholar
Cook D. A., Erwin P. J., Triola M. M.: Computerized virtual patients in health professions education: A systematic review and meta-analysis. Acad. Med. 85 (10): 1589–1602, 2010. https://doi.org/10.1097/ACM.0b013e3181edfe13
Article Google Scholar
Wattanasoontorn V., Hernández R. J. G., Sbert M.: Embodied conversational virtual patients. In: (Diana P. M., Nieto I. P., Eds.) Conversational Agents and Natural Language Interaction: Techniques and Effective Practices. Information Science Reference, IGI Global, Hershey, 2011, pp 254–281. https://doi.org/10.4018/978-1-60960-617-6.ch011
Rossen B., Lok B.: A crowdsourcing method to develop virtual human conversational agents. Int. J. Hum. Comput. Stud. 70 (4): 301–319, 2012
Article Google Scholar
Lelardeux C., Panzoli D., Alvarez J., Galaup M., Lagarrigue P.: Serious game, simulateur, serious play: État de l’art pour la formation en santé.. In: Actes du colloque Serious Games en Médecine et Santé (SeGaMED) 2013. e-virtuoses, Nice, 2013, pp 27–38
Wattanasoontorn V., Hernández R.J.G., Sbert M.: Serious games for e-health care. In: (Cai Y., Goei S., Eds.) Simulations, Serious Games and Their Applications. Springer, Singapore, 2014, pp 127–146. https://doi.org/10.1007/978-981-4560-32-0_9
Reiswich A., Haag M.: Evaluation of chatbot prototypes for taking the virtual patient’s history. Stud. Health Technol. Inform. 260: 73–80, 2019
PubMed Google Scholar
Nirenburg S., Beale S., McShane M., Jarrell B., Fantry G.: Language understanding in Maryland virtual patient.. In: Proceedings of the International Conference on on Computational Linguistics. Citeseer, Manchester, 2008, pp 36–39
Campillos-Llanos L., Bouamor D., Bilinski É., Ligozat A. L., Zweigenbaum P., Rosset S.: Description of the PatientGenesys dialogue system.. In: Proceedings of SIGDIAL. Association for Computational Linguistics, Prague, 2015, pp 438–440
Leuski A., Traum D.: Practical language processing for virtual humans.. In: Proceedings on Innovative Applications of Artificial Intelligence Conference, Atlanta, 2010, pp 1740–1747
Rizk Y., Kshoury K., Chehab M., Chidiac P., Awad M., Antoun J.: Virtual patient.. In: Proceedings of WINLP, Vancouver, 2017, pp 1–3
Datta D., Brashers V., Owen J., White C., Barnes L. E.: A deep learning methodology for semantic utterance classification in virtual human dialogue systems. In: (Traum D., Swartout W., Khooshabeh P., Kopp S., Scherer S., Leuski A., Eds.) Intelligent Virtual Agents, Los Angeles. Springer, Berlin, 2016, pp 451–455
Jin L., White M., Jaffe E., Zimmerman L., Danforth D.: Combining cnns and pattern matching for question interpretation in a virtual patient dialogue system.. In: Proceedings on Workshop Innovative Use NLP Building Educational Applications. Copenhagen, 2017, pp 11–21
Dickerson R., Johnsen K., Raij A., Lok B., Hernandez J., Stevens A., Lind D. S.: Evaluating a script-based approach for simulating patient-doctor interaction.. In: Proceedings of the International, Conference on Human-Computer Interface Advances Modeling and Simulation, New Orleans, 2005, pp 79–84
Pence T. B., Dukes L. C., Hodges L. F., Meehan N. K., Johnson A.: The effects of interaction and visual fidelity on learning outcomes for a virtual pediatric patient system.. In: IEEE International Conference on Healthcare Informatics (ICHI). IEEE, Philadelphia, 2013, pp 209–218. https://doi.org/10.1109/ICHI.2013.36
Maicher K., Danforth D., Price A., Zimmerman L., Wilcox B., Liston B., Cronau H., Belknap L., Ledford C., Way D., et al.: Developing a conversational virtual standardized patient to enable students to practice history-taking skills. Simul. Healthc. 12 (2): 124–131, 2017. https://doi.org/10.1097/SIH.0000000000000195
Article Google Scholar
Talbot T. B., Sagae K., John B., Rizzo A. A.: Sorting out the virtual patient: How to exploit artificial intelligence, game technology and sound educational practices to create engaging role-playing simulations. Int. J. Gaming Comput. Mediat. Simul. 4 (3): 1–19, 2012. https://doi.org/10.4018/jgcms.2012070101
Article Google Scholar
Scherly D., Nendaz M.: Simulation du raisonnement clinique sur ordinateur: Le patient virtuel. In: (Boet S., Granry J., Savoldelli G., Eds.) La Simulation en Santé. De la Théorie à la Pratique. Springer, Paris, 2013, pp 43–50. https://doi.org/10.1007/978-2-8178-0469-9_5
Hubal R. C., Kizakevich P. N., Guinn C. I., Merino K. D., West S. L.: The virtual standardized patient. Stud. Health Technol. Inform. 70: 133–138, 2000
CAS PubMed Google Scholar
Stevens A., Hernandez J., Johnsen K., Dickerson R., Raij A., Harrison C., DiPietro M., Allen B., Ferdig R., Foti S., et al.: The use of virtual patients to teach medical students history taking and communication skills. Am. J. Surg. 191 (6): 806–811, 2006
Article Google Scholar
Kenny P., Rizzo A. A., Parsons T. D., Gratch J., Swartout W.: A virtual human agent for training novice therapists clinical interviewing skills. Annu. Rev. CyberTherapy Telemed. 5: 77–83, 2007. https://doi.org/10.1145/159544.159587
Google Scholar
Kenny P., Parsons T. D., Gratch J., Rizzo A. A.: Evaluation of Justina: A virtual patient with PTSD. In: (Prendinger H., Lester J., Ishizuka M., Eds.) Intelligent Virtual Agents. Springer, Berlin, 2008, pp 394–408
Parsons T. D.: Virtual standardized patients for assessing the competencies of psychologists.. In: Encyclopedia of Information Science and Technology, 3rd edn. IGI Global, 2015, pp 6484–6492. https://doi.org/10.4018/978-1-4666-5888-2.ch637
Persad A., Stroulia E., Forgie S.: A novel approach to virtual patient simulation using natural language processing. Med. Educ. 50 (11): 1162–1163, 2016. https://doi.org/10.1111/medu.13197
Article Google Scholar
Gokcen A., Jaffe E., Erdmann J., White M., Danforth D.: A corpus of word-aligned asked and anticipated questions in a virtual patient dialogue system.. In: LREC International Conference on Language Resources and Evaluation, Portorož, 2016, pp 3174–3179
Talbot T. B., Kalisch N., Christoffersen K., Lucas G., Forbell E.: Natural language understanding performance and use considerations in virtual medical encounters. Stud Health Technol. Inform. 220: 407–413, 2016
PubMed Google Scholar
Leleu J., Caillat-Grenier R., Pierard N., Rica P., Granry J. C., Lehousse T., Pereira S., Bretier P., Rosec O., Bilinski É., Bouamor D., Campillos-Llanos L., Grau B., Ligozat A. L., Zweigenbaum P., Rosset S.: Patient Genesys: Outil de création de cas cliniques de simulation médicale proposant des cas patients virtuels en 3D.. In: Applications Pratiques de l’Intelligence Artificielle, Rennes, 2015, p 2
Campillos-Llanos L., Bouamor D., Zweigenbaum P., Rosset S.: Managing linguistic and terminological variation in a medical dialogue system.. In: LREC International Conference on Language Resources and Evaluation, Portorož, 2016, pp 3167–3173
Campillos-Llanos L., Thomas C., Bilinski É., Zweigenbaum P., Rosset S.: Designing a virtual patient dialogue system based on terminology-rich resources: Challenges and evaluation. Nat. Lang. Eng. 26 (2): 183–220, 2020
Article Google Scholar
Bodenreider O.: The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 32 (suppl 1): D267–D270, 2004
Article CAS Google Scholar
Dybkjær L., Bernsen N.O.: Usability evaluation in spoken language dialogue systems.. In: Proceedings of Workshop on Evaluation for Language and Dialogue Systems. Association for Computational Linguistics, 2001, pp 9–18
Duplessis G. D., Letard V., Ligozat A. L., Rosset S.: Purely corpus-based automatic conversation authoring.. In: LREC International Conference on Language Resources and Evaluation, Portorož, 2016, pp 2728–2735
Campillos-Llanos L., Rosset S., Zweigenbaum P.: Automatic classification of doctor-patient questions for a virtual patient record query task.. In: Proceedings of BioNLP. Association for Computational Linguistics, Vancouver, 2017, pp 333–341

Download references

Acknowledgements

We greatly thank all doctors who evaluated the system and gave valuable remarks, and also Dr. Aurélie Névéol for her helpful comments on the manuscript. We also thank the anonymous reviewers for their constructive suggestions. We developed the dialogue system in a collaborative project led by Interaction Healthcare and having as partners VIDAL, Angers University Hospital, Voxygen and LIMSI.^{Footnote 4}

Funding

This work was funded by BPI (FUI Project PatientGenesys, F1310002-P) and by the Société d’Accélération de Transfert Technologique (SATT) Paris Saclay (PVDial project). The funding bodies did not take part in the design of the study, analysis and interpretation of data and writing the manuscript.

Author information

Leonardo Campillos-Llanos
Present address: ILLA - Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain

Authors and Affiliations

Université Paris-Saclay, CNRS, LISN, Orsay, France
Leonardo Campillos-Llanos, Éric Bilinski, Sophie Rosset & Pierre Zweigenbaum
SATT Paris-Saclay, Orsay, France
Catherine Thomas
Assistance Publique-Hôpitaux de Paris, Paris, France
Antoine Neuraz

Authors

Leonardo Campillos-Llanos
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Thomas
View author publications
You can also search for this author in PubMed Google Scholar
Éric Bilinski
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Neuraz
View author publications
You can also search for this author in PubMed Google Scholar
Sophie Rosset
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Zweigenbaum
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Sophie Rosset (SR), Leonardo Campillos-Llanos (LC) and Catherine Thomas (CT) developed the VP dialogue system, and Pierre Zweigenbaum (PZ) contributed to the medical terminology components and patient record model. Éric Bilinski (EB) implemented the web evaluation tool and the online demonstration of the dialogue system. Antoine Neuraz (AN) helped to engage the evaluation participants and made valuable remarks about the system and article. SR and PZ designed the evaluation protocol, and LC collected and analysed the evaluation data. LC and SR double-checked a subset of the data. LC, SR and PZ wrote the manuscript, and all authors read and approved the final article.

Corresponding author

Correspondence to Leonardo Campillos-Llanos.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Education & Training

Appendix

Table 6 Examples of correct, incorrect and deferred replies (I: ‘input’; R: ‘system reply’); we show the English translation of dialogue interactions using the French system

Full size table

Table 7 Results of prediction methods of part-of-speech (PoS) category and morphology data for out-of-vocabulary (OOV) words (in percentage); the number of instances per class is shown in brackets; results of morphology data were only computed on OOVs for which the PoS category was predicted correctly

Full size table

Table 8 Analysis of incorrect replies with examples (I: ‘user input’; R: ‘system reply’); we show the English translation of dialogue interactions using the French system

Full size table

Table 9 Sample clinical record (top) and sample of the output for OOV words in a new VP record (bottom); adj stands for ‘adjective’; fp, for ‘feminine plural’; the format is YAML

Full size table

Table 10 Description of the seen cases used in the usability study

Full size table

Table 11 Description of the unseen cases used in the usability study

Full size table

Table 12 Summary of lessons learned from the development and usability evaluation and implications on design and development

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Campillos-Llanos, L., Thomas, C., Bilinski, É. et al. Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System. J Med Syst 45, 69 (2021). https://doi.org/10.1007/s10916-021-01737-4

Download citation

Received: 23 December 2020
Accepted: 05 April 2021
Published: 17 May 2021
DOI: https://doi.org/10.1007/s10916-021-01737-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

Abstract

Similar content being viewed by others

Intelligent Systems in Learning and Education

Authoring Negotiation Content and Programming Simulated Patients

Modelling Domain-Specific Self-regulatory Activities in Clinical Reasoning

Introduction

Objectives