Background

History-taking is an essential skill for becoming a competent doctor, and it is a fundamental component of work in various medical fields [1]. History-taking typically includes general data, chief complaints, present history, past medical history, family history, social history, and review of systems, etc. Since it directs subsequent exams, diagnosis, and treatment choices, gathering patient histories is the first and most important stage in identifying medical conditions [2]. Therefore, it is vital to provide medical professionals with training in history-taking before they engage in clinical practice [3,4,5,6].

Currently, the most common method for teaching history-taking skills combines theoretical instruction with simulation-based education, with Standardized Patients (SP) as the primary method of simulation. SP are individuals who have undergone standardized and systematic training to accurately, realistically, and consistently portray the characteristics, psychosocial features, and emotional responses required for specific medical cases [7, 8]. Doctor-patient communication encompasses both verbal and non-verbal components, which are equally important [2, 9]. During the diagnostic process, doctors collect information from patients by observing their facial expressions and body language. Similarly, doctors use body language and facial expressions to encourage and ensure patient comfort [10, 11]. The United States first used SP for clinical teaching in the 1960s, and China adopted the practice in the 1990s [12]. SP teaching is a valuable bridge between theoretical instruction and clinical practice. It not only facilitates the simulation of authentic medical scenarios without ethical concerns but also boosts student engagement, enhances clinical communication skills, supports the acquisition of medical knowledge, and promotes a deeper grasp of abstract concepts [13].

However, the use of SP in medical education has its own challenges. The training process for SP is rigorous, time-consuming, and resource-intensive. Consequently, the availability of qualified SP is limited [14, 15]. In the process of SP teaching evaluation, the influence of subjective factors cannot be avoided [16]. The lengthy and strict training process, resulting in the scarcity of SP, makes it challenging to implement one-on-one history-taking training effectively. To address these limitations of SP, virtual standardized patient (VSP) offers a potential solution. As early as the early twenty-first century, research suggested using computers to aid in history-taking exercises [17,18,19,20], but VSP has not become widely adopted. Implementing VSP in history-taking instruction can effectively address the limitations found in SP. It reduces the lengthy training time and costs associated with training SP, allows for repetitive training [21, 22], and facilitates the assessment of teaching effectiveness [23], thereby boosting student confidence [24, 25]. It reduces the potential subjectivity of both instructors and SP, enabling a more objective and standardized evaluation [23].

We developed a VSP according to the needs, using speech recognition technology, intention recognition technology, and automatic scoring. VSP initially used sentence similarity matching and then improved to intention recognition. The article aims to explore the accuracy of VSP and assess whether the upgraded system’s accuracy can be applied to diagnostic teaching assistance and performance evaluation. This research highlights certain limitations in SP physician training and examines the application accuracy of our independently developed VSP. The goal is to establish a foundation for more effective teaching strategies.

Methods

VSP

This study utilizes a virtual standardized patient history-taking system jointly developed by our institution and Shanghai Chuxin Medical Technology Co., Ltd., based on speech recognition and intent recognition technology. The system operates in both a human–computer dialogue mode and a human–computer collaborative mode, as detailed in Fig. 1. This research employs the latter.

Fig. 1
figure 1

Human–computer dialogue mode and human–computer collaborative mode of VSP a human–computer dialogue mode; b human–computer collaborative mode

The system first converts the spoken dialogue into text. Subsequently, the sentences are dissected, breaking them down into phrases. Following part-of-speech recognition and classification, these sentences are compared to the intent templates kept in the intent library to provide assessments and comments. After gathering all the data, the system performs self-learning to adjust the corpus [26]. The specific process is illustrated in Fig. 2.

Fig. 2
figure 2

The specific process of the virtual standardized patient history-taking system based on speech recognition technology and artificial intelligence

VSP underwent a general system replacement during the experiment, with VSP 1.0 being the old version and VSP 2.0 and VSP 3.0 being the new version. VSP 1.0 compares sentences and keywords with standard statements, for the old version of the system. It is considered the same statement when the sentences are similar to the standard statements [27]. For the updated version of the system, VSP 2.0 divides the sentences into phrases, categorizes the phrases, and matches the intent templates of the phrases [28]. VSP 3.0 is the version of VSP 2.0 self-learning optimization [29,30,31,32]. The differences between the old and new versions are shown in Fig. 3.

Fig. 3
figure 3

Comparison of old and new VSP

Design, setting and subjects

We adopted a prospective study design. The Biomedical Ethics Committee of West China Hospital, Sichuan University approved this study (Approval 2019 No.1071). In this study, we applied different versions of VSP to assess the clinical performance of medical students with no prior clinical experience and residents, using a human–computer collaborative model for evaluation. In the study population, clinical medical students were recruited from the annual diagnostics course, while residents were selected from the enhanced training sessions, which covered three years. Participants willing to use VSP are recruited from these two courses, and the scoring results of history-taking can be compared with those of SP. Informed consent was obtained from all participants before the tests, and they were informed that the results of this study would not affect their final course grades. All participants had previously received theoretical instruction in medical history taking.

Measurements

In this study, the accuracy of the VSP application was determined. The application accuracy included speech recognition accuracy, intention recognition accuracy, and scoring accuracy. Speech recognition accuracy is the ratio of correctly recognized characters to the total number of characters. Intent recognition accuracy is the ratio of correctly matched phrases to the total number of phrases. The system automatically determines the accuracy of speech recognition and intent recognition, and it separates intent matches with a probability of less than 80%. The results were reviewed manually by two technicians. If their results differed, a third technician made the final decision. The score consisted of the content of history-taking and the skill of history-taking, with a total of 70 scoring points, and the total score of each scoring point was different. The Department of Diagnostics’ multidisciplinary team established the scoring scale, which has been validated and applied for many years. The scale underwent minor modifications based on actual conditions to ensure quality control. Because SP is highly trained and experienced, we calculated the VSP’s scoring accuracy using its scores as the gold standard. Scoring accuracy was calculated as the ratio of the number of scoring points with the same VSP and SP scores to the total score points.

Procedure

In this study, the participants randomly selected one case from the four cases (diarrhea, syncope, palpitation, cough) during the assessment process. Throughout the assessment, SP acted as a patient and interacted with participants performing the role of doctors. The VSP was placed next to it without responding, collecting information for real-time scoring. As a result, two sets of scores were obtained from SP and VSP. After the examination, participants were invited to complete relevant questionnaires voluntarily (results are shown in the appendices). Following history-taking training with SP, both SP and VSP scores are simultaneously obtained, and participants provide feedback about the VSP after comparing these scores. Speech and intent recognition employ mature commercial technologies, with accuracy automatically generated by the system upon the corpus. Lastly, technicians reviewed the texts and recordings to assess and adjust the accuracy provided by the system. See Fig. 4 for details.

Fig. 4
figure 4

The procedure of the study

Data analysis

Since the data did not meet the normal distribution, we used the Wilcoxon rank sum test when comparing the SP and VSP scores. Analysis of variance (ANOVA) was used to test the significance of the differences in the accuracy of scoring, speech recognition, and intent recognition of VSP across different VSP versions in various cases. Kruskal–Wallis one-way ANOVA tests were used for pairwise comparisons, with post hoc analysis using the Bonferroni correction. The independent t-test was used to compare the accuracy of scoring, speech recognition, and intent recognition of VSP 3.0 in different cases between medical students and residents. p < 0.05 was considered statistical significance.

Results

Demographic characteristics

A total of 502 students participated in the study over the three years. Of these, 476 students were included in the final analysis, as 26 students’ data were not recorded due to VSP 1.0 system problems. Among the included participants, 89 medical students used VSP 1.0, 129 used VSP 2.0, and 104 used VSP 3.0. The 154 residents used VSP 3.0. Statistics were based on different versions and randomly selected cases, as shown in Table 1:

Table 1 Distribution of subjects (n = 476)

The medical students who used different versions of VSP are at similar stages of learning history-taking, with comparable ages. Residents have more clinical experience than medical students in the same stage of training.

Comparison of history-taking scores given by SP and VSP

The t-test revealed significant differences in the scores given by SP and VSP for both medical students (Z = -8.194, p < 0.05; Z = -9.864, p < 0.05; Z = -8.867, p < 0.05) and residents (Z = -10.773, p < 0.05). Generally, VSP scores were lower than SP scores. The score distribution was skewed, mainly in the high-score range, as shown in Table 2..

Table 2 Comparison of SP and VSP history-taking Scores (n=476)

Comparison of VSP application accuracy

Our study examined four distinct medical scenarios, comparing the application accuracy of various VSP versions and determining whether there are variations in accuracy while using VSP 3.0 with medical students and residents.

Comparison of VSP application accuracy in diarrhea cases

The results indicated significant differences in the application accuracy (H = 42.424, p < 0.001; H = 27.220, p < 0.001; H = 44.135, p < 0.001) among the three versions of the VSP system. Multiple mean comparisons revealed significant differences in scoring accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), VSP 1.0 and VSP 3.0 (p < 0.001). The speech recognition accuracy between VSP 1.0 and VSP 3.0 (p < 0.001), and VSP 2.0 and VSP 3.0 (p < 0.001) was significantly different. Intent recognition accuracy was significantly different between VSP 1.0 and VSP 2.0 (p < 0.001), VSP 1.0 and VSP 3.0 (p < 0.001). The results are presented in Table 3.

Table 3 Comparison of VSP application accuracy in diarrhea cases (n = 76)

When instructing medical students and residents in history-taking using VSP 3.0, a t-test was employed to determine whether there were any significant differences in application accuracy between the two groups. The results showed significant differences in speech recognition accuracy (Z = -2.719, p = 0.007) and intent recognition accuracy (Z = -2.406, p = 0.016). The results are presented in Table 4.

Table 4 Comparison of the accuracy of VSP application in different groups of diarrhea cases (n = 66)

Comparison of VSP application accuracy in syncope cases

There were significant differences in the application accuracy (H = 34.506, p < 0.001; H = 27.233, p < 0.001; H = 38.485, p < 0.001). Multiple mean comparison results showed significant differences in scoring accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), as well as between VSP 1.0 and VSP 3.0 (p < 0.001). Significant differences were observed in speech recognition accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), between VSP 1.0 and VSP 3.0 (p = 0.016), and between VSP 2.0 and VSP 3.0 (p = 0.019). Intent recognition accuracy exhibited significant differences between VSP 1.0 and VSP 2.0 (p < 0.001), between VSP 1.0 and VSP 3.0 (p < 0.001), and between VSP 2.0 and VSP 3.0 (p = 0.036). The results are presented in Table 5.

Table 5 Comparison of VSP application accuracy in syncope cases (n = 69)

There were no significant differences in application accuracy (Z = -0.426, p = 0.670; Z = -0.216, p = 0.829; Z = -0.035, p = 0.972) between medical students and residents using VSP 3.0 in syncope cases. The results are presented in Table 6.

Table 6 Comparison of the accuracy of VSP application in different groups of syncope cases (n = 63)

Comparison of VSP application accuracy in palpitation cases

There were significant differences in the application accuracy (H = 71.858, p < 0.001; H = 23.986, p < 0.001; H = 77.121, p < 0.001). Multiple mean comparison results showed significant differences in scoring accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), as well as between VSP 1.0 and VSP 3.0 (p < 0.001). Significant differences were observed in speech recognition accuracy between VSP 1.0 and VSP 2.0 (p = 0.011), VSP 1.0 and VSP 3.0 (p < 0.001), and between VSP 2.0 and VSP 3.0 (p = 0.035). Intent recognition accuracy exhibited significant differences between VSP 1.0 and VSP 2.0 (p < 0.001), VSP 1.0 and VSP 3.0 (p < 0.001), and between VSP 2.0 and VSP 3.0 (p = 0.033). The results are presented in Table 7.

Table 7 Comparison of VSP application accuracy in palpitation cases (n = 109)

The results showed no significant differences in application accuracy (t = 1.055, p = 0.132; t = 0.138, p = 0.068; t = -0.872, p = 0.557) when using VSP 3.0 for teaching medical students and residents in palpitation cases. The results are presented in Table 8.

Table 8 Comparison of the accuracy of VSP application in different groups of palpitation cases (n = 65)

Comparison of VSP application accuracy in cough cases

There were significant differences in the application accuracy (H = 40.521, p < 0.001; H = 18.961, p < 0.001; F = 235.851, p < 0.001). Multiple mean comparison results indicated significant differences in scoring accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), as well as between VSP 1.0 and VSP 3.0 (p < 0.001). Significant differences were observed in speech recognition accuracy between VSP 1.0 and VSP 2.0 (p < 0.001), as well as between VSP 1.0 and VSP 3.0 (p = 0.011). Intent recognition accuracy exhibited significant differences between VSP 1.0 and VSP 2.0 (p < 0.001), between VSP 1.0 and VSP 3.0 (p < 0.001), and between VSP 2.0 and VSP 3.0 (p < 0.001). The results are presented in Table 9.

Table 9 Comparison of VSP application accuracy in cough cases (n = 68)

There were no significant differences in application accuracy (t = 0.276, p = 0.241; t = -4.933, p = 0.186; t = -0.486, p = 0.309) when using VSP 3.0 for teaching medical students and residents in cough cases. The results are presented in Table 10.

Table 10 Comparison of the accuracy of VSP application in different groups of cough cases (n = 64)

Changes in the accuracy of different cases

We conducted an analysis and comparison of scoring accuracy, speech recognition accuracy, and intent recognition accuracy (Fig. 5). Results showed that both scoring accuracy and intent recognition accuracy increased with the upgrade of the VSP version, and the standard deviation decreased. However, the trend in speech recognition accuracy varied depending on the cases. In the VSP 1.0 version, the syncope cases showed the best accuracy in speech recognition and intent recognition, followed by diarrhea, palpitations, and coughing. In the VSP 2.0 and VSP 3.0 versions, scoring and intent recognition accuracy were nearly identical for all four cases.

Fig. 5
figure 5

The accuracy trend chart and the maximum accuracy was 100%. a is the accuracy of scoring; b is the accuracy of speech recognition; c is the accuracy of intention recognition

Discussion

We explore the accuracy of our self-developed VSP simulator for assessing history-taking skills. While intent recognition and score accuracy have increased after updates and optimizations, speech recognition accuracy has continuously maintained a high level. The VSP application accuracy has stabilized after optimization and updating, continuously reaching high levels in various scenarios. The application accuracy of VSP does not vary depending on the population.

It is clear from the statistics in Table 2. that VSP scores are generally lower than SP scores. This finding aligns with the results of a study by Fink and others [33]. Fink et al. attributed this to the lower subjects’ interest, reduced appraisal of motivational value, and decreased quantity of evidence generation reported for VPs. However, our VSP did not engage in human–computer dialogue, so we believe the reason is different. Based on the analysis of this study, the reasons may be attributed to the overall operational processes and assessment methods of the VSP system. The accuracy of the system’s voice recognition and intention recognition might have affected the scoring accuracy since VSP will first translate the speech into text, then perform intention recognition, and finally provide the score. The lower score of VSP compared to SP may result from the recognition and classification error of speech and intention. Therefore, our study further explored the scoring accuracy, speech recognition accuracy, and intent recognition accuracy of VSP.

Considering the potential confounding effects of different case content, we conducted separate analyses of scoring accuracy, speech recognition accuracy, and intent recognition accuracy for each of the four cases: diarrhea, syncope, palpitations, and cough. All versions of VSP used these same four cases. We analyzed the scoring accuracy, speech recognition accuracy, and intent recognition accuracy of different VSP versions in these cases. The results all showed significant differences.

We also looked at any discrepancies in accuracy between medical students and residents using VSP 3.0. Among the results of comparing different groups, significant differences were observed only in the case of diarrhea, where speech recognition accuracy and intent recognition accuracy showed differences. The reason may be that medical students and residents conducted history-taking in different ways. Medical students, lacking clinical experience, tend to follow a standardized model provided by professors, making their approach easily recognizable by the system. In contrast, residents have some clinical experience, which leads to various inquiry styles that pose certain identification challenges. Furthermore, there are regional and dialectal accent variances in Chinese, which contributes to some degree of mistakes in the voice recognition system.

Based on the scoring accuracy data, the latest version of VSP achieved an accuracy rate of 85.40–89.62%, which aligns with similar research findings. In a study by William and colleagues [34] response accuracy ranged from 84 to 88%, and in Maicher et al.’s study [35], response accuracy ranged from 79 to 86%. The construction of this system has been relatively successful. However, future work should focus on enriching the synonym database and improving accuracy. The results of pairwise comparisons indicate a significant improvement in scoring accuracy with the newer system versions, i.e. VSP 2.0 and VSP 3.0. However, when compared with VSP 2.0, VSP 3.0 showed no improvement in scoring accuracy, indicating that the system’s self-learning functionality has a limited impact on enhancing scoring accuracy. The reason might be insufficient data in the collected corpus and insufficient time for the machine’s self-learning. It remains uncertain whether the self-learning feature of VSP has any impact on scoring accuracy. Further research is needed to confirm whether the self-learning functionality of VSP affects scoring accuracy.

In the four medical cases, speech recognition accuracy is relatively high. After pairwise comparisons, no specific patterns causing significant differences in speech recognition accuracy were observed in the data. There are possible reasons for this phenomenon. Firstly, regional and ethnic differences may contribute to distinct accents, especially when the system is designed for standard Mandarin. Secondly, speaking at a fast pace could cause the system to have difficulty accurately capturing the spoken words. Lastly, the system may fail to recognize or accurately identify sentence breaks, which could also be a contributing factor. When speech-to-text conversion fails to accurately convey the intended meaning, the system responds with errors or fails to respond. This aligns with the findings of Kammoun et al. [36], whose system automatically moves to the next section if it cannot accurately recognize the speech. In the future, adjustments can be made to the system to optimize the speech recognition section by customizing response time intervals for each individual.

Overall, intent recognition accuracy has been consistently improving. This suggests that VSP 2.0 successfully addressed the issue of intent recognition accuracy compared to VSP 1.0, and its self-learning feature provides an advantage in enhancing intent recognition accuracy. Based on the research results, it can be inferred that VSP’s intent recognition accuracy does not vary with different experience groups. Future research should include a more diverse range of participants to validate these findings.

The findings (Fig. 5) indicate that scoring accuracy and intent recognition accuracy improve with the upgrade of the VSP version. However, speech recognition accuracy varies across different cases. This discrepancy can be attributed to factors mentioned earlier, such as the subject’s accent, speaking speed, and sentence breakage, posing challenges for VSP recognition. In VSP 1.0, there was a notable standard deviation in application accuracy, with diarrhea cases showing the highest accuracy. This variation may be linked to VSP 1.0’s slight instability and differing word recognition accuracy across cases. The scoring process involves speech recognition followed by intent recognition, leading to relatively consistent results in scoring accuracy, speech recognition accuracy, and intent recognition accuracy in VSP 1.0.

Limitation

This study only utilized the system’s examination mode, focusing solely on speech conversion and score feedback. The system we employed has an additional human–computer interaction mode that can be used for student history-taking training, which we did not explore in this study. Moreover, the system’s comprehension of voice text and response accuracy were not examined in this study. These aspects can be studied in future research. Furthermore, we only examined a few key metrics, including VSP score accuracy, speech recognition accuracy, and intent recognition accuracy, without discussing all the metrics. Additionally, this study only included medical students and residents. Extra variables for various demographics should be considerate for analysis to investigate the correctness of the application of VSP in various population groupings.

Conclusion

VSP proves to be a feasible way to train history-taking skills. This study describes the scoring process of our self-developed VSP and reveals its commendable application accuracy. The upgrading and the self-learning function of the system have played a role in improving the stability and accuracy of VSP. At this point, the accuracy of VSP 3.0 has reached the level required for the history-taking training auxiliary tool, opening up possibilities for integrating diagnostic training tools into clinical education, and effectively addressing the shortage of opportunities for students in SP training. In the future, continuous optimization of VSP will position it as a reliable training and assessment tool, fostering students’ independent learning abilities in classroom teaching.