The laparoscopic approach has become the standard of care for a wide variety of surgical procedures and has the advantages of faster recovery, minimal blood loss, and lower cost of treatment [1]. Despite the many benefits of minimally invasive surgery (MIS), the technique is more demanding for surgeons and requires extensive training. This is due to the increased sensorimotor challenges associated with this technique such as hand–eye coordination, two-dimensional field of view, and lack of perceivable haptic feedback [2]. As a result, laparoscopic surgery trainees must undergo a substantial amount of preparation using simulators prior to performing live operations.

The current standard for basic laparoscopic skill development is the Fundamentals of Laparoscopic Surgery (FLS) curriculum [3, 4], which is administered by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) and the American College of Surgeons (ACS). The FLS trainer is a physical box trainer based on the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS) [5]. The five FLS tasks used in the manual skills portion of the curriculum are peg transfer, pattern cutting (PC), ligating loop, suture with intracorporeal knot, and suture with extracorporeal knot. Achieving proficiency in these tasks provides the foundation of laparoscopic surgical skill performance. Since 2009 in the USA, successful completion of the FLS exam is a requirement before being eligible to take the Qualifying Examination of the American Board of Surgery.

Despite being the standard in laparoscopic training, there are major drawbacks to the FLS practical exam, including difficulty in evaluating performance objectively and the time needed to score manually. To overcome these issues, virtual reality (VR)-based simulators can be used to replace physical models. VR simulators enable objective and automated assessment of performance, in real time, and without the need for proctors. Moreover, they permit unlimited training without the expense of consumables. They can also provide haptic feedback, which has already been shown to be an essential component of minimally invasive surgery simulations [6, 7]. VR-based simulators have been shown to transfer effective technical skills to the operating room environment [8, 9]. The Virtual Basic Laparoscopic Skill Trainer (VBLaST) was developed as the VR version of the FLS trainer [10] and maps the five FLS skills to a virtual environment. The VBLaST has shown face validity, as well as construct, concurrent, and discriminant validity [10,11,12,13].

The cumulative summation (CUSUM) is a criterion-based method that is commonly used for characterizing learning curves. It is a statistical and graphical tool that analyzes trends for sequential events in time and hence can be used for quality control of individual performance and group performance. It can be applied in the learning phases, such as while learning a new procedure, and at the end of the training phase after the acquisition of the skill [14, 15]. Previous research has examined the learning curve for the VBLaST peg transfer task and found it to be comparable to that of the FLS peg transfer task. The objective of this study was to continue the validation process of the VBLaST simulator and demonstrate the convergent validity of the PC task using the CUSUM method. To demonstrate convergent validity, the system must be at least as effective as a commonly accepted training system, such as the FLS. Therefore, it was expected that the learning curves on the VBLaST-PC and FLS are similar, with performance improving with practice. In addition, subjects with training on either simulator would perform better than those with no training in the posttest and retention test.

Materials and methods

Subjects

Based on prior learning curve studies and power calculations, five subjects were necessary for each of the three conditions in this learning curve study. Thirty medical students were recruited to allow for attrition, which was anticipated due to the long-time commitment required.

An IRB-approved recruitment email was sent to all Tufts University Medical students. Inclusion criteria were as follows: little or no prior experience with surgery or surgical simulators, normal or corrected to normal vision, and no motor impairment that prevented the handling of two laparoscopic tools in the surgical simulators. Subjects were compensated for their participation.

Ten subjects were randomly assigned to each of the three conditions (control, FLS, and VBLaST). At the end of the study, there were nine subjects in the control group, eight in the FLS training group, and seven in the VBLaST training group, due to attrition.

Equipment

The FLS system (Fig. 1A) used was the standard SAGES-approved trainer box. The view of the task space was filmed with a fixed focal length camera and displayed on a monitor for the subject. A digital capture device (AVerMedia, Milpitas, CA, USA) was used to record subjects’ performance inside the task space. The video was used to gather timing and error measurements for data analysis.

Fig. 1
figure 1

A VBLaST-PC, B FLS-PC (http://www.flsprogram.org), and C comparison of VBLaST-PC on the left and FLS-PC on the right

The VBLaST-PC system (Fig. 1B) consisted of two laparoscopic tools connected to haptic devices mounted in front of a monitor, and a VR environment simulating the FLS-PC task. Custom-developed computational software in the VBLaST-PC simulated the interaction between the tools and objects in the virtual environment. The two Geomagic Touch haptic devices (3D Systems Inc.) connected to the instrumented tools provided force feedback to the user. Figure 1C shows side-by-side comparison of the two systems. Even though the system is capable of tracking and calculating performance variables such as instrument path length and smoothness, only time to task completion and errors were used in this study to provide a fair comparison to the FLS.

Experimental design and procedure

The PC task was used in this convergent validity study. The FLS-PC task requires the subject to cut a circle out of a 4 cm × 4 cm piece of gauze along a pre-marked black line as quickly and accurately as possible. The task is completed using laparoscopic instruments and the official FLS box trainer as shown below. The same task was completed with the VBLaST© system using laparoscopic instruments and computational software.

In this mixed experimental design, participants were randomly assigned to one of three conditions: control, VBLaST, and FLS. Subjects in the control group did not receive any training on the task, while those in the two training groups received training on the assigned simulator over a period of 3 weeks. Demographic data including age, medical school year, and information regarding previous experience and laparoscopic surgery were gathered for all subjects. Before the testing session began, all subjects watched an instructional video that demonstrated the proper procedure to perform the PC task in both the FLS and VBLaST systems. All subjects then performed the task once using both simulators to establish their own baseline performance. This also served as the pretest assessment. The order of simulators was counterbalanced—half the subjects used the FLS system first for the pretest and half the subjects used the VBLaST system first.

Subjects randomized in the training groups were asked to attend one training session per day, 5 days per week, for three consecutive weeks equaling a total of 15, 30-min sessions. During each 30-min session, subjects were asked to perform ten trials of the PC task or attempt as many trials as possible, whichever was shorter. The experimenter was always present during these training sessions to answer questions and provide instruction when needed. VBLaST-PC time and error data were automatically recorded by the simulator. FLS training scores were computed manually and time was measured using a stopwatch.

At the end of 3 weeks, all subjects (training and control) performed the PC task on both the FLS and VBLaST to record posttest data. To assess retention, a final session was held 2 weeks after the last training session. Table 1 summarizes our experimental design.

Table 1 Learning curve study timeline

Data analysis

Performance data analysis

Using SPSS, the pretest, posttest, and retention test data (time, error, score) were analyzed using a two-way ANOVA mixed design and Tukey’s HSD post hoc test, as well as multiple pairwise comparisons with Bonferroni correction. The criterion for statistical significance was set at α = 0.05. Outliers were removed from the data and replaced with the group means before analysis. On the FLS simulator, one data point on the time measure and four data points on the error measure were considered outliers. On the VBLaST simulator, three data points in time and four data points in error were considered outliers.

CUSUM analysis

CUSUM analysis was performed by generating CUSUM charts for all the subjects in both the FLS and the VBLaST training groups. For both the training groups, a criterion on the normalized score was established based on the current accepted proficiency score for the FLS-PC task (score = 72) and an intermediate proficiency score (score = 56) calculated based on the average value of the score for all subjects in the first 40 trials. When the computed PC score for each trial for both groups equaled or exceeded the criterion score, it was defined as a ‘success’ (1), while a lower score was deemed as ‘failure’ (0). The acceptable failure rate (p 0) was set at 5%, and the unacceptable failure rate (p 1) was set at 10% (2 × p 0). Type I and type II errors (α and β) were set at 0.05 and 0.20, respectively. Based on those parameters, two decision limits (h 0 and h 1) and s, the target value for CUSUM, were calculated for each successive trial. For each ‘success,’ s was subtracted from the previous CUSUM score. For each ‘failure,’ 1−s was added to the previous CUSUM score. A negative slope of the CUMSUM line indicates success, whereas a positive slope suggests failure. This procedure was repeated for each subject on both training groups. Table 2 shows the CUMSUM variables for data analysis.

Table 2 CUSUM criteria score and parameters

Learning plateau and learning rate analysis

We calculated the learning plateau and learning rate using the learning curve data of all subjects in both FLS and VBLaST training conditions based on the method of inverse curve fitting. The learning plateau was defined as the asymptote of the fitted curve and learning rate was defined as the number of trials required to reach 90% of the plateau [16].

Results

Subjects’ baseline performance (pretest) scores are summarized as group means in Table 3. Analysis results indicate that the groups were similar at the beginning of the study, even though the differences in their performance scores on the FLS were significant (p = .048). No other measures showed a significant difference between groups.

Table 3 Subjects’ baseline performance as measured by the pretest on FLS and VBLaST

FLS simulator

Time to task completion

Analysis of variance results (see Table 4) showed a significant main effect in training condition [F(2,21) = 7.749, p = .003, η 2 = .425], and learning effect as measured in pretest, posttest, and retention test [F(2,42) = 21.924, p < .001, η 2 = .511]. No significant interaction between training condition and learning effect was found. Tukey’s HSD post hoc test showed that FLS-trained group was significantly different from the control group. There was no difference between the control group and the VBLaST-trained group. Post hoc pairwise comparisons, with Bonferroni correction, showed a significant difference between pretest and posttest, suggesting significant improvement over time, and no significant differences between posttest and retention test (Fig. 2A).

Table 4 ANOVA results for the 3 × 3 mixed design (3 training conditions × 3 tests)
Fig. 2
figure 2

Performance group means for the three training groups in pretest, posttest, and retention test on the FLS simulator (the error bars represent standard deviation): A completion time, B error, and C normalized score

Error

There was a significant learning effect in the error measure [F(2,42) = 22.957, p < .001, η 2 = .522], but no difference as a function of training condition. No interaction between the two factors was observed. Post hoc pairwise comparisons, with Bonferroni correction, showed that errors improved significantly from pretest to posttest, but no significant changes were observed from posttest to retention test, suggesting that learning occurred with training, and the learning effect did not decay with time after the training period (Fig. 2B).

Normalized score

The normalized score showed a significant main effect in the training condition [F(2,21) = 42.87, p < .001, η 2 = .803] and a significant learning effect [F(2,42) = 40.03, p < .001, η 2 = .656]. There was also a significant interaction between training condition and testing condition [F(4,42) = 23.67, p < .001, η 2 = .693]. Tukey’s HSD post hoc test showed a significant difference between all training groups. Post hoc pairwise comparisons, with Bonferroni correction, showed that pretest and posttest were significantly different, suggesting learning, but no difference between posttest and retention test scores, indicating no skill decay with time after the training period (Fig. 2C).

Pretest–posttest

The change in performance from pretest to posttest as an indicator of learning was analyzed. Results from individual t tests showed that the change in performance for the VBLaST-trained group was not different from that of the control group, whereas the change in performance for the FLS-trained group was significantly different from the control group. This suggests that there was no transfer of learning from the VBLaST environment.

VBLaST simulator

Time to task completion

Analysis of variance results (see Table 4) showed a significant difference in subjects’ time to task completion as a function of training condition [F(2,21) = 4.101, p = .031, η 2 = .281]. There was also a significant learning effect as measured in pretest, posttest, and retention test [F(2,42) = 42.087, p < .001, η 2 = .667]. A significant interaction between training and learning was noted [F(4,42) = 3.373, p = .018, η 2 = .243]. Tukey’s HSD post hoc test showed that the only difference was between the control group and the VBLaST-trained group. There were no significant differences between the other groups, suggesting that FLS skills did not transfer to the VBLaST simulator. Post hoc pairwise comparisons, with Bonferroni correction, showed a significant difference between pretest and posttest, suggesting improvement with training. There were no significant differences between posttest and retention test, suggesting that there was no decay in skill with time after the training period (Fig. 3A).

Fig. 3
figure 3

Performance group means for the three training groups in pretest, posttest, and retention test on the VBLaST-PC simulator (error bars represent standard deviation); A completion time, B error, and C normalized score

Error

No significant differences were found in any of the factors for the error measure on the VBLaST simulator (Fig. 3B).

Normalized score

The normalized score showed a significant main effect in the training factor [F(2,21) = 10.362, p = .001, η 2 = .497] and a significant learning effect [F(2,42) = 63.61, p < .001, η 2 = .752]. There was a significant interaction between training condition and testing condition [F(4,42) = 8.414, p < .001, η 2 = .445]. Tukey’s HSD post hoc test showed a significant difference between the subjects in the control and VBLaST-trained groups, and between subjects in the FLS- and VBLaST-trained groups. Control subjects and FLS-trained subjects were not different, suggesting that FLS skills did not transfer to the VBLaST simulator. Again, the pairwise comparisons with Bonferroni correction indicate that significant learning occurred from pretest to posttest, and no significant decay took place between posttest and retention test (Fig. 3C).

Pretest–posttest

The change in performance from pretest to posttest as an indicator of learning was analyzed. Results from individual t tests showed that the change in performance for the FLS-trained group was not different from that of the control group, whereas the change in performance for the VBLaST-trained group was significantly different from that of the control group. This suggests that there was no transfer of learning from the FLS environment.

Cumulative summation analyses for FLS-PC and VBLaST-PC training groups

Based on the intermediate proficiency criterion (score = 56) (Fig. 4), five of the seven medical students (MS) achieved the acceptable failure rate of 5% by 150 trials (MS 20, MS 18, and MS 5 at the 73rd trial, MS 34 at the 101st trial, MS 39 at the 115th trial). All subjects showed performance transition points (the trial at which the slope of the CUSUM curve becomes negative), indicating that they reached the targeted score and were improving as trials progressed (MS 18 at the 3rd trial, MS 20 at the 4th trial, MS 5 at the 5th trial, MS 39 at the 7th trial, MS 17 at the 14th trial, and MS 19 at the 29th trial). The performance of MS 17 was between the two decision limits of h 0 and h 1 and did not reach the acceptable failure rate. MS 39 did not achieve proficiency with acceptable failure rate in the 150 trials.

Fig. 4
figure 4

CUSUM learning curves for medical students trained on FLS simulator using the intermediate criterion success score of 56, acceptable failure rate p 0 = 5%

Based on the proficiency score of 72, none of the students achieved the acceptable failure rate of 5% (Fig. 5). For one student (MS 20), the performance crossed the upper decision limit h 1 but not the lower decision limit h 0 within 150 trials. Six students showed performance transition points (MS 17 at the 75th trial, MS 39 at the 77th trial, MS 19 at the 90th trial, MS 20 at the 95th trial, and MS 34 at the 109th trial).

Fig. 5
figure 5

CUSUM learning curves for medical students trained on FLS simulator using the FLS proficiency criterion success score of 72, acceptable failure rate p 0 = 5%

Based on the intermediate proficiency criteria score of 52, two students achieved the 5% acceptable failure rate (see Fig. 6) on the VBLaST simulator (MS 37 achieved at the 58th trial and MS 30 at the 143rd trial). The performance of MS 4 was between the two decision limits. All students showed transition points (MS 37 at the 12th trial, MS 4 at the 18th trial, MS 14 at the 32nd trial, Ms 26 and MS 22 at the 49th trial, and MS 14 at the 79th trial).

Fig. 6
figure 6

CUSUM learning curves for medical students trained on VBLaST-PC simulator using the intermediate criterion success score of 56, acceptable failure rate p 0 = 5%

Based on the proficiency criteria score of 72, none of the students achieved the acceptable failure rate of 5% (Fig. 7.) The performance of one student (MS 37) was between the two decision limits h 0 and h 1. Two medical students showed transition points (MS 30 at the 49th trial and MS 37 at the 41st trial).

Fig. 7
figure 7

CUSUM learning curves for medical students trained on VBLaST-PC simulator using the FLS proficiency criterion success score of 72, acceptable failure rate p 0 = 5%

Learning plateau and learning rate for FLS-PC and VBLaST-PC training groups

The inverse curve fitting results for the FLS-PC and the VBLaST-PC groups are shown in Fig. 8A, B. The learning plateau and the learning rate calculated from inverse curve fitting are presented in Table 5. The FLS group achieved a higher plateau (77.93, p < .001) compared to the VBLaST group (70.00, p < .001) in normalized scores. The FLS group also had a higher learning rate of seven trials compared to the VBLaST group which required an average of ten trials to reach 90% of the plateau.

Fig. 8
figure 8

Inverse curve fitting for the derivation of learning plateau and learning rate based on A normalized FLS scores and B normalized VBLaST scores

Table 5 Learning plateau and learning rate for subjects trained on FLS or VBLaST

Discussion

Learning curves capture gains in performance with repetition and have been widely used in assessing learning in surgical tasks [17,18,19,20,21,22,23,24,25,26,27,28]. There are multiple types of learning curves (linear, S-shaped, and positively or negatively accelerated) [29]. Although learning curves are useful, they are not adequate as it is hard to assess performance over time with knowledge of only the performance plateau and time to reach that level. Cumulative Summation Analysis (CUSUM) is a method where performance over time can be studied with set criteria [15]. CUSUM has been widely used in surgery to assess the learning progress for many different procedures [17, 18, 21, 23, 25, 27, 28, 30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51].

In our previous study on assessing learning of the VBLaST Peg Transfer task [17], we had analyzed the performance using three criteria (junior, intermediate, and senior) based on the classification reported by Fraser et al. [27]. In this work, we analyzed the learning of the VBLaST-PC task with a study using medical students.

Based on the two criteria (intermediate and proficient), four students in FLS and two students in VBLaST reached the acceptable failure rate of 5%. For the proficient score, none of the medical students were able to reach the acceptable failure rate at the end of 150 trials for both the simulators. It should be noted that all but one subject showed improvement in performance with a transition point on the CUSUM learning curve. Compared to peg transfer, the PC task is more challenging and 150 trials were not enough to train subjects at the 5% failure rate.

When comparing the learning plateau reached by both groups on their respective simulators and learning rate, the FLS group achieved slightly higher learning plateau than VBLaST. The subjects in the FLS group also took three fewer trials than the VBLaST group to reach 90% of the plateau. Based on our experience working with VR simulators, we attribute this difference to the fact that subjects needed more time to become familiarizes with the VR and haptics technology in the simulator. This hypothesis is difficult to test and may become self-evident in time as technology in VR and haptics becomes more sophisticated.

In terms of learning, subjects in the control condition did not show any learning on the FLS at the end of the study, with a total exposure of three trials on each of the simulators. They showed continual improvement on the VBLaST simulator from pretest to posttest to retention test, as indicated by the increasing test scores (Fig. 3C). This may suggest that the virtual simulator is more conducive to learning than the physical simulator during the trainee’s initial exposure to simulation training. This finding may have important implications for surgical education when time availability for training is limited.

Compared to the control group, subjects who received training on the simulators improved their skills significantly after the 3 weeks of practice on their assigned simulator only. Based on the subjects’ testing results (pretest–posttest), learning occurred from day 1 to day 15 of training. That is, VBLaST-trained subjects significantly improved in performance at posttest in the VBLaST simulator but not in the FLS simulator, whereas FLS-trained subjects showed significant improvement in the FLS simulator but not in the VBLaST simulator. This is to be expected due to specificity in learning.

Trained subjects were able to retain their skills even after a period of non-use beyond the training period. In fact, subjects who were trained on the VBLaST continued to improve their performance on the FLS simulator at retention test. This may imply that skills learned on the VBLaST simulator are being transferred to the FLS environment. Similarly, the continual improvement shown by FLS-trained subjects on the VBLaST at pretest and posttest suggests that there is some transfer of learning from the FLS environment.

Overall, our study highlights that laparoscopic surgical skill training in a virtual environment is comparable to training in a physical environment, taking into account additional time needed for familiarization with the VR environment. This is a very important finding when it comes to planning simulation center experiences for surgical trainees. Repeated practice on the VBLaST simulator does not require expensive consumables and does not require replacing the materials between trials. The virtual environment is also capable of providing immediate feedback on time, error, and score, whereas the FLS simulator requires a proctor to keep time and calculate error and score, which is time consuming, labor intensive and results in significantly delayed feedback. The virtual environment is also capable of providing adaptive learning and one can imagine that a simulator can be programmed at different levels of difficulty with progression of training.

Future directions for this research include studies to investigate the transfer of learning from one simulator environment to the other and, by extension, the transfer of these laparoscopic surgery skills from the simulation lab to the operating theater. This will allow us to validate the predictive power of the simulators as a training tool for surgical skills mastery.