Surgical training follows an apprenticeship model where surgical residents practice operations with the supervision and mentorship of faculty surgeons. This method requires significant time and personal resources while not providing a standardized means for surgical skill evaluation [1, 2]. Traditional surgical assessment methods, such as direct observations by an experienced trainer to assess the skills of the trainee, are generally subjective and use global rating scales (GRS) to score competency. Methods such as Objective Structured Assessment of Technical Skills (OSATS), Global Operative Assessment of Laparoscopic Surgery (GOALS), and Global Rating Index for Technical Skills (GRITS) allow experienced surgeons to use structured checklists for technical criteria and rate the surgical performance of the trainee under direct observation [3,4,5,6]. However, due to their subjective nature, there are serious criticisms of creating a generalized rating assessment across all subjects. Such criticism cites tremendous human resource costs, poor interrater reliability of human observers, and poor correlation with technical skill to patient outcome in the operating room [7, 8].

To provide more objectivity and standardization for laparoscopic skills assessment, the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS) was developed and validated as an effective simulator to teach and assess laparoscopic surgical skills [9]. The Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) adopted MISTELS into a program called the Fundamentals of Laparoscopic Surgery (FLS). Consequently, the FLS program is now the current standard is assessing proficiency in laparoscopic skills and is required for board certification since 2009 [10,11,12]. While the validated FLS simulator is shown to correlate with clinical skill performance, there are inherent drawbacks such as subjectivity in task assessment, high cost for testing administration, and the significant amount of time required for scoring [13,14,15,16,17]. To address the general limitations of physical trainers, virtual reality-based simulators have been developed and shown to provide a safe and effective training and assessment platform for laparoscopic surgical skills [18, 19]. To specifically address the limitations of the FLS training simulator, we have developed the Virtual Basic Laparoscopic Skills Trainer (VBLaST) that is capable of simulating the five FLS task modules in real time [17, 20,21,22,23]. The benefits of the VBLaST system include automated and robust scoring, introduction of kinematic metrics that are correlated to task performance, dramatically increased objectivity in task performance assessment, and the elimination of high cost for administration or testing materials [17, 20,21,22,23]. As with any virtual reality-based simulator, a thorough validation is required to demonstrate its effectiveness as a surgical training and performance assessment tool.

The goal of this study is to demonstrate convergent validation of the VBLaST pattern cutting module as an effective training and task assessment simulator for laparoscopic surgical skills. To achieve this goal, we aim to determine if there are significant improvements in pattern cutting task performance scores between trained VBLaST subjects and untrained control subjects once training is complete. Furthermore, we aim to determine if the acquired surgical motor skill for the trained VBLaST subjects transfer to the FLS training simulator and ex vivo models. We hypothesize that trained VBLaST subjects will not only outperform control subjects in the VBLaST simulation environment, but also in the FLS and ex vivo simulation environments. We propose three different mechanisms to show validity of the VBLaST pattern cutting system. First, we show task performance learning curves for the FLS and VBLaST trainers that are objectively characterized by cumulative summation (CUSUM) criterion [17, 24, 25]. Next, we show that there is significant task performance retention and transfer from the FLS to VBLaST simulation environments, and vice versa (p < 0.05). Finally, we show that task performance transfers from the simulation environments to ex vivo cadaveric models mimicking the pattern cutting task (p < 0.05). Ultimately, we present evidence that we have achieved convergent validity of the VBLaST pattern cutting module and show that it can be used as an effective laparoscopic skills training and task assessment simulator.

Methods

The study was approved by the Institutional Review Board of University of Buffalo and Rensselaer Polytechnic Institute.

Subject recruitment

Prior to subject recruitment, we performed an a priori analysis according to the Mann–Whitney U test to determine the minimum number of subjects required for the FLS training group, VBLaST training group, and the control group. Using FLS and VBLAST task performance scores from pilot study data, we estimated conservative effect sizes for the FLS and VBLaST groups and show that d = 5.67 and d = 2.57, respectively. Based on these effect sizes, a 95% confidence interval, and a minimum power of 0.80, we determined that a minimum of four subjects are required for the FLS training group, three subjects are required for the VBLaST training group, and four subjects are required for the control. Consequently, we recruited seven subjects for the FLS training group, six subjects for the VBLaST training group, and five subjects for the control group. To eliminate any bias due to handedness, all the recruited subjects had no prior skills in laparoscopic surgery and were right-handed. Subjects were monetarily compensated for their participation. The statistical software G*Power was used to determine the effect sizes and the minimum number of subjects required for this study [26].

Hardware

Two different simulators were used over the course of this learning curve study. The FLS group trained on a standard SAGES-certified FLS box trainer with the official supplementary materials to administer the pattern cutting task. The VBLaST group trained on the VBLaST system, specifically on the pattern cutting module. The VBLaST system consists of two major components: hardware interface and the simulation software suite. The hardware interface utilizes two PHANTOM Omni haptic devices (Geomagic, Morrisville, North Carolina), connected to appropriate surgical tool interfaces, that provide positional tracking and real-time force feedback in the virtual environment. The simulation software uses custom-developed algorithms and software to simulate tool to cloth interactions in the virtual environment. Figure 1 displays both the FLS box trainer and the VBLaST simulator.

Fig. 1
figure 1

FLS and VBLaST simulators and study design. A The physical FLS pattern cutting (PC) box trainer (left) and the VBLaST PC simulator (right) used in this study. B Schematic illustrating the learning curve study design. Two training groups, VBLaST (blue) and FLS (magenta), undergo a training period whereas the control group (green) only perform the baseline test (Day 1), retention, and transfer task tests (Color figure online)

Learning curve and task retention study design

Recruited subjects were randomly split into three groups: FLS training group, VBLaST training group, and control group with no training. All the subjects were given standardized instructions on how to successfully complete the pattern cutting task for the FLS and VBLaST simulators. The untrained control group performed three FLS trials and three VBLaST trials on the first day. The control group then waited 2 weeks and performed three FLS trials and three VBLaST trials as part of the final task retention day without undergoing any laparoscopic skills training. The FLS and VBLaST training groups were instructed to complete up to 10 trials per day for twelve consecutive days on each group’s respective simulator. Following 12 days of training, each group was instructed to wait 2 weeks without undergoing any laparoscopic training before performing three FLS and three VBLaST trials each as part of the final task retention day. A schematic illustrating the study design is shown in Fig. 1B.

Transfer task study design

Following the task retention trials, each subject was asked to perform a FLS pattern cutting task on ex vivo cadaveric peritoneal tissue to simulate motor skill transfer from the simulation environment to ex vivo tissue models. The transfer task consisted of replicating the FLS pattern cut task on marked excised cadaveric abdominal tissue samples. The official FLS pattern cutting gauze pads were used as a stencil to draw circles on ex vivo samples to ensure that all of the diameters for marked samples remain the same for each sample. Using a standardized set of instructions, the subjects were told to resect the marked peritoneal tissue as accurately and as quickly as possible without damaging the underlying fascia or muscle tissue. Each tissue sample was photographed before and after the completion of the transfer task. Figure 2 shows sample images of before and after the transfer task completion for an example subject.

Fig. 2
figure 2

Pattern cutting transfer task ex vivo sample. A Ex vivo peritoneum sample prior to transfer task completion for FLS trained subject 3. B Completed pattern cutting transfer task for FLS trained subject 3 with the pattern cutting task replicated and the marked peritoneal tissue resected

Task performance metrics

The proprietary FLS scoring metrics for the pattern cutting task was used to manually score each trial for each subject [9]. Each FLS pattern cutting trial completion time was subjectively recorded with an accuracy of ±1 s. FLS scoring metrics were obtained from the FLS committee under a non-disclosure agreement, and hence its details cannot be reproduced in this paper. The VBLaST task performance metric reproduces the same undisclosed FLS scoring formulation in the automated VR environment [23]. The FLS and VBLaST pattern cutting performance scores were used as outcomes measure for the learning curve and task retention tests. Since video recording was not allowed according to institute policies at the gross anatomy lab, the performance metric for the ex vivo-based transfer task was completion time. Completion time consisted of the total time (min) required to completely resect the circle-marked peritoneal tissue from the tissue sample. Each transfer task trial’s completion time was subjectively recorded with an accuracy of ±1 s.

Statistical analysis

Matlab (MathWorks, Natick, MA) was used to perform all statistical analysis in this study. With a 95% confidence interval, Mann–Whitney U tests were used to determine statistically significant differences between any two groups. All box plots display midlines indicating median values along with whiskers that represent interquartile ranges that cover 99.3% of the data distribution, or ±2.7σ, where σ is the standard deviation. Each boxplot also represents all trials for all subjects in each respective group according to training day. CUSUM scores were calculated for each trial per subjects. Each consecutive trial was flagged as a “success” or “failure.” The criterion for a “success” was when the FLS or VBLaST task performance score is equal to or higher than the defined threshold. The criterion for a “failure” was when the FLS or VBLaST task performance score is lower than the defined threshold. In this study, the defined threshold for achieving a “senior” level of mastery is 63 [9]. P 0 equals 5% and is defined as the acceptable failure rate, whereas P 1 equals 10% and is defined as the unacceptable failure rate [17, 27]. Type I and type II errors were defined as 0.05 and 0.2, respectively. Each “success” trial adds the parameter, s = 0.07, to the CUSUM score. Each “failure” trial subtracts the parameter, 1-s, which equals 0.93 from the CUSUM score. These parameters define the decision limits, H 0 and H 1, which are equal to −2.09 and 3.71, respectively. The parameters, s, H 0, and H 1, are independent of the assessment task and have been well defined in previous studies [17, 27]. Subjects that have CUSUM learning curves below the H 0 decision limit indicate that the failure rate of successfully achieving a “senior” mastery level is below 5%.

Results

Figure 3 shows the FLS pattern cutting performance scores, with respect to training days, for the FLS training and control groups. Results show that there are no significant differences between the FLS training group and the control group for the first day of training. FLS pattern cutting retention task scores show that both the FLS-trained (223.5 ± 18) and VBLaST-trained (109.6 ± 26.8) groups significantly outperformed the untrained control group (81.5 ± 25, p < 0.05). Figure 4 shows the VBLaST pattern cutting performance scores, with respect to training days, for the VBLaST training and control groups. Results indicate that there are no significant differences between the VBLaST training group and the control group for the first day of training. However, VBLaST pattern cutting retention task scores indicate that both the VBLaST-trained (209.4 ± 21) and the FLS-trained (175.2 ± 26.3) groups significantly outperformed the untrained control group (155 ± 21.2).

Fig. 3
figure 3

FLS performance score learning curve. FLS pattern cutting performance are shown with respect to training day. FLS training students (magenta) are compared to untrained control students (green). FLS pattern cutting task retention scores are shown for trained FLS students (magenta), untrained control subjects (green), and VBLaST-trained subjects (blue). Mann–Whitney U tests were used to statistically differentiate the control and FLS training groups (n.s. not significant, *p < 0.05) (Color figure online)

Fig. 4
figure 4

VBLaST performance score learning curve. VBLaST pattern cutting performance are shown with respect to training day. VBLaST training students (blue) are compared to untrained control students (green). VBLaST pattern cutting task retention scores are shown for trained VBLaST students (blue), untrained control subjects (green), and FLS-trained subjects (magenta). Mann–Whitney U tests were used to statistically differentiate the control and FLS training groups (n.s. not significant, *p < 0.05) (Color figure online)

Figure 5A shows the CUSUM learning curve results for subjects trained in the FLS simulator. Three subjects, FLS2, FLS3, and FLS5, passed the acceptable failure rate of 5% (H 0) over the course of the 12 days training period. Specifically, FLS2, FLS3, and FLS5 subjects passed the acceptable failure rate at trials 71, 85, and 85, respectively. Figure 5B shows the CUSUM learning curve results for subjects training in the VBLaST simulator where four subjects, VBLaST1, VBLaST4, VBLaST5, and VBLaST6 all passed the acceptable failure rate of 5% (H 0) over the course of the training period. Specifically, the four subjects VBLaST1, VBLaST4, VBLaST5, and VBLaST6 passed the acceptable failure rate at trials 57, 29, 29, and 29, respectively.

Fig. 5
figure 5

CUSUM scores for trained FLS and VBLaST groups. CUSUM scores for each subject with respect to number of trials. The threshold score to be considered a senior in the pattern cutting task is 63 [9]. A CUSUM scores indicate that three subjects (FLS2, FLS3, FLS5) achieved the level of senior during the FLS training period. B CUSUM scores indicate that four subjects (VBLaST1, VBLaST4–6) achieved the level of senior during the VBLaST training period

Figure 6 shows the ex vivo transfer task completion times for the trained FLS, trained VBLaST, and untrained control groups. Results indicate that the trained FLS (7.9 ± 3.3) and trained VBLaST (12.3 ± 1.9) subjects completed the transfer task significantly faster than the untrained control group (18.4 ± 3.1, p < 0.05). However, there was no significant differences between the transfer task completion time between the trained FLS and VBLaST groups (p > 0.05).

Fig. 6
figure 6

Transfer task completion time. Transfer task completion times for FLS group (n = 7), control group (n = 5), and VBLaST group (n = 6) where all of the subjects had to mimic the pattern cutting task on the marked cadaveric tissue along with resecting the marked peritoneum tissue. Mann–Whitney U tests were used to statistically differentiate the control and FLS training groups (n.s. not significant, *p < 0.05)

Discussion

In this study, we establish convergent validity for the VBLaST pattern cutting simulator where trained VBLaST subjects significantly outperform the untrained control students in both FLS and VBLaST simulation environments, indicating motor skill retention and transfer to a new simulation environment. These results are benchmarked against the established FLS simulator where there is also evidence of motor skill learning and transfer to the VBLaST simulation environment. Learning curve studies have shown evidence of laparoscopic skill learning in various laparoscopic-based procedures [17, 27,28,29,30]. Many of these studies utilize the CUSUM method to quantify learning curve outcomes. Specific to the VBLaST trainer, we previously report the learning curves for the VBLaST peg transfer simulator with the “junior,” “intermediate,” and “senior” mastery levels [17]. While the pattern cutting and peg transfer tasks are different, we report an increased number of students that achieve the “senior” mastery level when compared to our previously validated VBLaST peg transfer module [17]. Learning curve rates indicated that only three out of the seven FLS students and four out of six VBLaST students achieved the “senior” mastery level. Although a direct comparison cannot be made, both simulation environments result in comparable learning.

While some studies report laparoscopic skills transfer from simulation environments to the operating room [31,32,33,34], we have chosen an ex vivo cadaveric tissue model to assess laparoscopic motor skill transfer. We observe that task transfer performance completion times for the trained VBLaST and FLS groups are significantly lower than for the control group and there was no significant difference between training on the real and the virtual simulators.

Limitations and future work

Currently, only task performance scores are used to determine surgical motor skill performance on the FLS and VBLaST trainers. Studies have shown that other measure such as kinematic metrics can also be used as effective measure for assessing surgical skill [35, 36]. All of these metrics, such as task performance scores or kinematic metrics are measures of assessing the resulting motor task performance. However, these metrics focus on the outcomes of task performance instead of assessing the underlying neurological responses to fine motor skills. Neurophysiological metrics that can be incorporated into surgical simulator can also provide objective measure of motor skill performance by directly measuring cortical activation during a given task [37]. Ultimately, a multivariate approach that combines numerous distinguishable metrics can be useful in objectively differentiating and classifying laparoscopic motor skills with significantly higher accuracy. Another limitation is the usage of CUSUM scores to objectively measure learning curve outcomes for longitudinal studies. The CUSUM method utilizes a threshold that assigns a binary value of “success” or “failure” trials depending on whether the threshold condition is met. However, many learning curve rates are often non-linear and this non-linearity is not captured in the CUSUM method. Moreover, CUSUM scores utilize arbitrary threshold values that may not directly translate from one simulation environment to another. Traditionally, transfer tasks have been performed on live patients or animal models to show transfer of laparoscopic motor skills from the simulation environment to clinical environments [31,32,33,34]. Due to the complexity and variability of in vivo clinical environments, it is often difficult to standardize the transfer task for each subject. Furthermore, metrics to assess laparoscopic motor skill transfer are often subjective or depend on GRS that are not robust. By utilizing ex vivo-based models it is possible to add more objectivity to assessing laparoscopic motor skill transfer, even if the objective measures are as simple as task completion time. We plan on addressing some of these limitations regarding objective assessment for motor skill learning and transfer in future studies.