With mounting concerns about patient safety and the need to have objective measures of surgical technical competence, simulation as a means of surgical training and certification is rapidly gaining ground [1]. The fundamentals of laparoscopic surgery (FLS), which employs a box trainer, and the fundamentals of endoscopic surgery (FES) with a virtual reality-based simulator, have been recently adopted by the American Board of Surgery as pre-requisites for certification in general surgery [2,3,4,5,6,7,8,9,10]. However, prior to acceptance, each simulator, real or virtual, must undergo extensive validation and show evidence of successful transfer of technical skills from the simulation environment to the clinical environment [1, 11, 12].

The current standard in assessing successful transfer of skills from the simulation environment to a clinical setting is direct observations by an expert clinician [13] using a checklist such as the objective structured assessment of technical skills, or global operative assessment of laparoscopic surgery [13,14,15]. Alternative metrics such as task completion time have also been reported for assessing technical skill transfer [16]. Despite the current widespread usage of these generalized rating or completion time-based assessments, there are significant drawbacks to these methods that include personnel resource costs, poor interrater reliability between proctors, and poor correlation of learned technical skills from the simulator to outcomes in the operating room [16,17,18]. These limitations necessitate a need for more objective and analytical methods to assess surgical skill transfer [19, 20].

A promising technique that is objective in determining surgical motor skills is non-invasive brain imaging. Among all the non-invasive brain imaging methods currently available, functional near-infrared spectroscopy (fNIRS) offers the unique features to be portable, non-invasive, non-obtrusive to perform the surgical task, fast and relatively inexpensive [21, 22]. Investigators have used fNIRS to study brain activation responses between surgical experts and novices during the performance of surgical training tasks by measuring the fluctuations of hemodynamics signals, namely changes in concentration of oxygenated and deoxygenated hemoglobin [23,24,25,26,27,28,29]. However, these studies are limited in scope as they are subject to signal contamination from superficial tissue, and show no evidence of surgical skill transfer to more clinically relevant environments.

The purpose of this study is to determine if fNIRS can accurately assess motor skill transfer from simulation to ex-vivo environments for trained and untrained subjects as they perform an established surgical training task. We hypothesize that fNIRS-based metrics can classify different levels of surgical motor skill transfer with more accuracy than established methods. To test this hypothesis, subjects trained on a physical or virtual surgical simulator where they practiced a surgical training task and subsequently performed a surgical transfer task post-training. Based on brain imaging metrics, we then utilize multivariate statistical approaches to objectively differentiate and classify subjects that exhibit successful motor skill transfer.

Methods

Experimental setup

Two different laparoscopic skills trainers were utilized in the study. We utilize the official FLS box trainer as the physical simulator since it is widely used for training laparoscopic skills and is validated for board certification [8, 30, 31]. We use the validated Virtual Basic Laparoscopic Skills Trainer (VBLaST) system, which replicates the FLS pattern cutting task on a computer model with high fidelity [32,33,34,35,36,37], as the virtual simulator. To perform real-time brain imaging, a fNIRS system (CW6 system, TechEn Inc., MA, USA) was used to deliver infrared light.

In order to measure cortical activation changes during the transfer task, we measure functional activation specifically in the prefrontal cortex (PFC), primary motor cortex (M1), and the supplementary motor area (SMA), as these cortical regions are directly involved in fine motor skill learning, planning, and execution [28, 38,39,40,41,42]. We design a probe geometry that includes eight infrared illumination sources coupled to 16 long separation detectors and eight short separation detectors. Monte Carlo simulations indicate that our probe design is highly sensitive to functional activation changes in the PFC, M1, and SMA [43]. The distance between the long separation detectors and each corresponding source is within 30–40 mm to ensure specificity to white and gray matter. Furthermore, short separation detectors were placed 8 mm away from each corresponding source to ensure that only superficial tissue layers, such as skin, bone, dura, and pial surfaces are measured. These superficial tissue signals are later regressed during post-processing. A schematic of the probe locations onto the scalp along with probe geometry specifications is shown in Fig. 1.

Fig. 1
figure 1

Infrared probe geometry positioning. Schematic of probe placement projected on cortical locations specific to the PFC, M1, and SMA. Optodes are placed for maximum coverage over the PFC, M1, and SMA. Red dots indicate infrared sources, blue dots indicate long separation detectors, and textured blue dots indicate short separation detectors. The PFC has three sources (1–3), the M1 has four sources (4–7), and the SMA has one source (8). Each of the sources is connected to their corresponding long and short separation detectors. (Color figure online)

Subject recruitment and study design

In this IRB approved study, 18 medical student subjects were recruited at University at Buffalo. These subjects had no prior surgical experience and were randomly placed in one of three groups: untrained control (n = 5), training FLS (n = 7), and training VBLaST (n = 6) groups. Only the FLS and VBLaST training groups underwent rigorous training on their respective simulators for 12 consecutive days, completing an average of over 100 pattern cutting trials per subject. The control group did not undergo any training on either simulator. Once training is complete for the FLS and VBLaST groups, all subjects performed a post-test after a 2-week break period to measure surgical skill retention. However, the control group did not undergo training, and performed the post-test and transfer tasks following a 2-week break period immediately after their baseline tests. The post-test consisted of three pattern cutting trials each for all subjects on the FLS and VBLaST simulators. Nemani et al. further details pertinent information on the study design, power calculations for sample sizes, and other experimental design considerations [37].

The transfer task, however, consisted of an FLS pattern cutting task performed on cadaveric abdominal tissue instead of gauze. One cadaveric tissue sample, that consisted of a peritoneum layer with underlying fascia and muscle tissue, was prepared for each subject. While each sample is on average half an inch thick, the peritoneum layer is only a few millimeters in depth. Each sample was circle marked with the same dimensions as the marked circles in the FLS pattern cutting task. Tissue samples were securely placed in the official FLS trainer box. Each subject was then instructed to cut the marked circle on the peritoneal tissue sample and resecting the cut peritoneum section as quickly and accurately as possible without damage to the underlying muscle.

Accredited task performance metrics

Task performance metrics based on time and error have already been established for the FLS and VBLaST simulators. The FLS scoring metrics which are used in Board certification are proprietary, yet were obtained under a non-disclosure agreement with the Society of American Gastrointestinal and Endoscopic Surgeons. The VBLaST pattern cutting score also reproduces the FLS scoring methodology specifically in the virtual environment [36]. As a measure of effectiveness in training, the FLS and VBLaST pattern cutting scores were reported during the post-test to demonstrate that trained subjects significantly outperform untrained subjects. The performance metric for the transfer task was completion time. University policies prohibited video recording of cadaveric tissue and thus no further performance measures could be obtained. Task completion time, with an accuracy of ± 1 s, consisted of the total time (minutes) required to cut and resect the marked peritoneal tissue from the overall tissue sample.

Neuroimaging-based performance metrics

Functional brain imaging using fNIRS was utilized to derive a metric for measuring bimanual surgical skill performance in this study. Prior to data analysis, only measurement channels within the signal qualities between 80 and 140 dB were included. The wavelengths measured at 690 nm and 830 nm, with their corresponding partial path-length factors of 6.4 and 5.8, respectively, were converted to optical density using the modified Beer–Lambert law [44, 45]. Motion artifacts and systemic physiology interference were corrected using low-pass filters and recursive principal component analysis [46,47,48]. The filtered optical density data were used to derive the change in concentrations of oxy- and deoxy-hemoglobin. To remove signals from superficial tissue layers and increase specificity to cortical tissue hemodynamics, signals from short separation detectors were regressed from long separation detectors [49]. Finally, the corresponding source and detector pairs for each source were averaged over the transfer task completion time. The result is a scalar value for the change in oxy-hemoglobin according to different brain regions for each participant. All of the fNIRS data processing were completed using the open-source software HOMER2 [46].

Statistical tests and classification approaches

To determine statistical significances between data sets, two tailed Mann–Whitney U tests were utilized within a 95% confidence interval. This statistical test was used for all univariate tests where the type I error is defined as 0.05 for all hypothesis testing cases.

Linear discriminant analysis (LDA) was used for classifying untrained control subjects with either FLS trained or VBLaST trained subjects based on traditional and fNIRS metrics. LDA is an established multivariate classification approach that determines the maximal separation between two different classes based on multivariate metrics [50, 51]. Type I error is defined as 0.05 for all classification models. The quality of classification is reported by misclassification errors (MCE), specifically MCE12 and MCE21. MCE12 is defined as the probability that a trained subject is misclassified as an untrained subject during the transfer task. Conversely, MCE21 is defined as the probability that an untrained subject is misclassified as a trained subject. Theoretically, MCEs of 100% indicate that untrained and trained subjects are identical and indistinguishable, whereas MCEs of 0% indicate that untrained and trained subjects can be classified and differentiated with absolute certainty.

Leave-one-out cross-validation was used to assess how well each classification model can generalize to independent data sets by systematically removing one data point from the data set. Ultimately, cross-validation allows an objective assessment of the robustness of classification models when incorporating potentially new untrained or trained subject data sets. All classification and statistical analysis were completed using Matlab (Mathworks, Natick, MA).

Results

Differentiation and classification of motor skill transfer based on traditional task performance

To investigate whether trained subjects significantly outperform untrained subjects in the ex-vivo environment, first we report transfer task completion times for trained FLS, trained VBLaST, and untrained control subjects. As shown in Fig. 2a, results indicate that both the trained FLS (7.9 ± 3.3 min) and trained VBLAST (12.2 ± 1.8 min) groups completed the transfer task significantly faster than the untrained control group (18.3 ± 3.1 min, p < 0.05). While results show that transfer task time can statistically differentiate trained and untrained subjects during a transfer task, they do not address the accuracy of differentiation.

Fig. 2
figure 2

Classification of surgical motor skill transfer based on task completion time. A Transfer task completion times for the trained FLS, untrained control, and trained VBLaST subjects (*p < 0.05). B LDA classification of trained FLS and control subjects during the transfer task based on completion times and C corresponding cross-validation results. D LDA classification of trained VBLaST and control subjects during the transfer task based on completion times and E corresponding cross-validation results

In this context, LDA-based classification was used to classify trained and untrained subjects based on completion time. Figure 2A shows that classification based on transfer task completion time of trained FLS and untrained control subjects is poor, as shown by high MCEs (MCE1 = 20%, MCE2 = 14%). These results indicate that a trained FLS student has a 20% probability of being misclassified as a control subject and an untrained control subject has a 14% probability of misclassified as FLS trained subjects. Cross-validation results, as seen in Fig. 2C, show that 10/12 or 83% of the samples have MCEs less than 5%, indicating that the classification model is valid for potentially future data sets. The same classification approach was applied for the virtual simulator trained (VBLaST) subjects vs untrained control subjects as shown in Fig. 2D. Once again, subject classification based on transfer task completion time is poor, indicated by high MCEs (MCE1 = 20%, MCE2 = 41%). Furthermore, cross-validation results show that 8/11 or 72% of the samples have MCEs less that 5%, as shown in Fig. 2e.

Neuroimaging-based metrics for differentiation and classification of motor skill transfer

Due to high MCE encountered in assessing transfer task performance based on task time, we propose subject classification based on fNIRS metrics. Prior to classification, we determine if fNIRS is sensitive to subject cortical activation changes during the transfer task, specifically in the PFC, left medial M1 (LMM1), and the SMA. Results indicate that all simulator trained subjects show no significant differences in all PFC cortical regions compared to control subjects (p > 0.05). However, both FLS and VBLaST simulator trained subjects have significantly higher functional activation in the left medial M1 (0.64 ± 0.54 and 0.44 ± 0.18 ∆HbO2 conc. µM*mm, respectively) compared to untrained control subjects (− 0.44 ± 0.72 ∆HbO2 conc. µM*mm, p = 0.018 and p = 0.004, respectively). Furthermore, both FLS and VBLaST trained subjects also showed significant increases in functional activation in the SMA (0.42 ± 0.56 and 0.74 ± 0.47 ∆HbO2 conc. µM*mm, respectively) when compared to untrained control subjects (-0.08 ± 0.22 ∆HbO2 conc. µM*mm, p = 0.048 and p = 0.009, respectively). Figure 3A summarizes these descriptive statistics and Fig. 3B shows a visual depiction of average functional activation changes with respect to various cortical regions.

Fig. 3
figure 3

Changes in cortical activation during the transfer task with respect to cortical regions. A Average changes in hemoglobin (ΔHbO2) concentration as a measure of functional activation with respect to different cortical regions for FLS trained subjects (magenta), untrained control subjects (cyan), and VBLaST trained subjects (black) while all subjects perform the ex-vivo transfer task. B Spatial maps of functional activation with respect to varying degrees of training levels during the ex-vivo transfer task. Spatial maps cover specific regions including the PFC, M1, and SMA for all subjects. (Color figure online)

In order to compare the accuracy of subject classification based on transfer task completion time or fNIRS-based metrics, several combinations of metrics are used for the classification models. These combinations include transfer task performance time only, and all possible combinations between PFC, LMM1, and SMA. Figure 4A shows the relative MCEs for various combinations of performance and fNIRS metrics to classify FLS trained subjects from untrained control subjects. The fNIRS metrics combination of PFC + LMM1 + SMA used for the FLS classification model yields very low misclassification errors (MCE1 = 2.2%, MCE2 = 2.7%).

Fig. 4
figure 4

A Classification results to classify FLS trained, VBLaST trained, and untrained control subjects. The cumulative set of MCE1 and MCE2 for all combinations of fNIRS metrics and the transfer task completion time to classify FLS trained and control subjects. MCE1 indicates the probability that a trained subject is misclassified as an untrained control subject and MCE2 indicates the probability that a untrained control subject is misclassified as a trained subject. B The cumulative set of MCE1 and MCE2 for all combinations of fNIRS and transfer task metrics to classify VBLaST trained and control subjects. C Leave-one-out cross-validation results indicate the MCE for each sample treated as an independent sample for the LDA model using all combinations of fNIRs and transfer task metrics, where the percent of samples that have MCE below 0.05 for each possible metric combination are shown

Similarly, the fNIRS metrics combination of PFC + LMM1 + SMA used for the VBLaST classification model yields very low misclassification errors (MCE1 = 8.9%, MCE2 = 9.1%), as shown in Fig. 4B. Figure 4C shows the cross-validation results of various classification models to classify trained FLS or VBLaST subjects with untrained control subjects. FLS trained versus control subjects classification based on transfer performance scores and PFC + LMM1 + SMA combinations yield results where 83% of the samples have MCE less than 0.05. In a similar fashion, VBLaST trained vs control subjects classification models show that the transfer task performance score and PFC + LMM1 + SMA metric combinations yield in 72% of the samples with MCE less than 0.05. These cross-validation results independently assess the accuracy of both classification models using transfer task time and PFC + LMM1 + SMA metrics, ultimately showing that the resulting MCEs from both classification molds can be objectively compared.

Discussion

Accurate and objective assessment of surgical skills transfer from simulation environments to clinical settings is vital in determining the effectiveness of surgical training. Current standards utilizing rating checklists or task completion time metrics are limited in reliability, when objectively determining motor skill transfer to clinical environments [16,17,18, 52, 53]. For the first time, we present evidence that a neuroimaging-based approach provides objective assessment of surgical skill transfer from simulation to clinically relevant environments. The results are independent of whether the simulated task was in a physical or a virtual simulator and have been independently assessed to be robust in classifying trained and untrained subjects.

Note that the defacto metric used in numerous validation studies to show surgical skill transfer is performance time [16]. While our results corroborate the notion that decreases in task performance time are features of expert surgical skills, utilizing this metric alone leads to inconsistencies in literature [16, 53,54,55]. This point is further supported by our classification models where task performance time metrics present 20–41% MCE indicating a lack of robustness. Since no single metric itself, such as task completion time, can demonstrate surgical skill proficiency between trained and untrained subjects [16, 54], our fNIRS metrics-based multivariate approach on classifying trained and untrained subjects brings robustness in surgical skill transfer assessment. Unfortunately, task quality measures are also subjective and not standardized for simulation paradigms, further prompting a need for alternative methods such as our neuroimaging-based approach [16,17,18, 54].

Using fNIRS as a means to measure functional brain activation in real time, we have shown that FLS and VBLaST trained subjects show significant increases in activation in the left medial M1 and SMA, however, no significant differences in the PFC. These regions have been deliberately chosen due to their influence on motor task planning, execution, and fine motor control for complex motor tasks and their critical role in motor skill learning [38, 39, 56,57,58,59,60]. Specifically, the PFC is associated with motor strategy and the early stages of motor skill learning. The M1 and SMA are associated with execution and fine motor control and show increased activation during the later stages of motor skill learning as an indication of procurement of fine motor skills.

Our results are consistent with literature findings that indicate that subjects with fine motor skills in complex motor tasks exhibit higher M1 and SMA activation, particularly for bimanual motor tasks [41, 42, 61]. Furthermore, since all the subjects are right handed, majority of the fine motor manipulations employed during the pattern cutting task is via the right hand. Since right-handed motor tasks evoke contralateral activation in the left hemisphere of the cortex, we expect increased activity in the left medial M1 [38, 39, 56,57,58,59,60]. Although we do not report any significant cortical activation differences between the untrained and trained subjects in the PFC during the transfer task, this is an expected result since all the subjects are expected to recruit the PFC to develop a motor strategy for this unfamiliar transfer task. While these results show promise in assessing surgical motor skill transfer via brain imaging techniques, future studies are required for a formative conclusion. These studies would prospectively include increased sample sizes and subject recruitment, high density probes for higher spatial resolution for imaging, and inclusion of other FLS tasks for transfer task assessment.

Using well-established neurophysiological principles, our work integrates most recent advances in neuroimaging and assessment of surgical competence during transfer of skills from a simulation environment. Since fNIRS signals are heavily contaminated by superficial tissue, short separation channel regression can be used to further isolate cortical brain activation signals from superficial tissue [49, 62]. Such approaches provide more robust estimations of the underlying hemodynamic responses associated with surgical tasks, which were not reported in previous fNIRS surgical studies.

Conclusion

Here, we propose fNIRS as a non-invasive real-time imaging method to successfully differentiate and classify surgical motor skills that transfer from simulation to ex-vivo environments. First, we show that conventional surgical skill transfer metrics, such as task completion time, have significantly high MCE, when used to classify trained and untrained subjects in assessing surgical motor skill transfer. We also show that fNIRS-based metrics have significantly lower MCE than task completion time for surgical skill transfer assessment. fNIRS-based approaches to objectively quantify motor skill transfer may be a paradigm change for the surgical community in determining the effectiveness of surgical trainers in training technical skills that ultimately transfer to the operating room.