Introduction

Introducing new surgical methods can initially lead to longer operating time and increased risk of adverse events, which can pose a risk to patient safety [1, 2].

Over the last decade, robotic-assisted laparoscopic surgery (RALS) has become increasingly common, which has led to a need for structured training programmes and training tools to learn RALS [3]. Several randomised studies have shown that simulation-based training in laparoscopy shortens the learning curve for inexperienced surgeons, reduces operating time and reduces the risk of adverse events during the initial procedures [4,5,6].

Whereas much research has focused on simulation-based training of laparoscopic surgery, RALS has received comparatively little attention [7,8,9].

An important aspect of modern surgical education is that it utilises competency-based training instead of time-based training or numbers of performed procedures. Therefore, the implementation of tools to assess surgical skills is important and necessary. Simulation-based assessments can ensure that novices possess basic skills before operating on real patients, which may help increase patient safety [3, 7]. However, before using a simulation-based test to assess competency, it is imperative that the assessment tool is examined to ensure its relevance, validity and reliability [10].

It is especially important to determine a relevant pass/fail level using relevant standard setting methods to ensure a relevant level of competency. However, there is currently no consensus about how and which method to apply [11]. One example of a simulation-based test for robotic surgical skills is the fundamentals of robotic surgery, which is used for dry-lab training. However, no other simulation-based assessment tools have gained widespread use, despite several virtual reality robotic simulators being available [12, 13]. The da Vinci Skills Simulator (dVSS) is the only robotic simulator available today that uses the actual robotic console. Therefore, it is believed to be the simulator system that is closest to the real-life experience of performing surgery in the da Vinci robotic system.

The objective of the present study was to examine validity evidence for the dVSS as an assessment tool for robotic skills and to evaluate the usefulness of the dVSS as a training tool. We also wished to assess the implications of setting a pass/fail level based on the initial performance of experienced surgeons. This was done by comparing the level of participants before and after simulator training, based on which a relevant pass/fail level was determined.

Materials and methods

We designed a validation study examining the da Vinci Skills Simulator and used the unitary framework to describe the validity evidence [10, 14].

Participants and settings

Surgeons from five different specialties at three hospitals in the eastern part of Denmark (Rigshospitalet-Glostrup University Hospital, Herlev-Gentofte University Hospital and Zealand University Hospital) were invited to participate in the study. The inclusion criterion was surgeons (in their final year of residency or specialists) who were about to or were already performing RALS procedures. They were grouped according to the number of previously performed robotic procedures: novice surgeons (0 procedures), intermediate surgeons (1–50 procedures) and experienced surgeons (> 50 procedures). Novices with more than three hours of simulation-based robotic surgical training were excluded.

Description of the simulator and selection of exercises

The dVSS (Intuitive Surgical®, California, USA) was used in combination with the da Vinci® Si™ console.

The simulator has a built-in scoring system consisting of metrics such as time to complete exercise, economy of motion and excessive instrument force, as well as task-specific metrics. Based on a weighted average of these parameters, a total performance score was calculated for each exercise. All data were stored automatically on the simulator.

An experienced robotic surgeon evaluated all 40 simulator exercises on the dVSS and selected the 10 most relevant. Subsequently, 5 experienced robotic surgeons ranked the 10 exercises from 1 to 10, 1 being the most relevant and 10 the least relevant. The five exercises with the highest average ranking were chosen for further testing.

Testing of exercises

All participants completed a questionnaire on baseline demographics. All participants received a short oral introduction to the dVSS and performed an exercise (Peg board 2 exercise) to familiarise themselves with the simulator. They, then completed the five exercises once to familiarise themselves with the particular exercises. Finally, they completed all five exercises again and this was used to assess their baseline performance (first attempt).

Participants were then invited back for three additional sessions on the dVSS with an interval of 1–4 weeks between sessions. At each session, they completed three repetitions of the five exercises. Between sessions, the experts and intermediates were not allowed to practice on the dVSS and novices were not allowed to perform robotic-assisted surgery.

After each repetition of the five exercises, the participants filled out a Subjective Mental Effort Questionnaire (SMEQ) to measure their cognitive load. Using the SMEQ, they rated the amount of mental effort they felt they had used, on a scale of 0 to 150 [15, 16]. The scale included nine markers with corresponding statements from “not at all hard to do” to “tremendously hard to do”.

An investigator (MC or FB) was present during all sessions and was only allowed to provide technical support to the participants. The investigator did not provide feedback or instructions on how to complete the exercises.

Outcomes

The primary outcome was the average total performance score for the five exercises for the first attempt. Secondary outcome was the average total performance score for the five exercises for the 10th attempt. Exploratory outcomes were the SMEQ score used to assess cognitive load after the 1st and 10th attempts.

Statistical analysis

The Kruskal–Wallis test was used to examine differences between the three groups of surgeons with different level of robotic experience. Group-wise comparisons, using the Wilcoxon Rank Sum test with Benjamini-Hochberg correction, were performed to identify differences between novice and intermediate, novice and experienced, and intermediate and experienced surgeons [17]. The Spearman rank correlation coefficient (rs) was calculated to examine the correlation between the number of robotic procedures performed and the average total performance score.

The 9th and 10th attempt was used to calculate the intraclass correlation coefficient (ICC), with single measures and absolute agreement definition, to evaluate the internal structure.

To investigate the consequence of testing and to define an acceptable pass/fail level, we examined how many in the novice and experienced group would pass or fail at their 1st and 10th attempts, respectively. We calculated the pass/fail levels using the experienced group average performance score for all five exercises for the experienced group minus one standard deviation. Based on this, we decided on a relevant pass/fail level. This was reported using both frequencies and percentages.

To examine the effect of simulator training, we calculated the Wilcoxon signed-rank test and effect size, comparing the 1st attempt with the 10th attempt. To assess whether simulator training had an impact on cognitive load, the SMEQ score for the 1st and 10th attempts were compared (Fig. 1).

We used SPSS® version 21.0 (IBM, Armonk, New York, USA) for statistical analysis. Complete case analysis was performed and participants who did not complete all four training sessions were excluded from analysis of the secondary and exploratory outcomes. A two-sided significance level of 0.05 was used.

Fig. 1
figure 1

Surgeon practising on the da Vinci Skills Simulator

Ethics and funding

The study was submitted to the Regional Scientific Ethical Committee (Ref. No. H-1-2013-FSP-73), which found that no approval was necessary to carry out the study. All participants were informed by the principal investigator and signed a written consent form before participating. This study was not supported by any specific grants from funding agencies in the public, commercial, or not-for-profit sectors.

Results

The 32 participants were comprised of 11 novices, 11 intermediates and 10 experienced surgeons from the following specialties: gynaecology, urology, gastrointestinal surgery, thoracic surgery and paediatric surgery. The participants’ baseline demographics are shown in Table 1. 28 participants completed all training sessions; 4 participants did not complete all sessions and only data from their first attempt were included.

Table 1 Participant baseline demographics

The 10 exercises evaluated for relevance and the 5 final exercises included in the test are shown in Table 2 and Fig. 2, respectively.

Table 2 The da Vinci Skills Simulator Exercises ranked for inclusion in the final test
Fig. 2
figure 2

Screenshots of the five exercises included in the final test. Used with permission of Intuitive Surgical®, California, USA

There was a significant difference in the average total score between the three groups (p < 0.0001) (Table 3). Group-wise comparisons revealed a significant difference between novices and intermediates (p < 0.0001), as well as between novices and experienced surgeons (p = 0.002) for the first attempt. No significant difference was found between intermediates and experienced surgeons (p = 0.36).

Table 3 Primary, secondary and exploratory outcomes

A moderate correlation was identified between the average total score for the first attempt and the number of robotic procedures performed (rs  =  0.58; p = 0.0004).

All groups showed a significant improvement in their average performance score between the 1st and 10th attempts and effect sizes varied between 0.60 and 0.63.

The SMEQ score was reduced in all three groups from the 1st to 10th attempts, although the reduction for the intermediate group was minimal and not significant. The results of the intragroup comparisons are shown in Table 4.

Table 4 Intragroup comparisons for average total performance score and Subjective Mental Effort Questionnaire when comparing 1st and 10th attempts

The intraclass correlation coefficient was calculated to 0.825 (p < 0.0001).

To explore the consequences of testing, we calculated a pass/fail level using the 1st and 10th attempts for the experienced group. Using the first attempt, the average performance score had to be above 69% to pass, and when using their 10th attempt the level was set at 80%. As shown in Table 5, using the first attempt resulted in a pass rate of 18% for novices and 80% for experienced users; using the 10th attempt instead resulted in a pass rate of 0% for novices and 50% for experienced surgeons.

Table 5 Comparison of pass/fail rates for novices and experienced surgeons for their first attempt for different pass/fail settings

Based on these results, we decided on a pass/fail level on 75%. The number of participants who passed or failed in each group at each of these levels is summarised in Table 5.

Discussion

In this study, we have evaluated the validity evidence of the dVSS simulation-based assessment tool for basic robotic surgical skills and found that it can be used for training and assessment of basic robotic skills.

We have identified five content relevant exercises, which were further evaluated as a combined test.

We chose the number of robotic procedures to assess a relationship to an external variable, and demonstrated a moderate correlation with test score. We also found a significant difference between the novices and the intermediate and experienced surgeons, respectively.

We examined the internal structure of the test, which revealed a high intraclass correlation coefficient; this indicates high reliability of the average performance score and supports its use as a reliable assessment tool.

Furthermore, the automated registration of data and the standardised nature of the simulator ensured a uniform data collection and minimised bias in the response process.

We decided that the pass/fail level would be an average score of 75% for the five exercises. This level can be used for formative assessment and as a training goal for proficiency-based training, and also to determine whether a surgeon has acquired the necessary robotic skills, before moving on to more advanced simulator tasks and supervised operations.

We observed a reduction in cognitive load for novices and experienced surgeons, as assessed by the SMEQ, although there was no significant change for the intermediate group. This supports the use of simulators as a tool for reducing the psychological stress, when learning new skills or procedures. This has also been observed in another study [18]. As shown in Table 3, the novices’ SMEQ score after 10 repetitions had been reduced to approximately the starting level for the experienced surgeons. The reduction in cognitive load and the improvement in performance for all three groups demonstrates that the simulator has potential as a training tool. Interestingly, the experienced group also reduced their cognitive load while practising, which could indicate that they also needed further familiarisation with the simulator.

Overall, the proposed simulation-based test on the dVSS can be used both as an assessment tool and as a training tool. It demonstrates high reliability and is able to discriminate between groups with different levels of experience, although only a moderate correlation is observed with robotic experience. We believe that the test is currently best suited for formative assessment of skills before starting to perform RALS in the operating room. Before the assessment tool can be recommended for certification of surgeons (that is, “high stakes” assessment), more studies examining validity evidence are needed. The participants’ simulation-based performance improved with training and a reduction in cognitive load was also observed, suggesting that it can be used as a training tool.

We used the contemporary framework for validity, as proposed by Messick, which is a strength of this study [10, 14]. Most earlier validity studies on robotic surgery have used older frameworks [3, 9].

Another strength of the present study is that a relative high number of participants were surgeons, and even the novices were all surgical residents with a certain amount of surgical experience. Some previous validity studies have used medical students or inexperienced residents as participants [9]. This is important for the validity argument, because the novices in the present study represent the actual target group for robotic simulator training and competency assessment.

Because we used basic skills exercises, the proposed test is more generalisable. This enabled us to use it for multiple surgical specialties and thereby include a higher number of robotic surgeons, than some of the previous studies [9].

A further strength of this study is that we used a relevant pass/fail level. We used performance over time, instead of simply relying on the initial performance of experienced surgeons. Although the chosen level would fail 60% of the experienced surgeons, they would quickly pass after gaining familiarity with the simulator.

We did not examine the effect of simulation-based training on actual procedures. The number of procedures was used as a surrogate for surgical experience, which is a major limitation of this study. However, it was not feasible to assess the effect of simulation-based training on the actual procedures in the setup for the present study, as it involved multiple institutions. Furthermore, the novices had not undergone the formal robotic course, which meant they were not yet permitted to perform real operations. Because multiple specialties were included, a specific procedure as post-test could not be used, and procedural modules were not available on the simulator.

Also, using the comparison of different groups with varying levels of experience is a commonly used measure of validity, even though the relevance of this has been questioned [19].

We used the simulator generated performance score as the primary outcome. We decided to use the average score for all five exercises, thereby letting the test express the overall performance. However, using the automated score from the simulator could be seen as a limitation. One of the challenges with the dVSS is that users cannot program the pass/fail settings for each parameter for the different exercises. Therefore, to apply a test that was feasible to administer, we decided to use the total score. Choosing different measures for surgeons’ performance could have changed the outcome.

The intermediate and experienced surgeons in the study all had some previous simulation-based training experience prior to the study, as they had completed the formalised Intuitive Surgical® robotic surgery training course. However, all of the participants had completed this course a long time before our study and all had subsequently obtained actual operating experience; therefore, we believe the possible bias due to this was minimal.

We defined pass/fail levels as the experts average perfomance minus one standard deviation, although the data, although the data was not normally distributed, which is a limitation. However, we did not find it appropriate to use the contrasting groups method. There is no consensus about which method should be used to define a pass/fail level and the choice of level is ultimately an institutional decision. It is important to emphasise that simulator training is only one aspect of training prior to initiating surgery on actual operations [11].

Simulation-based assessment of robotic surgical skills has great potential for use in the ongoing development of robotic surgical curricula and certification. Virtual reality simulators possess the advantage of standardisation and can generate many different exercises. Unlike laparoscopic simulators, the lack of haptic feedback does not pose a problem, because there is no haptic feedback during robotic surgery.

While no standardised curriculum for robotic surgical training has yet gained widespread use, many have been described and are currently being developed. One example of a curriculum that includes assessment of technical skills is the Fundamentals of Robotic Surgery, which uses real training exercises on the da Vinci © system [12]. Industrial-held courses are currently the standard for certification, but these courses are not evidence-based and relevant training goals on the simulator have not been defined. To optimise training and increase patient safety, it is essential that training is proficiency-based and relevant training goals and exercises are identified. The same process has taken place for laparoscopic simulation-based training over many years and the experiences from these studies can help accelerate the implementation for robotic surgery.

An advantage of the dVSS is that it can be used with an actual da Vinci Surgeon console. This provides the opportunity to practice using the same equipment as in the operating theatre [20]. However, a limiting factor is that we could only test basic skills exercises and not use simulated surgical procedures, as none were available for this simulator.

Our study looked at basic technical skills for robotic surgery. With the further development of simulators, the incorporation of procedural modules should be emphasised after basic skills training and the training of non-technical skills should also be prioritised [21]. A full surgical curriculum is needed to combine surgeon technical skills training with theoretical education, followed by inter-professional scenario training in which the entire team learns how to dock the robot, handle equipment and handle complications and unforeseen events.

Conclusion

The da Vinci Skills Simulator can be used to assess basic robotic skills and can be used as a first step in training, before moving on to more advanced tasks. Although the proposed test can be used for assessment of competency and formative feedback, further validity evidence is needed before it can be used for certification and high-stakes assessment.