Performance evaluation plays a pivotal role in medical education. Reliable and valid global rating scales have been developed to evaluate surgical skill during surgical training, for program evaluation, and in surgical education research [1,2,3,4].

The development of endoscopic surgery has facilitated video recording of intraoperative surgical procedures, making it possible to analyze various aspects of technical performance during surgery, thereby enabling provision of detailed guidance and feedback. Skill evaluation scales were developed and investigated to validate educational achievements in specific procedures [5,6,7,8]. In the past decade, several studies reporting the correlation between surgical performance and postoperative outcomes have demonstrated the potential utility of intraoperative skill evaluation as a predictor of the surgical outcome [9,10,11,12,13,14,15,16,17]. However, most of these studies were limited to using global rating scales for skill evaluation of surgical procedures.

We previously developed the Japanese operative-rating scale for laparoscopic distal gastrectomy (JORS-LDG), which is a procedure-specific assessment tool for skill evaluation and training in laparoscopic gastrectomy [7]. However, data on its validity assessment and its correlations with the patient outcome are lacking. This study aimed to investigate the validity of the JORS-LDG as an assessment tool and the correlation of the assessment score and clinical outcome.

Methods

Ethical considerations

All procedures conformed to the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1964 and later versions. The Ethics Committee of Hokkaido University (investigator’s institute; IRB No. 015-0245) and 15 other participating facilities approved the study. Written-informed consent was obtained from the participating patients and surgeons, and all personal information was protected.

JORS-LDG: The assessment tool

JORS-LDG was developed through cognitive task analysis and expert consensus using the Delphi method [18]. It can be used to measure intraoperative performance during LDG using a scoring sheet (Table 1) [7]. The scale consists of the following tasks: (1) patient and energy device settings, (2) trocar placement, (3) investigation of distant intraperitoneal metastases, (4) lymph node dissection, (5) reconstruction, and (6) final check of the intraperitoneal space. The six tasks comprise 33 items describing the detailed steps involved in LDG using two or three grade criteria. Simple steps are scored on the two-point scale—0 if a given step is not performed and 1 if performed. The other items are graded on a three-point scale—0, unable to perform due to lack of knowledge or skill; 1, needs moderate guidance due to insufficient knowledge or skill; 2, able to perform independently without guidance. If the first assistant took over as the operator for a short time during the procedure to ensure patient safety, the score for that section was marked as zero. The points were recorded in the event of takeover by the first assistant to provide the video raters with a reference during evaluation. The total score is the sum of all scores, with a maximum of 46 points for Billroth-I and Billroth-II reconstructions or 52 for Roux-en-Y reconstruction.

Table 1 JORS-LDG scoring sheet [7]

Data collection in LDG cases

We evaluated the LDG performance of surgeons with various skill levels using the JORS-LDG. To minimize the differences in complexity among cases, patients with obesity (body mass index [BMI] > 30 kg/m2) or a history of upper abdominal laparotomy were excluded from the study. Intraoperative complications were defined as unintended organ injury or hemorrhage (> 100 mL) that might impact safe surgical progress. Short-term postoperative complications were defined as those with Clavien–Dindo classification grade III and above within 30 days following the initial LDG.

LDG skill evaluation

Rating in the operating room

The operator’s intraoperative performance during LDG was evaluated by the first assistant in the operating room, who was an expert in the procedure. Operator self-evaluations were also undertaken.

Video rating

The unedited LDG videos of all participating subjects were evaluated blindly. After proving the high reliability of the JORS-LDG as an assessment tool by comparing the scores of three raters (A, B, and C), the JORS-LDG score of one rater, i.e., video rater A, was adopted for comparison with the other factors.

Rater criteria

The investigator established the following criteria for the first assistant in the operating room and the video raters: (A) operated over 100 LG cases, or (B) qualified as a master of endoscopic surgery by the Endoscopic Surgical Skill Qualification System (ESSQS) of the Japan Society of Endoscopic Surgery. The ESSQS was established in 2004 [22] and performs the most rigorous examinations for Japanese endoscopic surgeons, with an LG pass rate of 21% in 2019.

Guidance for raters

At the beginning of the study, the investigator guided the raters individually on the evaluation criteria through 20-min telephone conversations. The guidance confirmed the primary evaluation points: (A) the extent of understanding the anatomy and concept of each aspect, and (B) the level of autonomy and safety. No rater guidance was provided to the LDG surgeons about the self-evaluation.

JORS-LDG scoring items and calculation adjustments

Six of the JORS-LDG evaluation items related to patient positioning, set up of surgical instruments, and first trocar placement were not recorded in the LDG videos; therefore, they were excluded from the evaluations in this study. The evaluation points for the Billroth-I and Billroth-II method (perfect, 3 points) were tripled to match the Roux-en-Y method (perfect, 9 points) to unify their total points for analysis. Following these adjustments, the maximum unified JORS-LDG score was 44 points.

Setting of the raters for the investigation of JORS-LDG reliability and validity

Based on the sample size calculation, 17 LDG videos were chosen randomly from a total of 54 videos and evaluated by blinded raters (A, B, and C). After determining the high reliability of the JORS-LDG scores of three raters, rater A blindly evaluated the remaining 37 videos alone. Therefore, the JORS-LDG scores of 54 LDG videos, which were evaluated by rater A, were used for the comparison with the direct observation scores and surgical outcomes.

Categorization into three groups based on the LDG skill level

The participating surgeons were categorized into three groups depending on the total JORS-LDG score for video evaluation. We arranged the total scores of the participants in a descending order and divided them evenly into three groups—high, intermediate, and low.

Statistical analysis

The results are reported as medians and interquartile ranges (IQRs) or means and standard deviations (SDs). Statistical significance was set at P < 0.05. All analyses were performed using IBM SPSS Statistics for Macintosh, Version 26.0 (IBM Corp., Armonk., NY, USA) in consultation with a statistician. Intraclass correlation coefficients (ICCs) and 95% confidence intervals (CIs) were calculated by assessing the JORS-LDG scores of the three independent video raters to estimate the inter-rater reliability. The internal consistency of the JORS-LDG items was estimated using Cronbach’s α. The correlations between the evaluation methods and relevance of the association between the total score and each aspect of LDG were investigated using Spearman’s correlation coefficient. The Kruskal–Wallis test compared the patient characteristics and surgical factors among the three LDG skill-level groups.

The sample size for the investigation of inter-rater reliability was calculated using the formula presented by Walter et al. [19]. The minimum acceptable ICC was set to 0.5, and the expected ICC was 0.8 in the study. The required sample size by three raters with the alpha error of 0.05 and power of 0.7 was 15.

Since this was the first study to examine the intraoperative performance of various surgeons during LDG using the JORS-LDG, it was difficult to hypothesize the relation between the assessment score and operative complications of LDG. Therefore, post-hoc power analysis was planned for operative outcomes.

Results

We analyzed 54 LDG procedures performed by 40 surgeons at 16 institutions from January 2016 to December 2018. The operator characteristics and experience are shown in Table 2. Most surgeons were board certified with a median experience of 50 open and LG cases, and 40% were surgeons qualified by the ESSQS [20].

Table 2 Characteristics of surgeons

Inter-rater reliability for the JORS-LDG

Three blinded raters (A, B, and C) evaluated 17 videos. The ICCs for the total JORS-LDG scores of the three raters were higher than 0.8, with excellent internal consistency of the JORS-LDG items, and a Cronbach’s α of 0.94 (Table 3).

Table 3 Inter-rater reliability for JORS-LDG

Correlations between the evaluation scores of direct observations, videos, and self-evaluations

We observed good correlations between the video evaluations and first assistant evaluated JORS-LDG scores (R = 0.69, P < 0.001) and self-evaluations (R = 0.69, P < 0.001; Fig. 1). The correlation between the self-evaluation and the first assistant direct evaluation was excellent (R = 0.85, P < 0.001).

Fig. 1
figure 1

Correlations between the scores for direct observation, video, and self-evaluation. Good correlation was observed between the JORS-LDG score for direct observation and video evaluation (a, b), and excellent correlation existed between the self-evaluations and direct observations (c). JORS-LDG Japanese operative-rating scale for laparoscopic distal gastrectomy

Correlations between the total JORS-LDG score and each aspect score

The total JORS-LDG score had an excellent correlation with the scores for lymph node dissection in the infrapyloric (station no. 6; R = 0.81, P < 0.001) and upper pancreatic edge (station no. 8a + 9; R = 0.75, P < 0.001) regions (Table 4).

Table 4 Correlation between the total JORS-LDG score and the score for each aspect

Laparoscopic distal gastrectomy skill level groups

The participating surgeons were divided into three LDG skill-level groups: high (JORS-LDG score, 42–44; n = 19), intermediate (JORS-LDG score, 39–41; n = 17), and low (JORS-LDG score, ≤ 38; n = 18) according to their total JORS-LDG scores. The first assistant took over the operation for a short time in five of the low-group cases to ensure patient safety.

Patient characteristics

The patients’ characteristics are summarized in Table 5. The characteristics, medical history, and pathological stages were similar among the three LDG skill level groups.

Table 5 Patient characteristics

Correlation between the total JORS-LDG scores and surgical outcomes

Comparison of the surgical factors revealed that the number of laparoscopic surgery (P < 0.001) and LG (P < 0.001) cases differed significantly among the three LDG skill-level groups (Table 6). The high group performed more Roux-en-Y reconstructions than the other two groups (low, 22.2%; intermediate, 23.5%; high, 57.9%; P = 0.01). The low, intermediate, and high groups differed significantly in terms of the median operating time (311, 266, and 229 min, respectively; P < 0.001), rate of unintended intraoperative organ injury or hemorrhage > 100 mL (27.8, 11.8, and 0%, respectively; P = 0.01), and postoperative complication rate (22.2, 0, and 0%, respectively; P = 0.002). There were seven cases with hemorrhage > 100 mL, two with splenic injury, and one with duodenal injury. These include multiple complications in the same case. Postoperative complications included anastomotic leakage in two cases, pancreatic fistula in two, intraperitoneal abscess in one, and surgical site infection in one. More than one complication occurred in a single case. The post-hoc power of this study was 0.71 with an alpha error of 0.05.

Table 6 Surgical factors

Discussion

This was the first study to demonstrate the reliability and validity of an LG-specific performance scale (JORS-LDG) and the correlation between intraoperative performance and clinical outcomes for LG.

Performance evaluation, which plays an essential role in surgical education, requires reliability among raters because proper evaluation and feedback on trainees’ performance cannot be provided without a reliable measure [21, 22]. ICCs > 0.8 were demonstrated for the three raters during the video evaluation of LDG performance using the JORS-LDG. This result revealed that the JORS-LDG, which was developed by us [7] using the Delphi method [18], was easy to understand and apply by different evaluators. Further, the LDG experts of the participating facilities were asked to offer 20-min telephonic guidance and serve as the first assistant in every case to maintain the quality of surgery and conduct direct evaluation. In Japan, general surgeons usually acquire board certification after completing a 5-year surgical residency program. However, most board certificated surgeons cannot perform LDG independently without advanced training in minimally invasive surgery. Therefore, an expert in minimally invasive surgery usually acts as the first assistant during an LDG case to teach and control the quality of the procedure, similar to the setting of this study. The excellent inter-rater reliability observed in this study clearly implies that telephonic guidance contributed toward raters’ understanding of the definitive criteria of the JORS-LDG, which includes some subjective criteria. Therefore, a user manual that defines each step and aspect of the evaluation criteria of the JORS-LDG is required to ensure widespread adoption of the JORS-LDG in various educational facilities in the future. Furthermore, we observed excellent correlation between direct and self-evaluations (R = 0.85, P < 0.001). Even after consideration of bias in the direct and self-evaluations, the results suggested that the operators could perform LDG based on proper communication with the first assistants, mutually confirming what they understood and what they could manage at each step of the operation. This result demonstrated that the JORS-LDG, initially developed as a formative assessment tool, could be suitable for this very purpose.

The high, intermediate, and low skill-level groups stratified according to the JORS-LDG scores exhibited significant differences in the total number of performed laparoscopic surgeries and LGs. This demonstrated the construct validity with correlation between the JORS-LDG score and experience in performing laparoscopic surgery and LG. Birkmeyer et al. [11], Fecso et al. [14], and Curtis et al. [15] argued that the surgical outcomes were affected by intraoperative performance and not by the duration of training and history. Based on the abovementioned factors, it is important to provide surgical trainees with abundant case experience in addition to competency-based training focusing on the performance of each procedure on advanced procedures such as LDG.

It is well known that lymph node dissection in the infrapyloric and upper pancreatic edge regions is a technically difficult aspect of LG [23,24,25], and unsafe execution of these steps could cause intraoperative and postoperative complications such as hemorrhage and pancreatic fistula [26,27,28]. It was interesting to note that the scores of these two steps correlated with the total JORS-LDG score more closely than the scores for the other steps. From an educational perspective, deliberate training focusing on such aspects that would significantly impact the total score could aid in efficient acquisition of LDG skills. Specific analysis of the JORS-LDG scores and errors at each step could be useful for predicting the onset of complications caused by the individual LDG steps and prognosticating the clinical outcomes of the surgery in future research.

Over the past decade, studies have demonstrated a correlation between intraoperative performance and short-term outcomes, particularly the postoperative complication rates [11,12,13,14,15,16,17]. Fecso et al. [14] examined 61 LG procedures for patients with gastric cancer performed by three surgeons at three institutions. The researchers used the Objective-Structured Assessments of Technical Skills [1] and Generic Error-Rating Tool [29] and demonstrated a relationship between intraoperative LG performance and postoperative complications. Our study demonstrated that the intraoperative performances of 40 surgeons in 54 LDGs, scored by the JORS-LDG, were correlated with intraoperative and postoperative complications. Our study is valuable since it demonstrated the utility of the JORS-LDG for evaluating surgeons at various skill levels at 16 institutions.

Curtis et al. [15] first demonstrated the correlation between intraoperative performance and short-term clinical outcomes using a procedure-specific skill measure for laparoscopic total mesorectal excision. As the only procedure-specific rating scale for LG skill, the JORS-LDG proved reliable, valid, and demonstrated a correlation with the short-term clinical outcomes. Although few rating scales exist for procedure-specific skills because of the extensive labor required for their development [30], they possess tremendous potential to provide specific educational feedback for surgery and detailed analysis of the causes for surgical complications. Moreover, the robust correlation between the total JORS-LDG scores and surgical outcomes suggested that the scale could play a role as a prognostic factor, operator’s criterion in surgical training, and quality control of surgical intervention in clinical trials.

Few studies have investigated the correlation between intraoperative performance and long-term surgical outcomes [15,16,17]. Nevertheless, the correlation between intraoperative performance and long-term outcomes has not been proven in surgeries for malignant diseases [15]. Future research should examine the relationship between procedure-specific performance and postoperative complications by type and site and the short- to long-term outcomes related to the complications associated with malignant diseases.

This study has several limitations. First, since participating surgeons were volunteers asked to present LDG videos for this research, video selection for evaluation could have been biased. To minimize bias, the LDG cases chosen for evaluation should have been randomly selected from various cases of participating surgeons. However, this was not possible due to the limited number of LDG cases available for each participating surgeon. Moreover, our data analysis revealed correlations between LDG performance and short-term clinical outcomes. These correlations suggest that the quality of the participating surgeon’s performance in each LDG reflected the clinical outcomes of the case.

Second, the quality of LDG performance by non-expert surgeons may have been influenced by the team’s ability, including the assistants’ support and advice, and the laparoscope intra-abdominal view in this complicated procedure. However, there was no prior confirmation regarding the extent and type of support and advice provided by the assistants during LDG in the operating room. The relationship between team performance, including assistants’ support, and the overall quality of surgery should be evaluated in future research. Moreover, the expert assistants decided that patient safety was of utmost importance and took over the operation for a short time in five low-group cases. Although all operator alternation times were recorded and accurately reflected in the direct and video evaluations, the possible effects of operator alternations on the intraoperative and postoperative outcomes were not investigated in this study. Appropriate and timely operator alternations in challenging situations could prevent unintended intraoperative complications. Investigation into the occurrence of complications due to lack of experience and the appropriate skill level in each step will require detailed analysis using a procedure-specific assessment scale.

Conclusion

This study demonstrated the reliability and validity of the JORS-LDG, exhibiting excellent correlation between its scores and short-term surgical outcomes. This newly developed JORS-LDG could be useful in surgical training and surgical outcome prediction, potentially improving patient outcomes.