Evidence supports that 40–60% of adverse events in surgical patients can be linked to errors in the operating room [1,2,3]. Yet efforts to improve surgical outcomes have largely focused on perioperative care with very little emphasis on measuring and improving operative performance [4]. Difficulty in accessing information on ‘what happens in the operating room’ and lack of appropriate tools for assessment of intraoperative performance have hampered this area of research [4, 5]. However, the expansion of image-guided surgery including laparoscopic and robotic operations facilitates capture, storage, and sharing of recorded procedures. Consequently, video-based assessment (VBA) may provide a valuable opportunity to measure intraoperative performance while minimizing observer bias related to unblinded in-theater evaluations [6, 7].

There is significant interest in the use of VBA of intraoperative performance for formative assessment in education and coaching [4, 8, 9]. In addition, there is interest in the use of VBA for summative ‘high stakes’ decisions such as certification after completion of surgical training [5] or after learning a new procedure [10, 11]. However, the use of VBA to inform competency decisions for trainees requires robust supporting evidence. A landmark paper from Birkmeyer et al. published in 2013 reported a significant association between surgeon technical performance and outcomes after Roux-en-Y gastric bypass, including complications, reoperations, and readmissions [12]. A systematic review, however, identified important limitations in the literature published in this field related to lack of standardized assessment tools and reliance on indirect observations of technical performance such as postoperative imaging or pathological specimen quality [13]. This has become an active area of research and several studies published subsequent to that review contributed new evidence that may further inform the integration of VBA into credentialing, certification, coaching, and quality improvement processes for practicing surgeons. Therefore, the objective of this study was to systematically review and summarize the existing literature on the association between intraoperative technical performance measured using VBAs and patient outcomes.

Materials and methods

This review was conducted and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) [14]. The review protocol was registered a priori at Open Science Framework (osf.io/c29yb).

Eligibility criteria

We included studies that (1) measured intraoperative technical performance of practicing surgeons from recorded cases; (2) described the association of intraoperative technical performance with the outcomes of patients undergoing the same type of procedure; and (3) used a performance assessment tool with published validity evidence supporting their intended use and interpretation. Studies from all surgical specialities published after 1990 (introduction of image-guided procedures) [15] were included. Exclusion criteria included (1) studies evaluating surgical trainees; (2) studies that relied solely on surrogate measures of technical performance such as postoperative imaging or pathological specimen; (3) studies with qualitative assessment of intraoperative technical performance only (i.e., lack of a standardized assessment tool); (4) case reports, comments, editorials, and non-human studies; and (5) abstracts that could not be traced to full-text articles. There were no language restrictions.

Literature search

The following databases were searched for relevant studies: Medline (via OvidSP and PubMed [for articles ahead of print]), Embase (OvidSP), The Cochrane Database (via Cochrane Library, including Cochrane Central Register of Controlled Trials, Database of Abstracts of Reviews of Effects, and National Health Service Economic Evaluation Database), and Web of Science (Thomson Reuters). The search strategies (eMethods 1) were developed by an experienced medical librarian according to the best practice recommendations [16]. The reference list of the selected studies was screened for further studies that met the inclusion criteria. [17] Searches were carried out in August 2020 and updated in March 2021 before manuscript submission. No language restrictions were applied.

Study selection and data extraction

Two reviewers (SB and AK) independently assessed titles, abstracts, and selected full texts of the articles obtained through the literature review. Any discrepancies between the included and excluded articles were resolved by consensus between the reviewers or by consulting a third independent reviewer (MH).

Quality assessment of individual studies

The methodological quality for each study included in the final selection was independently judged by two reviewers (SB and AK) using the Newcastle–Ottawa Scale (NOS) [18]. Any discrepancies were resolved by consensus between the reviewers or by consulting a third independent reviewer (LF). NOS is a validated system developed for the assessment of quality of non-randomized trials based on three domains: selection of the study groups (maximum of 4 stars), comparability of the groups (maximum of 2 stars), and ascertainment of the exposure or outcome of interest (maximum of 3 stars) with a maximum total score of 9 stars [19]. Although there are no defined cutoff values differentiating high-quality from low-quality study methods in the NOS tool, studies with fewer than 6 stars or with 1 star for the selection of participants or outcome ascertainment, or zero for any domain were deemed to have high risk of bias. [20,21,22,23] We followed a priori criteria for risk of bias analysis based on the NOS guidelines, as outlined in Supplemental Digital Content 1. [24, 25].

Data synthesis

This systematic review was reported using a narrative synthesis approach [26]. Meta-analysis was precluded as the identified studies were heterogeneous with respect to population, exposure, and outcome measures.

Results

A total of 3984 unique articles were identified and 31 articles were chosen for final full-text review after screening of titles and abstracts (Fig. 1). There were 3 additional studies identified through other sources (cross referencing [n = 2] [27, 28] or expert suggestions of recent papers which had not yet been indexed in Medline [n = 1] [29]. Twenty-three articles were excluded (articles and reasons for exclusion are listed in Supplemental Digital Content 1) and 11 articles met eligibility criteria. [12, 29,30,31,32,33,34,35,36,37,38].

Fig. 1
figure 1

PRISMA flow diagram. (PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-analyses) [14]

Characteristics of the included studies are summarized in Table 1. All were observational studies (10 cohort and 1 case–control study). All the other ten identified studies followed after the publication of the landmark paper by Birkmeyer et al. [12] Eight of 11 studies were multicenter collaborations. Two studies involved urologic procedures [34, 36] with the remainder involving general surgery procedures (foregut/bariatric [n = 4], colorectal [n = 4] and hepatobiliary surgery [n = 1]) [12, 29,30,31,32,33, 35, 37, 38]. Eight different procedures were evaluated in these studies. All studies involved minimally invasive surgical procedures (two studies in robotic surgery and 9 in laparoscopic surgery). The number of surgeons evaluated in each study ranged from 1 to 34. The rate of participation of invited surgeons ranged from 32 to 100% when specified. A range of 47–10,242 patients were assessed for surgical outcomes in the identified studies.

Table 1 Overview of the included studies

Table 2 summarizes the characteristics of the intraoperative technical performance assessment tools used and the features of the study designs that may influence their uses and interpretations [13]. A wide variety of generic and procedure-specific assessment tools were used, with 54% of the studies (n = 6) using the generic modified Objective Structured Assessment of Technical Skills (mOSATS) tool. The Generic Error Rating Tool (GERT) was the only error rating tool identified which was used in two identified studies. The remaining assessment tools used in these studies were procedure-specific, including the American Society of Colon and Rectal Surgeons (ASCRS) Video Assessment Tool, which was used in two out of three of the studies evaluating laparoscopic colectomy. Six studies assessed only critical parts of a given procedure that were defined a priori that included parts of an operation such as the anastomosis or critical dissections. Three out of these six studies, the intraoperative recording was edited by the research team to only include the a priori identified critical section of the operation. Five studies involved VBA of the entire procedure. In ten studies, the assessors were blinded to the patient and surgeon identifiers, and in one study, this was not specifically reported. Eight studies characterized the assessors as “expert,” while three studies characterized them as “peer assessors.” Only six (54%) studies described any attempt to train or calibrate the raters in using the assessment rubrics. Videos used for intraoperative technical performance assessment were submitted in two methods. In five studies, participating surgeons chose and submitted one video as representative of their overall performance. In this approach, the surgeon’s technical performance was estimated from that single video, and patient outcomes for each surgeon were determined from an existing registry. In the remaining six studies, videos were available for each case, and the association between intraoperative technical performance and outcomes was analyzed for each patient.

Table 2 Overview of intraoperative skills assessment

Quality assessment of each study was performed using the NOS tool [19]. A total of 6 studies were deemed to have low risk of bias and 5 studies to have high risk of bias (Table 3). A common reason for penalizing the quality of the studies was bias in selection of participants in the study (n = 8) [12, 29,30,31, 33, 35, 37, 38] followed by bias in measurement of exposure and non-disclosure of frequency and handling of missing data (n = 10) [12, 29,30,31,32,33, 35,36,37,38]. A complete description of the risk of bias assessment for each study is reported in Supplemental Digital Content 1.

Table 3 Study quality assessment for primary outcomes

The relationship between intraoperative technical performance and postoperative outcomes for each study is summarized in Table 4. The outcomes assessed were categorized as short-term (≤ 30 days) or long-term (> 30 days). Short-term outcomes (including 30-day complications, reoperations, readmissions, emergency department visits, and survival) were reported in 8 studies. Better intraoperative performance was associated with fewer postoperative complications (6 of 7 studies) in laparoscopic right and left hemicolectomy, laparoscopic total mesorectal excision, laparoscopic gastrectomy, laparoscopic gastric bypass, and robotic Whipple procedures. Out of these seven studies, three had low risk of bias [12, 30, 31], 2 of which demonstrated an association between better intraoperative performance and fewer postoperative complications (rate reduction between 9.2% and 5.1%) [12, 31]. Better intraoperative performance was associated with fewer reoperations in 3 of 4 studies (rate reduction between 0.7% and 2.5%), including all 3 studies with low risk of bias [12, 30,31,32].Better intraoperative performance had an association with fewer readmission in only 1 of 4 studies [12]; only one of these studies (that showed no association) had a high risk of bias [31]. All studies looking at ED visits and mortality were of low risk of bias [12, 30, 31]. One of 2 studies showed an association between better intraoperative performance and lower ED visits and mortality. [12].

Table 4 Association of intraoperative performance with postoperative outcomes

The impact of intraoperative performance on long-term outcomes was reported in 6 studies and supported by studies focused on weight loss (1 of 2 studies, both with low risk of bias) [30, 35], and patient satisfaction (1 of 1 study with high risk of bias) [35], but not cancer recurrence (0 of 1 study with high risk of bias) [32]. Cancer survival was investigated in 2 studies: an association between better intraoperative technical performance and longer overall cancer survival was supported by one study with low risk of bias [29] with a second study with high risk of bias reporting a large but non-statistically significance increase in overall survival. [32] In minimally invasive prostatectomy, an association between intraoperative technical performance and improved 3 month postoperative urinary continence rate was supported in 2 studies (1 with low risk of bias [34] and one with high risk of bias [36]) (Table 4). Four studies reported the association between intraoperative technical performance and pathological outcomes. [29, 32, 36, 38] Of the 3 studies investigating the association between intraoperative technical performance and lymph node yield, 2 showed no association [29, 32] and 1 showed a significant association (13 vs. 18 LNs in colon cancer) [38]. One study showed a significant association between better intraoperative technical performance and higher rate of pathologic success in rectal cancer surgery (defined as mesorectal fascial plane, circumferential margin ≥ 1 mm, and distal margin ≥ 1 mm) [32] and another reported an association with the distal margin in left colon cancer surgery (median 3 vs. 4 cm). [38].

Discussion

This systematic review summarizes the existing literature investigating the association between intraoperative technical performance, as evaluated using VBA measures, and patient outcomes. Despite study heterogeneity, the results support the association between better intraoperative technical performance and improved short-term outcomes including 30-day complications and reoperations in laparoscopic colectomy, laparoscopic total mesorectal excision, laparoscopic gastrectomy, laparoscopic gastric bypass, and robotic Whipple procedures. There was more limited evidence supporting the relationship between technical performance and short-term resource utilization (readmissions and ED visits), as well as longer-term outcomes such as weight loss after bariatric surgery and survival after cancer resections.

Our study builds on the previous systematic review assessing the association between technical performance and patient outcome, which included studies conducted up to 2014. The earlier review, which included 24 studies, included only one study where an intraoperative assessment tool with validity evidence was used for VBA of practicing surgeons, while the remaining studies relied on indirect evaluations of intraoperative performance such as postoperative imaging or pathological specimens [12, 13] Our systematic review was further strengthened with compliance with PRISMA methodological standards and the use of cross referencing to maximize our literature search. [14, 17].

Given that the majority of the VBA tools used in the studies, such as mOSATS, focus mostly on elements of psychomotor proficiency, such as dexterity and tissue handling, it is not surprising that associations were found between intraoperative performance and short-term safety outcomes, while associations with long-term efficacy outcomes were less clear. While intraoperative technical performance seems important in preventing early complications like bleeding and infection, most assessment tools used in the included studies do not fully capture the complex cognitive skills related to surgical expertise that may have a larger role to play in determining the long-term effectiveness of the operation [5, 39]. Therefore, the tool used for VBA should be selected based on the outcome of interest. An additional source of variability is that operations are not standardized between surgeons and these variations (e.g., oversewing versus not oversewing of the staple-line or length of the roux-limb in bariatric surgery) may also be associated with postoperative outcomes [40, 41]. However, technical variation was not considered in any of the identified studies in this review, which may also contribute to the heterogeneity observed in the effect measures [30]. One of the long-term outcomes that was associated with superior intraoperative technical performance was improved cancer survival in 2 studies, despite the mixed findings in the association between intraoperative performance and pathology outcomes. This may be related to the detrimental impact of major early postoperative complications on oncological outcomes related to increased systematic spread or delayed adjuvant treatment. [42,43,44].

The association between surgeon technical performance and patient outcome has several important implications. It suggests a potential avenue for quality improvement and continuing professional development through feedback, benchmarking, and coaching [8, 45]. Similarly, there is an interest in using VBA to measure and improve surgical techniques from leading groups such as the American Board of Surgery [46]. It is important to highlight that association does not imply causation; while there is evidence for the benefits of video analysis and feedback in surgical trainees [47], additional studies are required to support the effectiveness of this approach for practicing surgeons. Additionally, for VBA to be used to inform higher-stakes decisions (e.g., certification and credentialing), the measurement tools need to be supported by rigorous studies supporting their validity for that use and be representative of all domains the tool seeks to measure including operative safety and effectiveness [5, 48]. There is limited evidence supporting the use of the generic assessment tools identified in this review for summative video-based evaluation in practicing surgeons [49, 50]. However, other instruments identified in our study were in fact developed specifically to assess performance of a specific procedure by practicing surgeons, using a recorded case, with evidence provided supporting their uses, interpretations, and psychometric properties. [10, 32] This work is critical as automated metrics of performance using computer vision and machine learning are rapidly being developed [51]. Finally, the ability to accurately document and measure variations in surgical technique using VBA has implications for surgical research, with many randomized trials now requiring submission and analysis of procedure video to ensure quality and standardization. [52].

We identified significant heterogeneity in study design related to video editing, the type of assessment tool, rater qualification, and rater training. These characteristics were selected based on the published recommendations for minimizing measurement error when using VBAs. [13, 53] Although our review only included studies using assessment tools supported by validity evidence, evaluation of the strength of the validity evidence for the intended uses and interpretations falls outside the scope of this review. As discussed earlier, the development and use of assessment tools with robust psychometric properties should be standard practice for video-based evaluations. [48].

While most studies followed the recommendation to use blinded evaluators, rater qualification varied and was either described as “peer” or “expert” evaluation. The definition of expert raters varied between studies but was commonly described as an experienced surgeon in the field (i.e., some may argue this is a “peer”) or as having familiarity in the use or development of intraoperative assessment tools (i.e., may not be clinical expertise). Use of multiple peer raters (as opposed to defined experts in the field) has been justified in the literature based on the theory that the collective intelligence of a group may solve problems more efficiently than individuals [7]. The literature supporting peer VBA assessment in comparison to expert assessment (the default gold standard) has been mixed [54, 55] with supporting evidence for their use in evaluating simple tasks such as knot tying [56] and in the presence of added information such as intraoperative audio [55]. Since the use of peer assessors would significantly increase the feasibility of larger scale assessment programs, the qualifications of the raters should be better defined, and evidence to support optimizing rater training should be prioritized in future studies.

There was also wide range of definitions for rater training, ranging from passive training based on descriptive manuals [32] to full training programs with continuous calibration of the assessors [33]. Only one of 5 studies that used peer assessors described any attempts at rater training. In studies with lack of peer training, lack of familiarity with the nuances of assessment tools can result in non-differential measurement error, resulting in underestimation of the effect size and biasing the analysis toward the null. For future studies, rater training is recommended to enhance reliability and reduce non-differential measurement bias, but more work is needed to determine the optimal mode of rater training. [7, 13, 57].

Inconsistency in the association between intraoperative technical performance and outcomes between studies may be related to other issues in study design. In almost half of the studies, a single video chosen by the participating surgeon was used, compared to the alternative of having one video per patient. The former method is not only susceptible to selection bias, but also evaluating a surgeon based on a single video does not take into account a surgeon’s learning curve or the evolution of their technique throughout their years of practice. On the other hand, surgeons would likely select their “best” videos which would bias the results toward the null. The number of assessments required for a reliable score using VBA has been investigated in trainees; however, this information is lacking in assessment of practicing surgeons. [58].

This review has several limitations. Study heterogeneity precluded meta-analysis. In addition to the risk of measurement bias discussed above, eight of the eleven identified studies were at high risk of selection bias. The most common reason was the degree of participation from surgeons, consistently reported below 35% of invited participants. Another area of potential bias was the inclusion of patients based on the availability of intraoperative videos versus having a consecutive cohort of patients where video and outcome data were both available for every patient. Twelve abstracts were excluded because they were not yet traced back to a full-text article. Our systematic review also did not identify any studies of open surgical procedures likely due to increased complexity for recording.

This review contributes evidence regarding the relationship between technical performance as measured through video-based assessment and surgical outcomes, supporting the association between superior intraoperative technical performance and lower risk of perioperative complications and reoperations. Long-term outcomes were less commonly investigated, with mixed results. Future research should investigate the impact of technical performance and technical variation on postoperative outcomes in a more diverse range of procedures and investigate the effectiveness of interventions to improve technical skill on patient outcomes.