While common sense suggests that better skilled surgeons will have better postoperative outcomes, there is surprisingly little literature that tests this hypothesis [1]. Given the constraints of data sharing and patient privacy [2], currently available studies tend to summarize a surgeon’s technical skill and correlate that score with their overall outcomes [1,3]. We hypothesized that at a patient level, how well a particular operation is performed will correlate with their postoperative outcomes.

Performance assessment has been traditionally carried out through qualitative judgment and informal observation in the operating room. Quantitative scoring systems such as the Objective Structured Assessment of Technical Skill (OSATS) and Global Evaluative Assessment of Robotic Skills (GEARS) have recently been developed to provide reproducible assessments of surgical skills [4,5]. These assessment tools use a Likert scale for score components of intraoperative performance, namely depth perception, bimanual dexterity, efficiency, force sensitivity, and tissue handling, with each domain scored out of 5 for a total of 25 possible points. The GEARS score is internally validated [6] and is consistent across expert- and crowd-sourced review, allowing laypeople to quantify surgical skill to avoid the costly and time-consuming process of expert review [7,8].

Bariatric surgery uniquely offers standardized procedures, barring some nuance [9], in a relatively healthy patient population with a unique outcome, excess weight loss (EWL). We sought to determine the relationship between postoperative outcomes and intraoperative technical skills for robotic sleeve gastrectomy as quantified by the GEARS score during crowd-sourced video-based assessment.

Methods

Patients undergoing robotic sleeve gastrectomy between July 2018 and January 2021 at a single health care system were captured in a prospective database for retrospective analysis. Given the inability to assign GEARS scores for laparoscopic or open cases, any patient who was converted from a robotic approach was not included. Patients younger than 18 years old were also excluded. GEARS scores were assigned through crowd-sourced evaluators by a third party; the methodology has been previously described by this group [10,11]. Patient identifying information is captured and encrypted with a one-way hashing algorithm. This information and the operative videos are uploaded onto a secure database for assignment of GEARS scores by crowd-source evaluators, which is managed by Crowd-Sourced Assessment of Technical Skills (C-SATS, Seattle, WA). Online evaluators do not have access to any identifying information. Evaluators are trained on VBA and are frequently evaluated against other layperson evaluators and expert surgeon reviewers to determine the reliability of their scoring. After technical skills are assessed by a minimum of 30 evaluators, the scores and hashed identifying number were returned to the research team via a secure application program interface for de-encryption and correlation with patient variables. All data were stored in a secure, HIPAA-compliant database within the surgical department’s quality improvement initiative.

Serious morbidity included wound dehiscence, stroke or transient ischemic attack, cardiac arrest, myocardial infarction, pulmonary embolism, deep venous thrombosis, acute kidney injury, sepsis or septic shock, surgical site infection, pneumonia, unplanned intubation, urinary tract infection, ileus, anastomotic or staple line leak, and postoperative hernia. Complications were only recorded within 30 days of surgery.

Bivariate Pearson’s correlation was used to compare continuous variables, one-way ANOVA for categorical variables compared with a continuous variable, and chi-square for two categorical variables. Significant variables in the univariable screen (age, BMI, CCI and ASA) were included in a multivariable linear regression model. Patients lost to follow-up were censored at their last known visit date. Separate models were created for EWL at 6 and 12 months, and each GEARS subcomponent was evaluated in a separate model. No multivariable regression was performed for serious morbidity as there were no significant variables in the univariable screen. Assumptions of linear regression were tested as follows. There is a linear relationship between the outcome variable (excess weight loss) and the independent variables. The independent variables were not highly correlated with each other. All residuals are normally distributed. All analyses were performed with SPSS 26.0 (IBM, Armonk, NY) statistical software. Two-tailed p-value < 0.05 was considered significant. This study was approved by the Institutional Review Board at Northwell Health. Written consent was not required.

Results

A total of 162 patients who underwent robotic sleeve gastrectomy performed by a total of 7 surgeons were captured (Table 1). No patients met exclusion criteria. The majority of patients were young and healthy, with a mean age of 40.8 ± 12.6 years, a mean Charlson comorbidity index (CCI) 0.69 ± 1.2, and a mean American Society of Anesthesiologists (ASA) score 2.5 ± 0.5. Most patients were non-Hispanic (73.4%), women (80.2%), split among white (32.6%), Black (25.3%), and other (32.7%) racial identities. From a mean starting BMI of 42.4 ± 5.1, the mean EWL at 6 months was 72 ± 11.7% and at 12 months 74.7 ± 14.5%. EWL at 6 months was only available for 88 patients and at 12 months for 55 patients. The mean GEARS score was 20.2 with a standard deviation of 0.72. Mean subcomponent scores were bimanual dexterity 4.1 ± 0.2; depth perception 4.0 ± 0.2; efficiency 3.8 ± 0.2; force sensitivity 4.2 ± 0.2; and robotic control 4.2 ± 0.2. Only 9 patients (5.5%) experienced a serious morbidity, which included 1 patient with a urinary tract infection, 1 pneumonia, 1 acute kidney injury, 2 deep venous thromboses, 2 surgical site infections (1 requiring return to the operating room for washout), and 2 port site hernias.

Table 1 Patient demographics and correlation with excess weight loss

To further evaluate the potential for confounding, age, sex, race, BMI, and ASA were evaluated on a univariate screen and found to correlate with EWL at 6 months and age, race, BMI, and CCI at 12 months (Table 1). The correlation between GEARS score and demographics and outcomes was similarly evaluated (Table 2). The overall GEARS score was correlated with age (p = 0.031) and estimated blood loss (p = 0.017); however, there was no correlation identified for other patient demographics, including BMI (p = 0.496) or outcomes.

Table 2 Patient demographics and correlation with GEARS score

The total GEARS score or its subcomponents were not correlated with EWL at 6 or 12 months on unadjusted analysis (Table 3). However, after adjusting for age, sex, race, BMI, CCI, and ASA, total GEARS score and its subcomponents were positively correlated with EWL at 6 and 12 months (p < 0.001). There was insufficient evidence to conclude a correlation exists with any patient demographic or GEARS scores and serious morbidity (Table 4).

Table 3 GEARS score and correlation with excess weight loss
Table 4 Correlation of patient demographics and GEARS score with serious morbidity

Discussion

This study evaluated the relationship between intraoperative technical skill and postoperative outcomes for robotic sleeve gastrectomy. We determined that skill as determined by blinded video-based review and quantified with the GEARS score correlates with weight loss. While the overall low number of serious complications would require a much larger study to determine the relationship between skill and serious complications, this work is among few studies that demonstrate that more technically skilled surgeons may have better outcomes. These conclusions have meaningful consequences for surgical credentialing and residency education [2,12].

This is the first study to correlate technical skills of the surgeon with patient outcomes on a patient level. Previous studies asked surgeons to submit a small number of representative intraoperative videos, summarize specific surgeon’s skills with one number, and then correlate that skill evaluation with their overall outcomes [3,9,13,14]. In comparison, we correlate individual patient outcomes with the skill demonstrated in their specific surgery. We were able to accomplish this with our encrypted program interface that allows us to share hashed patient identifiers with a separate team for skill evaluation while maintaining patient privacy [2]. We demonstrated that even for experienced, fellowship-trained bariatric surgeons, the skill with which a particular surgery is performed will impact that specific patient.

The seminal paper by Birkmeyer et al. first established the relationship between postoperative outcomes and technical skill as measured by direct assessment with blinded video review [3]. Since then, numerous studies have sought to replicate these results or expand them into other operations, beyond the original gastric bypass [1]. However, there are few studies that use direct objective measurements of skill rather than a proxy, such as operative time or surgeon experience [1]. Importantly, these proxies have not been validated as a measure of technical skill. Operative time, surgeon experience, length of stay, and complication rates have complex interdependent relationships [1,10,15]. Furthermore, this group asserts that operative time and length of stay are outcomes of skill rather than an indirect measurement of skill itself.

In bariatric surgery, Birkmeyer et al. evaluated 20 surgeons performing laparoscopic gastric bypass and found surgeons at the top quartile of skill had lower complication rates and mortality [3], and similarly, Varban et al. evaluated 25 surgeons performing laparoscopic sleeve gastrectomy and found that more skilled surgeons had lower rates of specific surgical complications but not a lower rate of overall 30-day complications [9]. In robotic surgery, postoperative outcomes have not been correlated with objective technical skill outside of urologic procedures [16,17].

While representing early work in the field of video-based assessment in robotic surgery, this work is not without several important limitations. All surgeons included are fellowship-trained bariatric surgeons operating within a bariatric center of excellence. This high level of expertise allows us to conclude that even skilled surgeons have small variations case by case that impact patient outcomes. However, it limits the number of patients experiencing complications, precluding our ability to create a regression model that is not over-fit to the data. Our conclusions are also limited to this population of highly experienced surgeons at a bariatric center of excellence; however, robotic bariatric surgery is typically performed in this setting. This highly trained cohort helps explain the small standard deviation of GEARS scores. Additionally, assessment by such a large number of evaluators may have a tendency toward the mean, where small differences in the GEARS score correlates with a large difference in operative skill. While there is trainee involvement in these cases, currently our VBA is limited in that it does not account for which console and therefore which surgeon is performing these cases. When evaluating regression coefficients, the GEARS score and its subcomponents were all positively correlated with EWL; however, the effect sizes were relatively small. Further studies with a larger population and more surgeons across a wider variety of skill may result in a larger effect size.

Our study may also be limited by selection bias. At this institution, we routinely send all robotic bariatric surgical videos for objective scoring. Some videos may not be recorded in their entirety and correlated with patient identifiers, either due to surgeon preference, or technical or human errors. We lost 46% of patients at 6-month (n = 88) and 66% at 12-month (n = 55) follow-ups. Compared to other studies of sleeve gastrectomy, the rate lost to follow-up is similar [9]; additionally, our EWL is comparable to that generally reported for sleeve gastrectomy [18]. Finally, while the GEARS score is a validated measure for surgical skill, there is no standard measurement of robotic technical skill [6,7]. The GEARS score was designed to describe the fundamental elements of robotic surgery regardless of the specific procedure [5]. There are numerous other scoring systems that describe nuances of robotic skill, such as specific for microsurgery or control of the console [19,20]. By utilizing the GEARS score, this study can be repeated for any robotic procedure and the results compared across specialties.

This study is also limited in answering the following question: what is a more highly skilled surgeon doing differently than a less skilled surgeon that may result in better weight loss for their patients? To answer this question on a larger scale, this group is looking at kinematic data to break down specific movements. For example, does the angle the stapler takes at the angle of His differ consistently for patients with better EWL? Does more gentle tissue handling result in less swelling of the sleeve and better postoperative outcomes? VBA has been combined with kinematic data derived from the da Vinci system to evaluate robotic performance in other specialties, and our group will next look into applying this data to bariatric surgery [21].

Conclusion

In this retrospective review of patients undergoing robotic sleeve gastrectomy, higher technical skill as assessed by crowd-sourced assignment of the GEARS score did not correlate with serious morbidity but did correlate with weight loss; patients whose cases were assigned a higher GEARS score had more weight loss. Objective, video-based assessment of technical skill may predict postoperative weight loss in robotic sleeve gastrectomy at the patient level.