Achieving technical proficiency in laparoscopic surgery is critical as it remains the most frequently employed surgical technique by case volume [1]. Recent studies in bariatric [2] and colorectal [3] surgery have shown that greater technical skills are associated with better outcomes and fewer complications. Due to the difficulty in acquiring laparoscopic technical skills directly in the operating room, simulation-based training has emerged as a viable alternative [4,5,6,7,8,9]. Simulation training platforms provide a conducive learning environment to teach the technical and cognitive competencies necessary to master laparoscopic surgery in a safe, patient-free environment, without the cognitive load experienced in the operating room. An effective simulation-based training program is contingent upon having a robust curriculum with clearly defined and quantifiable performance metrics. Such metrics can be summative to establish a high stakes pass/fail determination or formative to provide trainees with targeted feedback for improvement. The most widely used summative tool for the evaluation of surgical performance is the Objective Structured Assessment of Technical Skills (OSATS) [10], a validated global tool for assessing operative performance in 6 domains, typically through video-based review. Formative assessment usually requires the creation and validation of task-specific metrics tailored precisely to each procedure or task.

Laparoscopic hiatal hernia repair (LHHR) is a complex procedure requiring advanced surgical training [11]. Attaining proficiency in this procedure is crucial given the high recurrence rate for such hernias, which is up to 50%, especially for paraesophageal hernias [12]. LHHR remains a difficult procedure to master with reported learning curves ranging from 50 to 200 cases [13,14,15], emphasizing the need to optimize training to acquire the necessary skills. Traditional anatomic models like cadavers and live animal models have been useful for simulating many procedures but may fall short in replicating some important aspects of human HHR and can pose ethical, cost, logistical, and curricular challenges. Advances in technology have made Virtual reality (VR) simulators a potentially ideal solution, offering detailed anatomic representations that are characteristic of HHR and facilitating focused, deliberate practice [16]. Using standardized simulation scenarios, VR trainers also enable automated objective assessment and targeted feedback to improve performance without the need for an expert surgeon reviewer. Importantly, skills acquired in VR simulators have been shown to improve operating room performance [17, 18]. We are developing a VR simulator for LHHR training as part of an NIH-funded project. The purpose of this study was to develop and assess task-specific metrics for LHHR, specifically evaluating their reliability and validity for the fundoplication portion of the procedure.

Materials and methods

This study was approved by the UT Southwestern Institutional Review Board and was done in two phases. In phase 1, interviews were conducted with experts to create task-specific metrics for the assessment of performance in laparoscopic Nissen fundoplication. In phase II, a bench model study was performed to evaluate validity evidence supporting the newly created metrics.

Development of task-specific metrics for fundoplication

We performed a hierarchical task analysis (HTA) of the LHHR by conducting hour-long semi-structured interviews with local foregut surgeons and experts from the Society for American Gastrointestinal and Endoscopic Surgeons (SAGES) Foregut Task Force. HTA in surgery is a well-known method that breaks down any given surgical procedure into tasks, sub-tasks, and motion end effectors, and it has been successfully used to deconstruct various minimally invasive procedures [19,20,21]. To guide our expert interviews, we formulated an initial list of procedure steps, drawing from recorded operative videos, information from textbooks, and prior task analysis of the laparoscopic fundoplication procedure [22, 23]. Experts were then asked to describe how they perform the procedure, highlight key moments, identify variations in the procedure, and list common procedural errors in order of their severity. The recordings were then independently analyzed by two authors (SH and GS) to create task trees with variations and a list of errors. Any discrepancies were resolved by an expert author (CH) and through consultations with the interviewed experts.

Validity evidence evaluation for the fundoplication task-specific metrics

In phase II, we assessed the validity evidence of the newly created metrics by conducting a study at the UT Southwestern Simulation Center using a porcine explant Nissen fundoplication model. Messick’s unitary framework was used to evaluate the validity of our task-specific metrics [24]. Specifically, data were collected to evaluate validity evidence in the following domains: content alignment, response process, internal structure, and relationship to other variables.

Fundoplication simulator design

We created a Nissen fundoplication simulator using a porcine stomach explant, which was placed inside a modified version of a laparoscopic box trainer [4] (Fig. 1). A frozen porcine stomach and esophagus specimen (Animal Technologies Inc., Tyler, Texas) was thawed and positioned in the box trainer. The esophagus was passed through a small circular incision in the lap box, about 2 inches from the base and held taut using an Allis clamp. To prevent lateral movement, the stomach was secured with two alligator clips. To create a retroesophageal window for the fundoplication, a Penrose drain was inserted through a circular incision about 4 inches from the base to lift the stomach at the gastroesophageal junction and keep the model under tension. A 0° laparoscope connected to a standard equipment tower was used for visualization. A pair of standard laparoscopic needle drivers, curved graspers, and scissors were used to perform the procedure. In addition, 2–0 silk sutures pre-cut to 15 cm in length were placed on a foam box to be used for suturing.

Fig. 1
figure 1

Laparoscopic Nissen fundoplication simulator; A overall set-up; B specimen set-up

Study design and procedure

The study was performed at the UT Southwestern Simulation Center with a between-subjects design. Recruited participants were stratified into three groups by level of expertise: novice (post-graduate year [PGY] 1–2 residents in general surgery), intermediates (PGY 3–5 residents and a minimally invasive fellow), and experts (faculty).

Prior to starting the procedure, each participant completed a survey that captured demographic information, clinical experience, and simulator experience. After providing informed consent, participants were given general instructions explaining the study objective and the task, without any technical/operative guidance. Specifically, we did not provide any instructions on number and type of sutures, the distance between the sutures and the placement of the wrap. Participants were then asked to complete a Nissen fundoplication on the porcine stomach model. They were given 1 h to complete up to 2 unassisted attempts. Video recordings focused on the instruments actively utilized in the laparoscopic box trainer, the training model itself, and a card displaying the participants’ random identification number to ensure anonymity during the video review process. Additionally, we also collected and analyzed the following real-time in situ metrics for each participant: (I) number of attempts completed (1 or 2), (II) number of sutures placed for fundoplication, (III) space between sutures measured in centimeters, and (IV) whether seromuscular bites were taken through the esophagus (dichotomized as 0 or 1).

At the conclusion of the study, participants were asked to complete a post-simulation survey to assess the quality of the simulator on a 5-point Likert scale. The survey covered 5 categories that included the visual appearance of the simulation, the quality of models and textures, the realism of the simulator interface, how closely the task mirrored the actual surgical procedure, and the simulator’s overall effectiveness in teaching LHHR.

Two qualified raters, blinded to the participants' experience levels, independently evaluated the video recordings of performances using both global and task-specific metrics. Table 1 presents the global metrics derived from the OSATS rubric, whereas Table 2 displays the task-specific metrics grounded in the HTA [10, 25,26,27,28]. Among the OSATS domains, we excluded the scoring rubric for knowledge of instruments because all participants were provided with the same set of laparoscopic tools. Initially, the two raters assessed the performance of 5 participants, comparing their ratings to discuss the grading and to resolve discrepancies. They then evaluated another 5 videos to ensure concordance between their ratings and reviewed the intraclass correlation coefficient (ICC). Finally, each rater independently graded the remaining videos.

Table 1 Rubric for assessing performance using global metrics
Table 2 Task-specific metrics for assessing performance of the creation and securing the wrap portion of the laparoscopic fundoplication procedure

Data analysis

The ICC estimates and their 95% confident intervals for establishing interrater reliability (IRR) were calculated based on mean rating (k = 2), absolute agreement, and 2-way mixed-effects model. An ICC value between 0.75 and 0.9 was deemed good, while a value above 0.9 was deemed excellent for IRR [29]. A total score was calculated by first averaging the individual metric scores from both raters and then summing them up for both global and task-specific evaluations. The Spearman rank correlation test was used to assess the association between the total global and task-specific scores. To determine performance differences between the three groups, the data were first evaluated for normality using the Shapiro–Wilk test. If the data were normally distributed, a one-way analysis of variance (ANOVA) was conducted, followed by a pairwise t test with Bonferroni correction for post hoc analysis. If not normally distributed, the Kruskal–Wallis test was employed, followed by a pairwise Wilcoxon test with Benjamini–Hochberg correction. Post hoc effect size was reported when appropriate.

Sample size

A priori power analysis was conducted using the G*software [30] to test the difference in performance between the three groups, with α = 0.5, a medium effect size f = 0.5 and power β = 0.8. The analysis showed that a total of 30 subjects equally distributed in three groups was needed to achieve the necessary power.

Results

Phase I results

Task analysis

A total of 12 expert foregut surgeons participated in interviews for task analysis, spanning 720 min in total. Table 3 displays the HTA of LHHR, outlining 6 major tasks, 27 subtasks, and 19 major errors. Using the HTA (Table 3) and the cataloged errors, we formulated metrics for video-based assessment of the LHHR (see Appendix 1).

Table 3 Hierarchical task analysis of the laparoscopic hiatal hernia repair showing major tasks, sub-tasks, and errors

Phase II results

Pre-survey results

Demographics

A total of 38 participants were recruited to complete the fundoplication simulation (Table 4). Participants were grouped into novice (n = 17, 45%), intermediate (n = 15, 39%), and expert (n = 6, 16%). Additionally, 50% (n = 19) were male, 45% (n = 17) were under the age of 30, 87% (n = 33) self-reported being right-handed, and 58% (n = 22) were wearing corrective lenses.

Table 4 Demographics of the participants
Prior experience

The overwhelming majority of novice and intermediate participants (n = 28, 88%) reported having observed 0–10 HHRs, while 3 (9%) reported observing 11–30 cases and 1 (3%) reported observing 30–50 cases. Among the attending surgeons, most had observed and/or participated in at least 100 cases and only 1 reported observing/participating in less than 100 cases. Overall, 42% (n = 16) of the participants self-reported a prior exposure to a robotic (Da Vinci) or laparoscopic (Fundamentals of Laparoscopic Surgery) simulation trainer, indicating that a subset of participants had previous hands-on engagement or familiarity with the technology being assessed. Additionally, 37% (n = 14) reported having gaming experience, with more than half of them (n = 10) playing at least 1–5 h a week. None of the participants included in the study reported any exposure to VR laparoscopic training.

Post-simulation survey results

After the Nissen fundoplication task, we conducted a post-simulation survey in which participants rate the realism and usefulness of their experience on a scale of 1–5, with 1 being not realistic/useful and 5 being very realistic/useful. The survey questions covered 5 categories that included the realism of the anatomy of the model, the realism of the ex vivo porcine model (texture), the realism of the simulator interface (instruments, display), the overall realism of the task compared to the actual surgical task, and the overall perceived usefulness of the simulator for learning laparoscopic hiatal hernia surgical skills. Table 5 shows the survey results for the degree of realism and usefulness of the fundoplication simulation model. The vast majority of participants from all three groups rated the simulator’s realism aspects highly, recognizing its usefulness and capability to capture the essential features of the task, thus establishing the content alignment.

Table 5 Survey completed after performing the Nissen fundoplication simulation on the porcine model

Reliability analysis

The IRR between the two blinded raters was good for both the global- (ICC = 0.84, 95% CI 0.79–0.87, p < 0.001) and task-specific metrics (ICC = 0.75, 95% CI 0.7–0.78, p < 0.001), thereby establishing internal structure validity. Grading the videos with blinded raters mitigated potential errors due to rater bias, thus ensuring response process validity.

Analysis of metrics

The descriptive statistics of the metrics used for assessing performance are shown in Table 6. Due to the unequal sample size of the groups and data violating normality using the Shapiro–Wilk test, non-parametric tests were used and are reported here.

Table 6 Median and interquartile range (IQR) of metrics used for the assessment of performance

Global metrics

Table 6 presents the median and interquartile range of the total global scores for all three groups. The Kruskal–Wallis test showed a significant difference in performance between the groups (χ2 = 24.01, p < 0.001). As depicted in Fig. 2, performance improved with increasing level of expertise. Post hoc analysis revealed significant differences among all three groups: novice vs. intermediate (p = 0.001), intermediate vs. expert (p = 0.01), and novice vs. expert (p = 0.007).

Fig. 2
figure 2

Total global score for the three groups

Task-specific metrics

The median and interquartile range of the total task-specific scores for all three groups are shown in Table 6. The Kruskal–Wallis test revealed a significant difference in performance among the groups (χ2 = 18.4, p < 0.001). As illustrated in Fig. 3 and mirroring the total global score, performance improved with increasing levels of experience. Post hoc analysis showed a significant difference in performance among all three groups: novice vs. intermediate (p = 0.001), intermediate vs. expert (p = 0.03), and novice vs. expert (p = 0.001). The Spearman rank correlation indicated a strong association between the total global score and the total task-specific scores (rs = 0.87, p < 0.001), as depicted in Fig. 4. In addition, Fig. 5 displays photos of subjects executing various components of the task-specific metrics.

Fig. 3
figure 3

Total task-specific score for the three groups

Fig. 4
figure 4

Correlation between total global- and task-specific scores

Fig. 5
figure 5

Video-based assessment of the fundoplication task

In situ metrics

  1. I.

    Number of attempts: all of the participants in the expert and intermediate groups were able to complete the maximum of 2 attempts in the allotted time except for 1 subject each in both groups; whereas, in the novice group, only 6 out of 17 subjects were able to proceed to the second attempt. The Kruskal–Wallis test showed a significant difference in the number of attempts between the groups (χ2 = 12.5, p = 0.001). Post hoc analysis showed a significant difference between the novice and intermediate groups (p = 0.002). No difference was found between the novice and expert group (p = 0.07) and the intermediate and expert group (p = 0.54).

  2. II.

    Number of sutures: all the subjects in the expert group placed 3 sutures to complete the fundoplication. In the intermediate group, 13 subjects placed 3 sutures and 2 placed 4 sutures. In the novice group, 3 placed only 1 suture, 2 placed 2 sutures, 8 placed 3 sutures, and 4 placed 4 sutures. Kruskal–Wallis tests showed no significant difference in the number of sutures placed between the three groups (χ2 = 0.94, p = 0.62).

  3. III.

    Sum of distance between sutures: experts sum of distance ranged from 2 to 3.5 cm, the intermediate group sum ranged from 1.4 to 4.4 cm, and the novice group sum ranged from 0 to 3.5 cm. The Kruskal–Wallis test showed significant differences between the groups in the sum of the distances for all the sutures (χ2 = 6.04, p = 0.04). Post hoc analysis could not find any significant differences between novice and intermediate groups (p = 0.15), novice and expert groups (p = 0.08), and between intermediate and expert groups (p = 0.08).

  4. IV.

    Seromuscular bite: overall, 5 out of 6 subjects in the expert group, 10 out of 15 subjects in the intermediate group, and 5 out of 17 subjects in the novice group placed a seromuscular bite on the esophagus while performing fundoplication. The Kruskal–Wallis test showed a significant difference between the groups in seromuscular bite taken during fundoplication (χ2 = 6.94, p = 0.03). Post hoc analysis could not find any differences between novice and intermediate groups (p = 0.06), novice and expert groups (p = 0.06), and between intermediate and expert groups (p = 0.49).

The results from the analysis revealed the metrics’ relationship to other variables, confirming construct validity.

Discussion

Our results demonstrate that task-specific metrics differentiate the performance in the wrap creation step of the laparoscopic fundoplication between novice, intermediate, and expert surgeons. A strong positive correlation was also observed between the validated global OSATS score and our task-specific scores. High IRR for both metrics established the feasibility of using our task-specific metrics for video-based assessment of performance. Additionally, it is noteworthy that 89% of participants rated the simulator’s usefulness as either 4 or 5 on a scale of 5. This rating was further supported by informal comments from several non-expert participants throughout the study expressing their desire for this practice opportunity before performing the procedure in the operating room. Many trainees also mentioned how the experience enhanced their confidence when approaching such cases involving live patients.

Expertise in laparoscopic hiatal hernia surgery requires extensive training with high case volume. Learning curve studies have shown that for individual surgeons, a total of 20–40 cases, and for individual institutions, about 50 cases, are needed for stabilization of postoperative complication rates [13, 31]. In a 10-year institutional learning curve study, it was found that 200 fundoplication cases had to be performed before operative time, conversion rates, and complications plateaued [14]. Given the procedure’s long learning curve, obtaining adequate training is further complicated by a substantial number of cases performed in high-volume centers, indicating centralization of this procedure to a few specialty centers [32]. This can affect the number of cases performed by residents, whose training pathways in complex foregut surgery are limited to their experience in the operating room. In our study, 88% of residents reported participating in 10 or fewer LHHRs. Simulation-based training can help bridge this gap by providing an opportunity for trainees to practice this task outside of the operating room.

As the exposure of surgical trainees to LHHR varies based on whether or not they are at a high-volume center, a simulator for training in this procedure is essential. Such a simulator should not only be capable of training the important cognitive and technical aspects of this procedure but should also be capable of both high-stakes summative and low-stakes formative assessment of skills. Several tools exist for video-based assessment of performance in LHHR with limited validity evidence [33]. A majority of training programs use a global tool for assessment of laparoscopic performance, such as the OSATS and the Global Operative Assessment of Laparoscopic Surgery (GOALS) [34,35,36,37] or a combination of global scales and procedure-specific assessment tools in the form of checklists [38, 39]. In a study by Peyre et al. [40], investigators focused on a detailed 65-step procedural checklist previously developed based on task analysis for the evaluation of technical performance in laparoscopic Nissen fundoplication [41]. Sixty-four of the 65 steps showed high degree of reliability (> 0.8) when expert operative performance of Nissen fundoplication was graded by five surgeons using the checklist. More recently, as part of its master’s program, SAGES developed a video-based assessment tool for laparoscopic fundoplication and demonstrated its content validity [22]. In our work, we independently developed metrics for assessment using the well-established HTA method. Overall, major tasks and sub-tasks aligned with prior HTA findings for this procedure [20, 22, 41]. Using HTA, we identified 19 major errors and developed task-specific metrics to evaluate performance for LHHR. Such task-specific metrics developed using HTA and expert consensus have been validated for the assessment of performance in endotracheal intubation and colorectal anastomosis procedures [27, 28, 42]. Though only the task-specific metrics for the creation and securing the wrap portion of the procedure were tested in our work, we were able to clearly establish validity evidence in the following domains defined by Messick’s unitary framework, namely, content alignment, response process, internal structure, and relationship to other variables.

One unique aspect of this study was the incorporation of in situ metrics in addition to our task-specific metrics for assessment. Both the number of attempts and placement of seromuscular bite were found to be useful metrics, which could be easily incorporated in the VR simulator for assessment. Though the goal of the work was to develop assessment metrics to incorporate in our VR simulator, the developed metrics with their validity evidence can also be used for video-based assessment of performance in laparoscopic fundoplication procedures. We showed the relationship of our metrics to other variables by comparing our task-specific metrics to OSATS but due to constraints in time in performing video-based assessment, it is not yet known how our task-specific metrics correlate with other instruments developed for this procedure, which will be part of our future work.

The transferability of skills from simulation to live OR must be a priority when creating a simulator. Transferability would both encourage usage and result in an actual improvement in live operative technical skills and patient outcomes. Although we did not test the initial dissection and reduction of the hernia sac with its contents and the assessment of intraabdominal esophageal length due to constraints in creating a physical model, we plan to test those aspects later in a VR model. We have created a model of the crura with an enlarged esophageal hiatus and are performing studies to establish validity of the metrics for the crural repair portion of the procedure, which will be reported separately. Our fundoplication simulation closely mimics a portion of the actual LHHR operation with a few differences, namely the simulation’s lack of a diaphragm; hence, it does not replicate the exact constraints experienced in the real surgery. The realism of our simulation is evident from the feedback of the participants, 79% of whom graded it 4 or 5 on a 5-point scale of realism.

Limitations of this study include a relatively small sample size and varying participant numbers across the groups. While we were able to maintain sufficient representation from each level of surgical expertise, the intermediate and expert groups had comparatively fewer participants. This could be attributed to the escalating operative and clinical responsibilities associated with each PGY level, leaving less availability for participating in research studies. The smaller and unequal sample size also resulted in small or moderate effect size with no clear post hoc comparison results for our in situ metrics. Furthermore, despite blinding of the identities of participants in the videos, there might still be some reluctance and apprehension regarding skill evaluation among participants. Finally, due to resource constraints, we could not use the flexible endoscopy in our study to assess quality and securement of the wrap, such as tightness and potential full-thickness bites.

Using an ex vivo fundoplication model, this study established the validity and reliability of task-specific metrics developed for assessment of performance in the creation and securing the wrap portion of the LHHR. The developed simulator and the video-based assessment metrics can be used for training and assessment in this procedure. Our next step is to incorporate the validated task-specific metrics in our VR simulator for automated assessment.