The acquisition of technical skills is one of the fundamental goals of postgraduate surgical training. Nearly 80 % of graduating chief residents in general surgery apply for a fellowship, and nearly 30 % apply to the Fellowship Council for a fellowship (advanced gastrointestinal/minimally invasive surgery, bariatric surgery, hepatobiliary surgery, etc.). This is an indication that these residents feel the need for more training, specifically training to acquire expert technical skills. Although both faculty and residents perceive this need, an objective method to assess the technical skills of trainees remains elusive. Program directors have never had valid, objective, reliable tools with which to measure the technical performance of residents and fellows and, therefore, have never been able to assess objectively which trainees are proficient and which trainees need additional technical training.

Traditionally, surgeons developed their technical skills through a process of protracted exposure to a supervised, graded operative experience. The skills required to complete an operation are the result of applying a specific combination of abilities to the task of the operation, and the assessment of operative performance measures the overall efficiency of completing the task by measuring the component abilities required to complete the task [1]. Procedural training in current practice often is unsystematic and unstructured; it is typically based on the random need to perform various surgical procedures. Surgical educators have worked for decades to improve the methods by which surgical education is delivered to trainees [2, 3]. However, validated teaching methods have not been integrated into clinical education, in part because validated tools with which to assessment trainee performance did not exist [3]. Surgical fellows and their preceptors spend a large amount of time demonstrating and observing technical skills in the operating room, but traditional methods of assessment of technical skills rely upon subjective evaluations by senior staff members, case records, and morbidity and mortality rates [4], none of which have been shown to be valid or reliable.

Surgical educators need validated assessment tools to assess surgical competency during training to ensure that graduates have acquired sufficient skill to be safe and effective in independent practice [514]. Although surgeons have developed a few tools to evaluate performance and some have been validated for specific procedures, surgical educators and certifying organizations have not endorsed or mandated a comprehensive, objective, validated method to measure technical performance. The basic reason to assess performance is to guide training to the level of proficiency; however, public demand, reduced resident work hours [15], and regulatory mandates have all increased the sense of urgency to develop and implement validated, objective assessment tools that will reliably measure technical performance.

Global Operative Assessment of Laparoscopic Skills (GOALS)

The GOALS evaluates laparoscopic surgery performance in five domains (depth perception, bimanual dexterity, efficiency, tissue handling, and autonomy). Each domain is scored with a rating from 1 to 5. A descriptive anchor is provided for scores of 1, 3, and 5 for each domain (Table 1). A total score for each operation is the sum of the scores from the five domains and may be considered an overall assessment of a trainee’s performance. GOALS was first validated for dissection of the gallbladder from the liver bed [16]. Subsequently, GOALS has been validated as an assessment tool for laparoscopic cholecystectomy and laparoscopic appendectomy but has not previously been validated for assessing performance during more complex operations [17]. Chang and colleagues also demonstrated construct validity and inter-rater reliability using GOALS to evaluate video-recordings of laparoscopic cholecystectomy performed by a novice and by an expert [18].

Table 1 Basic GOALS for Nissen fundoplication, gastric band, and gastric bypass

More recently, investigators have developed and validated tools designed to evaluate performance during groin hernia repair (GOALS-GH) [19] and incisional hernia repair (GOALS-IH) [20, 21]. The lack of consensus regarding a validated assessment tool for use in most complex laparoscopic surgery cases led to the purpose of this study. We hypothesize that GOALS will differentiate novice fellows from graduating fellows and, thereby, establish construct validity for GOALS when used to evaluate performance during complex laparoscopic gastrointestinal surgery. Additionally, we hypothesize that there is a correlation between the GOALS scores for fellows and (1) total previous experience with complex laparoscopic operations, (2) previous experience with the procedure under evaluation, (3) difficulty of operation, and (4) length of time as a fellow.

Materials and methods

The Fellowship Council provided a data set that included all voluntarily reported performance scores for fellows between June 2010 and November 2011. In addition to the performance scores, the data set included the percent of the case performed by the fellow, the name and date of the procedure, previous experience with complex cases, previous experience with the case being reported, and the difficulty of the case. With permission from the Fellowship Council and approval by the Institutional Review Board, the data were stratified by case type and then analyzed to identify predictors of improved performance.

Statistical analysis

Some basic descriptive statistics were first calculated, including mean and standard deviation of the scores for each domain for each quarter during the fellowship year. In order to determine which of the various available factors (number of previous complex procedures performed, number of the same type of procedure performed, case difficulty, and total time in the fellowship program) were related to scores, we applied a linear mixed effects model to the data. Each of the available factors is treated as a fixed effect and the fellow is included as a random effect. This allows us to model fellow-to-fellow variability and to account for correlation among different scores for a given subject. To allow for apparent nonlinearity of the learning curve, we log-transformed the time in the program. For all analyses, R version 2.15.2 (www.rproject.org) for Windows was used.

Results

The performance of each of 31 unique fellows during 98 complex laparoscopic operations was assessed using the GOALS tool for three types of operations: laparoscopic Nissen fundoplication (n = 27), laparoscopic adjustable gastric band (n = 17), and laparoscopic Roux-en-Y gastric bypass (n = 54; Table 2). Of the 31 fellows, 14 (45.2 %) had only one set of performance scores available. The mean scores for each quarter for each domain increased throughout the fellowship year (Fig. 1).

Table 2 Descriptive statistics—assessment of laparoscopic cases
Fig. 1
figure 1

Mean score of overall and five performance domains across four quarters

Table 3 displays the results of analyses using three different models. The mean of performance scores is the outcome variable in each model. On the left of the table (“Separate Models”) are the results from fitting a separate linear mixed effects model for each fixed effect predictor. In the middle of Table 3 (“Full model”) are the results of the analysis in which all four factors are included as fixed effects in a single model. On the right of Table 3 (“Final model”) are the results from the model giving the lowest Bayesian Information Criterion [22]. In this model, each of the three predictor variables is significantly related to the outcome.

Table 3 Modeling results for four predictive factors

Figure 2 shows all of the raw data and a calculated, estimated average learning curve for overall performance and for performance in each of the domains using the Final Model. The learning curves for overall performance, bimanual dexterity, efficiency, and autonomy demonstrated statistically significant learning during the fellowship year. Although the learning curves for depth perception and tissue handling showed a trend toward improvement, the curves were not statistically significant.

Fig. 2
figure 2

Learning curves for overall performance and five performance domains (GOALS)

Discussion

Competency-based education is currently being introduced into all levels of surgical training in the U.S. Fellowship training in complex surgery is continually evolving, not only to keep up with the advances in patient care and the adoption of new technologies, but also to meet the needs of trainees. Valid and reliable performance assessment tools are essential to ensure that competencies are acquired. Some tools have been proven to be valid measures of technical performance, although none has been adopted for generalized use for technical skills assessments.

In addition to providing an assessment with which to provide feedback, data from assessments can be analyzed with other variables to identify factors that result in improved performance. In this project, we developed several analytic models to identify which model would best identify these predictors. In our Final Model, we were able to identify and confirm that three separate factors all independently impacted the performance scores. In future studies with larger data sets, we will further test these models. This study provides evidence that we can use these models to identify factors that impact the performance scores of trainees.

By comparing performance scores of fellows at the beginning of the fellowship year with their scores at the end of the fellowship year, we determined that GOALS is able to differentiate novice fellows from graduating fellows and, therefore, we have established construct validity for the tool. These results confirm that GOALS may be reliably used to provide feedback to fellows by program directors and other faculty. Formative feedback of this type has the potential to allow program directors to customize training for each fellow to meet her/his specific needs for technical skills training. Ultimately, this type of tool could be used for summative assessment to inform critical decisions such as graduation, certification, and credentialing.

Learning curves graphically display the change in performance over time during a period of learning. When the learning curve reaches a plateau, learning has been completed. A trainee will have achieved proficiency in performing an operation if her/his scores plateau at a high level of performance that would be consistent with proficiency. If the scores plateau at lower levels of performance, either the trainee or the training methods may have deficiencies. However, for surgical educators to make summative assessments with tools such as GOALS, they must be validated.

The limitations of this study are the small number of fellows who were assessed, the small number of assessments for each fellow, and the small number of fellows for whom there were multiple assessments for the same type of case throughout the fellowship year. Despite that, the scores for the fellows represented in the dataset tended to improve throughout the fellowship year, strongly suggesting that, as a group, the fellows performed significantly better as the year passed.

Although the performance scores improved through the course of the year overall and for all domains, that improvement did not reach statistical significance either in the domain of depth perception or of tissue handling. One possible explanation is that the number of fellows may have been too small to provide sufficient power. Another possible explanation is that the fellows had already acquired significant skill in these domains before becoming a fellow. This possibility is supported by the fact that the performance scores in these two domains at the beginning of the fellowship were higher than the performance scores in the other domains. All fellows had previously completed a general surgery residency and had a large operative experience with laparoscopic cholecystectomy. They may have acquired the skills to effectively manage the lack of depth perception and effectively handle tissue while performing those operations.

The ultimate value of assessment tools, such as GOALS, is that they will enable training to technical proficiency. Training surgeons to proficiency will not only enable surgical training programs to graduate better-trained surgeons but also enable surgical educators to document that each graduating surgeon is well trained.

Conclusions

This study has documented that GOALS is able to differentiate novice fellows from graduating fellows and, therefore, the study establishes construct validity for GOALS as an assessment tool for technical performance during complex laparoscopic gastrointestinal operations. The analytic models developed and tested in this study may now be studied on larger, more complete data sets as they become available. These future studies will better define optimal use of both the GOALS tool and the analytic models.