Introduction

Training residents is a key task of surgeons in teaching hospitals. Gaining insights into the strengths and weaknesses of surgeons’ teaching performance is crucial for the maintenance and enhancement of high-quality training programs. There is evidence suggesting that unguided (isolated) self-evaluation of performance does not provide sufficient information for adequate performance enhancement [1, 2]. In response to these findings, a process of informed self-evaluation, including both external and internal data to self-evaluate performance, has been suggested as a valuable alternative [3]. Several feedback sources can provide an external view on surgeons’ teaching performance, including feedback of residents in training. Previous research has shown that surgeons found residents’ feedback valuable, especially when they combined it with a self-evaluation of their performance to enhance their self-awareness [4]. Robust performance evaluation systems are now available to guide residents in the process of collecting and feeding back surgeons’ performance data for the purpose of informing surgeons about their teaching performance [57]. However, the effects of such evaluation systems on surgeons’ subsequent teaching performance are unknown. Therefore, this study evaluates how surgeons’ teaching performance evolves after two cycles of evaluation, reporting, and feedback.

In the process of informed self-evaluation, internal and external data sources are integrated to provide a comprehensive overview of surgeons’ performance [3]. However, combining and comparing such data sources as resident evaluations and self-evaluations of surgeons’ teaching performance, can result in tensions on behalf of surgeons, thereby leading to delay in, or even dismissal of, self-improvement actions [811]. In particular, surgeons who reveal a discrepancy between self- and external evaluations of their performance, may develop (emotional) reactions that can impact their reaction towards their performance feedback and subsequently their actual performance improvement [813]. Furthermore, psychological studies show that discrepancies between self and other perceptions of one’s performance can be perceived as unsatisfactory and suggest that overestimating can impede subsequent performance, while underestimating is usually harmless for performance [1420]. Consequently, surgeons may aim to minimize the discrepancy by either attempting to influence resident evaluations or by adjusting their self-evaluations. This study evaluates the influence of over- or underestimation on subsequent teaching performance. This study has two main aims. First, we explore how resident evaluations and self-evaluations of surgeons’ teaching performance evolve after two cycles of evaluation, reporting, and feedback. Second, we explore whether over- or underestimating of surgeons’ own performance influences resident and self-evaluations of surgeons’ subsequent performance.

Materials and methods

Setting and study population

This study was conducted at 29 surgical teaching programs in 13 hospitals, including general surgery (10), obstetrics and gynecology (10), ophthalmology (3), orthopedic surgery (2), otorhinolaryngology (1), urology (1), neurosurgery (1), and plastic surgery (1). Teaching programs could participate voluntarily by approaching the project leaders. In the Netherlands, postgraduate medical training is organized in eight geographical regions, each coordinated by an academic medical center. All larger (more than five residents) surgical training programs that were based at or coordinated from the project leaders’ academic medical center participated in this study (24 of the 29 programs included in this study). Additionally, five training programs from other regions in the Netherlands participated. Data were collected from September 2008 until May 2013 and occurred during annual evaluation periods lasting 1 month. Residents could choose which and how many surgeons to evaluate, based on whose teaching performance the resident believed he/she could evaluate accurately. For each residency training program, data from three subsequent evaluation periods were included, which represent two full cycles of evaluation, feeding back, follow-up, and re-evaluation. In total, 351 surgeons were invited to participate in this study. Only surgeons who participated during the first evaluation period at their training program were included in this study; none of the surgeons could enter during a later evaluation period. All residents were asked to provide feedback. Overall, 299 residents were invited to evaluate surgeons’ teaching performance during the first, 346 during the second, and 341 during the third evaluation period. Participants were invited to participate via email, stressing the formative purpose and use of the evaluations and the confidential and voluntary character of participation.

System for evaluation of teaching qualities (SETQ)

We used the system for evaluation of teaching qualities (SETQ), which provides surgeons with reliable and valid evaluations of, and feedback on, their teaching performance in order to improve the quality of teaching in residency training. The SETQ items are theory based and extensively tested [5, 6, 21, 22]. The items are listed in Appendix Table 4. Briefly, the SETQ is composed of two tools (questionnaires): one for surgeons’ self-evaluation and another for resident evaluation of a surgeon’s teaching performance. The two tools include exactly the same items and were applied via the Internet. The tools consisted of 26 items [5, 6]. Each item could be rated on a five-point Likert scale: 1 ‘strongly disagree’, 2 ‘disagree’, 3 ‘neutral’, 4 ‘agree’, 5 ‘strongly agree’, and there was an additional option ‘I cannot judge’. The items were statements such “this surgeon explains why residents are incorrect.” In addition to these numerical items, the tools contained two narrative items: residents could provide ‘positive attributes of surgeon’s teaching performance’ and ‘suggestions for improvement of surgeons’ teaching performance’. A previous study showed that residents provided surgeons with a median of 11 positive open-text feedback comments and four suggestions for improvement per evaluation report [23]. The day after closure of an evaluation period, surgeons received their individual feedback report, summarizing residents’ ratings and narrative comments, along with their self-evaluation. Previous studies have indicated that resident evaluations of surgeon’s teaching performance had high reliability, at six to eight resident evaluations [5, 6]. To preserve the anonymity of the residents, only the number of residents that provided feedback was reported to surgeons. The surgeons were encouraged to discuss their feedback with their peers or program director.

Study variables

The first variables of interest were surgeons’ self-evaluation and resident evaluations of surgeons’ teaching performance. To obtain an overall teaching performance score, all SETQ items were averaged. For residents, evaluations were first aggregated at the surgeon level. Subsequently, the discrepancy between resident evaluation and self-evaluation was calculated. Previous studies defined the cut-off points for over- and underestimating at half a standard deviation (which corresponds to 0.45–0.50 point across the evaluations in the current study) [15, 19]. Although no clear rationalization for this method of selecting cut-off points was given in the previous studies [15, 19], absence of a rationalized alternative led us to adapt this method in the current study. Consequently, we categorized surgeons who evaluated their performance >0.5 higher than residents as ‘overestimating’, surgeons who evaluated their performance >0.5 lower than residents as ‘underestimating’, and as ‘in agreement’ if the discrepancy was within +0.5 to −0.5. In addition, a few covariates were included in the analyses: surgeon’s sex, years of experience, teacher training, whether or not surgeons formally discussed the feedback of a previous evaluation, training programs’ specialty, and training programs’ hospital.

Analytical strategies

Initially, we calculated appropriate descriptive statistics. Subsequently, missing data were imputed using multiple imputations (‘mice’ package in R statistics) [24]. We used generalized linear mixed effects growth models to explore how the evaluation scores changed over the three subsequent evaluation periods [25, 26]. The mixed models framework allowed for adjustment of clustering on individual, specialty, and hospital levels.

Next, the effect of over- and underestimating performance on subsequent teaching performance was analyzed using regression analysis. More specifically, sequential g-estimation within a generalized linear mixed models framework was used (a technique developed to estimate causal effects with time-varying exposures in longitudinal studies) [27]. The first regression model had resident-evaluated subsequent teaching performance as the outcome and included whether surgeons over- or underestimated their previous performance as predictor. The second model had surgeons’ self-evaluated subsequent teaching performance as the outcome and included whether surgeons over- or underestimated previous performance as predictor. Both models were additionally adjusted for previous teaching performance scores, whether surgeons formally discussed their previous evaluation report, surgeon’s sex, experience, teacher training, residency training programs’ specialty, and residency training programs’ hospital. Effect heterogeneity by surgeon’s sex and by surgeons who discussed or did not discuss their previous performance was explored and is reported in an appendix.

Because this cohort study involved surgeons who were lost to follow-up (because they retired, switched jobs, quit teaching, or received no resident evaluations), sensitivity analysis for this loss-to-follow-up (or selection or censoring) bias were performed. In this sensitivity analysis, the inverse probability of censoring (IPC) weight was calculated for each surgeon based on a surgeon’s background characteristics and his/her evaluation scores of previous evaluations [28]. Subsequently, all models described above were re-estimated, now weighting each surgeon by their IPC weight to account for the loss-to-follow-up bias. All analyses were performed using IBM SPSS Statistics 21.0 for Windows (IBM, Armonk, NY, USA).

Results

Study participants and response

Of the 351 invited surgeons, 347 (99 %), 313 (89 %), and 288 (82 %) received residents’ feedback during the first, second, and third evaluation periods, respectively. Self-evaluations were completed by 295 (84 %), 249 (71 %), and 242 (69 %) surgeons during the first, second, and third evaluation periods, respectively. Residents’ response rates were 84, 74, and 78 %, respectively, during the three subsequent evaluation periods. Characteristics of surgeons and residents are reported in Table 1.

Table 1 Study and participant characteristics

Findings

The median score of resident evaluations of surgeons’ teaching performance increased from 3.83 in the first and 3.82 in the second evaluation period to 3.91 in the third evaluation period (p < 0.001) (Table 2; Fig. 1). Surgeons’ median self-evaluated teaching performance scores did not change over the three subsequent evaluation periods and the growth models indicated no change (Table 2; Fig. 1). There were no differences between the unweighted growth models and the IPC weighted models.

Table 2 Median, 20th and 80th percentile scores, marginal means, and 95 % CL of resident evaluations and surgeon self-evaluations for the three subsequent evaluation periods
Fig. 1
figure 1

Median teaching performance scores over three subsequent evaluation periods

Overestimating teaching performance resulted in lower subsequent teaching performance as evaluated by both residents (regression coefficient (b): −0.08, 95 % confidence limits (CL): −0.18, 0.02) and surgeons themselves (b: −0.12, 95 % CL: −0.21, −0.02). Underestimating performance did not impact resident-evaluated teaching performance (b: 0.01, 95 % CL: −0.08, 0.06), while it resulted in enhanced self-evaluated performance (b: 0.10, 95 % CL: 0.03, 0.16) (Table 3). The IPC-weighted models yielded similar effect estimates and are available in Appendix Table 5.

Table 3 Unstandardized regression coefficients (b) and 95 % CLs for the associations between (resident and own) evaluation discrepancy and surgeon’s subsequent teaching performance

Surgeons’ sex was found to modify the relationship between over- and underestimating teaching performance and subsequent performance. Therefore, the models were re-estimated for male and female surgeons separately (Appendix Table 6; Fig. 2). No modification by discussion of feedback was found.

Discussion

This study showed that residents evaluated surgeons’ teaching performance higher after two cycles of evaluation, feeding back, follow-up, and re-evaluation. Surgeons’ self-evaluations of their teaching performance did not alter over the years. Surgeons who overestimated received lower scores by residents on their subsequent teaching performance. Surgeons who underestimated, self-evaluated their subsequent teaching performance higher, while surgeons who overestimated self-evaluated their subsequent performance lower.

Surgeons’ teaching performance was evaluated higher by residents after two cycles of evaluation, reporting, and feeding back. This finding suggests that feedback can be helpful for teaching performance enhancement. Feedback is often used to guide surgeons’ development and to enhance surgeons’ performance, and this study provides further empirical evidence that feedback systems can be effective in enhancing performance. Although the changes in performance are limited, a recent Cochrane review concluded that even these small changes have the potential to actually change performance in practice [29]. Surgeons’ teaching performance was enhanced after two cycles of feedback, but not after the first feedback cycle. Several reasons such as lack of time to, or low prioritization of, changing particular behaviors in response to feedback, could have delayed actual changes in behaviors [23, 30]. Furthermore, there may be some distrust in the validity and usefulness of a recently developed evaluation system, and surgeons may perceive discomfort with the new process of receiving residents’ feedback [23]. These factors may have impeded surgeons from changing their behaviors after the first feedback cycle. After the second cycle, surgeons—individually as well as a group—were more familiar with the evaluation system and the process of receiving feedback, and may have prioritized changes higher after receiving particular feedback twice.

Surgeons who overestimated their performance had lower subsequent teaching performance as evaluated by residents. As noted earlier, although the regression coefficients are small, they do have potential clinical relevance [29]. Several managerial and psychological studies found similar negative effects of overestimating own performance [1517, 19]. The negative effects may be caused by the perceived inaccuracy of the feedback by overestimating surgeons or by other negative (emotional) reactions evoked by overestimating own performance [10, 11, 17, 31, 32]. An alternative explanation for the negative effects of overestimation may be found in the different background characteristics of over-estimators compared with under- and in-agreement estimators [19]. It was proposed that characteristics such as sex, experience, and age might influence performance (enhancement) more than the overestimation itself. Previous studies identified that over-estimators tended to be older and more likely to be male than are under- or in-agreement estimators [19, 33]. The modification by surgeons’ sex, as found in this study, also suggests that female surgeons, who are less likely to be over-estimators, had higher subsequent performance than did male surgeons. With more females entering surgery, the number of overestimating surgeons may decrease in the near future. Underestimation of performance had no influence on subsequent teaching performance as evaluated by residents. This may not be surprising, since most studies in the psychological literature found little differences in performance between under-estimators and in-agreement estimators [15, 16, 19].

Surgeons who overestimated their teaching performance self-evaluated lower in subsequent evaluations, while surgeons who underestimated rated themselves higher in follow-up evaluations. These findings are in line with previous research showing that peoples’ most obvious reaction towards external performance evaluations that disagree with self-evaluations of performance, is to converge their self-evaluations in a follow-up evaluation towards the external ratings [15, 18, 20]. These findings can be explained by the self-consistency theory, which states that people seek to minimize the discrepancy between self- and external ratings of performance [14].

In line with informed self-assessment theory [3], the results of self- and external evaluations should be integrated to draw any conclusions about the (enhancement of) performance of individual surgeons. We suggest that, at least, resident- and self-evaluated performance is considered when interpreting the performance of individual surgeons, especially since we know that these two evaluations tend to be complementary, not identical [5, 34].

This study involved all attending surgeons of 29 residency training programs of 13 teaching hospitals. The participation rates were high, loss to follow-up was limited to only 17 % over 3 years, and several potential sources of bias (including loss to follow-up bias) were addressed in the data analyses and contributed to the robustness of this study’s findings. The cut-off scores for over- and underestimating applied in this study were arbitrary, although they were similar to those of previous studies on this topic [15, 19]. Further, there was no uniform procedure for the discussion of the feedback. Therefore, modification by discussion of feedback and adjustment of the regression analyses could only be performed for the variable if the feedback was formally discussed and not how the feedback was discussed. The results of this study suggest that changing performance takes time and therefore, it will be interesting to study whether a surgeon’s performance will be even further enhanced after a third, fourth, or fifth evaluation cycle. Future studies will explore the effects of evaluation over a longer follow-up period. Because the self-evaluated performance remained stable while resident-evaluated performance was enhanced, fewer surgeons were overestimating their performance after two SETQ cycles. Given the finding that overestimating performance negatively impacted subsequent performance, this trend is probably beneficial for surgeon’s subsequent performance after more than two SETQ cycles.

Knowledge about whether surgeons over- or underestimated their teaching performance can be important to guide the follow-up once the feedback is received. Because surgeons who overestimated their performance were more likely to have lowered subsequent teaching performance, specific guidance and support in the reflection process can probably help these surgeons in their interpretation of, and reactions after receiving, the feedback. For this purpose, structured reflection methods that take surgeon’s individual emotions and the specific content of the feedback into account, may help surgeons in appreciating their performance evaluation feedback [35]. However, more research is needed to explore if tailored guidance and support, for surgeons who over-, under-, or in-agreement estimated their performance, for male and female surgeons, can enhance subsequent performance.