INTRODUCTION

Residency programs must assess resident competency in patient care, medical knowledge, practice-based learning and improvement, interpersonal and communication skills, professionalism, and systems-based practice.1 Although reliable and valid assessments of medical knowledge (e.g., United States Medical Licensing Examination, American Board of Internal Medicine certification examination) are readily available, assessment of the other competencies is more difficult.

Faculty ratings are commonly used to assess residents;2,3 however, ratings of resident competence by individual faculty are often problematic.4 For example, unique post-rotation performance ratings often neglect a large percentage of deficiencies apparent at summative, end-of-year progress judgments.5 Ratings forms are sometimes inaccurate and provide little specific formative feedback to residents.6 It is difficult to assure the consistent application of performance standards across a large number of faculty members. The problem of leniency has been clearly documented.7,8 Lastly, we recognize the so-called “halo” effect, initially described by Thorndike,9 in which a good or bad performance in one area affects assessments in other performance domains.10

Compared with individual faculty assessments, group discussions of clinical performance have proven beneficial in undergraduate medical education. Group discussions have been shown to identify students with marginal funds of knowledge11,12 and deficiencies in professionalism,13 while improving the quality of narrative comments and better justifying assigned grades.14 Furthermore, group discussions of learner performance provide a forum for case-based faculty development.13 Despite the known advantages of group assessment in undergraduate medical education, little is known about the use of group assessment in residency education.

The aims of this study were to determine (1) whether the addition of faculty group assessments of residents in an ambulatory clinic, compared with individual faculty-of-resident assessments alone, have better reliability and a reduced halo effect; and (2) faculty perceptions of the group assessment process.

METHODS

Setting and Design

This was a prospective longitudinal study performed over the course of one academic year at a large internal medicine residency program. The Mayo Clinic Rochester Internal Medicine Resident Continuity Clinic is organized into 6 firms, each with 24 residents and a mode of 8 (range 7–11) faculty preceptors. During this study each faculty member precepted on one to two specific afternoons each week (a fixed schedule), while the once weekly clinic day of individual residents varied, based on their hospital call schedule (Fig. 1). Therefore, each resident will presented a few cases to several faculty members over the span of several months.

Figure 1
figure 1

Model of a 4-week calendar for a single firm and annual calendar of group assessments. The top part of the figure conceptually illustrates a 4-week calendar for a single firm. Faculty members (indicated by the letters A through J) precept clinic on a fixed day every week. For example, Faculty A always precepts on Mondays. In contrast, residents (indicated by numbers 1 through 24) attend clinic on a different day each week, since their clinic calendar is dependent on their hospital call schedule, which, in turn, is independent of the day of the week. For example, Resident 1 attends clinic on Monday during the first week, and then Friday during the second week. This leads to numerous faculty members gaining experience with a specific resident over time, but no single faculty member who has extensive experience with any specific resident. The lower part of the figure illustrates the annual calendar of group assessments, wherein the performance of residents from a single PGY year is discussed on a quarterly basis in a rotating fashion.

Data Collection

Individual Faculty-on-Resident Assessments

Firm faculty preceptors were selected based on their desire to teach in the continuity clinics and their teaching excellence, which in turn is based on validated teaching assessment ratings.15 Individual faculty-on-resident assessments were independently completed by firm faculty preceptors on a quarterly basis. Previous validation research on Mayo Clinic internal medicine person-on-person assessments revealed excellent internal consistency reliability and a single dimension of clinical performance based on factor analysis.16 An average of 4.7 (range 4–8) resident forms (matching those residents scheduled for the upcoming group assessment) were completed quarterly by individual faculty members who were instructed to document direct observations of residents and to decline assessing residents with whom they had not worked. Assessment items, structured on 5-point scales, addressed residents' performances in the following domains (linked ACGME competency in parentheses): accuracy and completeness of data gathering (patient care), effectiveness of interpretation (patient care), appropriate test selection (patient care), illness management (patient care), provision of follow-up (patient care), patient recruitment and retention (communication), and commitment to education (professionalism). Anchors (1 = needs improvement, 3 = average, 5 = top 10%) were provided for each performance domain. Individual assessments were electronically submitted prior to the group assessment.

Group Assessments

Group assessment sessions in each of the six firms were scheduled quarterly (Fig. 1) for 90 min overlapping the usual lunch hour, facilitated by the respective firm chief (a faculty member who administrates for each firm) and attended by firm faculty (who had submitted individual faculty-on-resident assessments prior to the group assessment). Reviews of six to eight residents, grouped by year of training, were scheduled for each session, which allowed 10–15 min of discussion per resident. Following group discussion of each resident, a single group faculty-on-resident assessment form, with questions identical to those on the individual faculty-on-resident assessment forms, was electronically submitted for each resident by the firm chief, with scores and narrative comments that reflected the consensus of the firm preceptors.

Group assessments were added to the existing components of resident assessment in the continuity clinics, which were left unchanged. Residents are evaluated quarterly in the clinic setting. Continuity clinic advisors (who serve as preceptors in the resident’s firm) meet quarterly with residents, during which time the resident’s performance in clinic is reviewed.

Faculty Opinion Survey

Faculty opinion surveys were completed immediately following each of the quarterly group assessment sessions. The survey, which was not pre-validated, solicited faculty opinions on whether they agreed (scale 1–5) the group assessment method had improved faculty members' knowledge of each resident's strengths, weaknesses, and learning plan; improved their understanding of assessment in general and confidence in their own assessment skills; and improved resident assessment in the continuity clinic setting. Faculty members’ overall satisfaction with the group assessment method was then queried.

Statistical Analysis

Wilcoxon sign rank test was used to compare mean scores, for each item and overall, for assessments by groups and individual raters. Interclass/intraclass correlation coefficient (ICC) was calculated using an ANOVA model where evaluator, evaluatee, and their interaction term were considered. The ICC and its 95% confidence interval were calculated using a random sample from 49 evaluators for individual assessments and 6 evaluators for group assessment. ICC was interpreted as follows: <0.4 poor; 0.4 to 0.75, fair to good; and >0.75 excellent.17 Halo error, defined in this context as giving similar or identical scores across all item domains, was determined by calculating inter-item correlations, which has been used in previous education research18 and is the most traditional method.19,20 Higher inter-item correlations were interpreted to reflect higher halo error, and lower inter-item correlations represented lower halo error. Mean scores were determined for faculty survey responses.

RESULTS

A total of 679 individual faculty-on-resident assessments and 136 group faculty-on resident assessment forms (representing 94% of the group assessment discussions) were collected over a 1-year period. Faculty opinion data were collected from 16 of 18 (89%) sessions.

Scores on items for individual and group assessments are shown in Table 1. Group assessment mean scores on 3 of 7 individual items and the overall average of item mean scores were significantly higher in the group assessments compared with individual assessments (3.92 ± 0.51 vs. 3.83 ± 0.38, p = 0.0001). Item inter-rater reliability of individual assessments alone and with the addition of group assessments are shown in Table 2. Overall inter-rater reliability increased when combining group and individual assessments compared to individual assessments alone (intraclass correlation coefficient, 95% CI = 0.828, 0.785–0.866 vs. 0.749, 0.686–0.804). Inter-item correlation was less for group (0.49) than individual (0.68) assessments (0.49 vs. 0.68), signifying reduced halo error.

Table 1 Individual Versus Group Assessment Item Mean Scores
Table 2 Inter-Rater Reliability and Inter-item Correlation of Individual Assessments Alone and with the Addition of Group Assessments

Faculty opinions regarding the group assessments are shown in Figure 2. In general, faculty agreed (on a 5-point scale) that the group assessments improved their knowledge of resident strengths (4.4 ± 0.7), weaknesses (4.5 ± 0.7), and learning plans (3.9 ± 0.9); and improved their understanding of resident assessment (3.8 ± 1.1) and confidence in their assessment skills (3.7 ± 0.9). Faculty strongly agreed that group assessments improved resident assessment in the continuity clinic setting (4.6 ± 0.7) and highly rated their overall satisfaction with the group assessment method (4.3 ± 0.9).

Figure 2
figure 2

Faculty opinions regarding group assessments (n = 87). Faculty members were asked immediately following each group assessment whether the discussion improved their understanding of resident strengths, weaknesses, and learning plans; improved their understanding of the process of assessment; improved their confidence in their own assessment abilities; and improved the overall process of assessment. They were then asked to rate their overall satisfaction with the process.

DISCUSSION

Residency programs are charged with accurately assessing resident competency across multiple domains for both formative and summative purposes,1 yet this task is challenging for many reasons. Individual faculty members have only brief contact and limited exposure to resident behaviors, so they may miss significant deficiencies in performance.5 Additionally, faculty members often provide minimal feedback21 and may place variable emphasis on different performance criteria.7,10 We report an integrative faculty group assessment model for internal medicine continuity clinics that uses pooled faculty observations to identify learner deficiencies, enable more specific feedback, and provide ongoing case-based faculty development. This study demonstrates improved inter-rater reliability and reduced range restriction (halo effect) of resident assessment across multiple performance domains by incorporating the group assessment method.

Evaluating residents is becoming more challenging. Performance criteria are increasingly explicit, which makes assessments more complex. Faculty practice burdens are growing. Recent recommendations to extend resident duty hour restrictions,22 while potentially important for patient safety, may further reduce resident-faculty continuity. Fortunately, the group assessment model can mitigate these challenges because pooled individual assessment tools2 enable faculty to "connect the dots" between separate brief observations in order to discern the larger pattern of performance. Group leaders can ensure that faculty members within a group apply consistent performance standards across many residents. Over time, the group assessment model serves as case-based faculty development with the potential to improve faculty members' knowledge of specific residents, the assessment process in general, and confidence in resident assessment. This type of group assessment model can also be an important component of outcomes-based assessment within residency programs.23 Properly used, the information gleaned from group assessments may provide residents with more discriminating formative feedback so they can remedy deficiencies and become better physicians.

The use of multiple assessment modalities can potentially minimize deficiencies in any one method.2,3,24 Although faculty ratings are the most common method for assessing residents' clinical performance,2 this method is fraught with rater errors,4 the most common of which is the "halo effect," demonstrated by the tendency of faculty raters to give similar scores to different domains of a resident’s performance even when the domains are clearly separate.4,18,25 Related to this is "grade inflation," which reflects ratings that are globally higher than a learner's performance warrants. Prior studies suggest that a committee process for making progress decisions may result in reduced grade inflation;26 our study extends this literature by demonstrating a reduced halo effect through the use of a group assessment method. Interestingly, however, the average scores across all performance items in our study increased with group assessments versus individual assessments, although by only a small amount. One explanation for this finding is that the group assessment improved the ability to target more specifically resident domains that were excellent, and those that required improvement, with the greater balance of emphasis being on domains of excellent performance.

We implemented group assessments in the outpatient setting to utilize the longitudinal relationship between our Continuity Clinic faculty and residents over their 3 years of training. However, most previous studies on the group assessment model occurred in the inpatient setting, where observations from faculty, senior residents, and junior residents were gathered for student assessment. Hemmer and colleagues have shown that formal evaluation sessions compared with individual assessments better identified core clerkship students with marginal funds of knowledge and episodes of unprofessional behavior in the hospital setting,1113 and additionally served as case-based faculty development sessions.27 The current study extends these findings to resident assessment in the context of the ACGME competencies and to the outpatient ambulatory clinic setting. This study also adds support to the important role formal evaluation sessions can serve in continuous faculty development. Schwind and colleagues found that discussion at a surgical resident evaluation committee, compared to individual assessments, identified an increased number of deficiencies across three performance domains.5 By comparison, our study provides evidence of significant differences in reliability and a halo effect between the two methods of assessment. Future studies should examine inpatient group assessment for resident physicians, which could involve combined assessments from attending physicians, resident peers or senior medical residents, and allied health staff. Additionally, group assessments of senior residents might include the junior residents' feedback on teaching skills.

Group assessments in this study were remarkably inexpensive and efficient. Ninety-minute sessions were conducted quarterly, overlapping the lunch hour to minimize the impact on faculty's clinical productivity and other academic commitments. Lunch was provided, and meeting rooms with projectors—used to view photographs of residents and documented clinical observations—enhanced the discussions. Firm chiefs were given administrative time to compile information from the group assessment discussions. Finally, information discussed during the group assessments was obtained partly from individual continuity clinic faculty assessments (for example, direct observations of resident clinical performance), the completion of which is already a standing expectation for teaching faculty.

This study has several limitations. First, it was performed at a single institution, which may limit generalizability of the study findings. The group assessment model, however, has demonstrated feasibility at other institutions in clinical clerkships in both inpatient and outpatient settings.11,13,28 Given the relatively low costs of implementation and the adaptability of the group assessment model to different institutional environments and practice settings, we believe this model has universal appeal. Second, this study did not assess the impact of group assessments on improved resident outcomes. Nonetheless, we found that these assessments provide discriminating assessment data across domains of resident performance, which should allow more specific feedback to residents, thus enhancing their potential for improvement. Third, this study did not assess whether group dynamics influenced the groups’ collective assessments. However, prior research on the effects of group dynamics on resident progress committee deliberations suggests that a controlled group decision-making process, such as used in our group assessment sessions, does not compromise the validity of assessment.29 Fourth, although formal assessments of students have been found to help identify lapses in professionalism, we did not target the professionalism domain quantitatively by incorporating this into the assessment as a Likert-scaled item. However, we observed that narrative comments during the group assessments often provided information that could not be gleaned from the individual assessment scale data alone. Specifically, we found that issues pertaining to communication and professionalism often emerged through group discussion. This is similar to what we had found in previous qualitative studies at our institution, which showed that narrative comments provide, more than quantitative questionnaire data, valuable information regarding interpersonal dynamics.30 Finally, faculty members' satisfaction and improved confidence with group assessment were surveyed with a non-validated instrument (so information regarding the properties of this instrument are unknown) and were not substantiated by objective measures. Future studies should assess whether these perceptions can be confirmed objectively.

In summary, this study demonstrates that integration of a group assessment in an internal medicine continuity clinic improved reliability and reduced range restriction (halo effect) of resident assessment across multiple performance domains when compared to traditional individual faculty-on-resident assessment alone. This model, as one component of a global assessment system, should help graduate medical education programs achieve reliable and discriminating resident assessments despite increasing practice demands and faculty-resident discontinuity. Therefore, the group assessment might be considered as one of the solutions for addressing forthcoming changes in duty hour reform. Future research should assess the efficacy of this model for graduate medical education in hospital settings, investigate whether this model compared to traditional models of assessment leads to improved resident outcomes, and demonstrate whether use of this model leads to objective improvement in faculty assessment skills.