Introduction

The objective structured clinical exam (OSCE) is widely used in medical education assessments. Considerable research has been conducted to increase the reliability of such examinations. However, until recently few investigations have focused on associated validity issues. In this paper we report on a mixed methods approach that we used to probe more deeply into the complex processes underlying OSCE examinations.

Background

Since the late twentieth century, the use of OSCE examinations has become a major component of medical education assessment (Boursicot and Burdick 2014; Cox et al. 2007; Regehr et al. 2011). Statistical advancements such as those outlined by Brennan (2001), Cronbach et al. (1972), and Linacre and Wright (2002) have allowed researchers to estimate outcomes with more accuracy and streamline assessments (Khan et al. 2013; Medical Council of Canada 2013). Generally research in medical education assessment has been devoted to examining the reliability of assessments (Fuller et al. 2013; Hodges and McIlroy 2003; Hodges et al. 1999; Liao et al. 2010; Regehr et al. 1998). Recently researchers are interested more in the nature and validity of assumptions made by raters (Berendonk et al. 2012; Gingerich et al. 2014a, b; Johnston et al. 2013; Kogan et al. 2011; Wood 2014). Drawing on theories from social and cognitive psychology, emerging medical education research is focusing on understanding the underlying, hidden structures that govern rater cognition (Gingerich and Eva 2011). In parallel, researchers in the broader assessment community have moved a step further and developed conceptual models that can illustrate cognitive processes of raters when scoring open-ended items (Crisp 2012; Eckes 2012; Joe et al. 2011; Kishor 1990, 1995; Wolfe 2004, 2006). In this paper we describe a mixed methods approach to advance this effort in medical education. We anticipate the findings will provide insights particular to OSCE ratings and contributions to the larger assessment community regarding the tacit priorities of examiners rating performance for the workplace.

Hidden structures

Interpretation of results has always been an area of concern in assessment, due to the level of inference needed to build evidence towards the validity of conclusions about examinees (Cronbach and Meehl 1955; Kane 2013; Kane and Bejar 2014; Kelley 1927; Messick 1975; Shepard 1997). In performance assessment settings such as OSCEs, interpretations of IMGs results are made by examiners who are human. For instance, during a OSCE an examiner can quickly rate the History Taking skills of a candidate, provide a score, and move to the next candidate using a rating scale as a guide. However, if we were to interview the examiner after the fact, some may provide general impressions that may or may not coincide with the ratings (Williams et al. 2003). We know little about the hidden structures that govern rater’s cognition. However, in order to provide precise estimates of candidate performance, two elements at the very least are needed: congruent definitions of a construct (e.g. History Taking) and consistent measurement. The latter of these concerns can be easily addressed with the numerous psychometric models used to investigate reliability. The first relates to validity and requires in-depth investigations into structures, often latent, that govern rater decisions.

How raters decide what is important in an exam is invariably a hidden part of the assessment process. When ratings are produced, there is an assumption that the examiners’ beliefs about what is important are universal and that these are accurately reflected in a candidate’s score. However, researchers have found that identical examiner ratings may be based on different rationales (Douglas and Selinker 1992). Therefore, a major part of the justification of assessment procedures, not yet fully explored, is understanding what raters are thinking during the exam (Bejar 2012; Messick 1994). Knowledge of examiners’ understanding of the assessment process is necessary to ensure the appropriateness of that process, the salience of the results and the utility of the OSCE in this context.

Methodology

Study setting

In North America, International Medical Graduates (IMGs) play a critical role in health care systems. Approximately one-quarter of family physicians have had some education outside North America (Boulet et al. 2009; Canadian Institute for Health Information 2009; Norcini et al. 2014; Walsh et al. 2011). In Canada, the assessment process for IMGs is provincially administered. The College of Physicians and Surgeons of Nova Scotia has one of the oldest IMG programs in Canada: the Clinician Assessment for Practice Program (CAPP). The CAPP OSCE differs from many other OSCEs, as it is a practice-ready assessment. A candidate has to pass a rigorous review process that includes the CAPP OSCE as well as a written therapeutics exam and file review. After the review process, successful IMGs receive a defined license requiring their practice to be mentored for 8 months and they must subsequently obtain certification from the College of Family Physicians of Canada within 4 years.

The examiners who participate in the CAPP OSCE are physicians currently practicing in Nova Scotia. Some are IMGs themselves and others have worked supervising IMGs. The CAPP OSCE comprises twelve, 12-min stations with 3-min intervals between the stations. All the stations are monitored and some are recorded during the exam for quality assurance. Over the past 8 years, the CAPP OSCE has evolved and developed to a high level of standardization and reliability from case development to implementation. The program has conducted several internal research studies to ensure consistency and optimize the exam (Maudsley 2008).

This study utilized data from CAPP OSCE administrations years 2010–2014. The quantitative data were examined along with four follow-up cognitive interviews of examiners after the 2014 administration. The interviews were intended to explore how examiners conceptualized practice-ready competence. The study was conducted with the approval of the Mount Saint Vincent University Ethics Board and the College of Physicians and Surgeons of Nova Scotia.

Hierarchical Linear Modeling

Hierarchical Linear Modeling (HLM), is a multivariate statistical technique that is an extension of regression modeling developed in the early 1980s and designed to model nested data structures (Goldstein 1986; Raudenbush and Bryk 1986; Wong and Mason 1985). Since its development, the approach has become widespread and is utilized across multiple fields from economics to sociology and developmental psychology. Researchers have illustrated ways in which Logistic HLM can be used as a measurement model that is comparable to Rasch modeling (Kamata et al. 2008; Beretvas and Kamata 2005; Kamata 2001). The HLM approach lends itself to OSCE data as stations are nested within person and persons are nested within streams or cohorts. The HLM approach provides more accurate estimates of performance than traditional analysis methods because we do not have to aggregate or disaggregate data (Osborne 2000). In this study we used Ordinal Logistic Hierarchical Linear Modeling (OLHLM) to provide insight into which competencies are most predictive of Overall Global score. In doing so we are able to identify which competencies practicing family physicians believe are the most critical to being practice-ready.

Quantitative analyses

Data from 204 IMGs who participated in the CAPP OSCE in years 2010 though 2014 were used. In 2010 the OSCE had 14 stations. Since 2011 there have been 12 stations. Station cases are not repeated within a 3 year period. The CAPP did state that over the 5 years two cases were repeated; however, these were anonymized in the data set and as a result treated independently. At each station candidates were rated on a scale of 1 (Inferior), 2 (Poor), 3 (Borderline), 4 (Satisfactory), 5 (Very Good), and 6 (Excellent) on eight competencies:

  • (1) History Taking (HIST),

  • (2) Physical Exam (PE),

  • (3a) Physician Examiner Rated Communication Skills (PECOMM),

  • (3b) Simulated Patient Rated Communication Skills (SPCOMM),

  • (4a) Physician Examiner Rated Quality of Spoken English (PEQSE),

  • (4b) Simulated Patient Rated Quality of Spoken English (SPQSE),

  • (5) Counseling (COUN),

  • (6) Professional Behavior (BEHV),

  • (7) Problem Definition and Diagnosis (PDD), and

  • (8) Investigation and Management (INMAN).

They are also rated at each station on the outcome variable (Overall Global), which had the same 6-point scale. Since there were very few scores in the Inferior and Excellent categories, they were combined with adjacent scores for analysis purposes, resulting in four categories: Poor, Borderline, Satisfactory and Very Good. Due to ethical restrictions, the identity of the Examiners was randomly coded; as a result the researchers could not identify if an examiner was part of one or more years of administration. However, in consultation with the CAPP, they stated that some examiners return from year to year. In each year, all examiners participated in a mandatory orientation that included both online and face-to-face training components. SPSS 21 (IBM Corp., 2012), HLM 6.02 (Raudenbush et al. 2004) and the ggplot2 (Wickham and Chang 2015) package for R version 3.1.3 (CRAN 2015) were used to conduct the quantitative analyses.

Cognitive interviews

Asking participants to think aloud during a task is a common way to investigate the cognitive processes underlying tasks. However, due to the nature of an OSCE, we cannot interrupt examiners and ask them what they were thinking. Cognitive interviewing techniques were used in this study to elicit the thought processes of examiners (Willis 2005). These techniques are very flexible and may be applied in vivo or retrospectively. In the cognitive interview process, the interviewee conducts a task and is then asked to describe what they did and their thinking about why. The researcher in a cognitive interview can delve deeply and use probes to encourage the interviewee describe their thinking more comprehensively.

In the 2014 administration of the CAPP OSCE, Physician Examiners were asked to participate in an interview during which they watched a video of themselves in the exam room with the candidate and standardized patient. The video was paused every 2 min and each examiner was asked to describe their thinking. All of the participants (n = 24) in the exam were invited to participate in the study. However, due to ethics protocol requiring simulated patient, candidate and examiner consent, only 10 videos qualified to be used in the study. The examiners in the 10 videos were contacted and four agreed to take part in the interview. The remaining examiners did not respond to invitations. Those four received a $150 honorarium for their time.

The interviews were video recorded and transcribed verbatim. To conduct the analysis, the interview transcripts were divided into 12 two-minute segments. The segments were placed on a grid, to allow the researchers to look across the 12 segments and discern any patterns (Miles et al. 2014). We initially designed this study as mixed methods. The design was planned to be a parallel data collection with integration during interpretation, where the primary themes would be presented with comparable quantitative results (Creswell et al. 2011). However, due to the small sample, we were unable to draw an overall generalization or interpretation of the patterns from a thematic analysis. Instead, we explored specific excerpts that provide further insights into the quantitative results. Excerpts were chosen to be representative of similar interview stages across the participants.

Results

The quantitative data comprised of 2443 station level observations per competency and 204 (199 complete cases) candidates over 5 years. The Table 1 below provides the mean and standard deviation of each competency at the station level.

Table 1 Descriptive statistics on competencies at each station

As shown in Table 1, the competencies were not assessed an equal number of times; for example, the Physical Exam (1214 station level observations) was present in approximately half of the stations, while History Taking was present in all. For every competency the standard deviation is approximately 1 point around the mean; thus, the majority of the scores were within one point of the mean.

The Overall Global outcome score (transformed to a 4-point scale) was not generated from an average of the individual competencies but was provided as a separate overall judgment of the candidate’s performance at that station. This variable does not appear in the table above as it was the outcome variable. There were 2415 complete observations of the Overall Global outcome that were used: 28 had missing values. Of the 2415 observations at the station level, 21 % were Poor, 33 % Borderline, 28 % Satisfactory, and 16 % Very Good. It is important to note the high-stakes nature of this exam and that candidates failed stations more often than they passed, which is commonplace in practice-ready IMG OSCEs (MacLellan et al. 2010).

Ordinal Logistic Hierarchical Linear Modeling results

The modeling procedure for OLHLM is similar to regression. The competencies were examined prior to inclusion in the model and were removed if not predictive, in order to achieve a parsimonious model. The variance–covariance matrix revealed that Problem Definition and Diagnosis (PDD), and Investigation and Management (INMAN) had a value close to 1, suggesting collinearity. We selected Investigation and Management (INMAN) to include in the model as it was a stronger predictor of the Overall Global score. Only four competencies [Professional Behavior (BEHV), Physician Examiner Rated Communication Skills (PECOMM), Counseling (COUN), and INMAN] were significant predictors. Other competencies: History Taking (HIST), Physical Exam (PE), and Quality of Spoken English (QSE) were not significant. The final model is presented in Appendix and Table 2 provides the results produced from the modeling.

Table 2 Ordinal Logistic Hierarchical Linear Modeling results

The OLHLM results are based on the log-odds of four category outcomes using three connected equations, the first of which (see the Appendix) represents the lowest category (i.e. poor category). Since there was a smaller likelihood of a poor outcome than of the other categories combined, all the coefficients were negative. The greater the absolute magnitude of a competency coefficient, the more substantial it was as a predictor of Overall Global at the station level. The most predictive competency was Investigation and Management (INMAN, Coefficient = −0.92) and the least was Professional Behavior (BEHV Coefficient = −0.26).

The OLHLM is expanded upon in the Appendix. Using the results in Table 2 and functions 6–9 in the Appendix, we can derive probability estimates based on all candidates and stations. We can estimate that a given candidate had a 12 % chance of being in the poor category, 50 % chance of being borderline, 33 % chance of receiving a satisfactory and a 5 % chance of very good in a typical station. These estimates differ from the actual number of poor/borderline/satisfactory/very good categories, as they are based on the probability of an outcome from our parsimonious HLM model. The large coefficient on Investigation and Management indicates that in order to pass, a candidate had to score high on that competency.

Conceptualizing competence: introducing the interview data

To illustrate our results, we imputed plausible values (−3 to 3, at 0.1 intervals) in the model for each competency while holding the others constant (i.e. at zero). The following two figures illustrate the estimated probability of the Overall Global (vertical coordinate) by the score of an individual competence (horizontal coordinate) from low to high.

Figure 1 shows that the lines for each rating were fairly flat as ratings increased for Communication and for Professional Behavior. This suggests that it is difficult to predict the likelihood of a candidate receiving any particular Overall Global outcome using just Communication or Professional Behavior. However we can see that as scores move from low to high, the satisfactory line ascends; these competencies appear to have been more predictive of satisfactory than of the other outcomes.

We can explore further why these two competencies are of relatively low value to examiners using the interview data. In his interview, Examiner 1 described both these competencies as interactions:

Researcher: Can you describe, what do you mean by interaction?

Examiner 1: You know the stuff that might want to build a rapport so that people can communicate to their patients, even getting to the meat of the substance. You know, the meet and greet, the sort of eye contact, how are you doing, what is your problem today.

Interactions that can be described as communication and professional behavior seem to be important at the very beginning of a 12-min OSCE. Examiner 4 described his/her thinking about a candidate who seems to be at the borderline level.

Researcher: What are you thinking now?

Examiner 4:

… Probably the same what I was thinking at that time. Good questions. Some of them are certainly relevant for lab results and that’s good. What’s bad is, that we seem to be using the medical terms more than I would like for the patient of this nature, this type of a person. This supposed to be very straightforward, regular citizen who works at a diner and telling her “levels are high”, she is just going to (the examiner is giving over the head gesture). You have to explain that. And it doesn’t take long, you have to use simple words.

It was easy for examiners to discern when a candidate was not doing well when it came to communication. When communication was not an issue, an examiner began to evaluate other components of the OSCE. For example, Examiner 2 part way through the OSCE has already moved to History Taking:

Researcher: At this point can you describe what were you thinking?

Examiner 2:

So, in this case he took the history of what was going on with her and then I listened to see whether they’ll branch out to do a comprehensive to cover the past medical history, current medications, any allergies, to cover a bit of social history, family history and a bit of a review of the system. So, he is progressing on those lines. So, right now I’m thinking; OK, he has finished off his history of what’s going on with her, the history of the present illness and now he is moving on to past medical history, social history, medications etc. So, I’m thinking; all right this is reasonably well organized, his history taking. We can sort of see that they’re going on the right track or not.

While all of the examiners described in detail good and poor qualities of the candidate’s performance, they did so using a process: they were first looking for Communication and Professional Behavior, followed by History Taking. In the interview where they seemed to begin separating candidates into categories was about halfway through the OSCE, when they were expecting the candidates to start describing the issues at hand and to begin counseling the standardized patient. This is reflected in the quantitative data. The two figures below illustrate quantitative results on Counseling and Investigation and Management.

In comparison to Fig. 1 in which the lines were relatively flat, lines in Fig. 2 have much more movement as competency ratings shift from low to high. When it comes to Investigation Management, the very good category (solid line) ascends substantially as the scores increase and satisfactory begins to descend. Counseling, is also a strong predictor having more pronounced lines than those presented in Fig. 1.

Fig. 1
figure 1

Side by side graphs of estimated Overall Global outcome by ratings of Communication and Professional Behavior. Note Each line represents the change in probability of being in a given category (1. Very good, 2. Satisfactory, 3. Borderline, 4. Poor) as a function of the competency (i.e. Communication)

Fig. 2
figure 2

Side by side graphs of estimated Overall Global outcome by ratings of Counseling and Investigation Management. Note Each line represents the change in probability of being in a given category (1. Very Good, 2. Satisfactory, 3. Borderline, 4. Poor) as a function of the competency (i.e. Counseling)

During the OSCE and about halfway through (i.e. interview segments at 6 and 8 min), the examiners were looking for clues as to whether candidates have detected the problem through an investigative process and are managing the information gathered from the standardized patient. Examiner 2 reflected on her thinking during the station and described:

Researcher: What are you thinking now?

Examiner 2:

I remember thinking when he asked about her family history if she had any history of thyroid disease, I thought he’s got it. So, that was good. He told her what he was thinking which I think is quite reasonable because I often tell patients what I think is going on before I examine them…So, clearly he’s taking a good history, he is coming to a conclusion that she’s got a thyroid problem.

Across the examiners, they were looking for candidates to narrow the possibilities to the most salient and important ones, making the clinical diagnosis and then translating that to a patient in a consultative, professional manner. When the candidate was doing well, it was qualitatively very different than when they were not doing as well. Examiner 4 describes this honing process as “selectivity” and described his thinking when the candidate did not perform this skill:

Researcher: What are you thinking now?

Examiner 4:

If he came here with an abdominal pain, you know, to me while you’re listening to what they say, you examine the system that you think where the money is and then you start moving out. Not the other way around. Just throwing the net and hoping to catch something but that’s not the way you work in the real world practice where you see patients every 5–10 min. You don’t have 30 min per patient. So selectivity is a big thing. You have to be able to focus. This guy is not focused at all… Anyone can do that. And this is not what I would say is practice-ready.

In this quote from the last few minutes of the interview, the examiner was referring to a much more complex skill than investigation and management. All of the examiners suggested that physicians experience time pressures, especially in a rural setting. The examiner here was referring to the cognitive process that is expected of the physician: the ability to detect the problem, manage the information, and develop a treatment plan in a very short amount of time. Thus, there were two abilities examiners were looking for when assessing practice readiness: first being able to investigate and manage the information presented by the patient, and second to do this within the first 4–6 min of encounter.

Discussion

This paper combined an extended quantitative data set with focused, qualitative interviews in order to explore why examiners valued some things more than others in the CAPP OSCE. The quantitative evidence shows that examiners valued Investigation and Management above all other competencies. It requires the capture of information from a patient through an investigative process, the management of all the evidence, and formation of a conclusion. The examiners referred to it as “where the money is and then you start moving out” and “coming to a conclusion.” In the minds of the examiners this complex skill of investigation and management is the “key feature” of the OSCE following the usage of Page et al. (1995). While there may have been individual differences when it comes to what was valued and when, and the often heterogeneous performance of IMGs, the results suggest that over 5 years of data, examiners are universally looking for the same competency manifestations in candidates. This echoes Gingerich et al.’s (2014b) recent identification of a limited number of patterns of social judgments and their conclusion that examiners may not be as idiosyncratic as once theorized.

The qualitative evidence also suggests that examiners began to make and confirm their judgments halfway into the station. In addition to Investigation and Management, and Counseling skills, they were looking for the sequencing of the encounter to be paced in a such way that IMGs were able to make a diagnosis within 6–8 min and then to begin counseling a patient. Although more data is needed to confirm, there were hints to suggest that there is an expectation by examiners that practice-readiness amounts to the ability to diagnose accurately from a “thin slice” of information (Ambady and Rosenthal 1992).

This study expanded our current understanding of rater cognition by using a novel methodology; however, we wish to acknowledge several limitations. First, the quantitative data were collected from one province in Canada. Therefore, the findings may not generalize to other Canadian contexts. Further work is needed to understand whether similar findings can be replicated in other jurisdictions. Second, we were unable to account for examiners who repeatedly participated in the CAPP OSCE from 2010 to 2014. While we believe our overall model will be consistent, accounting for repeated examiners will provide for more accurate estimates of competency coefficients. And lastly the qualitative data were collected from four examiners. While we focused on themes that were consistently reported across all examiners, it is possible that with a larger sample size, we may have been able to capture additional themes or variability of beliefs and assumptions across the examiners based on their unique training and professional backgrounds.

Implications

What can we do with the knowledge that Investigation and Management (INMAN) was treated as the principal component by examiners of the CAPP OSCE? The conventional solution is to differentially weigh Investigation and Management to be worth more than other competencies. Weighting was developed almost a century ago by Toops (1927), and has since been applied across many fields (Bobko et al. 2007). However, there are two issues with the application of weights. The first is determining which competencies are worth more and which less in a composite score, and the second involves determining the magnitude of each weight. The first is not a simple task. Defining the ingredients of what practice-ready competency means for family medicine is challenging because the definition of competent and competency has been difficult to construct in a way that can be readily measured (Kane 1992; Williams et al. 2003; ten Cate et al. 2010; Epstein and Hundert 2002; Newble 2004; van der Vleuten 1996; van der Vleuten and Schuwirth 2005). The second task is equally complex from a mathematical point of view and a recent study in medical education reveals that establishing weights requires a very large data set and, when compared to non-weighted equivalents, weighting does not provide additional information (Sandilands et al. 2014). Considering the conceptual complexity of competency and the large data set required to derive weights, developing appropriate weights may not be feasible in OSCEs like the CAPP OSCE.

If the conventional solution of weighting is infeasible, we need to seek out other solutions. Another, simpler alternative for exams that are similar to the CAPP OSCE is to reorganize the rating scales by increasing complexity. Conventionally, the OSCE uses an anchored Likert rating scale where each competency is equally weighted. However, our results suggest an underlying hidden structure in the OSCE that lends itself to a Guttman style rating scale (Mislevy 1993). In a Guttman scale, tasks are ordered from easy to difficult, and success on the more difficult tasks implies success on the simpler ones. Visualized in Fig. 3 our results suggest that assessing competencies such as conducting a Physical Exam and Quality of Spoken English are more easy than competencies such as History Taking and Communication and Professional Behavior, which appear to be precursors to the more critical competencies of Problem Definition and Diagnosis, Investigation and Management and Counseling. This structure is hypothetical and based on the structures revealed in the data, which are intended to enhance the rating process for examiners. However, it is important to note that in practice physicians may be more fluid and natural in an encounter possibly doing some History Taking while performing a physical exam.

Fig. 3
figure 3

Model of the nested structure of competencies for the CAPP OSCE

Although more research is needed to establish this structure, it is possible that the rating form could be reorganized so that the simpler competencies are assessed first using a simple rating scale, and the more complex are assessed using longer rating scales with areas for narrative. The reorganization to this scale may possibly reflect how examiners work, allowing them to direct their cognitive load where it matters—to the assessment of the principal competency of Investigation and Management.

Conclusion

The initial purpose of this study was to gain deeper insight into how examiners conceptualize practice-readiness. Through the combined use of OLHLM and cognitive interviews, we developed a novel way to capture what examiners believe are the essential abilities for medical practice. Quantitatively and qualitatively we found that the essence of “practice-ready” in the Nova Scotia context lay in the competencies of Investigation and Management along with some Counseling.

Despite the limitations of the study, we believe this work adds to the evidence base of rater cognition in medical education. This paper also suggests that future inquiries are needed to explore the generalizability of these findings, in other contexts, particularly with respect to how examiners prioritize competencies and their expectations relating to diagnostic efficiency. Lastly, this work suggests that in order to develop authentic assessment practices, we need to ensure that assessment measures and processes mirror the real world—that is, the ways in which examiners actually approach the assessment process and conceptualize competency.