Introduction

Competency-based medical education (CBME) is being adopted world-wide as a new approach to medical education, particularly in postgraduate training (Iobst et al., 2010). In many countries, including the Netherlands, USA and Canada, the shift to CBME has also come with the implementation of, and increased focus on, workplace-based assessment (WBA) as part of programmatic assessment. WBA uses low-stakes assessment tools, implemented in the authentic clinical environment, that are intended to encourage direct observation and feedback in an assessment for learning paradigm (Norcini et al., 2007).

A further innovation has been the introduction of WBAs that capture supervision judgements using entrustment supervision scales. Entrustment supervision scales are behaviourally-anchored rating scales that capture the level of supervision a learner requires to perform a clinical task as they progress towards unsupervised practice (Ten Cate, 2020). Entrustment supervision scales have been touted as having several benefits. Scales that anchor on the supervisor’s perception of a trainee’s progressive clinical ability should promote construct alignment between the rating and the priorities of the supervisor (Crossley et al., 2011). Entrustment anchors that closely align with the degree of supervision required during the clinical task may encourage supervisors to use the entire range of the scale when rating a performance. Supervisors who may have been reluctant to tell a resident they are “below average” using traditional rating scales may be more willing to record “I had to do it” if that accurately captures the supervision provided. Finally, entrustment supervision scales capitalize on the natural decision making of clinical supervisors, who decide daily whether learners can be allowed to undertake clinical tasks with or without supervision (Ten Cate, 2020).

While many different entrustment supervision scales have been developed, one prominent tool in use across North American residency training programs is the Ottawa Surgical Competency Operating Room Evaluation (OSCORE) (Gofton et al., 2012; Dudek et al., 2015; MacEwan et al., 2016; Ode et al., 2019; Thanawala et al., 2018; Saliken et al., 2019; Fitzpatrick et al., 2019; Cutrer et al., 2020; Dudek et al., 2019; Thanawala et al., 2019; Van Heest et al., 2019; Prudhomme et al., 2020; Gillis et al., 2020; Halman et al., 2020; Thoma et al., 2020; Meholick et al., 2020; RCPSC, 2021). In the Canadian postgraduate medical education (PGME) context, the OSCORE is promoted as a ‘strongly recommended’ WBA entrustment supervision tool. (RCPSC, 2021).

The OSCORE was originally developed as a tool that would allow surgical training programs to determine surgical residents’ competence, defined as “readiness for independent performance of the particular procedure”, in select procedures throughout the course of their training (Gofton et al., 2012, p. 1402). The OSCORE has 8 clinical items rated on a 5-point scale (1 = “I had to do it” to 5 = “I did not need to be there”), one yes/no question about ability to perform the procedure independently, and two open ended feedback questions (Gofton et al., 2012). The OSCORE is novel compared to other surgical evaluation tools; it assesses overall surgical competence instead of narrowly focusing on technical skill and it assesses a resident’s ability to independently perform the procedure as opposed to comparing the resident with their peer group.

Although originally intended as an assessment of surgical procedure competence (Gofton et al., 2012), the OSCORE is currently utilized as an entrustment supervision scale as evidenced by its inclusion in a recent review on entrustment supervision scales (Ten Cate et al., 2020). While it makes conceptual sense that there would be a direct relationship between a supervisor’s assessment of a resident’s competence and the supervisor’s level of entrustment of the resident to perform the task independently, in reality entrustment is influenced by a host of factors and the relationship between competence, independence and entrustment is complex (Hauer et al., 2015; Gilchrist et al., 2021; Klasen and Lingard, 2021). Within surgical supervision, emerging evidence supports a relationship between a supervisor’s assessment of competence and entrustment of an operative procedure (Ji et al., 2019). The confluence of a promotion of entrustment-based decisions within CBME with the language of the OSCORE anchors (Ten Cate et al., 2020) (e.g. “How much supervision did this trainee require to perform the procedure independently?”) has influenced the OSCORE’s evolution as an entrustment supervision scale.

While there have been multiple individual studies on the use of the OSCORE in medical education, no review has systematically examined them together to understand whether the OSCORE is measuring what it intends to measure, and its effect on learners and programs of assessment (i.e., the validity argument underlying the OSCORE). The frameworks for organizing validity arguments have evolved from the early categories of concept, criterion and construct validity to more unifying contemporary conceptualizations of validity in which all validity is construct validity, supported by different sources of evidence (Messick 1989). Kane’s validity framework is one such contemporary validity framework which is highly versatile as it both highlights the sources of validity evidence and offers a framework for synthesizing that evidence into a validity argument (Kane, 2013). Kane’s framework can be used for both quantitative and qualitative assessment tools, as well as quantitative and qualitative sources of validity evidence. Kane’s framework has two major components, starting with the interpretation/use argument (IUA) for the assessment tool (i.e., explicitly articulating the decision being made about a learner). The IUA identifies the key assumptions and inferences associated with the assessment decision. Once the IUA has been articulated, the necessary and/or available evidence that tests the assumptions of the IUA is evaluated. Validity evidence is captured from multiple sources and categorized under one of four inferences: scoring (evidence that examines the translation of an observation to a score on the rating tool); generalization (evidence supporting the sampling and reliability of the measurement); extrapolation (what the score infers about real-world performance); and implications (the impact of the assessment on the learner, program and/or patient) (Kane, 2013).

In the current study, we address the gap in the literature between the individual studies and the overall validity argument for the OSCORE. We use systematic review methodology to gather validity evidence and Kane’s framework to examine the validity argument, identifying strengths and weakness and potential areas for future research and development of the OSCORE. We address the question: What is the validity argument underlying the use of the OSCORE in assessing readiness for independent performance of a procedure by medical learners?

Methods

The methodology for this systematic review was based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols (PRISMA-P) (Moher et al., 2015).

Search strategy

We searched MEDLINE, Embase and Google scholar from 2005 (the earliest papers on entrustment) to September 2020 with the assistance of a reference librarian. The initial search included terms related to assessment (competenc*, assess*, evaluat*, educational measure*), combined with “entrust*”, and supplemented by searching ‘OSCORE’ as a text word in the databases. Additional studies were sought by hand-searching the reference lists from two published reviews of entrustment supervision scales (Rekman et al., 2016; Ten Cate et al., 2020).

Study inclusion and exclusion

We included original quantitative or qualitative full-text research studies published in English. Studies had to address the use of the OSCORE for assessment of health professionals. Modifications of the original tool were included, but new derivative tools (e.g. Ontario Bronchoscopy Assessment Tool, Ottawa Clinic Assessment Tool) were excluded. Health professionals included physicians, nurses, pharmacists, dentists, veterinarians, allied health professionals, medical lab technicians if they provided patient care, and clinical psychologists. Meeting abstracts were excluded.

All identified titles and abstracts, and subsequently full-text articles, were independently screened by two authors to identify those that met the inclusion criteria. Disagreements were resolved by consensus.

Data abstraction process

All articles included in the systematic review were reviewed for general study characteristics and sources of validity evidence as per Kane’s validity framework (Kane, 2013). We followed a previously published guide to categorize the validity evidence under each of Kane’s inferences (Cook et al., 2015). A data abstraction sheet was developed and used to record information relevant to assessment including: the clinical setting in which the OSCORE was used (health care profession, specialty, inpatient vs. outpatient, academic vs. community, geographical location, simulation vs. real life, medical vs. surgical specialty, procedural vs. clinical, academic vs. community), learner characteristics (level of training, number of learners, number of encounters/learner, voluntary vs mandatory participation, OSCORE learner training, incentives), assessor characteristics (title/rank, number of assessors, number of encounters/assessor, OSCORE training for rater) and study design (purpose of assessment, study duration, task evaluated, opportunities for feedback by participants).

The interpretation/use argument (IUA) was extracted if it was explicitly stated in the study. Sources of validity evidence were also extracted and organized using Kane’s framework.

Study quality

Methodological quality of the included quantitative studies was appraised using the Medical Education Research Study Quality Instrument (MERSQI) (Reed et al., 2007).

Data synthesis

Two authors critically examined the extracted data and categorized the validity evidence, with discrepancies resolved by consensus. All authors contributed to data analysis and synthesis to articulate the overall validity argument for the OSCORE and identify evidence gaps.

Reflexivity

All of the authors either currently hold or have held educational leadership positions in postgraduate medical education related to assessment. Two of the authors (JS, RH) also have careers in education scholarship and RH has previously published using Kane’s framework. While Kane’s framework itself is not inherently associated with a specific philosophical position on assessment, it is helpful to articulate our philosophical positions as they will influence our examination of the validity evidence (Tavares et al., 2020). While we describe ourselves as holding predominantly post-positivist views on assessment of learning, we hold philosophical positions more closely aligned with constructivism for WBAs such as the OSCORE. Specifically, we view competence as demonstrated through authentic clinical tasks as interpersonal, co-constructed between learner, supervisor and patient, and socially situated with multiple dimensions.

Results

The initial search yielded 1491 articles that was narrowed, using the inclusion criteria, to 19 studies focused on the OSCORE (Fig. 1). Seventeen were quantitative studies and two were qualitative. Eighteen studies were in post-graduate medicine; one study was in undergraduate medicine. The majority of the post-graduate studies (13/18) were in surgical specialties, many of which included orthopedics residents (7) and general surgery residents (5). All of these surgical studies examined the original (n = 10) or modified (n = 3) multi-item OSCORE. The remaining post-graduate studies included emergency medicine (n = 2), internal medicine (n = 1), critical care medicine (n = 1), or multiple medical specialties (n = 1). All of these non-surgical studies, and the undergraduate study, examined a single global rating scale (GRS) with the OSCORE entrustment anchors. Six studies used the OSCORE for assessment in a simulation setting.

Fig. 1
figure 1

PRISMA flow diagram

The MERQSI scores for included quantitative studies ranged from 9 to 14 with a mean score of 11.7 out of a possible score of 18. We divided the MERQSI scores into terciles of methodological quality with 1–6.5 being low quality, 7–12.5 being moderate quality and 13–18 being high quality. The majority of quantitative studies (12/17) included in this review were of moderate methodological quality; the remaining studies (5/17) were of high methodological quality (Table 1). Table 1 summarizes each study, the MERSQI score, and the detailed validity evidence.

Table 1 Description and detailed validity evidence for included studies

Below, we present a narrative summary of the validity evidence using Kane’s framework. While not explicitly stated, the studies predominantly examined the OSCORE assessments through a post-positivist lens (e.g. describing minimizing rater bias or considering reliability as the gold-standard for generalizability). The results presented below are consistent with this post-positivist perspective.

Interpretation/use argument (IUA)

The OSCORE was created as a “succinct surgical assessment tool that could be used to evaluate competence on any surgical procedure” (Gofton et al., 2012, p. 1402), where surgical competence was defined as “readiness for independent performance of a particular procedure” (Gofton et al., 2012, p. 1402). This IUA is consistent across the surgical postgraduate workplace-based studies included in our review. These surgical studies chose operative procedures across a range of different surgical specialties and assessed a resident’s ability to perform a particular procedure independently using Gofton et al.’s five anchors. In the non-surgical postgraduate studies in which non-procedural skills were assessed, a clear IUA was not articulated although the studies imply an IUA of readiness for independent performance of a task. In the undergraduate study, assessors were asked to document the extent to which they had to intervene in the clinical task (Cutrer et al., 2020). By contrast, most studies in the simulation setting focused on assessing competence (Gerull et al., 2019; Halman et al., 2020; Prudhomme et al., 2020).

Validity argument: 1) Scoring

Evidence supporting the scoring inference describes how observation of performance is translated and captured as a numeric score or written comment (Cook et al., 2015). Only the original high methodological quality OSCORE study describes how the scale was developed (Gofton et al., 2012). An expert group referenced previously validated surgical assessment tools and created the unique entrustment supervision scale anchored with colloquial language that surgeons used to describe a resident’s participation in a given procedure. Local surgeons reviewed the wording for relevance. The tool was piloted with orthopedic and general surgery residents and subsequently revised to its final form (Gofton et al., 2012).

None of the moderate methodological quality studies (MERSQI scores 9–12.5) that modified the OSCORE items (Gerull et al., 2019; Meholick et al., 2020; Thanawala et al., 2018) or reduced the OSCORE to a single GRS (Cutrer et al., 2020; Halman et al., 2020; Lord et al., 2019; MacEwan et al., 2016; Prudhomme et al., 2020; Thoma et al., 2020) described the development process for their modified scales.

Regarding response process, focus group participants in the original study (Gofton et al., 2012) felt the language of the anchors closely reflected real-world assessment. Another study reported that residents and faculty found the OSCORE anchors useful in procedural and non-procedural contexts for both junior and senior learners (Dudek et al., 2019). However, residents in a qualitative study did not perceive the single GRS OSCORE anchors as being different from traditional scales (Martin et al., 2020). Some residents preferred traditional anchors, which allowed comparison with their peers and gave them information on their expected rate of progress (Martin et al., 2020).

While rater training was undertaken in eight studies (Dudek et al., 2015; Gillis et al., 2020; Gofton et al., 2012; Halman et al., 2020; Lord et al., 2019; MacEwan et al., 2016; Meholick et al., 2020; Van Heest et al., 2019), none provided a detailed description and only one study was of high methodological quality (Gofton et al., 2012).

The influence of raters on scoring remains uncertain and was only assessed in three simulation studies of moderate methodological quality. In one unblinded simulation study that included rater training (Gillis et al., 2020), there were no differences between community faculty (less familiar with residents) versus academic faculty ratings, which the authors’ suggest indicates minimal rater bias. Two of the simulation studies blinded the rater by using video-taped surgical procedures focusing only on the resident’s gloved hands (MacEwan et al., 2016; Saliken et al., 2019). For studies in the clinical environment, raters were familiar with their residents.

In terms of entrustment scores, there is a tendency towards range restriction favouring the high end of the scale (Gofton et al., 2012; Saliken et al., 2019). The low end of the scale is infrequently used, even for very junior residents (Gofton et al., 2012). Less than 10% of first year emergency medicine residents scored 1 or 2 on the OSCORE, whereas greater than 60% rated 4 or 5 in a high methodological quality study (Thoma et al., 2020).

For psychometrics related to the scoring inference, the original OSCORE study found item-total correlations of 0.57–0.82 (Gofton et al., 2012) for 8 items. One moderate methodological quality OSCE-based study (Halman et al., 2020) found single OSCORE GRS item-total correlations of 0.30–0.79 by station compared to overall exam score.

Validity argument: 2) Generalization

The two major sources of evidence supporting the generalization inference are sampling and reliability (Cook et al., 2015). As outlined in Table 1, most studies took place at a single academic institution. A wide range of surgical procedures were assessed across moderate and high methodological quality studies. Seven studies included orthopedic procedures (Gofton et al., 2012; Dudek et al., 2015; MacEwan et al., 2016; Ode et al., 2019; Saliken et al., 2019; Van Heest et al., 2019; Gillis et al., 2020), five studies included general surgery procedures (Gofton et al., 2012; Dudek et al., 2015; Thanawala et al., 2019; Gerull et al., 2019; Meholick et al., 2020), three studies included urological procedures (Dudek et al., 2015; Fitzpatrick et al., 2019; Gerull et al., 2019) and one included gynecological procedures (Gerull et al., 2019). The moderate methodological quality simulation studies examined only one type of procedure (Lord et al., 2019; MacEwan et al., 2016; Saliken et al., 2019). One study including surgical residents did not specify the types of procedures included (Thanawala et al., 2018) while one qualitative study included procedural and non-procedural specialties but did not specify which procedural specialties were included (Dudek et al., 2019). Among the non-surgical studies, two high methodological quality studies were in emergency medicine (Prudhomme et al., 2020; Thoma et al., 2020), one in internal medicine (Halman et al., 2020) and one in undergraduate medicine (Cutrer et al., 2020). In Gofton et al.’s original study, which included both orthopedic and general surgical residents and faculty, specialty accounted for little variability in item ratings, but this was not re-examined in the other studies (Gofton et al., 2012). All of these sampling issues limit generalizability to broader contexts, particularly non-surgical settings.

Examining the psychometric data, five studies employed generalizability theory to examine different sources of measurement error; two were of high methodological quality and three were of moderate quality (Gofton et al., 2012; MacEwan et al., 2016; Lord et al., 2018; Saliken et al., 2019; Prudhomme et al., 2020). Consistent across studies, variance attributed to raters was relatively small compared to variance attributed to residents (Gofton et al., 2012; Lord et al., 2019; MacEwan et al., 2016). High reliability was achievable with multiple assessments, ranging from a g-coefficient of 0.80 for five assessments by one rater in a clinical setting (Gofton et al., 2012) to 0.90 for eight assessments by two raters in a simulated context (MacEwan et al., 2016). In a high methodological quality study comparing workplace-based to simulation-based single GRS OSCORE assessments of first year residents’ emergency resuscitation, the g-coefficient was markedly lower for the clinical setting (0.35 across twelve assessments with a single rater in the clinical setting compared to 0.75 across four cases and two raters in a simulated environment) (Prudhomme et al., 2020). This study suggests 33 work-based assessments would be required to achieve a reliability of 0.6. D-studies confirm that the number of raters can be reduced to one to two in the simulation setting without significantly impacting reliability (Lord et al., 2019; MacEwan et al., 2016).

Validity argument: 3) Extrapolation

Evidence supporting the extrapolation inference examines how performance on the OSCORE is related to real-world performance (Cook et al., 2015). As is evident in Table 1, the dominant source of evidence collected under the extrapolation inference is novice-expert differences. All ten of the moderate to high methodological quality studies (Gofton et al., 2012; MacEwan et al., 2016; Saliken et al., 2019; Van Heest et al., 2019; Thanawala et al., 2019; Ode et al., 2019; Prudhomme et al., 2020; Halman et al., 2020; Meholick et al., 2020; Gillis et al., 2020) that examined this relationship found that the OSCORE increased with level of training, even across months of training for first year emergency medicine residents (Prudhomme et al., 2020). Most of these studies are confounded by raters being unblinded to the resident level of training.

Additional extrapolation evidence, from both moderate and high methodological quality studies, examined the relationship of the OSCORE to other assessment tools. There are high correlations between the OSCORE and other surgical technical assessments completed at the same time (Gillis et al., 2020; MacEwan et al., 2016; Thanawala et al., 2019). There is moderate correlation between the single statement “resident competent to independently complete the procedure?” (Gofton et al., 2012 p. 1407) and the mean OSCORE rating (Gofton et al., 2012; Saliken et al., 2019) raising the possibility that the single-item score and the multi-component OSCORE function similarly. The OSCORE performed equivalently to the P-score, a single question summative assessment, in discriminating between levels of training (Van Heest et al., 2019).

Examining non-surgical assessments, Halman et al. found a high correlation between the OSCORE entrustment anchors and multiple other performance measures including a case-specific checklist, a GRS and a training level rating scale during an internal medicine OSCE in a moderate methodological quality study (Halman et al., 2020). In an undergraduate medicine, high methodological quality study, the OSCORE entrustment ratings were concordant with the higher ratings from another entrustment supervision scale (the Chen Supervisory scale), but mismatches were found for mid-range scores (Cutrer et al., 2020).

Only one high methodological quality study examined the relationship between simulation-based performance and clinical performance. There was no correlation (concordance correlation coefficient = -0.01, 95% CI -0.31–0.29, p = 0.93) between simulation and workplace-based single GRS OSCOREs for emergency medicine residents’ emergency resuscitation with significantly higher scores in the workplace setting (Prudhomme et al., 2020).

Validity argument: 4) Implications

The implications inference addresses how available evidence impacts the learner, the faculty, the training program and/or patients and society (Cook et al., 2015). As demonstrated in Table 1, the vast majority of implications evidence focused on perceived feasibility of the OSCORE in practice and acceptability amongst staff and residents, across predominantly moderate methodological quality studies. Within the surgical studies, the OSCORE was generally found to be feasible for workplace-based assessment, using a feasibility standard of completing more than 50% of eligible assessments (Fitzpatrick et al., 2019; Ode et al., 2019; Van Heest et al., 2019). However, two studies found contrasting results with the studies by Meholick et al. and Dudek et al. reporting 11% and 6% OSCORE completion rates, respectively (Dudek et al., 2015, Meholick et al., 2020). Facilitators of high completion rates included email reminders, setting completion rate targets, and providing residents with immediate access to the OSCORE (Thanawala et al., 2018; Van Heest et al., 2019). The identified barriers included accessing electronic platforms, residents perceiving they were intruding on faculty, residents selectively choosing cases for assessment, and lack of time (Dudek et al., 2015; Ode et al., 2019). Two studies reported it took less than two minutes to complete the OSCORE after a surgical procedure (Thanawala et al., 2018, 2019), but in a third study, surgical residents felt it took too long to complete (Ode et al., 2019).

There was mixed evidence regarding acceptability of the OSCORE to residents and faculty in surgical specialties (Fitzpatrick et al., 2019; Ode et al., 2019; Van Heest et al., 2019). Across high methodological quality studies, use of the OSCORE helped to define important aspects of a surgical case for residents (Gofton et al., 2012; Van Heest et al., 2019) or clarified performance expectations (Cutrer et al., 2020). In two studies, residents reported that they were more accepting of lower entrustment scores compared to traditional anchors (Dudek et al., 2019; Gofton et al., 2012) as the entrustment anchors highlighted performance deficits. In an internal medicine OSCE, faculty perceived the single GRS OSCORE entrustment scale to be a better measure of a resident’s abilities than a global rating scale, training level rating scale or case-specific checklist (Halman et al., 2020).

A number of moderate and high methodological quality studies examined the impact of the OSCORE on feedback with surgical studies reporting that the amount and quality of verbal feedback improved (Fitzpatrick et al., 2019; Gofton et al., 2012; Ode et al., 2019; Van Heest et al., 2019). By contrast, in one study of residents across mixed specialties, residents did not find that direct observation increased with implementation of OSCORE assessments (Dudek et al., 2019). In a qualitative study with medicine subspecialty and emergency medicine residents, the OSCORE was felt to negatively impact residents’ sense of self-efficacy and to potentially reinforce a performance mindset (seeking only positive assessments) over a growth mindset (seeking feedback to improve performance) (Martin et al., 2020).

Examining the impact on training programs, Meholick et al. found that the OSCORE could be used to identify residents requiring extra surgical simulation training and to assess progress after simulation training (Meholick et al., 2020). Use of the OSCORE impacted standard-setting in an internal medicine OSCE with more residents labelled as failing using the single GRS OSCORE compared to traditional OSCE ratings (Halman et al., 2020). Both of these studies were of moderate methodological quality. In a high methodological quality, national study of emergency medicine competence committee decisions using single GRS OSCORE assessments to guide resident promotion, residents required longer than predicted training time to advance through training, while promotion decisions were based on less assessment data points than recommended. There was also large between-program variability in terms of number of assessments collected and promotion timelines. (Thoma et al., 2020).

Discussion

Having used systematic review methodology to search, identify, appraise and abstract the original research studies, we now articulate the validity argument for the OSCORE, separating out the multi-item OSCORE (either original or modified) implemented in surgical specialties from the single GRS OSCORE implemented in non-surgical specialties.

There is a reasonable validity argument for the multi-item OSCORE in surgical specialties, grounded in an interpretation/use argument of assessing surgical competence as readiness for independent performance for a given procedure. The evidence is predominantly from single-institution studies, across a mix of simulation and clinical contexts, and heavily weighted towards orthopedics and general surgery. The individual studies were of moderate to high methodological quality. The scoring, generalization and extrapolation inferences are well-supported. In terms of implications, there is reasonable data that the OSCORE can feasibly be implemented and effectively used in training programs, and that residents and faculty alike find that it is an acceptable tool. Only one study commented on how the OSCORE was able to identify those in need of more practice at a given procedure (Meholick et al., 2020). It should be noted that in the original study (Gofton et al., 2012), the OSCORE was not intended to be used to make summative decisions about promotion or independent practice, instead focussing on readiness to perform a given procedure independently. As such, neither the original study nor the additional available evidence supports use of the OSCORE in summative decisions about the promotion of residents through their training program.

Taking a deeper look at some of the issues raised under the scoring, generalization and extrapolation inferences, although the colloquial anchors were intended to encourage raters to use the entire scale (Gofton et al., 2012), there is evidence of range restriction towards the higher end of the scale. Despite limited descriptions and assessment of rater training, generalizability studies did not report major error variance due to raters. From a post-positivist perspective, this may suggest that the construct-alignment of the scales mitigates the need for rater training and/or minimizes the impact of rater bias. (Crossley et al., 2011; Weller et al., 2017).

Alternatively, the unblinded assessment design and rater familiarity with a learner may confound this finding. However, adopting a constructivist lens consistent with the reflexivity of our team, we embrace variability between raters (i.e., we expect that raters, based on different experiences and expertise, would hold different views of a resident’s performance). From this constructivist perspective, the lack of variability between raters is unexpected. Possible explanations include, but are not limited to, rater training discourages variability in perspectives or the OSCORE does not encourage varied perspectives on performance.

OSCORE assessments completed in the clinical environment require a significant and potentially prohibitively larger number of assessments to achieve reliability compared to those completed in simulated settings. For surgical cases, reliability can be achieved with a relatively small number of ratings per resident, making it a potentially effective and time-efficient assessment tool for surgical residency programs.

Although there is reasonable evidence for the extrapolation inference, it should be noted that this is largely in the form of expert-novice differences. Although it is reassuring to know that senior residents have higher OSCOREs than junior residents, there may be many confounding factors as to why this is the case (Cook, 2015). As such, expert-novice differences are necessary but not sufficient to support a strong validity argument under the extrapolation inference. No further studies showing expert-novice comparisons are needed.

The original intent of the OSCORE was to assess overall surgical competence, not simply technical skill (Gofton et al., 2012). However, the OSCORE demonstrated moderate to high correlations with other surgical skills assessments. This brings into question whether all surgical performance-based tools, including the OSCORE, are assessing the same construct of surgical competence.

In contrast to the validity argument supporting the multi-item OSCORE in surgical contexts, limited validity evidence exists for the single-item OSCORE in non-surgical contexts. A clear interpretation/use argument has not been articulated in these contexts, although the underlying assumption seems to be readiness for independent performance. There is limited sampling across specialties, programs and centres. The non-surgical studies raise many issues that require further study. It is unclear if the single GRS OSCORE anchors represent a different construct than traditional behaviour-based anchors. Furthermore, there was a lack of correlation between performance in simulation and clinical contexts (Prudhomme et al., 2020). Interestingly, there is very concerning implications evidence in the non-surgical contexts compared to the surgical studies. Two qualitative studies demonstrated mixed impacts of OSCORE assessments on resident behaviours (Dudek et al., 2019; Martin et al., 2020) and highly variable impacts on decision-making regarding resident promotion across emergency medicine training programs (Thoma et al., 2020).

Limitations

We explicitly excluded the Ottawa Clinic Assessment Tool and the Ontario Bronchoscopy Assessment Tool, which were derived from the OSCORE. These tools include scoring items that deviate significantly from the original OSCORE; they may be assessing different constructs. Our synthesis is also limited by the modest methodological quality of the original studies. Notable factors that negatively affected the quality of studies included a single group cross-sectional or post-test only study design, unblinded raters and limited sampling across institutions. We also included simulation-based studies, although it may be argued that the purpose of the OSCORE is for workplace-based assessment where readiness for independent performance is a clearer construct than in the controlled simulation setting. Finally, although we hold a constructivist stance on workplace-based assessment, the bulk of the research into the OSCORE sits firmly in a post-positivist perspective which limited our data interpretation.

Implications for educational practice and research

Acknowledging that there has been confounding in practice between the original IUA of the OSCORE (i.e., “Can this resident perform this procedure independently?”) compared to an entrustment supervision decision regarding the procedure (i.e., “how much supervision did I provide to this resident to perform the procedure independently?”), we believe the available evidence does support the use of the OSCORE for ad hoc (in-the-moment) entrustment decisions of surgical procedures by frontline supervisors. The language of the OSCORE anchors aligns with retrospective entrustment supervision decisions (“I had to do it” through to “I did not need to be there”) (Ten Cate et al., 2020) and there is evidence for a relationship between competence, independence and entrustment in surgical supervision (Ji et al., 2019). Given that the IUA for OSCOREs in the simulation context focusses on competence (as opposed to readiness for independent performance of a procedure), programs should consider interpreting performance in the simulation context differently from assessment generated in the clinical context.

There is little evidence to support the use of the OSCORE by surgical programs for summative assessment decisions, such as determining the progress of a remediating resident or to make promotion decisions. In order to determine if the OSCORE actually predicts readiness for independent practice, studies comparing OSCORE performance to results of post-residency qualifying exams and actual performance in independent practice are required. These comparisons will take time to develop. Furthermore, if the OSCORE continues to be used in simulation contexts, more validity evidence is required examining the relationship to authentic clinic performance.

An abundance of caution is required in widespread implementation of the OSCORE into non-surgical contexts, given the very limited available evidence. There is a pressing need to articulate the interpretation/use argument for the OSCORE in these settings, and to determine if the current anchors are construct-aligned to either competence or entrustment supervision or whether they represent a novel construct. Much more evidence is needed under each of the inferences to understand the OSCORE in these contexts. In the Canadian implementation of CBME, the OSCORE has been promoted as a core WBA instrument for assessment of Entrustable Professional Activities (RCPSC, 2021). In this educational landscape, it is important to reflect on the ramifications of our articulated validity argument. While the OSCORE underwent rigorous development in the surgical population in the original study (Gofton et al., 2012), residency programs should be aware that the OSCORE has yet to be studied in community hospitals and little evidence exists outside of surgical specialties. Validity arguments change across contexts, as validity is not a stand-alone property of the tool, and the argument must be re-examined in the new contexts (Cook et al., 2015). Perhaps most concerning, if competence committees are relying on OSCORE data to make decisions regarding resident progression, the only study in this regard suggests high between-program variability, which could threaten the defensibility of the summative decisions (Thoma et al., 2020). Gathering additional data would inform program-specific standard-setting around best practices for the number of assessments and predicted length of training (Thoma et al., 2020).

Conclusion

This systematic review demonstrates that the OSCORE has reasonable validity evidence to support its use for surgical operative assessment, under the scoring, generalization and extrapolation inferences of Kane’s framework. However, a validity argument for the extension to non-surgical contexts is not supported. Evidence to support the implications of this assessment instrument is nascent. We are optimistic that the OSCORE can be an informative and relevant tool for postgraduate learner assessment. However widespread adoption must be informed by concurrent data collection in more diverse settings and specialties.