Background

Society’s implied social contract with any health profession includes the obligation of that profession to self-regulate in a manner that ensures the protection of patients (Cruess and Cruess 2014). To address this obligation, regulatory authorities in the US and Canada have long included formal national examinations as part of their efforts to construct a fair and equitable quality assurance process. The primary mandate of these examinations has been to ensure that doctors have the knowledge and skill required to provide safe and effective health care (Swanson and Roberts 2016) and there is good evidence that scores on such examinations are associated with performance during candidates’ subsequent careers (Cadieux et al. 2007; Wenghofer et al. 2009; Tamblyn et al. 2007).

There is also evidence, however, that increases in the amount of high-stakes assessment have not led to improvements in quality and safety; rather, the opposite has been observed with the most recent data (James 2013) suggesting that the actual number of deaths due to medical error and poor quality healthcare in the US may be three times the number reported by the seminal Institute of Medicine Report that was published nearly two decades ago (Kohn et al. 1999). It is unlikely that any one intervention could overcome such pernicious findings and it is inappropriate to blame any one aspect of the system. At a minimum, however, these findings mandate a critical look at our current approaches to assessment.

Such an analysis does not deny the importance of the gatekeeping function of high quality testing buttressed by high psychometric standards, but it does argue for supplementation by ongoing assessment that emphasizes quality improvement efforts during the practice years. In other words, effective continuing professional development is vital for all healthcare providers, regardless of where they sit on the pass-fail continuum according to high-stakes assessment practices (Eva et al. 2013). This argument is not a fundamental shift away from concerns about patient care towards concerns about physician growth, but is a recognition that continuing education offers a means towards improving patient care given that many more patients will be impacted upon by physicians who pass our current exams than by those who fail.

The purpose of this reflections article is, therefore, to cast a critical lens on current assessment practices to offer insights into ways in which they might be adapted to ensure alignment with modern conceptions of health professional education for the ultimate goal of improved healthcare.

Methods and conceptual framing

To ensure that any assessment program has a positive influence on patient care by promoting lifelong professional development, it is important to consider both the implicit messages sent to candidates and stakeholders as well as any unintended consequences of adopting a particular assessment strategy. This is Messick’s (1989) notion of consequential validity and requires a broader, scoping review not only of current best practices in assessment, but also of the alignment between current assessment systems and modern understanding of health professional practice (Cizek 2012). In considering these issues, van der Vleuten’s (1996) model of utility (reliability, validity, feasibility, acceptability, and educational impact) continues to provide a useful model from which to judge the adequacy of any assessment system. As van der Vleuten notes, compromise is necessary across these factors. In an ideal world this compromise should result from deliberate decision making rather than uninspected imbalance (Razack et al. 2015). At the same time, the broad conception of validity that Kane (1992), Messick (1989), and others (Cook et al. 2015; Cook 2014; Downing 2003) have put forward demands that a wide variety of additional factors be taken into account when deciding whether or not a system of assessment is fit for purpose. For example, the creation of a coherent and integrated system of assessment that promotes ongoing learning across the continuum of training and practice requires a process that (a) is made efficient for candidates, ensuring appropriate and comprehensive coverage of many aspects of performance while eliminating unnecessary redundancy; (b) emphasizes the primacy of learning by harnessing the power of feedback (Boud and Molloy 2013; Galbraith et al. 2011); and (c) creates a shared accountability between the learner, educational programs, and regulatory authorities for engaging in continuous performance improvement (Mylopoulos and Scardamalia 2008; Bordage et al. 2013).

To explore these issues, the authorship team began by reviewing four recent overviews of the current state of health professional assessment: (1) The Future of Medical Education in Canada (AFMC 2010); (2) Assessment in Postgraduate Medical Education (Regehr et al. 2011); (3) Assessment Strategies within the Revised Maintenance of Certification Program (RCPSC 2011); and (4) The Ottawa Conference consensus statement (Norcini et al. 2011). From that foundation we launched iterative discussions, both in person and asynchronously, over the course of 3 years about the key issues facing medical training, regulatory, licensing, and certification authorities in the near future and sought out literature that would inform these discussions with the intent to offer additional perspectives and issues that are meant to supplement these four documents. Our goal was not to debate the merits of any particular form of assessment. Rather, it was to re-formulate general principles that could be relevant to anyone involved with assessment, be they individual course directors, national testing agencies, or anything in between.

Results

The three broad themes discussed below are not intended to be either exclusive or exhaustive. Where possible, we have tried to offer paths for further exploration of the issues raised. In an effort to balance the discussion of these themes between broad perspectives on assessment and the operational needs of the health professions we have focused on three levels of consideration: (1) Conceptual—issues about how, why, and when different assessment practices impact upon the culture of the profession; (2) Logistical—specific avenues of exploration through which the conceptual issues might be redressed within practical realities; and, (3) Systemic—cultural issues inherent in current practice and education systems that create barriers that need to be overcome. The first two levels will be discussed independently for each theme after which the third, integrating level, will be examined more generally.

Theme 1: Overcoming unintended consequences of competency-based assessment

Conceptual issues

Modern systems of medical education are increasingly centered on competency-based frameworks that outline the core knowledge, skills, and attitudes that individual physicians are expected to maintain. (Frank et al. 2010; Morcke et al. 2013) This movement has had definite positive impact by offering an explicit and broadly focused model from which educators, regulatory authorities, and licensing bodies can guide educational and assessment practices for the sake of improved patient care (Holmboe et al. 2010). While there are well-recognized psychometric challenges for the assessment of many competencies, in general these challenges can be overcome through deliberate and diverse sampling across different situations (Norcini et al. 2003; Eva and Hodges 2012) and the direct observations required offer increased opportunities for formative feedback.

Implicit in common models of competency-based assessment, however, are a variety of assumptions that may have unintended and undesirable consequences (Ginsburg et al. 2010). Most central is the notion that competence is something one can check off. Claiming that a student can “perform a complete and appropriate assessment of a patient,” for example, ignores the robust literature indicating that contextual factors play an important role in our ability to perform any task (Eva 2003; Colliver 2002) and risks sending an implicit message that once a task can be achieved there is no further work to be done (Neve and Hanks 2016; Norman et al. 2014; Newell et al. 2001). The fact that every candidate who passes a minimal competence exam is effectively labeled competent overlooks the realities that (a) there is always considerable variability of performance within the passing range, (b) even the top performers have room for improvement, and (c) knowledge and skill are subject to drift and deterioration (decay) over time (Choudhry et al. 2005; Norman et al. 2014). A considerable amount of information is collected on students during their time in medical school, postgraduate training, and while undertaking licensing and certification examinations that is simply ignored when the individual is, at a particular point in time, deemed ‘good enough’ (Klass 2007). This creates a number of problems.

First, focusing on a determination of ‘competent’ contributes to assessment protocols being seen as hurdles that one simply needs to get over rather than as diagnostic opportunities that can be put to use for further pedagogic benefit. Second, focusing exclusively on the pass-fail cut-point removes any incentive, and creates considerable disincentive, for disclosing difficulties and continuing to pursue improvement (Eva et al. 2012). Passing the examination may then indicate that the weaknesses one experiences are unimportant (Butler 1987). Third, such competence-based models may reduce the degree of support educators feel compelled to provide given that there is little need to offer guidance to trainees who have been deemed competent. Finally, using the label ‘competent’ overlooks the well-established view that knowledge and skills must be continuously used for them to be maintained (Ericsson 2004; Eva 2002). Having successfully crammed to pass an exam should not be viewed as an indication that one will remember the material after the exam is completed (Custers 2010).

Moreover, the “state of independence” that underlies the label of competent runs counter to modern perspectives on expertise, which differentiate between the routine expert who achieves a certain degree of performance and simply reproduces that performance repeatedly (Regehr 1994) and the adaptive expert who continuously reinvests her energies into better understanding and innovating within the domain of practice for the sake of continuous performance improvement (Mylopoulos and Regehr 2011). All of this sums together to create a state in which focusing assessment efforts on the achievement of competence by exclusively striving to identify a minority of individuals who do not meet an established threshold eliminates opportunities to provide formative guidance directing future learning for the majority. While some individuals definitely need to be excluded from practice until such time as their ability can be improved, formally supporting the others in their performance improvement efforts could have a bigger impact on the quality of healthcare that patients receive.

Logistical considerations

Addressing such unintended consequences would require greater emphasis on longitudinal and direct observation, with accompanying feedback, across medical school, postgraduate training, and continuing professional development. This could, as a result, lead to a radical increase in assessment. That must be considered in light of the observations that physicians in practice are already overworked and the organizations responsible for implementation of assessment do not have endless resources. Further, given the value of progressive independence (Kennedy et al. 2009) and of desirable difficulties (i.e., being challenged in a manner that drives learning) for performance improvement (Guadagnoli et al. 2012; Eva 2009; Bjork 1994), there are dangers associated with trainees or practitioners being observed constantly. Thus, as we move forward in addressing the unintended consequences of competency-based assessment it will be important to optimize the time and resources that are available rather than assuming that more is necessarily better.

To achieve such optimization, it will be necessary to create a more continuous model of assessment that is integrated across the various stages of training and practice, with information being carried forward by the individual, thereby enabling the profession to hone in on the particulars of performance that would be most impactful for a particular learner at a particular time and would promote the notion that learners (trainees and practitioners) need to be accountable for their learning (van der Vleuten and Schuwirth 2005). An assessment system that recognizes the continuous nature of performance, as opposed to dichotomizing into pass-fail, would further normalize this process such that all learners would be expected to maintain ownership over a learning plan that could efficiently guide their activities while minimizing threats to the candidates’ self-concept, an important determinant of performance (Eva et al. 2012; Kluger and van Dijk 2010). Thinking this way about competent practice would not remove the need to develop strategies for external observation (by supervisors or peers), but it might create a situation in which such opportunities are more rewarding as they can be deliberately directed at the key aspects of performance that would yield palpable benefit while helping learners maintain a sense of ownership over their own learning agenda (Mutabdzic et al. 2015).

To some extent these ideas are translations of the key features model of exam question development (Page and Bordage 1995; Farmer and Page 2005) with “key” now defined by what the learner most requires rather than by the critical next step for the patient in a clinical case. This analogy leads to a proposal to use the information available in any form of assessment to define the key actions that are most appropriate at a particular moment in one’s development. For example, with regard to issues of knowledge acquisition, the high quality of summative assessment processes currently mounted by big testing organizations are likely to be more trustworthy compared to assessment protocols generated at local institutions. By contrast, for issues of practice performance and related competencies, the greater number of opportunities for situated and longitudinal observations in workplace and learning institutions will likely make the data from those settings more trustworthy compared to those collected by big testing organizations. As a result, we believe maximal strength will come from integration of efforts across the stakeholder organizations that are responsible for each level of training and practice. Within current assessment practices, it is conceivable that OSCE stations or key features questions could be built such that they require candidates to follow up on an error that was made during a preceding assessment moment (Bordage et al. 2013); to ask for help, guidance, or supervision (Cote and Bordage 2012); to demonstrate use of a clinical decision support; or to summarize a case for an attending, making it clear not only what is understood (Goldszmidt et al. 2013), but what the learner would take away from the situation to direct further formative development that would facilitate better care for patients. This promotes the translation of the assessment experience into a personal inquiry based learning strategy and integrates the idea of using data to make sense of one’s experience and frame a plan for improvement. In an ideal world such planning would take place with a coach or peer support (Marsh and Roche 1997).

Using the resulting information to tailor components of subsequent assessment processes that require the candidate to demonstrate how they have utilized previous experiences to their patients’ benefit would create considerable incentive for candidates to “nurture” their learning as a lifelong effort requiring continuous reinvestment rather than simply trusting that they know enough because their exams have been passed (Schön 1983). Doing so would require considerable harmonization of assessment practices across many groups. Working towards such a coherent and integrated assessment system, however, would create the potential to overcome negative reactions to assessment practices by establishing a cultural norm and expectation that shifts accountability toward a shared responsibility between learner and system (Galbraith et al. 2008) and harnesses feedback in ways that emphasize the primacy of learning (Eva et al. 2012), which leads us to Theme 2.

Theme 2: Striving to implement quality assurance efforts while promoting performance improvement

Conceptual issues

For assessment of any kind to provide a meaningful mechanism through which individuals can be expected to grow it needs to promote and empower self-regulating aspects of practice (Eva et al. 2010), provide high quality and credible feedback (Sargeant et al. 2011), and deliver support for the candidate (Eva et al. 2012; Marsh and Roche 1997). Perhaps because of this complexity it is commonly believed that an assessment cannot fulfill the dual purposes of offering summative measurement and formative guidance. The distinction is an important one in that it helps to make sense of the compromises that are appropriate when determining the utility of an assessment protocol in different contexts. However, when treated as an absolute rule, this distinction can be detrimental. It risks absolving training organizations that are responsible for nurturing and supporting learners from serving as effective gatekeepers. It also risks removing responsibility from high stakes testing organizations to attend to assessment for learning. More fundamentally, the assumption that duality of purpose cannot be achieved simply mistakes the reality of the learner’s experience. We presume that even the staunchest advocates of the distinction would recognize the adage that assessment drives learning and would, therefore, concede that studying or sitting a summative assessment has a formative influence (Newble and Jaeger 1983; Larsen et al. 2008; Norman et al. 2010). Further, any time one performs a task in which identity is invested there is an aspect of summative judgment even if the assessment is intended to be “purely formative.” That judgment may not take the form of a high-stakes, ‘pass-fail’ decision, but it can certainly have a direct impact on whether an observer deems the performer worth the effort of providing feedback or whether the performer himself deems further improvement to be important (Eva et al. 2010).

Thus the question is not whether an assessment is summative or formative, but is the extent to which summative or formative purposes are foregrounded in the mind of the learner. In this regard, a more relevant continuum is the level of the stakes involved in the judgment (again, in the perception of the person being assessed). Part of the challenge with current assessment practices, especially in high-stakes contexts, is that they are routinely singular events with no effort to facilitate use of the information gained from their administration beyond a summative decision. Further, there is generally no effort directed at follow-up to determine if the information that could be gleaned from the assessment is utilized by the candidate for learning purposes.

Exacerbating the problems associated with the unsophisticated dichotomization of summative and formative testing purposes is the romanticized construction of the “self-regulating professional” as one who will rationally and neutrally accept data and strive to use it to change their own behaviour (Eva and Regehr 2013; Watling et al. 2014; Harrison et al. 2015). Many of the roles (e.g., Scholar, Professional) inherent in modern conceptions of clinical practice incorporate competencies that require individuals to continually reinvest in developing their performance over the course of their careers (Mann et al. 2009; Ericsson 2004). This demands that good data be sought by individual practitioners regarding their current level of practice. Yet, data that conflict with one’s self-identity are threatening to the individual recipient (Kluger and van Dijk 2010) and create an experience of cognitive dissonance that can make it easier to discount the data than to determine how to best use them for professional growth (Eva et al. 2012). This is especially true given the confidence that follows increasing experience (Eva 2009). For recipients to be influenced by feedback they must be receptive to it (Shute 2008). For recipients to be receptive to feedback they must deem it credible, not just with respect to its validity, but with respect to believing that it is delivered with the sincere goal of helping the recipient practice better (Sargeant et al. 2011; Galbraith et al. 2011). Achieving such credibility requires more than simply convincing the recipient that the data are psychometrically sound. At the level of the individual, we must offer not just data but also guidance regarding how to use external evidence to improve (Marsh and Roche 1997). Culturally, we must normalize the improvement process across the range of performance, because focusing attention only on those at the bottom of the distribution reduces the need for the majority of candidates to pay attention to the data available (Kluger and van Dijk 2010; Butler 1987). Functionally, we must strive for an integrated and continuous system with shared accountability by focusing beyond point-in-time assessment moments that will inevitably be treated simply as hurdles to be overcome before returning to one’s normal stride.

It is rare for trainees or physicians in practice to have better data on which to guide their continued development than they do upon engaging in formal assessment activities organized by institutions that have invested heavily in providing the best possible data. To not consider the use of assessment for performance improvement even in high stakes contexts is, therefore, a considerable missed opportunity. There will inevitably be a tension between quality control and quality improvement goals whenever the threat of high stakes assessment looms over candidates (Kluger and van Dijk 2010). We believe benefit can occur despite such tension, however, with further integration across the continuum of training and with deliberate attention paid to quality improvement as a goal such that one of the things that is “learned” through high-stakes assessments is that the material covered matters and must be considered beyond demonstrating adequate performance at a particular moment. Thus, we offer some initial thoughts regarding how such concepts might be built into current high stakes assessment activities.

Logistical considerations

A burgeoning area of research in recent years is demonstrating the conditions under which testing can have pedagogical value (Larsen et al. 2008; Kromann et al. 2010; Rohrer and Pashler 2010). For example, more frequent testing tends to yield a greater learning effect, especially when the testing format requires constructed responses (e.g., short answers) rather than recognition (e.g., MCQs; Karpicke and Roediger 2008; Kornell and Son 2009). This phenomenon creates a perspective in which shorter, more frequent, lower stakes quizzes become increasingly valuable. While such approaches might seem more feasible for learners still in training, a number of groups, including the American Board of Anesthesiology (ABA), have provided proof of concept for its use with clinicians in practice (2014). The ABA provides questions longitudinally to the diplomate who is then given a period of time to answer, providing an alternate pathway to continuous quality improvement and a new approach to keeping up with medical knowledge. The ABA’s efforts provide an excellent example of how to implement a summative assessment that can help to confirm aspects of knowledge that are up-to-date while enabling learning and improvement.

One of the reasons offered against using summative assessments for formative purposes is the cost inherent in generating a high quality assessment exercise. If the gatekeeping function is to be maintained in high stakes assessments, test security is an issue and providing feedback on items may mean radically increasing the pool of quality items available for use. Similarly, if assessment is to be offered more continuously through smaller scale but more frequent testing, this too would likely require an increase in the pool of questions available. However, if the profession truly believes that assessment illuminates a road to improved healthcare, this is an investment worth making. Further, this might become more feasible with the rapid developments of automatic item generation (AIG) processes (Gierl and Lai 2013; Gierl et al. 2012), mitigating test security issues by allowing new tests to be built relatively efficiently. If adequate databases of questions could be built, through AIG and other approaches, several possibilities for the use of testing to support performance improvement could be considered.

The learning associated with this sort of assessment program could be further enhanced with the creation of a test-tailoring platform in which the domains of focus and item blueprints could be specified for individual trainees and physicians in near real time. Customized tests would both support learning and create habits of engaging in deliberate practice improvement activities. In an ideal world, item databases would be created that would allow practitioners to sync their current scope of practice (based on electronic health records, prescription habits, etc.) to a rubric that defines the item database such that formative tests could be maximized to yield optimal guidance regarding mechanisms of improvement. This is closer to reality now than it was 20 years ago as the amount and quality of data that physicians have available about their practice is increasing (Ellaway et al. 2014). The success of such targeted assessment and feedback could then be tracked through the individual’s practice patterns by considering the very patient outcomes that led to the generation of the formative test.

Establishing such an iterative program of assessment, in combination with harmonization across levels of training and across stakeholder organizations, could be still further enhanced through innovative examination protocols for candidates to demonstrate how feedback on areas requiring improvement led to an action plan that formed the basis for additional learning. This would extend the scope of examinations away from the single moment in time in which the candidate is physically present for the exam. An OSCE station, for example, could involve review of data collected from each candidate’s actual patient encounters and require them to demonstrate how they have understood their experiences with workplace-based assessments and evaluation of patient outcomes. Similarly, the generation of a “Diagnostic OSCE” late in undergraduate MD training or early in postgraduate training could be used deliberately to identify aspects of performance that would benefit from further development and could form the basis for tailoring subsequent assessment efforts. Ideally this process would be repeated at the end of residency and, in both instances, would allow further exploration of the candidates’ conceptualization of practice (Bogo et al. 2011) while enabling the stakes of any given assessment to be lowered because motivation would come from the need to demonstrate follow-up rather than the need to pass the exam. How to implement such processes in a manner that will be deemed authentic, and therefore used, is the focus of Theme 3.

Theme 3: Authentically linking assessment and practice

Conceptual issues

To be maximally effective as an educational tool, any system of assessment should model the realities of practice as closely as possible. Such alignment increases acceptability and makes claims of validity much more credible (Bernabeo et al. 2013). Authenticity does not mean using high fidelity simulation to mimic practice (Norman et al. 2012). Rather, authenticity is achieved when assessment protocols accurately reflect the domain of practice such that “studying to the test” or learning to “game the system” equates with learning to practice well. Too often we hear statements from clinical preceptors to their trainees along the lines of “in reality I would do X, but for your exam you should do Y.” Such disconnects threaten to undermine the entire system and create a culture in which assessments are viewed merely as hurdles to be overcome to prove oneself competent. They encourage trainees to practice in a manner that is misaligned with reality and encourage educational programs to teach to the exam rather than to learners for the benefit of patients.

Although this section is the shortest in this reflections piece, this aspect of our deliberations may be the most fundamental to enabling the cultural shift encouraged by efforts to broaden the definition of good medical practice. It is important not only that assessment processes accurately reflect the aspects of practice that stakeholders desire to promote, but assessment candidates should be able to express an understanding of why their behavior might differ in “typical” practice or why their behaviour might be variable within their practice. In other words, it might be appropriate for assessment-driven behavior to deviate from one’s normal practice because context matters. Practicing in rural and remote areas, for example, will not be the same as practicing in large urban academic tertiary care centres and assessment practices should provide some sense of whether or not candidates demonstrate appropriate (i.e., safe) awareness of variation in their practice. Having candidates state their broader conceptualizations of practice by indicating why their practice might change from one context to the next might give assessors better information regarding the driving motivation for behaviours observed. It is only by marrying abstract standards of practice with meaningful understanding of local variability that assessment can be truly authentic in the eyes of the individual being assessed.

Workplace-based assessment practices such as in-training evaluation reports, mini-CEXs, patient reported outcomes, and direct observation of procedural skills (Pugh et al. 2014; Kogan and Holmboe 2013; Kogan et al. 2009; Norcini and Burch 2007; Norcini 2005; Norcini et al. 2003; Galbraith et al. 2011), in many ways represent the next frontier of assessment technology. They are not currently part of the most high stakes assessment activities despite their potential for assessing a greater variety of dimensions of practice and their capacity to better reflect what individuals actually do in their day-to-day activity. Of course, there remain concerns about standardization and reliability with most workplace-based assessment practices, but collection of data over many evaluators, rotations, and cases does tend to yield sufficient reliability (Ginsburg et al. 2013) and uniformity of opinion may not be the ultimate goal in all contexts (Gingerich et al. 2014).

Logistical considerations

At its root, making assessment authentic requires having candidates engage with clinical scenarios that are not clear and obvious cut-outs from a blueprint. In in vivo, work-based, situations such as using practice data, peer review, or portfolios, generating “authentic” assessment would seem straightforward as the data are by definition based on the individual’s practice. When the stakes are high and momentary, however, even one’s personal “reflections” can become fictional when the system encourages them to be written for external review (Hays and Gay 2011). We see value, therefore, in leaving control of learner portfolios (Galbraith et al. 2008) in the hands of the learner to engender a sense of accountability and responsibility (van Tartwijk and Driessen 2009) while also enabling deliberate exploration of practice patterns, successes, and concerns without fear of the repercussions that can arise from placing great weight on any one assessment moment. This view is reinforced by recent literature examining physicians’ engagement with Practice Improvement Modules, which has suggested that physicians are more likely to believe the data available because they collected it (Bernabeo et al. 2013). Placing the emphasis on learning from workplace-based assessment in a manner that accumulates towards a higher stakes decision, as advocated by van der Vleuten and Schuwirth (2005), is important to engage communities of practitioners to discuss and learn from each other the reasons why their practice might deviate from that of others and when or if such differences are safe and appropriate.

In ex vivo assessment situations such as OSCEs, the cases must allow uncertainty and avoid prompting statements such as “here comes the breaking bad news station.” Doing so might involve allowing stations with multiple pathways even at the cost of absolute standardization (Hodges 2003). Models are being developed, as many groups have been experimenting with sequential OSCE processes in which later stations revisit previously encountered patients at an ostensibly later point in time, offering test results that were not available previously or focusing on some other form of follow-up visit (MacRae et al. 1997; Hatala et al. 2011). Within station, it is also conceivable that standardized patients could be trained to offer information midway through a case that contradicts the most apparent diagnosis from the early portion of the encounter. Doing so would further provide some indication of candidates’ capacity to overcome their first impressions and avoid falling prey to premature closure (Eva and Cunnington 2006).

At the same time, there is a tendency to infer the cause of behaviours, trusting that the right things, when done, were done for the right reasons (Ginsburg et al. 2004). Given that context influences performance there might be value in establishing opportunities for examiners to explore the reasoning underlying candidates’ behaviour (Bogo et al. 2011; Kogan et al. 2011; Williams et al. 2014). This could be done through post-encounter probes that are akin to debriefing sessions post simulation encounters in that both require the candidate to explain why certain things were done (Williams et al. 2014), why alternative actions were ruled out, and if or how decision-making might have changed if the context had differed in specified ways. Gaining a better understanding of candidates’ conceptualizations of practice might help to account for some of the apparent inconsistency both in behaviour and in rater perception of that behaviour while also enabling a strategy for assessing other aspects of competence embedded within the Scholar and Professional roles (Bogo et al. 2011). Whether or not any of these innovations prove effective in adequately measuring candidate performance, movement beyond simple “examine the knee” types of cases is thought necessary to enable assessment of holistic aspects of practice that avoid the atomization of medical practice that the specification of a series of competencies can create (Eva and Hodges 2012).

Systemic considerations

In offering these reflections we fully recognize that good assessment is time and resource intensive. Given the generic nature of the issues raised here, without specific focus on any one setting or level of training, it is impossible to specify with any precision the cost of the concepts outlined. We do know, however, that a considerable sum of money is already spent on assessment that might well be diverted in ways that would better align assessment processes with the ideas we are exploring, especially the quality, safety, and improvement of patient care. We are also aware that many of the issues we raise are cultural in their roots and that changes in the direction we are suggesting might be threatening to many stakeholders. However, any assessment process will be threatening and can create a source of frustration or even outright rebellion amongst practitioners (as recently seen in relation to Maintenance of Certification practices in the US). We suggest that this is more likely to be the case if the assessment community continues to emphasize summative processes based on a ‘standard practice’ that does not exist, thereby creating an unnatural, high stakes test of competence. It seems antithetical to the very reasoning behind assessment (the protection of patients) to suggest that we should not think about how to improve current assessment practices, not only in terms of their role in gatekeeping but also in terms of their opportunities for shaping further learning. Leadership is called for now, just as it was when substantial funds were devoted to the development of Multiple Choice Question technology from the 1950s onward. Such leadership will only be achieved through effective collaboration across educational and testing organizations and with providers of continuing education services to enable practices and expectations of healthcare practitioners to be established early in training.

A common criticism of the medical training system is the sharp transitions experienced when moving from pre-clerkship to clerkship, from clerkship to postgraduate training, and from postgraduate training to practice (Jarvis-Selinger et al. 2012; Teunissen and Westerman 2011). Some degree of transition pain is inevitable, but the challenges might be reduced by efforts to create a cohesive system of assessment. Enabling supervisors, mentors, program directors, and colleges to receive high quality information regarding each individual’s relative strengths and weaknesses such that further educational opportunities might be crafted efficiently would maximize the opportunity to resolve any issues even as the next stage of training is experienced and the process of discovery begins anew.

Finally, it will be important that these changes are understood to be value added to the recipients of such assessment and feedback. Encouraging active engagement will require a reward structure that allows data and candidate responses to be recognized as evidence that continuing professional development is being undertaken and that the candidate experiences authentically reflect the practice in which physicians are engaged, thereby having clear relevance to their patients. Thus, we do not see this process of change as a top down exercise that is imposed upon trainees and practitioners but rather as a co-productive collective exercise that truly engages learners in benefiting patients.

Summary

Conceptions of best practice in health professional assessment are evolving away from simply focusing on “knows how and shows how” processes towards processes that catalyze quality improvement and patient safety. There is growing availability of more robust and timely performance measurement through local Electronic Medical Records and large clinical databases. These forms of information are calling into question the exclusive reliance on traditional assessment approaches thanks to their potential to provide a real-time authentic “window” into a physician’s practice. As thinking and data sources evolve, we consider the following to be a list of common issues facing any individuals concerned with mounting high quality assessment practices:

  1. 1.

    Broadening the base of assessment beyond knowledge tests;

  2. 2.

    Rigorously focusing data collection and decision-making practices in a manner that enables the assessment body to draw relevant and meaningful inferences;

  3. 3.

    Adding emphasis on healthcare processes and outcomes, including strengthening of the ability of the assessments to predict who will perform well against those outcomes and who will further develop in their ability after training;

  4. 4.

    Building a coherent and integrated system of assessment across the continuum of training to practice;

  5. 5.

    Emphasizing the primacy of learning as an integral part of assessment;

  6. 6.

    Harnessing the power of feedback; and

  7. 7.

    Shifting accountability towards a model of shared responsibility between the individual and the educational system.

Continuing the evolution of assessment practices in the manner outlined here will require time, energy, and resources. However, patient safety challenges and the licensing and certification of physicians are not going to stop while these issues are resolved. As such, we would advocate for engaging a set of pilot efforts aimed at quickly determining the feasibility of multiple strategies that might facilitate movement in the desired direction rather than waiting for a comprehensive system of assessment to be designed and investing heavily in a complete infrastructure, the components of which will undoubtedly be variable in effectiveness. While there are no simple answers, there are many testing organizations working towards determining how to offer authentic, tailored, and meaningful assessment practices for professional regulation (e.g., Eva et al. 2013; Hawkins et al. under review; Krumholz et al. under review). Fundamental to all of these efforts is that we avoid confusing quality assurance with quality improvement, reliability with usefulness, precision of measurement with being actionable and that we avoid confusing the desire on the part of practitioners to practice well with the desire to be told how they are doing (Mann et al. 2011).