Introduction

The past decade has seen increasing interest in instructional or system-based interventions to reduce diagnostic errors. As reviewed by Graber, a number of strategies have been attempted, with varying success (Graber et al. 2012). One of the earliest strategies used technology to ameliorate errors using computer-based “decision support systems.” These systems were based on an “expert-system” framework that employs explicit rules extracted from experts. Such systems were cumbersome, and not particularly accurate, listing as many as 30–40 diagnoses and with overall accuracy that rarely exceeded 40% (Elstein et al. 1996). Not surprisingly, large scale trials (Berner et al. 1999; Friedman et al. 1999) showed these types of decision-support interventions had relatively small impact on errors. Integration of artificial intelligence to improve the usability and functionality of these systems is appealing, and may allow machine-assisted diagnostic accuracy which surpasses that of clinicians (Haenssle et al. 2018), however even the most robust technology still requires human oversight (Obermeyer and Emanuel 2016).

A second popular strategy was based on the assumption that most errors derive from cognitive biases suggesting that interventions designed to educate clinicians about biases should reduce errors (Croskerry 2003a, b, Croskerry et al. 2013a, b). The underlying assumption have been challenged (Norman and Eva 2009), some studies have shown that while students can learn to identify biases (Reilly et al. 2013), such knowledge does not reduce the rate of errors (Sherbino et al. 2011, 2014).

A third strategy incorporates knowledge mobilization techniques, following a theory that errors are a consequence of knowledge deficits (Kahneman and Egan 2011). Zwaan et al. (2017) conducted a retrospective chart review of successive cases of COPD—both with, and without, errors—and found that insufficient knowledge was the primary cause of “suboptimal clinical acts”. Strategies to mobilize knowledge have been investigated by Mamede and Schmidt (2008, 2014) who showed that a structured “reflection” intervention designed to encourage clinicians to systematically explore their knowledge of relationships between features and diagnoses, demonstrated small, but reasonably consistent effects in reducing diagnostic error. However the process is very resource intensive, is ineffective when the clinician selects the cases to review (Monteiro et al. 2015) and is only effective when the original case is available in written form for review (Schmidt, personal communication).

One strategy that may improve diagnostic accuracy is the use of checklists. Checklists have been proposed as safety interventions in many domains of medicine, with improved clinical outcomes related to central line infections (Marsteller et al. 2012) and peri-operative communication (Haynes et al. 2009). Ely et al. (2011) have advocated for the use of checklists to reduce diagnostic errors. They point out that such checklists can be designed to achieve different purposes. They describe three kinds of checklists, directing the clinician to:

  1. a)

    Identify cognitive biases using general questions like “Could I be anchoring on the wrong diagnosis? (Availability, anchoring)“, “Have I selectively elicited data to confirm my diagnosis?” “(confirmation bias), “Did I consider the flaws of heuristic thinking?” (all biases)

  2. b)

    Consider alternative diagnoses within the domain

  3. c)

    Critically examine the process of data gathering/interpretation.

Each of these types of checklists interacts with a different component of the diagnostic reasoning process (see Fig. 1), which is most commonly typified as an interplay between automatic pattern recognizing system 1 processes and more cognitively intensive system 2 processes which are variably involved in data gathering and interpretation around the illness presentation, arriving at a diagnosis and ratifying it through a calibration step (Croskerry 2009).

Fig. 1
figure 1

3 different types of checklists in a model of diagnostic decision making

While there is no evidence showing that checklists to identify cognitive bias can reduce errors, Sibbald et al. (2013, 2014) demonstrated that the checklists directed at interpretation of ECG components can result in a small but consistent reduction in errors at all levels, although the benefit was larger for novices (Sibbald et al. 2014). Critically, only one study has examined a differential diagnosis checklist in a head-to head comparison. Shimizu et al. (2013) compared a differential diagnosis checklist to a cognitive bias checklist and found that the former reduced errors (statistically significant) but the latter did not. It is possible that the differential diagnosis checklist acted as a retrieval aid or knowledge mobilization tool. However, the differences were very small and inconsistent. In one group the difference emerged on easy cases but not difficult; in a second group the reverse was true. Critically, the study involved medical students and written clinical cases, and may not generalize to more experienced clinicians. The current study re-examines this comparison in a visual diagnostic task with clinicians of different levels of expertise.

To understand the potential impact of checklists to improve diagnostic performance requires a direct comparison of error reduction strategies based on identification of cognitive biases with alternative strategies based on mobilizing condition-specific knowledge and no intervention.

Research goal

The primary goal of the current study is to examine the whether or not use of checklists could reduce diagnostic error. We trial two different types of checklists; the first focused on identifying and addressing cognitive bias (debiasing checklists) and the second directed towards a systematic review of the ECG features (content checklists). To gauge the impact of these checklist-based interventions on reducing diagnostic error in ECG interpretation, we compared performance on these conditions to a no-intervention control.

As a secondary goal, we examined the relative effectiveness of these checklist interventions in novice (postgraduate year 1) and experienced trainees (postgraduate year 4 or 5). We also examined the relative effectiveness of these strategies in cases that are specifically designed to contain cognitive biases compared to cases that were not. The rationale for this design feature is that most of the experimental literature that implicates cognitive bias as a cause of diagnostic error uses written cases or experimental strategies that are designed to induce bias (Christensen et al. 1991; Hatala et al. 1999; Mamede et al. 2010). We anticipated that experts would be less susceptible to bias, and therefore benefit less from either intervention. We also anticipated that participants using a checklist focused on identifying cognitive bias would have lower error rates when interpreting content engineered to induce biases.

Methods

Study design

The study was designed to test the relative effectiveness of three educational interventions designed to reduce errors in ECG interpretations described in Table 1, following the method described by Sibbald et al. (2013). The conditions were as follows:

Table 1 Checklists for each condition
  1. 1.

    A ‘control’ condition containing only general instructions to review their interpretation carefully for diagnostic errors.

  2. 2.

    A ‘debiasing’ condition involving instruction on identifying common biases coupled with a checklist that prompts learners to identify any cognitive biases that might lead to errors in interpretation, based on a checklist used in prior study (Shimizu et al. 2013).

  3. 3.

    A ‘content’ checklist condition involving instruction on how to systematically interpret an ECG, then a content-specific checklist drawing attention to specific features of the ECG, based on a checklist used in prior study (Sibbald et al. 2013, 2014).

Study flow

The study flow consisted of four steps (see Fig. 1). Following informed consent, all participants watched a review session on ECG diagnosis with worked examples and embedded instruction on the underlying concepts of ECG reading. All participants received the same instructions. The main goal of this component was to ensure that all participants have, to some extent, a common working knowledge of principles of ECG interpretation. The instructional material used was from a popular online ECG teaching site (ECG Made Simple 2017). Next, participants watched a slide show on the prevalence and importance of diagnostic errors, their potential origins in system or cognitive processes, reasons why clinical decision-making might be faulty as well as the value of checking diagnostic decisions.

Participants then received instruction customized to their random group assignment, as follows:

  1. 1.

    In the control condition, participants were given general instructions to review their diagnoses carefully.

  2. 2.

    For the debiasing condition, participants were given instructions in identifying specific cognitive biases adapted from the online materials recommended by the Society to Improve Diagnosis in Medicine. We provided definitions and examples for anchoring bias, availability bias, confirmation bias, framing effect and search satisficing. These biases are all identified as important in the Institute of Medicine report (Donaldson et al. 2000), have been described in detail previously and have some evidential support (Blumenthal-Barby and Krieger 2015).

  3. 3.

    For the content checklist condition, participants received instruction on use of the content checklist (Table 1) which drew attention to various aspects of the ECG, including rate, rhythm, axis, hypertrophy, ischemia, and intervals.

Finally, all groups practiced interpreting ECGs using the same set of 6 ECGs with their condition-specific checklist (or nothing else for control condition). Table 1 includes the checklists. All practice ECGs contained clinical features that could induce a cognitive bias as described below. The correct answer was provided for each ECG.

Instruction and testing was completed on a secure web-based testing platform (LimeSurvey, Hamburg, Germany).

Testing

During the testing phase, all participants were given the same 20 ECGs, each consisting of a brief clinical history and an ECG. They were instructed to provide a written interpretation in a free text box for each ECG, and then advanced to a separate screen and asked to check their interpretation using with their condition-specific instruction. They were allowed to revise their interpretation in a free text box.

The 20 ECGs were selected from an online case bank (www.ecgmadesimple.com, Table 2) and presented in random order. Ten of these ECGs included case stems that were designed to suggest 2 of the 6 cognitive biases described in the debiasing checklist. The two biases were selected for their ease of adoption in the ECG format:

Table 2 ECG testing material
  1. 1.

    Search satisficing, which occurs when search for a diagnosis stops after one cause or explanation is found, even though multiple problems are present. The ECG cases designed to promote search satisficing had more than one important diagnosis (e.g., rapid atrial fibrillation AND inferior ST elevation myocardial infarction).

  2. 2.

    Confirmation Bias describes the tendency to look for confirming data to support a diagnosis rather than look for disconfirming evidence to refute it, although the latter may be more persuasive and definitive. The ECG cases designed to promote confirmation bias included a plausible but incorrect diagnosis in the stem (e.g., “rule out atrial fibrillation” when the ECG was sinus rhythm with multiple premature atrial contractions).

ECGs were selected by MS, reviewed by JS, SM and GN to identify their fit to suggest the 2 cognitive biases, and pilot tested with two practicing cardiologists. Answers were adopted from the online case bank that publishes each ECG’s interpretation and a justification for those interpretations.

Participants

In order to ensure a baseline working knowledge of ECG interpretation, we enrolled novice emergency medicine and internal medicine residents in their first year of postgraduate training. Experienced participants were cardiology fellows or residents in their fourth, fifth or sixth year of postgraduate training. Participants were recruited from Canada (McMaster University, Western University), United States (University of Washington), and the Netherlands (Rotterdam University). Participants were paid a small honorarium for participating. We planned a sample size of at least 20 novice and 20 expert learners, giving a power of 0.80 to detect a 10% difference between all three groups, accepting a 5% type I error rate (http://powerandsamplesize.com).

Scoring

Consistent with the approach of Sibbald et al. (2013), errors in each ECG interpretation before and after the condition-specific instruction were counted. Errors included missing correct diagnoses, as well as additionally provided incorrect diagnoses. Interpretation time and the additional time taken following the condition specific instruction were also collected as secondary endpoints.

Analysis

Data were analyzed using repeated measures ANOVA, with condition (control, debiasing checklist and content checklist) and learner level (novice, experienced) as between-subject factors, and case (20 levels) as a repeated measure. Initial analysis looked at the effect of expertise on accuracy and time to diagnosis, as a form of validation.

The primary analysis focused on the number of errors following revision after the use of the two checklist and control conditions. Secondary analysis examined how often the diagnosis was revised under different instructional conditions. To determine whether the effect of instructional condition differed by educational level, we examined the interaction between instruction condition and level.

We did not look at initial diagnoses prior to application of the checklist intervention for each case, since it may be influenced to an unknown degree by the intervention based on recall of the checklist for prior cases.

Finally, we examined performance on specific cases to determine whether the debiasing instruction was more effective than the content checklist or control condition for those cases that included a specific cognitive bias. All analyses were conducted with SPSS version 23 (IBM, Redmond United States).

Results

Sixty-one participants were recruited, 40 novice learners and 21 experienced learners. Participant demographics are reported in Table 3.

Table 3 Participant demographics

Effect of educational level

Overall, there were 1.19 ± 1.34 errors per ECG. Experienced learners made fewer errors than novice learners (0.60 vs.. 1.52, F = 66.0, p < 0.0001). There were fewer errors in the two checklist conditions compared to controls (Table 4), although the differences were small and not significant (F = 0.61, p = 0.55).

Table 4 Errors in ECG interpretation by condition, learner level and content

Although novice residents appeared to take longer to enter their preliminary diagnosis (276 s vs. 164 s) these time differences were not significantly different (F = 2.14, p = .15). In examining revision after use of the checklist, juniors took longer (31 s vs. 11 s), which was significant (F = 4.60, p = .04).

The substantial difference between the novice and experienced groups constitutes some evidence of the appropriateness of the materials and task.

Effect of instructional condition on errors and time

The error rates for junior and senior learners by condition are shown in Fig. 2. The control condition had slightly more errors among juniors, but the main effect of condition was not significant (F = 0.26, p = .77). and the interaction between condition and trainee experience was not significant (F = 0.82, p = .44). Participants working under the content checklist condition took longer, on average, than the debias and control condition (checklist 276 s, debias 225 s, control 223 s) but these differences were not significant (F = 0.206, p = .81).

Fig. 2
figure 2

Study process

The process of revision

Of the 1220 interpretations made on the ECGs, there were only 97 instances where participants changed their original interpretation in the context of their ‘second look’ (e.g. interpreting the ECG in the context of one the two checklists, or no additional information in the control condition). The mean number of errors for these changed interpretations was 1.87 before the intervention and 1.86 after (F = .000, p = .997). Novice trainees made 1.92 errors before and 2.06 errors after the intervention while more experienced trainees made 1.79 before and 1.65 after; the interaction between condition and experience on change in accuracy was not significant (F = 0.64, p = .42).

Specific effect of debiasing intervention on cases with cognitive biases

As shown in Table 4, cases engineered to induce specific biases resulted in significantly more errors for both juniors and seniors (0.67 overall for non-bias-inducing cases compared to 1.08 and 1.82 for cases with confirmation bias or search satisficing bias, respectively (F = 96.9, p < 0.0001). There was no evidence that specific debiasing instructions resulted in fewer errors on the biased cases, for either novice or experienced residents (F = 0.25, p = .78). There was, however, a significant interaction between the case type and learner level with novice participants being more susceptible to confirmation bias (F = 5.59, p = 0.001).

Discussion

This study systematically compared the strategy of debiasing checklists with knowledge retrieval checklists to reduce errors in ECG interpretation. While we found significant effects of expertise on diagnostic accuracy, we did not demonstrate any advantage for checklists, either those directed at specific aspects of the ECG or those designed to identify cognitive biases. Moreover, even when cases contained specific cognitive biases related to the instruction and checklist, there was no increase in accuracy related to use of the debiasing checklist.

One obvious conclusion is that is that the checklists and instruction were inadequate. However, we specifically attempted to use “state of the art” materials (Sibbald et al. 2013, 2014; Society to Improve Diagnosis in Medicine). The testing content was taken from a well-rated ECG education site, and carefully engineered to promote bias, a manipulation that successfully resulted in a higher error rate both among novice and expert learners. Learners were taken from multiple institutions in multiple countries to minimize systematic influences of individual training contexts.

Despite all this careful effort, little impact on diagnostic errors was seen using either of these strategies. Certainly the systematic application of a checklist after the initial diagnosis resulted in relatively few changes, and no clear improvement. To be fair, novice learners experienced a small reduction in errors in both the debiasing and content checklist conditions, however, the effect size was small and only present in a post hoc analysis of the novice group, whereas the interaction between expertise and condition was not significant. It is not clear whether this observation reflects the specific content of these instructional conditions or is a more general phenomenon of increased instructional scaffolding translating into better novice performance.

The results contrast with an earlier study of ECG diagnosis using the same content checklist (Sibbald et al. 2013). It may be that the present study was relatively underpowered; however, one clear distinction was that the Sibbald et al. (2013) study used more complex cases, with almost three times as many errors per case as the present study. It also showed a significant number of error corrections, particularly among novices.

While the low frequency of changed diagnosis and a resulting minimal effect on accuracy as a consequence of a systematic intervention after an initial diagnosis is a negative result from the present study, it is not a finding unique to this study. Monteiro, using written clinical cases, found a similar 8% change in diagnoses after revision, with similarly small overall improvement (Monteiro et al. 2015).

This may in turn suggest to some that using decision supports after an initial diagnosis may be futile, and students should be taught to be systematic and thorough from the outset, using checklists similar to those in this study to guide systematic search and reduce premature closure and confirmation bias. Surprisingly a direct test of this approach of “hypothesis-free” systematic inquiry using feature checklists and ECG cases led to the opposite conclusion; such approaches actually increased diagnostic errors (Norman et al. 1999).

Importantly, teaching about bias and providing checklists to learners to identify biases did not mitigate the increase in errors in testing content engineered to promote bias. This can be viewed as a ‘best case’ scenario, to the extent that the cases clearly exhibited the defined cognitive bias and participants were actively reminded of the biases during the diagnostic process. However, the study by Zwaan et al. (2017) raises concern that even those with substantial experience with cognitive biases have difficulty identifying and agreeing on which bias is present. Moreover, the process of identification of a bias does not guarantee error correction unless the clinician is aware of the correct alternative (Dhaliwal 2017).

This study has some important limitations. We chose to use ECG diagnosis as an example of visual test interpretation. While visual test interpretation is frequently used for studies on diagnostic error, contexts in which data are not automatically provided, but must be collected by a healthcare professional may be more prone to bias. Second, this study included learners at the beginning and near the end of their postgraduate training. We cannot comment on more novice learners (e.g. medical students) or more expert clinicians (e.g. practicing cardiologists). However, given the range of diagnostic performance captured in the two groups in this study, and the minimal impact of both interventions, it is unlikely larger effects will be seen in other groups. Finally, our intervention represents a relatively blunt checklist instrument to tackle a nuanced problem of bias infiltrating diagnostic error. Our findings do not preclude the efficacy of more selective application of checklists for more targeted use.

What are the implications of this study? The effects of these interventions are disappointingly small. Educators might reduce curriculum time in these types of interventions until further work shows larger impact, either through empirically refined tools or more targeted application of these ideas. It is difficult to justify taking ECG instructional time to cover the interventions used in this study given their minimal effect. It is becoming increasingly clear that interventions that rely on encouraging clinicians to “think harder”, however that is defined e.g. “take more time, slow down” (Sherbino et al. 2012); “be aware of biases” (Sherbino et al. 2014); “pay attention to specific aspects” (Sibbald et al. 2013, 2014) or “reflect” (Mamede and Schmidt 2014; Mamede et al. 2008, 2010), but which rely entirely on existing knowledge have little or no impact on errors (Dhaliwal 2017). Certainly, one recurrent issue is that clinicians are unaware that they are committing an error, so are unlikely to take efforts to correct it (Eva and Regehr 2011). While such errors may appear obvious in hindsight, this retrospective process has its own problems (Zwaan et al. 2017).

It is difficult to escape the conclusion that, however serious the problem of diagnostic errors, there are unlikely to be any ‘quick fixes’ to improve diagnostic performance based on general, real time reminders that clinicians broadly apply in practice.