Introduction

Studies of diagnostic test accuracy should be designed to minimise bias, a principle that underpins guidance for both reporting [1] and appraising the quality of diagnostic test research [2, 3]. At the same time, study results should ideally be generalisable to everyday clinical practice. Balancing bias against generalisability is not straightforward. For example, to reduce the risk of clinical review bias, it is generally accepted that study observers should be blind to prior investigations [4]. However, concealing information contrasts with daily practice where patients’ clinical history, examination and prior investigations are known to the observer when formulating a diagnosis. Particularly in the fields of radiology, histopathology and endoscopy, test interpretation involves a significant subjective element that could be influenced by methods which manipulate the clinical context.

In addition to individual patient information, study observers are often unaware of sample characteristics, notably disease prevalence. This issue is potentially important when assessing diagnostic tests intended for screening: In daily practice, observers will expect asymptomatic patients to have low likelihood and lower stage of disease (i.e. more subtle pathology). However, it is unclear how the observer’s a priori expectations influence subsequent interpretation, if at all: Some studies have found diminished vigilance when prevalence is low [5] while clustering of abnormal cases in high prevalence situations may also bias interpretation [6]. Nevertheless, studies of diagnostic test accuracy usually increase the prevalence of abnormality to achieve adequate statistical power within a feasible study size [7, 8]. Therefore, results of studies performed in the “laboratory” may not be transferable to lower prevalence, screening populations in “the field.”

Other pragmatic issues may also influence generalisability. For example, to complete research within a reasonable timescale, reporting intensity (the number of cases reported within a given timescale) frequently exceeds normal practice and is often exacerbated by the requirement to re-evaluate cases under different conditions (e.g. when comparing MR to CT) [8] or on more than one occasion (e.g. with and without computer aided detection). Moreover, because it is widely believed that prior exposure will influence subsequent interpretation (observer recall bias), it is recommended that consecutive interpretations are separated by a “washout phase” [9]. However, the ideal duration is unknown and there is little evidence that such procedures are effective or necessary.

While these potential “laboratory effects”[10, 11] have been discussed in the methodology literature[6, 1114], their impact remains unverified. To attempt to quantify their magnitude, we performed a systematic review of studies where the context of interpretation was manipulated or investigated (i.e. “laboratory” versus “field”). In particular, we wished to investigate the effect of varying sample characteristics, for example, enriching disease prevalence or increasing reporting intensity. Moreover we aimed to explore the effect of concealing sample information (especially prevalence) from observers. We were also interested in studies that addressed “memory effect” due to observer recall bias.

Methods

Data sources and search strategies

D.B. searched the biomedical literature to March 2010 using three complementary search strategies. A primary search identified any existing systematic reviews dealing with our research questions (Table 1). Because our review was not restricted to a specific test, diagnosis or clinical situation (which would facilitate keyword identification), we examined 10 key publications [6, 10, 1522] known to the authors in the fields of radiology, medical statistics and image perception, that had dealt with case-specific information (Table 2). Relevant keywords/phrases identified from these 10 articles were clinical information, recall bias, intensity, prevalence, prior knowledge and laboratory effect. The MEDLINE database was then searched via PubMed (http://www.nlm.nih.gov/pubmed) applying the systematic review filter to each term in turn. “Snowballing,” an iterative process for searches of complex material [23], identified potentially relevant publications by reintroducing new key words, repeating the process until no new relevant material emerged.

Table 1 Primary search strategy: Search for related systematic reviews using six keywords or phrases identified by hand-searching the ten “key publications” described in Table 2
Table 2 Secondary search strategy: Details of the 10 “key publications”, the related record search, and the number of publications citing each key publication

A secondary search was performed to A) identify indexed literature that shared two or more of the references cited by the 10 key publications and, B) identify all indexed literature citing a key publication (using “related records” and “citation map” searches through Web of Knowledge http://www.isiknowledge.com). Citations were collated, duplicates eliminated and abstracts reviewed (or titles if abstracts were unavailable) for potential inclusion (Table 2).

Lastly a tertiary search was initiated by retrieving Medical Subject Heading (MeSH) terms from each potentially relevant publication identified by the primary and secondary searches. Terms were ranked in order of frequency and terms likely to be non-discriminatory excluded (e.g. adult, male, female, mammography, CT). Multiple suffixes (e.g. radiology, radiological) were substituted by a truncated heading (e.g. radiol*). Related disciplines (e.g. histopathology, endoscopy) were linked with “OR” operators. Ultimately there were three “modality” terms (endoscop*, radiol* and [cyto* OR histo* OR patho*]) and six “manipulation” terms (prevalen*, attention, Bayes theorem, bias*, observer varia*, and research design), which were paired using the “AND” operator. MEDLINE was searched using these strings using the “diagnosis” option in the “Clinical Queries” filter. Duplicates were excluded and abstracts examined (Table 3). Potentially relevant publications were expanded using the secondary search strategy previously described and any new publication introduced using snowballing [23].

Table 3 Table detailing the Boolean search strings used for the tertiary search strategy and the number of individual abstracts identified by each term, with details of the full texts subsequently examined

The search strategies were tested: The secondary search identified all 10 key publications. The tertiary search identified all articles from which the MeSH headings had been compiled, and 7 of the 10 key publications.

Inclusion criteria

English language studies to March 2010 inclusive were eligible if they investigated the effect of experimentally modifying the context of observers’ interpretations on diagnosis. In particular, the effects of varying disease prevalence, blinding to sample characteristics, reporting intensity, and studies investigating recall bias. Studies exploring artificial “laboratory” conditions on outcome were also eligible. However, we excluded studies whose focus was manipulation of case-specific information (e.g. concealment of individual-patient information) since this has been investigated previously by systematic review[4]. Participants were human observers (computer-assisted detection was excluded), making subjective diagnoses based on interpretation of visual data, blind to reference results. Studies were excluded if the number of observers or cases interpreted was unreported. There was no restriction to disease type. We anticipated most studies would be radiological, but subjective interpretation of any medical image (e.g. endoscopy, histopathology) was eligible. Non-medical interpretation was excluded (e.g. airport security X-ray) as were narrative reviews.

Data extraction

D.B. extracted data from the full-text articles consulting S.H. and S.A.T., both experienced in systematic review, if uncertain. Differences of opinion were resolved by consensus. Data were extracted into a data-sheet incorporating measures developed from QUADAS[2] and QAREL[24], with additional fields specific to the review question. We extracted: Author, Journal, imaging modality, topic, number of observers/cases and their characteristics (e.g. professional background and experience), reference standard, case and observer concealment of population characteristics, blinding observers to study participation and purpose, reporting intensity, washout period, prevalence of abnormality and whether this varied, and data clustering (grouping of normal/abnormal cases).

Results

The primary search (Table 1) found 6050 abstracts. 56 full articles were retrieved by D.B.; one was suitable[25]. The secondary search (Table 2) identified 2828 publications with the full text retrieved for 34: ultimately 6 were included [6, 13, 2629] and 28 rejected because the research focused on case-specific information. The tertiary search (Table 3) identified 74 MeSH terms which were combined into 18 Boolean search strings: These identified 111 potential articles with a further 2 via snowballing; 5 articles were ultimately included [11, 12, 3032]. Overall, 11247 abstracts were reviewed, 201 full articles retrieved, and 12 ultimately included for systematic review (Table 4).

Table 4 Details of the 12 publications included in the systematic review

Description of studies investigating clinical context

Of the 12 identified studies that investigated the effect of manipulating clinical context, 3 focused on varying the prevalence of abnormality [6, 13, 26]. The remaining 9 studies investigated observer performance in different situations with fixed prevalence: 4 compared performance in the laboratory to daily practice [10, 12, 32]; 3 investigated observer blinding to previous clinical investigations [2931]; 1 investigated training [27]; 1 investigated varying reporting conditions[25]; 1 investigated recall bias [28]. The 4 studies that investigated interpretation in “the field” used retrospective data obtained from normal clinical practice [10, 12, 25, 32]. One study recruited from an international conference [30]. The remaining 7 used a laboratory environment exclusively.

Study characteristics and settings (Table 4)

The following diagnostic tests were investigated by the 12 included studies: 9 studies were radiological (5 mammographic [10, 12, 25, 28, 29], 3 chest radiology [13, 26, 27], 1 angiographic[6]), 2 endoscopic [30, 32], and 1 histopathological [31]. A single research group contributed 5 studies [10, 13, 2628].

Study design

All studies used a study design with an independent reference standard excepting a single study of observer agreement [31]. With the exception of one study [31], all observers were blinded to the research hypothesis. Furthermore, one study [30] used observers who were unaware that they were taking part in research. However, despite attempts to overcome “study knowledge bias” [14] (an area of interest to this review) this was not formally quantified by repeating the study with observers who were aware of the study.

Observer and case characteristics (Table 4)

In all studies the observers were medically qualified/board certified with a median of 8 observers per study (inter-quartile range [IQR] 3.5 to 14, range 2 to 129), with 6 studies restricted to observers who were “specialists” [10, 25, 31] or “experienced” [28, 29, 32]; but only 2 studies [10, 28] quantified this. Five studies included less-experienced observers, e.g. residents [6, 13, 26, 27, 30]. In one study, the authors did not detail experience [12]. The median number of cases per study was 300 (IQR 100 to 1761, range 5 to 9520). Case selection criteria were well-defined for 9 (75%) studies. Of these, in 4 studies [10, 12, 25, 29] recruitment was consecutive, 4 [13, 26, 30, 31] selected cases for optimal technical quality, and 1 [28] selected “stress” cases (specifically, cases misinterpreted previously in clinical practice). In all 12 studies technically acceptable material was used, e.g. genuine radiographs, video endoscopy.

Effect of sample disease prevalence (Table 5)

Three articles investigated the effect of varying the prevalence of abnormality on observers’ diagnoses (Table 5). The earliest [6] investigated context bias (to determine if clustering of abnormal cases influenced interpretation of subsequent cases), finding that sensitivity for pulmonary embolus increased significantly (from 60% to 75%) when prevalence was increased from 20% to 60% (7). Two studies by Gur and colleagues [13, 26] increased the prevalence of subtle chest radiographic findings from 2% to 28% in a sample of 3208 cases read by 14 observers of varying experience in a laboratory environment. While no significant effect on observer performance [via receiver operator characteristic (ROC) area under curve (AUC)] was demonstrated [13], reader confidence scores increased at higher prevalence levels [26]. However, the effects on sensitivity, or indeed the ROC curve itself were not addressed. Furthermore, the maximum prevalence used was 28% but researchers frequently increase prevalence far beyond this level: 6 (50%) studies in this review used prevalence between 50 and 100% [6, 27, 2932].

Table 5 Articles investigating the effect of manipulating the prevalence of abnormality on studies of diagnostic test accuracy

Effect of blinding observers to disease prevalence (Table 5)

Of the 12 studies reviewed, 8 (66%) concealed the prevalence of disease. One mammographic study [10], informed observers that the prevalence of abnormality in the sample was enriched (while concealing the exact extent and proportion) but that BiRads ratings should be assigned as if in a screening environment. Of the remaining three studies, observers were told the sample prevalence [28], aware of prevalence because they designed the study [31], or aware of prevalence because the entire study was performed in the clinic [25].

Although 2 studies [13, 26], varied the sample prevalence without informing readers, these studies did not specifically test the effects of revealing the sample prevalence on observers‘interpretation. Hence the effect of blinding readers to the spectrum of abnormality in the study sample remains uncertain.

Effect of reporting intensity (Table 6)

We did not identify any research that specifically manipulated reporting intensity (i.e. burden of interpreting cases) in the laboratory or compared it to daily practice. While a retrospective analysis of mammography in daily practice found that false-positive diagnoses diminished following implementation of high-intensity, batch-reading [25], the change was unquantified. The researchers believed improved performance was due to decreased disruption. Of the remaining 11 studies, 6 detailed setting, observer experience, and case-load enabling an inference of reporting intensity vs. normal practice (Table 6). Observers each read a median of 300 (IQR 100 to 3208) cases at a median rate of 50 (IQR 40 to 50) cases per session. One angiographic study [6] stipulated interpretation within three minutes, which likely exceeded normal practice. Intensity was either unreported or unclear in 5 studies. No article attempted to justify reporting intensity.

Table 6 Estimation of reporting intensity and generalisability to daily practice of “lab” studies

Effect of observer recall bias (Fig. 1)

Fig. 1
figure 1

Duration and scientific justification of the “washout” interval to reduce observer recall bias in studies requiring repeated observations of the same data

One article investigated recall bias specifically [28], asking observers to reinterpret mammograms reported by them in clinical practice 14 to 36 months previously. One observer recognised a single mammogram, but subsequently reported it incorrectly. The authors concluded that recall is rare and unlikely to bias studies. The same group [13] tested for 2 week recall via subgroup analysis, finding no effect, but the study was neither designed nor powered for this analysis. 8 (66%) studies included repeated observations of the same cases. One study [30], did not account for recall bias at all, requiring reinterpretation within minutes. The remaining studies incorporated a washout period between observations, with 3 studies using between 2 to 8 weeks and 3 indicating 14 to 36 months, and the exact duration unclear in 1 article [Fig. 1]. Moreover, only one article [13] justified the interval and, even then, based this upon anecdotal opinion.

“laboratory” vs “field” settings

All articles considered aspects of generalisability to daily practice, which was the primary focus of 6 articles [Table 4]. Three studies [10, 12, 32] compared “laboratory” interpretation with observers’ prior interpretation of the same cases in clinical practice. Gur [10] and Rutter [12] found higher mean observer sensitivity and specificity in normal clinical practice. However, while Meining et al. also found improved accuracy in the clinical environment, laboratory performance improved significantly when observers had access to clinical information [32].

Irwig [29] questioned whether results from standard tests should be revealed when new diagnostic alternatives are assessed, believing that observers may give undue weight to standard tests with which they are familiar, and so confound the assessment. The authors concluded that such practice is acceptable only when the standard test is both sensitive and specific. One histopathological study examined whether unavoidable initial viewing of low-magnification images may bias subsequent interpretation of high-magnification images [31], arguing that performance would be diminished if studies were restricted to high-power fields. One article [27] explored “checkbox” bias in ROC methodology, concluding that measures encouraging readers to use the full extent of confidence scales might itself introduce bias.

Discussion

We wished to investigate and quantify the effect on diagnostic accuracy results of blinding observers interpreting medical images to sample information, including disease prevalence. We found that, although manipulation/concealment of individual case information is relatively well-investigated, including a 2004 meta-analysis of 14 studies [4], few researchers have addressed sample information. Our systematic review identified only 12 studies (9 radiological) that investigated generalisability of results from laboratory environments to daily practice and, of these, only 3 focused specifically on prevalence [6, 13, 26], 2 from the same research group. Furthermore, only 2 modalities have been investigated, angiography [6] and chest radiography [13, 26]. The literature base is therefore very insubstantial. We had originally intended to perform a meta-analysis to quantify the effect of the potential biases investigated, but the paucity of available data prevented this.

Enriched prevalence may be an unavoidable aspect of study design, to complete within an acceptable timeframe, with available resources and without undue observer burden. It is important to distinguish between two potential reasons why prevalence might affect sensitivity. Firstly, high prevalence clinical settings are often associated with a more severe disease spectrum, which in itself, will increase sensitivity. Secondly, prevalence may be increased without an increase in disease severity, a situation often encountered in research studies, especially of screening technologies. In this latter situation, it is uncertain how increased prevalence will affect study results. For results to be generalisable we must know the effect, if any, of these enriched study designs on measures of diagnostic test performance, and to what degree and direction. It is widely believed that increasing prevalence raises sensitivity because disease is encountered more frequently than in daily practice [21]; a view supported by Egglin et al. [6]. However, it is only where an increased prevalence is associated with an increase in disease severity that there are theoretical reasons to expect prevalence to affect the ROC curve [33]. It is important to note that although Gur et al. did not demonstrate a significant difference in ROC AUC, despite varying prevalence[13], it does not necessarily follow that a prevalence effect does not exist. Indeed the authors cautioned in a separate editorial [11] that while results obtained in enriched populations should be generalisable to lower prevalence lab-based studies (provided they were analysed using ROC AUC methods) this is not the case for clinical practice. In addition, it is important to consider that while the maximum prevalence was 28%, this level is still well below that often employed by researchers.

Our interest in sample prevalence was precipitated by studies of CT colonography for colorectal cancer screening but we could find no research that addressed the design of these studies. Screening for lung and colorectal cancer by CT, and for breast cancer by mammography, are the subject of considerable research but it is currently impossible to draw evidence-based conclusions regarding the effect of sample prevalence on measures of diagnostic test accuracy.

It is intuitive that observers’ prior knowledge of sample prevalence in a study will influence their expectation of disease and we were interested whether this might affect measures of diagnostic accuracy. For example, it is believed that vigilance is reduced if prevalence is low (e.g. screening), because disease is encountered infrequently [34]. Surprisingly, we could identify no research that specifically addressed this issue, either by blinding/unblinding or misleading readers. Most studies concealed prevalence altogether whereas some altered prevalence, but without readers’ knowledge. Recall bias (i.e. where interpretation is influenced by recollection of prior interpretations) is a related issue. Many studies incorporated a “washout” phase between consecutive interpretations of identical cases but we could find no research that specifically investigated the impact of varying its duration. It could be argued that the repetitive nature of screening (in terms of material and task) argues for short washout. Indeed, one study concluded recall bias does not exist [28]. We could find no research that specifically addressed the effect of manipulating reporting intensity on measures of diagnostic test performance.

Although anecdotal opinion suggests that observers’ performance in an artificial “laboratory” environment (reviewing cases enriched with pathology, far from the pressures of normal daily practice) should exceed that achieved in “the clinic,” the available evidence identified by our review [10, 12, 32] suggests the opposite. The fact that clinical information is available in normal practice might help explain this but metaanalysis suggests the effect is small [4]. Another possible explanation is that observers in laboratory studies are aware their assessments will have no clinical consequences; “study knowledge bias” is also likely to influence observer studies but we found no research to substantiate this. Lastly, a substantial reporting burden associated with research studies (often performed at unsocial hours so as not to interfere with normal duties) may explain why accuracy is diminished. This discrepancy between “lab” and “field” performance has important implications, not only for evaluation of diagnostic tests, but also for how radiologists’ performance is assessed in isolation. For example, the PERFORMS programme for evaluating mammographic interpretation uses a cancer prevalence of 22% [35] and so may not reflect radiologist performance in clinical practice. Toms et al. suggested a more accurate assessment would be obtained by sporadically introducing abnormal test cases into normal daily reporting [36].

Our review revealed that the existing evidence-base is too insubstantial to guide many aspects of study design. High-quality research is needed to investigate and quantify the biases we investigated. Inevitably, studies specifically designed to answer the questions we posed will be expensive and time-consuming. For example, most studies we identified used observer samples in the single digits and variance is likely to be high; much larger studies are required. The authors predict that funding would be difficult to achieve for large-scale methodological research specifically designed to quantify these potential biases. However, given that funding agencies have previously provided very substantial support for large-scale studies of screening technologies, the authors suggest that future studies incorporate additional research that aims to estimate bias and generalisability. For example, this could be achieved via sub-studies/parallel/nested studies that incorporate unblinded observers, different contexts, or by varying the duration of washout period for different groups of observers. Such an approach would combine large-scale diagnostic test accuracy studies with methodological research for relatively little additional cost.

Our review does have limitations. In particular, relevant research may have been missed because of a lack of search terms specific to our review question. For example, many papers will discuss potential bias but few will test this as a primary outcome. Aware of this, we used multiple search strategies and snowballing to maximise studies retrieved. Even so, the total body of relevant literature we identified was rather small and was heterogeneous in the issues addressed.

In summary, several issues central to the design of studies of diagnostic test accuracy have not been well-researched and there is an insufficient evidence-base to guide many aspects of study design. High quality research is needed to address potential bias resulting from observers’ knowledge of prevalence and the effects of recall bias across several imaging technologies and diseases, most notably for studies of screening.