Introduction

Research on clinical reasoning has been ongoing for more than three decades (Norman 2005). The initial focus of this research was on the process of clinical performance construed as a generic problem-solving ability. Researchers at Michigan State University and McMaster University observed clinical reasoning and described a generic strategy called hypothetico-deductive reasoning. While this strategy was found in experts and novices alike, expertise was linked to generating better hypotheses more quickly. However, looking at the outcome of clinical reasoning, an intriguing finding came to light: performance as measured by diagnostic accuracy was not homogenous, with correlations across problems in the order of 0.1–0.3. Elstein referred to this phenomenon as content-specificity (Norman 2005). Variability across cases has since been found in written assessments as well as in competence and performance assessments (van der Vleuten and Newble 1995). The consequences of these lines of research have been far-fetching, most notably in assessment. The traditional long case has generally been abandoned and replaced by new assessment methods that sample broadly using numerous cases and items, such as objective structured clinical examinations (OSCEs) and written tests using multiple choice questions (MCQs) (Norman 2005).

The phenomenon has also been labelled “case-specificity” and “context-specificity”. The three terms have different connotations and reflect different underpinning explanations for the phenomenon (Norman et al. 2006). Case specificity simply reflects the widespread research findings of variability of performance across cases without suggesting any underlying causal principle. Context-specificity refers to repeated findings in the psychological literature showing that human behaviour is not constant and depends on circumstances (Eva 2003). Content-specificity, the initial term used by Elstein, suggests that the reason for variability from case to case is the variability of the knowledge base, i.e., knowledge of the topic presented in the case is a prerequisite for performance and differences of performance reflect differences in mastery of the underlying knowledge. This initial hypothesis led to research on the link between knowledge and mental representations, and clinical reasoning. These studies have shown that clinicians have different types of organisations of knowledge, from formal basic science knowledge, to illness scripts where knowledge is encapsulated, to clinical exemplars or instances (Custers et al. 1996; Schmidt and Boshuizen 1993; Schmidt et al. 1992). They also use different reasoning strategies, from analytical causal and hypothetico-deductive reasoning to schema-induction and pattern recognition. Expertise is not linked to use of a single strategy (Charlin et al. 2007; Eva 2005; Norman 2005). Indeed the illness script framework describes two steps in clinical reasoning, script activation which is usually non-analytical in experts and script processing which is usually deliberately reflective and analytical (Charlin et al. 2007).

New assessment formats and new statistical methods such as generalisability analysis and structural equation modelling have provided useful insights into “case-specificity” and clinical reasoning. Recent research has shown that although specific components play a role in clinical problem-solving, elements of more generic abilities, which can be transferred across cases, also contribute to performance (Mattick et al. 2008; Wimmers et al. 2007; Wimmers and Fung 2008). Wimmers et al. used structural equation modelling techniques to test goodness of fit of three models to explain clinical clerkship oral examination grades. They found that neither a model based wholly on a general factor, nor a model based on strictly independent factors, fit with their data, but rather that a mixed model more adequately accounted for their data (Wimmers et al. 2007). The same type of analysis of data from a clinical examination was performed by Wimmers and Fung. Station scores included subscores on three general components, i.e., history taking, physical examination, and communication. Again, mixed models provided better fit to data (Wimmers and Fung 2008). Mattick et al. conducted the same type of analysis on another clinical examination. They found that content specific factors contributed to 43–54% of variance with transferable skills accounting for 13–16% of variance (Mattick et al. 2008) The exact nature of specific and generalisable components is, however, still a matter of discussion.

Regarding the nature of “case-specificity”, a study by Norman et al. casts doubt upon its sole link to knowledge base (Norman et al. 1985). Residents exposed to the same cases played by two different standardized patients (SPs) performed differently, with a kappa coefficient of 0.12 for diagnostic accuracy between the two presentations (Norman et al. 1985). Generalisability analysis separates variance into different components. Assessment generally aims to separate individuals (students) on the basis of some ability. Studies using generalisability analysis have found that the proportion of variance attributed to variability of individuals (i.e., what the test is supposed to be measuring) is small. Item variability accounts for some of the variance (i.e., some items are more difficult than others). There remains, however, a large amount of variance attributed to the interaction of individuals with items (i.e., different students perform differently on different items, independently of their overall ability and independently of item difficulty). This interaction has been interpreted as reflecting the phenomenon of case specificity (Norman et al. 1985). New item formats provide interesting contexts to study the aetiology of this remaining variance by clustering items within cases (i.e., in key features tests) or topics (i.e., in extended-matching tests) for instance. A recent study by Norman et al. using generalisability analysis on a key features test showed that variance associated with items nested (or clustered) within cases was more important than variance associated with cases, going against the common notion that specificity is linked to cases as such (Norman et al. 2006). However, it should be noted that the nature of case specificity may be different in different assessments. For example, different SPs may account for variance between OSCE stations, while written tests are exempt from this source of variance. Current understanding of clinical reasoning may help to interpret different findings from different studies. If clinicians, novice and experts alike, do not use a single reasoning strategy, perhaps it is this variability in reasoning strategy that accounts at least in part for the phenomenon (Eva et al. 1998). Wimmers and Fung have suggested that specific content-related processes may be important for easier items with more generalisable problem-solving processes needed to solve more difficult items (Wimmers and Fung 2008). Heemskerk et al. have shown that item format and item difficulty do indeed influence reasoning strategy (Heemskerk et al. 2008).

Better understanding of the nature of this undesirable error variance may be helpful in improving assessment measurements, both from the point of view of construct validity (is the test congruent with the construct of clinical reasoning it claims to measure?), and reliability. As the study by Norman et al. illustrated, the reliability of a key features examination would be improved more efficiently by increasing items within cases rather than increasing cases (Norman et al. 2006).

Norman et al. analysed a key features test where items are nested within cases. In extended-matching tests, items are cases nested in topics. Analysing the sources of variance stemming from the interaction between subjects and items and between subjects and topics would allow further study of the role, if any, of content-specificity, i.e., of differences in levels of topical knowledge in individuals. We sought to verify the interpretation of case specificity as content-specificity, hypothesising that variance attributed to subject-item interaction would be smaller than variance attributed to subject-topic interaction.

Methods

Participants

The 2005 and 2006 cohorts of first and second-year general practice residents of the Université catholique de Louvain (UCL), Belgium, were required to take part in a day of learning activities on campus. Residents took part in a study on self-assessment which included an extended-matching test (Case and Swanson 1993) in the morning and various seminars in the afternoon which gave them credit for residency learning activities. Although participation in the day’s activities was compulsory, some residents were unable to attend. The number of attendees was 127 of 169 (75.15%) in 2005 and 120 of 147 (81.63%) in 2006.

Test

The test included 159 extended-matching items (EMI). EMIs are a type of multiple-choice questions with varying number of answer options (between 5 and 26) which makes them less amenable to guessing, and with several items nested within topics, so that items share an option list and a question stem, cutting down on reading time per item and thus making it possible to include more items per unit of time (example provided below) (Case and Swanson 1993). They have been shown to provide valid and reliable measures of clinical reasoning, within the constraints of written multiple-choice tests (Beullens et al. 2002; Beullens et al. 2005; Swanson et al. 2006). The items were retrieved from an item bank used for the regional examination of general practice certification in the Dutch-speaking part of the country and translated by a professional translater. General practice is extremely similar in the different regional entities in Belgium so revalidation of content was not deemed necessary. The test covered four domains (chest, urogentital system, locomotor system, and dermatology) with 50 topics and 159 items nested within these topics. Cronbach’s alpha of the test was 0.867 in 2005 and 0.860 in 2006. Scores (percentage correct) for 2005 ranged from 28.93 to 76.73% (mean 58.76%, standard deviation SD 9.00), scores for 2006 ranged from 38.99 to 77.99% (mean 60.74%, SD 8.72).

Analysis

Generalisability analysis was conducted using the urGENOVA (http://www.education.uiowa.edu/casma/GenovaPrograms.htm) and G-string II (http://www.fhs.mcmaster.ca/perd/download/) software.

The facets, all random, were: S (subjects), D (domains, i.e., chest, locomotor system, urogenital system, dermatology), T (topics), and I (items). Items were nested within topics which were nested within domains. The design used was S/I:T:D and relative G coefficients were computed. Taking into account the number of domains (4), topics nested in domains (50) and items nested in topics (159), the formula for the relative G coefficient in this design is:

$$ G = {\frac{{\sigma^{2} (S)}}{{\sigma^{2} (S) + {\frac{{\sigma^{2} (S \times D)}}{4}} + {\frac{{\sigma^{2} (S \times T:D)}}{50}} + {\frac{{\sigma^{2} (S \times I:T:D)}}{159}}}}} $$

where \( {\frac{{\sigma^{2} (S \times D)}}{4}}, \) \( {\frac{{\sigma^{2} (S \times T:D)}}{50}}, \) and \( {\frac{{\sigma^{2} (S \times I:T:D)}}{159}} \) can be considered sources of error attributable to the interaction between subjects and domains, topics and items, respectively. The relative weight of these three error components was calculated.

To illustrate practical implications, a D-study was also conducted for the following three scenarios, all including 120 items and 4 domains: 2 topics per domain, each with 15 items; 6 topics per domain, each with 6 items; 15 topics per domain, each with 2 items.

Ethical approval

Because of the formative nature of the whole experiment which was part of learning activities within the residency program, ethical approval was not sought. Residents were assured that individual data would not be available to faculty members involved with residency teaching and/or assessment.

Results

The results of the generalisability analysis G study are reported in Table 1. They indicate that the variance component attributed to subjects was small (645 and 601 for 2005 and 2006, respectively), accounting for only 2.64 and 2.50% of score variance. Variance attributed to items was over twice that of variance attributed to topics. The variance attributed to interaction between subjects and items was the largest component explaining around two-thirds of the total variance.

Table 1 Results of the generalisability G study

Error attributed to subject-domain interaction represented 24.40 and 19.85% of total error, error attributed to subject-topic interaction represented 14.34 and 17.66% of total error and the main source of error stemmed from subject-item interaction (61.26 and 62.49% of total error). Relative G coefficients were 0.798 for 2005 and 0.794 for 2006.

Results of the generalisability D study are reported in table 2. Increasing the number of topics results in a higher G coefficient than increasing the number of items per topic for a test with the same number of items (0.769 for 15 topics per domain with 2 items each versus 0.657 for 2 topics per domain with 15 items each).

Table 2 Results from the D-study (using mean values of variance components)

Discussion

The results of our generalisability analysis indicate that variance linked to items is larger than variance attributed to topic. The figures from our study are extremely similar to those of Norman et al. (2006). However, the two studies use different test formats. Norman et al. used key features questions where items are nested in cases. We used extended-matching questions where items are cases that are nested in topics. Our format allowed comparison of different case presentations which pertain to the same knowledge base, at least of formal knowledge. In spite of this, variance attributed to items within topics was larger than variance attributed to topics.

There are, however, limitations to these analyses. First of all, since items are nested in topics and domains, it is impossible to obtain the net effect of items removing potential interactions with topic and/or domain. Furthermore, the subject-item interaction component is the highest order component which means it contains residual error variance form unknown sources.

Nevertheless, topical knowledge appears less important than response to individual clinical scenarios in explaining variance in test performance. These results question the belief that case-specificity is linked exclusively to the variation in the formal knowledge base pertaining to different cases. They do not, however, explicit what exactly determines the effect. One hypothesis is that the structure of relevant knowledge (i.e., pathophysiological causal networks, schemas, illness scripts, instances) and hence clinical reasoning strategy, may account for variability of scores. Our subjects were residents who had already had a significant amount of clinical exposure from 18 months of clerkships and at the time of the test at least 9 months of general practice residency. Residents may have “recognised” the vignette as similar to a patient they had encountered and used non-analytical reasoning to answer. Other cases may elicit analytical reasoning such as schema-induction or hypothetico-deductive reasoning. These strategies may then have been more or less successful, even on a case-by-case basis, producing variability in score. EMIs were initially developed to test pattern recognition ability (Case et al. 1988) although Heemskerk et al. using a think-aloud protocol have found that EMIs elicit more hypothetico-deductive reasoning than non-analytic reasoning, and more so than short-answer questions (Heemskerk et al. 2008). Different reasoning strategies have been found to be linked to different levels of diagnostic accuracy (Coderre et al. 2003) although a causal link is controversial. Clinical reasoning strategy may reflect knowledge structure of the problem (i.e., experience with a clinical problem may result both in more use of pattern recognition and more successful use of pattern recognition) (Norman and Eva 2003).

The empirical finding of variability of a subject’s performance from one case to the next, i.e., the phenomenon labelled case-specificity, has widely been interpreted as resulting from variability in knowledge but other hypotheses have been put forward. In fact, the nature and extent of case-specificity may differ from context to context. Test difficulty may lower specificity, i.e., tests that are too easy will elicit consistently good performance and vice versa, limiting specificity. Performance may be dependent on different factors depending on level of expertise, e.g., formal knowledge may be more important to novice students while patient similarity may be more important for expert clinicians (Schmidt and Boshuizen 1993; Schmidt et al. 1992). Clinical speciality has even been suggested as a potential factor (Custers et al. 1996). It has been shown that clinical reasoning strategies vary even for the same individual, and item-difficulty and item-format seem to be important factors too (Heemskerk et al. 2008). Kreiter and Bergus have argued that the variance attributed to the interaction between case and subject may contain residual error linked to factors not included in the model, such as occasion of measurement, and they caution against over-interpretation of this error component (Kreiter and Bergus 2007; Kreiter 2008). Finding a single explanation for results in different study settings may simply not be desirable and we should be wary of interpreting empirical findings by means of speculation rather than sound hypothesis-testing grounded in theoretical frameworks.

While the discussions on the nature of case-specificity continue, the findings do have a practical impact. Reliability of tests using EMIs would benefit from an increase in the number of topics rather than of items per topic. While this may seem counter-intuitive as variance attributed to items was greater than variance attributed to topics, careful analysis of the formula for the G coefficient reveals that error terms are divided by total number of components (e.g., topics, items) rather than nested numbers. Hence if the number of items is constant, increasing the total number of topics (and thus reducing the number of items nested in topics) is the only way to increase reliability. This would, however, increase testing time for the same number of items. Data allowing such estimations was not available so the net effect on reliability per hour of testing is unknown. Nevertheless, increasing the number of topics makes sense from a content-validity standpoint, especially in general practice where the domain one wants to generalise to is so broad. Using topics which are more cross-sectional may be one way to strike a balance between broad sampling of cases and satisfactory reliability per unit of testing time.