Introduction

In developed countries, atherosclerotic cardiovascular disease represents one of the leading causes of morbidity and mortality. Coronary artery disease (CAD) is characterised by the development of plaques through the accumulation of calcified or lipid deposits in the coronary vessel walls [1, 2]. Several studies have shown a correlation between the risk of developing the various symptoms associated with CAD and plaque characteristics such as obstruction [3, 4], composition [57] and localisation [811].

Computed tomography coronary angiography (CTCA) is a robust and accurate imaging technique for the non-invasive assessment of coronary arteries with respect to the diagnosis or exclusion of significant coronary stenoses [12] as well as for the detection and semi-qualitative characterisation of coronary artery plaques [1316]. A recent study demonstrated the composition of coronary artery plaques, as determined by CTCA, to be an independent risk factor for developing cardiac events, beyond the degree of luminal narrowing alone [17]. In addition, plaque identification with CTCA has been shown to be important for stratifying the risk of patients with CAD [18].

Until now, no objective criteria based on Hounsfield units (HU) at CTCA for plaque detection and differentiation could be defined [1923]. Hence, a considerable intra- and interobserver variability has been reported both for the detection [24, 25] and for the volumetric assessment [26, 27] of coronary plaques. Another factor potentially affecting the reliability and accuracy of coronary artery plaque detection with CTCA might be the impact of reader experience, as recently demonstrated for the analysis of CTCA in regard to the diagnosis of coronary artery stenoses [28]. The influence of reader experience on intra- and interobserver variability as well as on evaluation time for plaque detection has not been reported so far.

The purpose of our study was to assess the effect of reader experience on variability, evaluation time and accuracy for the detection of coronary artery plaques with CTCA.

Materials and methods

Patients

Between August 2007 and September 2007, 50 consecutive patients (35 male, 15 female, aged 67.3 ± 10.4 years, range 46–86 years) undergoing CTCA for clinical reasons were enrolled in this study. The indications for CTCA were in accordance with current guidelines and recommendations [12] and ruled out significant coronary stenoses in all 50 patients. All patients suffered from atypical chest pain and had a low to intermediate risk of having CAD, as determined by Diamond and Forrester [29]. Patients with nephropathy, known hypersensitivity to iodine-containing contrast media, previous myocardial infarction (clinical and ECG), known CAD, or with aorto-coronary bypass grafts or previous coronary interventions were excluded from study enrolment. Eight per cent of the patients were in chronic beta-blocker therapy. Clinical characteristics and demographic data of the patients are summarised in Table 1.

Table 1 Demographics of the 50 patients

This retrospective study was approved by the local ethics committee who waived the written informed consent requirement.

CTCA protocol

All patients underwent dual-source computed tomography (CT) (Somatom Definition, Siemens Healthcare, Forchheim, Germany) following a standard retrospectively electrocardiography (ECG)-gated CTCA protocol. All patients received a single dose of 2.5 mg isosorbide dinitrate s. l. (Isoket, Schwarz Pharma, Monheim, Germany) before CT. No beta-receptor antagonists were given. Eighty millilitres of a non-ionic, iodinated contrast agent (iopromide, Ultravist 370, 370 mg/ml, Bayer Schering Pharma, Berlin, Germany) was injected at a flow rate of 5 ml/s followed by 30 ml saline solution through an antecubital vein. Contrast agent application was controlled by bolus tracking in the ascending aorta (signal attenuation threshold 140 HU). CT parameters were: detector collimation 2 × 32 × 0.6 mm, slice acquisition 2 × 64 × 0.6 mm by means of a z-flying focal spot, gantry rotation time 330 ms, tube potential 120 kV, tube current time product 330 mAs per rotation, and pitch of 0.2–0.5 depending on the heart rate. ECG-gated tube current modulation for radiation dose reduction was used in all patients as previously recommended [30]. CT images were acquired in a cranio-caudal direction from the level of the tracheal bifurcation to the diaphragm. CTCA data were reconstructed using a mono-segment algorithm with a slice thickness of 0.75 mm, a reconstruction increment of 0.4 mm and using a soft-tissue convolution kernel (B26f) during mid-diastole at 70% of the R–R interval. When motion artefacts were present in the data set, additional reconstructions were performed in 5% steps within the window of full tube current. The reconstruction phase with least motion artefacts as determined by the attending radiologist during acquisition was used for further analysis.

Data analysis

Three observers with different levels of experience in cardiac CT imaging according to the statement of the Society of Cardiovascular Computed Tomography, the Society of Atherosclerosis Imaging and Prevention, the Society for Cardiovascular Angiography and Interventions, and the American Society of Nuclear Cardiology [31] were involved.

Reader 1 (R1) was a first-year resident of our radiology department with less than 1 year of experience in cardiac CT (level 1 experience, i.e. 0 or more mentored examinations while being present during the performance and more than 50 interpreted mentored examinations).

Reader 2 (R2) was a third-year resident of our radiology department with 3 years of experience in cardiac CT (level 2 experience, i.e. more than 35 mentored examinations while being present during the performance and more than 150 interpreted mentored examinations).

Reader 3 (R3) was a sixth-year resident of our radiology department with 5 years of experience in cardiac CT (level 3 experience, i.e. more than 100 mentored examinations while being present during the performance and more than 300 interpreted mentored examinations).

All three readers, blinded to the clinical presentation and history of the patient, independently evaluated the 50 CTCA data sets for the presence or absence of coronary artery plaques. All data sets were anonymised and were presented to the readers in random order. The randomization was performed with a random generator having equal weights. For each individual reading, the random generator created for each of the 50 data sets a number between 1 and 50 representing the order number the data set should be displayed during the reading. In case the order number was already assigned to another data set, a new order number was created. The reading of the 50 data sets was repeated 4 weeks after the first reading.

Interface

A graphical user interface offering the standard functionality of a commercial workstation was developed with the MeVisLab software (Version 1.5 for Windows, MeVis Medical Solutions AG, Bremen, Germany). The user interface allowed automated measurement of the evaluation time in seconds per patient. The reader was instructed to start, pause/resume and finish the reading of each patient by using the corresponding buttons. The evaluation time was defined as the elapsed time between pressing the start and end buttons, which included the manual reporting of the findings on prepared sheets. When the reading was paused, the data set was hidden and the time measurement was held. The readers were aware of the time recording but the evaluation time was not displayed to the reader during reading. Evaluation and reporting included the following:

Segments

Coronary segments were defined and numbered according to the 16-segment scheme proposed by the American Heart Association (AHA) [32]. First, in each reading, the segments were classified by the readers as being of diagnostic image quality, non-diagnostic image quality because of major artefacts, or as anatomically not present. Only segments with diagnostic image quality were evaluated for the presence or absence of plaques.

Plaques

Three different types of plaque were visually differentiated: purely calcified plaques, purely non-calcified plaques and mixed plaques, the last of these indicating a mixture of calcified and non-calcified components. Plaques extending over more than one segment were labelled according to their most proximal segment.

Consensus

A consensus reading was performed 4 weeks after the last individual reading in which all three readers jointly determined the classification of the segments as well as the presence and type of plaques within the segments. The readers were unaware of their individual performance for the consensus reading which was then defined as the reference standard of the study. Biases such as differing segment numbering among the readers or assignment of plaques extending over more than one segment were resolved during consensus reading.

Statistical analysis

All statistical analyses were performed by using the statistical software package R (release 2.8.1 for Windows, www.r-project.org). Categorical variables were expressed as frequencies or percentages. Quantitative variables were expressed as means ± standard deviations as well as medians. The non-Gaussian distributed evaluation times were compared with a Wilcoxon signed rank test. The relationship between evaluation time and experience level was analysed with Spearman’s rank-order correlation coefficient.

To account for a potential correlation between the 16 segments analysed for plaque detection per patient, the data were clustered [28]. Therefore, the bootstrap method [33] was applied with 1,000 resamples created by randomly sampling the 50 patients with replacement. If a given patient was included in the resample, all associated observations from the 16 segments of this patient were included. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and κ values were estimated for each bootstrap resample. Intraobserver variability and interobserver variability between two readings for plaque detection on a segment level was assessed using Cohen’s κ, whereas observer variability among multiple readings was assessed by Fleiss’ κ. Bias corrected 95% confidence intervals (CI) were calculated by using the 1,000 estimates of each quantity. One thousand estimates of two quantity measures were compared using Friedman’s rank sum test [34, 35]. The relationship between intraobserver variability and experience level was assessed using Spearman’s rank-order correlation coefficient.

Results

Consensus reading

The consensus reading revealed 663 (82.9%) segments with diagnostic image quality, 31 (3.9%) segments with non-diagnostic image quality and 106 (13.2%) anatomically non-present segments. A total of 377 (47.1%) segments were identified to be harbouring plaques. The consensus reading resulted in 125 (27.9%) calcified, 217 (48.4%) mixed and 106 (23.7%) non-calcified plaques.

Evaluation time

The average (median) evaluation time per patient in seconds for the first and second readings, respectively, was 328.3 s (307.5 s) and 321.8 s (326.7 s) for R1, 266.2 s (247.0 s) and 242.9 s (234.0 s) for R2, and 172.9 s (163.6 s) and 177.8 s (151.3 s) for R3. No significant differences (R1: p = 0.79, R2: p = 0.07, R3: p = 0.95) were found between the evaluation times of the first and second readings for all readers (Fig. 1). Significant differences were present when the R1 and R2 (p < 0.05), R1 and R3 (p < 0.05), and R2 and R3 (p < 0.05) evaluation times were compared. The evaluation time per patient showed a significant negative correlation with the experience level (r = −0.59, p < 0.05).

Fig. 1
figure 1

Box plots of the evaluation times for the three readers and two readings. No significant differences were found in the evaluation times between the two readings of each reader, whereas the evaluation times of the three readers differed significantly

Observer variability

Reader R1 labelled 233/234 segments, reader R2 marked 253/269 segments and reader R3 indicated 321/327 segments as having plaques in the first/second reading, respectively (Table 2). Compared with the consensus, R1 missed 151/153, R2 missed 141/132 and R3 missed 71/64 segments with plaques in the first/second reading, respectively (Fig. 2). The sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) for all readers and readings are listed in Table 3.

Fig. 2
figure 2

Confusion matrix of all readings as balloon plots showing the agreement as compared with the consensus

Table 2 Summary of the plaques detected by the two readings of the three readers
Table 3 Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for plaque detection with CTCA as well as the observer variability as compared with the consensus given as kappa value and 95% confidence interval (CI)

Variability as compared with consensus

The observer variability for plaque detection, compared with the consensus, varied between κ = 0.582 and κ = 0.802 (Table 3). For both readings, R3 was significantly better than R1 (first p < 0.05/second p < 0.05) and R2 (p < 0.05/p < 0.05) (Fig. 3). There was no significant difference when comparing the first reading of R1 with the first reading of R2 (p = 0.31). A significant difference existed between the first reading of R1 and the second reading of R2 (p < 0.05) as well as between the second reading of R1 and the two readings of R2 (p < 0.05/p < 0.05).

Fig. 3
figure 3

Box plots showing the variance of the observer variability as compared with the consensus. For both readings, R3 was significantly better than R1 and R2

Intraobserver variability

The κ values for the intraobserver variability were R1: κ = 0.761 (95% CI [0.693, 0.830]), R2: κ = 0.779 (95% CI [0.715, 0.835]) and R3: κ = 0.847 (95% CI 0.787, 0.902]) with significant differences between R1 and R2 (p < 0.05), R1 and R3 (p < 0.05), and between R2 and R3 (p < 0.05). A significant correlation between the experience level and the κ values for the intraobserver variability has been revealed (r = 0.73, p < 0.05), i.e. a reader with a higher experience level performed more consistent labelling (Fig. 4).

Fig. 4
figure 4

Box plots showing the variance of the intraobserver variability. Significant differences among the readers were observed

Interobserver variability

The interobserver variability among all readings was κ = 0.662 (CI [0.622, 0.704]). The κ value for the interobserver variability between two readers varied between 0.582 and 0.715.

Discussion

This study is one of the first to demonstrate the effect of reader experience on the variability, time effectiveness and accuracy of coronary artery plaque detection with CTCA. With an increasing level of experience, intraobserver variability and evaluation time for coronary plaque detection decreased, while the accuracy improved.

Pugliese et al. [28] recently reported on the effect of reader experience on the time effectiveness for reading CTCA data sets with regard to the presence or absence of coronary stenoses. Although the evaluation time is not of primary interest for making the diagnosis, the parameter is of clinical importance in terms of economic factors like reimbursement. Therefore, there is a demand for keeping the evaluation time for reading radiological studies as low as possible while maintaining the accuracy of the method. Our study demonstrates a significantly decreased evaluation time for plaque detection with increasing reader experience, comparable to the previous study on reading coronary stenoses with CTCA [28].

The interobserver agreement for the detection of coronary plaques in our study (κ = 0.66) was lower than that reported by Hoffmann et al. [25] (κ = 0.89) and Ferencik et al. [24] (κ = 0.85). This may be explained by the selection of participating readers in our study, covering three different experience levels [31]. The intraobserver variability and accuracy of the reader with most experience in our study was similar to those reported in the literature [24, 25]. Our study shows that differences in variability are correlated with the experience level. As the experience level increases, the detection of coronary plaques with CTCA becomes more consistent and more accurate with lower intraobserver variability and lower variability as compared with the consensus.

Specificity and positive predictive values were not different between the three readers, but sensitivity and negative predictive values were lower for the less experienced ones. This result could reflect the tendency of less trained readers to attribute a pathological result in case of doubt and thus to increase the number of false positives. However, it is more likely that an inexperienced reader more often misses a plaque and thus increase the number of false positive ratings. Certainly, further targeted studies are needed to investigate this issue in more detail.

During education, a reader is normally mentored by an experienced senior reader who provides her own experience as feedback and thus helps to constantly improve the reader’s performance in plaque detection. In addition to this, it might be conceivable to incorporate software-based tools to further support this decision-making and learning process [40]. Although automatic algorithms for plaque detection have already been proposed [36, 37], their performance is still beyond the needs of clinical application as they all focus on specific plaque types. None of them covered all visible plaque types in CTCA as described by Becker et al. [38].

Limitations

First, our consensus was based on CTCA images, not on intravascular ultrasound (IVUS) which is the current reference imaging technique for detecting and characterising coronary plaques. However, we only evaluated observer variability for the detection of visible plaques in CTCA and not the diagnostic accuracy of CTCA itself. Moreover, our study aimed to analyse plaques in the entire coronary artery tree, which prevents the use of IVUS as reference standard considering its applicability only in larger, proximal coronary segments. Second, the consensus was performed with the participating readers and not with an outside panel of experienced observers which might induce a bias towards the opinion of the most experienced reader. To limit this, the readers were blinded to their individual performance during consensus reading. Third, the plaques were classified according to their CTCA attenuation behaviour into calcified, non-calcified and mixed, but no sub-analyses for the different plaque types were performed. This was due to the limited number of plaques in our study population preventing a meaningful statistical analysis. Fourth, we did not use the most recent CT scanner technology for cardiac imaging [39]. Finally, each experience level was represented by one single reader; thus, no intra-experience level statistics could be performed.

Conclusions

Our study demonstrates that with increasing experience, the intraobserver variability and evaluation time of coronary artery plaque detection with CTCA decreases, while the accuracy increases.