Introduction

Magnetic resonance imaging of the breast has been established worldwide as a highly accurate imaging modality for the detection of breast cancer, with an excellent sensitivity up to 100 % [13]. However, breast MRI can be challenging: many different criteria can be used for image interpretation, and technical recommendations encompass a broad variety of examination and interpretation quality. Several recommendations have been published that were designed to improve the standardization of breast MRI acquisition and reporting [48]. The most widely accepted standard is the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) lexicon [7]. It contains a structured common language for interpretation and reporting of mammography (MG), ultrasound (US), and MRI. Without a doubt, the BI-RADS lexicon facilitates communication among physicians through the use of a standardized terminology. The MRI BI-RADS lexicon features cover lesion morphology, such as margins, internal enhancement pattern, and functional contrast enhancement kinetics. However, the BI-RADS lexicon does not provide defined rules by which to convert specific imaging features into a diagnostic category [7]. Moreover, the use of multiple diagnostic criteria is associated with the risk of information redundancy [9]. As a consequence, inter-reader agreement of BI-RADS is generally moderate while diagnostic accuracy is highly variable [1013].

A scoring system is defined as a clinical decision rule that leads to a prognostic estimation or a diagnostic category by incorporating several criteria [14]. Scoring systems have been investigated in order to assist radiologists in characterising MRI findings and improving their specificity for the prediction of breast lesion malignancy [8, 1521]. Baltzer et al. proposed a classification Tree flowchart as a structured and intuitive algorithm for the differentiation of malignant and benign lesions [15]. In that algorithm, five diagnostic criteria independently contribute to lesion diagnosis and each specific combination of criteria provides a likelihood of malignancy. Such a scoring system would be expected to improve inter-reader agreement and may reduce experience-related variability. According to the Centre of Evidence-Based Medicine reasoning, the Tree scoring system of Baltzer et al. was exploratory. In order to achieve a high level of diagnostic accuracy, a validating study in one clinical centre is required [14].

Our aim was to investigate the diagnostic performance and inter-reader agreement of the Tree scoring system and compare these parameters with standard MRI BI-RADS lexicon reading.

Materials and methods

Study cohort

Our institutional review board (IRB) approved this retrospective diagnostic single-centre study and waived the necessity for informed consent. Four hundred and fifty-nine consecutive patients undergoing MRI examinations from April 4th, 2013, to September 25th, 2014, were eligible. MRI, in accordance with international recommendations, was performed to evaluate the following conditions: (a) unclear findings (conventional imaging BI-RADS 0 and 3); (b) suspicious lesions or lesions highly suggestive of malignancy (conventional imaging BI-RADS 4 and 5); and (c) preoperative staging of biopsy–proven breast cancer (BI-RADS 6). No high-risk screening patients were included in this study. Further inclusion criterion was a reference standard by means of histopathological analysis, either by image-guided biopsy (US-guided core biopsy or vacuum-assisted biopsy under MG/MRI guidance) or open surgery according to international guidelines for quality assurance in breast cancer screening and diagnosis [22]. Board-certified breast pathologists performed the work-up of breast tissue specimens. Patient selection details are given in Fig. 1.

Fig. 1
figure 1

Patient selection flow-chart and final lesion diagnoses stratified by presentation as mass or non-mass

Magnetic resonance imaging of the breast

MRI examinations were performed on a 1.5 T (Siemens Espree; Erlangen, Germany) with the patient in the prone position, using a dedicated bilateral four-channel breast coil. The examination protocol included: an axial T2-weighted turbo spin echo (TSE) without fat saturation (TR/TE: 5160/197 ms; spatial resolution 0.7 × 0.7 × 3 mm, acquisition time 3:07 min.); an axial turbo inversion recovery magnitude sequence (TIRM) (TR/TE 10460/64; spatial resolution 1.1 × 1.1 × 4 mm, acquisition time 2:59 min.); an axial diffusion-weighted echo-planar imaging (DW-EPI) sequence (TR/TE 6300/117 ms, spatial resolution 1.6 × 1.6 × 3 mm, b values of 0 and 1000 s/mm2, fat saturation by water excitation, 12 averages, acquisition time 2:50 min.). During the time course of this study, two different dynamic contrast-enhanced protocols were used because the protocol was modernized. Contrast agent dosage and injection (20 mL Gadoteridol [ProHance, BRACCO, Italy] 2.5 mL/s, 20 mL saline solution) were kept constant. Protocol 1 implemented an axial T1-weighted 2D-FLASH sequence (TR/TE: 129/4.74 ms; spatial resolution 0.9 × 0.9 × 3 mm, acquisition time per measurement 1:09 min., one measurement before and four after contrast medium injection) and the modernized protocol 2 used an axial T1-weighted 3D-FLASH sequence with radial k-space sampling and Dixon fat saturation (TR/TE1/TE2: 12.4/4.77/9.54; spatial resolution 0.7 × 0.7 × 2 mm, acquisition time per measurement 1:35 min., one measurement before and three after contrast medium injection). ADC maps and scaled subtractions were calculated automatically, using the vendor-supplied scanner software. All image data sets were stored in our picture archiving and communication system (PACS, IMPAX EE, AGFA, Bonn, Germany).

Data analysis

All examinations were independently analysed by four breast imaging radiologists, blinded to the initial radiological BI-RADS (both at conventional and MR imaging) category assignment and the final histopathological diagnosis. The readers had different levels of experience in breast MRI and they were classified according to the number of cases read prior to this study with histological verification: less experienced (R1, 200 cases); intermediately experienced (R2, R3, 600 cases each); and highly experienced (R4, >5000 cases).

They were asked to assign a BI-RADS rating to each lesion according to the suspicion of malignancy (BI-RADS 2 to BI-RADS 5), in congruence with the fifth edition of the MRI BI-RADS lexicon [7]. The results were stored in a spreadsheet. Subsequently, the radiologists followed a previously published scoring system (Tree) based on five diagnostic criteria and containing 11 assignment categories that corresponded to an increasing probability of malignancy (1 = lowest, cancer very unlikely, to 11 = highest, cancer highly probable; Fig. 2, [15]). A detailed explanation of the diagnostic criteria of the Tree including schematic drawings and example cases is provided as Supplemental Material 1. A diagnostic category was chosen by following the Tree criteria and noted in a spreadsheet. Before the beginning of image analysis, a training session was held, in which readers were shown 10 example cases that were not part of the study cohort, to demonstrate the application of the Tree scoring system. Results for mass and non-mass lesions are given in Figs. 3 and 4.

Fig. 2
figure 2

Tree scoring system flow-chart following the initial description by Baltzer et al. 2014 [15]. Terminal nodes are hierarchically ordered and denote an increasing probability of malignancy (1 = lowest, 11 = highest)

Fig. 3
figure 3

Mass lesion examples. Upper row (capital letters) shows a fibroadenoma, presenting as a mass lesion with circumscribed margins, heterogeneous internal enhancement, and a persistent signal enhancement time curve. Based on the Tree system (Fig. 2), the absence of the root sign resulted in a node 1 (benign finding most likely) rating. The lesion was classified as BI-RADS 3 (T2w A, early B, and late C subtractions). The lower row shows an invasive carcinoma presenting as an irregularly shaped mass lesion. Margins are not circumscribed, with some small spiculations. The curve type is washout and a perifocal oedema is present as high SI on T2w. Based on the Tree system, the root sign is present, the curve type is washout, and perifocal oedema is present, which resulted in a node 11 (malignant finding most likely) rating. T2w (a), early (b), and late (c) subtractions

Fig. 4
figure 4

Non-mass lesion examples. Upper row (capital letters) shows a focal inflammation, presenting as a regional non-mass lesion with homogeneous enhancement and a persistent signal enhancement time curve. Based on the Tree system (Fig. 2), the absence of a root sign, the persistent time curve type, and the non-circumscribed margins resulted in a node 3 (benign finding likely) rating. The lesion was classified as BI-RADS 3 (T2w A, early B, and late C subtractions). The lower row shows an invasive lobular carcinoma presenting as a clumped segmental non-mass lesion. Some readers reported persistent signal enhancement and some readers reported plateau enhancement time curves. Based on the Tree system, the root sign is present, the curve type is either persistent or plateau, and oedema is absent, resulting in a node 6 or 7 rating (malignant finding suspected, refer to biopsy). T2w (a), early (b), and late (c) subtractions

Statistical analysis

Statistical analyses were performed using SPSS 22.0 (SPSS, IBM, USA) and MedCalc 15 (MedCalc software bvba, Ostend, Belgium). All calculations were primarily performed on a per-lesion basis. Final diagnosis, as the reference standard for each lesion, was obtained from histopathology reports prospectively stored in our institutional database and prospectively checked for congruence during our weekly interdisciplinary meetings. Inter-reader agreement in the assigned MRI BI-RADS and Tree categories was assessed using kappa statistics. A receiver operating characteristic (ROC) analysis was performed to determine overall diagnostic performance, measured by the area under the ROC curve. Further, sensitivity and specificity and likelihood ratios were calculated at cut-off values of BI-RADS >3 and Tree >4 and >7. Statistical significance was assumed at P-values ≤0.05.

Results

Lesion characteristics

Our final study cohort consisted of 100 subjects (mean age 53 ± 14 years [standard deviation]; age range 25–88 years). Here, a total of 121 lesions were histologically verified, 52 (43 %) of them malignant and 69 (57 %) benign. Thirty-seven malignant lesions presented as masses (71 %; mean size 2 ± 1.1 cm; range 0.7–4.9 cm), 15 presented as non-mass lesions (29 %; mean size 4.16 ± 3.3 cm; range 1–13 cm). Forty-one (60 %) benign lesions were masses (mean size 1.39 ± 0.9 cm; range 0.5–5 cm) and 28 (40 %) presented as non-mass lesions (mean size 2 ± 2.5 cm; range 0.6–10 cm).

Details on lesion diagnoses stratified by presentation as mass or non-mass are given in Fig. 1.

Inter-reader agreement

The kappa agreement among the four readers for the characterization of breast lesions as cancers was substantial to almost perfect (k = 0.643–0.896) for Tree, while it was only moderate for BI-RADS (k = 0.455–0.657, Table 1).

Table 1 Kappa values and AUC difference significance level for BI-RADS and Tree in all four readers

Area under the ROC curve

The area under the ROC curve (AUC) for lesion diagnosis ranged between 0.889 and 0.943 for Tree 0.872 and 0.953 for BI-RADS (Fig. 5). The AUC of Tree reading was higher than BI-RADS in the less-experienced and intermediately-experienced readers; however, this was only significant for R2 (Tables 1 and 2).

Fig. 5
figure 5

Receiver operating characteristics (ROC) curves of BI-RADS (left) and Tree (right) ratings. Note that a high sensitivity level, corresponding to a rule-out criterion, was achieved for all observers using both techniques, but with a more uniform specificity for Tree (for details see Tables 1 and 3)

Table 2 Areas under the ROC curves (AUC) for BI-RADS and Tree reading for all four readers with their corresponding standard errors and 95 % confidence intervals

No significant differences were found between the AUC of intermediately experienced readers (R2, R3) and the expert reader (R2/R4 P = 0.4; R3/R4 P = 0.08) in Tree reading, but there was a significant difference between R1 and R4 (P = 0.02). In BI-RADS reading, the expert reader (R4) performed significantly better than all less-experienced readers (R1/R4 P = 0.004; R2/R3 P = 0.01 R3/R4 P = 0.002).

Sensitivity and specificity

Detailed reading results using BI-RADS and Tree, with the corresponding sensitivity and specificity values and 95 % confidence intervals for each reader stratified by presentation as mass or non-mass are displayed in Table 3.

Table 3 Diagnostic parameters of BI-RADS reading results and Tree reading results for all four readers stratified by lesion presentation as mass or non-mass lesions

Tree achieved sensitivity equal to that of BI-RADS readings (differences in sensitivity each P > 0.1), ranging between 96.2 and 98.1 % (Table 3). Specificity was significantly improved by Tree reading in the inexperienced reader (R1, specificity difference: 16 %, 95 %CI 6.9–16 %, P = 0.001). False-positive cases were reduced by 30.6 % (36 FP with BI-RADS vs. 25 FP with Tree). Specificity did not change in the readers with intermediate and high experience (P > 0.1). Similar results were observed in mass and non-mass lesions. R1 had a higher specificity in mass lesions using Tree compared to BI-RADS [difference 17.1 % (95 %CI 3.1–17.1 %), P = 0.0156]. No further differences were observed between Tree and BI-RADS in mass and non-mass lesions (P > 0.1, respectively). The sensitivity and specificity of Tree did not differ between masses and non-masses (P > 0.1, respectively).

Discussion

Our study investigated the inter-reader agreement and diagnostic performance of Tree, a scoring system for breast MRI. Tree leads the radiologist step-by-step toward the final diagnosis of a breast lesion detected on MRI, using an intuitive flow-chart that is easy to follow [15]. We found that Tree is a highly accurate scoring system, improving inter-reader agreement and achieving a high diagnostic accuracy. This was especially evident for the least experienced radiologist. These results have important clinical implications: in addition to BI-RADS, Tree provides specific guidance about what certain combinations of lesion features indicate with regard to potential malignancy. This simplifies and structures the process of lesion interpretation. Our results demonstrate a reduction of inter-observer variability related to reader experience. The Tree uses a small number of diagnostically relevant criteria in a simple flow-chart to lead the reader toward a definite diagnosis. The Oxford Centre for Evidence-Based Medicine requires diagnostic scoring systems such as Tree to be validated in an independent study in order to prove its diagnostic value [14]. Our study follows this recommendation to validate Tree in an independent study, taking place in another department using different MRI technology and readers.

For masses, the MRI BI-RADS terminology is effective in predicting malignancy and has a good reproducibility for the final category assignment [2326]. Similar results could not be confirmed for non-mass lesions: it has been reported that BI-RADS descriptors fail to diagnose correctly non-mass lesions and that several morphologic and dynamic features show an overlap of diagnostic information [9, 23, 24, 26]. Consequently, there is a wide variability among radiologists in choosing the best BI-RADS lesion description, especially when reporting non-mass lesions. In addition, the diagnostic performance of MRI BI-RADS reading is affected by reader experience, as demonstrated by our results and a prior study. This prior study reported on poorest diagnostic outcomes in interpretation of MRI if less experienced readers assessed non-mass lesions [25]. Our findings show similar results for Tree reading in mass and non-mass lesions. Of note, all readers achieved 100 % sensitivity in non-mass lesions while specificity stayed similar or improved with Tree. This improvement did not show statistical significance due to the low number of cases, but, similar to mass lesions, the improved performance was strongest in the inexperienced reader. While the original Tree article did not investigate this issue [15], we demonstrated that Tree is applicable not only to masses, but also to non-mass lesions. Moreover, inter-reader agreement was improved by using Tree. As a result, Tree can be applied without restrictions in the diagnostic setting.

Some limitations of our research merit acknowledgment. One would expect a certain bias toward higher inter-reader agreement by the mono-centric design of this study. However, all four readers were trained in four different institutions and underwent only a short training session, as described above. Therefore, our results clearly demonstrate the high reproducibility of Tree, which was superior to the BI-RADS reading approach. It is not our intent to replace the BI-RADS lexicon. On the contrary: Tree is complementary to BI-RADS, as it provides empirically validated guidance where no specific recommendations are contained in BI-RADS. We validated the previously proposed Tree classification algorithm. Both the initial study and this validation study focused on histologically verified lesions. Although we thus confirmed the applicability of Tree and demonstrated its use in mass and non-mass lesions alike, our results may not directly apply to a general population. Here, because of the large number of negative MRI cases not referred for biopsy, specificity is likely to be higher [27]. Our study was performed considering MRI features only, and did not integrate patient characteristics. Such an approach yielded higher diagnostic accuracy in a prior study on non-mass lesions [28]. Of note, the Tree system allows integration of further diagnostic data, as discussed in [15]. In addition to clinical and conventional findings, quantitative information, e.g., from DWI, may be integrated [29, 30]. However, this approach was beyond the aim of our study and has not been validated as yet.

In conclusion, the Tree scoring system improves inter-reader agreement and achieves a diagnostic performance similar to that of BI-RADS. Less experienced radiologists, in particular, benefit from Tree.