Introduction

Digital breast tomosynthesis (DBT) is a relatively new imaging technique that is expanding widely in breast diagnosis centres. DBT uses a series of individual low-dose projections acquired while the x-ray tube is rotating over a limited arc above the compressed breast. Using mathematical algorithms, data from these multiple low-dose projections are reconstructed into a quasi-3D breast volume of thin slices parallel to the detector plane. Thus, DBT potentially facilitates diagnosis of breast lesions by reducing tissue overlap. Published clinical studies show that the accuracy of one- or two-view DBT is equal or better than that of conventional full-field digital mammography (FFDM). These studies also show superior lesion detection and lower recall rate when DBT is used in combination with FFDM [1,2,3,4,5,6,7,8,9,10]. The combination of DBT plus FFDM yields mean glandular doses that may double the doses delivered in FFDM examinations [11]. According to our previous studies [12,13,14,15] the mean glandular dose due to DBT compared to the FFDM is 50% higher for 5–6-cm breast thickness (most common), 40% for thicknesses of 3 to 4 cm, and 30% for 7 to 8 cm. Dose values are of great concern, especially for the purposes of incorporating this technology in breast-screening programs. This has led most manufacturers to develop a 2D synthetic image (SI) from the reconstructed tomographic slices with the aim of substituting the FFDM images.

Some studies have addressed the clinical performance of SIs. Skaane et al. [16] conducted a prospective study over a screening population (12,270 people) with an arm aiming to compare SI with FFDM. The results show comparable performance of SI + DBT and FFDM + DBT in terms of cancer detection rates and false positive scores. The TOMMY trial [17] is a retrospective reading study with three arms (FFDM vs. DBT + FFDM vs. DBT + SI) in which 7060 cases were blindly reviewed. This study concluded that DBT + SI showed a similar performance to that of DBT + FFDM. In a retrospective study of 214 cases, Choi et al. [18] concluded that SI and FFDM show comparable detection rates for T1-stage breast cancers.

Most published studies compare the sensitivity of SI versus FFDM when used in combination with DBT slices. Direct comparison between FFDM and SI with no access to DBT slices may be interesting for avoiding the influence of DBT reading in the diagnosis. The aim of the present work is to evaluate the clinical performance of the SI alone, compared with FFDM alone in terms of lesion detectability and BIRADS lesion categorisation [19]. In [15], we published a preliminary study based on phantom images and a reduced patient sample (50 patients). We found that the visibility of radiological findings in the clinical images (grouped as architectural distortions, micro-calcifications, and nodules) was similar for both types of images except for distortions, which were better visualised in SI (p < 0.01). However, this was a preliminary study with some limitations in that it lacked a sufficiently large patient sample, readers had access to the corroborated patient diagnosis, and intra-observer variability was not evaluated. Therefore, we have now developed a more conclusive study based on a sufficiently large enough image sample where intra-observer variability is included as one of the sources of uncertainty.

We compare the sensitivity and specificity of the FFDM and the SI in order to prove the non-inferiority of the SI. A positive result would allow the replacement of clinical protocol based on DBT + FFDM in favour of DBT + SI, with the subsequent dose savings.

Material and methods

An observational, retrospective, single-centre, multireader blinded study was performed following approval of the institutional ethics committee.

Study design and patient sample

The sample size was calculated to provide a statistical power of at least 80% when establishing the non-inferiority of SI compared to FFDM regarding the diagnostic capability of the two image types [20]. A set of 244 patients who underwent a 4-projection (2 breasts x 2 projections: CC and MLO) COMBO exam (exam routinely performed in our institution at the time of the study) in a Selenia Dimensions DBT unit (Hologic Inc., Bedford, MA, USA) between May 2013 and July 2014 were included in the sample. As is well known, the COMBO modality performs FFDM acquisition followed by a DBT acquisition with the breast under the same compression force. For all patients and all acquired projections, the SI (version 1.0.0.1) was obtained (C-View for Hologic).

All recruited patients arrived at our breast unit for a screening or diagnostic appointment. A radiologist who did not participate in the reading study selected the patients based on their final diagnosis as determined by final interpretation with complementary ultrasound or magnetic resonance examinations, or biopsy with histological studies. Inclusion criteria were: a) breasts with no mammographic findings randomly selected from all available cases and b) breasts with mammographic benign and malignant findings representing the typical range of lesions found in the clinical practice. Patients with breast prosthesis were not enrolled. A subset of 54 patients was included twice in order to evaluate intra-observer variability. Thus, the effective sample size was 298 patients. The sample included 119 biopsy-proven cancers, 15 high-risk lesions, 110 benign lesions, and 350 breasts with no lesions. 26 breasts in the sample had two lesions. The ground truth in this study was defined in terms of BIRADS categorisation and radiological findings reported during the routine diagnosis, complementary examinations, or histological studies.

In order to guarantee blind evaluation, the FFDM and SI of each breast were separately anonymised so that the two types of images and the images corresponding to contralateral breasts were de-coupled. DBT slices were discarded since they were not used in the study. For each patient included in the sample, four anonymised studies were generated containing two images corresponding to the FFDM (SI) CC and MLO projections of the left (right) breast (see Fig. 1). All the anonymised studies were randomly ordered. In total, 1192 anonymised studies (2384 images) were included (596 FFDM and 596 SI).

Fig. 1
figure 1

Study flowchart

Reader study

The images were read by three experienced radiologists in digital mammography (over 5000 mammograms per year [21]) and DBT (over 7000 studies per year). The SI reading experience of the radiologists was 1 year on average (first SI version was installed in late 2012). The reading sessions started 4 months after patient recruitment and were performed in multiple sessions in an independent 5Mp Hologic workstation courtesy of Emsor (distributor of Hologic systems in Spain) without the ability to recall DBT slices or previous studies of the patients. The CC and MLO projections of a single breast were presented to the reader, who had to detect mammographic findings, score their visibility, and classify the breast according to the BIRADS categorisation. Readers were blinded to patient clinical history or images of the contralateral breast. It is important to point out that SI images are easily traceable due to their characteristic texture, which is easily identifiable for experienced readers, as well as to the C-View tag present in the images (Fig. 2). The randomly ordered image set guaranteed that images corresponding to the same patient were separated. Image readings were performed over 8 months in order to prevent memory effects.

Fig. 2
figure 2

(Left) Synthetic and (right) FFDM mediolateral oblique images of a 51-year- old woman

Readers were allowed to score a maximum of three mammographic findings at each image that had to be classified in five categories: micro-calcification, nodule, nodule with micro-calcifications, architectural distortion, and focal density. The visibility was rated on a scale of 0 to 3 (0: no finding detected; 1: subtle visibility and very difficult characterisation; 2: medium visibility and difficult characterisation; 3: clearly visibility and characterisation). Finally, the BIRADS categorisation (1–5) was used to classify each breast according to the probability of malignancy of the more suspicious finding. Readers were provided with a data sheet designed using database software (Microsoft Access) and were asked to assign a BIRADS category and to select the type of finding, scoring its visibility. A case number that matched with the one viewed on the workstation was provided on each data sheet.

Statistical Analysis

Inter- and intra-reader agreements for both BIRADS categorisation and lesion visibility were separately evaluated for SI and FFDM. The agreement level was assessed by calculating the kappa coefficient with a 95% confidence interval (CI). Conventionally, kappa values of 0.00–0.20, 0.21–0.40, 0.41–0.60, 0.61–0.80, and 0.81–1.00 indicate minimal, fair, moderate, substantial, and near-perfect agreement, respectively [22]. Multiple-reader, multiple-case (MRMC) ROC methodology was used to compare the diagnostic capabilities of SI and FFDM. ROC curves were determined for each of the three readers and overall, using the BIRADS categorisation and lesion visibility. Smooth ROC curves were calculated using the bi-normal model [23], and this was performed using the ROC function from the pROC package of the R programme. The diagnostic capability of SI and FFDM was defined in terms of the area under the ROC curve (AUC). The overall comparison of diagnostic capabilities of SI and FFDM were obtained from the difference in AUC. Non-inferiority of SI against FFDM was evaluated using the 95% CI for the difference between mean AUCs. Confidence limits were obtained using an Obuchowski–Rockette model with Hillis improvements to calculate degrees of freedom [24]. This was performed using the ORH analysis function from the RJafroc package of R programme. To conclude that SI was non-inferior to FFDM, it was required that the lower limit of the CI be above the non-inferiority margin.

Results

The radiological findings were: 121 nodules, 42 microcalcifications, 19 nodules with microcalcifications, 24 distortions, and 13 focal densities (the values correspond to the number of confirmed findings present in the breast sample). The cancers in the effective sample were 77 invasive ductal carcinoma (IDC), 7 ductal carcinoma in situ (DCIS), 11 infiltrating lobular carcinoma (ILC), 6 IDC + DCIS, and 18 other cancers.

Table 1 shows kappa coefficients (95% CI) for agreement between all three readers based on the BIRADS categorisation and lesion detectability (detected/non-detected) for both SI and FFDM.

Table 1 Inter-reader concordance for BIRADS categorisation (1– 3, 4–5) and lesion detectability (detected/non-detected) for each reader and for digital mammography (FFDM) and synthetic image (SI)

BIRADS were grouped into two categories: 1–3 and 4–5, which separates healthy breasts or breasts with benign findings (BIRADS 1–3) from breasts with malignant lesions (BIRADS 4–5). Results show substantial agreement between readers for BIRADS categorisation in both image modalities, with a slightly higher kappa for SI. The results of the analysis performed over the 5-step BIRADS categorisation reveal a slightly poorer agreement (results not shown).

Substantial inter-reader agreement was also found for nodule and micro-calcifications detectability in both FFDM and SI, while fair to moderate agreement was found for densities, distortions, and nodule+micro findings. Here, it is important to consider that this result is based on a poorer statistical sample: only 13 densities, 24 distortions, and 20 nodules+micros were available in the patient sample as compared to 121 nodules and 42 micro-calcifications.

Intra-reader agreement for BIRADS categorisation (1–3, 4–5) and lesion detectability shows almost perfect agreement for all the readers and both image modalities (Table 2). Some exceptions were for reader 1 that showed a fair agreement with himself for densities in the SI, and for reader 3 in the case of architectural distortions in the FFDM. A high 95% CI must be noted for densities, distortions, and nodules+micros due to the lower number of cases in the sample.

Table 2 Intra-reader agreement for BIRADS categorisation (1–3, 4–5) and lesion detectability (detected/non-detected) for each reader and for digital mammography (FFDM) and synthetic image (SI)

A substantial agreement between SI and FFDM was found for BIRADS categorisation and nodule and micro-calcification detectability for all three readers (Table 3). Moderate agreement was found for all other radiological findings.

Table 3 Agreement between synthetic image and digital mammography for BIRADS categorisation (1–3, 4–5) and lesion detectability (detected/non-detected) for the three readers

AUC for each reader and mean AUC for the three readers for both SI and FFDM were obtained by combining the visibility scores for all the radiological findings (Table 4).

Table 4 Area under the ROC curves for each reader and for synthetic image and digital mammography based on BIRADS categorisation and lesion visibility

The 5-step BIRADS categorisation was used to compute the ROC curve [Fig. 3a]. The difference between the AUC of SI and FFDM across the three readers (Fig. 4) is -0.014 (95% CI: -0.042–0.016), which is not statistically significant (p = 0.282). Therefore, SI proved to be non-inferior to FFDM based on BIRADS categorisation.

Fig. 3
figure 3

ROC curve for synthetic image (SI) and digital mammography (FFDM) based on BIRADS categorisation (a) and lesion visibility (b)

Fig. 4
figure 4

Mean AUC difference between synthetic image and digital mammography (SI-FFDM) based on BIRADS categorisation and on lesion visibility, and 95% confidence interval for each difference. The zero difference line and the −0.05 non-inferiority margin are shown as dashed vertical lines

The difference between the computed AUC for lesion visibility (Fig. 4) in SI and FFDM across the three readers is -0.001 (95% CI: -0.035–0.037), which is not statistically significant (p = 0.9607). Therefore, SI proved to be non-inferior to FFDM based on lesion visibility.

Regarding the sensitivity of both image modalities, the rate of correct detection of malignant lesions (ground truth = BIRADS 5) was computed, assuming that the lesion would have been detected if a BIRADS 5 or 4 had been assigned during image reading (Table 5). On average, FFDM images had a higher sensitivity (79%) than SI (75%) although this difference was not statistically significant (95%CI: -0.15–-0.16). The sensitivity was also calculated by considering only those malignant lesions scored as BIRADS 5 (Table 5). The results showed that on average, SI images had higher sensitivity (63%) than FFDM (58%) and the differences were statistically significant (Mean difference = 0.046, p = 0.001).

Table 5 Sensitivity (S) and specificity (Sp) for synthetic image (SI) and digital mammography (FFDM) of each individual reader and the means across the three readers. Sensitivity values are for malignant lesions categorised as both BIRADS 4 and 5, and BIRADS 5. Specificity values are for breasts categorised as BIRADS 1 and 3, and for BIRADS 1 breasts

Table 5 also shows the specificity for each reader and the mean for each imaging type. The specificity was calculated by dividing the total number of breasts scored as benign or without lesion (BIRADS 1–3) by the total number of breasts in the sample that were benign and without lesion [3]. SI and FFDM presented similar specificity, and the differences in the mean values are not statistically significant. In a similar way, the specificity was recalculated by only taking into account the number of lesions scored as BIRADS 1 (breasts without lesions). SI presented a higher specificity (86%) than FFDM (81%), and the differences were statistically significant (mean difference = 0.049; 95% CI: -0.072–-0.015; p = 0.007).

Discussion

In this work, we demonstrate that the clinical performance of SI is not inferior to that of FFDM images for lesion visibility or BIRADS categorisation. Other published studies [16,17,18] compared the clinical performance of SI + DBT and FFDM + DBT. At present, the use of SI as a valid image for replacing FFDM in DBT examinations is under debate. The good results obtained with DBT in the screening programs reinforce this debate. The inclusion of DBT in these programs entails overcoming important challenges such as the time it takes to interpret the DBT exams and the dose of radiation. Therefore, we consider that direct comparison of both images can inform this discussion. Zuley et al. [25] directly compared SI and FFDM in terms of the malignancy probability assigned to various radiological findings, and they found that both image types were comparable in performance. In our work, SI and FFDM were compared in terms of lesion visibility, while malignancy probability was evaluated through BIRADS categorisation. Both studies conclude the validity of SI for replacing FFDM images in DBT examinations causing substantial dose savings. According to our results in previous studies, dose values are reduced by 40–45% when using the SI instead of FFDM [13,14,15].

It is important to note that SI is the result of computational algorithms that evolve over time and differ between manufacturers. Skaane et al. [16] and Gur et al. [26] analysed the performance of one of the first versions of the Hologic C-View SI. They demonstrated worse performance of SI when comparing with FFDM. Locatelli et al. [27] reported low sensitivity and reduced conspicuity when using the SI generated from a DBT system of a different manufacturer. Thus, the conclusions of this study are only valid for the SI used in this research.

The clinical protocol followed in our institution prior to this study included two-view DBT + FFDM acquisitions per breast. The results obtained in this study encourage avoidance of FFDM, and, currently, only DBT acquisitions with SI are performed, with the subsequent dose savings. Other clinical protocols as one-view DBT + two-view 2D can provide also important dose savings [28, 29]. As with SI, the option of performing one-view DBT versus two-view DBT needs to be supported in studies that demonstrate they have a similar diagnostic capability. The results of these studies will also be dependent on the specific characteristics of the different DBT systems and can not be easily generalised.

Inter- and intra-reader agreement was performed by grouping BIRADS categories: 1–3 and 4–5, to separate healthy breasts and breasts with benign lesions from breasts with malignant lesions. This may have introduced a limitation in the study as less conspicuous lesions with BIRADS assignations of 2 or 3 become indistinguishable from un-detected lesions, where a BIRADS 1 would be assigned. To overcome this limitation, the specificity was computed considering only breasts categorised as BIRADS 1. Furthermore, the sensitivity was estimated taking into account only the breasts in the BIRADS 5 category. In both cases, this caused high reliability for the SI image. Another potential limitation is the diminished statistical power obtained for lesion visibility due to the smaller sample available for each type of lesion.

In conclusion, this study proves that the clinical performance of SI is not inferior to that of FFDM even when DBT planes are not present during image reading.