Introduction

Digital breast tomosynthesis (DBT) is rapidly emerging as the new standard of care for breast cancer screening. This novel x-ray technique images the breast with multiple low-dose exposures obtained along an arc which are reconstructed into a series of thin images or “slices” of the breast [1, 2]. The ability to scroll through the multiple reconstructed images minimizes the impact of overlapping structure which limits two-dimensional mammographic imaging [3]. The three-dimensional format of the DBT images allows better localization of lesions and improves the conspicuity of both benign and malignant lesions.

Thus far, early studies comparing screening with DBT combined with digital mammography (DM) to screening with DM alone have shown reductions in recall from 15 to 37 % [411] and increases in cancer detection from 10 to 35 % [410]. These results have prompted the Centers for Medicare and Medicaid Services to introduce billing codes adding a global reimbursement of approximately $56 [12] for DBT imaging further promoting the adoption of this new technology. While these prior studies are encouraging, the majority have not included necessary patient level follow-up to assess for false negatives or interval cancer rates. Additionally, there may have been differential use of the modalities so benefit may need to be adjusted to groups that are statistically comparable.

We present results comparing screening outcomes using DBT screening to DM alone from three research centers participating in the Population-Based Research Optimizing Screening through Personalized Regimens (PROSPR) consortium. The consortium includes large academic centers as well as community clinics reflecting a population-based evaluation of the possible benefit of DBT. We evaluated patient level data and conducted an analysis among a subset of patients with at least one year of follow-up to assess cancer rates, cancer detection rates, false negative rates, sensitivity, specificity, and positive predictive value.

Methods

Study setting

This study was conducted as part of the NCI-funded PROSPR consortium. The overall aim of PROSPR is to conduct multi-site, coordinated, transdisciplinary research to evaluate and improve cancer screening processes. The ten PROSPR Research Centers reflect the diversity of US delivery system organizations. Our study included three PROSPR Research Centers evaluating breast cancer screening—University of Pennsylvania, an integrated health care delivery system; University of Vermont, a statewide breast cancer surveillance system; and Geisel School of Medicine at Dartmouth in conjunction with Brigham and Women’s Hospital, a primary care clinical network. A conceptual model of the breast cancer screening process with further details about the PROSPR research centers has been published previously [13]. All activities were approved by the institutional review boards at each research center and by the PROSPR Statistical Coordinating Center.

Data collection

We pooled data from PROSPR’s central data repository to evaluate breast cancer screening outcomes with DBT in combination with DM (for brevity, henceforth called DBT) compared to DM alone. The overall study time frame was from 2011 to 2014; data availability varied by time for each research center (Fig. 1). University of Pennsylvania (UPenn) began DBT screening for all patients on October 1, 2011 at a single imaging facility. A low volume DM facility with the same readers during the same time period was used for comparison. DBT screening began in January 2012 at one University of Vermont (VT) facility based on room availability and patient preference. Additional units were added in July 2012, November 2013, and December 2013. The Dartmouth-Hitchcock Health System in New Hampshire and Brigham and Women’s Hospital in Massachusetts (D-BWH) began DBT screening in March 2011 at one facility. There was a more gradual conversion to DBT at other facilities during 2012 and 2013. DBT was used if requested by a patient or provider, and at some facilities women with dense breasts, baseline exams or with no obtainable prior imaging were targeted for DBT screening. We ascertained biopsy information from electronic health records and pathology databases. Cancer data came from local institutional tumor registries, state registries, and one statewide surveillance system. Pathology and cancer data availability varied by time for each center (Fig. 1).

Fig. 1
figure 1

Data availability for imaging, pathology, and cancer outcome by calendar time. DBT digital breast tomosynthesis, UPenn University of Pennsylvania, D-BWH Dartmouth-Hitchcock health system and Brigham and Women’s Hospital,VT University of Vermont. *UPenn imaging data includes imaging from 1/1/12 to 12/31/13. Follow-up imaging data were available through 6/30/14. The largest imaging site began exclusively using tomosynthesis on October 1, 2011, but data availablility began on January 1, 2012. Cancer data also included UPenn institutional cancer registry data through June 2014

Our analyses included all bilateral exams with an indication of screening and no other breast imaging within 3 months prior, among women 40–74 years of age with no known history of prior breast cancer. Furthermore, we limited exams to those with radiologists who had interpreted at least 50 DBT and 50 DM screening exams (UPenn = 6, D-BWH = 27, VT = 14). A total of 55,998 DBT exams and 142,883 DM exams from 103,401 women met these criteria (45,049 women contributed 1 exam; 29,041 women contributed two exams; and 29,311 women contributed ≥3 exams). We defined a first exam as the first screening exam with no prior films and no prior imaging records available in PROSPR data, and no self-report of prior breast imaging. All other exams were considered subsequent exams. Breast density was extracted from the clinical screening report and used the Breast Imaging Reporting and Data System (BI-RADS) categories (almost entirely fat, scattered fibroglandular densities, heterogeneously dense, extremely dense) [14]. Race and ethnicity data were available from electronic health records and patient self-report.

Outcome measures

We evaluated the following screening outcomes: recall rate (%), biopsy rate (%), cancer rate (per 1000 exams), cancer detection rate (per 1000 exams), false negative rate (per 1000 exams), positive predictive value (%), sensitivity (%), and specificity (%). A positive screening exam included exams with BI-RADS assessment category 0, 3, 4, or 5. Recall rates are for positive screening exams; biopsy rates include any biopsy occurring after screening, regardless of the BI-RADS assessment category of the exam. Cancer rate was the number of cancers within 365 days of the screening exam; cancer detection rate was restricted to cancers within 365 days of a positive screen. False negative rates were determined from the difference between cancer rates and cancer detection rates. We evaluated the positive predictive value (PPV1), defined as the number of cancers diagnosed per number of positive screens. We calculated cancer rates, cancer detection rates, false negative rates, positive predictive values, sensitivity, and specificity among women under observation for at least one year (n = 25,268 DBT and n = 113,061 DM exams).

Statistical analysis

We compared screening outcomes (recall rates, biopsy rates, cancer rates, cancer detection rates, false negative rates, positive predictive values, sensitivity, and specificity) among DBT and DM exams using logistic regression and calculating odds ratios (ORs) and 95 % confidence intervals (CIs). For 2 × 2 tables, we used two-sided Fisher exact tests; p-values <0.05 were considered statistically significant. A priori we adjusted the logistic regression models for research center, age (40–49, 50–59, 60–74 years), breast density (the four BI-RADS density categories), and first exam. In supplementary analyses, we further adjusted for race and ethnicity (non-Hispanic white, non-Hispanic black, Hispanic, Asian/Pacific Islander, American Indian/Alaska Native, multiple races/other race). To evaluate the impact of differences in recall rate among interpreters, we additionally adjusted for interpreter in a conditional logistic regression model comparing recall rates. For the primary outcomes, we also considered a GEE logistic model that accounts for potential correlation of examinations within the same individual. These models gave the same OR estimate and confidence interval. Results given are from the standard logistic model since inference based on likelihood ratio testing is valid and did not differ from those from the GEE models. We used SAS Version 9.4 (SAS Institute, Inc.) for all analyses.

Results

DBT exams comprised 28 % of all screening exams with the percentage varying according to how quickly the sites adopted DBT (Table 1). Compared to DM exams, DBT exams were more likely in women 40–49 years of age, among non-Hispanic black women, and among women with heterogeneously or extremely dense breasts. DBT exams were slightly more likely to be first screening exams compared to DM exams. Some of the differing characteristics between DM and DBT exams were due to differences in the populations being screened with DBT at each center, but remained important even after this adjustment.

Table 1 Characteristics of exams using digital mammography (DM) alone or in combination with digital breast tomosynthesis (DBT)

The overall recall rate for DBT and DM screening exams was 8.7 and 10.4 %, respectively (Table 2, p < 0.0001). The odds of recall was 32 % lower for DBT compared to DM after adjusting for center, age, breast density, and first exam (OR = 0.68, 95 % CI = 0.65–0.71). Stratification by individual interpreters did not change the adjusted OR substantially (OR = 0.72, 95 % CI = 0.69–0.75). Biopsy rates were statistically significantly higher for DBT compared to DM (2.0 % DBT vs. 1.8 % DM, p = 0.0074). However, after adjusting for center, age, breast density, and first exam, the odds of biopsy were statistically significantly lower for DBT than DM (OR = 0.85, 95 % CI = 0.77–0.93).

Table 2 Recall rates and biopsy rates for digital breast tomosynthesis in combination with digital mammography (DBT) compared to digital mammography (DM) alone by PROSPR research center

We observed an overall cancer rate of 6.5 per 1000 DBT exams compared to 4.9 per 1000 DM exams among exams with at least one year of follow-up (Table 3, p = 0.0016, adjusted OR = 1.49, 95 % CI = 1.17–1.89). The invasive cancer rate was also higher for DBT relative to DM (4.7 vs. 3.7 per 1000 exams, p = 0.0252; adjusted OR = 1.45, 95 % CI = 1.09–1.92). The overall cancer detection rate was higher for DBT relative to DM (overall: 5.9 vs. 4.4 per 1000 exams, p = 0.0026; adjusted OR = 1.45, 95 % CI 1.12–1.88). Restricted to invasive disease only, the invasive cancer detection rate was also higher: 4.2 vs. 3.3 per 1000 exams, p = 0.045; adjusted OR = 1.38, 95 % CI = 1.02–1.87). The PPV1 statistically significantly increased for DBT compared to DM (6.4 vs. 4.1 %, p < 0.0001, adjusted OR = 2.02, 95 % CI = 1.54–2.65). The false negative rates were similar for both modalities with rates of 0.60 for DBT vs. 0.46 for DM per 1000 screened (adjusted OR = 0.55, 95 % CI = 0.13–2.26).

Table 3 Cancer outcomes for digital breast tomosynthesis in combination with digital mammography (DBT) compared to digital mammography (DM) alone

Sensitivity was not improved (DBT = 90.9 %, DM = 90.6 %; adjusted OR = 0.79, 95 % CI = 0.38–1.64); however, specificity did increase (DBT = 91.3 %, DM = 89.7 %; p < 0.0001; adjusted OR = 1.39, 95 % CI = 1.30–1.48). In supplementary analyses, we further adjusted all multivariable models evaluating screening outcomes for race/ethnicity and the ORs did not meaningfully change (results not shown).

We evaluated all screening outcomes by age group (40–49 and 50–74 years) and breast density (non-dense versus dense). The adjusted ORs comparing DBT to DM for recall rate were similar for each age group and for each breast density group (Table 4). For biopsy rates, the adjusted ORs were comparable by age and by breast density, although there was some suggestion that the magnitude of the adjusted OR comparing DBT to DM was greater among dense than non-dense breasts. Sample sizes were small for cancer diagnoses. Nevertheless, there was a suggestion that the magnitude of the adjusted OR comparing DBT to DM for cancer rate, cancer detection rate, and PPV1 was greater among women ages 40–49 than ages 50–74, and greater among non-dense than dense breasts.

Table 4 Screening outcomes for digital breast tomosynthesis in combination with digital mammography (DBT) compared to digital mammography (DM) alone by age and breast density

Conclusion

The results from our multi-center cohort study further support that screening with DBT increases cancer detection, reduces recalls, and does not increase false negative exams compared to screening with DM alone. In the subset of patients with at least one-year follow-up, we observed a statistically significant improvement in specificity. Additionally, our findings support that the reduction in recall can be achieved with a statistically significant 34 % increase in overall cancer detection or 1.5 more cancers detected per 1000 screened with DBT screening compared to DM alone. In comparing invasive cancer detection rates, there was a 27 % increase or 0.9 additional invasive cancers detected per 1000 screened with DBT, not as large an increase as achieved in other large studies, but still statistically significant (7). We also compared the recall rate, cancer rate, and cancer detection rate among all exams by age group to Breast Cancer Surveillance Consortium data based on 2,061,691 digital mammography exams from years 2004 to 2008 [15]. While the overall cancer rates were slightly higher in both our DM and DBT cohorts compared to the BCSC, the cancer detection rate was significantly higher in our DBT cohort (results not shown).

Our study is important because it is the first U.S. multi-site study to include a subset of the screened population with at least one year of imaging follow-up. While the number of patients with one year of follow-up is limited to 138,329 (70 % of the examinations), and the study was not powered to evaluate false negative rates, we observed no statistically significant change in the false negative rates for DBT versus DM (0.60 versus 0.46 per 1000 screened). In Skaane’s interval analysis of the first 12,621 subjects screened in a multi-arm, prospective trial with only 9 months of follow-up [4], there was a 40 % increase in invasive cancers and 3 known interval cancers for a rate of 0.2 per 1000 screened. In the STORM trial, a prospective, multi-armed reader study with a minimum of 13 months follow-up the interval cancer rate was 0.82/1000 screens for both the DM and DBT reading arms, but an absolute difference in cancer detection of 2.7 per 1000 screened with DBT compared with DM alone [16]. In our two separate yet concurrent screening populations with follow-up, the false negative rates of 0.60 and 0.46/1000 screened for DBT and DM respectively are lower, but must be viewed with caution since our definition of a false negative screen may have included cancers detected within one year by other screening modalities such as magnetic resonance imaging and ultrasound. The classic definition of an interval cancer is a cancer that presents symptomatically after a negative screening exam, and before the next scheduled screen [17]. However, in our recent publication of the single site UPenn data, the interval cancer rate using this classic definition was similar to the rate in this multi-site study [18]. Further analysis of our false negative cases is on-going to determine mode of presentation.

The overall relative reduction in recall rate of 15.6 % or 13 women per 1000 screened achieved in our population screened with DBT compared to those screened with DM alone is in keeping with other studies [4, 611]. When adjusted for center as well as patient age and breast density, we showed a 32 % decrease in the odds of recall with DBT versus DM alone (OR = 0.68, 95 % CI = 0.65–0.71). Thus far, this is the only such patient data published from a multi-center site. These data further support that the benefits of screening with DBT may be achievable across many different populations and sites and readers.

In our study, although the absolute recall reduction with DBT was greater for women with dense than for those with non-dense breasts (23 versus 17 per 1000 screened), both were statistically significant. However, when adjusted for center, first exam, and age, the odds of recall comparing DBT to DM were similar for women with dense and non-dense breasts. When stratifying by age, the recall reduction was greater for women ages 40–49 than for women ages 50–74 and the odds of recall were statistically significantly lower for DBT than for DM for both age groups even after adjusting for breast density. These findings demonstrate that all women may benefit from improved screening with DBT with no particular advantage due to age or breast density.

Several limitations should be considered when interpreting our findings. Each of the research centers began DBT screening at different times with variable volumes and data captured within PROSPR and this was not always from the initiation of DBT screening. Therefore, the data represent samples from different points in the “learning curve” of implementing this new modality. We are investigating time trends in DBT performance, but that is beyond the scope of this paper. In addition, the populations at the various sites were quite different in terms of race/ethnicity and potential intrinsic individual risk level that may have contributed to variability in recall and cancer rates. There may also be some misclassification of first versus subsequent exams due to limited retrospective imaging data at some centers; however, we do not expect that this would meaningfully impact our results.

Despite these limitations, our multi-site study is the first to have follow-up data at the patient level with comprehensive cancer data sources, so that sensitivity and specificity calculations may be estimated for DBT screening. We have shown that across multiple, diverse research centers, screening with DBT is associated with a statistically significant increase in cancer detection with a concomitant improvement in specificity further supporting that this innovative technology offers critical improvements over breast cancer screening with DM alone.