Introduction

Digital breast tomosynthesis (DBT) has emerged as a valuable imaging modality involving the acquisition of projection images that are reconstructed into thin image slices of the breast [1]. Multiple studies have demonstrated that integration of DBT into the screening setting has led to increased cancer detection rates (CDR) with fewer false-positive examinations [2,3,4,5,6]. For example, in a retrospective analysis of screening performance metrics from 13 academic and nonacademic breast centers, the addition of DBT to digital 2D mammography (DM) was associated with an increase in the invasive CDR from 2.9 to 4.1 per 1,000 examinations and a decrease in the abnormal interpretation rate (AIR) from 10.7% to 9.1% [4].

The benefits of DBT have been primarily studied in the screening setting, with limited research on DBT performance metrics in the diagnostic setting [7]. A single-institution study found that the integration of DBT into the diagnostic setting led to a significant improvement in positive predictive value (PPV) 3, but no other performance metrics were reported [7]. The American College of Radiology (ACR) recommends separate auditing of screening and diagnostic mammograms, but there are no established benchmarks for diagnostic DBT [8].

Our institution first integrated DBT into clinical practice in early 2011 and transitioned completely to DBT in early 2013, and thus it is well positioned to study performance metrics of DBT in the diagnostic setting. The purpose of our study is to compare performance metrics between DM and DBT in the diagnostic setting.

Materials and methods

Study subjects

The institutional review board exempted this Health Insurance Portability and Accountability Act-compliant retrospective study from the requirement for informed consent. We retrospectively reviewed consecutive diagnostic mammograms from August 1, 2008, to February 28, 2011 (DM group, before DBT integration), and from January 1, 2013, to July 31, 2015 (DM/DBT group, after complete DBT integration), at a single large tertiary academic medical center. After complete integration of DBT in January 2013, all examinations were performed with combined DBT and conventional DM. Diagnostic mammograms performed from March 2011 to December 2012 were not included to avoid selection bias, since DM examinations only were performed in some patients and combined DBT and DM examinations were performed in others during this period.

We searched the mammography information system (MagView, Burtonsville, MD) for all diagnostic mammogram reports with BI-RADS final assessment categories of 0 to 5 from August 2008 to February 2011 and from January 2013 to July 2015, in addition to all image-guided core biopsy and surgical pathology results within 1 year after the diagnostic mammogram was performed. Indications for diagnostic examinations included primarily evaluation of a breast problem (such as a palpable lump or focal pain) and short-interval follow-up. A less common indication was mammography performed after identification of a finding on MRI, CT, or PET/CT. Diagnostic examinations performed for additional evaluation of a recent screening mammogram were not included in the study cohort. The database generated from the mammography information system was linked to tumor registries for five hospitals within our health care system.

Imaging technique and interpretation

Mammographic examinations were performed using full-field DM or DM/DBT (Hologic, Bedford, MA). All DBT examinations combined tomosynthesis and conventional DM, without using reconstructed DM views. At our institution, diagnostic examinations typically are unilateral and include craniocaudal and mediolateral oblique or mediolateral views of the affected breast, with additional views (such as spot compression or spot magnification views) and ultrasound examinations obtained at the discretion of the interpreting radiologist. During the time period of this study, the ultrasound units used by our institution remained the same (Philips Healthcare, Amsterdam, The Netherlands), but the software was periodically updated.

All examinations were interpreted by 1 of 28 breast imaging radiologists, with experience ranging from 1 to 35 years, using the BI-RADS Atlas [8]. Sixteen of the 28 radiologists were the same in the DM and DM/DBT groups. In the DM group, there were 21 interpreting radiologists, 8 of whom interpreted more than 1,000 examinations each, accounting for 66.9% (15,300/22,883) of the examinations. The other 13 radiologists in the DM group interpreted 33.1% (7,583/22,883) of the examinations. There were 23 interpreting radiologists in the DM/DBT group, 8 of whom interpreted more than 1,000 examinations each, accounting for 69.1% (15,768/22,824) of the examinations. The other 15 radiologists in the DM/DBT group interpreted 30.9% (7,056/22,824) of the examinations.

Standard definitions

Mammograms were categorized as true positive, true negative, false positive, or false negative according to standard definitions in the BI-RADS Atlas [8]. Mammograms were considered “true-positive” examinations if there was a tissue diagnosis of cancer [invasive cancer of ductal carcinoma in situ (DCIS)] within 1 year after a positive examination (BI-RADS 4 or 5). Mammograms were considered “true-negative” examinations if there was no known tissue diagnosis of cancer within 1 year of a negative examination (BI-RADS 1, 2, or 3). Mammograms were considered “false-positive” examinations if there was no known tissue diagnosis of cancer within 1 year of a positive examination (BI-RADS 4 or 5). Mammograms were considered “false-negative” examinations if there was a tissue diagnosis of cancer within 1 year of a negative examination (BI-RADS 1, 2, or 3). For diagnostic mammograms with a BI-RADS final assessment category of 0 [0.6% (144/22,883) in the DM group and 0.4% (89/22,824) in the DM/DBT group], the final assessment category (1-5) given after completion of the recommended workup was used for purposes of the analysis. (An assessment of “0” was used primarily when the patient was unable to stay for additional recommended workup, such as ultrasound, or when an MRI examination was recommended for problem-solving purposes.)

Performance metrics were calculated using standard formulas from the BI-RADS Atlas [8]. The CDR is the number of cancers detected (true positives) per 1,000 examinations. The AIR is the percentage of total examinations interpreted as positive (BI-RADS 4 or 5). PPV2 is the percentage of diagnostic examinations recommended for tissue diagnosis or surgical consultation (BI-RADS 4 or 5) that result in a tissue diagnosis of cancer within 1 year. PPV3 is the percentage of all known biopsies performed as a result of positive diagnostic examinations (BI-RADS 4 or 5) that resulted in a tissue diagnosis of cancer within 1 year. Sensitivity, which is the probability of interpreting an examination as positive when cancer exists, is measured as the number of positive examinations for which there is a tissue diagnosis of cancer within 1 year of the examination (true positives) divided by all cancers present in the population examined in the same time period (true positives and false negatives). Specificity, which is the probability of interpreting an examination as negative when cancer does not exist, is measured as the number of negative examinations for which there is no tissue diagnosis of cancer within 1 year of the examination (true negatives) divided by all examinations for which there is no diagnosis of cancer within the same time period (true negatives and false positives).

Data collection and statistical analysis

For each diagnostic examination report, the following information was extracted from the mammography information system (MagView, Burtonsville, MD): all image-guided core biopsy and surgical pathology results within 1 year after the diagnostic mammogram, patient age, patient race, breast density, history of breast cancer, and reader. The mammographic finding types (architectural distortion, asymmetry, calcifications, or architectural distortion) for all true- and false-positive examinations were recorded.

All data were analyzed with statistical software (R version 3.4.2). We evaluated the distribution of age, race, breast density, and history of breast cancer by modality type (DM versus DM/DBT) using the Wilcoxon test (for continuous variables) and the Pearson’s chi-squared test (for categorical variables). For each of the performance metrics, parameters of a multivariable regression model were estimated using generalized estimating equations (GEE) with a logit link and an autoregressive (AR1) correlation structure to account for multiple examinations per subject. The advantage of a multivariable regression model is that the association between an independent variable and the outcome can be estimated holding all other variables constant, which allows us to account for potentially confounding variables. Each model was adjusted for age, race, breast density, history of breast cancer, and reader. For each model, we calculated adjusted odds ratios, 95% confidence intervals, and p values for the DM and DM/DBT groups. In addition, the mammographic finding types that led to true- and false-positive examinations were compared between the DM and DM/DBT groups using the Pearson’s chi-squared test. P < 0.05 was considered statistically significant.

Results

Characteristics of the study population

Over a 31-month period before DBT integration, 22,883 DM diagnostic examinations were performed in 15,823 women (DM group); over a 31-month period after complete DBT integration, 22,824 DBT diagnostic examinations were performed in 16,881 women (DM/DBT group) (Fig. 1). There was a small but statistically significant difference in age (52.8 versus 53.5 years in the DM and DM/DBT groups, respectively, p < 0.01) and proportion of prior history of breast cancer (11.9% versus 10.9% in the DM and DM/DBT groups, respectively, p < 0.01) (Table 1). There were no differences in race or breast density between the DM and DM/DBT groups (p = 0.10 and p = 0.13, respectively) (Table 1).

Fig. 1
figure 1

Flow diagram of patient selection

Table 1 Comparison of patient characteristics with DM versus DM/DBT

Of 45,707 total examinations performed in the DM and DM/DBT groups, 29,190 (63.9%) were given a final assessment category of 1 or 2, 11,776 (25.8%) were given a final assessment category of 3, and 4,741 (10.4%) were given a final assessment category of 4 or 5. The proportions of cases reported as BI-RADS 1 or 2, BI-RADS 3, and BI-RADS 4 or 5 were similar in the DM and DM/DBT groups (p = 0.53-0.73) (Table 2).

Table 2 Comparison of BI-RADS final assessment categories with DM versus DM/DBT

Performance metrics with DM versus DM/DBT

A total of 1,589 breast cancers were diagnosed within 1 year after 45,707 diagnostic examinations for an overall CDR of 34.8 per 1,000. After adjusting for age, race, breast density, prior history of breast cancer, and reader, the CDR was similar in both groups (adjusted OR = 1.11, 95% CI 0.97-1.27, p = 0.14) (Table 3). However, a higher proportion of cancers were invasive rather than in situ with DM/DBT compared with DM [83.7% (731/873) versus 72.3% (518/716), p < 0.01].

Table 3 Comparison of performance metrics with DM versus DM/DBT

The AIR was lower in the DM/DBT group (adjusted OR = 0.87, 95% CI 0.80-0.94, p < 0.01). PPV2 and PPV3 were higher in the DM/DBT group (adjusted OR = 1.34, 95% CI 1.13-1.60, p < 0.01, for PPV2; adjusted OR = 1.26, 95% CI 1.05-1.51, p = 0.01, for PPV3) (Table 3). Sensitivity was similar in the DM and DM/DBT groups (adjusted OR = 1.26, 95% CI 0.93-1.72, p = 0.14) (Table 3). Specificity was higher in the DM/DBT group (adjusted OR = 1.28, 95% CI 1.17-1.41, p < 0.01) (Table 3).

Mammographic finding types with DM versus DM/DBT

The mammographic finding of mass led to more true-positive examinations in the DM/DBT group than DM group [52.9% (462/873) vs. 41.3% (296/716), p < 0.01] (Table 4). The mammographic finding of calcifications led to fewer true-positive examinations in the DM/DBT group than DM group [23.8% (208/873) vs. 32.1% (230/716), p < 0.01] (Table 4).

Table 4 Comparison of mammographic finding types that led to true-positive examinations with DM versus DM/DBT

The mammographic findings of architectural distortion and mass led to more false-positive examinations in the DM/DBT group than DM group [5.3% (78/1,485) vs. 1.4% (24/1,669), p < 0.01, for architectural distortion and 37.1% (551/1,485) vs. 20.8% (347/1,669), p < 0.01, for masses] (Table 5). The mammographic findings of asymmetry and calcifications led to fewer false-positive examinations in the DM/DBT group than DM group [11.6% (172/1,485) vs. 18.2% (303/1,669), p < 0.01, for asymmetries and 30.3% (450/1,485) vs. 36.8% (614/1,669), p < 0.01, for calcifications] (Table 5).

Table 5 Comparison of mammographic finding types that led to false-positive examinations with DM versus DM/DBT

Discussion

In our study, integration of DBT into the diagnostic setting led to a significant decrease in AIR and significant increase in PPV2, PPV3, and specificity. Although there was no significant difference in CDR after adjusting for multiple variables, there was a higher proportion of invasive relative to in situ carcinomas with DM/DBT compared with DM. These results demonstrate that integration of DBT into the diagnostic setting leads to improved performance. In addition, our research supports guidelines that recommend separate auditing of screening and diagnostic examinations [8].

Our results suggest that diagnostic DBT leads to increased positive predictive values with improved selection of patients recommended for biopsy. Our findings are consistent with the robust literature that supports the use of DBT in the screening setting. Through the acquisition of projection images that are reconstructed into thin image slices of the breast, DBT minimizes the masking effect of overlying tissue, enables improved cancer detection, and reduces false-positive findings [1]. In addition, we observed a higher proportion of invasive relative to in situ carcinomas with DM/DBT. This phenomenon has also been described with the use of DBT in the screening setting [3, 4, 9, 10]. The preferential ratio of invasive relative to in situ carcinomas with the transition to DBT may contribute to the optimization of patient outcomes from diagnostic mammography.

There is limited research on the performance metrics of diagnostic DBT. In a retrospective study of all diagnostic mammograms obtained during 1 year before DBT integration and during 3 consecutive years transitioning to full DBT integration (DBT1, DBT2, and DBT3), the authors found a significant increase in PPV3 from 29.6% (85/287) in the DM group to 50% (182/364) in the DBT3 group [7]. Of note, in the DBT1 and DBT2 groups, the patients received DBT or DM based on unit availability, so comparisons were primarily made between the DM group and DBT3 group. Other performance metrics were not calculated since limited follow-up data were available for the DBT3 group. Our study offers a comprehensive review of performance metrics in a large DBT practice and compares metrics during a 31-month period before the integration of DBT, when our institution was performing only DM, to a 31-month period after complete integration of DBT, when our institution was performing combined DBT and DM in all patients, thus eliminating selection biases that may occur in a hybrid environment.

The aforementioned study reported a significant increase in examinations given a BI-RADS final assessment category of 1 or 2 and a significant decrease in examinations given a BI-RADS final assessment category of 3 after integration of DBT into the diagnostic setting [7]. (No significant change in examinations given a BI-RADS final assessment category of 4 or 5 was observed.) In our study, however, we observed no significant differences in the proportion of cases reported as BI-RADS 1 or 2 or BI-RADS 3 with integration of DBT. This finding may reflect differences in institutional practice regarding the use of BI-RADS 3. Of note, in the study that reported a decrease in BI-RADS 3 assessments, the baseline use of BI-RADS 3 with DM was higher than our institution’s baseline use of BI-RADS 3 with DM (33.3% versus 25.6%) [7].

In a recent update from the Breast Cancer Surveillance Consortium (BCSC) on diagnostic DM, the CDR was 34.7 per 1,000 examinations (95% CI: 34.1, 35.2) [11]. The CDR for diagnostic DM in our study was 31.3 per 1,000. The lower CDR in our study may be due to patient selection: we included all diagnostic mammograms apart from those that were performed for additional evaluation of a recent screening mammogram. We chose to focus on this specific diagnostic population, since there is limited literature on diagnostic DBT for evaluation of findings that are not recalled from screening. If we exclude mammograms performed for additional evaluation of a recent screening mammogram from the BCSC study, the CDR would be 29.4 (8,006/272,572), which is more in keeping with the CDR in our study. As practices rapidly transition from DM to DBT, there is a need for benchmarks for facility auditing based on modern DBT performance in the diagnostic setting.

Our study also demonstrates that the types of mammographic findings that led to true- and false-positive examinations differed before and after complete integration of DBT. The mammographic finding of mass led to more true-positive examinations in the DM/DBT group than DM group, whereas the mammographic finding of calcifications led to fewer true-positive examinations in the DM/DBT group than DM group. These findings are consistent with our observation of a higher proportion of invasive relative to in situ carcinomas with DM/DBT, with invasive carcinomas often presenting as masses and in situ carcinomas often presenting as calcifications.

The types of mammographic findings that led to false-positive examinations differed before and after complete integration of DBT, with fewer false-positive examinations resulting from asymmetries and calcifications after complete integration of DBT. The types of findings that led to false-positive examinations with DBT provide insight into the advantages and possible disadvantages of this new imaging modality. More false-positive examinations resulted from architectural distortion after integration of DBT. These results are consistent with a recent study that found that architectural distortion is more readily detected on DBT than DM but is less likely to be malignant on DBT than DM; as such, the finding of architectural distortion can lead to more nonmalignant biopsies with DBT [12]. More false-positive examinations also resulted from masses with DBT, suggesting that short-term follow-up may be a reasonable option when mammographic features with DBT appear likely or probably benign after margin analysis. On the other hand, fewer false-positive examinations resulted from asymmetries with DBT, suggesting that fewer asymmetries are being recommended for biopsy with DBT or that asymmetries that persist on additional imaging are more likely to be significant.

Several studies have compared DM and DBT with regard to lesion characterization and image quality in the diagnostic setting and have concluded that the integration of DBT into the diagnostic setting can lead to more efficient evaluations and provide sufficient information for lesion localization [13,14,15,16,17,18,19,20,21,22,23]. For example, in a study of 146 women with 158 abnormalities who underwent DM and two-view DBT, with imaging reviewed by three radiologists, the authors reported that DBT could replace conventional DM views for the evaluation of non-calcified findings [18]. In contrast to masses, asymmetries, and architectural distortion, however, DM is more sensitive than DBT for the detection of calcifications [24].

In clinical practice, the results of this study suggest that the use of DBT is particularly helpful in the diagnostic setting and can reduce benign biopsies. DBT allows for better margin analysis and can increase the radiologist’s confidence in determining the benignity of a lesion [1]. With subtle lesions, DBT can inform the decision for biopsy and increase overall accuracy [22]. Unnecessary ultrasound examinations and an increase in examination time can also be avoided in cases in which spot compression images with DBT demonstrate that an asymmetry represents normal superimposition of breast tissue. Currently, one of the disadvantages of DBT at our institution is the increased radiation dose to patients with combined DM and DBT; however, the radiation dose can be reduced using reconstructed DM views in place of the 2D DM acquisitions [25, 26].

Our study has several limitations. First, it was conducted at a large tertiary academic medical center that has been performing DBT since early 2011, and thus the results may not be generalizable to other institutions, especially those that have more recently transitioned to DBT. Second, the database generated from our mammography information system was linked to tumor registries for the five hospitals within our health care system; however, our database was not linked to a state registry. Third, although we avoided the selection bias inherent to other retrospective studies by comparing performance metrics in non-hybrid environments (DM in all patients versus combined DBT and DM in all patients), this study was not a randomized trial, and thus other factors may have influenced our results; however, we performed a multivariable regression and adjusted each model for age, race, breast density, history of breast cancer, and reader.

To our knowledge, our study is the largest and one of the first to report performance metrics for DBT fully integrated into the diagnostic setting. In a large DBT practice, complete integration of DBT into the diagnostic setting is associated with improved diagnostic performance. Increased utilization of DBT in the diagnostic setting may thus result in decreased health care costs and lessened patient anxiety, in addition to a shift in the benchmarks and outcomes that have been established for conventional DM.