Introduction

Architectural distortion (AD) is a subtle mammographic finding that can be a manifestation of breast cancer. While AD can be due to a variety of malignant and non-malignant causes [1], the positive predictive value for malignancy is approximately 75% [2]. AD may be the earliest manifestation of breast cancer [3] and is the most commonly missed abnormality on false-negative mammograms [4]. Earlier detection of AD may improve patient prognosis more than earlier detection of calcifications [3].

Compared to digital mammography (DM), digital breast tomosynthesis (DBT) is a newer imaging technology that displays thin slices of breast tissue, thus mitigating the effects of tissue overlap. DBT has been shown to increase cancer detection rates and decrease screening call-back rates [5,6,7,8,9]. In addition, use of screening DBT allows more recalled patients to undergo ultrasound alone [10], thus potentially improving diagnostic workflow efficiency for patients recalled from screening. The risk of malignancy in abnormalities detected only by DBT is significant [11], and has been reported at nearly 50% [12]. DBT is particularly helpful in detecting abnormalities in patients with dense breasts [13, 14].

Minimising disagreement in difficult cases is the best way to reduce inconsistencies in screening mammogram interpretation [15]. AD has high interobserver variability (IOV) [16], and while sensitivity for AD is lower than for non-AD manifestations of breast cancer [17], DBT improves detection of AD [11]. The purpose of this study was to compare IOV, reader confidence and sensitivity/specificity in detecting AD on DM versus DBT.

Materials and methods

Study design

This reader study, approved by the Institutional Review Board and compliant with the United States Health Insurance Portability and Accountability Act, used a counterbalanced experimental design to estimate the effects of DBT relative to DM regarding sensitivity/specificity, IOV (or reader agreement) and reader confidence in detecting AD. Informed consent was waived.

Selection of patient images

The radiology database at a tertiary breast centre was searched using the PenRad Management Information System (PenRad Technologies Inc., Buffalo, MI, USA) for all reports containing the words ‘architectural distortion’ or ‘possible architectural distortion’ on all screening mammograms performed from 5 March 2012 to 27 November 2013. Unilateral examinations were excluded to decrease a true positive hit by chance and allow each breast to act as a control within a subject. All studies consisted of standard two-view full-field DM images and DBT reconstructions in both the mediolateral oblique and craniocaudal projections. Both DM and DBT images were obtained on 3D units (Selenia Dimensions, Hologic, Marlborough, MA, USA) in a single compression episode for each projection. All patients had both DM and DBT (not synthetic mammography views reconstructed from DBT data). DBT images were obtained through the motion of the x-ray tube over a 15° arc and reconstructed into 1-mm sections. DM and DBT images from all reports containing the words AD or possible AD were consensus-reviewed to confirm the presence of AD or possible AD (Fig. 1) on screening views on a 5 megapixel liquid crystal display (LCD) diagnostic workstation (SecurView, Hologic, Marlborough, MA, USA) by three radiologists (one breast imaging fellow and two fellowship-trained breast imagers). At the time of the consensus review, the breast imagers had 6 and 16 years of breast imaging experience in practice, respectively. This consensus review took place 3 years prior to the current study. Our institution began using reconstructed two-dimensional images after the consensus review and before this study was performed. To maintain the integrity of the methods for the current experimental study, we did not collect additional AD cases using reconstructed two-dimensional imaging. In addition, the 3-year delay provided the beneficial effects of memory decay and retrograde interference. Control cases were identified through searches on our institution’s PACS (GE Healthcare; Chicago, IL, USA) and through MONTAGETM Search and Analytics software (Montage Healthcare Solutions, Philadelphia, PA, USA). Controls were matched for age, breast density (as described in the screening mammogram report by the reading radiologist), presence and side of prior malignancy, presence and side of new malignancy on the presented mammogram, presence and side of prior surgery, and date of mammogram when possible. While matching for breast cancer history may have increased the breast cancer risk profile of the study sample in both the AD and non-AD groups compared to the general screening population, it controlled for the possibility that either group would be more complex or at higher risk than its counterpart. The ratio of case/control was one AD patient/one control patient. All cases and controls were bilateral examinations. Patient demographics, imaging findings, pathology findings and follow-up imaging results were obtained from the electronic medical record and recorded. Imaging and clinical follow-up through May 2016 were reviewed via the electronic medical record, thus all cases had at least 2 years of follow-up available.

Fig. 1
figure 1

A 58-year-old woman with left breast architectural distortion. Left breast (a) craniocaudal and (b) mediolateral oblique digital mammogram images show heterogeneously dense breast tissue. Four out of four study readers did not detect architectural distortion on this patient’s digital mammogram images. Left breast (c) craniocaudal and (d) mediolateral oblique digital breast tomosynthesis images show architectural distortion in the 12 o’clock location (white circles). The architectural distortion was detected by four out of four study readers on this patient’s digital breast tomosynthesis images. The patient went on to biopsy and subsequent excision revealing invasive ductal carcinoma on pathology

Review of images

Two breast radiologists (9 and 19 years experience, respectively) and two breast imaging fellows in the second half of their 1-year breast imaging fellowships who were blinded to patient information, prior imaging, and outcomes independently reviewed the DBT and DM images from screening mammograms done with combined technique in two separate reading sessions. In the first session, for half of the cases (1–59), only the DBT images were reviewed by radiologists and for the other half of the cases (60–118) only the DM images were reviewed. In the second session (at least 1 month later), only the DM images were reviewed for those cases in which DBT images had been previously evaluated (1–59), and only DBT images were reviewed for cases in which DM had been previously evaluated (60–118). While the order of patient images was held constant across sessions 1 and 2, the order itself was randomly assigned AD versus no AD. This counterbalanced experimental design allowed patient images and radiologists to be held constant while only imaging technique (DBT vs. DM) was manipulated; therefore, observed differences in radiologists’ performance can be attributed to the direct effect of imaging technique.

Measures

For each breast, readers recorded the presence or absence of AD or possible AD (i.e., Yes/No), the location in the breast using clock face position, and their confidence in that interpretation on a scale of 0–4 (i.e., How confident? 0 ‘Not at all’, 1 ‘Somewhat’, 2 ‘Confident’, 3 ‘Very confident’, 4 ‘Almost sure’). AD or possible AD that was due to post-surgical change and clearly identified as such by the reader was not coded as AD or possible AD. Only unexplained AD or unexplained possible AD was coded as AD or possible AD for the purposes of this study in order to reproduce the clinical setting in which architectural distortion that is clearly due to prior surgery is not clinically significant. The studies were interpreted on a 5 megapixel LCD diagnostic workstation (SecurView, Hologic).

Statistical methods

All analyses were conducted using SAS Software 9.4 (SAS Inc., Cary, NC, USA). Agreement was examined using weighted Kappa. Differences in confidence between DBT versus DM and between attending radiologists versus fellows were examined using generalised mixed modeling assuming a binomial distribution. Differences in sensitivity/specificity between DBT versus DM and between attending radiologists versus fellows were examined using generalised mixed modeling assuming a binary distribution. Modeling was accomplished with sandwich estimation using PROC GLIMMIX, where images were nested within patients and patients nested within radiologists. Positive likelihood ratios were calculated as sensitivity/(1 − specificity) and negative likelihood ratios were calculated as (1 − sensitivity)/specificity. Receiver operating characteristic (ROC) estimates were calculated using PROC LOGISTIC. Statistical significance was established a priori at the 0.05 level and all interval estimates are calculated for 95% confidence. Multiple comparisons were examined using Tukey correction.

Results

Of 25,369 screening DBT examinations, there were 84 reports of AD or possible AD. Thirty-two cases were excluded because AD or possible AD was not confirmed on consensus review. Four cases were excluded because they were unilateral examinations. The 59 remaining patients had 59 AD lesions (all cases were single lesions in one breast of a bilateral screening mammogram) and were matched to 59 controls for a total of 1,888 observations (59 × 2 (cases and controls) × 2 breasts × 2 imaging techniques × 4 readers, or 472 observations per reader). No differences were observed between AD and non-AD patients on matched variables (see Table 1). Results of biopsy or imaging follow-up of AD cases are shown in Table 2. Although the purpose of the present study was to examine consensus-reviewed AD and possible AD, almost half of our identified cases were confirmed AD that persisted on additional imaging and required biopsy (27 of the 59). Thirty-two out of the 59 cases resolved with additional imaging or turned out to be postsurgical change that was not readily apparent to the consensus reviewers.

Table 1 Patient demographics
Table 2 Architectural distortion (AD) pathology or imaging follow-up

Agreement (Tables 3 and 4)

Table 3 Agreement, reader confidence, sensitivity, specificity, and positive and negative likelihood ratios for architectural distortion (AD) and possible AD cases
Table 4 Agreement, reader confidence, sensitivity, specificity, and positive and negative likelihood ratios for confirmed architectural distortion (AD) cases

Overall agreement among radiologists was fair to moderate [18] for DM (κ = 0.37) and moderate to good for DBT (κ = 0.61). Agreement between attendings was fair to moderate for DM (κ = 0.40) and good for DBT (κ = 0.72). Agreement between fellows was fair for DM (κ = 0.34) and moderate for DBT (κ = 0.57). In addition, as seen in Table 4, agreement overall was much higher when using DBT (κ = .71) relative to DM (κ = .35); this held true for both attendings (κ = 0.76 for DBT versus 0.46 for DM) and fellows (κ = 0.60 for DBT vs. 0.34 for DM) for the 27 confirmed AD cases.

Reader confidence (Tables 3 and 4)

Level of confidence in detecting AD was higher when using DBT compared with DM (3.2 vs. 2.6 on a 0–4 scale), p < .001. Attendings’ level of confidence in detecting AD was higher when using DBT compared with DM (3.7 vs. 3.1 on a 0–4 scale), p < .001. Fellows’ level of confidence in detecting AD was higher when using DBT compared with DM (2.4 vs. 1.8 on a 0–4 scale), p < .001. As seen in Table 4, confidence also increased when using DBT (2.5) relative to DM (3.1) for confirmed cases of AD, p < .0001.

Sensitivity and specificity (Tables 3 and 4)

Overall, DBT achieved higher sensitivity than DM (.59 vs. .32), p = .0006. Sensitivity for attendings was much higher for DBT than DM (.69 vs. .29), p < .0001. Sensitivity for fellows was higher for DBT than DM (.49 vs. .35), p < .0001. Specificity remained high for both DBT and DM for both attendings and fellows (>.90). The small reduction in specificity from .96 to .93 observed for attendings was offset by the increase in specificity from .91 to .94 for fellows, resulting in no significant change overall when combined. In addition, DBT achieved higher positive likelihood ratio values, smaller negative likelihood ratio values, and larger ROC values relative to DM (Table 3). Sensitivity and specificity were also examined for the 27 confirmed AD cases. As seen in Table 4, overall sensitivity increased when using DBT relative to DM (.41 vs. .86). In particular, sensitivity increased dramatically for attendings using DBT relative to DM (.97 vs. .38), p < .001, for confirmed AD cases; increase in sensitivity was also observed for fellows using DBT relative to DM for confirmed AD (.75 vs. .43), p < .0001. Specificity remained high throughout (.90–.96). Figure 2 shows area under the curve values for attendings and fellows for all AD and possible AD cases versus just for confirmed AD cases.

Fig. 2
figure 2

Area under the curve values for all architectural distortion (AD) and possible AD cases (‘Possible’) versus just for confirmed AD cases (‘Confirmed’)

Discussion

AD is a subtle but clinically important mammographic finding that may be the earliest manifestation of breast cancer [3]. The sensitivity for AD is lower than for non-AD manifestations of breast cancer [17], and AD is the most commonly missed abnormality on false-negative mammograms [4]. The results of our study indicate that, compared to DM, DBT decreases IOV, increases reader confidence, increases sensitivity, and maintains high specificity in the detection of AD. Because our study design held patient images constant (i.e., patient images serve as their own controls) and radiologists constant (i.e., radiologists serve as their own controls) while only imaging technique (DBT vs. DM) was manipulated, observed differences in radiologists’ performance (e.g., sensitivity, specificity) regarding patient images can be attributed to the direct effect of imaging technique.

IOV is an assessment of radiologists’ consensus and, particularly in the setting of expert readers, ambiguity of difficult findings. Because there is no gold standard to confirm mammographic findings other than expert consensus, improving consistency in clinical practice is critical to providing the highest quality care. In a study performed prior to DBT, AD was found to have high IOV compared to other mammographic abnormalities [16]. Our study shows that DBT also decreases IOV compared to DM. Consistency is increased among our four readers together as well as between our two experienced readers and between our two less experienced readers.

In addition, reader confidence was significantly higher when using DBT compared with DM. Confidence in mammogram interpretation is associated with improved accuracy, particularly among low volume readers [19]. Similarly, Tucker et al. recently showed that the increase in sensitivity with DBT is greater for readers with less than 10 years of experience [20]. Compared to DM, DBT allowed for increased detection of AD by all of our readers when examined together as well as by the experienced readers alone (i.e., when examined without the less experienced readers) and by the less experienced readers alone (i.e., when examined without the experienced readers).

The sensitivity for detecting AD in our study was 29–35% with DM and 49–69% with DBT, lower than that found by Suleiman et al. at 87% [17], although all readers in that study were ‘experienced breast screen readers’ and all cases contained a biopsy-proven malignancy. The sensitivity of our attending readers was higher than that of our fellow readers but still lower than that found by Suleiman et al. This is likely due to the differences in case mix. Our cases were consensus reviewed as having AD or possible AD on screening images, although 32/59 cases resolved with additional imaging or turned out to be postsurgical change that was not readily apparent to the consensus reviewers. Scars from prior benign surgeries were not always marked by the technologist, thus some cases of postsurgical distortion were not prospectively identified as such. Many of the cases that were not identified as having AD by our four readers were from this group of 32 cases. When we examined only confirmed AD cases (i.e., only the cases that persisted on additional imaging and required biopsy), sensitivity increased to 97% for attendings and 75% for fellows, which is similar to the sensitivity reported by Suleiman et al. It is possible that additional years of experience with DBT between the time of the consensus review and the time of this study helped the readers to better select only the true cases of AD. Alternatively, it may be that studies selected as having possible AD or AD during the consensus review process were already known to have been read as AD or possible AD by the original radiologist, while in this study the readers also had control cases and were blinded to the original interpretation. Patient information or previous imaging could have contributed to the original designation of AD, but our readers were blinded to patient information and previous imaging. A study by Partyka et al. showed that some AD is seen better or only on DBT compared to DM [11], and our cases were consensus reviewed using both DM and DBT (as in clinical practice but not in our experiment); this likely accounts for some of the low sensitivity seen with DM alone in our study. Importantly, the benefit of increased sensitivity of DBT compared to DM did not come at a cost to specificity, as specificity remained high for both DBT and DM for both attendings and fellows. Although a statistically significant difference was observed for both groups individually, there was no statistically significant difference when the groups were combined. The 3% change (decrease for attendings and increase for fellows) when specificity remained >90% may not be clinically relevant.

The relative benefit in diagnostic performance of DBT compared with DM can also be seen from the likelihood estimates and area under the curve estimates. Specifically, overall positive likelihood almost doubled when using DBT relative to DM and area under the curve increased by 10%, both indicating a stronger association with AD for DBT relative to DM. Though both groups experienced these gains, fellows in particular experienced greater gains in diagnostic performance when using DBT.

Our study design controlled for bias in the following ways: (1) Radiologists were blinded to patient information (e.g., patient identifiers, outcomes), thus reducing the chance of recall due to patient recognition; (2) Reading sessions were separated by 1 month in time and the same patients were viewed using different imaging techniques thus reducing the likelihood that radiologists would recall images from a different imaging technique from the previous month (this time lag of 1 month was selected to ensure the effects of memory decay and retrograde interference of memory given the high volume of breast imaging interpreted by the radiologists in clinical practice); (3) Recall bias, if present, would be the same for DBT and DM because of the counterbalanced design. In addition, although the cases were selected 3 years prior to the study, memory decay and retrograde interference would reduce recall bias after reading thousands of images in clinical practice over the intervening years; (4) To control for order effects, the order of patient images was held constant across sessions 1 and 2, but the order itself was randomly assigned AD versus no AD using a block design.

Our study has some limitations. As this was an efficacy study, a 1:1 matched AD/non-AD design was used to optimise detection in change of sensitivity and specificity equally; positive predictive value and false-positive rate cannot be assessed because of their inherent relationship with prevalence. In addition, as the goal of our study was to examine AD (known to arise from a variety of aetiologies), we included all cases of consensus-reviewed AD, not just those cases due to cancer. A recent study [21] comparing single view DBT and DM showed an increase in recalls for stellate distortions with DBT, most assessed as normal breast tissue after additional imaging; the second most common aetiology was radial scar [21]. Radial scars diagnosed on percutaneous biopsy risk being upgraded to malignancy and thus are often excised, but increased detection of radial scars with DBT compared to DM warrants additional investigation into the need for excision in all cases particularly when there is no evidence of atypia [22, 23]. The inclusion of all AD and possible AD cases in our study, benign and malignant, was to reflect a more real-world collection of cases as seen in clinical practice. While our findings suggest that DBT may improve detection of cancer presenting as AD, our study did not specifically address this. Our study also examined DM alone versus DBT alone, although DBT is used in conjunction with DM or synthesised two-dimensional imaging in clinical practice. When reading DM in conjunction with DBT, the presence of AD on DBT alone (i.e. AD not visible on corresponding DM) would likely prompt further work-up by most radiologists, although we did not specifically address this scenario with our study design. Use of digitally reconstructed two-dimensional images from DBT, as opposed to use of a separate DM acquisition, is increasing. We believe that our results with DM would translate to reconstructed images because both are two-dimensional techniques as opposed to DBT, although we did not examine this in our study. Finally, our cases of AD were agreed upon consensus review by three radiologists, but different consensus reviewers could disagree on the presence of AD in our case mix.

In conclusion, digital breast tomosynthesis decreases IOV, increases reader confidence, and improves sensitivity while maintaining high specificity in detecting architectural distortion.