Keywords

1 Introduction

Breast cancer is the most commonly diagnosed cancer among women  [15]. Mammography is a common examination for early breast cancer diagnosis. The mammogram malignancy classification is crucial. Most existing methods require extra annotations, such as bounding boxes for detection  [1, 8, 12, 16, 17] and mask ground truth for segmentation  [7]. However, the above extra annotations require expert domain knowledge, which is costly to obtain. Therefore, mammogram malignancy classification with the only image-level labels as the supervision is of vital significance clinically.

Exploring lesion features from a full mammogram image is the key to solve the problem. However, lesion exploration is very challenging since the lesions can be expressed as diverse appearances and the high-intensity breast tissues may partially obscure the lesions. Previous researches mainly use attention mechanism for abnormal exploration, e.g., Zhu et al.  [20] and Fukui et al.  [2]. However, the lack of using mammogram domain knowledge limits their performances.

Learning a healthy generation could be an effective way to exploit domain structure prior. Given a diseased image, if we know how its healthy version behaves we can localize the abnormal regions easily by the difference between the original and its healthy version. Thus such prior provides a more direct and credible way to localize abnormalities from a full mammogram. AnoGAN  [13] applies such thought to anomaly detection. However, training with only healthy images restricts its effectiveness in our application. Fixed-Point GAN  [14] and CycleGAN  [19] can be used for the healthy generation based on the cycle consistency mechanism. An intuitive idea of applying to our application is to regard unhealthy images as a domain and healthy images as another domain. However, such approaches have two major limitations. First, we need to know which images are healthy. Moreover, healthy patterns in mammograms can be various and even similar to the lesions in some cases as shown in Fig. 1. We want to generate a healthy image that maintains all the healthy contents of the original. Regarding healthy images as generation reference may lead to a diverse generation and may conflict with our goal. Second, the cycle consistency mechanism assumes that the translated data can be translated back to the original data  [5, 10] and leads to the preservation of the original features, e.g., large objects and textures. However, lesions in our application can appear anywhere and have diverse appearances. It translates the healthy domain back to the original domain an ill-posed task. Thus such methods will result in undesirable lesion removal in our application.

Fig. 1.
figure 1

Cases to show that how the unhealthy breasts look asymmetrical, while healthy breasts are roughly symmetrical

To address the first problem, we directly regard the contralateral as generation reference by making use of the mammogram bilateral symmetry prior. To be clear, we call the image to be classified as the target, while the image of the opposite side as the contralateral. Bilateral breasts from the same person have a roughly symmetrical glandular pattern. Most lesions only appear on one side and are invisible in the symmetrical regions of the opposite side. Therefore, the contralateral can be an effective reference for generating the healthy version of the target. Besides, a standard mammogram contains images from both sides. Thus, no extra data is required. To tackle the second problem, i.e., ill-posed back translation, we try to preserve the suspicious lesion information while translating from the target data to its healthy version, and plus it when translating back. Thus, the information fed to the back translation is supposed to be sufficient.

In this paper, we propose a novel model named Bilateral Residual Generating Adversarial Network(BR-GAN) to improve mammogram malignancy classification by making use of bilateral prior and healthy generation mechanism. First, we propose a bilateral-cycle mechanism. We use contralateral images as references instead of healthy images. Due to the bilateral misalignment problem, we perform the generation in feature-level. Second, we propose a residual-preserved mechanism for better preserving lesion information during translation. While generating healthy features, we preserve the target-healthy residual features with the attention mechanism. We constrain the preserved features and the target features to share the same malignancy prediction by a residual embedding loss. In the healthy/contralateral-target translation, the preserved features are also fed into the translation network. Finally, we aggregate the generated features with the target features for further classification. Experimental results on both the public dataset and the in-house dataset demonstrate the proposed BR-GAN achieves state-of-the-art performance.

Fig. 2.
figure 2

The schematic overview of BR-GAN. Framework is based on CycleGAN  [19] but uses contralateral features as references and adds preserved features calculated in Residual-Preserved Module to the back translation network. The generated features are then fed into a Classification Module with target features. Finally the Classification Module outputs labels of benign/malignant.

2 Bilateral Residual Generating Adversarial Network

Figure 2 outlines the overall network architecture of our framework. We first use contralateral features as references and generate the healthy version of the target features (Sect. 2.1). Then we feed both the target features and the residual between the generated features and the target features into Classification Module to predict labels (Sect. 2.2). We design our model in feature-level instead of pixel-level due to the bilateral misalignment.

2.1 Feature Generation

Generated features are the healthy version of the target features. To achieve feature generation, we propose a bilateral-cycle mechanism based on a Cycle-GAN framework. Due to the limitation of the cycle mechanism, we design a residual-preserved mechanism to provide lesion information for translation from healthy to unhealthy.

Bilateral-Cycle Mechanism. The GAN loss can be defined as Eq. 1.

$$\begin{aligned} \begin{aligned} \min _G \max _D \mathcal {L}_{G}(G,D,f_{target},&f_{reference}) := \log \left( D\left( f_{reference} \right) \right) \\&+ \log \left( 1 - D\left( G(f_{target}) \right) \right) , \end{aligned} \end{aligned}$$
(1)

where \(f_{reference}\) is defined as the features of the reference used in discriminator D and \(f_{target}\) is defined as the features of the target image which needs to be classified.

Most paired healthy breasts are roughly symmetrical and the abnormalities are rarely symmetrical. Thus, contralateral features are appropriate references for the healthy generation. We use contralateral features as references in our basic Cycle-GAN framework. Generator \(G_{T2C}\) tries to generate healthy features \(f^{T}_{H}\) that look similar to the contralateral features \(f^{C}\) from the target features \(f^{T}\), while \(D_C\) aims to distinguish between translated features \(f^{T}_{H}\) and real features \(f^{C}\). Generator \(G_{T2C}\) is optimized by \(\underset{G_{T2C}}{min} \underset{D_C}{max} {L}_{G_{T2C}}(G_{T2C},D_C,f^{T},f^{C})\), and \(f^{T}_{H} = G_{T2C}(f^{T})\). While generator \(G_{C2T}\) tries to translate the generated healthy features \(f^{T}_{H}\) back to the target features \(f^{T}\) and help \(f^{T}_{H}\) maintain the target features in lesion-free areas.

However, lesions in our application can appear anywhere and have multiple shapes. Due to the limitation of the cycle consistency mechanism mentioned in Sect. 1, the generated healthy features can not provide lesion information for back translation. Thus it will be an ill-posed problem if we feed the generated features \(f^{T}_{H}\) into the generator \(G_{C2T}\) directly. We propose a residual preserved mechanism to tackle the problem.

Residual Preserved Mechanism. While we translate the target features to its healthy version, we separate the suspicious lesion features i.e., the preserved features \(f^{T}_{P}\) from the target features in Residual Preserved Module. The preserved features \(f^{T}_{P}\) are used as the guidance to indicate the predicted lesion information. Thus the preserved features \(f^{T}_{P}\) should contain the texture and space information of the lesions. Concatenation of the preserved features \(f^{T}_{P}\) and the generated features \(f^{T}_{H}\) will be inputs to the generator \(G_{C2T}\), i.e. \(G_{C2T}([f^{T}_{H}, f^{T}_{P}])\). The preserved features \(f^{T}_{P}\) will provide lesion features for a back translation and avoid the ill-posed problem.

To calculate the preserved features, we first calculate the residual between the target features and the generated features. Second, to avoid the back translation network \(G_{C2T}\) being a direct identity mapping, we do not use residual as our preserved features directly. We turn the residual features into an attention map by softmax function for normalization. If the generated features learn to be the healthy version of the target features, the locations on residual features with high values should indicate high abnormal probabilities. Third, we multiply the attention map and the target features. Finally, we get the preserved features \(f^{T}_{P}\) which is defined as:

$$\begin{aligned} f^{T}_{P}=f^{T} * softmax(f^{T} - f^{T}_{H}) \end{aligned}$$
(2)

To further constrain the success of separation, we define a residual embedding loss Eq. 3 and constrain the preserved features and the target features to share the same malignancy prediction. We use the malignancy classifier to predict the malignant probabilities \(p_m(\cdot )\) of the target features \(f^{T}\) and the preserved features \(f^{T}_{P}\).

$$\begin{aligned} \mathcal {L}_{RE}= -p_m(f^{T}) * log(p_m(f^{T}_{P})) - (1-p_m(f^{T})) * log(1-p_m(f^{T}_{P})). \end{aligned}$$
(3)

We design the residual cycle consistency loss \({L}_{c}^{T}\) measured by selected mean square error(MSE) to achieve \(f^{T} \rightarrow G_{T2C}(f^{T}) \rightarrow G_{C2T}([G_{T2C}(f^{T}), f^{T}_{P}]) \approx f^{T}\).

However, with only one residual cycle consistency constrain may lead to a collapsed identical mapping from contralateral features. To avoid this problem, we design another cycle consistency loss \({L}_{c}^{C}\). \({L}_{c}^{C}\) also is measured by MSE and achieves i.e., \(f^{C} \rightarrow G_{C2T}([f^{C}, f^{T}_{P}]) \rightarrow G_{T2C}(G_{C2T}([f^{C}, f^{T}_{P}])) \approx f^{C}\). And the generator \(G_{C2T}\) is optimized by \(\underset{G_{C2T}}{min} \underset{D_T}{max} {L}_{G_{C2T}}(G_{C2T},D_T,f^{C},f^{T})\), while \(D_T\) aims to distinguish between translated features \(G_{C2T}([f^{C},f^{T}_{P}])\) and real target features \(f^{T}\).

2.2 Classification

From the feature generation procedure, we obtain the healthy features \(f^{T}_{H}\) of the target image \(x^T\). The healthy features \(f^{T}_{H}\) and the target features \(f^{T}\) are fed into Classification Module for final classification. In the module, we first calculate the residual between the generated features \(f^{T}_{H}\) and the original target features \(f^{T}\). Then concatenation of the residual and the original target features \(f^{T}\) which contain global semantic information is used to predict benign-malignant labels. We use the cross-entropy loss as loss function \({L}_{CLS}\) for mammogram classification.

During training, we optimize both feature generation and classification modules jointly as in Eq. 4.

$$\begin{aligned} \begin{aligned} \mathcal {L}&= \mathcal {L}_{RE} + \mathcal {L}_{CLS} + \min _{G_{T2C}} \max _{D_C} \mathcal {L}_{G_{T2C}}(G_{T2C},D_C,f^{T},f^{C}) \\&\quad + \min _{G_{C2T}} \max _{D_T} \mathcal {L}_{G_{C2T}}(G_{C2T},D_T,f^{C},f^{T}) + \mathcal {L}_{c}^{T} + \mathcal {L}_{c}^{C} \end{aligned} \end{aligned}$$
(4)

3 Experiments

3.1 Experimental Settings

Datasets. We evaluate BR-GAN on a public INBreast dataset  [9] and an in-house dataset. INBreast  [9] has 115 cases and 410 mammograms and provides each image a BI-RADS result as image-wise ground truth. We use the same process as Zhu et al.  [20] (malignant if BI-RADS > 3; benign otherwise). For a fair comparison, our settings are all the same as Zhu et al.  [20] for mass classification on the INBreast  [9]. However, we discard 9 images for the lack of contralateral images and the remainings all have contralateral images. In addition, we also attempt mixed-lesion classification including mass, calcification cluster and distortion for the purpose of generalization. The in-house dataset contains 1303 images with malignancy annotations, including 589 only masses,120 only suspicious calcifications,34 only architectural distortions, 197 only asymmetries and 363 multiple lesions from 642 patients. All these 1303 images have opposite sides, i.e. 1303 pairs. We randomly divide the dataset into training, validation and testing sets as 8:1:1 in patient-wise.

Implementation Details. We use Otsus method  [11] to segment the breast regions and remove backgrounds from the original images in 14-bit DICOM format. We implement all models with PyTorch and use Adam optimization. Both target and contralateral features are extracted from the last convolution layer. We use Area Under the Curve (AUC) as evaluation metrics in image-wise.

3.2 Performances

Mass Classification. The first four lines in Table 1 summarize the results of the representative methods. To be fair, we compare the results with the backbone of AlexNet  [6] and ResNet50  [4] separately. Due to the slight difference of images caused by reference absence, for a fair comparison, we re-implement some representative methods for mammogram classification  [20], natural image classification  [2, 18] and healthy generation  [13, 14, 19] by adjusting the source codes given by the authors. We marked these methods by ‘*’ in the table.

Mix-Lesion Classification. The performances are shown in the last two columns of Table 1 for the INBreast dataset and the in-house dataset.

Table 1. AUC evaluation on (a) INBreast for mass classification with Alexnet; (b) INBreast for mass classification with Resnet50; (c) INBreast for mixed-lesion classification with Resnet50; (d) in-house dataset for mixed-lesion classification with Alexnet.
Table 2. Top-1 localization error on (b) INBreast dataset for mass classification with Resnet50; (d) INBreast dataset for mixed-lesion classification with Resnet50.
Fig. 3.
figure 3

Visualization of class activation maps of Vanilla CNN, AnoGAN  [13], Fixed-Point GAN  [14], CycleGAN  [19], Zhu et al.  [20], ABN  [2] and our BR-GAN. The target containing lesions is bounded by a red rectangle. The ground truth bounding boxes are labeled by green rectangles in the third column. (Color figure online)

Results. Attention mechanism (Zhu  [20], CAM  [18], ABN  [2]) works but limits by the lack of mammogram domain knowledge. Only using healthy data for training highly (AnoGAN  [13]) relies on the number of healthy data and is limited by the lack of reference to unhealthy data. The cycle consistency mechanism (Fixed-Point GAN  [14], CycleGAN  [19]) is effective to some extent but is limited by its ill-posed back translation problem in our application. However, our BR-GAN outperforms the representative methods significantly on both datasets.

To further evaluate the effectiveness of the generated features(healthy version of the target features), we calculate the mean FID  [3] to measure the average of features distribution distances in the INBreast dataset. The mean FID between the target and contralateral features is 63.63. The generated-contralateral mean FID is 27.54. The target-generated mean FID is 22.81 while the one after removing the lesion areas from ground truth is 0.73. Through the above comparison, we can find the generated features containing both contralateral distribution and target information in healthy areas as we want.

Localization. To verify whether the proposed model focuses on the lesion areas, we evaluate the localization error by CAM  [18]. We use the top-1 localization error as ILSVRC using an inter-over-union (IOU) threshold of 0.1. As is shown in Table 2, BR-GAN largely outperforms the representative methods.

Furthermore, Fig. 3 visualizes the class activation maps of some cases. As we can see, all lesions satisfy the bilateral asymmetry prior. The proposed BR-GAN succeeds to focus on all lesions since it incorporates the bilateral asymmetry prior and modifies the cycle mechanism. The other methods show uneven results without considering bilateral information.

3.3 Ablation Experiments

To verify the effectiveness of each component, we evaluate some variant models and show results in Table 3. Here are some interpretation for the variants:

SBF: Simple Bilateral Features. The bilateral features are combined and fed into the fusion layer directly;

Single: Only use the consistency loss \({L}_{c}^{T}\);

Double: Use both consistency losses \({L}_{c}^{T}\) and \({L}_{c}^{C}\);

Mask: Whether use attention mask in Residual Preserved Module.

Note that bilateral breasts exist misalignment, using SBF to classify is not robust enough. As shown in the above tables, the bilateral cycle mechanism, the double consistency losses, the residual preserved mechanism and attention mask for preserved features are all proved to be effective.

Table 3. Ablation experiments on (a) INBreast dataset for mass classification with AlexNet; (b) INBreast dataset for mass classification with ResNet50; (c) INBreast dataset for mixed-lesion classification with ResNet50; (d) in-house dataset for mixed-lesion classification with AlexNet.

4 Conclusions

In this paper, we present a novel approach called bilateral residual generating adversarial network (BR-GAN) to improve the mammogram classification performance. The approach proposes a novel way to generate the healthy version of target features to help find the abnormal features. Thus, BR-GAN enhances the interpretability of results for clinical diagnosis. Experimental results indicate that the proposed BR-GAN achieves the state-of-the-art in both the public and the in-house dataset.