Keywords

1 Introduction

Cancer is a disease that affects cells and its detection at an early stage increases the chances of recovery. In fact, breast cancer is one of the most common types among women, and breast cancer has always shown a very high incidence and mortality rate of about 10% of women at some point in their lives. It is the second-largest cause of death among females after lung cancer [1]. The World Health Organization’s International Agency for Research on Cancer (IARC) reported an anticipated increase in the number of breast cancer cases to 1.1 million by 2030, with the gap between developed and developing nations expected to widen [2]. Cancer can be described as the uncontrolled proliferation of abnormal cells that form masses. These tumors can be benign or malignant. Benign tumors remain localized and grow slowly. Malignant tumors invade nearby structures and may destroy other parts of the body [3]. Therefore, mammography is an effective imaging technique used in the detection of breast cancer. It is the most used imaging technique in screening programs [4]. It helps in detecting suspicious lesions like masses and micro calcification. However, mammography in a 2D image results in tissue overlap, which can mask the lesion, or create a false lesion, thus producing false-positive and false-negative results. In addition, mammography is known to be less sensitive to breast density (30–64%) compared to fatty breasts (76–98%), as it has been shown that women with dense breasts are more susceptible to breast cancer.

In the last decade, several works based artificial intelligence tools were developed to enhance computer-assisted breast cancer (CAD) diagnosis. These approaches have shown their ability to treat the problem of abnormal lumps and calcification in the breast and predict their growth. They help radiologists and oncologists diagnose breast cancer by providing a second opinion. In this work, we propose a new system of breast cancer detection using a pre-trained DenseNet, with the integration of attention mechanism. The combination of these two models has demonstrated their effectiveness in improving the detection performance. In fact, attention mechanism allows to increase the weights of the relevant features of the model and decreasing the others, to make a better decision. Furthermore, we applied data augmentation technique to increase the number of training images and to improve the model generalization.

The rest of this paper is organized as follows. Section 2 presents some related works in breast cancer detection field. Section 3 illustrates the different parts of the proposed methodology. Section 4 presents the experimental results and a comparison with similar works. We finish this paper by a conclusion and some prospects.

2 Related Works

In the literature, numerous approaches were proposed for breast cancer detection based on mammography images. Samee et al. [5] proposed a breast cancer detection system based on several deep learning architectures. In fact, they extracted automatic features from both AlexNet, VGG, and GoogleNet models. These features are fused to make the prediction task. This system was evaluated on INbreast database and achieved the accuracy of 98.50%. In other work, Jiang et al. [6] integrated a new dataset of breast mammograms named Film Mammography dataset number 3 (BCDR-F03). They applied both GoogLeNet and AlexNet models to classify segmented tumors found on mammograms, and obtained the accuracy of 88% and 83% for GoogleNet and AlexNet respectively. Ribli et al. [7] used Regional based CNN(R-CNN) model to detect and classify breast lesions using mammograms. They obtained the accuracy of 95% on INbreast dataset. Alruwaili et al. [8] used ResNet50 model to distinguish between malignant and benign breast cancer. Data augmentation technique was applied to increase the number of training images and prevent the model from over-fitting. The proposed model was assessed on MIAS dataset and achieved the accuracy of 89.5%. Kaur et al. [9] proposed an hybrid model where both deep learning neural networks (DNN) and Support Vector Machines (SVM) were used. SVM was implemented after the DNN classification part instead of regular dense layers. The results showed that SVM allows improving the recognition rate from 70% to 96.9% on multiclass MIAS dataset. Mohapatra et al. [10] evaluated several pre-trained deep learning models such as AlexNet, VGG16, and ResNet50 on mammogram images. Due to the limited number of training images, they applied data augmentation to address the problem of over-fitting. The experiments were done on Mini-DDSM dataset and reached the accuracy of 65% when using ResNet50. Muduli et al. [11] proposed a deep convolution neural network (CNN) model for breast cancer classification using mammograms and ultrasound images. This model facilitates the extraction of prominent features from the images with only few tune parameters. They applied data augmentation to increase the number of training images and prevent the model from over-fitting. The proposed model was evaluated on MIAS and INbreast datasets, and achieved the accuracy of 90.68% and 91.28% respectively. Rouhi et al. [12] proposed a new model of primary breast cancer detection using region growing method. Their model is based on the hybridization of cellular neural network with genetic algorithm, and achieved the accuracy of 96.47% and 95.13% on MIAS and DDSM databases respectively. In [13], transfer learning technique was applied with the pre-trained deep neural networks Inception V3, ResNet50, VGG16, and Inception-ResNet. The best result was obtained using VGG16 model which achieved the accuracy of 98.96% on MIAS database. Punithavathi et al. [14] proposed an hybrid model based on SVM and KNN classifiers. They introduced multiple categories of images to the SVM, and the final decision was done by KNN algorithm. This model produces higher diagnostic accuracy on MIAS dataset and achieved the accuracy of 99.34%. Pillai et al. [15] evaluated several pre-trained deep learning models such as EfficientNet, AlexNet, VGG16, and GoogleNet on MIAS database. They applied data augmentation to increase the number of training images and prevent the model from over-fitting. The best performance was obtained with VGG16 model and achieved the average accuracy of 75.46%. Chougrad et al. [16] applied fine-tuned Inception-v3 model on MIAS database to classify breast lesions and obtained the accuracy of 98.23%. Selvathi et al. [17] proposed a new system for breast cancer detection. Their approach consists of using a stack autoencoder architectures with softmax classifier. Moreover, they applied some processing on MIAS database images to remove noise, background, and pectoral muscle, and obtained the accuracy of 98.5%.

3 Proposed Methodology

In this section, we present our system for multi-class breast cancer detection based on mammogram images. The proposed methodology employs the pre-trained DenseNet121 model truncated at the feature extraction part, followed by an attention model to give more importance to relevant features of the Region of Interest (ROI). Thereafter, convolutions and attention modules are combined to fuse both the high-level information and the interesting semantic information. The obtained features are fed into a Global Average Pooling (GAP) to reduce the feature maps dimensions and preserve pertinent features for the classification part.

3.1 DenseNet121 Architecture

Dense Convolution Network (DenseNet) is a modern CNN architecture designed for visual object recognition with only few parameters [18]. It achieved the state-of-the-art results on several image classification datasets, such as CIFAR-10, SVHN and ImageNet [19]. The basic structure of the network mainly includes two-component modules: Dense and Transition blocks (Fig. 1). In DenseNet-121, there are a total of 4 dense blocks and 3 transition blocks. Each layer in the Dense Block is connected to all subsequent layers in a densely manner [20]. Moreover, each dense block is composed of a stack of two convolution layers with a kernel size of (1\(\,\times \,\)1) and (3\(\,\times \,\)3) respectively. In each transition block, (1\(\,\times \,\)1) convolution and (2\(\,\times \,\)2) max pooling operations are done. Table 1 shows the overall architecture of DenseNet121 model. We notice that DenseNet121 alternates dense and transition blocks. At each pass, the convolution layers of the dense block are reproduced 6, 12, 24, and 16 times respectively.

Fig. 1.
figure 1

DenseNet121 model concept [21]

Table 1. DenseNet121 structure

3.2 Self Attention Model

After the global average pooling layer, we implemented a Multi-Head Self Attention (MHSA) model to improve the model effectiveness (Fig. 2). In fact, MHSA is a mechanism used to provide an additional focus on a specific component in the data. It enables the network to concentrate on a few aspects at a time and ignore the rest [22]. MHSA consists of several attention layers running in parallel, instead of performing one single attention function. In particular, the input consists of queries and keys of dimension \(d_{k}\) (Q and K respectively), and values of dimension d\(_{v}\) (V). The output of the attention model is done by computing the scaled dot product of the queries with all keys and applying a SoftMax function to obtain the weights on the values V (Eq. 1). The attention mechanism is linearly projected h times with different learned weights (\(W_{Q}\), \(W_{K}\), \(W_{V}\)). These different representation sub spaces are concatenated into one single attention head to form the final output result (Eq. 2). We applied a particular version of attention model called self-attention, in which query, key and value inputs are the same. The calculation process follows these steps: First, we made the dot product (MatMul) of query and keys tensors and scale the obtained scores. Next, we apply a SoftMax function on these scores to obtain attention probabilities. Finally, we take a linear combination of these distributions with the value input tensors and concatenate them into one channel.

Fig. 2.
figure 2

Attention model architecture

$$\begin{aligned} Attention(Q, K, V) = Softmax\left( \frac{Q\times K^T}{\sqrt{d_{k}}}\right) \times V \end{aligned}$$
(1)
$$\begin{aligned} {\left\{ \begin{array}{ll} \text { MHA}\,\, \text {(Q, K, V)} = \text {concat(head}_{1}, ..., \text {head}_{h}) \\ \text {head}_{i=1..h} = \text {Attention}(QW_{Q}, KW_{k}, VW_{V}) \end{array}\right. } \end{aligned}$$
(2)

The proposed methodology consists of making the dot product of DenseNet121 and Self Attention models outputs. Thereafter, we applied Global Average Pooling on both attention model output and the resulting dot product tensors. The classification part is composed of two dense layers with dropout function to prevent the model from over-fitting. Figure 3 illustrates the different parts of the proposed breast cancer detection system.

Fig. 3.
figure 3

Proposed methodology architecture

4 Experimentation and Results

4.1 Database Description

The proposed methodology was evaluated on MIAS multi-class database containing images of normal, benign, and malign breast cancer [23]. This database consists of 322 mammogram images of size \((1024\times 1024)\) pixels and stored according to Portable Gray Map (PGM) format. These images belong to three types: glandular dense, fatty, and fatty glandular. Each type is divided into three categories: normal, benign, and malignant. The dataset also contains radiologists’ actual estimations of the location of abnormalities (benign, malignant), with an approximate determination of the radius surrounding the center of the anomaly. In this work, we use all the images in the dataset, which consists of 207 normal images, 64 benign images, and 51 malignant images. Figure 4 shows three images from MIAS database representing three categories (Normal, Benign, and Malign).

Fig. 4.
figure 4

MIAS database samples. (a) Normal (b) Benign (c) Malign

4.2 Data Augmentation

Since MIAS dataset contains only 322 images, the proposed model may not be generalized. For this purpose, we applied data augmentation operation to increase the number of training samples in each class and prevent the model from overfitting. In this work, data augmentation is mainly based on geometric transformations including rotation, flipping, and shifting. Thus, we obtained a new database of 1836 breast cancer images evenly distributed over the three classes (612 images per class). Figure 5 shows an example of data augmentation where vertical and horizontal flip were applied on the original image.

Fig. 5.
figure 5

Data augmentation samples. (a) original (b) horizontal flip (c) vertical flip

4.3 Experimental Setup

During the experiments, the training database was divided into batches of size 32, with shuffling option to make different min-batch samples in each epoch. Moreover, in each iteration categorical cross entropy method was used to compute the loss between desired and calculated outputs. The model was trained using Adam (Adaptive Moment Estimation) optimizer with an initial learning rate of 0, 001. This value can be reduced by a factor of 0.5 once learning stagnates. Moreover, the early stopping approach is applied as a regularization method. It consists of stopping the training process early before it has over-fit the training database. In the multi-head self attention model, we employed 8 parallel attention layers or heads. For each of these, we used 64 units in the linear projector of both query, key, and value matrices (Table 2).

Table 2. Hyperparameters setting

4.4 Evaluation Metrics

To illustrate the performance of the proposed model, the confusion matrix and other metrics were calculated like Accuracy, Recall, Precision, and F1-score (Eq. 36). They are all based on the calculation of True positive (TP), False positive (FP), False negative (FN) and True negative (TN) values. TP denotes images predicted with breast cancer when they were. TN relates to normal images predicted as healthy. FP concerns normal images which are predicted as breast cancer, and FN refers to images predicted as normal, but they were not.

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(3)
$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(4)
$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
(5)
$$\begin{aligned} \text {F1-score} = 2\times \frac{Precision \times Recall}{Precision + Recall} \end{aligned}$$
(6)

4.5 Experimental Results

In the experiments, the images shape was fixed to \((256\times 256\times 3)\). Moreover, several models were studied with different values of splitting and optimizers. All of these models have been used with pretrained weights. First, we evaluated the model’s performance without the use of self attention mechanism. The best result was obtained with DenseNet-121 (Table 3). When applying multi-head self attention mechanism, the DenseNet-121 accuracy was improved by 6%, and we reached the accuracy of 0.9939 for 90% of database split. On the other hand, several other metrics were evaluated such recall, precision, and AUC (Table 4). In all of these metrics, the best results have been obtained using DenseNet-121 model with Adam optimizer. Figures 6 and 7 represent the confusion matrices related to the classification report for different split ratios. We notice that the model performances was improved when using multi-head self attention mechanism. Moreoever, the proposed model allows a good discrimination between benign and malign image samples, but it confuses between normal and benign classes (Table 5).

Table 3. Models accuracies without and with attention
Table 4. DenseNet model performance with different split ratio
Table 5. Performance results with optimizers
Fig. 6.
figure 6

Confusion matrices without self attention. (a) split ratio (70:30) (b) split ratio (80:20) (c) split ratio (90:10)

Fig. 7.
figure 7

Confusion matrices with self attention. (a) split ratio (70:30) (b) split ratio (80:20) (c) split ratio (90:10)

4.6 Comparative Study and Discussion

Table 6 summarizes several works evaluated on multi-class MIAS dataset. When applying the split ratio of 90% and multi-head self attention mechanism, the proposed model achieves the state-of-the-art performances on MIAS dataset, and outperforms the models based on ADL-BCD and ResNet50. However, in the case of split ratio of 80%, the proposed approach is better than DenseNet-201 model, and it is slightly less efficient than VGG16 and OMLTS-DLCN approaches. Furthermore, the proposed work is the only one to combine multi-head self attention mechanism with the pre-trained deep neural networks DenseNet-121. This combination has led to significant improvement in the classification rates. In fact, the attention model was frequently applied to sequential data. In this work, we turned it to image classification task to associate high attention weights to the parts of images with relevant features.

Table 6. Results comparison on MIAS database

5 Conclusion

In this paper, we proposed a deep architecture for breast cancer classification based on mammographic images to help medical doctors in breast cancer detection and diagnosis. The approach provides the breast image classification into normal, benign, and malignant. The virtue of our method is to combine pre-trained deep convolution neural networks DenseNet121 with a self-attention model. Moreover, data augmentation was applied to increase the number of images and prevent the model from overfitting. During the experiments, several hyper-parameters were tuned such as optimizer and learning rate to boost the diagnostic efficiency. The proposed methodology achieved the accuracy of 92.64% and 99.39% for a split ratio of 80% and 90% respectively. Finally, it can be concluded that by integrating the CNN using learning transfer with the attention mechanism, a clear improvement was achieved compared with other existing approaches. The results presented in this study open new windows for the use of self-attention-based architectures with vision transformer technology for breast cancer classification to obtain high-performance CAD schemes with better results.