1 Introduction

In 2021, the National Center for Health Statistics estimated that approximately 1,898,160 new cases of malignancy breast cancer would be diagnosed in the same year, with approximately 608,570 cancer death projected to occur in the United States. In Malaysia, about 1 in 19 women will be diagnosed with breast cancer. According to the World Health Organization (WHO), breast cancer became the most commonly diagnosed cancer in 2021, accounting for 12% of new cancer cases worldwide each year (Siegel and Miller 2021). Among the various forms of breast cancer, invasive ductal carcinoma (IDC) is common, representing approximately 80% of breast cancer incidence upon diagnosis (Makki 2015). Early determination is crucial during the diagnosis of breast cancer because breast cancer survival is highly influenced by the diagnosis stage of the malignancy. Early determination enables medical experts to provide appropriate treatment to the patients, thereby reducing mortality (Youlden et al. 2012; Wang 2017). An informative diagnosis of the various cancer classification is essential to aid medical professionals in selecting legitimate treatments. Technological advancements in screening tests to identify early-stage cancer cells were recommended (Wang 2017).

Mammography is the standard screening test for detecting breast cancer, but its effectiveness is limited for patients under 40 years old and those with high-density breast tissue. It is also less sensitive to tumors smaller than 1 mm, and may not provide conclusive evidence of breast cancer (Onega et al. 2016). Another screening test is called contrast-enhanced (CE) digital mammography which can deliver higher accuracy in diagnosis compared with other screening tests in high-density breasts region cases, but its availability is limited due to the high cost and the elevated levels of radiation involved in the procedure (Lewis et al. 2017). Another method for detecting breast cancer is the use of magnetic resonance imaging (MRI) in conjunction with mammography. MRI is a medical imaging tool that can detect small-sized tumors that are difficult to visualize with mammography. However, MRI has its drawbacks. They are high cost, low specificity, injection of contrast agent, and the chance of over-diagnosis (Hua et al. 2015). In addition, a biopsy test is considered the definitive way for a confirmative and comprehensive diagnosis of breast cancer. Invasive breast cancer detection such as invasive microscopic examination is employed to identify breast cancer in a microscopic setting. Whole slide imaging (WSI) is another commonly used imaging modality in the microscopic field for investigating breast cancer. WSI provides high-resolution histopathology images that aid in visualizing cellular features and tissue structures (Cruz-Roa et al. 2014).

Currently, many medical practitioners still rely on manual identification of invasive ductal carcinoma (IDC) in the breast. However, this approach is time-consuming and operator dependent as it involves scanning a large area to identify IDCs. Moreover, prior knowledge of the abnormality presence is required by medical practitioners for manual delineation of breast cancer mass. The discrepancy in differentiated diagnosis opinions among medical experts and radiologists requires a dual reading procedure (Yap and Yap 2016). Another approach is the semi-automatic detection and classification of breast cancer abnormalities (Sim et al. 2014; Ting et al. 2017). However, it is challenging to apply common image processing techniques to locate various types of mammograms in medical images, as malignant lesions can appear at different locations and have different intensity distributions.

Recently, the use of machine learning has shown immense potential in addressing a wide range of tasks and challenges faced by the healthcare industry. Genetic programming, a subset of machine learning, is a method for automatically generating computer programs or mathematical models to solve complex problems without the need for explicit programming by humans. The uniqueness of genetic programming is the ability to evolve programs or mathematical models, allowing it to handle a wide range of problems. Recently, D’Angelo et al. (2023) introduced the use of genetic programming to develop a classifier for diabetic foot (DF). The authors proposed an explainable genetic programming classifier (X-GPC), which aims to produce a model that can provide a human-readable explanation of the diabetic foot ulcer (DFU) diagnosis. Asides from the genetic programming, the author also mentioned about evolutionary algorithms, a class of optimization algorithms inspired by the process of natural selection and evolution in biological systems. This type of algorithm used to find the optimal or near-optimal solution to complex problems such as the data in the biomedical field (D’Angelo and Palmieri 2020).

Deep learning, a subset of machine learning, has emerged as a groundbreaking approach that can mimic the workings of the human brain's neural networks. This technique enables an end to end learning, where the model learns all the steps between the initial input phase and the final output result. It automatically learns and extracts patterns and representations from complex medical data. One of the most significant applications of deep learning in the healthcare industry is medical imaging analysis. Traditional diagnostic methods often rely on human expertise to interpret information in the images and are subject to human error. Algorithms of deep learning, on the other hand, can automatically learn to interpret that information, enabling faster and more efficient diagnoses (Araújo et al. 2017).

Hence, this paper aims to apply deep learning methods for non-IDC and IDC classification. Deep learning models are well suited for processing medical imaging due to the availability of a large number of sample images for training. The proposed model, residual attention neural network breast cancer classification (RANN-BCC), aims to assist medical practitioners in investigating medical images of breast cancer quickly and effectively. RANN-BCC utilizes a residual neural network (ResNet) as a supportive tool to classify breast cancer lesions, thereby reducing the time required for breast cancer diagnosis.

To evaluate the performance of the RANN-BCC model, a classification was conducted using a dataset of non-IDC and IDC images, and the results were compared with other deep learning models. The paper is organized as follows. A review of related work is shown in Sect. 2. The structure of the RANN-BCC model is explained in Sect. 3. The results and discussion of the RANN-BCC and other deep learning models are presented and discussed in Sect. 4. Finally, the study is summarized in Sect. 5.

2 Related works

2.1 Whole slide images

Whole slide imaging (WSI) is a technology that produces digital images by scanning and digitization of entire glass (histology) slides. WSI is considered as a digital file that is comparable to the glass slides under a microscope. WSI is increasingly being used by pathology departments, scientists, and pathologists for educational, clinical, and research activities (Hanna et al. 2020). A trained and experienced histopathologist can make accurate diagnoses of biopsy specimens based on WSI data. However, with the different dimensions of WSIs and the increasing number of cancer cases, the analysis of WSIs will be time-consuming and even difficult if there is a lack of histopathologists (Khened et al. 2021). Figure 1 shows the typical workflow of digital pathology research, where several image analysis techniques are used to perform segmentation, detection, and classification.

Fig. 1
figure 1

The workflow of pathology research for segmentation, detection, and classification (Janowczyk and Madabhushi 2016)

In the past, most of the research methods involved histological primitives’ segmentation and handcrafted feature extraction that describe the arrangement and appearance of these primitives to distinguish malignant from benign areas. Petushi et al. (2006) introduced the tissue micro-texture classification to segmentate nuclei and extract two features which are spatial position and surface density of nuclei. Dundar et al. (2011) presented a computerized classification of intraductal breast lesions that can distinguish between actionable subtypes and ductal hyperplasia. Niwas et al. performed the breast lesions classification using log-Gabor complex wavelet bases which could evaluate the color texture features of the segmented nucleus. Those previous methods involved manual handcrafted features to extract the feature contents of patches divided from WSI. Those methods not only involved numerous preprocessing steps but also the classification accuracy was dependent on the accuracy of the previous step. In recent years, deep learning had provided a state-of-the-art result in various image analyses. Deep learning does not require the use of a handcrafted feature, instead, it will automatically learn the feature content of patches divided from WSI. With the rapid adoption of deep learning in imaging, the wider accessibility of WSIs now attracts the application of deep learning.

2.2 Deep learning in image classification

The deep learning model was useful in the development of medical research and currently received a lot of attention due to its superiors’ classification of a large set of training data. These deep learning models showed outstanding capability in mimicking humans, including in the field of medical imaging (Tan et al. 2017; Ting and Sim 2017).

Among different types of deep learning models, convolutional neural network (CNN) is commonly used in classifying the image. CNN consists of several layers of neural computer connections that can greatly improve the field of computer vision with minimal systematic processing. The architecture of CNN consists of several parts such as the convolutional layer, pooling layer, and fully connected layer. A convolutional layer will learn the feature representation of the image by detecting line, edge, and other pattern forms. For computing different feature maps, several kernels will be applied to the image and get the convoluted features. Those features will then be passed to the pooling layer which is used to reduce the computational burden by decreasing the feature map resolution. After that, those features will be flattened and fed into the fully connected layer to classify them into various classes. CNN can learn a hierarchical representation of a model, from low-level to high-level functions, and extract the most important functions of a specific model (Krizhevsky et al. 2012). Since deep CNN architectures usually involve numerous layers in a neural network, with potentially millions of weight parameters to be estimated, a large number of samples are required to form the model and set the parameters. This suggests that deep learning models are suitable for handling medical imaging since a large number of medical sample images are available to perform training. Recently, the deep learning-based system was suggested by a researcher on the application such as lung cancer (Hua et al. 2015; Kumar et al. 2015), breast cancer classification (Wang et al. 2016; Ting et al. 2019), cognitive classification (Toa et al. 2021), Alzheimer’s disease (AD) (Ji et al. 2019; Suk et al. 2014), and even pain quantification (Elsayed et al. 2020). Moreover, recent studies mention that deeply learned features can provide a more effective feature-learning technique for image classification as compared to handcrafted features (Toa et al. 2021; Arevalo et al. 2016). Cruz-Roa et al. provided automatic detection of IDC in WSI using CNN. The authors mentioned that the use of the deep learning method yielded a better result in the detection of IDC as compared to an approach using handcrafted features (Cruz-Roa et al. 2014). Janowczyk and Madabhushi performed the analysis on the digital pathology image. The authors used the deep learning method to produce results superior to the handcrafted feature-based classification approach (Janowczyk and Madabhushi 2016).

3 Materials and methods

3.1 Materials

Invasive ductal carcinoma (IDC) is a common subtype of breast cancer. The applied digital databases are made publicly available and were collected in a previous study (Cruz-Roa et al. 2014; Janowczyk and Madabhushi 2016). Figure 2 shows the non-invasive ductal carcinoma (non-IDC) and invasive ductal carcinoma (IDC) in whole slide imaging (WSI). The dataset consisted of 162 WSI breast cancer specimens scanned at 40×. From these WSI, 277,524 patches of size (50 × 50) were extracted and converted into Portable Network Graphics (PNG) format with 198,738 non-IDC (0) patches and 78,786 IDC (1) patches. The filename of each patch includes the x-and y-coordinates of the cropped patch location and its category (0 or 1).

Fig. 2
figure 2

Examples of non-IDC and IDC in whole slide imaging (WSI)

3.2 Methods

To achieve the aim of identifying and classifying breast cancerous lesions, we designed a sophisticated neural network architecture named residual attention neural network breast cancer classification (RANN-BCC). It consists of six different building blocks. These six building blocks have utilized many deep learning conceptions such as residual learning, attention mechanism, convolution, and deconvolution. Figure 3 is the overall design of the architecture. The subsections below will explain each building block individually.

Fig. 3
figure 3

An overview of the residual attention neural network breast cancer classification (RANN-BCC) architecture

3.2.1 Block 1: feature extractor

This block includes a residual neural network 34 (ResNet34) to map significant features of breast cancer images to feature maps (He et al. 2016). ResNet34 is an architecture that is used to solve vanishing gradient problems when constructing more layers. Figure 4 shows the ResNet34 architecture. The parameter used is shown in Table 1.

Fig. 4
figure 4

The ResNet34 architecture

Table 1 Parameters of ResNet34

From Fig. 4, the residual connections shown between layers are important in solving many deep learning problems because they allow gradients to flow directly through the network without going through non-linear activation functions, which solves common neural network training issues such as vanishing gradients. In other words, as shown in Fig. 5, the residual connections link the previous layer output to the new layer.

Fig. 5
figure 5

Residual learning building block (He et al. 2016)

As aforementioned, each inputted image to this building block will result in the creation of 512 maps. Each of which carries some important features that would help the classifier in identifying the cancerous tumor. Figure 6 demonstrates what the feature maps could look like.

Fig. 6
figure 6

A demonstration of the feature maps created by the feature extractor building block

3.2.2 Block 2: self-attention block

The input to this building block is the features extracted from the input image using a Residual Network (He et al. 2016), where the average pooling and classification layers (last two layers) of a ResNet34 (He et al. 2016) will be removed to obtain features of shape \(k\times k\times d\), where \(k\) is spatial size and \(d\) is number of dimensions. We then apply an adaptive average pooling layer. We denote the generated features as F.

Self-attention was proposed by (Vaswani et al. 2017), and it was later implemented as an attention mechanism (Bahdanau et al. 2015) on its input. It is mainly utilized in our system to extract relationships from the images. The definition of the attention components and mathematical formulation will be presented here. The self-attention mechanism projects its input using three projections into a key (K), query (Q), and value (V). It then performs a dot-product operation to find the similarity between the query and the key, and then generates attention weights which signify the importance of each query with all the keys. It then multiplies these attention weights with the projected value and sums the vectors to get a representation of each query contextualized with all its important values.

$$Q={W}_{q}\widehat{Q},K={W}_{K}\widehat{K}, V={W}_{v}\widehat{V.}$$
(1)

The self-attention is defined as a function of the similarity between the Q and the K, normalized with the softmax function to generate probability values that sum to one, and it is mathematically defined as shown in Eq. 2:

$$A=\mathrm{Attention}\left(Q,K,V\right)=\mathrm{softmax}\left(Q{K}^{T}\right)V.$$
(2)

The self-attention mechanism output described in Eq. 2 is then fed to a final linear layer as shown in Eq. 3:

$$O={W}_{o}A+{b}_{o}.$$
(3)

To improve the attention performance, it will be modeled as a “multi-head” and then concatenate the outputs of each head, as in Eqs. 4 and 5:

$${A}_{i}=\mathrm{Attention}\left(Q,K,V\right)=\mathrm{softmax}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V,$$
(4)
$$O=\mathrm{Concatenate}\left({A}_{i},\dots {A}_{h}\right){W}_{o}+{b}_{o},$$
(5)

where O is the output, h is the number of heads and \({d}_{k}\) is each head dimensionality, which is computed as d_model/number of heads.

In the system, as shown in Eq. 6, the input to the self-attention block is the features extracted from the feature extractor block, denoted as \(F\). The Q, K, and V are projected using three separate linear layers, followed by the attention mechanism:

$$S=\mathrm{Attention}\left({W}_{qF}F,{W}_{kF}F,{W}_{vF}F\right).$$
(6)

It is important to mention here that due to how the self-attention mechanism operates, applying self-attention to the visual features is equivalent to exploring visual relationships between the visual elements. Figure 7 shows the architecture of the self-attention block.

Fig. 7
figure 7

Architecture of self-attention block

3.2.3 Block 3: cross-attention block

The only difference between this block and the self-attention block is that the Q is a projection of the model’s input, and the K and V are projections of different features. Here, the input will query the other features rather than querying itself. In our system, the query is the output (\(O)\) of the self-attention layer, and the keys and values are the features extracted from the feature extractor building block. Then, the first cross-attention layer output will be fed as K and V to a second cross-attention layer, where the Q are features extracted from the first CNN. The purpose of adding this block is to cross-reference and confirm the weights that project the importance of the features resulting from performing the self-attention on the output of the feature extractor building block.

Note that block 1 and block 2 include a residual (He et al. 2016) and layer normalization (Ba et al. 2016) layer at the output. At the end of each block, a positional-wise feed-forward network composed of two linear layers with a ReLU activation function in between is included to add non-linearity to our network, also with residual and layer normalization layers at the output. The input to the second layer self-attention is the output of the first layer cross-attention and the input to the second layer cross-attention is the output of the second layer self-attention, and so on.

3.2.4 Block 4: collector

This block and the next were partly inspired by squeeze-and-excitation networks (SENet) (Hu et al. 2020) which was originally designed for image recognition. The collector-building block was mainly added to our system to filter the feature maps before going into the classification stage. Adding this block into our classification system, it provides us with an effective and learnable approach to replace image processing filtering techniques. It is important to note that SENets modifies the equal weighting of the feature maps by adding a content-aware mechanism that adaptively weights each channel. This is different from what CNN does which is to weight all the feature maps equally. Figure 8 shows the inner architecture of the collector and the compressor building blocks. The only reason this block is separated here from the next is to emphasize the different two objectives namely filtering and dimension reduction.

Fig. 8
figure 8

Combined architecture of the collector and the compressor building block

3.2.5 Block 5: compressor

This building block is mainly added to our classification system to reduce the dimensionality with maintaining the important features that have been extracted in the previous building blocks. This dimension reduction step is planned to enhance the efficiency and accuracy of the classifier building block. Figure 8 shows the architecture of blocks 4 and 5 combined. They can be considered as one block, the only reason we divided them into two here is to highlight the two objectives that they both are designed to filter the feature maps and reduce their dimensions before entering the classifier.

3.2.6 Block 6: classifier

As aforementioned, our system consists of six building blocks where block 4 and block 5 can be combined. The output of the compressor building block (block 5) is then fed to the classifier building block. This output is then run through a classification layer with two output classes: (0) non-IDC and (1) IDC. We implement the cross-entropy loss to optimize our network. The cross-entropy loss is given as shown in Eq. 7:

$$\mathrm{CE}=-\frac{1}{n} \sum_{j=1}^{n} \sum _{i=1}^{c}{y}_{i}\mathit{log}{\widehat{y}}_{i},$$
(7)

where \({y}_{i}\) is the class label which is either 0 or 1, \({\widehat{y}}_{i}\) is the predicted probability of the class, c is the number of classes (2 in our case) and finally n is the number of samples in the batch. The complete network is optimized with the Adam optimizer (Kingma and Ba 2015) with a batch size of 15. We set an initial learning rate of 2e–4 and it is then reduced by a factor of 0.8 every 3 epochs. The model is trained for 25 epochs with early stopping, which is a state-of-the-art approach for monitoring the training model performance and stopping training once the model performance begins to degrade. The first layer is an adaptive average pooling, followed by a convolutional layer, and finally, a sigmoid is applied to facilitate the classification process. Figure 9 shows the architecture of the classifier building block.

Fig. 9
figure 9

The architecture of the classification building block

4 Results and discussion

The experiment results from our proposed residual attention neural network breast cancer classification (RANN-BCC) model are provided. The results will be compared with existing methods used in the classification of non-invasive ductal carcinoma (non-IDC) and invasive ductal carcinoma (IDC). The first method is the convolutional neural network (CNN). Cruz-Roa et al. proposed the use of CNN to perform the automatic detection of IDC (Cruz-Roa et al. 2014). The model adopts 3 layers of CNN architecture which employs 16 feature maps for the first layer, 32 feature maps for the second layer, and 7200 features flattened for a fully connected layer. A kernel size of 8 × 8 was used in the convolutional layer and 2 × 2 was used in the pooling layer. The second method is the AlexNet network used by Janowczyk and Madabhushi on digital pathology image classification (Janowczyk and Madabhushi 2016). The AlexNet model consists of 3 convolutional layers and 1 fully connected layer. The 1st and 2nd convolutional layers consist of 32 feature maps, the 3rd convolutional layer consists of 64 feature maps, and the fully connected layer consists of 1024 flattened features. A kernel size of 5 × 5 was used in the convolutional layer and 3 × 3 was used in the pooling layer. Moreover, to make the result more significant, other baseline models such as feed-forward neural network and ResNet34 will be compared with our model. Feed-forward neural network is a type of artificial neural network where features were performed in a single direction, starting from input nodes, moving through the hidden nodes, and towards output nodes. This neural network consists of 4 layers with 2500 input dimensions, 100 hidden dimensions, and 2 output dimensions. The residual neural network 34 (ResNet34) model is an architecture that has 34 layers deep. The model introduced the use of the residual network to solve the problem of vanishing gradient when constructing more layers. ResNet34 model consists of 6 layers with 64 features maps in 1st and 2nd layers, 128 features maps in 3rd layers, 256 features maps in 4th layers, 512 features maps in 5th layers, and 25,088 flattened features in the fully connected layer.

All the deep learning models will be compared using 4 classification metrics which are accuracy, recall, precision, and F-score as shown in Eqs. 811.

$$\mathrm{Accuracy}= \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FN}+\mathrm{TN}+\mathrm{FP}}\times 100\mathrm{\%},$$
(8)
$$\mathrm{Recall}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}},$$
(9)
$$\mathrm{Precision}= \frac{\mathrm{TP}}{TP+FP},$$
(10)
$$F-\mathrm{score}= \frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}},$$
(11)

where TP is a true positive in which the model correctly predicts the IDC class, TN is a true negative in which the model correctly predicts the non-IDC class, FN is a false negative in which the model incorrectly predicts the actual IDC class, and FP is false positive in which the model incorrectly predicts the non-IDC class. Tables 2, 3, 4, 5 show the result of classification metrics for deep learning models.

Table 2 Result of our model compared to other models in terms of accuracy
Table 3 Result of our model compared to other models in terms of recall
Table 4 Result of our model compared to other models in terms of precision
Table 5 Result of our model compared to other models in terms of F-score

For the model accuracy in classification, as shown in Table 2, the RANN-BCC model can obtain the highest accuracy of 92.45%. It is then followed by AlexNet (90.28%), CNN (89.56%), ResNet34 (79.49%), and feed-forward neural network (71.18%). The accuracy of Resnet34 is lower than that of CNN and AlexNet. Through the introduction of other mechanisms, such as self-attention, cross-attention, collector, and compressor combined with ResNet34, the RANN-BCC model is designed, and its accuracy can achieve 92.45%. This shows that by introducing the use of other mechanisms, we have improved the accuracy from 79.4 to 92.45%, an increment of 13.05%.

In the recall metric, as shown in Table 3, all models have high accuracy within an error margin of 0.05. This indicates that all models have a low rate of incorrect predictions of the actual IDC class. For the precision metric, RANN-BCC has achieved the highest value of 0.91, followed by CNN and AlexNet with a value of 0.87, ResNet34 with a value of 0.76, and feed-forward neural network with a value of 0.71. It shows that RANN-BCC has a lower rate of incorrect predictions of actual non-IDC class as compared to other models. Apart from that, the feed-forward neural network has the lowest value of precision, indicating that the model has the highest rate of incorrect prediction of actual non-IDC class. For ResNet34, although it has the highest recall of 1, it has a lower precision of 0.76, indicating a low rate of incorrect predictions for actual IDC class but a high rate of incorrect predictions for actual non-IDC class. Thus, the model is biased toward the actual IDC class.

As for the F-score, as shown in Table 5, it is used to calculate the harmonic mean between precision and recall. Since RANN-BCC has high precision and recall rate, it is undoubtedly having the highest value of 0.94, followed by CNN and AlexNet with the same values of 0.92, ResNet34 with a value of 0.86, and feed-forward neural network with a value of 0.81. The RANN-BCC with the highest F-score value indicates that the model has low false positives and low false negatives. Based on the result of classification metrics for all deep learning models, RANN-BCC shows the best performance since the model is able to achieve higher accuracy, recall, precision, and F-score when classifying the IDC and non-IDC class of breast cancer.

To show that RANN-BCC has a good generalization capability, we have plotted the curve for the loss function and receiver operating characteristic (ROC). The loss function is a method to evaluate how well the model performs on the dataset. Figure 10 shows the plotted graph for the validation loss and training loss of the RANN-BCC model. From the graph, we can see that the training line (blue) and the validation line (orange) are close to each other in exponential decay. This shows that the model has good generalization capability and it is not overfitting to the breast cancer dataset.

Fig. 10
figure 10

Loss graph of training and validation process

The next one is the ROC, which is a useful method to measure how well the model can distinguish between the IDC class and the non-IDC class. The area under the curve (AUC) is a measurement tool used to measure the area underneath the ROC curve with a score from 0 to 1. The higher the AUC score, the better the model is at predicting the IDC class and non-IDC class. Figure 11 shows the plotted curve for the ROC curve of the RANN-BCC model. We can see that there are 2 types of curves which are micro-average and macro-average. Micro-average is a summation of the TP, FP, and FN of the model, while macro-average takes the average of the precision and recall of the model. From the curve, we can see that the AUC score of the micro-average and the macro-average is equal to 0.98 and 0.99, respectively, which is approximately 1. This indicates that the RANN-BCC model has a good generalization capability to distinguish between the IDC class and the non-IDC class.

Fig. 11
figure 11

Receiver operating characteristic (ROC) curve

5 Conclusion

In this paper, we introduced and designed the residual attention neural network breast cancer classification (RANN-BCC) model to classify the given breast cancer dataset into invasive ductal carcinoma (IDC) and non-invasive ductal carcinoma (non-IDC). We demonstrated that our model had outperformed other deep learning models and showed the significance of each block of the RANN-BCC model. We found that the accuracy could be improved from 79.49 to 92.45% through the implementation of Residual Neural Network 34 (ResNet34) integrated with self-attention, cross-attention, collector, and compressor. We believe this integrative developed deep learning approach will not only help medical practitioners to classify IDC and non-IDC of breast cancer by learning the feature content of medical images but also will contribute to the field of computer-aided diagnostics by inspiring more similar and effective deep learning approaches.