1 Introduction

Breast cancer is the most prevalent type of cancer in women next to lung cancer and early detection has significantly increased the survival rate as proven by clinical reports (Zhang et al. 2018; Kim et al. 2016; Yousefi et al. 2018; Shin et al. 2017). There are a number of breast cancer imaging modalities from old (screen-film mammography) to recent (digital breast tomosynthesis) that have been used by radiologists to screen breast cancer. These imagining modalities have shown remarkable success in detecting breast cancer abnormalities that include masses, microcalcifications, architectural distortions, and bilateral asymmetry. However, they suffer from issues like breast tissue overlapping that hides breast information and which lets suspicious lesions out of sight (Yousefi et al. 2018).

Breast cancer abnormality can be categorized into in-situ and invasive ductal carcinoma (IDC). In-situ represents approximately 20–30\(\%\) of all new breast cancer diagnoses (Brennan et al. 2011; Zhu et al. 2018) whereas IDC is the most common type of breast cancer which almost accounts for 80\(\%\). For example, in-situ has started to be treated using active surveillance without undergoing surgical treatment (Grimm et al. 2017; Zhu et al. 2018) which is not true for IDC. Therefore, early differentiation of breast cancer type as in-situ and invasive is very important for patients so as to define treatment strategy (Grimm et al. 2017; Zhu et al. 2018). For a better understanding of the readers, we identified medical terms used in this survey paper and presented their definition in Table 1.

Table 1 Medical terms and their definition

In Sect. 2 of this survey paper, we discuss the methodology adopted to search papers from selected search databases. In Sects. 3 and 4, we reviewed most common breast cancer imaging modalities and breast cancer databases that are most cited in articles, respectively. In Sect. 5, we reviewed the application areas of deep learning (DL) in medical image analysis in general and breast cancer image analysis in particular. Finally, we made the conclusion of the survey paper in Sect. 6 with highlighting the research gaps for further improvement.

2 Methods

We reviewed articles from 2004 to 2018 to (1) evaluate the use of imaging modalities, (2) compare breast cancer imaging modalities, (3) point out the most cited and publicly available breast cancer databases with different formats and modalities, (4) evaluate the use of DL application in medical image analysis specifically to breast cancer image analysis, (5) evaluate the application of DL using histopathological based breast cancer image analysis. Our general search criteria for this survey paper consisted keywords like ‘breast imaging technology’, ‘deep learning and medical image analysis’, ‘application of deep learning in medical image analysis’, and ‘application of deep learning to breast cancer’. However, we specifically used different search criteria for some of the search databases. The searches were carried out from eight databases: (1) Web of Science, (2) PubMed, (3) Science Direct, (4) IEEE Xplore Digital Library, (5) Google Scholar, (6) arxiv, (7) MICCAI, and (8) SPIE. PubMed was searched for papers containing “convolutional neural network” OR “deep learning” OR “medical imaging” OR “histology”. Arxiv was searched using search terminologies related to medical imaging with search string ’abs:((medical OR mri OR “magnetic resonance” OR(medical OR “histology” OR “ultrasound” OR sfm OR “screen-film mammography” OR “digital mammography” OR “breast cancer”) AND (“deep learning” OR “deep learning application” OR convolutional OR cnn OR “neural network”))’. IEEE Xplore Digital Library is searched for paper containing “convolutional neural network” OR “deep learning” OR “medical imaging”. Conference proceedings for MICCAI and SPIE were searched based on search terminologies that include: DL in breast cancer and MRI, DL in breast cancer and US, DL in breast cancer and DBT, DL in breast cancer and DM OR GM, DL in breast cancer and histology, DL and medical image analysis and application of deep learning in medical image analysis.

3 Breast cancer imaging modalities

In breast cancer image analysis, breast abnormality detection starts with imaging modalities for screening (Zhang et al. 2018). When an abnormality is found early, it is easy to treat the patients, but if evidence appears, cancer may start to spread and by then might be difficult to treat. Among several, some selected breast cancer screening methods are discussed here (Ethiopian Cancer Association 2016). There are different imaging technologies used for breast cancer screening. The performance of breast cancer imaging modalities can be evaluated mostly by sensitivity, specificity, recall rates, positive predicted value (PPV), AUC, F-score, and accuracy.

3.1 Screen-film mammography (SFM)

Screen-film mammography has been the standard imaging modality (still in use in some countries including Ethiopia) for detecting suspicious lesions at an early stage. In the past five decades, the SFM became a useful medium in breast screening. SFM has a high sensitivity (100\(\%\)) in detecting suspicious lesions in breasts composed primarily of fatty tissue (Duijm et al. 1997). However, decreases significantly for breasts with dense glandular tissue. Consequently, 10–20\(\%\) of breast cancers are not visualized (Burrell et al. 1996). Besides, the decrease in lesion conspicuousness may be due to the film itself since it serves as the medium of image acquisition, display, and storage. Once the film is produced, then further improvement is not possible and part of the image may be displayed with lesser contrast. If image improvements cannot be carried out for images with lesser contrast, then patients need to undergo another mammographic image and consequently be exposed to more radiation dose. Other drawbacks of the film are that different regions of the breast image are represented according to the characteristic response of the mammographic film. There is a trade-off between the dynamic range (latitude) and contrast resolution (gradient) (Helvie 2010). Another significant problem of SFM is that it is not digital.

3.2 Digital mammography (DM)

Early-stage breast cancer screening using digital mammography is an effective imaging modality (Gilbert et al. 2015; Liu et al. 2018). It has been the most effective and standard breast imaging modality in the detection and diagnosis of abnormalities of the female breast (Jalalian et al. 2013). However, it has some limitations which include low specificity. As a consequence there may be a higher number of unnecessary biopsies and this limitation increases costs and stress on the patients (Gilbert et al. 2015; Jalalian et al. 2013). Besides, low specificity and high cost the digital mammography exposes the patients to ionizing radiation which endangers the patient’s health (Jalalian et al. 2013). In cases where there is overlapping of breast tissue, then there is a high possibility to leave out some cancers in the retro-mammary space as a result of insufficient positioning of deep tissue (Gilbert et al. 2015; Kevin et al. 2010). Digital mammography offers several advantages over SFM (Patterson and Roubidoux 2014). Besides, computer-assisted detection (CAD) system has revealed favorable results in mammography and is used in the clinical routine to improve the radiologist’s sensitivity (Becker et al. 2018). However, it has also three limitations: high false positive results which imply higher recall rates, higher false negative results, and high radiation exposure (Liu et al. 2018).

3.3 Ultrasound (US)

Ultrasound is an imaging modality that has been used for breast lesion detection and differentiation even though it is operator dependent. Breast lesion detection and differentiation is only possible with the help of an operator who can properly locate the lesion using ultrasound scanner (Byra et al. 2018). However, in contrast to mammography ultrasound doesn’t require the use of ionizing radiation (Becker et al. 2018). According to a review made in Sudarshan et al. (2016) and Jalalian et al. (2013), ultrasound imaging modality is used for detection and diagnose of abnormalities in breast cancer as the second choice to DM. In Jalalian et al. (2013), it is indicated that ultrasound achieved high accuracy in detecting and discriminating benign and malignant masses and supported by Shin et al. (2017). This enabled US imaging modality to bring down unneeded biopsies. According to Byra et al. (2018) and Shin et al. (2017) US is found to be safe, accurate, low cost and highly universalize compared to magnetic resonance imaging, DM, and digital breast tomosynthesis imaging modalities. For every specific lesion types, deep knowledge of image features are required to interpret and this makes the Ultrasound image interpretation not to be straightforward. Ultrasound showed high sensitivity for identifying abnormalities in dense breasts and for women younger than 35 years of age (Sudarshan et al. 2016; Becker et al. 2018). Ultrasound is well recommended to be used as a supplement to DM because of its availability, inexpensiveness compared to other modalities and well-tolerated by patients (Kevin et al. 2010; Leach et al. 2005; Becker et al. 2018).

3.4 Magnetic resonance imaging (MRI)

MRI imaging is based on radio frequency absorption of nuclei in the existence of potent magnetic fields. It is used in case of presence of high patient risk and for clinical diagnosis and monitoring of breast cancer (Amit et al. 2017; Antropova and Giger 2018; Morrow et al. 2011; Kuhl et al. 2014; Saslow et al. 2007; Lin and Brown 2007). In previous studies MRI was used for breast segmentation (Gubern-Mèrida et al. 2015; Wu et al. 2013), breast abnormality detection (Chang et al. 2014; Renz et al. 2012), and breast abnormality classification (Gallego-Ortiz and Martel 2015; Agliozzo et al. 2012; Agner et al. 2011; Pang et al. 2015) using computer aided detection/diagnosis (CAD) system. The technologically enhanced form of MRI, DCE-MRI (dynamic contrast-enhanced MRI), has provided higher volumetric resolution for better lesion visualization and lesions temporal pattern enhancement to extract valuable information for better cancer management (Antropova and Giger 2018; Turkbey 2009). Studies have shown that DCE-MRI provides a useful tool for breast cancer diagnosis (Mahrooghy et al. 2015; Zhang et al. 2018), prognosis (Mazurowski et al. 2015a; Zhang et al. 2018), and correlation with genomics (Mazurowski 2015b; Zhang et al. 2018). In comparison with other imaging modalities like mammography and ultrasound, MRI has shown high sensitivity to breast cancer diagnosis (Antropova and Giger 2018; Zhang et al. 2018; Lin and Brown 2007). CE-MRI is an improved MRI technology and it has shown to have high sensitivity for cancer detection, even in dense breasts (Leach et al. 2005). Even though recommended for women with high-risk breast cancer, MRI might not be optimal imaging modality because of its higher cost and lower specificity (Griebsh et al. 2006; Kuhl et al. 2007).

3.5 Digital breast tomosynthesis (DBT)

Digital Breast Tomosynthesis is an imaging modality that produces a 3D image of the breast using low dose X-rays received at different angles (Regina et al. 2017; Helvie 2010). It is a new breast cancer imaging modality in which the breast is placed and compressed in the same way as a mammogram but the tube with the X-ray moves in a circular arc around the breast (Gur et al. 2009; Gennaro et al. 2010; Wallis et al. 2012; Andersson et al. 2008; Zhang et al. 2018; Poplack et al. 2007). It takes less time for the imaging (Fotin et al. 2016) and provides better detail of dense tissue in the breast compared to conventional mammography (Zhang et al. 2018; Poplack et al. 2007). 3D breast images are produced using computer based on information received from X-rays. The X-ray dose for a tomosynthesis image is similar to that of a regular mammogram (American College of Radiology Imaging Network 2017). After digital mammography, DBT has appeared to be a favorable breast cancer imaging modality to enhance the sensitivity and accuracy of screening (Gur et al. 2009; Gennaro et al. 2010; Wallis et al. 2012; Andersson et al. 2008; Poplack et al. 2007). DBT has emerged as a new breast cancer imaging modalities with a lot of benefits. However, DBT was not able to detect malignant micro-calcifications if those calcification were not on the DBT slice plane (Regina et al. 2017) and increases recall rates for architectural distortion type of breast cancer abnormality (Lourenco et al. 2015). It has also substantially increased the reading time compared to a digital mammogram (DM) in terms of mammogram reading (Samala et al. 2016b) (Table 2).

Table 2 Advantages and disadvantages of breast cancer imaging modalities

3.6 Combination of breast cancer imaging modalities

The radiologists and researchers started to use combined imaging modalities during screening to enhance the rate of early detection. In this survey paper we included a few papers and presented as follows:

Gilbert et al. (2015), evaluated the performance of three breast imaging modalities (DM, DBT and synthetic DM) and their combinations (DM + DBT and synthetic DM + DBT). The comparison was made using datasets containing 7060 cases collected randomly from 8869 women of age between 29 and 85. Then, independent radiologists, blind reviewers are considered to review images in DM + DBT, DM, and synthetic DM + DBT without access to the previous examination results. The blind review was made in terms of specificity and sensitivity. The sensitivity for DM, DM + DBT, and synthetic DM + DBT were 87\(\%\), 89\(\%\) and 88\(\%\), respectively. The blind review assured that for the age ranging from 50 to 59, the sensitivity of patients became significantly higher (p = 0.01) for DM + DBT than for DM. In the study the patients with dense breast were included and for those patients that had breast density of 50\(\%\) and higher, the sensitivity was 93\(\%\) for DM + DBT and 86\(\%\) for DM with a p-value of 0.03. The specificity for DM, DM + DBT, and synthetic DM + DBT were 57\(\%\), 70\(\%\) and 72\(\%\), respectively. Finally the study in Gilbert et al. (2015) has proved that adding DBT to DM increased the sensitivity value for patients with dense breasts and increased specificity for all age groups. More importantly, DBT has shown that it has potential benefits especially for dense breasts in younger women.

Mariscotti et al. (2014) compared the efficiency of four imaging modalities (DM, DBT, US, MRI) using 200 patients with age ranging from 26 to 79 and who undergo the screening. Their target was to compare the DM and MRI to DBT alone and a combination of imaging modalities. That means, comparing DM with DBT and MRI with DM + DBT + US. Parameters used for evaluation were sensitivity, specificity, and overall accuracy. DBT scored higher sensitivity than DM alone. The sensitivity of DBT and DM are 90.7\(\%\) and 85.2\(\%\), respectively. The three combined imaging modalities (DM + DBT + the US) achieved a sensitivity value of 97.7\(\%\). The sensitivity of MRI alone is 98.8\(\%\). However, combining it with the other three imaging modalities (DM + DBT + the US) didn’t show improvement to overall sensitivity. The overall accuracy of MRI and DM + DBT + the US were 93.3\(\%\) and 93.7\(\%\), respectively. Breast density affects the sensitivity of some imaging modalities, for example, it affects DM and DBT but not MRI.

Kuhl et al involved 529 participants for screening with 43 different cases where 34 invasive and 9 were ductal carcinoma in-situ type of cancers Kuhl et al. (2005). In their study three imaging modalities (DM, US, and MRI) were considered for comparison and they found out that the sensitivity of MRI, 91\(\%\), is significantly higher than that of DM (33\(\%\)), US (40\(\%\)) and DM + US (49\(\%\)). However, DM and MRI have scored almost the same specificity value, 97.2\(\%\) for MRI and 96.8\(\%\) mammography.

Leach et al. (2005) performed a comparative analysis between DM and contrast-enhanced magnetic resonance imaging (CE MRI) in terms of sensitivity and specificity. The study involved 649 women patients between 35 and 49 with breast cancer history from family (BRCA1 and BRCA2). The specificity and sensitivity were computed after annual screening for 2–7 years and CE-MRI, DM, and CE-MRI + DM scored a sensitivity of 77\(\%\), 40\(\%\) and 94\(\%\), respectively. The imaging modalities like CE-MRI, DM, and CE-MRI + DM scored the specificity of 81\(\%\), 93\(\%\) and 77\(\%\), respectively.

Warner et al. (2004) in their study targeted to compare three breast cancer imaging modalities (DM, US, and MRI) and breast examination in the clinic (CBE) in terms of sensitivity and specificity. The patients they considered were patients related to BRCA1 or BRCA2 mutation. The study, CBE, recommended every 6 months to carry out breast screening from age 25 onward for those with mutation (BRCA1 or BRCA2). In their study, they confirmed that the sensitivity of MRI is more for detecting breast cancers compared to DM, US, or CBE. The specificity and sensitivity scored were 77\(\%\) and 95.4\(\%\) for MRI, 36\(\%\) and 99.8\(\%\) for DM, 33\(\%\) and 96\(\%\) for the US, and 9.1\(\%\) and 99.3\(\%\) for CBE, respectively (Warner et al. 2004). Additionally, they did screening using MRI + DM + the US + CBE to compare with DM + CBE and achieved a sensitivity of 95\(\%\) and 45\(\%\), respectively.

Patient screening for breast cancer using MRI + DM scored higher sensitivity than DM alone in all age ranges (Phi et al. 2017). For example, the sensitivity of DM + MRI achieved 95\(\%\) and 51\(\%\) for DM and 50\(\%\) for MRI alone. For women aged between 40 and 49, the researchers found that the sensitivity of MRI + DM was 98\(\%\) and that of DM and MRI alone were 57\(\%\) and 47\(\%\), respectively. The sensitivity of DM enhanced to some level with increasing age but low in women less than the age of 40.

Phi et al. (2016), evaluated the performance of two breast imaging modalities (DM and MRI) and their combination (DM + MRI) for two mutation status indicators, BRCA1 and BRCA2. The study divided the patients into four age groups (all ages, \(\le\) 40, ages between 41 and 50 years, and above 50) to do age based performance analysis using specificity and sensitivity for the two imaging modalities. For all age groups and BRCA1 mutation status, the sensitivity and specificity were 35.7\(\%\) and 93.8\(\%\) for DM, 88.6\(\%\) and 84.4\(\%\) for MRI and 92.5\(\%\) and 80.4\(\%\) for DM + MRI, respectively. For all age groups and BRCA2 mutation status, the specificity and sensitivity were 44.6\(\%\) and 93.4\(\%\) for DM, 80.1\(\%\) and 85.3\(\%\) for MRI and 92.7\(\%\) and 80.5\(\%\) for DM + MRI, respectively. The sensitivity and specificity for other age groups and BRCA1 and BRCA2 were also presented in Phi et al. (2016) (Table 3).

Table 3 Summary of performance of imaging modalities in terms of sensitivity and specificity

4 Breast cancer image databases

Over the last few decades, a lot of databases/datasets was produced and published in a different repository where some of them were publicly available for use. Most datasets exist in two formats [CSV and image (jpg, pgm,png, DICOM and jpeg)]. Breast cancer image analysis has mainly used these databases. For example, Mammography Image Analysis Society(MIAS) database is the most popular and applicable by many researchers. It contains 322 image samples where 208 are normal and 114 are abnormal (63 benign cases and 51 malignant cases). The other popular database is Digital Database for Screening Mammography (DDSM) with 2500 images. The summary of most cited and recently updated breast cancer databases are presented in Table 4.

Table 4 Summary of most cited and recently updated (2016–2019) breast cancer datasets (N—normal, AB—abnormal, R—repository, A—author, M—malignant, B—benign, CM—causal mutation, LNV—likely neutral variant, NV—neutral variant, UV—unknown/unclassified variant, DC—ductal carcinoma, LC—lobular carcinoma, MC—mucinous carcinoma, and PC—papillary carcinoma)

5 Deep learning and breast cancer image analysis

In this section we present breast cancer image analysis in two perspectives: in Sect. 5.1, we present breast cancer image analysis by deep convolutional neural network with datasets developed using various breast cancer imaging modalities; and in Sect. 5.2, we reviewed the histopathology based breast cancer analysis using deep convolutional neural network.

5.1 Imaging modalities and deep learning based breast cancer image analysis

Over the last decades, we have witnessed the importance of medical imaging, e.g., screen-film mammography, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), digital mammography, ultrasound, and so on, for the early detection, diagnosis, and treatment of diseases (Antropova and Giger 2018). In the clinic, the medical image interpretation has mostly been performed by human experts such as radiologists and physicians. However, due to large variations in pathology and potential fatigue of human experts, researchers and doctors have recently begun to benefit from computer-assisted interventions. As compared to the advances in medical imaging technologies, it is belated for the advances in computational medical image analysis and recently it has been improving with the help of machine learning techniques. The most common application areas of DL in medical health care include: breast cancer image analysis (Rodriguez-Ruiz et al. 2018; Kooi et al. 2017a; Wang et al. 2017; Debelee et al. 2018), brain image analysis (Shen et al. 2017; Hosseini-Asl et al. 2016; Burgh et al. 2017; Ghafoorian et al. 2017), retinal image analysis (Wu et al. 2016; Zilly et al. 2017), chest X-ray image analysis (Rajkomar et al. 2017; Kim and Hwang 2016; Anavi et al. 2015, 2016; Bar et al. 2015, 2016; Hwang et al. 2016; Shin et al. 2016a; Wang et al. 2016a), abdominal image analysis (Shah et al. 2016; Zhu et al. 2017) and musculoskeletal image analysis (Forsberg et al. 2017; Spampinato et al. 2017).

Deep learning algorithms with deeper layers like Convolutional Neural Networks (DCNN) have recently shown success in different medical image analysis tasks like segmentation, detection, and classification (Kooi et al. 2016) for urinary bladder (Cha et al. 2016), thoracic-abdominal lymph nodes, interstitial lung disease (Gao et al. 2016), and pulmonary perifissural nodules (Ciompi et al. 2015; Shin et al. 2016b). Angelov and Sperduti (2016) have made an impressive and concise review of the challenges in DL. They started with how multiple layers in a DL approach help in letting efficient learning of hidden representations in datasets and exponential gain in depictive power of each feature in the datasets (Angelov and Gu 2017). Besides its computational cost, they added that fine-tuning the hyper-parameters of the models and structural features selection is not yet realized for DL techniques. But, the availability of pre-trained models has enabled the researchers to either extract features in different points of DL (Sargano et al. 2017) or use it for incremental training to adapt the models to other domain on which the models are trained on Angelov and Gu (2018). The summary of DL application types are given in Table 6. The acronyms of the databases used in papers that we considered in this survey paper are given in Table 5.

Table 5 Names of databases used in papers that we included in this survey paper and their acronyms
Table 6 Deep learning application types in medical image analysis

Samala et al. (2016a), evaluated their proposed DCNN layer built of 12 hidden layers by comparing it with a CNN with 8 hidden layers in terms of AUC. The DCNN with 5 kernel size and CNN with 3 kernel size were intended for the classification of true microcalcifications and false positives and achieved the AUC value of 0.93 for DCNN and 0.89 for CNN. The dataset used in this research work includes 64 DBT cases collected at the University of Michigan.

Samala et al. (2016b) proposed feature-based and DCNN based CAD system. In DCNN, transfer learning is applied to train the first four convolutional layers and the last three fully connected layers of DCNN using only mammographic images for lesion recognition and false-positive reduction. The transfer learning scored a training AUC of 0.99 and validated using DBT image dataset and achieved an AUC value of 0.81. However, after training using only DM, additional training was held using DBT images to improve the validation score of the model and scored an AUC of 0.90. Data used in their study was obtained using three imaging modalities (SFM, DM and DBT) where 2282 images using digitized SFM and DM, and 324 images was DBT. The source of the image dataset was the department of Radiology at the University of Michigan Health System and University of South Florida. Morphological and texture features were used as feature-based CAD system for detection of mass in the mammograms with the intention of false-positive reduction result. Finally, the feature-based and the DCNN-based CAD systems achieved a sensitivity of 83\(\%\) and 91\(\%\), respectively, at 1 FP/DBT volume.

Kim et al. (2016) proposed a latent bilateral feature representation learned from DCNN to classify masses and FPs through abstraction of data at multiple levels to get accurate representation of image dataset. This approach is applied for the latent bilateral feature representation of masses in DBT and compared with hand-crafted features. The AUC value for hand-crafted features was 0.826 and 0.847 for latent bilateral features.

Fotin et al. (2016) proposed a comparative analysis between the conventional approach and DCNN using 3D (DBT) images to detect (ROIs) and classify the two breast cancer abnormalities (mass and architectural distortions). In the conventional approach, hand-crafted features (contrast, histogram, gradient, texture, shape and topology descriptors) are extracted from the ROIs and given to the ensemble of boosted decision trees. However, in DCNN approach instead of hand-crafted features, a resized \(256\times 256\) ROIs are given to DCNN to detect and classify the abnormalities. The sensitivity of a conventional and DCNN approach was 83.2\(\%\) and 89.3\(\%\), respectively for suspicious ROIs and 85.2\(\%\) and 93.0\(\%\), respectively for malignant ROIs.

Junzhang et al. collected a weakly annotated image datasets for mass using expertise to use in the proposed approach, fully CNN based heatmap regression, for mass detection (Zhang et al. 2018). The weakly annotated mammograms were given as input to the fully CNN model and the model generated the heatmap for the breast mass. The trained model then used for two different purposes: first for estimating the probability of map of mass locations for the 439 mammograms and then 40 images of DBT were used to evaluate the performance of transfer learning by only fine-tuning the last two layers of the pre-trained U-Net model which was trained using mammographic images. The evaluation parameters used in this paper were precision and recall value. The precision and recall value of the approach using mammographic images were 0.85 and 0.92 and that of other approach using tomosynthesis images were 0.33 and 0.41, respectively.

Samala et al. (2018a) explained how too many parameters in the pre-trained models have become a major challenge in training deep learning. The pre-trained models like AlexNet, VGGNet16, GoogLeNet17 and ResNet18 use 60 million, 138 million, 4 million and 60 million parameters, respectively (Krizhevsky et al. 2012; Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016). The limited amount of medical images is also another challenge to train these models and the applied practice to overcome this challenge is training the models with non-medical images. In this study (Samala et al. 2018a), imageNet pre-trained deep CNN model is selected for a transfer learning. The images used in their experiment was 2, 282 ROIs out of 2461 mass lesions from mammographic image dataset and 230 ROIs from 228 DBT mass lesions. Data augmentation is applied to these images resulting in a total of 19,688 mammographic images and 9120 DBT images. The authors added two additional FC layers to avoid divergence occurred by cross-domain transfer learning. The first added FC layer contained 100 nodes and the second with two nodes. According to (Samala et al. 2017), deactivating the first convolutional layer of the ImageNet pre-trained model with non-medical images became best for transfer learning to mammographic images. Training DCNN using mammographic images has performed well when validated with DBT image data (Samala et al. 2017). In the final stage of this approach transfer learning using DBT was made by deactivating layers from the first convolutional layer to third fully connected layer. These layers are used as a feature extractor to generate 1000 features, and then the recursive feature reduction method was applied to select 240 features. After feature reduction, genetic algorithm and layered pathway evolution were used to compress the frozen deep CNN. The AUC based classification performance of the method applied in this paper was 0.88 before compression and 0.90 after compression for deep CNN.

Samala et al. (2018b) proposed a two stages cross-domain transfer learning approach using DCNN (ImageNet) with five convolutional layers (C1-C5) trained with 1.2 million non-medical images. In their first stage, some convolutional layers of the DCNN were frozen and trained with 20K ROIs from mammographic images. In this stage, the convolutional layer was frozen in three ways: firstly, C1 was frozen, secondly C1 to C3 were frozen and lastly, C1 to C5 were frozen. In their second stage, mammographic trained DCNN was further trained using 9K ROIs from DBT in all the three cases considered in the first stage. Finally, the efficiency of the designed transfer learning approach was evaluated in terms of AUC and achieved an AUC value of 0.76 for C1, 0.73 for C1–C3 and 0.73 for all frozen convolutional layers. The result indicates that deactivating only C1 gives a higher performance during transfer learning.

Semi-automated breast mass segmentation was proposed by Zhang et al. (2018) using DBT images. In most recent years, mass detection and segmentation using machine learning approach has gained remarkable result (Zhang et al. 2016, 2017; Lian et al. 2015, 2017; Zhu et al. 2016, 2017; Liu et al. 2017). However, DCNN based methods have become even more robust and precise to detect and segment masses in the breast (Zhang et al. 2018). In their study, an auto encoder-decoder networks type of DL approach is used to do a mass segmentation in training and application stages. In the training stage, a breast mass mask was used to build an auto encoder-decoder model to realize the mass segmentation. In the application stage, mass region annotated by radiologists were extracted from each DBT image and feed to a pre-trained model, with U-Net network architecture, to do pixel-based mass segmentation. The network used in this study has two parts: an encoding path with two convolution operations, two rectified linear unit and one max-pooling operation for feature extraction and a decoding path (one up-pooling operation, one feature map, and two convolution operations) for image expansion. In the experiment, n-fold cross-validation was applied to measure the efficiency of the proposed mass segmentation in terms of Dice similarity coefficient (DSC) and achieved a value of 0.59.

Yousefi et al. (2018) introduced a three different CAD framework, hand-crafted, feature-based MIL framework, DCNN Multiple Instance-Random Forest (DCNN MI-RF) and deep cardinality-restricted Boltzmann machine Multiple Instance-Random Forest (DCaRBM MI-RF), for automatic detection of speculated mass. The 5040 2D slices collected from 87 DBT volumes were preprocessed that include data augmentation, noise removal, and the pectoral muscle removal of slices. For DCNN and deep CaRBM, data augmentation was carried out before noise and pectoral muscle removal. The efficiency of all the three frameworks was measured based on sensitivity, AUC, specificity, and accuracy. In hand-crafted, four features (morphological, statistical, gray-level, texture) were extracted from ROIs and given to MI-RF classifier to classify the DBT slices. The performance of this framework in terms of specificity, sensitivity, accuracy, and AUC were 75\(\%\), 66.6\(\%\), 69.2\(\%\), 0.75, respectively. In DCNN MI-RF based framework, DCNN was embedded into the framework to get the optimum high-level feature representation out of pre-processed \(256\times 256\) resized DBT slices. Then these features are given to MI-RF classifier as input to classify the DBT slices. The performance of DCNN MI-RF framework in terms of AUC, accuracy, sensitivity, and specificity were 0.87, 86.81\(\%\), 86.6\(\%\) and 87.5\(\%\), respectively. The CaRBM based CAD framework was similar to the DCNN except that the DCNN is replaced by deep CaRBM for feature representation. Then, these features are given to MI-RF classifier as input to classify the DBT slices. Its performance in terms of AUC, accuracy, specificity, and sensitivity were 0.70, 78.5\(\%\), 66.6\(\%\) and 81.8\(\%\), respectively.

Mendel et al. (2018) designed a method of feature extraction using CNN from the ROIs obtained from DM, synthesized 2D images and DBT slices. These images are collected from 76 patients using DBT and DM. Expert radiologists identified the 78 lesions (ROIs) with dimensions \(512\times 512\) pixels from these datasets where 48 of them were benign and the rest were malignant. Some of the lesions were visible in CC views and some in MLO views. These features were given to a pre-trained DCNN (VGGNet19) to extract the features (LeCun et al. 2015; Shin et al. 2016b). Feature extraction was followed by feature reduction through eliminating features with zero-values f or 50\(\%\) of the ROIs. Finally, the reduced features were given to linear SVM. The achievement of the SVM was measured in terms of AUC for three datasets. The AUC value of DM, synthesized mammographic images and DBT slice were 0.755, 0.814 and 0.743, respectively for CC view. The AUC value of DM synthesized 2D images and DBT slice were 0.757, 0.881 and 0.832, respectively for MLO view.

Rodriguez-Ruiz et al. (2018) adopted a DCNN architecture for three-class (pectoral, breast or open field) classification which was similar to the one used in Ronneberger et al. (2015). The U-Net model was evaluated based on the Dice similarity coefficient (DSC) in which the area overlap between segmentation and ground truth is compared. Data used in this paper was collected from 100 patients to gain 172 DBT slices where 121 slices for training, 15 slices for validation and 36 slices for testing. The experimental result showed that the DSC value of the method became 0.970 for test data and found to be promising for the other modalities like mammography and synthetic mammograms (Tables 7, 8).

Table 7 Deep learning applications with DBT, DM, MRI, and US imaging modalities and databases
Table 8 Performance comparison of selected studies using DBT, DM, MRI, and US databases in terms of size of images, AUC, accuracy (Acc), specificity (Spec) and sensitivity (Sen) and modality

Kooi et al. (2017a) carried out a feature extraction approach using a DCNN to classify benign solitary cysts from malignant masses. In their work, they adopted both data augmentation with different image resolution but end up with no significant improvement in the performance. Their experiment achieved 0.80 AUC value.

Jadoon et al. (2017) introduced CNN-DW and CNN-CT based multi-class classification techniques to classify the mammograms from IRMA datasets into normal, benign, and malignant. The fusion of the CNN features and most descriptive features with wavelets performed well and achieved an accuracy of 83.74\(\%\) for SVM classifier.

Gallego-Posado et al. (2016) applied DCNN for breast tumor detection and diagnosis. The authors preprocessed (cropping and resizing) the original mammograms from MIAS. Then, the data augmentation was applied by rotating the original images to enhance the size of image datasets. They extracted features using CNN pre-trained model and gave these features to SVM and scored an accuracy of 64.52\(\%\).

Amit et al. (2017) introduced two DCNN techniques that classify breast images into benign and malignant. The annotated images were cropped using a square bounding box around the annotated boundaries and obtained 891 malignant (BIRADS 5) and 365 benign (BI-RADS 2). These images were augmented using rotation (90\(^{\circ }\), 180\(^{\circ }\), 270\(^{\circ }\)) and flipping (right-left, down-up). In the first approach, the CNN with three convolutional layers was trained with the labeled datasets. And in the second approach, the same labeled datasets were used as input to pre-trained VGGNet to extract the features from fully connected layer to do classification using SVM. The first approach’s accuracy, sensitivity, specificity, and AUC were 83\(\%\), 84\(\%\), 82\(\%\) and 0.91, respectively. And the second approach’s accuracy, sensitivity, specificity, and AUC were 73\(\%\), 77\(\%\), 68\(\%\) and 0.81, respectively.

Antropova et al. (2017a) proposed two different ways to extract features with the aim of classifying the images into benign and malignant where the first is segmentation-based and the second is CNN based. The study was based on 640 images collected using DCE-MRIs imaging modality. Out of 640 images, 191 were benign and 449 were malignant. In a segmentation-based approach, 38 features of 6 different categories (enhancement texture, size, kinetics variance, morphology, shape, and kinetics) were extracted after segmentation for classification purpose. In a CNN-based approach, the extracted \(148\times 148\)-pixel sized ROIs were given as input to the AlexNet pre-trained model and 4096 feature vectors were extracted from FC layers. However, only 518 features were used for analysis after leaving 80\(\%\) of the feature vectors with zero value. The performance of this study was evaluated using LDA classifier with round-robin cross-validation for three cases (segmentation-based features (38), CNN-based features (518) and fused features (556)). The performance results in terms of AUC for segmentation-based, CNN features and combined features were 0.88, 0.76 and 0.91, respectively.

Antropova et al. (2017b) collected three datasets using three imaging modalities (mammography, ultrasound, and DCE-MRI). The number of patients considered in mammographic imaging modalities was 245, 1125 patients for ultrasound and 690 for DCE-MRI. However, the number of ROIs for mammographic imaging modality was 739 (328 benign, 411 malignant), 2393 (1978 benign, 415 malignant) for an ultrasound and 690 (212 benign, 478 malignant) for DCE-MRI. For all datasets, CNN-based features from FC and Max-pool layers of VGGNet (VGG19) and conventional features (handcrafted features) were collected. However, the performance of max-pool features outperformed the FC features according to the comparison made based on AUC and hence the performance of the conventional features was made only with max-pool features. The two features (Conventional features and CNN features) were fed to a non-linear SVM with Gaussian RBF kernel and achieved an AUC value of 0.79 and 0.81 for mammographic images, 0.84 and 0.87 for ultrasound images and 0.86 and 0.87 for DCE-MRI, respectively. The performance of SVM was also evaluated for the combined or fused features (Conventional features + CNN features) and achieved an AUC value of 0.86, 0.90 and 0.89 for mammography, ultrasound, and DCE-MRI datasets, respectively.

Antropova and Giger (2018) extracted CNN-based features from all five max-pooling layers using DCE-MRI images from 690 cases. Based on a report from pathologists and radiologists, out of 690 cases, 212 cases were benign and 478 cases were malignant. The extracted features were first normalized with Euclidean norm before concatenated to form fused CNN feature vectors. Then, these features were given to linear SVM to classify MRI images as malignant and benign. The discriminating power of the features from the three ROIs was evaluated using AUC with 80\(\%\) of the features for training and 20\(\%\) for testing. The AUC value of the central slice of the second postcontrast, a central slice of the second postcontrast subtracted and MIP were 0.80, 0.83 and 0.88, respectively.

Antropova et al. (2018) extracted features from all five max-pool layers using VGGNet with 19 layers to classify as benign and malignant. Out of the 703 images datasets collected using DCE-MRI imaging modality 221 were benign and 482 were malignant. They separately extracted the features from images before and after contrast enhancement and fed to LSTM and SVM with RBF kernel. The parameters of the LSTM and SVM classifiers were tuned on a grid search with cross-validation (5 fold). The efficiency of the classifiers and distinguishing power of the features were measured based on the AUC analysis. AUC for SVM classifier was 0.81 and 0.85 for LSTM classifier.

Zhu et al. (2018) used VGGNet with 16 layers to extract features (deep features, Conv11, Conv12, Conv13, FC1, and FC2 features) from MRI images. Images were collected from a total of 131 patients where 35 of them were invasive and the rest were DCIS diagnosed patients. After generating ROIs from the original images, data augmentation was applied using random translation and rotation. The SVM with kernel functions of different types (polynomial, linear and RBF) was trained and evaluated in terms of AUC and validated using cross-validation (10 fold). Compared to other features, the best AUC value (0.68) was achieved with deep features from convolutional layer 13.

Zhang et al. (2018) proposed a CNN based segmentation technique in two stages for images collected from 272 patients using DCE-MRI imaging modality. In the first stage, rough segmentation of breast tumor is obtained and followed by refining as the second stage of FCN. The efficiency of segmentation was evaluated using three measurements (Dice similarity coefficient, sensitivity, and PPV) and comparing with manually annotated ground-truth. The DSC, sensitivity and PPV values were 0.7176, 75.04\(\%\) and 77.33\(\%\), respectively.

Li et al. (2017) proposed a 2D CNN and 3D CNN classification of 143 breast images as benign and malignant. There were 77 malignant and 66 benign for classification and AUC, accuracy, sensitivity and specificity as evaluation parameters. The value of AUC, sensitivity, specificity, and accuracy on test data without augmentation were 0.841, 81.4\(\%\), 77.3\(\%\) and 80.4\(\%\), respectively for 3D CNN and 0.752, 76.1\(\%\), 67.4\(\%\) and 71.1\(\%\), respectively for 2D CNN.

Benjamin et al. (2017) applied cropping to extract 561 ROIs with \(111\times 111\) pixel size from 64 images. VGGNet was used to extract features from the five convolutional layers to enrich spatial information both in lower-level features and higher-level features. All the five convolutional layers from 1 to 5 presented 64, 128, 256, 512, and 512 features, respectively and fused features became 1472 features. Standardization technique was applied to all features to achieve a mean of zero value and a variance of 1. Following the removal of features with zero variance, the prediction power of LDA classifier for a retort to therapy was measured in terms of AUC and scored 0.85 as AUC value.

Becker et al. (2018) targeted 632 patients that undertaking breast ultrasound in 2014 for their study. Out of 632 patients, 550 patients were found to have malignant and the remaining 82 have benign lesions. The authors proposed a generic DL approach to compare with the performance of human readers (radiologists, residents, medical students) with different expertise (experienced and intermediate readers, inexperienced readers) to classify the ultrasound images into benign or malignant. Hold-out validation technique was used where 70\(\%\) of the dataset used for training and 30\(\%\) for testing. The performance analysis was made using AUC and DL method scored 0.84, experienced and intermediate readers scored 0.88 and inexperienced readers scored 0.79.

Han et al. (2017) have carried out an experiment by modifying the architecture of GoogleNet. The modifications targeted on removing two auxiliary classifiers and shifting the input layer to deal with grayscale images instead of color images. The modifications include the reduction of output classes of the target architecture from 100 to 2 classes. For their work, the authors collected 7408 biopsy-confirmed ultrasound breast images (ROIs) associated with masses. Semi-automatically segmentation technique was used to collect ROIs from 5151 patients lesions. Their dataset covered 4254 benign and 3154 malignant lesions. In their pre-processing, histogram equalization, image cropping and margin augmentation were considered. Image cropping was done using a margin with 180 pixels. Data augmentation was achieved using cropping with two different margins (120 pixels and 150 pixels) and translation to increase the number of the training dataset. Out of 7408 ROIs, 6579 ROIs [benign (3765) and malignant (2814)] were used for training and 829 were for testing.

Shin et al. (2017), proposed a CNN based framework to localize and classify masses in breast ultrasound (BUS) images. The CNN (VGGNet-16 and RESNET-101) was trained using large and weakly annotated (DX) dataset and small but strongly annotated (DX + Loc, 600 benign and 600 malignant) dataset. The evaluation is conducted on DX + Loc-Test using correct localization (CorLoc) measure; it is the percentage of images in which a method correctly localizes an object of the target class. Better results were obtained when both the weakly annotated and strongly annotated datasets were used to train the network. DX image dataset used image-level loss whereas DX + Loc image datasets used region-level losses. VGGNet-16 scored a CorLoc value of 0.8450 and RESNET-101 scored 0.8325.

Yap et al. (2018a) proposed three different DL approaches named Patch-based LeNet, U-Net and transfer learning using a fully connected network, AlexNet, for breast ultrasound lesion detection. The authors used two different datasets named dataset A [malignant (60), benign(246)) and B (malignant (53), benign(110)] and the overall best performance was achieved when the two datasets were combined using LeNet.

Yap et al. (2018) proposed an end-to-end breast ultrasound lesions detection using a fully connected network version of AlexNet (FCN-AlexNet). The dataset used in Yap et al. (2018) was identical to the one used in Yap et al. (2018a) and the proposed approach was found to be good for benign lesions detection compared to malignant lesions based on the performance assessment made using hold-out techniques (70\(\%\) for training, 10\(\%\) for validation and 20\(\%\) for testing).

5.2 Histopathology and deep learning based breast cancer image analysis

Histopathology is a technique applied for cancer diagnosis and prognostication for many decades where Pathologists analyze tissue cells under different microscopic standards (Ahmad and Khurshid 2019; Mobadersany et al. 2018). However, the pathologists’ chance to come to one final decision is rare since the assessment is subjective and hence frequent use of this method become tiresome and not repeatable (Ahmad and Khurshid 2019; Mobadersany et al. 2018). In addition, issues related to slide preparation, variations in scanning across sites and staining, and biological variance (Janowczyk and Madabhushi 2016) among patients made the histopathological based breast cancer analysis very challenging.

Ahmed and Khurshid have applied the histopathological method for breast cancer image analysis using deep convolutional neural networks as a supervised classification method cite Ahmad2019. They adopted three deep convolutional neural network architectures (AlexNet, GoogleNet, and ResNet) in their study to classify the 260 images into four classes (normal, benign, in-situ and invasive). The original image dataset distribution for the four classes were 51 for normal, 74 for benign, 68 for In-situ and 67 for Invasive. The classification was made patch-wise and image-wise, but the performance of image-wise classification better than patch-wise for all three CNN models.

Xie et al. (2019) have adopted two deep convolutional neural network models (Inception-V3 and Inception ResNet-V2) to classify the BreaKHis histology image dataset into binary classes (Benign and Malignant) and multi-classes. The multi-class is imposed as a result of malignant subtypes that include ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC). In their experimental analysis, they found that histopathological based image classification using the two selected DCNN models were superior compared to the existing methods. And they proved that Inception-ResNet-V2 is the most performing DCNN architecture for diagnosing breast cancer using histopathological images.

Sun and Binder (2017) has applied three deep convolutional neural network architectures (CaffeNet, GoogleNet, and ResNet-50). In their study, they used breast cancer biopsies from BreaKHis dataset with resolutions of \(40\,\times\), \(100\,\times\), \(200\,\times\), \(400\,\times\). The whole networks of CaffeNet, GoogleNet and ResNet-50 were fine-tuned with different crop sizes of histopathology images from the target dataset with specified resolutions and their performance was evaluated using accuracy. The best result was achieved at 200X resolution. The accuracy of CaffeNet, GoogleNet and ResNet-50 were 89.40\(\%\), 89.86\(\%\) and 89.60\(\%\), respectively.

Jiang et al. (2019) have ushered in a novel DCNN composed of a convolutional layer, small SE-ResNet module, and fully connected layer to classify the histopathology images from BreaKHis dataset into binary classes (benign and malignant) and multi-classes. The multi-classes include other malignant subtypes like ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC). In their architecture, they introduced a new module which is the combination of residual module and squeeze-and-excitation block. They top up a new learning rate scheduler to avoid the complicated fine-tuning process to achieve better performance (Table 9).

Table 9 Histopathology and deep learning based breast cancer image analysis

In our final stage of this survey paper, we selected papers with a publication year from 2016 to 2019 as indicated in Fig. 1 to show (a) the number of papers that use the particular database/dataset considered in this survey paper, (b) the distribution of papers that addresses the application types of DL in breast cancer image analysis, and (c) the frequency of breast cancer abnormality type that most diagnosed.

Fig. 1
figure 1

Distribution of papers for publication. a Shows the number of papers that used a particular database from the year 2016–2019 and b shows the number papers that considered a particular breast abnormality type from the year 2016–2018 c shows the number of papers that we considered for a particular DCNN application type from the year 2016–2018

6 Conclusion

Medical image analysis using DL has proven to be better for scientific researchers compared to conventional machine learning approach. A recently remarkable change has been made on deep learning for medical images analysis has enabled it to discover feature patterns in raw images except continued demanding of huge image datasets. Some of the application types that are commonly used in today’s DL based research work are feature extraction, classification, detection, and segmentation. In this survey paper, all these DL application types are considered for review. Since DL methods have succeeded in state-of-the-art achievement over different medical applications like breast image analysis, brain image analysis, retinal image analysis, abdominal image analysis, and musculoskeletal image analysis, using it for further improvement is the major step in analyzing medical images. However, there are some gaps that need to be addressed in medical image analysis using DL. First, building big datasets using medical images and making it available for researchers so that there will be different available pre-trained models trained on medical images, which in turn eases image requirement for transfer learning. Second, developing a new algorithm in which lesser image datasets are required to train deep models to specific domains in medical applications.