1 Introduction

Breast cancer is the second-ranked cause of death for women, after lung cancer. It is generally caused by the growth of uncontrolled abnormal cells that usually arise from the inner milk ducts or lobules [1]. Microcalcifications and masses are two common types of breast cancer and can be benign or malignant. Early detection of breast cancer is critical for patient survival. In general, digital X-ray mammography is most widely used for breast imaging and screening. The main purpose of mammography is to detect early signs of cancer and to diagnose breast masses [2]. The American Cancer Society (ACS), American College of Radiology (ACR), and American Congress of Obstetricians and Gynecologists (ACOG) suggest that women undergo annual mammograms starting at age 40 [3]. For women between 40 and 50 years of age, the National Cancer Institute (NCI) encourages breast screening one or two times a year [4]. Screening with mammography is performed in two steps. First, the breast is compressed between two small flat plates. Then, a low X-ray dose is applied directly through the breast and acquired by a two-dimensional (2D) panel detector. After that, mammographic images are assessed by radiologists. However, due to a large number of breast images, abnormal lesions can be missed. Computer-aided diagnosis (CAD) could support radiologists by improving the visual screening process and allowing better recognition of breast abnormalities. It has been reported that CAD systems improve the overall accuracy of detection of breast cancer [5,6,7,8].

There have been several studies using CAD to detect breast abnormalities in mammograms. Balanica et al. developed a CAD system that utilized the spiculated lesion features (i.e., the edge shapes of lesions) [9]. They adopted a neural network (NN) technique to train their system on 96 cases and distinguished masses as either benign or malignant. Al-Olfe et al. built a CAD system using unsupervised clustering and biclustering classifier techniques [10]. They extracted the first and second statistical and shape features via wavelet transform to classify the masses as either normal or abnormal. Kang et al. utilized two-dimensional (2D) and fractal dimension entropies as texture features [11]. They applied a support vector machine (SVM) classifier to recognize the mass abnormalities. Sharma and Khanna, developed a CAD system based on the manual collection of many regions of interest (ROIs) from two public databases, digital database for screening mammography (DDSM) and image retrieval in medical application (IRMA). They extracted the Zernike features which are rotation and shift invariant, and also relevant to the shape [12]. They used an SVM classifier to evaluate their system and diagnose lesions as malignant or nonmalignant. Recently, a new brand of classifiers has been introduced based on a new principle of deep learning, and it has been extended to address the task of breast tissue classification. Arevalo et al. utilized convolutional neural network (CNN) to explain the content of breast images and recognized masses as benign or malignant [13]. They evaluated their CAD system, showing an area under the ROC curve (AUC) of 82.10%. Jiao et al. designed a CAD system based on CNN to recognize benign or malignant tissues in a total of 600 cases [14]. They extracted multiple hierarchical levels of features to train CNN. Kooi et al. implemented a deep model using CNN to identify malignant and suspicious normal regions among 398 regions of mammograms [15]. As CNN has been recently adopted for CAD, active investigations are underway to improve its performance to a satisfactory level.

In this paper, we propose a CAD system utilizing deep belief network (DBN), one of the deep learning algorithms [16], to classify abnormalities on mammograms or breast masses into normal, benign, or malignant. We evaluated its performance against those of other conventional classifiers. Our results show that our DBN-based CAD outperforms the conventional CAD systems. The paper is organized as follows. First, we present an overview of our proposed CAD system. Second, we automatically extract ROIs and generate the features for training and testing processes. Third, we train our proposed DBN-based CAD system using the DDSM database. Finally, we compare the performance between the proposed DBN and other classifiers.

2 Materials and Methods

The schematic diagram of our DBN-based CAD system processes is shown in Fig. 1. Our system involves automatic mass detection, ROI extraction techniques, feature extraction, and DBN classifier modules. Our presented system is slightly different in the use of ROIs and involves two techniques, Technique A (i.e., Mass ROIs), and Technique B (i.e., Whole Mass ROIs), to study the feasibility of a DBN-based CAD system. These two different ROI selection options are studied to investigate whether they have a significant effect on the performance of the proposed CAD system.

Fig. 1
figure 1

Schematic diagram of our proposed CAD system for both techniques

2.1 DDSM Dataset

In this study, we utilize the Digital Database for Screening Mammography (DDSM) to evaluate the proposed system [17]. This dataset is collected by South Florida University and is available online for research purposes [17]. It is collected to represent real breast data with an average size of 3000 × 4800 pixels, resolution of 42 microns, and 16 bits. The DDSM database consists of 2620 cases that are categorized into 43 volumes. Each case involves four breast images, two of them are Mediolateral Oblique (MLO) views and the others are Cranio-Caudal (CC) views of each breast. Benign and malignant masses in all mammograms are recognized and annotated by expert radiologists. In this paper, we utilize 150 mammograms divided equally into normal, benign, and malignant classes.

2.2 Preprocessing

Due to breast compression in the scanning process of mammography, breast deformation occurs. The peripheral area of the breast is affected by this compression, which affects the grey level values of breast tissue in these regions [18]. Thus, the intensity values of peripheral areas are always lower than that of the central area. To diagnose correctly, physicians must use certain settings of the window level during inspection of suspicious regions. However, this process can take a long time especially with a huge number of patients, and it is inconvenient. To enhance mammographic images for automatic detection of abnormal masses, we apply a multi-threshold peripheral equalization algorithm [19]. This algorithm enhances and removes irrelevant information from mammograms. The main purpose of this method is to enhance the peripheral area of the mammogram by utilizing multiple thresholds to create multiple images and then average them to produce the smooth transitions between the central and peripheral areas of the mammogram. Thus, physicians can view and inspect the lesions through one window level setting. The algorithm involves the following five sequential steps. First, by utilizing adaptive Otsu thresholding, we segment and separate the breast region from its background to generate a segmented image (\({\text{I}}_{\text{seg}}\)) as illustrated in Fig. 2b. Second, a 2D Gaussian low pass filter (GLPF) is applied to the original mammogram, shown in Fig. 2a, to produce a blurred image (\({\text{I}}_{\text{blur}}\)) shown in Fig. 2c. Third, the \({\text{I}}_{\text{blur}}\) is multiplied by the segmented image to eliminate the unwanted information that exists outside the breast tissue, as depicted in Fig. 2d. Then, a normalized thickness profile (NTP) of the mammogram is estimated, as shown in Fig. 2e. The NTP is obtained using five threshold values \(( {\text{T}}_{\text{n}} )\), as the mean value of the corresponding pixels of the five thresholded images [19]. Each threshold value is computed as follows:

Fig. 2
figure 2

Peripheral density correction using multi-thresholding algorithm. a Original mammogram, b segmented image (Iseg) with adaptive Otsu thresholding, c blurred image (Iblur), d blurred image after multiply by Iseg, e normalized thickness profile (NTP) of mammogram, and f peripheral equalized (Ipeq) of mammogram

$${\text{T}}_{\text{n}} = {\text{I}}_{\text{ave}} \times {\text{F}}_{\text{n}} ;\quad {\text{n}} = 1,2, \ldots 5,$$
(1)

where \({\text{I}}_{\text{ave}}\) is the average intensity value of \({\text{I}}_{\text{blur}}\), and \({\text{F}}_{\text{n}}\) is equal to 0.8, 0.9, 1.0, 1.1, and 1.2 and represents the scale parameter used to adjust the threshold value to be around \({\text{I}}_{\text{ave}}\). In this study, we used these five thresholds to increase the intensities of the peripheral regions and eliminate the boundary effect produced if a single threshold is used [19].

To create NTP, we averaged all \({\hat{\text{I}}}_{\text{blur}}\) images, which are estimated by rescaling the \({\text{I}}_{\text{blur}}\) image according to each threshold value as follows:

$${\hat{\text{I}}}_{\text{blur}} ( {\text{i,j)}} = \left\{ {\begin{array}{*{20}c} {\frac{{{\text{I}}_{\text{blur}} ( {\text{i,j)}}}}{{{\text{T}}_{\text{n}} }} ;} & {{\text{I}}_{\text{blur}} ( {\text{i,j) }} \le {\text{ T}}_{\text{n}} } \\ { 1 ;} & {\text{otherwise}} \\ \end{array} } \right..$$
(2)

then,

$$\text{NTP} = \frac{1}{5}\sum\limits_{n = 1}^{5} {{\hat{\text{I}}}_{\text{blur}} ({\text{n}})} ,$$
(3)

where \({\text{i}} = 1, 2, 3, \ldots ,{\text{M}}, {\text{and j}} = 1, 2, \ldots ,{\text{N}}\), and M × N is the size of the mammogram. Finally, the peripheral equalized image (\({\text{I}}_{\text{peq}}\)) of the mammogram is achieved as follows:

$${\text{I}}_{\text{peq}} = \frac{{{\text{I}}_{\text{att}} }}{{({\text{NTP}})^{\text{r}} }},$$
(4)

where \({\text{I}}_{\text{att}}\) is an attenuation image (i.e., the original mammogram, as shown in Fig. 2a). Peripheral equalization of a mammogram is illustrated in Fig. 2f, and r is a constant value in the range of [0.70–1.0] as in [19]. The ratio of signal to noise (SNR) is computed for original and preprocessed mammograms, which are shown in Fig. 2a, f [20], as follows:

$${\text{SNR }}\left( {\text{dB}} \right) = 1 0\cdot { \log }_{ 1 0} \left[ {\frac{{\sum\limits_{\text{i = 1}}^{\text{M}} {\sum\limits_{\text{j = 1}}^{\text{N}} {\left[ {{\text{I}}_{\text{att}} \left( {\text{i,j}} \right)} \right]^{ 2} } } }}{{\sum\limits_{\text{i = 1}}^{\text{M}} {\sum\limits_{\text{j = 1}}^{\text{N}} {\left[ {{\text{I}}_{\text{att}} \left( {\text{i,j}} \right) - {\text{I}}_{\text{peq}} \left( {\text{i,j}} \right)} \right]^{ 2} } } }}} \right].$$
(5)

Thus, SNR is estimated to be 17.01 and 17.35 dB when r is equal to 0.7 and 1.0, respectively. So, to achieve our CAD system, we set \({\text{r}} = 1.0\) for all dataset.

2.3 Automatic Mass Detection

One of the important steps in CAD for breast cancer classification is to detect specific masses or suspicious regions on mammograms [21]. In this work, we implemented our automatic mass detection algorithm as shown in Fig. 3. Figure 4a shows a label-removed and extracted breast only image. Figure 4b shows initial suspicious regions identified via the adaptive threshold procedure. We computed the threshold value by aggregating all gray level intensities inside the breast tissues and dividing them by the total number of non-zero pixels (L), which is known as the grayscale intensity of the local background, as follows:

Fig. 3
figure 3

Block diagram of our proposed technique for automatic mass detection

Fig. 4
figure 4

Proposed algorithm for automatic mass detection. a Labels removal, b after applying adaptive thresholding, c binary image, d after applying morphological operations, e mass region extraction, f mass contour superimposed on the corresponding mammogram

$${\text{T}}_{\text{thr}} = \frac{{\sum\nolimits_{{{\text{i,j}} \in {\text{I}}}} {\text{I(i,j)}} }}{\text{L}} ;\quad {\text{I(i,j)}} > 0.$$
(6)

Then, we applied this threshold (\({\text{T}}_{\text{thr}}\)) to the breast image, converting the thresholded image of Fig. 4b to the binary image as in Fig. 4c in order to apply morphological operations. Consecutive binary morphological operations were applied to determine proper shape and size of the mass. These operations were accomplished using three steps. First, we utilized a fill operation to complete the whole expected suspicious region. Second, erosion with a structuring element of disk type was applied. Finally, we removed the remaining disconnected small areas remaining around the mass after the erosion process. Figure 4d shows the end result of the morphological processing. As it is known that the pectoral muscle exists at the border of the mammograms in both MLO and CC views. After applying our mass detection algorithm as shown in Fig. 4, the pectoral muscle was disconnected from the mass. Then, we removed the part that is connected to the border which represents the pectoral muscle. Figure 4f shows the whole breast image with the superimposed contour around the extracted mass. In this study, we verified our aforementioned approach on MLO and CC views of 150 mammograms, which were equally divided for each class (i.e., normal, benign, and malignant). In this study, by comparing the detected positions of the mass contour using our method against the ground truth, we achieved an accuracy of 86% as an average of correct mass detection for benign and malignant cases. If the intersection over union (\({\text{IOU}}_{{{\text{Ground}}\;{\text{truth}}}}^{\text{Extracted}}\)) between the extracted mass and its ground truth exceeds 50%, we considered the result to be correct. In this study, the next stages of mass classification depended solely on the successfully detected masses. It should be noted that extraction of masses from breast dense regions remains a limitation of our method. Figure 5 shows some sample results compared to the ground truth of the original mammogram with the contours that drawn manually by expert radiologists during the mass inspection procedure. Some mammograms involved two masses with different sizes, as shown in Fig. 5d. Thus, we collect 56 benign and 56 malignant masses that were extracted correctly. In tumor classification, dense breast tissue remains challenge. So, in order to reduce classification bias, we extracted some normal regions randomly. In total, 56 regions from normal cases were collected.

Fig. 5
figure 5

Examples of proposed automatic mass detection results. The top row shows the ground truth outlined by radiologists while the bottom row illustrates automatically detected masses for the same mammograms

2.4 ROI Extraction

Once breast masses were detected, we derived two kinds of windows to train and test our proposed CAD system. In Technique A (i.e., Mass ROIs), we utilized 168 regions (i.e., 56 masses for each benign and malignant region and 56 normal regions). Then, we randomly extracted four non-overlapping ROIs 32 × 32 pixels in size around the center of each region, extracting 224 ROIs for each class, as shown in Fig. 6a. Thus, a total of 672 ROIs were collected. The size of each ROI was determined in order to obtain reliable statistical features [10, 21, 22] in which the smallest sufficient ROI requires at least 800 pixels, resulting in the ROI size of 32 × 32 pixels. However, one could try more or less ROIs as long as sufficient number of statistical features can be derived.

Fig. 6
figure 6

ROI settings for proposed CAD system. a four non-overlapped ROIs from each mass (i.e., Mass ROIs), and b Whole Mass ROIs

In Technique B (i.e., Whole Mass ROIs), a whole detected mass was utilized as depicted in Fig. 6b. For benign and malignant cases, after extracting the whole ROIs, rectangular boxes were drawn around the irregular shapes of the masses, which numbered 56 benign and 56 malignant. For normal cases, we utilized 56 regions that were manually extracted from normal mammograms. Thus, a total of 168 ROIs were collected.

2.5 Feature Extraction and Selection

The next step was to derive features representing the breast tissue types. These attributes are quantitative measures of breast tissues that are used to describe the salient characteristics of the tissues [23, 24]. In this study, we used the first and higher order statistical features to describe the characteristics of the regions. These statistical features involve both intensity and texture feature types. These various features were utilized in previous studies and were shown to be strong enough to distinguish different lesions [13, 25,26,27,28,29,30]. A total of 347 statistical features were extracted from the ROIs. After that, we applied some feature selections techniques for the conventional classifiers, but applied the entire features to DBN considering the feature selection capability of deep learning.

2.5.1 First Order Statistical Feature

From the pixels in the ROI, the first order attributes were computed by applying statistical analyses on both grayscale intensities and the histogram of each ROI. Nine features are extracted from the histograms of the ROIs, namely, entropy, modified entropy, standard deviation (SD), modified standard deviation (MSD), energy, modified energy, asymmetry, modified skewness, and range value of the histogram. Other features included mean, SD, smoothness, third moment, entropy, skewness, kurtosis, variance, mode, interquartile range, and percentiles or quintiles at levels 0.1–0.9 [25, 29, 30]. A total of 28 first order features were derived.

2.5.2 Higher Order Statistical Features

To take into consideration the spatial inter-relationships of the pixels as well as their grayscale, the second order attributes were computed on the grayscale co-occurrence matrix (GLCM) as proposed by [31]. The 2D histogram of grayscale intensity for a pair of pixels is called the GLCM. We utilized quantization grayscale (i.e., L = 32), angle of orientation (i.e., θ = 0°, 45°, 90°, and 135°), and displacement vector (i.e., d = 1, 3, 5, and 9) to create GLCMs [26,27,28, 32]. The value of d is acceptable in the range of 1–10 [30]. From each value of d, we estimated four GLCMs according to θ, resulting in 16 different GLCMs. Then, we extracted 19 different statistical features from each one, for a total of 304 statistical features. These features included energy, contrast, correlation, homogeneity, entropy, maximum probability, inverse different moment (IDM), variance, sum average, sum entropy, sum variance, difference entropy, difference variance, autocorrelation, dissimilarity, cluster shade, cluster prominence, correlation information 1, and correlation information 2 [21, 24]. On the other hand, some previous studies have demonstrated that d = 1 and d = 2 for GLCM provided good results of overall accuracy [31, 32]. Thus, we estimated four GLCMs at d = 2 with respect to θ and then averaged them to calculate the average GLCM. Then, we extracted 15 features from this average GLCM. These features are seven invariant moments, entropy, maximum probability, homogeneity, IDM, variance, uniformity, correlation information 1, and correlation information 2 [31]. Finally, a total of 319 higher order features were derived. All extracted features were normalized in the range of [0, 1].

2.5.3 Feature Selection Techniques

To reduce the redundancy in the features and to select the most prominent features, we utilized four feature selection algorithms: sequential backward (SBS), sequential forward (SFS), sequential floating forward (SFFS), and branch and bound (BBS) as described in [33].

2.6 Classifier Designs

The selected features were used to train and test the conventional classifiers of linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and neural network (NN). LDA is used to find the linear combination of features that best separates two or more classes of objects [34, 35]. QDA is a general discriminant function with a quadratic decision boundary that can be used to classify datasets with two or more classes [34, 35]. In this study, we present a 3-class classification problem with multi-dimensional features. Therefore, the task of LDA and QDA is to classify 3 groups with an input of multi-dimensional features. It is known that neural network (NN) is usually used to indicate the feedforward neural networks, while deep neural network (DNN) represents the feedforward NN with many layers [16]. In comparison to the conventional artificial neural network (ANN), deep belief network (DBN) has a slightly different structure. DBN utilizes restricted Boltzmann machines (RBM) in which the layers can be trained using the unsupervised learning algorithm contrastive divergence (CD). This work was devised by Hinton et al. [16] and explained that classification could be achieved in deeper architectures when each layer (RBM) was pre-trained with an unsupervised learning algorithm (i.e., CD algorithm). Then the Network can be trained in a supervised way using backpropagation in order to “fine-tune” the weights. Thus, a DBN-based CAD system can deal with all extracted features without the need for selected algorithms. Therefore, no feature selection method was used, and all features were used for the training and testing of DBN.

2.7 Deep Belief Network (DBN)

Deep belief network (DBN) based on RBM was recently established by [16]. RBM is considered a model of a generative stochastic neural network with connections only between visible nodes (v) and hidden nodes (h). In RBM, there are no connections between nodes or units in the same layer; hence, Gibbs sampling is performed [36]. DBN has a deep structure that generates a prominent model with multiple layers of RBM, as shown in Fig. 7. To develop our DBN for both techniques, we utilized R and Q hidden nodes with first and second hidden layers, respectively. The training of DBN was achieved through two consecutive processes. First, a pre-training process for RBM was performed via unsupervised learning. Then, we utilized a supervised learning method to apply a back propagation algorithm with known labels of breast cancer features to adjust the weights and fine-tune the networks. In DBN, with training data, pre-training assists the neural networks to overcome any problems of over-fitting.

Fig. 7
figure 7

a Structure of our proposed DBN based CAD system for both techniques, b pre-training process, and c back projection algorithms

One of the benefits of using DBN is the ability to extract and select the more prominent features from the input data. Figure 7b shows three layers of RBM for both of our techniques. Each layer of RBM is updated depending on the previous one. Once the first layer is prepared by computing the weight matrix, it is considered as an input for the next layer. This process trains RBMs one after another and utilizes their extracted features for learning in the next one. So, the input data during this process is reduced layer by layer. Thus, the selected features at the hidden nodes of the last layer can be considered as a vector of features. In both structures of our DBN, the algorithm of contrastive divergence (CD) with block Gibbs sampling is utilized to update the matrix of weights w layer by layer [35,36,37]. In fact, there are five consecutive steps to train RBM. First, all parameters of the network are initialized and set to zero. These parameters are weight matrix w and two real values of bias vectors (i.e., A and B) for hidden and visible layers, respectively. Second, the logical state of the first hidden layer is computed as follows:

$${\text{h}}_{ 1} = \left\{ {\begin{array}{*{20}c} { 1 ;} & {{\text{f}}\left( {{\mathbf{B}}{\text{ + v}}_{ 1} {\mathbf{w}}^{\text{T}} } \right) > \upvarphi } \\ { 0 ;} & {\text{otherwise}} \\ \end{array} } \right.$$
(7)

where \({\text{f}}\left( {\text{z}} \right) = 1 /\left( { 1 {\text{ + e}}^{\text{ - z}} } \right)\) is a sigmoid activation function, and \(\upvarphi\) is an activation threshold. Third, after \({\text{h}}_{ 1}\) is obtained, the state of visible layer vrecon is reconstructed corresponding to the following formula:

$${\text{v}}_{\text{recon}} = \left\{ {\begin{array}{*{20}c} { 1 ;} & {{\text{f(}}{\mathbf{A}}{\text{ + h}}_{ 1} {\mathbf{w}} )> \upvarphi } \\ { 0 ;} & {\text{otherwise}} \\ \end{array} } \right..$$
(8)

The fourth step is to compute the state of hidden layer hrecon as follows using vrecon,

$${\text{h}}_{\text{recon}} = {\text{f}}\left( {{\mathbf{B}} + {\text{v}}_{\text{recon}} {\mathbf{w}}^{\text{T}} } \right) .$$
(9)

Finally, the difference of weight Δw is estimated to compute the current one as follows:

$${\mathbf{w}}_{{{\mathbf{k + 1}}}} = {\mathbf{\Delta w}}{ + }{\mathbf{w}}_{{\mathbf{k}}} ,$$
$${\mathbf{\Delta w}} = \left( {\frac{{{\text{h}}_{ 1} {\text{v}}_{ 1} }}{\upeta }} \right) - \left( {\frac{{{\text{h}}_{\text{recon}} {\text{v}}_{\text{recon}} }}{\upeta }} \right) ,$$
(10)

where \(\upeta\) is the batch size.

As shown in Fig. 8b–d, \({\text{v}}_{ 1}\) represents the visible layer for each RBM during the pre-training process. This means that, after the first RBM is tuned, the hidden layer of this RBM is considered as the visible layer for the next one and so on. All of these steps are iterated until the number of batches is converged. Figure 8 presents the pre-training process of DBN utilizing the CD algorithm, where both red boxes and arrows demonstrate the current network layers, while blue ones indicate the derived weight and states. Figure 8e shows the learning process of DBN via back propagation, which includes pre-training and fine-tuning of all parameters. For both techniques, all extracted features (i.e., 347 features) were directly utilized as input for DBN through m visible units (i.e., \({\text{V}}_{\text{m}} = {\text{V}}_{ 1} , {\text{ V}}_{ 2} ,\ldots {\text{V}}_{ 3 4 7}\)), as illustrated in Fig. 7a. In Technique A (i.e., Mass ROIs), two hidden layers were used with \({\text{R}} = 1 5\;{\text{and}}\;{\text{Q}} = 8\) hidden nodes for RBM and the batch size \(\upeta\) of 10. In Technique B (i.e., Whole Mass ROIs), we also utilized two hidden layers with \({\text{R}} = 2 0\;{\text{and}}\;{\text{Q}} = 8\) hidden nodes and \(\upeta = 7\) for RBM. For both techniques, batch size \(\upeta\) of DBN was empirically obtained by trying different values (i.e., trail-and-error based) in order to obtain the best accuracy, as in [36]. In this study, the overall accuracy was derived from 2-fold cross-validation.

Fig. 8
figure 8

a Proposed DBN layers. bd Schematic diagram of Contrastive Divergence (CD) technique for greedy layer-wise training, and e learning process of DBN via back propagation algorithm

2.8 Evaluation of CAD system performance

To evaluate our proposed CAD-based DBN system, we utilized overall accuracy and receiver operator characteristic (ROC) curve with its area under the curve (AUC) [12,13,14, 23]. The ROC curve presents a trade-off between sensitivity and specificity, where the high rate of classification is investigated when the AUC satisfies a set threshold [13]. The ROC curve is defined based on sensitivity and specificity as follows:

$${\text{Sensitivity}} = \frac{\text{TP}}{\text{TP + FN}} ,$$
(11)
$${\text{Specificity}} = \frac{\text{TN}}{\text{TN + FP}},$$
(12)

where sensitivity represents the ability to measure disease appearance as abnormal, while specificity represents the ability to measure absence of disease as normal. Overall accuracy represents the ability of the system to distinguish between the different classes (i.e., normal, benign, or malignant) and is defined as follows:

$${\text{Overall accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} ,$$
(13)

where true positive (TP) and true negative (TN) represent the numbers of studies that are classified correctly as positives and negatives, respectively. False positive (FP) indicates the negative studies that are incorrectly distinguished as positives. False negative (FN) indicates the positive studies that are incorrectly classified as negatives. In this study, we utilized the pairwise comparison method between each two classes to compute all of these metrics, as presented in [38].

3 Results and Discussion

For data preparation in this work, we randomly split the total cases into training and testing datasets. That is, a total of 168 whole cases (or masses) were randomly split into a training dataset (84 cases) and a test dataset (84 cases).

In Technique A (i.e., Mass ROIs), we extracted four ROIs from each mass in the training dataset and collected a total of 336 ROIs for training. For testing, we also extracted four ROIs from each mass in the test dataset and classified each mass. In Technique B (i.e., Whole Mass ROIs), we trained the classifiers with the features from the whole mass in the training dataset. After training, we classified each mass from the test dataset. In both techniques, training and testing sets contained equal numbers of cases for each class. In this study, all results were derived as the average of 2-fold cross-validation with all classifiers.

Our two techniques using a CAD system based on DBN were evaluated in terms of overall accuracy and area under the ROC curves. All classifiers are used to classify three kinds of breast tissues of normal, benign, and malignant. The conventional classifiers (i.e., LDA and QDA) depend on the selection feature algorithms to select the most prominent features. The main purpose of using feature selection techniques is to reduce the dimension of features to make the classifiers strong enough for distinguishing between the different lesions of breast tissues. On the other hand, the NN classifier has a good structure to deal with all features, so its performance does not change much with selection algorithms. On the contrary, we directly utilized all extracted features without any dimension reduction as an input of DBN. Overall accuracies of our DBN-based CAD system compared with other classifiers in both techniques are reported in Table 1. The performance of each conventional classifier varied corresponding to the utilized feature selection algorithm. In Technique A, the performance of NN with SFFS was better than those of LDA and QDA, with an accuracy of 83.85%. On the other hand, the performances of QDA and LDA with SBS were better than those of other selection algorithms. On the contrary, DBN achieved a much higher classification rate with an accuracy of 92.86%. In Technique B, DBN again showed better performance than all other classifiers, with an accuracy of 90.48%. In this technique, all conventional classifiers provided high accuracy with SFFS and SFS algorithms. The performance of QDA with SFFS was slightly increased by 1.19% compared with SFS. From the results in Table 1, we conclude that the performance of DBN in both techniques provides higher accuracy than the others methods. The performance of DBN in Technique A was slightly higher by 2.38% than in Technique B. For conventional classifiers, the NN classifier showed better performance than LDA and QDA for both techniques. This indicates that our DBN-based CAD is preferable to the other classifiers to obtain a highly accurate rate of breast cancer diagnosis. The confusion matrices of our three classes for DBN are shown in Table 2. In Technique A, the error (i.e., misclassification) of mixing benign with normal cases was 3.57%, while the error of mixing malignant with benign was 17.86%. In Technique B, there was no mixing error for benign with normal, but the mixing rate error for confusing benign with malignant was 28.57%. Meanwhile, the ROC curve is considered a detection tool to verify the performance of the CAD system. Due to the misclassification rate between normal and benign for both techniques, the AUCNB (i.e., AUC for ROC curve between 1-specificity for normal and sensitivity for benign) in Technique A was slightly lower than that in Technique B, as reported in Table 3. Due to the false rates of classification between benign and malignant, as shown in Table 2, the values of AUCBM (i.e., AUC for ROC curve between 1-specificity for benign and sensitivity for malignant) were 93.54% and 86.56% for Techniques A and B, respectively. Therefore, the average 97.26% of all AUCs in Technique A was higher than the 95.39% for Technique B. Figure 9 shows an example of ROC curves for DBN against other classifiers corresponding to 1-specificity for normal and sensitivity for benign and the AUCs (i.e., AUCNB for all classifiers) for both techniques. ROC curves of the conventional classifiers are depicted with the most robust feature selection algorithm of SFFS. Both Techniques A and B with DBN provided much higher AUCs compared with other classifiers. This indicates that a DBN-based CAD system can distinguish different breast abnormalities with higher probability. Figure 10 shows the capability of DBN along with other conventional classifiers with respect to different numbers of datasets in terms of accuracy and stability, as in [14]. These results show that the performance of DBN improves slightly as the size of the dataset increases. We confirmed the stability of DBN and some improvements in the performance with an increase in dataset size against the conventional classifiers. On the other hand, a much larger dataset on a more powerful machine would likely provide better characterization of the clusters and hence better performance. The presented results of our proposed CAD system utilizing DBN show the feasibility of distinguishing different breast tissues. The usability of deep learning techniques such as DBN or CNN has been validated for various medical applications such as prostate cancer detection [39], breast cancer detection [13, 14], and discriminative cue integration for medical image annotation [21]. In this work, we developed a DBN-based CAD system to distinguish among three different breast tissue abnormalities, an improvement over previous studies [40] and [41] differentiating only two tissue types (i.e., benign against malignant). Currently, we are investigating the application of a convolutional neural network (CNN) for even better performance with a CAD system [13,14,15, 39,40,41,42,43].

Table 1 Accuracy of the proposed DBN based CAD system against other conventional classifiers
Table 2 Confusion matrices of DBN based CAD system for both techniques
Table 3 Performance analysis of DBN based CAD system
Fig. 9
figure 9

ROC curves of CAD system for normal against benign classes in a Technique A and b Technique B

Fig. 10
figure 10

Comparison of the accuracies between DBN against the conventional classifiers with different sizes of dataset in a Technique A and b Technique B

4 Conclusions

In this paper, we present a DBN-based CAD system to classify among three classes of breast tissues, which are normal, benign, and malignant. We present an automatic mass detection algorithm to identify suspicious masses on mammograms. Then, statistical features are derived from the detected masses. After that, we utilize DBN to investigate its potential for a CAD system with breast tissues. The results of the DBN-based CAD system demonstrate significantly improved performance compared to previous conventional CAD systems. The results indicate that CAD systems with deep learning capability offer great potential for computer-aided detection of breast cancers. For the two proposed techniques, there was a slight difference in performance. However, Technique A could be preferable for microcalcifications’ problems, especially when the expected target is relatively small in size. Otherwise, Technique B could be more preferable when the entire mass is considered.