1 Introduction

Cancer is one of the main reasons of the morbidity and mortality worldwide, and in 2012, 14 million new cancerous patients were diagnosed and 8.2 million patients died of cancer [1]. In 2012, 1.7 million women with breast cancer were diagnosed, while there are 6.3 million women diagnosed to have cancer in 5 years before 2012 [2]. In 2012, having breast cancer was 20% and morbidity and mortality increased by 14% relative to 2008 [2].

It is completely obvious that breast cancer grows fast and extensively particularly in women. It is important to propose breast cancer prediction methods for early diagnosis and treatment in order to minimize the risk of metastasis of the cancerous tissues. In medical areas, the artificial neural network (ANN) has attracted considerable attentions to it. Considering high capability, this approach is used in a range of decision making. Saritas [3, 4], Saad [5], Groshev [6], and Wahab [7] used the ANN in cancer diagnosis. Many researchers used the ANN along with other techniques for breast cancer diagnosis. Karabatak et al. [8] used the integrative ANN and expert system approach based on association rules. They used expert system based on the association rules in order to reduce the breast cancer database and applied ANN for smart classification, and the results obtained from their proposed approach was compared to artificial neural network. Huang et al. [9] diagnosed the breast cancer using ANN classification with entropy based on feature selection. For ANN training, they reduced a set of data using feature selection and used Levenberg-Marquardt’s model for training the ANN and employed the particle swarm optimization (PSO) for determination of the optimal ANN parameters. Senapati et al. [10] investigated the wavelet linear local neural network for breast cancer diagnosis such that the wavelet linear local neural network was used for recognition of the breast cancer, and they optimized this approach by training its parameters using performance recursive least squares.

In addition to ANN, many researchers used another computing instrument for prediction and classification of the breast cancer patterns. Using machine learning models in genetic programming, Ludwig and Roos [11] predicted the breast cancer. They investigated the linear programming approach and genetic programming. Chao et al. [12] also used the SVM and logistic regression and decision tree for prediction of breast cancer survival. They employed a model for level of survival in patients with breast cancer using 10-fold validation for state recognition. Shin et al. [13] investigated the prediction of low-risk breast cancer tumors by perfusion parameters and apparent diffusion coefficient. They developed an empirical model for prediction of low-risk tumor using analysis of the logistic regression and analysis system performance characteristic curve. Korkmaz et al. [14] used the texture of images of optical microscope and mammography using relative entropy via estimation of the kernel for breast cancer diagnosis. For early breast cancer diagnosis by mammography and histopathology images and three triangles, Jensen Shannon and Hellinger’s approaches, they evaluated the results. Nugroho et al. [15] investigated machine learning-based methods to improve the detection of breast cancer using mammography. They proposed a model based on naive Bayes and sequential minimal optimization (NB-SMO) to detect breast cancer.

It should be extremely noted that breast cancer is a multistep and complicated disease with particular biological features and clinical behaviors [16] due to the abnormal proliferation of breast cells. Certainly, accurately diagnosing the breast cancer leads to minimization of morbidity and mortality and survival rate increase. Based on the fact that precision of the disease diagnosis is low and time consuming in traditional approach, it is necessary to have an efficient, accurate, and fast for helping the physicians in diagnosis and prediction of this disorder.

The innovation of this paper is dealing with the above-mentioned issues using a hybrid computational model based on unsupervised learning technique (self-organizing map) and supervised learning (complex-valued neural network) resulting in a reliable detection of breast cancer. Self-organizing map (SOM) and complex-valued neural network (CVNN) are the two popular and practical methods in the both fields of industry and medicine for clustering and classification purposes, respectively [17,18,19,20,21].

The rest of this paper is organized as follows. In Section 2, the methods, the dataset used along with the five most important features which are significant in the detection of breast cancer, will be explained. Section 3 describes the obtained results in detail. Then, Section 4 examines the proposed model (SOM-CVNN) and compares the obtained results with those of other models. Finally, in Section 5, conclusion and future work are described.

2 Methods

2.1 Data Acquisition and Feature Selection

In this article, the dataset used includes 822 patients, which was provided by M. Elter, R. Schulz-Wendtland, and T. Wittenberg [22]. There are novel techniques such as ultrasonography and MRI. However, due to their lack of capability in screening, they cannot be replaced with the traditional mammography. Mammography can diagnose the smallest abnormalities in breast tissue even in 0.5 cm, whereas breast mass in 1-cm size is not diagnosed by physicians. A mass can be either benign or malignant. Benign masses do not pose a threat health and the cells do not multiply rapidly. However, some types of benign breast conditions are linked to breast cancer risk. For example, proliferative lesions with atypia are associated with the greatest breast cancer risk [23]. Malignant masses have the potential to be dangerous for patients. They grow quickly and invade the nearby tissues, organs, and other parts of the body which can cause damage. As can be seen in Fig. 1, a mass is a space seen in two different projections. If a potential mass is seen in only a single projection, it should be called an “asymmetry” until its three-dimensionality is confirmed. A number of mammographic mass features are the shape of breast mass, density of breast mass, and margin of breast mass. The shape of a mass is either round, oval, or irregular. The margin of a lesion can be circumscribed (historically well-defined and is known as a benign mass), obscured or partially obscured, microlobulated, indistinct (historically ill-defined), and speculated. One of the most significant features that specify a malignant mass cancer is an irregular or speculated margin [24]. The density of a mass is related to the expected attenuation of an equal volume of fibro glandular tissue which has been proven to be a risk factor in breast cancer [25]. High density is associated with malignancy. It is extremely rare for a breast cancer tissue to be of low density.

Fig. 1
figure 1

The five most significant patient’s features

On the other hand, American Radiology College registered the Breast Imaging Reporting and Data System (BI-RADS) as the diagnosis, prevention, and treatment guideline for the patients. BI-RADS is an instrument for qualitative expression and level of the risk in breast mammography and can reduce confusion in breast imaging interpretations. Employing this system makes the mammogram report reading for non-radiologists a standard and understandable issue and causes the improvement in relationship between physician and radiologist [26]. BI-RADS was updated in 2011 and divided into five categories [27]. All of these features have been depicted in Fig. 1.

2.2 Procedural information

2.2.1 SOM

Kohonen developed SOM in 1980s [28]. SOM is a very popular artificial neural network model, which is based on the unsupervised learning paradigm [29, 30]. It consists of three phases, namely, competition, cooperation, and adaptation. This technique is a tool that maps large-scale data to data with fewer dimensions by putting together the most analogous data in the form of distinct clusters. Clustering and data compression are original applications of SOM. SOM generates a mapping from a continuous high-dimensional input space Ψ onto a discretized low-dimensional output space λ by learning from past examples. The discrete output space λ comprises k neurons arranged as per a fixed topology (a 2D hexagonal or rectangular lattice). The weight vectors W = (w1, w2,  … , wk) define the mapping c(x) : Ψ→λ, and it assigns to a neuron index (known as competition phase) an input vector x(t):

$$ {\mathrm{i}}^{\ast}\left(\mathrm{t}\right)\kern0.5em =\kern0.5em \arg \underset{"\mathrm{i}}{\min}\left\Vert \mathrm{x}\left(\mathrm{t}\right)-{\mathrm{W}}_{\mathrm{i}}\left(\mathrm{t}\right)\right\Vert $$
(1)

where ‖.‖ refers to the Euclidean distance and t is the discrete current iteration. It is noteworthy that the dimensionality of weight vectors and input patterns is the same.

To train the weight vectors, a competitive-cooperative learning rule is adopted. When an input vector is fed to the network, the weight vector of the winner neuron as well as that of its neighbors are updated as follows (known as adaptation phase):

$$ {\mathrm{W}}_{\mathrm{i}}\left(\mathrm{t}\kern0.5em +\kern0.5em 1\right)\kern0.5em =\kern0.5em {\mathrm{W}}_{\mathrm{i}}\left(\mathrm{t}\right)\kern0.5em +\kern0.5em \upeta \left(\mathrm{t}\right)\mathrm{h}\left({\mathrm{i}}^{\ast },\mathrm{i};\mathrm{t}\right)\left[\right(\mathrm{x}\left(\mathrm{t}\right)-\mathrm{Wi}\left(\mathrm{t}\right)\Big] $$
(2)

Hence, the weight vectors of the adapted neurons slightly move toward the input vector. The learning rate η, which decreases with time exponentially, controls the magnitude of the movement. A neighborhood function h determines the number of neurons, which are impacted by this adaptation. The neighborhood function is typically unimodal, symmetric, and monotonically decreasing with distance to the winner increasing. The Gaussian function is a popular option (known as cooperation phase):

$$ \mathrm{h}\left({\mathrm{i}}^{\ast },\mathrm{i};\mathrm{t}\right)\kern0.5em =\kern0.5em \exp \left(-\frac{{\left\Vert {\mathrm{r}}_{\mathrm{i}}\left(\mathrm{t}\right)-{\mathrm{r}}_{{\mathrm{i}}^{\ast }}\left(\mathrm{t}\right)\right\Vert}^2}{2{\upsigma}^2\left(\mathrm{t}\right)}\right) $$
(3)

where \( \left\Vert {\mathrm{r}}_{\mathrm{i}}\left(\mathrm{t}\right)-{\mathrm{r}}_{{\mathrm{i}}^{\ast }}\left(\mathrm{t}\right)\right\Vert \) is the distance between neurons i and i in the discrete output space λ, and σ(t) is the radius of the neighborhood function at time t, which decreases exponentially to guarantee the decrease in the neighborhood size during training.

The neighborhood function is chosen to embrace a large part of the output space λ at the onset of learning, and it is decreased little by little such that the winner is merely adapted toward the end of the process. When the global ordering of the weight vectors reaches a steady state, the map is said to have converged. The preservation of neighborhood relations, i.e., the mapping of nearby data vectors in the input space onto adjacent neurons in the output space, is a significant characteristic of the resultant map. The low-dimension output space can display the structure hidden in high-dimensional data, e.g., clusters and spatial relationships, because of the topology-preserving property.

2.2.2 Min-max normalization technique

Sometimes, the values of input and output parameters are extremely low or high, so that raw data may not be suitable for use. Thus, raw data needs to undergo preprocessing. Therefore, scaling of data should be performed. One approach for scaling of data is performed with the following formula equation which normalizes the data to values between 0 and 1 [31, 32]:

$$ {\mathrm{X}}_{\mathrm{i}}^{\prime}\kern0.5em =\kern0.5em \frac{{\mathrm{x}}_{\mathrm{i}}-{\mathrm{x}}_{\mathrm{min}}}{{\mathrm{x}}_{\mathrm{max}}-{\mathrm{x}}_{\mathrm{min}}} $$
(4)

where Xi is the original value of parameter, X′i is the normalized value of Xi, and Xmin and Xmax are the minimum and maximum values.

2.2.3 Transferring problem space

The phase transformation is used in order to transfer the input feature space with the real values to that with complex values as follows [20]:

If we have x j  ∈ [a, b], so that a , b ∈ R, therefore, ϕ is defined as follows:

$$ \phi =\frac{\frac{\Pi}{2}\left({\mathrm{x}}_j-\mathrm{a}\right)}{b-a} $$
(5)

Then, supposing the constant magnitude and using the Euler equation, the complex variable z j corresponding to the input variable x is as follows:

$$ {f}_{R\to C}={z}_j={e}^{i\phi}=\cos \left(\phi \right)+\left(\mathrm{i}\right)\sin \left(\phi \right) $$
(6)

As can be observed from the above equation, using this conversion, the input space of the features is mapped from x ∈ [a, b] to \( \phi \in \left[0,\frac{\Pi}{2}\right] \).

2.2.4 CVNN

These networks have been highly regarded since the 1980s, and it has been shown that CVNN makes it possible to solve complicated problems that cannot be solved by real-valued neural networks [19]. The networks developed in both theoretical and practical fields so that, today, they have many applications in areas such as digital image processing and data processing [32]. Considering the set {(z 1, y 1), ⋯(z t , y t ), ⋯(z N , y N )}, z t and y t are m- and s-dimensional inputs and outputs with complex values, respectively (z t  ∈ C m, y t  ∈ C s), that are applied to a CVNN. This network operates on a complex input with m dimensions z t  = [z 1 z 2 ⋯ z m ], complex activation function f a (.), complex input weight matrix V 0, and complex output weight matrix V 1 to produce a complex output with s dimensions \( \overset{\wedge }{y}=\left[{\overset{\wedge }{y}}_1{\overset{\wedge }{y}}_2\dots {\overset{\wedge }{y}}_s\right]. \) The structure of CVNN is shown in Fig. 2. CVNN has m neurons in the input layer, h neurons in the hidden layer, and s neurons in the output layer represented as ξ m : h : s. Activation functions in the hidden and output layers also have complex values. The output of the jth neuron is calculated in the hidden layer as follows:

$$ {Z}_h^k={f}_a\left(\sum_{j=1}^m{V}_{kj}^0{z}_j\right);k=1,2,\dots, \mathrm{h} $$
(7)
Fig. 2
figure 2

Structure of a CVNN [33]

In the above equation, \( {V}_{kj}^0 \) is a complex weight that connects the jth input neuron to the kth hidden neuron and f a is the activation function.

Similarly, for the output of the lth output neuron of CVNN, we have

$$ \hat{y_l}={f}_a\left(\sum_{k=1}^h{V}_{lk}^1{z}_h^k\right);l=1,2,\dots, s $$
(8)

In the above equation, \( {V}_{lk}^1 \) is a complex weight that connects the hidden kth neuron to the output lth neuron and f a is the activation function. Here, the objective of the CVNN is the proper estimation of parameters V 1 and V 0 to minimize the error of the following relationship:

$$ E=\frac{1}{2}{e}^He=\frac{1}{2}\sum_{k=1}^n{e}_k{\overline{e}}_k $$
(9)

In the above equation, \( {e}_k={y}_k-\overset{\varLambda }{y_k} \) and \( \overline {e_k} \) is conjugate complex. H is a complex Hermitian operator. The back propagation algorithm with complex values is used for estimating parameters v 0 and v 1, as described in [33] in detail. The rule of updating the output weight connecting the hidden kth neuron and the output lth neuron will be as follows:

$$ \Delta {v}_{lk}={\eta}_v{\delta}_l{\overline{z}}_h^k $$
(10)

In this relationship, \( {\delta}_l={y}_l-\overset{\varLambda }{y_l} \) and η v is the learning rate that can be a real or complex value. Similarly, the first derivative of the mean square error (MSE) with respect to the input weights \( {v}_{kj}^0;k=1,\dots, h;j=1,\dots, m \) is applied to update the input weight connecting the jth input neuron and kth hidden neuron.

$$ \varDelta {v}_{kj}^0={\eta}_v{\delta}_l\left(\sum_{l=1}^n{\overline{v}}_{lk}^1{\overline{f}}_a^{\prime}\left(\sum_{j=1}^m{v}_{kj}^0{z}_j\right)\right){\overline{z}}_j;k=1,\dots, h,j=1,\dots, m $$
(11)

In the above equation, \( {\overline{f}}_a^{\prime } \) is the derivative of the conjugate of the activation function.

2.2.5 Transferring problem space with complex values to that with real values

In the previous section, the phase conversion was carried out in order to transfer the problem space from real values ​to complex values considering the complex number magnitude as a constant value. Therefore, the following relationship can be used to convey complex values to the real initial values:

$$ {f}_{C\to R}= Arc\kern0.5em \tan \left(\frac{y}{x}\right)/\frac{\varPi }{2} $$
(12)

In the above equation, y and x are the imaginary and real parts of the complex number, respectively. The phase value (ϕ) of the complex number is calculated through calculating \( \mathrm{arc}\tan \left(\frac{y}{x}\right) \). Furthermore, given that the features of the input space are mapped from x ∈ [a, b] to \( \phi \in \left[0,\frac{\varPi }{2}\right] \) by converting real values to the complex values, the real value corresponding with the complex number is obtained by dividing it to\( \frac{\varPi }{2} \).

2.3 The proposed computational intelligence model (SOM-CVNN)

In this paper, after the normalization of 822 input samples each with five features and applying unsupervised learning technique (SOM), separate clusters were defined. It should be mentioned that the determined clusters include patients with a maximum degree of similarity. In the following, the sample values of each cluster are applied to supervised learning technique (CVNN) after transferring to the complex space and separating training and testing samples (693 samples for training and 129 samples to testing the model). In this section, the selection of the unique structure of neural network consisting of neurons number in a hidden layer, choosing the learning rate parameter, and giving initial weights to each of the cluster samples are very important. Then, if the accuracy of the adopted neural network during the training process is satisfying (magnitude and phase error ​obtained for the complex number are a minimum value), the testing samples would be applied to the network. Otherwise, the previous step is repeated, the design of neural network architecture is modified, and efforts are made to reduce the error. In the end, the complex output obtained from the network is transferred to real space and the evaluation criteria of the confusion matrix and receiver operating characteristic (ROC) are considered for the total samples of clusters. The overall process of the proposed algorithm is shown in the flowchart of Fig. 3.

Fig. 3
figure 3

The flowchart of proposed computational intelligence model

2.3.1 The proposed structure of SOM for patient clustering

In choosing the optimal size of the SOM (number of neurons = number of clusters), there is no theoretical and appropriate base, and the size is determined according to the considered application [34]. In this paper, considering the small number of input samples (822 samples), the size of the SOM is considered to be a 2 × 2 neuron network, and consequently, the number of clusters will be 4.

2.3.2 The proposed structure of CVNN for breast cancer detection

Here, the intention is to use the exponential function as an activation function for processing non-linear complex data and a logarithmic error function including magnitude and phase error with complex values. Moreover, back propagation (BP) learning algorithm with complex values was used for CVNN training. BP learning algorithm with complex values based on the logarithmic error function is quite similar to BP learning algorithm based on the mean square error.

In the following, the pseudo-code of the proposed computational model is coded step by step (Fig. 4).

Fig. 4
figure 4figure 4figure 4

Pseudo-code of the proposed computational intelligence model

2.4 Statistics (confusion matrix and ROC curve analysis)

In statistics, a ROC curve is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR is also known as sensitivity, recall, or probability of detection in machine learning [35]. FPR is also known as the fallout or probability of false alarm and can be calculated as (1 − specificity). The ROC curve is, thus, sensitivity as a function of fallout. In general, if TPR and FPR are known, the ROC curve can be generated by plotting the cumulative distribution (area under the probability distribution from −∞ to the discrimination threshold) function of TPR on the y axis versus the cumulative distribution function of the FPR on x axis.

In this article, a diagnostic test that seeks to determine whether a patient has a cancerous or non-cancerous mass (benign or malign mass) was taken into consideration. There are four possible outcomes from a binary classifier. If the outcome from a detective model is P and the actual value from a medical diagnosis test is also P, then it is called a true positive (TP); however, if the actual value is N, then it is said to be a false positive (FP). Conversely, a true negative (TN) has occurred when both the detective model outcome and the actual value are N, and false negative (FN) is when the detective model outcome is N, while the actual value is P. A FP in this case occurs when the person tests positive, but does not actually have the disease. A FN, on the other hand, occurs when the person tests negative, suggesting that they are healthy, when they actually do have the malign mass. An experiment was defined from P positive patients and N negative patients for some condition. The above definitions are as follows:

$$ \mathrm{TPR}=\mathrm{Sensitivity}=\varSigma\ \mathrm{True}\ \mathrm{positive}\kern0.5em \left(\mathrm{TP}\right)/\varSigma\ \mathrm{Condition}\ \mathrm{positive}\kern0.5em \left(\mathrm{TP}\kern0.5em +\kern0.5em \mathrm{FN}\right) $$
(13)
$$ \mathrm{TNR}=\mathrm{Specificity}=\varSigma\ \mathrm{True}\ \mathrm{negative}\ \left(\mathrm{TN}\right)/\varSigma \kern0.5em \mathrm{Condition}\ \mathrm{negative}\kern0.5em \left(\mathrm{TN}\kern0.5em +\kern0.5em \mathrm{FP}\right) $$
(14)
$$ \mathrm{FPR}=1- TNR\kern0.5em \left(\mathrm{sensitivity}\right) $$
(15)

ROC analysis is often called the ROC accuracy ratio, a common technique for judging the accuracy of default probability models. ROC curves are widely used in laboratory medicine to assess the diagnostic accuracy of a test, to choose the optimal cutoff of a test, and to compare diagnostic accuracy of several tests. Thus, we use accuracy criterion with the following formula:

$$ \mathrm{Accuracy}=\varSigma\ \mathrm{True}\ \mathrm{positive}\kern0.5em \left(\mathrm{TP}\right)+\varSigma\ \mathrm{True}\ \mathrm{negative}\ \left(\mathrm{TN}\right)/\varSigma\ \mathrm{Total}\ \mathrm{population}\kern0.5em \left(\mathrm{P}+\mathrm{N}\right) $$
(16)

An ROC space is defined by FPR and TPR as x and y axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity versus (1 − specificity) plot. Each prediction result or instance of a confusion matrix represents one point in the ROC space. The best possible prediction method would yield a point in the upper left corner or coordinate (0, 1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0, 1) point is also called a perfect classification. In addition, the diagonal divides the ROC space. The points above the diagonal represent good classification results, and the points below the line represent poor results.

3 Results

In this article, the detected outputs of the CVNN change between [0, 1]. These outputs are defined as thresholds. ROC analysis displays for each possible threshold value, the value of the various performance indices. We wish to strongly penalize diagnostic errors and more particularly the case where the sick patients are not detected (FN). So, for each threshold, different cost values are assigned to the TP, TN, FP, and FN outcomes. For example, if we decide to detect a patient’s mass status when the mass severity is greater than or equal to 0.9, the sensitivity is 0.31, the specificity is 0.98, and the total cost is 172. In fact, we made a decision to select an appropriate threshold based on total cost values. For this purpose, the values assigned to TP, TN, FP, and FN were 1, 1, 2, and 4, respectively. As can be seen in Fig. 5, the decision plot allows to choose the threshold value that minimizes the cost. This value corresponds to a severity of 0.5, for both training and testing sections.

Fig. 5
figure 5

Decision plots for a training phase with 693 patients and b testing phase with 129 patients

Thus, according to Fig. 6, based on ROC analysis, the area below CVNN training and testing data was obtained to be 0.92 and 0.93, respectively.

Fig. 6
figure 6

Graph of the ROC analysis for a training data (693 patients) and b training data (129 patients)

4 Discussion

For training and testing phases, the four outcomes obtained from CVNN are shown in Tables 1 and 2 as a 2 × 2 confusion matrix.

Table 1 Confusion matrix for training samples (693 patients in four clusters)
Table 2 Confusion matrix for testing samples (129 patients in four clusters)

According to the test dataset confusion matrix analysis, it is observed that the ratio of distinguishing the sick from the true sick is 95%, the ability to distinguish the healthy from the true healthy is 94.2%, and the ratio of the total number of correct diagnoses of the test as sick and healthy is 94.5%. It is obvious that the proposed model that was applied was successful in breast cancer detection, in a reliable manner. Additionally, for comparison of several proposed models in this field, the sensitivity, specificity, and AUC are also apt criteria to compare different diagnostic tests. Thus, according to Table 3, it can be observed that the current work is completely successful in comparison with other proposed diagnostic methods.

Table 3 Comparison between proposed models (obtained results in testing phase)

5 Conclusion

This paper proposed a hybrid computational intelligence model, namely, SOM-CVNN for diagnosis of breast cancer. We applied a real dataset including 822 patients with five features. The SOM technique was used to cluster the patients, and then for each cluster, the patient’s features were applied to complex-valued neural network and dealt with to classify breast cancer severity (benign or malign). In the testing phase, health and disease detection ratios were 94 and 95%, respectively. The obtained results show that the model can be used as a reliable tool that may eliminate unnecessary biopsy. Furthermore, better breast cancer detection results can be obtained by increasing the number of patients in conjunction with other innovative machine learning methods. We are working on another novel hybrid model which is extremely dependent on data mining along with pattern recognition.