1 Introduction

Breast cancer is the most frequent cancer in females worldwide, encompassing 15 % of all female cancers. In 2012, 521,000 deaths were due to breast cancer. In spite of some risk, shortening might be accomplished with prevention; these approaches cannot lessen the most of breast cancers diagnosed in very late stages. As a result, early detection is the cornerstone of breast cancer control to improve breast cancer survival [57].

Mammography and fine-needle aspiration cytology (FNAC) are typically used diagnostic techniques, but these techniques have a lack of satisfying diagnostic performances. There is no hesitation that assessment of data obtained from patients’ and doctors’ decisions is the most valuable elements in diagnosis. Together with mammography and FNAC, different data mining techniques can be supportive tool in doctors’ diagnosis and decision making; as a result, improved diagnosis system can be obtained. In regard to the above-mentioned requirements, data mining techniques can be utilized to facilitate improvement of the diagnostic systems. With using automatic diagnostic systems, the probable doctor mistakes during diagnosis can be eliminated, and the medical statistics can be analyzed in more detail in a less amount of time. The purpose of this study is to establish an accurate automatic diagnostic system which can distinguish among benign breast tumors from malignant cancers. To solve this task, different data mining techniques were applied and their performances were evaluated and compared. These techniques include Logistic Regression, Decision Trees, Random Forest, Bayesian Network, Multilayer Perceptron (MLP), Radial Basis Function Networks (RBFN) and Support Vector Machine (SVM).

Selection of the most significant and informative features and removal of the remaining features (or in other words compression of original feature set to smaller set) are one of the most important tasks in design of the efficient classification model. Therefore, in order to construct an efficient breast cancer diagnosis dataset, there is a need for a method which will efficiently extract the most informative features given following constraint: Lack of any previous familiarity with information contained within the original data set and significance of the original features should be preserved. Genetic algorithms (GAs) may be employed as a tool to determine information dependencies and decrease the number of features in a dataset by simply structural techniques [37]. GA can be seen as data compression algorithm which eliminates unwanted features and chooses a feature subset having the equal discernibility as the initial set of features, resulting in better classification performances [9]. One of the main goals of this study is to use advantages of GA in feature reduction in the breast cancer data in constructing automatic diagnosis system. New dataset obtained after GA is fed as input to different classifiers. Our proposed approach has two stages. During the first stage, GA is employed as a feature reduction mechanism to determine the discriminative features. It serves to remove redundant data. In the second stage, the best feature subset is employed as the input to different data mining techniques. The accomplishment and efficiency of the methods are evaluated on breast cancer datasets. Experiments proved that data mining techniques have better predicative classification accuracy and performances with smaller number of attributes.

Numerous studies have proposed different systems for automatic diagnosis of breast cancer based on Wisconsin Breast Cancer datasets, and many of these studies reported high classification performances. Quinlan [41] used the C4.5 decision tree method and tenfold cross-validation. Hamilton et al. [17] used RIAC method, and Ster and Dobnikar [48] used linear discrete analysis method. Pena-Reyes and Sipper [38] used fuzzy-GA method, and Setiono [47] used a feed-forward neural network rule extraction algorithm. Albrecht et al. [3] obtained 98.80 % accuracy using learning algorithm that combined logarithmic simulated annealing with the perceptron [33]. Goodman et al. [15] used three distinct methods namely artificial immune recognition system (AIRS), big LVQ and optimized learning vector quantization, achieved 97.2, 96.8 and 96.7 % accuracies, respectively. Abonyi and Szeifert [2] used supervised fuzzy clustering method, and Hassanien [21] used rough set method. Sahan and Polat [44] used a novel hybrid technique based on fuzzy-artificial immune system and k-NN algorithm, and the accuracy was 99.14 % [9]. Maglogiannis and Zafiropoulos [32] used three different methods: SVM, Bayesian classifiers and artificial neural networks (ANNs). Peng et al. [39] used a hybrid method that joins filter and wrapper tools [9, 33]. In [49], support vector machine (SVM) and evolutionary algorithm were used, and obtained accuracy was around 97 %. Koloseni et al. [27] used differential evolution classifier with optimal distance measures applied on WDBC dataset, and obtained average classification accuracy was around 93.64 %. Astudillo et al. [4] applied tree-based topology-oriented SOM on WDBC dataset to discriminate between malign and benign cancer, and obtained classification accuracy was 93.32 %. Tabakhi et al. [51] proposed unsupervised feature selection algorithm based on ant colony optimization for feature selection and Naïve Bayes for classification, and obtained classification accuracy with this system for discriminating between benign and malign cancer was 92.42 % when applied on WDBC dataset. Saez et al. [43] proposed mutual information (MI) between features as a weighting factor for nearest neighbor (NN) classifier, and obtained classification accuracy for WDBC dataset was 96.14 %. Chen et al. [8] suggested system based on parallel time-variant particle swarm optimization (PTVPSO) for concurrent parameter optimization and feature selection for SVM, and obtained classification accuracy was 98.44 % when this system was applied on WDBC dataset. Zheng et al. [60] proposed breast cancer diagnosis system based on K-means and SVM (K-SVM), and in this study, proposed system was tested on WDBC dataset, and obtained classification results were 97.38 %. Lim et al. [30] extended Bandler–Kohout (BK) subproduct to interval-valued fuzzy sets (IVFS), and obtained classification accuracy with this approach was 95.26 % for WDBC dataset.

In this study, GA feature selection and different data mining techniques, namely Logistic Regression, Decision Trees, Random Forest, Bayesian Network, MLP, RBFN, SVM and Rotation Forest, have been investigated, in order to construct automated system which will distinguish between benign and malign tumor in breast cancer. Also the widely used datasets in the literature were used to evaluate performances of the proposed system. It is observed that the Rotation Forest which is a multiple classifier system (MCS) with GA feature selection achieved the highest classification accuracy (99.48 %) in breast cancer data classification.

This paper is organized as follows. In the next section, information is given about the Wisconsin Diagnostic Breast Cancer datasets, and the methods used in each step of the classification process are presented. Section 3 provides a complete experimental study of the different data mining techniques for diagnosis of breast cancer, in which the effect of feature set and algorithmic concerns are compared with respect to the classification performance. Finally, the conclusions are summarized in Sect. 4.

2 Materials and methods

2.1 Breast cancer database overview

Breast cancer is a malignant tumor arising from breast cells. Even though some of the risk factors (e.g., aging, genetic risk factors, family history, menstrual periods, not having children, obesity) that raise a woman’s possibility of developing breast cancer are known, it is not known yet what causes most of the breast cancers and how various factors initiate cells to change into cancerous. Many studies are conducted to learn more, and scientists are having great improvement in understanding how certain alterations in DNA which can affect healthy breast cells to change into cancerous [25, 33].

In this study, two different Wisconsin Breast Cancer Datasets (obtained from UCI Machine Learning Repository) were studied. The first dataset is Wisconsin Breast Cancer (Diagnostic) (WBC (DIAGNOSTIC)) dataset. This dataset contains 569 different instances and 32 attributes. Three hundred and fifty-seven cases are benign, and 212 cases are malignant. All attributes are calculated from a digitized image of a fine-needle aspirate (FNA) of patients’ breast tissues. All cell nuclei in breast tissues are described by ten real-valued features, and for all these features, the mean, the standard error and the “worst” (mean of the three largest values) are calculated. As a result, a total of 30 attributes for all images were obtained [52]:

  • Radius (mean of distances from center to points on the perimeter)—a 1,1, a 1,2, a 1,3;

  • Texture (standard deviation of grayscale values)—a 2,1, a 2,2, a 2,3;

  • Perimeter—a 3,1, a 3,2, a 3,3;

  • Area—a 4,1, a 4,2, a 4,3;

  • Smoothness (local variation in radius lengths)—a 5,1, a 5,2, a 5,3;

  • Compactness (perimeter2/area − 1.0)—a 6,1, a 6,2, a 6,3;

  • Concavity (severity of concave portions of the contour)—a 7,1, a 7,2, a 7,3;

  • Concave points (number of concave portions of the contour)—a 8,1, a 8,2, a 8,3;

  • Symmetry—a 9,1, a 9,2, a 9,3;

  • Fractal dimension (“coastline approximation”-1)—a 10,1, a10,2, a10,3;

where a i,1 refers to ith attribute mean, a i,2 refers to ith attribute standard error, and a i,1 refers to ith attribute “worst” (i = 1,…30).

The second dataset is Wisconsin Breast Cancer Original dataset and contains 699 samples obtained from a breast tissue. Subsequently, data with missing values are removed from dataset; as a result, 683 cases are used in our experiment. Every record in the database has nine attributes, with all values represented as integer numbers between 1 and 10, and was found to fluctuate notably among benign and malignant instances. The measured nine attributes are [53]:

  • Clump thickness;

  • Uniformity of cell size;

  • Uniformity of cell shape;

  • Marginal adhesion;

  • Single epithelial cell size;

  • Bare nuclei;

  • Bland chromatin;

  • Normal nuclei;

  • Mitoses.

2.2 Genetic algorithm-based feature selection

Genetic algorithms (GA) have found broad range of applications. It is established on the resemblance to natural selection. GA operates with population, and the preeminent solution is received after a sequence of iterative steps. GA develops sequential populations of periodic solutions that are shown by a chromosome until satisfactory results are reached [55].

A fitness function estimates the importance of the answer in the evaluation step. Two major operators are crossover and mutation functions, and these have the key impact on the fitness value. Chromosomes for reproduction are selected by finding the fitness value, and the bigger fitness value is obtained, by selecting the chromosome with higher probability. The fitter chromosomes have higher likelihood to be selected into the recombination pool using either the roulette wheel or the tournament [55].

In mutation, the genes may be updated randomly. Crossover is genetic operator that joins distinct features from subsets pair into novel subset. Offspring substitutes the previous population using the elitism or variety replacement strategy to create a novel population in the upcoming generation [55]. To accomplish better performance, GA-based selected features are applied as an input to classifiers.

There are three criteria to model fitness function: model accuracy, number of selected features and cost. For any chromosome with acceptable classification accuracy rate, selection of only significant and informative features and reduced cost result in a satisfactory fitness value. The chromosome with higher fitness value has better chance to be used in the following generation, so these are properly expressed according to user’s specifications. To get accurate feature selection based on GA, these steps are to be followed [55]:

  1. 1.

    Data preprocessing (scaling): Two advantages of scaling are evading of attributes in bigger numeric range to control attributes in lesser numeric range and avoiding of numerical difficulties in calculation [24, 55].

  2. 2.

    Conversion of genotype to phenotype: Here we convert each feature chromosome.

  3. 3.

    Feature subset

  4. 4.

    Fitness evaluation

  5. 5.

    Termination criteria (if it is met, process is stopped; otherwise, we continue with next generation.

  6. 6.

    Genetic operation: In this step, better solution is being searched by genetic operations.

The GA algorithm applied to feature selection is presented in Fig. 1.

Fig. 1
figure 1

Algorithm for proposed GA–Rotation Forest model

2.3 Logistic Regression

Logistic model originated as result of modeling the posterior probability of K classes via linear functions in x, while ensuring that they sum to one and remain in range [0, 1]. Model can be identified in terms of K − 1 logit transformations or log odds. Even though the model utilizes the last class as the denominator in the odds ratio, the selection of denominator is random in that the estimates are equally distributed under this choice. When K = 2, the model is straightforward because there is just a single linear function. In biostatical applications where binary response (only two classes) occurs repeatedly, this model is used extensively [16, 22, 56].

2.4 Bayesian Network

Bayesian Network illustrates the joint probability distribution for a set of variables by defining sets of local conditional probabilities together with a set of conditional independence assumptions. Every variable in the joint space is shown by a node in the Bayesian Network. For all variables, two types of information are specified. First, the variable is conditionally independent of its non-descendants in the network given its instant predecessors in the network. Second, a conditional probability table is given for every variable, telling probability distribution for that variable assumed the values of its instant antecedents. The joint probability for any desired assignment of values (b 1 , …, b n ) to the tuple of network variables (B 1 B n ) can be computed by the formula:

$$P(b_{1} , \ldots ,b_{n} ) = \prod\limits_{i = 1}^{n} {P(b_{i} \left| {{\text{Parents}}(b_{i} ))} \right.}$$
(1)

where Parents(B i ) denotes the set of immediate predecessors of B i in the network. Values of \(P(b_{i} \left| {{\text{Parents}}(B_{i} ))} \right.\) are the values stored in the conditional probability table associated with node B i [46].

2.5 Multilayer Perceptron (MLP)

Multilayer Perceptrons (MLPs) are neural networks consisting of units that create the input layer, one or more hidden layers of computation nodes and output layer consisting of computation nodes. Input signal travels in forward direction on layer-by-layer basis. MLPs are successfully used to solve challenging and distinct problems by training them in supervised manner using well-known back-propagation algorithm [23].

Back-propagation learning constitutes of two passes through distinct layers: a forward pass and backward pass. In the forward pass, synaptic weights are all fixed, while, on the other hand, in the backward pass weights are adjusted. Error signal is created when the actual output of the network is subtracted from target data. This error signal propagates through the network in opposition to the direction of synaptic connections. Weights are tuned to build the real response more closely to the target. This learning process is named as back-propagation learning [23].

2.6 Radial Basis Function Networks (RBFN)

RBFN is popular substitute to Multilayer Perceptron (MLP) because it has more simple structure and more rapid training process. In RBFN, each neuron in the hidden layer uses RBF as its nonlinear activation function. A nonlinear transformation of the input is done in the hidden layer. Output layer of RBFN is a linear combiner and maps the nonlinearity into a new space. The output layer neurons’ biases can be designed by adding extra neuron in the hidden layer, and constant activation function of the hidden layer is 1 [10].

2.7 Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised learning method, and it chooses a modest amount of significant limit samples known as support vectors from all classes and constructs a linear discriminant function dividing them as broadly as it can be accomplished. These systems exceed the restrictions of linear limits by adjusting it to consist of an additional nonlinear function terms, preparing it to establish quadratic, cubic and higher-order decision limits [58].

Support Vector Machines (SVMs) are built on algorithm that develops a particular type of linear model called the maximum margin hyper plane. Hyper plane is different expression for a linear model. To illustrate a maximum margin hyper plane, it can be thought of a two-class dataset with linearly separable classes; in other words, there is a hyper plane in sample space classifying the entire training samples accurately. The greatest margin hyper plane is the one offering the most supreme division among the classes. It goes no nearer to any than it ought. To be precise, the convex hull of a group of points is the most stable enclosing convex polygon: It appears as soon as each point of the set is linked to each other point. Since it is assumed that two classes are linearly separable, their convex hulls never concatenate. Between every hyper plane dividing the classes, the maximum margin hyper plane is the one being the most furthest away from both convex hulls. It is the vertical bisector of the least distanced line linking the hulls [58]. With the selection of satisfactory mapping, the input examples become linearly or approximately linearly divisible in the high-dimensional plane. The SVM tries to find the optimal hyper plane that maximizes the distance between the instances of two different classes [54].

2.8 C4.5 Decision Tree

This algorithm is developed by J. Ross Quinlan, and it begins with big sets of samples being part of identified classes. The samples, defined by whichever combination of nominal and numeric characteristics, are considered for patterns that permit the classes to be accurately characterized. These patterns are then expressed as models, forming Decision Trees or sets of if–then rules that can be employed to classify novel samples, with special accent on making the models comprehensible and precise. C4.5 algorithm uses equations established on information theory to estimate the “goodness” of the test; particularly, they select the test that extracts the highest amount of data from a set of samples, given the restriction that just single attribute is to be tested [40].

In decision tree algorithms, problems are how to handle unknown values and overfitting. C4.5 is able to handle unknown values: Essentially, samples with unknown values are neglected, while calculating the data content and the data gain for an attribute A is subsequently multiplied by the fragment of samples where A value is already defined. Thus if A is unknown for a large fragments of samples, the data received by testing A at a node will be relatively petite. This matches the normal perception regarding how these attributes ought to be treated. A decision tree that accurately classifies all samples in a training set may not be as excellent classifier as a lesser tree not fitting all whole training data. In order to avoid this problem, pruning approach had been adopted for C4.5. This method is established on evaluation of error rate for all subtrees, and displacing the subtree with a leaf node in case when the evaluated error of the leaf is smaller. If the evaluation were ideal, this approach would the entire time guide to an improved decision tree. In reality, even though these are very crude, this approach frequently performs relatively fine [40].

2.9 Random Forests (RF)

Random Forests [5] is an important alteration of wrapping that constructs a large collection of de-correlated trees and then averages them. RF is very simple to train and adjust. As a consequence, it found wide range of applications. RF is used for both classification and regression, although there is difference when they are used for classification and when they are used for regression. When RF is employed to perform classification task, it receives a class vote from each tree and then using majority vote performs classification task. When RF is used for regression, predictions at a target point x from each tree are plainly averaged [22].

Usage of out-of-bag (OOB) samples is a significant characteristic of Random Forests. RFs utilize the OOB samples to construct a diverse variable rank measure and to compute the prediction strength of each variable. After the bth tree is developed, the OOB samples are sent to the tree and then prediction accuracy is recorded. After this, in the OOB samples, values for the jth variable are randomly selected, and the accuracy is calculated again and as a result of this random selection, accuracy is averaged over all trees and then used as a measure of the importance of variable j in the Random Forest [22].

2.10 Rotation Forest

Rotation Forest is a novel method for generation of group of classifiers. In the first step, the feature set is split into S subsets, and principal component analysis (PCA) runs independently on every subset, and after that a novel extracted feature set is reconstructed during which all the components are preserved. New features are obtained from linearly transformed data. A SVM with polynomial kernel is used in this study as base classifier for Rotation Forest. Distinct feature set splits direct to distinct rotations. As a consequence, distinct classifiers are acquired. But also, the evidence how the data are scattered is saved in the novel extracted feature space. Thus, individual classifiers with high performances are constructed. Therefore, achieving both diversity and accuracy together is the objective of Rotation Forest [42].

3 Results and discussion

In this study, we used two different WBC medical datasets to test the performances of models. These two datasets are WBC (Diagnostic) and WBC (Original) and are explained in Sect. 2.1. We used different data mining techniques namely Logistic Regression, Decision Trees, Random Forest (RF), Bayesian Network, Multilayer Perceptron (MLP), Radial Basis Function Networks (RBFN), Support Vector Machine (SVM) and Rotation Forest. Also we used genetic algorithm-based feature selection to find best attributes, and then, we applied data mining techniques for classification.

3.1 Experimental setup and dataset

Two different experiments were set up for the training data for two different WBC datasets. In the first case, the same training–testing dataset was applied as in [1, 13, 19]. In this work, publically available open-source machine learning software, called WEKA, was employed to implement algorithms and approach proposed in this study. In this training dataset, tenfold cross-validation was used. In the second case, we used GA feature selection, where the best attributes were selected, and then, tenfold cross-validation was used on these selected attributes. Numerous researches evaluating breast cancer classification using k-fold cross-validation can be found in the literature. In k-fold cross-validation technique, the dataset is separated into k subsets randomly. As a result, k − 1 subsets, in our case nine subsets, are used for training, and the rest is used for testing of the classifier efficiency [20]. We compared the efficiency of proposed techniques without GA feature selection and with GA feature selection.

Area under ROC [receiver operating characteristic (ROC)] curve (AUC) was also employed to assess the discrimination capability of the classifiers proposed in this study. ROC curves represent the performance of a classifier without taking into consideration class distribution or error overheads. A ROC curve is produced by plotting all sensitivity values (true-positive fraction), on the y-axis, adjacent to their equivalent (1-specificity) values (false-positive fraction) for all presented thresholds on the x-axis. The worth of the approximation to a curve is dependent on numerous thresholds tested. For all folds of a tenfold cross-validation, weight the samples for a selection of distinct overhead ratios, train the system on all weighted sets, calculate the true positives and false positives in the test set and plot the outcome point on the ROC axes [58]. The classification success is then calculated by AUC. The average AUC value provides a sign of a characteristic AUC values generated using the specified input data and displays how consistently result is predicted [18, 36, 50]. AUC is generally considered as the index of performance since it provides a single measure of total accuracy that does not depend on any specific threshold [34, 50]. Regardless of its positive sides, the ROC plot does not provide a rule for the case classification. However, there are approaches that can be employed to create decision rules from the ROC plot [12, 45]. As a guideline, Zweig and Campbell [45, 61] proposed that if the false-positive costs (FPCs) go beyond the false-negative costs (FNCs) the threshold should support specificity, while sensitivity should be supported if FPCs are bigger than the FNCs. Associating these overheads (costs) with the prevalence (p) of positive cases permits the computation of a slope [34, 45]:

$$m = ({\text{FPC/FNC}}) \times ((1 - P)/P)$$
(2)

where m refers to the slope of a tangent to the ROC plot. The sensitivity/specificity pair is positioned where the line and the curve first make contact [45]. An additional measure used to describe performance is F-measure, defined as:

$$F - {\text{measure}} = \frac{{2\,{\text{TP}}}}{{2\;{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
(3)

3.2 Results without GA

The experimental results achieved for WBC (Diagnostic) dataset are given in Table 1. We get an average accuracy of 97.19 % for Logistic Regression, 93.32 % for Decision Tree (C 4.5), 96.13 % for Random Forest, 95.08 % for Bayes Net, 96.66 % for Multilayer Perceptron (MLP), 94.20 % for Radial Basis Function Network (RBFN), 96.89 % for SVM and 97.41 % for multiple classifier system (MCS) tool Rotation Forest.

Table 1 Results for WBC diagnostic using different data mining techniques

The experimental results obtained for WBC (Original) dataset are given in Table 2. As shown in the Table 2, total accuracy achieved with the SVM classifier based on the polynomial kernel on the test set was equal to 96.78 %. The total accuracies are equal to 95.75 % for the RBFN classifier, 96.05 % for the Multilayer Perceptron (MLP), 96.05 % for the C4.5 Decision Tree, 96.34 % for the Random Forest, 97.22 % for the Bayes Net, 96.78 % for the Logistic Regression classifier and 96.78 % for Rotation Forest.

Table 2 Results for WBC original using different data mining techniques

3.3 Results with GA

In the second test, we first used genetic algorithm (GA)-based feature selection to select the best attributes, and then, we used the same data mining techniques as in the previous section. Experimental results showed that highest classification performances are achieved when Rotation Forest is used as classifier. Therefore, the model proposed in this study is a model where GA is used for feature selection and Rotation Forest used for classification. GA–Rotation Forest structure is given in Fig. 1.

The experimental results obtained for WBC (Diagnostic) dataset are given in Tables 3, 4 and 5. To determine which of 30 attributes in WBC (Diagnostic) dataset is more important, GA is employed. Genetic algorithm-based feature selection gave us 14 attributes as important. These are a 1,2, a 2,1, a 3,1, a 3,2, a 4,2, a 5,2, a 6,3, a 7,1, a 7,3, a 8,2, a 8,3, a 9,1, a 9,3 and a 10,1. These 14 attributes noticeably differentiated between benign and malignant breast cells and tissues. As shown in the Tables 3, 4 and 5, total accuracy, AUC and F-measures achieved with the Rotation Forest classifier on WBC (Diagnostic) dataset were equal to 99.48 %, 0.993 and 0.995, respectively. These results were better than those achieved by the other classifiers. Indeed, the total accuracies are equal to 94.38 % for the RBFN classifier, 98.45 % for the Multilayer Perceptron (MLP), 94.02 % for the C4.5 Decision Tree, 95.34 % for the Random Forest, 95.34 % for the Bayes Net and 98.45 % for the Logistic Regression classifier. The AUCs were equal to 0.979 for the RBFN classifier, 0.999 for the Multilayer Perceptron (MLP), 0.954 for the C4.5 Decision Tree, 0.993 for the Random Forest, 0.995 for the Bayes Net and 0.999 for the Logistic Regression classifier. The F-measures were equal to 0.944 % for the RBFN classifier, 0.984 for the Multilayer Perceptron (MLP), 0.932 for the C4.5 Decision Tree, 0.953 for the Random Forest, 0.953 for the Bayes Net and 0.984 for the Logistic Regression classifier. Obtained results propose valuable information for the doctors and medical workers to pay much more attention to 14 attributes previously mentioned. This result confirms that Rotation Forest is superior as compared to other classifiers. Furthermore, it has reference classification accuracy in order to measure the ability of the suggested classification algorithm.

Table 3 Results for WBC diagnostic dataset with genetic algorithm feature selection
Table 4 AUC results for WBC diagnostic dataset with genetic algorithm feature selection
Table 5 F-measure results for WBC diagnostic dataset with genetic algorithm feature selection

Applying genetic algorithm-based feature selection on WBC (Original) did not change results obtained without GA because WBC (Original) has very small number of attributes and GA-based feature selection gave as result of all these attributes.

4 Discussion

The performance demonstrated by the ensemble data mining techniques for breast cancer diagnosis lies in input variable choice and classification method selection. The parameters, which are most appropriate for breast cancer diagnosis, must be utilized as the inputs of the model. For this reason, GA is appropriate for classification of the WBC (Diagnostic) data in the breast cancer diagnosis. In the second test, where GA was applied, the highest obtained accuracy is 99.48 % with Rotation Forest classifier.

It can be observed from obtained performance results, two important observations can be obtained: (1) GA can correctly rank significant attributes since selected GA performs well in terms of classification performances, (2) Rotation Forest outperformed all other traditional linear and nonlinear classification methods by giving the highest accuracy. There are several results for superiority of Rotation Forest over other traditional methods employed in the literature for breast cancer classification. Rotation Forest is multiple classifier system, and because of this, it is more robust since it may all the time improve the performance results for individual classification methods and diversity in the groups. Every base classifier in Rotation Forest employs distinct subsets of WDB diagnostic and original datasets taking different features of these two datasets so that diversity can be achieved.

Accurate identification of breast cancer diagnosis is important for both diagnosis and treatment evaluation. The developed Rotation Forest model classifies WBC (Diagnostic) data using GA for feature selection with an accuracy of 99.48 %. This effect also resulted in an improvement of ROC area (AUC = 0.993), and F-measure (0.995) of Rotation Forest was higher than that of other classifiers. The Rotation Forest, as designated in this study, becomes as good as to other algorithms in breast cancer diagnosis. After applying different kinds of data mining techniques on our selected datasets, SVM with polynomial kernel also resulted in satisfactorily high accuracies of 98.96 %.

To summarize, the suggested expert system accomplished higher classification accuracy rate, decreased the number of attributes and obtained higher performance rate. Results obtained in this study prove that the suggested expert system is valuable in helping the doctors and other medical workers to make the correct breast cancer diagnosis and may demonstrate huge capacity in the area of medical decisions making.

To demonstrate the success of our approach, outcomes achieved in this research are compared with other results developed in the literature. To compare the breast cancer classification efficiency of the proposed model, numerous researches that employed the identical data but different classification techniques were used. For the sake of consistency with those researches, the same division of train–test dataset as explained previously was followed. To illustrate this, the classification performance with that of previous researches was compared. This is illustrated in Table 6. Most of these researches mentioned in Table 6 used the identical data division as our proposed model. For WBC (Original), both tests gave the same results because WBC (Original) with GA-based feature selection gave us all initial attributes (9 in totals) as important. It is worth of mentioning here that several systems evaluated on WBC (Original) dataset resulting in high classification performances are proposed in the literature. In [33], Multilayer Perceptron (AMMLP) algorithm was applied, and achieved classification accuracy was 99.26 %. In [9], rough set (RS)-based supporting vector machine classifier (RS_SVM) was proposed, and obtained classification accuracy was 96.87 %. In [3], LSA machine algorithm was applied, and obtained classification accuracy was near to 90 %. However, one of the main objectives of this study is to construct accurate classification system, but also to find the best-performing attribute selection algorithm. Therefore, WBC (Diagnostic) was employed to evaluate performances of system proposed in this study since it has more than threefold features when compared to WBC (Original).

Table 6 Comparison of accuracies with previous researches

5 Conclusion

A great number of researches have been conducted in the medical area to study medical disorders and find accurate diagnosis. Data mining techniques have been widely used for these purposes. In this study, we have proposed several different data mining methods with and without genetic algorithm-based feature selection to correctly classify medical data (data taken from Wisconsin Diagnostic Breast Cancer database). Random Forest and GA feature selection gave the highest accuracy of 99.48 %. In this research, one of the highest classification accuracies was obtained compared to all previous researches done in this field. We also achieved good classification accuracy by using SVM. Many powerful methods have been applied to WBC (Diagnostic) prediction problems. It is proved in this paper that instead of using complex methods based on strength classifiers to achieve good classification accuracies, an ensemble of more simple classifiers can be used as well, producing remarkable results. An ensemble of several methods offers us to use advantages of each method in order to achieve high classification accuracies for breast cancer diagnosis. We can use group of these rather simple methods to classify other medical diseases and to help doctors to make more precocious decisions in breast cancer diagnosis.