Keywords

1 Introduction

Breast Cancer is one of the most prominent diseases prevalent in females. It is the most common invasive cancer. In 2020, 2.3 million cases of BC were reported, which makes this cancer the most common cancer [1]. BC tumors occur when certain breast cells grow out of control, leading to the lining of the breast ducts. In addition, tumors can be classified into benign and malignant [2]. When cells are not cancerous, the tumor is benign. It would not invade nearby tissues or spread to other areas of the body (metastasize). When removed, benign tumors usually do not grow back. Malignant tumors, however, can grow and spread to other parts of the body.

Early detection of BC is the key to increase the survival rate, and the cancer is more likely to respond to effective treatment and can result in a greater probability of surviving, less morbidity, and less expensive treatment Different techniques were investigated to diagnosis BC such as ultrasound, thermography, mammography and biopsy [3]. Mammography is possibly the most intensely used medical procedure by the physicians. However, if a mammogram looks suspicious, then a biopsy is required to decide whether an abnormality is in fact a breast cancer [4]. Furthermore, if the detection accuracy of the tumor is at a low level (lower than 70%), this prevents the doctor from reporting the final result of the diagnosis. Thus, this results in a waste of time and can cause mental discomfort for the patient [5]. Accordingly, the patient faces additional tests which can be costly and demanding.

In order to provide a quick and accurate diagnosis for BC [6], various classification techniques have been investigated in the literature such as Neural Networks (ANNs), Support Vector Machines (SVMs) and K Nearest Neighbor (KNN) [7, 8]. Nevertheless, single techniques are not always the most appropriate techniques to use, since they does not achieve better performance under all situations. Actually, the performance of single techniques relies on the characteristics of the dataset [5, 8, 9]. Furthermore, each single classification technique has advantages and limitations regarding the classification tasks. In order to address this challenge, a powerful approach called ensemble classifiers has been widely investigated. They consist of combining a set of individual classifiers by means of an aggregation layer [8, 10].

One of the most important task in optimizing an ensemble learning system is to select a subset of the “best” classifiers (ensemble members) from the whole pool of classifiers, which can drive an ensemble to outperforming its members [8, 11, 12]. Otherwise, the performance of an ensemble can be worse than all or most of its members [13]. In the literature, several previous studies were carried out to identifying optimal ways to combine classifiers [14,15,16,17]. However, the selection of the ensemble members is also a crucial challenge to deal with in order to improve the ensemble performance [18]. For instance, in [15] Aytu et al., proposed a hybrid ensemble approach that employs randomized search and clustering scheme to produce an ensemble. They trained a multitude of single classifiers with different parameters, then a group of diverse classifiers is created. Classifier clusters are then created using the classification performance of single classifiers. Thereafter, two single classifiers from each cluster are selected as candidate members based on their pairwise diversity to generate the ensemble. Caruana et al. [16] proposed an ensemble selection scheme from a library of thousands of classification algorithms. In this scheme, many machine learning algorithms and parameter settings are used to build a model library. Then, a selection strategy, such as the forward stepwise selection, was used to select members that maximize the ensemble performance. In [17], Aksela proposed a method to select ensemble members based on several selection criteria such as correlation between errors, Q statistics and weighted count of errors. A measure focused on penalizing classifiers making the same error, the exponential error count approach, was identified to generate the best selections.

Performance of an ensemble learning can be influenced by many criteria including accuracy of single classifiers, number of base classifiers, combination rule, data sampling technique, and diversity of members [19]. While diversity was in general considered as the relevant criterion impacting the performance of an ensemble [20,21,22,23], other studies confirmed the opposite [24,25,26,27]. Diversity alone is a poor predictor of the ensemble accuracy” [28]. In [29], Krogh and Vedelsby underlined that members could improve the performance when they are accurate and diverse.

In the literature, several existing selection methods of ensemble members are essentially investigating one criterion: accuracy or diversity [18, 30,31,32,33]. This paper proposes a method for selecting heterogenous ensemble members for breast cancer classification, which uses both accuracy and diversity as selection criteria. Accuracy is measured in terms of recall precision, and accuracy metrics, while diversity is evaluated by means of the Q statistic diversity measure, which is one of the most popular due to its simplicity and understandability compared to other diversity measures [34].

Furthermore, we compare our proposed selection method with two existing strategies: (1) the selection of ensemble members based only on the criterion accuracy (i.e. we investigated the effect of the selection of the most accurate models from a group of seven classifiers); and (2) the selection of all the single classifiers without using any criterion. The empirical evaluations were carried out using: (1) seven single classifiers: K nearest neighbor (KNN), Multilayer Perceptron (MLP), Decision trees (DTs) and four variants of Support vector machines (SVMs) with four different kernels: Linear Kernel (LK), Normalized Polynomial Kernel (NP), Radial Basis Function Kernel (RBF), and Pearson VII function based Universal Kernel (PUK); (2) a majority voting combination rule to combine the outputs of the ensemble members. (3) three well-known available BC datasets from online repositories; (4) three performance metrics, namely accuracy, recall and precision to evaluate the constructed ensembles; and (5) the statistical test Scott-Knott and the Borda Count voting system to perform the significance tests and rank the best classifiers respectively.

The contributions of this paper are: (1) analyzing the impact of investigating both accuracy and diversity for ensemble members’ selection in breast cancer classification; and (2) comparing the proposed members’ selection method with two existing selection strategies.

The rest of this paper is structured as follows: Sect. 2 briefly presents the single techniques used, the ensemble concept and the existing measures of diversity. Section 3 presents an overview of related work investigating diversity in members’ selection. Section 4 describes the ex.perimental design pursued in this study. The empirical findings are presented and discussed in Sect. 5. The threats to validity are given in Sect. 6. Conclusions and future works are summarized in Sect. 7.

2 Background

This section gives a summary of the single classification techniques used, the concepts of ensemble, and the measures of diversity to select ensemble members in classification.

2.1 Single Techniques

KNN:

is a popular machine learning algorithm known for its simple implementation and robustness [35]. It is a non-parametric method first created in 1951 by Evelyn Fix and Joseph Hodges, and later updated by Thomas Cover. It used to solve both classification and regression problems. KNN stores all available instances and classifies new instances based on a similarity measure. To measure the similarity between its nearest neighbors, KNN uses in general the Euclidian distance.

SVMs:

are powerful classification algorithms, used to solve problems of classification as well as regression. SVM was developed in the 1990s by Vladimir Vapnik [36]. It is used to classify a new unknown instance into one of the predefined classes. SVM has the ability to model complex nonlinear relationships by choosing an appropriate kernel function [37]. In fact, the Kernel function transforms the training samples so that a non-linear decision boundary is transformed to a linear equation in a higher number of dimensions [12, 38]. In this study, four variants of the SVM classifier were used. The four SVMs variants used four different kernels: Linear Kernel (LK), Radial Basis Function Kernel (RBF, Pearson VII function based Universal Kernel (PUK) and Normalized Polynomial Kernel (NP).

MLP Neural Networks:

are the most frequently used feedforward neural networks due to their fast operation, ease of implementation, and smaller training set requirements, [39, 40]. They are used for both classification and regression problems [41, 42]. Their architecture consists of three types of layers: the input layer, output layer and hidden layer. The nodes present in each layer are connected to the next layer. That is the principle of feed-forward neural network; the movement information is allowed only in a forward direction. The neurons of each layer are connected to the neurons of the subsequent layer by means of weights and output signals which are a function of the sum of the inputs to neurons modified by an activation function. Generally, the neurons of the hidden layer use a nonlinear activation function, while a linear activation function is usually used for the output neurons.

DTs:

are the most frequently used classification techniques, easy to use and to interpret. They can be used for both classification and regression problems [43]. DT is a tree-structured model in which internal nodes represent dataset attributes, branches represent decision rules, and each leaf node represents a class label. Depending on the task addressed, the class label could be categorical or continuous. The classification rules are described by the paths from root to leaf. In this study, the C4.5 algorithm was investigated [44].

2.2 Ensemble Classifiers

An ensemble classifiers are a powerful machine learning technique that create multiple models and then combine them by means of an aggregation rule in order to produce one optimal predictive model [8, 12]. They can be grouped into two types: Homogeneous or Heterogeneous [8, 10, 12, 45]. The Homogeneous method refers to an ensemble that combines one based learning algorithm with at least two different variants, or an ensemble that combines one base learning algorithm with one meta ensemble such as Boosting [46]. While the Heterogeneous method refers to an ensemble that combines members having different base learning algorithms. The current research is based on heterogeneous ensembles, and it adopts the majority voting combination rule to combine the decision of the individual classifiers that comprise the ensemble. Note that, the majority vote rule is the most popular and frequently used method in the literature of ensembles [47].

2.3 Measures of Diversity in Ensemble Based Classification

It is well known that the performance of an ensemble learning is impacted by diversity of its members, i.e., the degree of disagreement within the members of an ensemble [18, 25]. Diversity is loosely described as “making errors on different examples” [48, 49]. Thus, diversity has been acknowledged as a very relevant characteristic in classifiers combination. Kuncheva [28, 50] provided an analysis of ten diversity measures and classified them into two groups: Pairwise and Non-pairwise measures. Pairwise measures calculate diversity values between two base classifiers of an ensemble. The overall diversity of an ensemble can be estimated by averaging the pairwise diversity values of pairs using Q-statistic [51], double-default measure [52], and disagreement measure [50]. Non-pairwise measures, on the other hand, are used to estimate diversity among all base classifiers by accounting for all potential disagreements between them using entropy [53], generalized diversity [54], and measure of difficulty [50]. In this study, the Q statistic diversity measure was used, it is preferred over other diversity measures because of its simplicity and understandability [34].

Q statistic measure is based on Yule’s Q statistic used to assess the similarity of two classifiers’ outputs [50]. For two classifiers Li and Lk, Q-statistic value is defined by Eq. 1.

$${Q}_{i,k}=\frac{{\mathrm{N}}^{11}{\mathrm{N}}^{00}- {\mathrm{N}}^{01}{\mathrm{N}}^{10}}{{\mathrm{N}}^{11}{\mathrm{N}}^{00}+{\mathrm{N}}^{01}{\mathrm{N}}^{10}}$$
(1)

where Nab is the number of training instances for which Li gives result ‘a’ and Lk gives result ‘b’ (It is supposed that the result here is equal to 1 if an instance is classified correctly and 0 if it is misclassified). The expected value of Q for statistically independent classifiers is 0. The value of Q ranges between –1 and 1. Classifiers that appear to correctly classify the same instances will have positive Q values, while those that make errors on different instances will have negative Q values [21, 50].

3 Related Work

This section presents an overview of some related work investigating diversity in ensemble techniques.

Banfield et al. [55] proposed an algorithm termed the percentage correct diversity measure (PCDM) to construct decision trees ensemble. The proposed algorithm seeks to find the test samples for which the percentage between 10 and 90 of the single classifiers are correct. These test samples are removed from determining the ensemble diversity. The proposed technique was evaluated in twelve datasets from UCI repository using teen fold cross validation method. The empirical results suggest the effectiveness of the proposed technique.

Kadkhodaei et al. [56] proposed an entropy based approach to determine the best combination of classifiers from a pool of ten different single techniques. The evaluation of the proposed heterogeneous ensemble was evaluated on three datasets from the UCI repository. The empirical results stated that the proposed technique generates an accurate ensemble and that the time required to build it is less than the one required bagging and boosting ensemble techniques.

Nascimento et al. [57] presented a new approach for automatic selection of both base classifiers and features. The proposed approach was based on evolutionary approach composed of two genetic algorithm instances. Two proposed diversity measures were investigated in order to analyze the performance of the proposed framework. The empirical evaluations were performed using ten different classification algorithms using the bagging architecture. Five datasets from UCI repository were selected for the evaluations. The results suggested that the proposed technique was effective to generate accurate ensemble. The authors recommended to take into account other factors than diversity such as accuracy and complexity when constructing ensembles.

Lysiak et al. [58] proposed a novel approach for dynamic ensemble selection (DES) based on probabilistic measures of competence and diversity between member classifiers. The two types of ensembles were constructed: the homogeneous ensemble consisted of 20 pruned decision tree classifiers and the heterogeneous ensemble consisted of nine different classifiers. Seven public datasets were used to assess the effectiveness of the proposed approach. The Results indicated that the proposed method can eliminate weak classifiers and keep the ensemble maximally diverse. Further, the proposed DES led to better classification accuracy of the constructed ensembles compared to those generated by the DES system using only the competence measure.

4 Experimental Design

This section explains the experimental design investigated to conduct all of the empirical evaluations, including the performance metrics used, Scott–Knott (SK) test, Borda Count voting system, datasets descriptions, ensemble selection process and the abbreviations used.

4.1 Performance Metrics

The following performance metrics are used to assess the performance of single and ensemble techniques:

Accuracy, Recall and Precision defined by Eqs. 2, 3 and 4 respectively [10].

$$\mathrm{Accuracy}=\frac{\mathrm{TN}+\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}$$
(2)
$$\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(3)
$$\mathrm{Precision }(\mathrm{Prec})=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(4)

where FP stands to False Positive, FN stands to False Negative TP to True Positive and TN to True Negative.

4.2 Scott-Knott Test

The Scott-Knott (SK) test is a hierarchical clustering algorithm developed by Scott and Knott (1974), is an efficient method to conduct procedures of multiple comparisons without ambiguity [59]. Compared to other statistical tests such as the Tukey test, Student– Newman–Keuls (SNK) test and t-test, the SK test is a commonly used method [60,61,62,63], it has the ability to group techniques into non-ambiguous groups [64, 65]. In this study, the SK test was used to cluster the single and ensemble techniques based on their error rates (Error rate = 1-Accuracy) and to check the significant difference between them. The ten folds cross-validation approach was used in all the experiments presented in this study.

4.3 Borda Count Voting System

The Borda count [66] is a form of single-winner election in which voters rate candidates in order of choice. The Borda count determines the winner of an election by allocating points to each candidate based on the voter’s rating. After that, the system aggregates the score of each candidate based on the received points. The candidate who receives the highest score is the winner. To illustrate this process, the example in Table 1 shows the steps of Borda count to choose the winner among four candidates (Cd1, Cd2, Cd3 and Cd4) who were voted on by four voters (Vot1, Vot2, Vot3 and Vot4). Each voter Voti assigns a candidate to one of the positions i (1, 2, 3 or 4). As a result, we compute the vector position PI (n1, n2, n3,n4) of each candidate, where ni is the number of times the candidate has been ranked in position i. Mi points are assigned to each position i (Mi = # ofcandidates − i + 1). Finally, the score of each candidate is equal to ni * Mi.

Table 1. Borda count voting system for four voters which rank four candidates according to their preferences.

4.4 Datasets Fescription

In order to evaluate the performance of the proposed techniques, three datasets obtained from the online UCI repository were investigated in this study. These datasets were the most widely used by researchers in the literature [7]. A short description of each of these datasets is reported in Table 2. Note that two of the datasets contain missing values. We simply removed them since their number was very small. Furthermore, the WPBC and Wisconsin datasets are unbalanced. The Synthetic Minority Over-sampling Technique (SMOTE) [67] was used to address this problem.

Table 2. Datasets description

4.5 Ensemble Selection Process

The first concern of constructing an effective ensemble classifier is to ensure that all individual classifiers are accurate [68]. Then we can improve the ensemble performance by rejecting weak classifiers and combining accurate members only. Toward this aim, we select ensemble members based on accuracy, referred to us Selection by Accuracy (SbA). The process of SbA is as follow:

  1. 1.

    Assess the performances of the N (7 in this study) single classifiers based on the three metrics: Accuracy, Precision and Recall.

  2. 2.

    Performing the statistical test SK based on the accuracy in order to cluster the classifiers evaluated in Step 1 into non-overlapping clusters. Each cluster includes one or more classifiers with comparable predictive abilities. The best cluster is the one with the lowest error rate value. (Error rate = 1-Accuracy); therefore, the classifiers belonging to this cluster are chosen for the next step.

  3. 3.

    Building an heterogeneous ensemble by combing the base classifiers belonging to the best cluster (In case of the best cluster only contains one classifier, we combine the classifiers of the two first best clusters)

  4. 4

    Evaluate the performance of the heterogeneous ensemble constructed in Step 3 according to the three criteria: accuracy, recall and precision.

Since diversity of the ensemble members is relevant to improve the accuracy of an ensemble, we construct ensembles based on both accuracy and diversity, referred to us Selection by Accuracy and Diversity (SbAD). The process of SbAD is as follows.

  1. 1.

    Carry out the Step 1, 2 and 3 of the SbA selection process.

  2. 2.

    Calculate the diversity between the heterogeneous ensemble classifier constructed in Step 1 (i.e. based on accuracy criterion) and each of the remaining classifiers (i.e. classifiers that were out of the best cluster)

  3. 3.

    Select the single classifier with the highest diversity and include it into the SbA heterogeneous ensemble.

  4. 4.

    Assess the performance of the heterogeneous ensemble constructed in Step 3.

  5. 5.

    Repeat Steps 3 and 4 until the heterogeneous ensemble size reaches the number N-1 members.

We also combine all the seven single classifiers (DT, MLP, KNN, S-PUK, SVM-NP, SVM-RBF and SVM-LK) in one ensemble in order to compare its performance with the other constructed ensembles using SbA and SbAD. We referred to this ensemble: No Selection classifier (NSc). Figure 1 presents the experimental process we followed.

4.6 Abbreviations Used

The following abbreviation rules were used to simplify the names of ensembles

E-SingleTechnique1 SingleTechnique2

E- SingleTechnique1 SingleTechnique2SingleTechnique3.

.

.

E- SingleTechnique1 SingleTechnique2… SingleTechniqueN

It is worth noting that for ensemble techniques, we shorten the names of single classifiers as well:

KNN for K, D for DTs, M for MLP, S for SVM, SVM-PUK for P, SVM-RBF for R, S-LK for L and SVM-NP for NP.

For example, EDKLM refers to the ensemble constructed by the fusion of the four single techniques, DT, KNN, SVM-LK and MLP.

Fig. 1.
figure 1

Experimental process.

5 Empirical Results

This section discusses the empirical evaluation results of the 7 individual classifiers, the SbA and SbAD ensembles as well as the Nsc ensemble. The R software was used for statistical tests and the Waikato Environment for Knowledge Analysis (WEKA 3.9) was investigated to conduct the empirical evaluations [69].

5.1 Individual Classifiers Evaluation

This section presents the evaluation results of the 7 classification techniques over the three datasets based on three criteria (Accuracy, Recall, and Precision). Table 3 depicts the performance values of the 7 single techniques. As shown in Table 3, we can note that in the WDBC dataset, S-LK displayed the best performance. S-PUK exhibited the best performance in the WPBC dataset. As for Wisconsin, DT was the best.

Based on the results of Table 3, we notice that there is no individual best classifier in overall datasets, and that the rankings of the same individual classifier vary depending on the dataset. This is due to the fact that the performance of single techniques depends on the characteristics of datasets (number of instances, dataset dimensionality, number of classes, etc.) [8].

Table 3. Performance results.

5.2 Ensembles Evaluation

Figure 2 depicts the results of the SK test carried out based on error rate overall datasets. We observed that the SK test identified 2, 4, and 2 clusters in the WDBC, WPBC and Wisconsin datasets respectively. Therefore, we constructed: (1) one SbA heterogeneous ensemble in WDBC dataset whose members were S-PUK, S-LK, KNN and DT; (2) one SbA heterogeneous ensemble in Wisconsin dataset whose members were DT, MLP and KNN; and (3) one SbA heterogeneous ensemble in WPBC dataset whose members were S-PUK, MLP, DT and KNN (since the best cluster of WPBC dataset contains only one technique we include also the techniques of the second best cluster).

For the SbAD ensembles, for each dataset we constructed ensembles based on the SbAD selection process described above (see Sect. 4.5):

  • For Wisconson, we constructed 3 SbAD ensembles: EDKNM, EDKLNM and EDKLNMP

  • For WDBC, we constructed 2 SbAD ensembles: EDLKMP and EDLKMNP

  • For WPBC, we constructed 2 SbAD ensembles: EDKMPR and EDKMPRN

Fig. 2.
figure 2

SK test of single techniques over the all datasets

WDBC Dataset:

Table 4 shows the performance metric values of the SbA and SbAD ensembles as well as the NSc ensemble for the WDBC dataset. We observe that the SbAD ensemble EDLKMP and the SbA ensemble come first with 97.72%, 97% and 97% for accuracy, recall and precision respectively; and the SbAD ensemble EDLKMNP comes second with 97.54%, 97.5% and 97.6% for accuracy, recall and precision respectively. Note that the the SbA and SbAD ensembles outperformed the NSc ensemble.

Table 4. Performance results: WDBC dataset.

Wisconsin Dataset:

Table 5 reports the performance metric values of the SbA and SbAD ensembles as well as the NSc ensemble for the Wisconsin dataset. In terms of the three performance metrics, accuracy, precision, and recall, we notice that the SbAD ensemble EDKLNMP marginally outperformed the others. It provides an accuracy, precision and recall values of 97.07%, 97.1% and 97.1% respectively.

Table 5. Performance results: Wisconsin dataset.

WPBC dataset:

Table 6 depicts the performance metrics values of the SbA and SbAD ensembles as well as the NSc ensemble for the WPBC dataset. We observe that the ensemble SbA ensemble EDKMP outperformed all the other ensembles; it achieved an accuracy, precision and recall of 91.22%, 91.5% and 91.2% respectively. As it can be seen from Table 6, the SbA and SbAD ensembles in general outperformed the NSc ensemble.

Table 6. Performance results: WPBC dataset.

5.3 Comparing SbA, SbAD and Nsc Ensembles

To check the significant difference between the performances of SbA, SbAD and Nsc ensembles, the SK test was carried out based on error rate values to check whether there was a notable difference between the ensemble performances. Figure 3 displays the results of the SK test on the built ensembles for each dataset. As it can be observed in Fig. 3, in all datasets, only one cluster was identified by the SK test. This means that SbA, SbAD and Nsc ensembles show the same predictive capabilities in terms of accuracy in all datasets.

Fig. 3.
figure 3

Accuracy based SK test results of SbA, SBA and Nsc ensembles.

To deeply compare the predictive capabilities of SbA, SbAD and Nsc ensembles, we used the Borda Count voting system to rank them based on the three performance metrics: accuracy, precision, and recall. Table 7 displays the ranking supplied by the Borda count voting system for each dataset. We note the following:

Table 7. Borda Count ranks of the best SK cluster techniques
  • Except in the WPBC dataset, where the SbA ensemble EDKMP was ranked first, SbAD ensemble techniques were in general ranked at the first position in two datasets (EDLKMP in WDBC and EDKLNMP in Wisconsin).

  • In general, the SbAD and SbA ensembles outperformed the NSc ensemble overall datasets.

  • The best SbAD ensemble in each dataset includes 5 to 6 single techniques (EDLKMP in WDBC, EDKLNMP in Wisconsin and EDKMPR in WPBC).

  • The members DT, KNN, MLP, and S-PUK were present in all the best SbAD/SbA ensembles (EDLKMP in WDBC, EDKLNMP in Wisconsin and EDKMP in WPBC).

6 Threats to Validity

This section discusses threats to this study’s validity, with regard to internal, external and construct validity.

Internal Validity:

This study used a 10-fold cross validation evaluation method, which is commonly used in machine learning [10]. Another internal threat is the presence of missing values in two datasets used, which may affect the performance of a model constructed using these datasets and increase the likelihood of drawing incorrect conclusions. In this research, instead of using imputation techniques like mean imputation or expectation-maximization [70] to impute missing values, we deleted the existing missing values because their number was small, but if there are a lot of missing values, we recommend the use of imputation methods [71].

Furthermore, the majority voting was used in this study to produce the final output of the proposed ensembles. Note that, the majority voting rule is the most popular and frequently used method in the literature of ensembles [72, 73]. However, we are aware that other combination schemes such as weighted majority voting, probabilistic and weighted sum can yield different outcomes. Thus, future experiments might look into the effect of other combiners on the predictive capability of the ensembles.

External Validity:

The aim of the external threat is to know whether the results of this research work can be generalized to other contexts [74]. Thus, it is crucial to figure out how broadly the findings of this study can be applied. To counteract this threat, we selected three datasets that differ in size and number of features. Moreover, this study investigated only one diversity measure, Q statistic, to calculate the classifiers’ diversity. Investigating other diversity measures such as double-default measure [52], disagreement measure [50] and measure of difficulty [50] is recommended so as to generalize the findings of this study.

Construct Validity:

The construct validity aims at answering the question about the measurement validity [11], or more specifically, the reliability/credibility of the measures selected to assess the performance of the techniques. We overcome this constraint by using three performance metrics (Accuracy, Precision and Recall) in order to obtain results from various perspectives. As a result, all three criteria yielded the same results. To affirm or refute the reported results, it is required to invesigate other performance metrics.

7 Conclusion and Future Work

This study assessed and compared the impacts of using accuracy and diversity instead of accuracy alone for selecting ensemble members on the performance of ensembles in BC classification. Moreover, it evaluated and compared the performances of ensembles with/out selection. We used the Q statistic measure to evaluate the diversity of ensemble members. The majority voting combination rule was used to combine the members of an ensemble. The experiments were carried out on three well-known available BC datasets from online repositories. The SK test and Borda Count were investigated to assess the significance of performance differences and to rank the ensembles respectively.

The findings were:

  1. (1)

    Investigating both accuracy and diversity to select ensemble members can improve the performance of an ensemble. This confirms the findings of [75].

  2. (2)

    Selecting ensemble members using accuracy and/or diversity in general led to better ensemble performance than constructing ensembles without members’ selection. This confirms the results of Zhou et al. [76].

  3. (3)

    Individual classifiers’ performance is influenced by the characteristics of the dataset as the ranks of the same technique depended on the dataset.

Ongoing work focuses on investigating other diversity measures and other combination rules to construct better heterogeneous ensembles in BC diagnosis.