1 Introduction

The proper and effective disease treatment with certain information is demanding and challenging in the area of bioinformatics and medical research [33]. As far as dangerous diseases are concerned, cancer is one of the most precarious diseases that occur due to the uncontrolled growth of abnormal cells in the human body. In cancer, the old cells do not decay and their growth gets uncontrolled which turned into the second most deadly disease in the world [11, 41]. Breast cancer is a kind of cancer that forms a mass of cells such as tissues on the breast and the continuous growth of cancer tissues start integrating into other normal human body parts [41]. Breast Cancer is the most prominent type of cancer among others as shown in Fig. 1 from the World Health Organization (WHO) in the year 2020. Most treatments that are offered to cancer patients depend on early diagnosis of cancer symptoms which affects the survival of the cancer patients. When a patient is diagnosed with a tumor, the very first step considered by doctors is identifying whether it is malignant or benign. The malignant type of tumor is cancerous whereas benign is non-cancerous. Therefore, differentiating tumor type is very important for a better and more effective cure and it is also required in the plan of treatment. Though, doctors require a reliable mechanism to differentiate these two types of tumors. In almost all countries, cancer is the root cause of death around eight million people annually. Every possible treatment plan for cancer patients requires a deep study of behavioral changes in cells; approximately 85% of breast cancer cases come from women [8, 10]. Moreover, most nations are in a developing phase where resources are limited and population growth is rapid and the doctors/experts and patients ratio is completely unbalanced. Therefore, it is very important to understand the disease’s behavior through automation so that effective and early treatment can be offered to the patients by accurate classification using machine learning techniques.

Fig. 1
figure 1

Reports and mortality rates for breast cancer in 2020 from WHO

Recently, numerous diagnostic institutions, research centers, hospitals, as well as many websites provide a huge amount of medical diagnostic data. So, it is essential to classify and automate them to speed up disease diagnosis. Nowadays, machine learning techniques are extremely admired in most fields of data analysis, classification, and prediction [11]. These techniques are used to facilitate the analysis of these data and produce revolutionary information for medical society and play a vital role in the serious and sophisticated evaluation of various kinds of medical data. Several machine learning and data mining algorithms are widely used for the prediction of breast cancer and it is very important to select the most appropriate algorithms for the classification of breast cancer [24]. These learning techniques require a huge amount of historical data for learning and their prediction results depend on the learning that leads to enormous computing power [36]. Healthcare also believes in large amounts of data from diagnosis and pathology to drug discovery and epidemiology.

Machine learning algorithms can be more reliable, effective, and easier to provide a better solution for automatic cancer disease detection. The formulation of better learning algorithms has the ability to adopt new data independently and iteratively. The machine learning technique prerequisite is continuous learning from historical data. This learning helps machines to identify patterns available in data and produce reliable and informed results for decision-making tasks. The main goal of machine learning is to make a system learn without human intervention which helps to design an automated system for decision making. The tumor type identification is the essential phase to suggest a better treatment plan to a patient that increases the probability of their survival. The main cause of breast cancer disease is the presence of either a benign tumor or a malignant tumor.

Fig. 2
figure 2

A general ensemble framework for breast cancer prediction

A number of algorithms have been utilized in the literature to address breast cancer classification problems by considering histopathological images and may be symptoms like fatigue, headaches, pain, and numbness. Breast cancer can also be categorized by using ensemble-based classification techniques with an improved prediction rate. In this work, we considered Extra tree, Random Forest, AdaBoost, Gradient Boosting, and KNN9 (9-nearest neighbour) as base learners for the proposed stacking ensemble classification framework. Further, we evaluate the performance of the SELF using BreakHis and WBCD datasets [36, 42]. The main objective of this research work is to detect and categorize malignant and benign patients in a faster way and improve prediction accuracy along with other performance parameters. Figure 2 exhibits the structure of a Stacked-based ensemble model for the classification of cancer. Our contributions to this paper are summarized as:

  1. 1.

    Studying the existing literature to identify the research gaps.

  2. 2.

    Identifying appropriate dataset(s) on breast cancer because most of the existing works have considered the datasets with a limited number of images/instances. We have considered the BreakHis dataset with 7909 histopathological images and the WBCD dataset with 569 instances.

  3. 3.

    Identifying the best-performing machine learning models for both datasets to create a stacked-based ensemble learning framework.

  4. 4.

    Proposing SELF, a stacked-based ensemble learning framework, to classify breast cancer with better accuracy and evaluate the performance of the proposed framework on the testing dataset.

The rest of the paper is organized as follows: Sect. 2 discusses the related work in the area of breast cancer prediction and Sect. 3 discusses different machine learning algorithms used in this work. Section 4 provides the details of our proposed ensemble framework for breast cancer classification, and a detailed discussion on performance evaluation of the SELF is presented in Sect. 5. Finally, we concluded our work in Sect. 6 with further possible improvements.

2 Related works

This section presents the existing works related to breast cancer classification/prediction using the image and numerical datasets. The categorization of breast cancer is a kind of classification problem that needs the extraction of relevant features for classification. The amalgamation of medical science and artificial intelligence has great significance and several researchers have been working in the same domain and coming up with extraordinary outputs. Recently, numerous automatic models have been proposed in the literature for breast cancer classification using different machine learning (ML)/deep learning (DL)/ensemble learning (EL) approaches. In the literature, researchers have adopted a number of the existing ML/DL approaches such as K-nearest neighbor (K-NN), Support Vector Machines (SVM), Decision Tree(DT), Random forests (RF), Extra Tree (ET), ResNet 50, VGG16, and VGG19 and a number of ensemble learning approaches to create a better ensemble based classifier using existing machine learning and deep deep learning approaches for breast cancer classification. They have put their good efforts to improve the classification accuracy of their algorithms on different breast cancer datasets. We summarize some of the recent machine learning, deep learning, and ensemble learning techniques based on the breast cancer classification in Tables  1, 2, and 3 respectively.

From Tables 1, 2, and 3, we have made the following observations on the used datasets, learning methodologies/techniques and learning algorithms to address the breast cancer classification problem:

Table 1 The recent machine learning based techniques for breast cancer classification
Table 2 The recent deep learning based techniques for breast cancer classification
Table 3 The recent ensemble learning based techniques for breast cancer classification
  • Most people in the world are commonly working on either publicly available benchmark datasets such as BreakHis, WBCD, BCW, CBCD, FCNC, BACH, and CBIS-DDSM ROI or hypothetical datasets created by themselves by collecting a smaller number of ultrasound images of the patient to address the problem of breast cancer classification/detection/prediction; the available datasets are either image datasets [2,3,4, 6, 9, 12, 14, 17, 18, 21, 22, 25,26,27,28,29,30,31, 34, 38, 43, 44] or numerical datasets [1, 5, 10, 16, 19, 32, 39, 40], and the researchers are working on them. The existing works on classification are using datasets with fewer sample images [1, 3, 5, 6, 16, 17, 19, 22, 26,27,28, 31, 32, 34, 39, 40, 43] that may not sufficient to train deep learning algorithms because the training process of a deep learning model requires a large amount of image data. The most frequently used datasets are the BreakHis [2, 4, 6, 9, 12, 14, 21, 25, 29, 30, 38, 44] and the WBCD [1, 5, 10, 16, 19, 32, 39, 40]. However, the WBCD dataset consists of only 569 or 699 instances with 32 features, while, the BreakHis dataset consists of 7909 images. Therefore, the BreakHis dataset and WBCD dataset can be the most suitable dataset with a sufficient number of breast cancer instances.

  • The researchers have adopted machine learning [1, 6, 10, 22, 27] or deep learning [2, 4, 12, 14, 17, 26, 29, 38, 43, 44] or ensemble learning [3, 5, 16, 18, 19, 21, 25, 28, 30,31,32, 34, 39, 40] techniques to address the breast cancer classification problem, and put their best efforts to improve the performance of their proposed approach(s). From Tables 1, 2, and 3, we observed that most of the existing works have mainly borrowed deep learning techniques to address the breast cancer classification problem because of image datasets [2, 4, 9, 12, 14, 17, 18, 21, 26, 29, 30, 38, 43, 43, 44]. The Convolutional Neural Network (CNN), a deep learning model, performs well on image datasets because it handles the entire feature engineering phase and extracts features from an image in an efficient way. However, CNN still faces obstacles because CNN training requires a large amount of training data with expensive computational resources which is time-consuming [35]. Other researchers found several machine-learning techniques to work with image datasets and adopted these techniques to address the breast cancer classification problem [6]. A machine learning model can be trained on a small dataset with less computation cost. On the other hand, an ensemble learning technique takes advantage of different machine learning/deep learning algorithms to build a powerful classifier; these techniques are mainly classified as bagging, boosting, and stacking. Several researchers have adopted different ensemble learning techniques to build an efficient breast cancer classifier. However, the existing ensemble-based breast cancer classifiers have been evaluated on either the WBCD dataset or the BreakHis dataset. The ensemble classifier with machine learning algorithms performs well on the WBCD dataset, while, the ensemble classifier with deep learning algorithms performs well on the BreakHis dataset. Therefore, we have an opportunity to take advantage of both ensemble learning techniques and machine learning algorithms to build efficient breast cancer classifiers with reduced computational cost on the BreakHis dataset.

  • The researchers have used different deep learning or machine learning algorithms to build automated breast cancer classifiers on publicly available benchmark datasets which are image datasets. The performance of CNN-based deep learning algorithms is good on image datasets because these algorithms are specially designed for them. On the other hand, some of the machine learning algorithms such as SVM, ANN, random forest, extra tree, and many more are efficiently working on image datasets and can be an alternative for CNN [6]. SVM is the most popular classification algorithm to separate the given data objects into multiple classes using an optimal hyperplane. On the other hand, the RF classifier utilizes the decision tree predictors by combining them into one and produces good results even without hyperparameter tuning [20]. The advantages of the random forest method are: efficiently works on unbalanced datasets and it is very fast. The other machine learning algorithms are also taken into account for breast cancer diagnosis using image classification [15, 37]. Therefore, machine learning techniques can be a good alternative to solve the breast cancer classification problem with limited computational resources.

In this work, we have compiled the findings of the recent approaches for breast cancer classification, and to the best of our knowledge, we found that there is still a scope for improvement on the existing breast cancer classification approaches. Therefore, we opt for an ensemble learning-based classifier with more effective machine learning algorithms to classify breast cancer with improved accuracy and reduced the false negative rate. We also analyzed that the performance of the proposed framework is slightly inferior when compared to the existing deep-learning-based ensemble classifiers on the BreakHis dataset, while, SELF is performing well with the WBCD dataset. Overall, the predicted outcome of the proposed work is at an acceptable level with minimal computational power.

3 Classification algorithms

This section discusses the major classification algorithms used in the proposed ensemble framework. Initially, we attempted different classification algorithms on the breast cancer data to achieve better accuracy, and after that, we created our final ensemble model by using the top-performing classification models. We have trained several distinct classifiers on the training dataset and selected the best five classification models among them on the basis of their accuracy measures to create our resultant ensemble model. A detailed description of the trained classification models is given in the following subsections.

3.1 Random forest (RF)

The Random Forest (RF) algorithm [20] is a well-known supervised learning algorithm that addresses both classification and regression problems in machine learning. This algorithm uses an ensemble learning technique, i.e., bagging, which uses multiple decision trees on different subsets of the given dataset. The resultant accuracy is calculated by taking the average accuracy of all decision trees which gives an improved predictive accuracy. The RF classifier is extensively used to generate a large number of trees as a forest and the quantity of trees introduced in the forest affects its accuracy. As a result, the number of trees created in the forest impact RF accuracy, because the more trees in the forest, the higher the accuracy, and vice versa. Furthermore, while generating a forest of trees, RF employs batching and randomness in the creation of each tree. The nodes in the decision tree branch are decided by an entropy measure which is described by the following equation:

$$\begin{aligned} Entropy(E) = - \sum _{i = 1}^{C} p_i log_2 p_i \end{aligned}$$
(1)

where \(p_i\) is the relative frequency of the ith class and C is total number of classes.

3.2 K-nearest neighbor (kNN)

The K-Nearest Neighbor (kNN) algorithm [23] is a simple and easy-to-implement algorithm that can be used for both classification and regression problems. As the name suggests, this algorithm considers k-nearest neighbors into account during classification and regression. This algorithm is a distance-based algorithm that calculates the distance between the new data point with all the existing data points in the training set and then chooses the k closest data points to the new instance from a set. Finally, the data is classified using the majority class of the k data points chosen. In this work, we have chosen the value of k as 9 by using cross-validation and Euclidean distance measure to get better results. The distance (d) between the two data points is calculated by:

$$\begin{aligned} d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \end{aligned}$$
(2)

where (\(x_1\),\(y_1\)) and (\(x_2\),\(y_2\)) are any two data points in the given data set.

3.3 Extra tree (ET)

The Extra Tree (ET) algorithm [1] is used to solve both classification and regression problems. Like the RF algorithm, ET algorithm randomly selects subsets of features from the dataset and trains a decision tree on them. This algorithm generates multiple trees and combines the predictions from all decision trees to get the resultant accuracy of the classification algorithm. This algorithm differs from the random forest in two ways: (1) it does not support bootstrap observations and selects samples without replacement. (2) It uses random splits, i.e., randomness that is generated through random splits of all observations rather than bootstrapping sampling. Here, we have also used the entropy measure that can be calculated by Eq. (1).

3.4 Adaptive boosting (AdaBoost)

The Adaptive Boosting (AdaBoost) [7] classifier is a meta-estimator classifier that begins by fitting a classifier on the initial dataset. It fits multiple copies of the classifier on the same dataset while modifying a large number of ineffectively ordered samples so that subsequent classifiers focus more on difficult situations. The idea behind the boosting technique is to associate the multiple weak classifiers in series to build a strong classifier. We build the first classifier on the training dataset and then build the second classifier to rectify the errors made by the first model. This process will continue until the error gets minimized. A boosted classifier is defined as:

$$\begin{aligned} F_T (x) = \sum _{t = 1}^{T} f_t (x) \end{aligned}$$
(3)

where each \({\displaystyle f_{t}}\) is a weak learner that takes an object \({\displaystyle x}\) as input and returns a value indicating the class of the object.

3.5 Gradient boosting (GB)

The Gradient Boosting algorithm [13] is one of the most powerful and widely used techniques in machine learning. Unlike AdaBoost, the base estimators in gradient boosting are fixed, i.e., the Decision Stump. This algorithm is also used to solve both classification and regression problems. It is a numerical optimization approach for determining the best additive model for minimizing the loss function. As a consequence, the GB approach constructs a new decision tree that minimizes the loss function ideally at each step. In regression, the process starts with the first estimate, which is usually a decision tree that minimizes the loss function, and then a new decision tree is fitted to the current residual and added to the previous model to update the residual at each step. This is a step-by-step procedure, which implies that the decision trees used to create the model in previous phases are not changed in subsequent ones. By fitting decision trees to the residuals, the model is enhanced in cases where it does not perform well. Predicted residuals at each leaf can be calculated as follows:

$$\begin{aligned} Predicted\;residual = \frac{\sum residual_i}{\sum [preprob_i - (1 - preprob_i)]} \end{aligned}$$
(4)

4 SELF: stacked-based ensemble learning framework

This section presents the SELF for breast cancer classification based on the stacking technique. Stacking is an ensemble machine-learning technique that learns how to combine the predictions of well-performing machine-learning models. The following subsections present a detailed description of the dataset, model selection, hyperparameter tuning, phases, and complexity analysis of the SELF.

4.1 Datasets description

Table 4 Distribution of images in BreaKHis dataset
Table 5 Image distribution of malignant tumor
Table 6 Image distribution of benign tumor
Table 7 Description of WBDC dataset

In this work, we considered the publicly available BreakHis dataset with a total of 7909 images collected from 82 patients [36] and the WBCD dataset with 569 instances collect from 8 different groups [42]. BreakHis is a Breast Cancer Histopathological database that composes microscopic images of breast tumor tissue with four magnifying factors: 40×, 100×, 200×, and 400×. This dataset contains 2480 benign and 5429 malignant samples (700 × 460 pixels, 3-channel RGB, 8-bit depth in each channel, PNG format). Table 4 exhibits the Images distribution in the BreaKHis dataset, and it can be observed that there is a huge class imbalance in the dataset. The number of malignant images is almost double of benign images. Table 5 exhibits the classification of Malignant images, divided into Ductal Carcinoma (DC), Mucinous Carcinoma (MC), Lobular Carcinoma (LC), and Papillary Carcinoma (PC). Table 6 exhibits the classification of the Benign images, divided into Tubular Adenoma (TA), Adenosis (A), Fibroadenoma (F), and Phyllodes Tumor (PT). We applied different data augmentation techniques, scaling, rotation, flipping, shuffling, zooming, and shearing, to deal with the class imbalance problem. In this work, we have classified the images into two classes, i.e., benign and malignant, which is a supervised learning problem. On the other hand, the WBCD dataset consists of a total of 569 instances with 32 features calculated from a digitized image of a Fine Needle Aspirate (FNA) of a breast mass; the distribution of the malignant and benign tumors is approximately 37% and 63% respectively. Table 7 exhibits the description of the WBCD dataset.

4.2 Model selection

Figure 3 exhibits the proposed framework of the ensemble model for breast cancer prediction. We have taken the breast cancer datasets from Kaggle and preprocessed them to handle the class imbalanced problem. After that, we split the processed dataset into 80% for training and 20% for testing. Our proposed framework is based on stacking ensembling techniques which are deferred from bagging and boosting. In this work, we have trained a total of 9 base learners on a training dataset; these base learners are Random Forest, Support Vector Machine, K-Nearest Neighbor, Extra Tree classifier, AdaBoost, Stochastic Gradient Descent, Gradient Boosting, Multilayer Perceptron, and Classification and Regression Trees (CART). Table 8 exhibits the training and testing accuracy of different base-learners on the BreakHis Dataset. Tables 9 and 10 exhibit the performance of base-learners on the BreakHis and the WBCD testing datasets respectively, out of these, we have selected the top-5 best performing common base-learners to create our final meta-model based on the stacked ensemble. The selected base learners are the Extra Tree classifier, Random Forest, AdaBoost, Gradient Boosting, and 9-Nearest Neighbor (KNN9).

Table 8 Performance of the base-learners on BreakHis dataset (train set)
Table 9 Performance of the base-learners on BreakHis dataset (test set)
Table 10 Performance of used base-learners on WBCD dataset (test set)

4.3 Hyperparameter tuning

In order to improve the performance of our adopted classifiers, we do the hyperparameter tuning for different classifiers that results in improved performance of the proposed framework. For instance, we have evaluated numerous values of ‘n’ estimators, i.e., 50, 100, 500, 1000, 1500, and 2000 for Random Forest Classifier, and achieved better accuracy for the value of ‘n’ estimator as 500, and employed entropy instead of Gini criterion. Further, we ran with several choices of ‘k’ for our kNN model and achieved better accuracy at k = 9. For the extra tree classifier, we investigate the numerous values of the ’n’ estimators, i.e., 50, 100, 500, 1000, 1500, and 2000, and achieve better accuracy for the value of ‘n’ estimator as 500. The selected hyperparameters with there best values and accuracy for both dataset are given in Table 11.

Table 11 Hyperparameter tuning of the BreakHis and WBCD datasets
Fig. 3
figure 3

Framework of SELF for breast cancer prediction

4.4 SELF

Stacking, also known as Super Learning, is an ensemble strategy that includes training a “meta-learner” using a variety of classification models. The goal of stacking is to bring together diverse groups of strong learners to improve the accuracy of the resultant model. Our proposed model works in three phases: data preprocessing, feature extraction, and model creation and evaluation. Algorithm 1 exhibits the overall working of the proposed stacked classifier for breast cancer prediction.

figure a
  • Data preprocessing phase In this phase, we handle the imbalance of images on the BreakHis dataset by applying the resizing of given images and image augmentation techniques. The techniques used for data augmentation are scaling, rotation, flipping, shuffling, zooming, and shearing. At the last, we create an array of training images.

  • Feature extraction phase After data preprocessing, we extracted 10 features such as pixel features, gobor features, and edge features, and remove noise from the images in this phase.

  • Model building and evaluation phase Figure 3 exhibits the working of the proposed framework. After feature extraction, we prepared our final ensemble model for breast cancer prediction. In this phase, we first train the different machine learning models on our training dataset using different features, then, we evaluated all trained models on the testing dataset. We utilized the power of multiple base learners to create a meta-learner classifier using the stacking technique. The architecture of a stacking model involves two or more base learners, often referred to as level-0 learners, and a meta-learner referred to as a level-1 learner that combines the predictions of the base learners. We train the level-0 learners on the training data whose predictions are compiled; the level-1 learner learns how to best combine the predictions of the trained base-learners. We trained multiple base-learners using 10-fold cross-validation on our training data at level-0 and the output of base-learners is used as input to the meta-learner at level-1. The meta-learner trained on the predictions made by the different base-learners on out-of-sample data. Finally, we build the resultant meta-learner that interprets the prediction of base-learners in a very smooth manner, and then uses a linear model, i.e., logistic regression, to handle the classification problem. The resultant meta-learner is validated on test data and its performance has been evaluated on the basis of several performance metrics.

4.5 Complexity analysis of SELF

Table 12 exhibits the training complexity of the different machine-learning models used in our proposed ensemble model. Let the training dataset have n number of training examples, m number of features, k number of neighbors, p number of decision trees (or stumps), and d is the maximum depth of a decision tree. Figure 4 exhibits the inferential flowchart of the proposed SELF classifier in which we have used the kNN, Random Forest, Extra Tree, Gradient Boosting, and Adaptive Boosting classifiers as base-learners at level-0 and the Logistic Regression classifier is used as a meta-learner at level-1 and it produces the final prediction. The time complexity of stacked classifiers actually depends on the training pattern of base learners. The base learners can be trained in either a serial or parallel manner that influences the complexity of the resultant stacked classifier. Suppose, we have the total N base-learners, the complexity of the ith base-learner is \(C_i\), and the complexity of the meta-learner is \(C_{meta}\) .

Fig. 4
figure 4

Inferential flowchart of SELF

  • If we train the base-learner in a serial manner then the complexity of the stacked classifier is given as:

    $$\begin{aligned} Complexity\_serial = \sum _{i=1}^{N} (C_i) + C_{meta} \end{aligned}$$
    (5)
  • If we train the base-learner in a parallel manner then the complexity of the stacked classifier is given as:

    $$\begin{aligned} Complexity\_parallel = \max _{i=1}^{N} (C_i) + C_{meta} \end{aligned}$$
    (6)

Therefore, we can compute the training complexity of the proposed model by using the equation either (5) or (6).

Table 12 The complexity of different base learners

5 Results and discussion

This section presents the evaluation of the proposed framework along with the detailed findings. We have evaluated the performance of the SELF using several performance metrics on BreakHis and WBCD datasets and compared them with the existing works based on the accuracy, sensitivity, precision, F1-Score, ROC, and MCC. Now, we first present a detailed description of the various performance metrics, then we compare the SELF with the existing works.

5.1 Performance metrics

Figure 5 exhibits a generalized confusion matrix that helps to define the different performance parameters. The performance metrics accuracy, sensitivity, precision, F1-Score, ROC, and MCC are defined as follows:

Fig. 5
figure 5

A generalized confusion matrix

  • Accuracy: It is one of the most often used measures for assessing the performance of a classifier. It is expressed as a proportion of properly classified samples and defined as:

    $$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
    (7)
  • Precision: It describes the ratio of actual positive instances out of predicted positive instances by a potential classifier, and is defined as:

    $$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
    (8)
  • Sensitivity or true positive rate (TPR) or recall: It describes the potential for a classifier to properly predict a favorable outcome in the presence of disease, and is defined as:

    $$\begin{aligned} Sensitivity = \frac{TP}{TP + FN} \end{aligned}$$
    (9)
  • Specificity or true negative rate (TNR):  It is a classifier’s likelihood of predicting a negative outcome when there is no sickness, and is computed as:

    $$\begin{aligned} Specificity = \frac{TN}{ TN + FP} \end{aligned}$$
    (10)
  • F1-score: It is regarded as the weighted average of precision and recall (or harmonic mean). The score of 1 is considered the best model, while, 0 is considered the worst model. The TNs are not taken into account in F-measures. The F1-score is computed as:

    $$\begin{aligned} F1{-}Score = 2 * \frac{Precision * Recall}{Precision + Recall} \end{aligned}$$
    (11)
  • Mathews correlation coefficient (MCC) MCC, a correlation coefficient between predicted classes and actual classes, is used for binary classification. The value of MCC ranges between \(-\)1 to +1, where +1 indicates the best prediction result, 0 indicates no better than the random prediction, and \(-\)1 indicates a complete disagreement between predicted and actual results. MCC is defined as:

    $$\begin{aligned} MCC = \frac{TP * TN - FP * FN}{\sqrt{(TP + FP)*(TP + FN)*(TN + FP)*(TN + FN)}} \end{aligned}$$
    (12)
  • Area under the curve-receiver operating characteristics (AU-ROC) AU-ROC is one of the most important and extensively used performance metrics for classification problems at various threshold settings. ROC represents a probability curve while AUC represents the degree or measure of separability. The higher value of AUC represents a better predictive model, and the ROC curve is plotted against the True Positive Rate (TPR) versus False Positive Rate (FPR), where TPR is represented in Y-axis and FPR is represented in X-axis. The AUC-ROC is defined as:

    $$\begin{aligned} AU{-}ROC = \frac{1}{2}*\left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right) \end{aligned}$$
    (13)

where TP, FP, TN, and FN denote True Positive, False Positive, True Negative, and False Negative respectively.

Table 13 Performance of base-learners on BreakHis dataset
Table 14 Performance of base-learners on WBCD dataset

5.2 Comparison of SELF with different base-learners

We compared our proposed classifier with different baseline models on BreakHis and WBCD datasets. Figures 6 and 7 exhibit the comparison of our proposed model with the different baseline models for breast cancer prediction on BreakHis and WBCD datasets respectively. From the Figures, we observed that our proposed model performs better in comparison to the baseline machine learning models and other existing models, and gives the approximate 95% and 99% of accuracy on BreakHis and WBCD testing datasets respectively. Tables 13 and 14 exhibit performance comparisons of the SELF with respect to the base learners on BreakHis and WBCD datasets respectively. From Table 13, we can also observe that with respect to other performance parameters, our proposed classifier has the highest F1-Score, ROC, and MCC scores with values of 94.17%, 89.41%, and 80.81% respectively. With respect to sensitivity, our proposed classifier has achieved the second-highest score of 95.96% whereas the random forest and extra tree classifiers have the highest sensitivity score of 97.11%. On the other hand, from Table 14, we can also observe that with respect to other performance parameters, our proposed classifier has the highest sensitivity, F1-Score, and MCC scores with values of 99.09%, 99.09%, and 97.45% respectively. With respect to precision score, our proposed classifier has achieved the second-highest score of 99.09% whereas the random forest has the highest precision score of 100.00%. Figure 8 exhibits the ROC curves for the best-performing models on BreakHis to analyze the developed models; this curve displays the classifier’s diagnostic skills. The closer the area value of the ROC curve is to one, the greater the model’s diagnostic capabilities. From the Figure, we observed that the area covered under our proposed model is the highest, i.e., 0.984. Similarly, Fig. 9 exhibits the comparison of precision, sensitivity, specificity, and ROC of the proposed classifier with different machine learning models. From these figures, we observe that our proposed ensemble classifier outperforms in most of the cases in comparison to other ML models.

Fig. 6
figure 6

Comparison of the proposed SELF with the base-learners on BreakHis dataset

Fig. 7
figure 7

Comparison of the proposed SELF with the base-learners on WBCD dataset

Fig. 8
figure 8

AUC-ROC curve of the SELF with respect to base-learners on BreakHis dataset

Fig. 9
figure 9

Comparison graph of proposed framework with different base-learner models on precision, sensitivity, specificity and roc on BreakHis dataset

Fig. 10
figure 10

Comparison of the proposed SELF with the existing classifiers on BreakHis dataset

5.3 Comparison of SELF with the existing breast cancer classifiers

We have also compared the proposed SELF with the other existing classifiers for breast cancer classification on BreakHis and WBCD datasets. Figure 10 exhibits the comparison of the SELF model with the existing classifiers on the BreakHis dataset. We observed that our ensemble-based classifier achieved the highest accuracy of 94.35% on the BreakHis dataset. However, Zou et al. [44] have proposed a deep learning based model, while Karthik et al. [21] have proposed a deep learning based ensemble for classification, and have achieved better accuracy than the SELF. On the other hand, the SELF has outperformed than other deep learning based ensemble classifiers as shown in Fig. 11. Figure 12 exhibits the comparison of the SELF with the various existing ensemble-based classifiers on the WBCD dataset. From Fig. 12, it is observed that the performance of SELF is better than the existing machine learning based ensemble classifiers on the WBCD dataset.

Fig. 11
figure 11

Comparison of SELF with the existing ensemble learning classifiers on BreakHis dataset

Fig. 12
figure 12

Comparison of SELF with the existing ensemble learning classifiers on WBCD dataset

6 Conclusion

In this work, we proposed SELF, a stacked-based ensemble learning framework, using the five best-performing machine learning algorithms, i.e., Extra tree, Random Forest, AdaBoost, Gradient Boosting, and KNN9, to classify the breast cancer on BreakHis and WBCD datasets. The proposed classifier has shown great potential to increase the classification accuracy by improving the accuracy, precision, sensitivity, specificity, and F1-Score with values of 94.35%, 92.45%, 95.96%, 82.87%, and 94.17% respectively on the BreakHis dataset. After analysing the overall performance, we found that the SELF is performing better than several existing classifiers on the BreakHis dataset. Similarly, we have also evaluated the SELF on the WBCD dataset and analysed that it also performs well with improved accuracy of 99% approximately. Further, this work can be extended by including different optimization techniques in the proposed framework to enhance classification performance.