Keywords

1 Introduction

Globally, Breast Cancer has remained the second most common disease which causes death among women [1]. Breast cancer classification of tumors accurately helps in curing the disease at the early stage itself. Breast cancer tumors are mainly classified into malignant (Cancerous) and Benign (Non-Cancerous). To discriminate amid these tumors, doctors require a reliable and safe diagnostic system. However sometimes, even the specialists find it challenging to identify the tumors correctly. So the early prediction of the disease is the need of the hour to reduce the risk of death in this case. Breast cancer malignancy is the most prevalent disease among women; it has consistently high mortality and frequency rates. We collected data from Wisconsin breast cancer dataset and applied ANOVA f-test Feature Selection method to decrease the high data dimensionality of the feature space before the classification process. It also helped in selecting the subsets of input features that are more relevant to the target variable so that we can get better results. After computing all the models, we compared them based on eight parameters such as Accuracy, Precision, Recall, F1-Score, Sensitivity, Specificity, False Negative Rate and False Positive Rate (Fig. 1).

Fig. 1.
figure 1

Breast cancer

2 Related Works

This research paper has gathered the information from various papers who researched the prediction of breast cancer on different datasets, including Wisconsin breast cancer dataset. Paper by Anusha [7], compared Support Vector Machine (SVM), Decision Tree (CART), Naive Bayes (NB) and k Nearest Neighbours (kNN) based on accuracy, Another paper by Naveen [6], compared ensemble machine learning models which gave 100% accuracy in KNN and decision tree on Coimbra breast cancer split train-test dataset in a ratio of 90:10. Fabiano Teixeira [8] evaluated different classification methods: Multilayer Perceptron, Decision Tree, Random Forest, Support Vector Machine and Deep Neural Network and got a good performance in accuracy level of 92%. Gilbert Gutabaga Hungilo [12] in his paper compared AdaBoost, Random Forest, and XGBoostwhose result indicates that the random forest is the best predictive model and has the following performance measure, accuracy 97%, sensitivity 96%, and specificity 96%. Another paper by Quang H. Nguyen [14] analysed prediction models using Feature Selection and Ensemble Voting which returned with the accuracy of at least 98%.

Till now, people compared the three-four algorithms [2, 5, 6] of their choice mainly based on the accuracy. Although accuracy is the main factor, they could get the highest accuracy, not more than 97% even after applying feature selection. So in this paper, We compared seven commonly used algorithms such as Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision Tree, Logistic Regression, Gaussian Naive Bayes, Random Forest and XG-boost.

3 Proposed Methodology

We have collected the Breast Cancer malignant growth instances from the benchmark database Wisconsin Breast Cancer diagnosis data set. We compared various ML algorithms like Decision Tree, K Nearest Neighbor, Gaussian Naive Bayes, Random Forests, Logistic Regression, Support Vector Machine. In this paper, we use named XG Boost classifier, which is an ensemble learning algorithm (aggregate of predictive powers of multiple algorithms) for acquiring the best results. Below is the flowchart representing the proposed model Fig. 2.

Fig. 2.
figure 2

Proposed model of XG Boost classifier

3.1 Data Collection

Data is collected from the Wisconsin Breast Cancer data set publicly available in UCI Machine Learning Repository [17]. Data set contains 569 occurrences with 30 attributes. It consists of 32 segments, with ‘ID number’, ‘diagnosis’ result (“Benign” or “Malignant”), and the ‘mean’, ‘standard deviation’ and the ‘mean of the worst estimations’ of 10 features. The class distribution is shown in the Fig. 3.

Fig. 3.
figure 3

Class distribution

3.2 Data Processing

Every row in the dataset which is incomplete or has some missing attribute values is removed, and attributes such as ‘id’ are also deleted as it is of no use.

3.3 Data Manipulation

As the target attribute ‘diagnosis’ is a categorical data which machine can’t read, so is converted into numerical data.

3.4 Data Visualization

After data collection and manipulation, we performed data visualization of all the remaining 31 attributes to identify areas that needed attention or improvement. We can easily interpret data using Fig. 4.

Fig. 4.
figure 4

Features pair-plot

3.5 Feature Selection

Feature selection, it is the most important as the final result values are dependent on the pattern of feature selection. So we choose the feature in such a manner so that we get the best accuracy and other parameters. In this paper, we used ANOVA f-test Feature Selection method.

ANOVA F-Test Feature Selection. ANOVA stands for Analysis of Variance. It is a popular numerical feature selection method. It compares mean between more than two groups. An F-test is a class of statistical tests that compute the relationship between the values of the variances, e.g., the variance of two different samples or the variance that is explained and unexplained by a statistical test here referring to as the ANOVA f-test. We select the features having the best variance using an object and applying the fit. Transform over the features and target variable. The static F is given as Eq. 1

$$\begin{aligned} F = \frac{Uncertainty\;between\;the\; groups}{Uncertainty\; within\; the\; groups } \end{aligned}$$
(1)

Z-Score Normalization. The purpose of normalisation is to equalise the scale of all data points such that each attribute is equally important. The Min-Max normalization fail in handling outliers. This outlier issue can be solved by using Z-score normalization. The formula for this technique is given below Eq. 2:

$$\begin{aligned} Z = \frac{value -\mu }{\sigma } \end{aligned}$$
(2)

Where \(\mu \) and \(\sigma \) are mean value and standard deviation value of the feature respectively. The value will be normalised to 0 if it is exactly equal to the mean of all the values of the feature. It will be a negative number if it is below the mean, and a positive number if it is above the mean.

3.6 Data Splitting and Feature Scaling

In this paper, we use 75% training data and 25% of data for testing. Since attributes vary in magnitudes, units and range, we have scaled features using z-score normalization to bring all characteristics to a similar degree level. The feature distribution after feature scaling is shown in the Fig. 4.

Fig. 5.
figure 5

Feature distribution after feature scaling

4 Background

This paper aims to select the machine learning algorithm that best suits for developing our model to the fullest. Machine learning algorithms classified into two types: Supervised and Unsupervised learning. We need Supervised learning for our breast cancer prediction model [8] (Fig. 5).

4.1 Supervised Machine Learning

[7] In this learning, we train the machine using data which is “labelled”. This learning algorithm predicts outcomes after learning from the labelled training data. This learning uses regression and other classification techniques to develop predictive models.

  • Logistic Regression: This algorithm exhibits a direct connection between a dependant (y) and at least one independent (y)’s factor. Since linear regression uncovers a linear relationship, it decides how the dependent variable’s value changes with the independent factor’s value.

  • KNN: KNN algorithm which is the short form for K-nearest neighbours, utilizes the given information to predict and allots the new information point dependent on how intently it coordinates with the focuses in the preparation set, i.e., depending on the similarity.

  • SVM: An SVM model is a data characterization algorithm for predictive analysis, that allocates new data components to one of the known gatherings; it works by defining a straight boundary between two classes. The data points that fall on the right side are considered one class, and the opposite side is regarded as the other.

  • Gaussian Naive Bayes: Gaussian Naive Bayes is widely used as a classifier and also with few alterations it can be used for regression too. In this algorithm, values are distributed based on Gaussian distribution. And this distribution is also called as a normal distribution. The classification is done based on Bayes Theorem.

  • Decision Tree: This algorithm identifies different ways to split data. It is used for both classification and regression. Using tree representation, it tries to resolve the error.

  • Random Forest: Random Forest classifier assembles different decision trees which represent different factual probabilities. Then it combines these decision trees to acquire a steady and precise prediction as shown pictorially in Fig. 10. Trees mapped to a solitary tree known as Classification and Regression Trees (CART) model. This calculation utilized for both regression and classification issues.

  • XGBoost: XGboost or eXtreme gradient Boosting algorithm is the application of gradient boosted decision trees developed for high speed and better performance. It is an ensemble learning method. Implies that each new model is prepared to rectify the error of the previous model, and the arrangement gets halted when there is no further improvement. In boosting the base learners are weak learners and do not have high predictive power, whereas the final one is a strong learner with high predictive power. The strong learner is a combination of the weak learners that provide some information for prediction (Fig. 6).

Fig. 6.
figure 6

XGBoost algorithm

Table 1. Accuracy comparison
Fig. 7.
figure 7

Accuracy comparison of models

Table 2. Precision comparison
Fig. 8.
figure 8

Precision comparison of models

Table 3. Recall comparison
Fig. 9.
figure 9

Recall comparison of models

Table 4. F1 comparison
Fig. 10.
figure 10

F1-score comparison of models

5 Experimental Results

In our results, we are considering the confusion matrix which gives a conclusion of the results of the classification problem prediction. It shows how much your model or algorithm classifier is in dilemma when we make predictions. We have also found out Accuracy, Precision, Recall, F1-Score, Sensitivity, Specificity, False Negative Rate and False positive Rate of all the algorithms [6].

  • Accuracy: XGBoost gives the Highest accuracy of 98.25% which is best for our model, whereas Decision tree gives the lowest accuracy of 88.81% as shown in Fig. 7 (Table 1).

    $$\begin{aligned} Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$
    (3)

    Here, TP = True Positive

    TN = True Negative

    FP = False Positive

    FN = False Negative

  • Precision: SVM and Random forest shows the highest precision value of 96.23% and XGBoost has a precision of 95.83%, whereas the decision tree has the lowest precision of 86.23% as shown in Fig. 8 (Table 2).

    $$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
    (4)
  • Recall: XGBoost has highest recall value of 100%, other algorithms also gave good results but decision tree shows the lowest value of 78.46% as shown in Fig. 9 (Table 3).

    $$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
    (5)
  • F1-Score: XGBoost gives the Highest F1 score of 97.87% whereas Decision tree gives the lowest accuracy of 86.44% as shown in Fig. 10 (Table 4).

    $$\begin{aligned} F1 = \frac{2 \times ( Precision \times Recall)}{Precision + Recall} \end{aligned}$$
    (6)
  • Sensitivity: XGBoost shows the excellent and highest sensitivity value of 100%, whereas decision tree shows the lowest value of 78.46% as shown in Fig. 11.

    $$\begin{aligned} Sensitivity = \frac{TP}{TP + FN} \end{aligned}$$
    (7)
  • Specificity: Support Vector Machine gave the highest specificity of 97.75%, others also gave good results but Gaussian Naive Bayes shows the lowest value of 93.41%. XGBoost has shown 97.06% specificity as shown in Fig. 12 (Tables 5 and 6).

    $$\begin{aligned} Specificity = \frac{TN}{FP + TN} \end{aligned}$$
    (8)
  • False Negative Rate: XGBoost shown the lowest False Negative of 0.00% which is great and Decision tree showed the highest value of 0.21% as shown in Fig. 13.

    $$\begin{aligned} False\ Negative\ Rate = 100 *\frac{FN}{FN + TP} \end{aligned}$$
    (9)
  • False Positive Rate: XGBoost, Random Forest, Decision Tree and Support Vector Machine (SVM) have shown the least false positive rate of 0.02% whereas Gaussian Naive Bayes gave a maximum of 0.06% as shown in Fig. 14 (Tables 7 and 8).

    $$\begin{aligned} False\ Positive\ Rate = 100 *\frac{FP}{FP + TN} \end{aligned}$$
    (10)
Table 5. Sensitivity comparison
Fig. 11.
figure 11

Sensitivity comparison of models

Table 6. Specificity comparison
Fig. 12.
figure 12

Specificity comparison of models

Table 7. False negative rate comparison
Fig. 13.
figure 13

False negative rate comparison of models

Table 8. False positive rate comparison
Fig. 14.
figure 14

False positive rate comparison of models

6 Results and Conclusion

This research analysis offered a new plan of action applying Feature Selection based on ANOVA F-test, z-score normalization, and XGBoost classifier algorithm for Prediction of breast cancer. This proposed action offers the following advantages: Improved classification accuracy, better recall, boosted sensitivity, increased precision, reducing the false-positive rate and false-negative rate. The classification accuracy of the new strategy is obtained as 98.25%, the recall is 100%, the f1-score is 97.87%, the sensitivity is 100%, the precision is 95.83%, false positive is 0.02%, and a false-negative rate as 0.00%.

The result of this new strategy using a hybrid approach of ‘feature selection and XGBoost’ was compared to predict breast malignancy with distinct algorithms. It yielded a more reliable performance in terms of various parameters. As XGBoost is an ensemble machine learning algorithm (Ensemble model is a combination of multiple models). It is able to give the best results. So in this paper we successfully created a prediction model for breast cancer.

For future research, we intend to execute the Feature selection based upon differential evolution algorithm to provide reasonably practical and more precise results. Furthermore, we also plan to deploy the same using different datasets and compare the performance of the hybrid approach using optimal features.