Keywords

1 Introduction

Cancer is one of the leading causes of mortality worldwide, based on WHO statistics. Breast cancer is the second most common cancer, after lung cancer, with 2.09 million cases among the predicted 9.6 million cancer fatalities. It is also the fifth most prevalent cause of cancer death, accounting for over 627,000 fatalities, or 15% of all cancer deaths among women. And breast cancer alone accounts for 30% of all new cancer diagnoses in women [1]. This work examined the breast cancer issue using publicly available data from the Portuguese city of Coimbra and Wisconsin. There were ten quantitative predictor factors in this dataset, which were anthropometric in nature and captured through standard blood tests used to determine the presence or absence of breast cancer. Breast cancer is the most frequent type of cancer in women, affecting about 2.1 million women each year and contributing to female cancer deaths being the leading cause of death. Breast cancer claimed the lives of over 627,000 women in 2018. Early detection is crucial for improving breast cancer and survival chances [2]. Prostate cancer is one of the most frequent malignancies in American males, and it has the second highest fatality rate after lung cancer. Now a days, one in every seven men would be diagnosed with prostate cancer. According to recent figures, the number of new patients diagnosed with prostate cancer in 2017 was approximately 161,360, with approximately 26,730 deaths [3]. Fortunately, if prostate cancer is detected early, the mortality rate can be reduced. This paper also includes the study on prostate cancer whose dataset is taken from Kaggle and analyzes all classification models on parameters of prostate cancer. This paper is organized as follows: Sect. 1 presents the introduction to the different types of cancer disease. Section 2 presents the review of various recent literatures for cancer detection. Section 3 describes each component of the methodology used in this work, which is followed by description of the datasets. The results obtained after various experiments are presented and discussed in Sect. 5 followed by the conclusion.

2 Related Work

Rahman et al. [4], the purpose of this research is twofold. The first is to identify the most relevant breast cancer biomarkers, and second is to improve the current computer-aided diagnostic (CAD) system for detecting early breast cancer. This work made use of a dataset that included nine anthropometrical and clinical variables. From all the techniques used by author, SVM model with radial basis function (RBF) kernel gives best results with 93.9% accuracy, 95% sensitivity, and 94% specificity.

Ray et al. [5], in this study, researchers worked on two different datasets. One dataset is based on diabetic, and another is based on breast cancer. Feature selection techniques also applied before applying the machine learning models for getting the reduced feature set to classify between healthy and non-healthy subjects. Feature set includes the features having majority that is generated by routine pathology examinations. Author focused on identifying biomarkers that entail pathological testing and those that do not.

Mushtaq et al. [6], in this research, breast cancer (Wisconsin) dataset was used for study. Different classification models are applied along with PCA reduction approach. Performance of different classifiers with variants of PCAs based on linear, sigmoid, cosine, poly, and radial basis functions is analyzed. Highest 99.20% accuracy got from sigmoid-based Naive Bayes. Using KNN, with all different kernels got accuracies within the range 96.4–97.8%.

Shakeel et al. [7] works on prostate cancer for which author initially collects information related to prostate cancer from DBCR dataset. After that, using mean mode process, irrelevant record was removed and collect other important elements using ant rough set hypothesis. Result is evaluated in the terms of mean square error rate, hit rate, and accuracy.

3 Proposed Methodology

Figure 1 depicts the workflow of proposed work, highlighting the overall steps taken in this work, which includes data preprocessing with normalization, feature selection techniques, training and testing with specified models, evaluation of results, and prediction of breast cancer and prostate cancer. Python 3 was used to carry out this task.

Fig. 1
A diagram has 3 parts. On the left, is a cylinder with Breast Cancer Wisconsin and Coimbra dataset, and Prostate dataset. To its right, is a box that includes, Data pre-processing, Data cleaning, Clean data, Data reduction, and Data splitting. A box below has, Classification Evaluation using Basic data Classification models.

Model for predicting cancer disease

Dataset

In this paper, three datasets had been used or analyzed for covering the famous cancer types in both males and females. Two datasets are based on breast cancer named as Breast Cancer Coimbra dataset and Breast Cancer Wisconsin, both had been collected from UCI repository. Third dataset had been collected from Kaggle named as prostate cancer. Table 1 shows the number of records under cancerous and non-cancerous cases in each dataset.

Table 1 Description of breast and prostate cancer datasets

Coimbra Breast Cancer dataset has clinical parameters like body mass, hormone, leptin, glucosamine, etc. But another dataset which is also a breast cancer dataset WBC includes the real-valued parameters for each cell nucleus like texture, radius, compactness, etc. In this dataset, for each image, mean, standard error, and worst values were computed. Prostate cancer dataset having ten features like area, perimeter, radius, identification number, etc. In this paper, label 0 is used for non-cancer patients and label 1 for cancer patients.

4 Result Analysis

The proposed work considers eight classifiers for the analysis of performance comparison. Two normalization methods Z-score and min–max are used for data transformation. But, in this paper, only best results are discussed. Out of Z-score and min–max, Z-score gives good results. Tables 2, 3, and 4 show the results of BCC dataset, WBC dataset and prostate cancer dataset, respectively, using all the machine learning techniques. Every table divided into two parts having results based on without using ANOVA and with ANOVA.

Table 2 Performance analysis of BCC
Table 3 Performance analysis of WBC
Table 4 Performance analysis of prostate cancer

Table 2 shows the comparison of results using the eight classifiers without feature selection and with feature selection on Breast Cancer Coimbra dataset. All classifiers except the KNN give better results after using ANOVA. KNN classifier gives highest accuracy which is 80% and it remain same in both cases with or without feature selection.

Table 3 shows the performance of Wisconsin Breast Cancer dataset using all models. Logistic regression gives best result with 99% accuracy using ANOVA feature selection method. Here, only Naïve Bayes, logistic regression, and ANN classifiers improve their accuracies after using feature selection. Table4 showing the results of applied classifiers on prostate cancer dataset. Highest accuracy 97% is computed by five classifiers (NB, SVM, RF, DT, SGD). But the only difference is that Naïve Bayes gives best result without using feature selection and remaining classifiers gives their best accuracies after using ANOVA feature selection technique.

Figure 2 showing the learning curves of classifiers who gives highest accuracy in each dataset. In Fig. 2, curve (a) is showing the performance of KNN on Breast Cancer Coimbra dataset, curve (b) is showing the learning curve of logistic regression on Wisconsin Breast Cancer dataset, and curve (c) showing the results of support vector machine model on prostate cancer.

Fig. 2
Three plots, a, b, and c depict the trend of the cross-validation line score. In plot a, it is at 0.48 at 3 training examples, 0.57 at 12, 0.62 at 20, and continuously rising up to 0.78 at 55. In b, it starts from 0.94 at 25, 0.962 at 60, 0.97 at 100, and rising to 0.98 at 280. In c, it is at, 0.650 at 10, rises to 0.8 at 23, and drops to 0.750 at 50. Values are estimated.

Learning curves of models having best accuracy in every dataset

5 Conclusion

This work covers two main cancer types breast cancer (in females) and prostate cancer (in males) which are most dangerous and increase the mortality rate in whole world. It is very necessary to predict these diseases in their early stage for better treatment of patient. For early and correct predictions, all classification models are analyzed on each dataset. For improving the performance of models, firstly Z-score normalization method is used and analyze all the measuring parameters such as precision, recall, F1-score, and accuracy with or without using feature selection technique. The future anticipates the use of the aforementioned strategies to eliminate existing shortcomings and improve prediction rates, so giving a way to improve the survival rate for the well-being of mankind.