1 Introduction

Among the severe medical emergencies, cancer is the most lethal disease induced by tumor cells. Obsolete development of cells as tumors, is still being a major challenge ahead of technological world today. Cancer treatments of tumor cells involve chemotherapy and endoscopy also has high risks of destroying the healthy tissue cells. Cervical cancer is found as fourth most prominent category of death from cancer in between women. By World Health Organization (WHO), there were 604,127 new cases of cervical cancer found in 2020 with mortality of 341,831 patients, 6.5% of all cancer infected females. In 2020, more than 83% of deaths of cervical cancer infected patients occurred in the low and middle income nations [1]. Here alone in India, in 2020, 18.3% cases of cervix cancer observed in all women cancer patients that are 9.4% of all cancer infected patients. It has found third place in all cancer cases in India, total 123,907 new registered patients along with 77,348 deaths in 2020 [2]. Each type of cancer is categorized into malignant and benign cancer. Due to the lack of, early diagnosis, effective screening and treatment programs, causes it to be as one of the most malignant type of cancer. Cervical cancerous cells are developed in the cervix of female’s uterus and early stage symptoms are abnormal bleeding in vagina, increase in vagina discharge, menopause bleeding after going through menopause, pain during sex, pelvic pain, etc.

Cervical cancer is found in females infected with the Human Papilloma virus (HPV) that causes cervical tissue to be change abnormally. Sexual relations with different partners, early age sexual relationship, long term usage of oral contraceptives; smoking, etc. lead to increase risk of cervical cancer [1, 3]. Most popularly, Pap test and HPV DNA test are recommended for screening of cervical cancer. Pap test (a.k.a. Pap smear test) is a cytology-based screening test in which a sample of cells is taken from cervix of a female and then tests for abnormality in cervix cells with cancer cells and also the cells that causes increase in chances of cervical cancer. HPV DNA test detects any type of HPV within the cells taken from the cervix that are responsible to lead cervical cancer. Pap smear and HPV tests both can be examine at similar time using the similar swab or by second swab. On suspicion of cervical cancer, patients have to go through the detailed diagnosis tests such as biopsy [4]. Presently, along with the conventional medical approaches, computer vision algorithms i.e. machine learning in cyber-physical system, interestingly playing a vital role in various medical applications such as diagnosis of diseases. Here in this paper, we applied some of the most popular machine learning (ML) approaches like NB, LR, KNN, SVM, LDA, MLP, DT and RF on cervical cancer data with some preprocessing methods. Analyzing with all the risk factors in disease diagnosis degrades the efficiency of classification model and also increases its computational complexity. So selection of relevant features also plays a vital role while analyzing the performance of a classification model. This article also realizes some of the popular feature selection methods for getting optimized performance in classification of disease.

2 Background of ML Algorithms Used

2.1 Naive Bayes (NB)

NB is another supervised classification model based on a conditional probabilistic approach utilizing the Bayes theorem to detect all the substances within the information set. This classification method is often suited for high dimensionality datasets [5,6,7]. This approach classifies the problem based on joint posterior probability distribution:

$$\begin{aligned} & p{(}C{|}X) = \frac{{p\left( C \right) p{(}X{|}C)}}{p\left( X \right)} \\ & p{(}X{|}C) = \mathop \prod \limits_{i = 1}^{n} p(x_{i} |C) \\ \end{aligned}$$

Here, \(p(C/X)\), p(C) and \(p\left( X \right)\) gives the posterior probability, prior probability of class and probability of attributes respectively and X is the vector space of n attributes. Due to statistical independence among features, these classifiers are highly scalable and can utilize limited training data with high dimensional features. In [6], Weighted Principle Component Analysis (WPCA) is used along with NB classifier to achieve improved performance in pap smear cervix cell image classification for Herlev dataset. References [8,9,10,11] compare NB classifier prognosis performance for cervical data with other ML classification models.

2.2 Logistic Regression (LR)

It is a statistical binary classification supervised learning method that fits linear regression algorithm to classify data in terms of discrete binary outputs based on logistic function. It intuits maximum-likelihood estimation by a search procedure to minimize the probability error in predicted model and optimize best coefficients value for the data so that the threshold value for classification can be easily adjusted [12]. The essence of the algorithm involves the minimization of cost function:

$$\begin{aligned} & J\left( \theta \right) = - \frac{1}{{\text{m}}}\mathop \sum \limits_{i = 1}^{m} [{\text{y}}^{{\left( {\text{i}} \right)}} \log h_{\theta } \left( {{\text{x}}^{{\left( {\text{i}} \right)}} } \right) + \left( {1 - {\text{y}}^{{\left( {\text{i}} \right)}} } \right){\text{ log}}\left( {1 - { }h_{\theta } \left( {{\text{x}}^{{\left( {\text{i}} \right)}} } \right)} \right] \\ & h_{\theta } \left( {\text{x}} \right) = \frac{1}{{1 + {\text{e}}^{{ - \theta^{T} x}} }} \\ \end{aligned}$$

Here, \(h_{\theta } \left( {\text{x}} \right)\) denotes the hypothesis for logistic function, \(\log h_{\theta } \left( {{\text{x}}^{{\left( {\text{i}} \right)}} } \right)\) and \({\text{log}}(1 - { }h_{\theta } \left( {{\text{x}}^{{\left( {\text{i}} \right)}} } \right)\) gives the cost function when class y is ‘1’ and ‘0’ respectively for m training examples. Reference [13] proposed a LR classifier with fuzzy inference model utilizing combined grayscale-texture based features for Cervical Intraepithelial Neoplasia (CIN) image classification. Many researches [8, 10, 11, 14] utilized LR as one of the classifier to perform comparative analysis in cervical cancer classification. Reference [15] utilized logistic regression for probability estimation of knowledge, attitude, and perception (KAP) of Human Papillomavirus (HPV) infection and cervical cancer.

2.3 Linear Discriminant Analysis (LDA)

LDA is popularly known as dimensionality reduction approach however it proved an effective algorithm to classify objects in two or more groups or clusters based upon the features measured describing that objects. LDA is alternative of LR when there are more than two classes to be classified as LR has limited to binary classification problems. It comprises statistical characteristics of data determined for each of the class that is used to predict decision based on Bayes theorem [16]. Its objective lies in predicting the class, for an input x with the largest:

$$\delta_{k} \left( x \right) = \log \pi_{k} - \frac{1}{2}\mu_{k}^{T} \hat{\varepsilon }^{ - 1} \mu_{k} + x^{T} \hat{\varepsilon }^{ - 1} \mu_{k}$$

Here, \(\pi_{k} = p\left( {y = k} \right)\) exactly, \(\mu_{k}\) and \(\hat{\varepsilon }\) denotes mean and the covariance matrix for class ‘k’. Reference [17] implemented fuzzy-entropy based prime feature discrimination from segmented cell nuclei for Herlev dataset. These segmented features are used with LDA as one of the classification model for abnormal cell detection.

2.4 K-Nearest Neighbor (KNN)

KNN is a non-parametric classification technique that uses feature similarity approach by searching the very similar data points among the available data to categorize them into a class. The KNN finds nearest K data points by determining the Euclidean distance (Other distance measures includes Manhattan, Minkowski and Hamming distance) to the given query point and identifies its class by determining the mainly repeated class label. The value of K is chosen by parameter tuning, providing best suited prediction for the given data [18]. An input x is considered to belong from the class that evident largest probability among all:

$$p{(}y = j{|}X = x) = \frac{1}{k}\mathop \sum \limits_{i \in A} I\left( {y^{\left( i \right)} = j} \right)$$

Here, I(x) denotes the indicating function that is ‘1’ for argument x is true and 0 otherwise, A is the set of K nearest points of input x. References [9,10,11, 17] utilized KNN classifier for comparative analysis with other classification models for cervical cancer classification.

2.5 Multilayer Perceptron (MLP)

MLP is a neural network based proficient and robust method that is used for finding solution of nonlinear and complex classification problems. It comprises multiple neurons arranged in form of input layer, hidden layers and output layer. Here some the nodes i.e. neurons uses non-linear activation functions so that it can also find solution for the problem that are not linearly separable. Here the most complex task is to determine the hidden layer size [19]. The optimization objective of MLP model is based on minimization:

$$\begin{aligned} & \min \left| {\left| {F\left( {X,W} \right) - d} \right|} \right|^{2} \\ & F\left( {X, W} \right) = Y = \left( {y_{1} ,y_{2} ,y_{3} , \ldots ,y_{{n_{N + 1} }} } \right) \\ \end{aligned}$$

Here, F denotes transfer function, X is input to the model, W indicates weight matrix, d is desired response, Y denotes computed output vector, N is total count of hidden layers and nN+1 indicates total output layer neurons. As a consequence incorrect estimation of same may results in approximation error, generalization error and overfitting. In [14], MLP classifier is used for performance comparison with other classifiers for two-class classification on risk-factor cancer data using RFE and RF based ensemble method for feature selection.

2.6 Decision Tree (DT)

DT is a supervised learning tree-like structure in which every single node in DT signifies an attribute value within an instance that is to detect for a class and each branch provides a value that assumed by a node [7, 20]. Selection of best split among training samples based on the measures in terms of class distribution:

$$\begin{aligned} & Entropy \left( t \right) = - \mathop \sum \limits_{i = 0}^{c - 1} p{(}i{|}t) log_{2} \left( {p{(}i{|}t} \right)) \\ & Gini\left( t \right) = 1 - \mathop \sum \limits_{i = 0}^{c - 1} \left[ {p{(}i{|}t} \right)]^{2} \\ & Classification\; error\left( t \right) = 1 - \mathop {\max }\limits_{i} \left[ {p{(}i{|}t} \right)] \\ \end{aligned}$$

Here, c is total number of targets and \(p{(}i{|}t)\) is the predicted sample belonging to class i at a specific node t. Conventional DT [8, 10, 21] algorithms as well as types of decision tress like ID3, C4.5, C5.0, CHAID, and CART [9], and J48 [10, 11] standalone and with ensemble approach performed efficiently in cervical cancer detection.

2.7 Support Vector Machine (SVM)

Vapnik introduced the SVM approach to deal with classification as well as regression models. SVM is supervised discriminative linear approach to accomplish binary classification through an explicit hyper plane [7]. Optimization in SVM is based on the minimization of equation:

$$\mathop {\min }\limits_{\theta } C\mathop \sum \limits_{i = 1}^{m} \left[ {y^{\left( i \right)} z^{\prime}\left( {\theta^{T} x^{\left( i \right)} } \right) + (1 - y^{\left( i \right)} } \right)z^{\prime\prime}\left( {\theta^{T} x^{\left( i \right)} } \right)] + { }Regularization \;term$$

Here, C is the penalty factor for error, \(z^{\prime}\left( {\theta^{T} x^{\left( i \right)} } \right)\) and \(z^{\prime\prime}\left( {\theta^{T} x^{\left( i \right)} } \right)\) denotes the cost function when class y is equals to ‘1’ and ‘0’ respectively and m indicates number of samples. In [22], SVM, support vector machine-recursive feature elimination (SVM-RFE) and support vector machine-principal component analysis (SVM-PCA) methods were used for cervical cancer detection with 90–94% accuracy for the risk-factor cervical cancer data. At early stage, SVM application was constrained to two-class classification, but afterward, kernel capacities for SVM presented that are valuable in multiclass classification [17, 23, 24]. Reference [17] implemented SVM with linear kernel (SVM-linear) and radial basis function kernel (SVM-RBF) using fuzzy entropy based feature extraction mechanism for abnormal cells detection in pap-smear images. References [8,9,10,11, 14, 25] performed comparative analysis of SVM with other prediction models for cervical cancer prognosis.

2.8 Random Forest (RF)

Random forest by Breiman (2001) is based on ensemble method that is used for both classification and regression problems. Ensemble methods used to group weak learners to shape strong learner and use multiple learning approach to produce enhanced predictive result. It trains multiple numbers of DTs that returns with the class find in majority within the ensemble of overall DTs [26]. In RF, each DT predicts a class for the classification model and among all the predictions most voted class becomes our RF model prediction. Bagging approach in RF involves prediction for sample x′ by taking average of all predicted values obtained from individual DT’s:

$$\hat{f} = \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} f_{n} \left( {x^{{\prime }} } \right)$$

Here, N is a parameter gives samples/tree. Generally RF algorithms perform slightly better than SVMs in many classification problems [27]. RF classifier [8, 10, 11, 14, 21, 25] performs efficiently for risk-factor cancer data as well as for pap smear cervix images in cervical cancer detection.

3 Methodology

3.1 Data Description

The cancer patients’ data used here for diagnosis is available at UCI, collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela [28]. The dataset include 858 instances with 36 risk factors including 32 attributes and 4 target categories—Hinselmann, Schiller, Cytology and Biopsy. The description of attributes in the given Cervical Cancer data is shown in Table 1. Hinselmann test uses colposcopy using acetic acid, while colposcopy using Lugol iodine includes Schiller test, Cytology and Biopsy. Malignant infected target described as ‘1’ and benign as ‘0’. In whole of the dataset, around 90–96% data belongs to benign class in each of four target variables. All the attributes values given are either boolean, integer or float type. To build an efficient learning model the data to be fed to it, should be proper and complete. As some of the samples within dataset have missing values, also each attribute have different scaling ranges thus it required to preprocess the data before feeding to a learning algorithm.

Table 1 Attributes description of Cervical Cancer dataset

3.2 Preprocessing of Data

In preprocessing, we eliminate the Instances and attributes with missing values. After elimination, features 27 and 28 get removed along with samples having at least single missing value and only 668 samples with 30 features left. Variation in scaling range of features may cause a particular feature might take over rest of the features while analyzing the performance in a dataset. If magnitude of attribute’s variance has much high order than the other ones, it may take over the objective function and build an estimator that may incapable of learning from rest of the attributes appropriately as being expected. In the given cervical dataset, attribute ‘age’ has high value of mean, variance and standard deviation as 27.265, 76.168 and 8.727 respectively, while some of the attributes as ‘STDs: cervical condylomatosis’ and ‘STDs: AIDS’ even have zero value of mean, variance and standard deviation. To restrain the weighing effect of the attributes with higher statistical mean and to obtain more numerically stable and improved optimization, all attributes should bring at the same scaling level by using different feature scaling approaches. Scaling methods are also quite helpful in speeding up the rate of computations within an algorithm. Here in this article, ML algorithms are analyzed on unscaled as well as with scaled data. For scaling, we use Min–Max Scaler, Standard Scaler and Normalization on available cancer data [29, 30]. Standard Scaler or Z-score Normalization scales the dataset in standard normalize distribution having unit variance with zero mean. If µand σ indicates mean and standard deviation respectively then standardization aims ~ N (µ, σ2) →  ~ N (0, 1) i.e. Z ~ N (0, 1), N stands for normal distribution. Standard score or Z-score for an instance referred by:

$$Standardization\,z = \frac{{x - {\upmu }}}{{\upsigma }}$$

Standard scaling is preferred where contribution of distance measures equally requires for all the attributes. It proves much valuable if distribution of attributes is nearly normalized or Gaussian. Min–Max Scaler and Normalization are an alternative to Standard Scaler if the attribute’s distribution is not of Gaussian nature and the attributes may lie within a restricted space. Both Min–Max Scaler and Normalization scales the data between 0 and 1, along with the difference that distribution characteristic is bounded and unit norm respectively for the two. Min–Max scaling conserves the outline of original distribution and finishes up with smaller value of σ that can suppress the consequence of outliers. Normalization scales each instance (row) instead of attributes (column), using Euclidian distance (l2 normalization) or Manhattan distance (l1 normalization).

$$\begin{aligned} & Min - Max \;Scaling \; x^{\prime} = \frac{{x - x_{min} }}{{x_{max} - x_{min} }} \\ & Normalization \;x^{\prime} = \frac{{x - x_{mean} }}{{x_{max} - x_{min} }} \\ \end{aligned}$$

Here \(x\) is the instance in a feature set (column) and in a sample (row) for Min–Max scaling and Normalization respectively, x’ is normalized value, xmin, \(x_{max}\) and \(x_{mean}\) indicate minimum, maximum and mean value in a given set.

Among all the samples, only around 4–10% found of malignant category in all four target variables in given dataset. Its causes a worst consequence in computation of performance metrics in data analysis and may cause the prediction to bias towards the majority class, when the number of one’s class is much more than the other class, i.e. in case of class imbalanced data [31]. Generally, the data for patients with positive results in the disease diagnosis is quite less than the negative group. The imbalance ratio for each of the target variable in the given dataset is shown in the Table 2. This imbalancing may consequence in prediction of high accuracy for the model even if the minority class being wrongly predicted, due to weightage of majority class in the dataset. To achieve balance distribution among the classes, oversampling and undersampling are the two ways but as the dataset is not being so large; oversampling approach is better than the other one. However, traditional oversampling approach is based on randomly replicating the instances that may cause overfitting hence a hybrid approach, synthetic minority oversampling technique (SMOTE) is used as a preprocessing method [32]. SMOTE is aimed on creating the new minority ‘synthetic’ instances by linear interpolation rather than duplicating them. A new minority instance is generated by SMOTE involves:

$$x^{\prime} = x + rand\left( {0, 1} \right)*\left| {x - x_{k} } \right|$$
Table 2 Imbalance ratio in cervical cancer dataset

Here, \(x\) is one of the minority instances in set of minority class A, for each \(x \in A\), \(rand\left( {0, 1} \right)\) implies any random number between 0 and 1 and \(\left| {x - x_{k} } \right|\) gives the Euclidean distance among the instance x and its kth nearest neighbor (for k = 1,2,…N, where N is sampling rate set in proportion of imbalance).

3.3 Implementation of ML Algorithms

On preprocessed data, we apply the aforementioned ML algorithms with fivefold cross-validation to find performance metrics. Cross-validation is a performance analysis method by reserving a set of samples for testing purpose and train the model by remaining data. Repeatedly perform this for every set of samples in the dataset and compute the performance metrics for each trained model. Finally resultant performance metric is computed by taking mean of metrics for each trained model. K-fold cross-validation splits whole of the dataset in K-subsets, where one fold is used as testing set and others as training set in each of the iteration, and repeated until each of the set in K-fold is used as testing set. Stratified K-folds cross-validation is used in this article available on Scikit-learn library for Python programming [33], for splitting the data into 5-folds (Fig. 1).

Fig. 1
figure 1

K-fold cross-validation with K = 5

Apart from fivefold cross-validation, parameter selection method is being used to determine best parameters for each of the ML algorithm with given data. For each of the algorithm we used a range of values for each parameter and then identify the suitable combination of parameters. GridSearchCV function is available in Scikit-learn that compute finest parameters for each of the ML algorithm. Parameter grid is used while using GridSearchCV that contain all possible parameters from which best are selected. The parameter grids used for simulation of ML algorithms are given in Table 3.

Table 3 Parameters grid for different ML classifiers

When we come to performance, accuracy is not only the criterion to determine best model. As most of the data belongs to benign category, overall accuracy has more weight for the benign cases. It specifies that overall accuracy can be very good even if the accuracy of malignant category is low. However, for correct diagnosis of disease, the prediction of malignant samples should also be accurate. Hence Precision, Recall, F-score, ROC-curve and AUC are used here for performance analysis to diagnose correctly [34, 35].

$$\begin{aligned} & Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}} \\ & Precision = \frac{TP}{{TP + FP}} \\ & Recall = \frac{TP}{{TP + FN}} \\ & F{\text{-}}score = \frac{2*Precision*Recall}{{Precision + Recall}} \\ \end{aligned}$$

Here TP, TN, FP and FN refer to True Positive, True negative, False Positive and False Negative values. True Positive value indicates whose malignant type of cervical cancer detected correctly while True Negative value gives uninfected patient predicted correctly. False Positive value provides uninfected samples find with positive results while False Negative value gives cervical cancer infected patient whose result is found negative. Accuracy provides the proportion of correctly diagnose samples among all. Precision a.k.a. Positive Predictive Value (PPV) gives the ratio of actually infected persons to all positive detected samples value. Recall a.k.a. Sensitivity provides the fraction of correctly detected cancer infected patients to all the samples that are actually infected with cervical cancer. Recall is also called as True Positive Rate (TPR). F-score is the harmonic mean of precision and recall that has best prediction with value 1. Other constraints ROC-curve and AUC are related to each other in the context that ROC-curve is plotted among True Positive Rate (TPR) and False Positive Rate (FPR). AUC score gives area under ROC-curve. FPR is the ratio of positive detected cancer uninfected samples to the total uninfected samples. AUC is as close to 1 indicates a model with much better performance and lie between 0 (worst) and 1(best).

3.4 Feature Selection Methodology

Computation cost of a learning model is directly proportional to dimensionality of attributes. Further performance improvement can be done by selecting the relevant risk factors that has more contribution in evaluation of classification model. Analyzing a learning methodology with irrelevant attributes may cause overfitting and increases computational complexity of model. Thus to create an effective classification model, redundant features should be eliminated from the dataset. Selection of certainly important features can be able to train an accurate model with enhanced performance. Filter method, Wrapper method and Embedded method are the three categories of methods for attribute selection [36, 37]. Among many of the available methods, two popular feature selection metholodologies i.e. Univariate feature selection and Recursive feature elimination has been used here.

3.4.1 Univariate Feature Selection

Univariate feature selection is a type of Filter method based on examining the strongest correlation between attributes and target variables for each risk factor independently. This method works on various statistical tests that select the attributes having more importance and distinct information. Scikit-Learn provide SelectKBest(), SelectPercentile() and GenericUnivariateSelect() as the transforming object for univariate feature selection method. SelectKBest() retain K-maximum scoring attributes while eliminates all others, SelectPercentile() eliminates others except uppermost user-specified proportion scoring attributes and GenericUnivariateSelect() uses configurable tactic to carry out selection of attributes. For classification the univariate feature selection has a popular chi-square test as a statistical tool used with SelectKBest() univariate approach that is being used in this paper for selection of features. Chi-square test is used for discrete set of values and tests the independency among two samples. Chi-square is based on the intuition that a risk-factor is uninformative for classification if it is independent to the class variable. SelectKBest() with Chi-square involves selection of attributes with having K-highest chi-square scores that is computed between attributes and target categories.

$$Chi{\text{-}}square\,\chi^{2} = \mathop \sum \limits_{i = 1}^{n} \frac{{\left( {O_{i} - E_{i} } \right)^{2} }}{{E_{i} }}$$

Here \(O_{i}\) and \(E_{i}\) are observed and expected values for class i among total n set of features. This approach aims selection of attributes that are highly dependent on the categorical data.

3.4.2 Recursive Feature Elimination (RFE)

RFE is a Wrapper method that involves recursive elimination approach for ruling out the important risk-factors in disease diagnosis. RFE is a kind of backward selection algorithm with the difference that it selects features based on ranking of attributes while backward selection eliminates them on the grounds of p-value score [38]. Dealing with classification problem RFE fits a learning model and retain the specified number of attributes that has highest importance or eliminate the weakest ones. An estimator is made fit on initial set of attributes for recursive selection of appropriate features by removing a few of the features in every loop based on ranking using the attributes ‘coef_’ or ‘feature_importances_’. RFE has the option to select specific number of features or select strongest features by default. Scikit-learn have RFE for recursive feature elimination and RFECV for finding optimized number of attributes using cross-validation approach. RFECV is useful in searching out finest set of attributes ranked based on validation score using K-fold cross-validation.

4 Experimental Analysis

The cancer patient data has four target variables i.e. Hinselmann, Schiller, Cytology and Biopsy with 30, 63, 39 and 45 malignant infected samples respectively out of overall 668 samples. Here, the performance metrics for the ML algorithms NB, LR, KNN, SVM, LDA, MLP, DT and RF are computed with fivefold cross-validation and parameter selection for unscaled and scaled data. Min–Max scaling, Standard scaling and Normalization are the three methods used with oversampled data for getting scaled data of three kinds. A tabular comparison is made between evaluation parameters in terms of accuracy, precision, recall, F-score and AUC listed in Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 and 19 along with the comparison of ROC curves shown in Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 and 17 for all target variables i.e. Hinselmann, Schiller, Cytology and Biopsy. The abbreviations used are as follows:

Table 4 Performance metrics for Hinselmann test (unscaled data)
Table 5 Performance metrics value for Hinselmann test (MinMax scaler)
Table 6 Performance metrics for Hinselmann test (standard scaler)
Table 7 Performance metrics for Hinselmann test (normalization)
Table 8 Performance metrics for Schiller test (unscaled data)
Table 9 Performance metrics for Schiller test (MinMax scaler)
Table 10 Performance metrics for Schiller test (standard scaler)
Table 11 Performance metrics for Schiller test (normalization)
Table 12 Performance metrics value for Cytology test (unscaled data)
Table 13 Performance metrics for Cytology test (MinMax scaler)
Table 14 Performance metrics for Cytology test (standard scaler)
Table 15 Performance metrics for Cytology test (normalization)
Table 16 Performance metrics for Biopsy test (unscaled data)
Table 17 Performance metrics for Biopsy test (MinMax scaler)
Table 18 Performance metrics value of ML algorithms for Biopsy test (standard scaler)
Table 19 Performance metrics for Biopsy test (normalization)
Fig. 2
figure 2

Comparison of ROC curves for Hinselmann test (unscaled data)

Fig. 3
figure 3

Comparison of ROC curves for Hinselmann test (MinMax scaler)

Fig. 4
figure 4

Comparison of ROC curves for Hinselmann test (standard scaler)

Fig. 5
figure 5

Comparison of ROC curves for Hinselmann test (normalization)

Fig. 6
figure 6

Comparison of ROC curves for Schiller test (unscaled data)

Fig. 7
figure 7

Comparison of ROC curves for Schiller test (MinMax scaler)

Fig. 8
figure 8

Comparison of ROC curves for Schiller test (standard scaler)

Fig. 9
figure 9

Comparison of ROC curves for Schiller test (normalization)

Fig. 10
figure 10

Comparison of ROC curves for Cytology test (unscaled data)

Fig. 11
figure 11

Comparison of ROC curves for Cytology test (MinMax scaler)

Fig. 12
figure 12

Comparison of ROC curves for Cytology test (standard scaler)

Fig. 13
figure 13

Comparison of ROC curves for Cytology test (normalization)

Fig. 14
figure 14

Comparison of ROC curves for Biopsy test (unscaled data)

Fig. 15
figure 15

Comparison of ROC curves for Biopsy test (MinMax scaler)

Fig. 16
figure 16

Comparison of ROC curves for Biopsy test (standard scaler)

Fig. 17
figure 17

Comparison of ROC curves for Biopsy test (normalization)

C, Classifiers; M, Performance metrics; A, Accuracy (%); P, Precision (%); R, Recall (%); F, F score; Ac–AUC.

Evaluation involves that the top three performing ML algorithms for all four targets are RF, SVM and DT, of which RF performance is superior in terms of the aforementioned evaluation metrics. The analysis reveals that RF gives maximum accuracy with Standard scaled data as 97.81%, 93.97%, 95.6% and 96.18% for target variables Hinselmann, Schiller, Cytology and Biopsy respectively. For SVM in standard scaled data, accuracy for all four targets is 93.63%, 90.64%, 91.52% and 92.69% respectively. DT gives accuracy of 93.54%, 88.6%, 90.67% and 90.97% that is highest in Standard scaled observation for the four targets respectively. Other evaluation parameters are also quite good for RF, SVM and DT observed from the tabular data. ROC curve gives visible comparison for all the ML algorithms that gives RF with standard scaled data has maximum AUC scores of 0.99, 0.98, 0.98 and 0.99 for four target variables respectively. Performance of NB classifier is worst among all eight predictors.

Observation shows that ML algorithms performance is finest when data is standard scaled in most of the cases, however unscaled data also provide high-quality result with a little bit difference in performance metrics. Min–Max Scaler also performs nearly Standard Scaler with most of the algorithms. Performance of normalization is worst among all. Concerning computation time,Footnote 1 evaluation for unscaled data has poor computational efficiency rather than scaled data as shown in Fig. 18. In terms of computational cost and performance RF, SVM and DT with standard scaled data are the finest algorithms for cancer diagnosis data when all the risk-factors are involved in computation. Further computation efficiency can be enhanced by eliminating less important features using Univariate feature selection and RFE algorithm.

Fig. 18
figure 18

Average computation time for unscaled, Min–Max scaled, standard scaled and normalized data

4.1 Feature Selection Using SelectKBoost

SelectKBoost is a univariate feature selection approach that selects K-risk factors having highest correlation with the target variables. Chi-square statistical test is being used here with SelectKBoost algorithm to determine feature importance as shown in Fig. 19. Top ten attributes obtained by SelectKBoost algorithm for four target categories of disease dataset are shown in Table 20. It is being observed that attributes 6, 13, 14, 29 and 31 are common in all target variables. Table 1 show that Attribute 6 corresponds to smokes in year, Attribute 13 is number of STD diseases, Attribute 14 is STDs related to Condylomatosis, Attribute 29 is radiography test for Cancer disease and attribute 31 is radiography test for HPV disease. Except these attributes 26 and 32 are common in three target variables. To obtain optimized performance top 16 relevant risk-factors are selected using SelectKBoost. The performance analysis is done with these 16 risk factors for previously obtained top three ML algorithms i.e. RF, SVM and DT with standard scaled data using fivefold cross-validation and parameter grid for classifiers as listed in Table 3. Removal of almost half risk factors from the dataset doesn’t affect much on the evaluation metrics. Table 21 shows the implementation results that conclude RF, SVM and DT performance with 16-risk factors is approximately same as that obtained with complete set of attributes.

Fig. 19
figure 19

Feature importance using univariate selection (SelectKBoost)

Table 20 Top ten attributes on SelectKBoost
Table 21 Performance metrics of DT, SVM and RF algorithms with SelectKBoost for K = 16

4.2 Feature Selection Using RFE

RFE is being implemented here with fivefold cross validation for the three ML algorithms i.e. DT, SVM and RF with selection of 16 risk factors. SelectKBoost retains important features based on scores of Chi-square test and then performance is analyzed for ML methods, while RFE is a recursive sequence selective approach for optimal risk factors selection using ML classifier. Top 10 attributes among all 30 risk factors for DT-RFE, SVM-RFE and RF-RFE is shown in Tables 22, 23 and 24.

Table 22 Top ten attributes on DT-RFE
Table 23 Top ten attributes on SVM-RFE
Table 24 Top ten attributes on RF-RFE

Table 22 shows DT-RFE gives risk factors 9, 13, 31 and 32 are common for all target variables and attributes 7, 17 and 29 are found in at least three target variables among top 10 attributes. Attributes 3, 7, 9 and 13 are involves in each column of Table 23 among most important 10 risk factors for SVM-RFE. Table 24 for RF-RFE has attribute 6, 7, 9, 13 and 31as found similar in all columns. The implementation results of DT-RFE, SVM-RFE and RF-RFE are shown as Table 25 in terms of performance metrics. An optimized performance achieved with recursive feature elimination (RFE) with reduced 16-risk factors compared to analysis with complete set of 30 attributes. Random Forest (RF) again proves the best classifier ML algorithm in diagnosis of given cervix data. The predictor accuracy is 93.72%, 95.05% and 99.21% for DT-RFE, SVM-RFE and RF-RFE respectively in Hinselmann test. For Schiller test, accuracy of 89.33%, 92.17% and 96.13% is achieved for the three classifiers respectively. For Cytology test, DT-RFE, SVM-RFE and RF-RFE provide accuracy as 91.7%, 92.89% and 97.01% respectively. Accuracy is 91.11%, 93.81% and 98.53% respectively for these ML predictors in Biopsy test. Tabular data shows highest precision of 98.5 for RF-RFE in Hinselmann test. Observation shows that RF-RFE gives highest recall score of 100% for Hinselmann and Biopsy test. Maximum F-score achieved is 0.99 again for RF-RFE in Hinselmann test prediction. RF-RFE also gives highest AUC score of 0.99 in three tests i.e. Hinselmann, Cytology and Biopsy. RF-RFE gives best results in terms of all performance metrics compared to other DT-RFE and SVM-RFE for four target variables.

Table 25 Performance metrics of DT-RFE, SVM-RFE and RF-RFE algorithms with 16 selected features

5 Comparison Analysis

Analysis with complete cervix data shows best results are obtained in RF, SVM and DT classifiers when data is standard scaled. Experimental results show that optimized performance is achieved with elimination of risk factors that are more irrelevant. SelectKBoost and RFE significantly approached with the attributes that are more relevance in prediction. Most significant risk factors in both SelectKBoost and RFE are Attribute 6, 7, 9, 13, 29 and 31 that appear in most of the columns. These risk factors are shown in Table 26 which contributes more in prediction. To make a logical comparison 16 most relevant risk factors are chosen in both SelectKBoost and RFE feature selective approach.

Table 26 Most relevant risk-factors

A detailed comprehensive performance comparison is made between the three best ML predictors i.e. RF, SVM and DT in Table 27 with all 30 risk factors, 16 risk factors obtained with SelectKBoost and 16 risk factors obtained for RF-RFE, SVM-RFE and DT-RFE. All these results are obtained on standard scaled cervix data with SMOTE for oversampling, GridSearchCV for parameter selection and fivefold cross-validation for performance scores computation. Tabular data makes a clear comparison among the implemented RF, SVM and DT classifiers with different approaches used for selecting risk factors.

Table 27 Comparison of DT, SVM and RF with 30 features, 16 features (SelectKBoost) and 16 features (RFE)

RF-RFE is given great results with accuracy as 99.21%, 96.13, 97.01% and 98.53% for four targets Hinselmann, Schiller, Cytology and Biopsy respectively. Other parameter for RF-RFE gives that precision is 98.5%, 95.59%, 96.27% and 98.07%, recall score is 100%, 98.4%, 98.41% and 100%, F-score is 0.99, 0.95, 0.97 and 0.97, and AUC score is 0.99, 0.98, 0.99 and 0.99 for the four target variables respectively. Figure 20 shows the comparison of accuracy for implemented ML classifiers. Performance metrics with 16 risk factors obtained from SelectKBoost approach are almost same to that obtained with complete set of features. However RFE approach significantly enhances the optimization in parameter metrics for RF, SVM and DT classifiers specifically in accuracy, precision and recall.

Fig. 20
figure 20

Accuracy comparison of implemented ML classifiers

6 Conclusion

This paper analyzes the performance of some of the most prominent ML algorithms for cervical cancer data and observes the effect of scaling on performance metrics to efficiently predict the samples of malignant type. NB, LR, KNN, SVM, LDA, MLP, DT & RF are the ML classifiers which made prediction for all 30 risk factors. RF, DT and SVM classifiers are ranked as top three that makes best prediction for all four target category with Standard scaled data, however the performance is not so much get affected with unscaled data and Min–Max scaled data except in case of normalization. Furthermore, classification is made with these predictors using relevant features having more importance in data by feature selection algorithms. RF, SVM and DT classifiers using Univariate feature selection (SelectKBoost) and RFE make the predictors more efficient compared to the classifiers using with complete set of risk factors. There is significant reduction in computational cost and time when low information risk-factors are removed from the disease diagnosis data. RFE is proven better approach than SelectKBoost and the performance of RF-RFE algorithm with 16 risk attributes is superior to other algorithms.