1 Introduction

Pancreas is an essential organ of human body, majorly because it produces insulin that helps in the metabolism of protein, fat and sugar for daily life energy. Insulin deficiency results in increased blood sugar concentration and drives out the redundant sugar via urine. This results in a disease called ‘diabetes mellitus’ that has symptoms like increased hunger, increased thirst, hypertension, frequent urination, stroke, high blood sugar, dyslipidaemia, cardiovascular damage and kidney damage [1, 2]. Lack of exercise and obesity is the premiere cause of diabetes as it depends on weight–height ration, diet style, and hereditary factors. Diabetes is the most serious long-term illness situation that has globally impacted lots of people in both developing as well as developed countries. World Health Organization (WHO) reported diabetes as the highest contributing non-communal disease (NCD) deaths across the globe [3]. According to a report, 20 million people including children and adults suffered from diabetes in USA during 2007 [4]. Another report suggested that, by 2030, more than four-fifth of the diabetic patients across the globe will be from developing countries [5].

The huge amount of data including treatment data, electronic medical records, and patient diagnosis information are generated in healthcare industry. This can be used to extract knowledge that mitigates cost and supports efficient decision-making. The advancements in the medical field have made significant strides in the development of antibiotics, vaccinations and sterilization that enabled industrial disruption and caused a cascading effect on the associated doctors as well as patients. Owing to the recent advancements in intelligent analysis methods [7], employing intelligence for medical diagnosis has emerged as an unrivalled hot issue [7]. To this end, machine learning (ML) algorithms have gained much significance due to its strength of managing voluminous data and making efficient predictions in computationally intensive manner [8, 9]. ML can serve as a solution for mitigating the cost involved in healthcare management and also enable the establishment of better doctor–patient relationship. However, numerous clinical issues exist such as requirement of quick, reliable and accurate decision models. This issue needs to be addressed for accurate disease diagnosis. The existence of huge amount of unstructured data in healthcare makes it difficult to categorize and quantify a conversation between a provider and the patient. Furthermore, performance of the classification models degrades, as the majority of medical datasets contain incomplete, redundant, irrelevant and noisy information [10]. Around one of every seven US grown-ups currently suffer from diabetes, as indicated by the Centers for Disease Control and Prevention. If that remains the case, it is estimated that by 2050, one of every three individuals will suffer from diabetes. In this regard, we utilize machine learning to assist us in early prediction of diabetes. This work presents a prediction system for diabetes disease that also addresses the problems of data imbalance and curse of dimensionality in the diabetes datasets. The significance of information quality (particularly in clinical information), has driven towards an ever-expanding development in information pre-processing strategies. The recent studies fail to identify the right approach that solves the issues of efficiency as well as the ease of implementation [11,12,13,14,15].

In this paper, we proposed a novel Prediction Model using Synthetic Minority Oversampling Technique, Genetic Algorithm and Decision Tree (PMSGD) for Classification of Diabetes Mellitus on Pima Indians Diabetes Database (PIDD) dataset. The proposed PMSGD model is comprised of four different layers. The first layer is the pre-processing layer that is responsible for missing values treatment, outlier detection and its handling [16, 17]. The class imbalance problem is solved by oversampling the minority class using synthetic minority oversampling technique (SMOTE) that yields high-quality training datasets [18, 19]. The second layer is responsible for feature selection that eliminates the insignificant features to generate high-quality datasets using correlation and genetic algorithm (GA). This reduced dimension of the dataset lowers the training complexity and also solves the issues of over fitting. The third layer relies on decision tree (DT) for predicting the diabetic patients’ records [20]. The fourth layer is the performance evaluation layer in which the implementation of our prediction model on the Pima Indian Diabetes (PID) dataset yields adequate confirmation that the proposed prediction model outperforms the existing models in terms of various performance metrics including classification accuracy (CA), classification error (CE), precision, recall (sensitivity), F_Measure (FM), and area under receiver operating characteristic (AUROC).

The major contribution of this proposed approach is as follows:

  1. 1.

    Pre-processing of the dataset is performed for the following: (a) checking and handling missing values, (b) outliers’ detection and handling, and (c) production of high-quality training datasets by oversampling the minority class (solves the class imbalance problem). As most of the existing artificial intelligence approaches neglect the minority class, these are prone to inconsistent results. This is the major issues in dealing with the imbalanced data sets.

  2. 2.

    Feature selection is employed to remove the insignificant features using correlation and GA from the PID dataset to reproduce the high-quality dataset. Owing to this reduction in the dimension of the dataset, the training complexity is reduced thereby resolving the issues of over fitting.

The remainder of the paper is organized as follows. Section 2 presents the related work. Section 3 presents the detailed description of the materials and methods employed. It explores the operations involved in GA, DT and SMOTE along with outlining the framework of the proposed prediction model. Section 4 presents the experimental discussion and analysis. It provides the statistical description of dataset, visualization of attribute values and relative performance measures. Finally, the paper concludes itself in Sect. 5 highlighting few open research trends in the related field.

2 Related work

Typically, several ML techniques are employed to capitulate diagnostic or prognostic models by learning from a sample of observed cases for diagnosis of diabetes or prediction of new diabetic cases. Such models can sometimes outperform the expert predictions and can serve as an appropriate model to guide physicians’ decisions. Further, the dataset used in this work have been utilized in huge number of studies that had approached the desired task differently and achieved varied results. Some of the most influential works in this field are reviewed in the section below.

Barakat et al. [21] used support vector machine (SVM) for the diagnosis of diabetes and incorporated an interpretation module that converts the SVM’s black box model into an intelligible SVM representation. The purpose of these rules extracted from it is to work as a second opinion for the diagnosis of diabetes and as a tool for predicting diabetes by identifying high-risk people. The significance of the proposed model lies in its simplicity, understandability, and validity. The obtained results show that the proposed model achieves high-quality precision in diagnosis and prediction. In another work, Ganji et al. [22] proposed FCS-ANTMINER for the diagnosis of diabetes in which set of fuzzy rules are extracted using an ant colony-based classification system. The proposed model uses artificial ants to explore state space and progressively generate fuzzy rules. The authors estimated the parameters in such a way that the cooperation and competition between the ants to discover more precise rules is balanced. The proposed scheme achieves high accuracy and accurately identifies diabetes. Karegowda et al. [23] proposed to integrate GA and back propagation network (BPN) for diagnosis of diabetes. The proposed scheme relies on estimating the optimal network connection weights of BPN with the help of GA. Similarly, Aslam et al. [24] classified diabetes using genetic programming (GP)-based model that performs feature selection using GP, F-score selection, and t test. Further, KNN and SVM classifiers are employed to test the GP generated classification features. Similarly, Han et al. [25] proposed a hybrid model that utilized SVM to screen diabetes mellitus. The work employed an ensemble learning module dedicated to generate transparent rules using SVM’s black box to solve the imbalance problem.

Hayashi et al. [26] proposed to combine rule extraction algorithm and sampling selection technique to achieve interpretable and accurate classification rules for PID data set. Similarly, Li et al. [27] proposed a probabilistic fuzzy-based classification framework that overcomes the fuzzy uncertainties and stochastic uncertainties. The work achieved better classification performance and effectiveness on lower back pain diagnosis and PID data. Cheruku et al. [28] proposed RST-BatMiner, a hybrid decision support system that relies on eliminating the redundant features from the data set using a rough set theory (RST)-based Quick-Reduct scheme. Further, the proposed fitness function is minimized using bat optimization algorithm (BOA) to generate fuzzy rules. In another work, Sharma et al. [29] proposed a novel guided stochastic gradient descent (GSGD) approach that employs greedy selection scheme to overcome the issues of inconsistency in a dataset. The proposed scheme achieved enhanced CA and convergence as compared to its counterparts. Wang et al. [30] proposed a prediction algorithm to classify diabetes mellitus by balancing data class distribution using oversampling technique. In this work, the missing data values are compensated using Naïve Bayes algorithm and the predictions are generated using random forest (RF) classifiers. The proposed work achieves CA of 87.10% on PID dataset.

In another work, Ontiveros et al. [31] proposed a shadowed Type-2 fuzzy inference system (FIS) to mitigate the computational cost and provide better approximation. Similarly, Zhang et al. [32] proposed a fuzzy partition classifier, aimed to achieve enhanced classification performance in diabetes diagnosis exploiting its strong interpretability and uncertainty handling capability. The proposed scheme employs fuzzy clustering to partition the training data set into several subsets, and use fuzzy weighted algorithm for final prediction of each classifier. The obtained results confirm that it provides enhanced interpretability and classification performance. Similarly, Das et al. [33] proposed a medical disease classification approach that generates membership values using linguistic neuro-fuzzification (LNF) process and extracts the significantly contributing features using feature extraction algorithms. Authors validated the proposed model using eight benchmark datasets.

Nnamoko et al. [34] proposed a selective data pre-processing scheme aimed to achieve even distribution among the artificially generated subsets. In this work, authors identified outliers, performed oversampling and used SMOTE to balance the training data. Similarly, Ameena et al. [35] aimed to predict and detect diabetes using Pima Indian women dataset. The work focuses on finding the accuracy of existing prediction models for diabetes analysis using various ML techniques such as RF, SVM, DT and logical regression. In another work, Tan et al. [36] presented a case study demonstrating high burden of cardiometabolic risk among Asian youth having Type 2 diabetes. Further, the study highlighted glomerular hyperfiltration as a strong Type 2 diabetes predictor. The following conclusions can be drawn from the above discussed literature reviews.

  • The predictive accuracy of diagnosis of diabetes remains a challenging problem and requires further investigation.

  • Missing values problem, curse of dimensionality, outlier’s detection and class imbalance issues are common phenomenon in medical dataset that directly or indirectly affect the outcome of the classification system.

  • Due to class imbalance in the diabetes dataset (like PIDD), CA is alone inadequate to determine the efficiency of the system.

3 Related terminologies

Improvement in the healthcare industries can significantly contribute towards the economic development of the nation because a healthier individual is capable of carrying out workplace tasks more efficiently as compared to any unhealthy person. The use of technology such as ML plays a major role in developing healthcare infrastructures as it can aid in the treatment, diagnosis and prevention of various health conditions. ML along with techniques of data mining such as classification [37, 38], clustering [39], regression [40, 41] and feature selection [42,43,44] are the main tools for developing an efficient healthcare system. ML operates on a basic principle—if you input garbage, you'll get garbage. In this work, garbage refers to noise, outlier, and class imbalance in the dataset. Prediction using class imbalanced dataset is prejudiced in favor of the common class or majority class. The dataset used in this work is imbalanced and therefore there is a need to oversample the dataset. For this purpose, SMOTE is used to produce class-balanced data. As every individual feature in not required for training a system, the proposed prediction model considers only the most significant features. The proposed model uses the concept of correlation and GA for feature selection. This helps to address the issues related to training complexity, performance and curse of dimensionality in the prediction system. Finally, DT is employed to achieve the main objective of prediction. The proposed model employs GA, DT and SMOTE as explored in the subsections below.

3.1 Genetic algorithm (GA)

GA is a searching scheme based on natural genetic mechanism and natural reduction [45]. Based on the concept of “survival of the fittest”, GA makes use of random genetic operators to eliminate the poorer, and generate new promising solutions. The novel unknown area of the search space is found by constantly utilizing the information related to the best solutions. This movement of GA towards the best direction makes it similar to tabu searching and simulated annealing algorithm [46, 47]. Therefore, GA can also be considered as a directed random searching approach. Equation 1 presents the formal definition of GA.

$$\mathrm{GA}=\left\{P\left(0\right), N,g, s,l,p,f,t\right\},$$
(1)

where \(P\left(0\right)=({x}_{1}\left(0\right), {x}_{2}\left(0\right),\dots .,{x}_{N}\left(0\right))\in {I}^{N}\), denotes the initial population; N denotes the initial population size; g denotes the genetic operators; s denotes the reduction strategy; l denotes the length of string (chromosome); f denotes the fitness function \([f:I\to {R}^{+}]\); and t represents a termination law \([t: {I}^{N}\to \{0, 1\}]\).

Abundancy of redundant and irrelevant features in the modern medical dataset lowers the efficacy of the existing data mining techniques leading to uninterpretable results. This is known as Hughes phenomenon [48]. However, appropriate attribute selection might yield interpretable and accurate results. This highlights the need for pre-processing phase in data mining. To overcome the issues of Hughes phenomenon, data reduction in the proposed model is done via attribute subset selection [49]. In the proposed architecture, the attribute selection is done using CFS-GA. CFS (correlation-based feature selection) is an attribute selection scheme that obtains final feature subset by heuristic evaluation for a single feature in every category label. Equation 2 represents the assessment method of CFS.

$${A}_{s}=\frac{m\times \overline{{\mathrm{MCD} }_{\mathrm{al}}}}{\sqrt{m+m\left(m-1\right)+\overline{{\mathrm{MCD} }_{\mathrm{aa}}}}},$$
(2)

where \({A}_{s}\) represents the evaluation of an attribute subset s with m items, \(\overline{{\mathrm{MCD} }_{\mathrm{aa}}}\) represents the mean correlation degree between various attributes and \(\overline{{\mathrm{MCD} }_{\mathrm{al}}}\) represents the mean correlation degree between category label and the attributes. Higher evaluation value is produced by bigger \(\overline{{\mathrm{MCD} }_{\mathrm{al}}}\) or smaller \(\overline{{\mathrm{MCD} }_{\mathrm{aa}}}\). The correlation degree can be estimated by information gain as shown in Eqs. (3) and (4).

$$H\left(C\right)=-\sum_{c\in C}p\left(c\right)\times {\mathrm{log}}_{2}\left(p\left(c\right)\right),$$
(3)
$$H \left(C|D\right)=-\sum_{d\in D}p\left(d\right)\sum_{c\in C}p\left(c|d\right)\times {\mathrm{log}}_{2}\left(p\left(c|d\right)\right),$$
(4)

where c is any possible value of the category attribute C. \(H\left(C\right)\) and \(H \left(C|D\right)\) represents the entropy of C and entropy of C under the condition D respectively. Therefore, the entropy reduction of attribute C can be estimated as

$${\mathrm{ER}}_{C}=H\left(C\right)-H \left(C|D\right).$$
(5)

As \({\mathrm{ER}}_{C}\) represents the amount of information provided by attribute C to attribute D, higher value of \({\mathrm{ER}}_{C}\) reflects a higher correlation degree between these attributes. For an effective comparison among attributes, normalization of information gain to [0, 1] is necessary as these tend to select attributes possessing higher values. Comparison effect among C and D can be estimated as

$${U}_{CD}=2.0\times \frac{H\left(C\right)-H \left(C|D\right)}{H\left(C\right)+H(D)}.$$
(6)

Even though the algorithm shows better performance in terms of dimension reduction, it does not achieve a global optimum result. Considering its global search capability, GA is a wrapping scheme for dimension reduction. The proposed scheme combines CFS and GA to make a hybrid CFS-GA algorithm that operates in four parts: coding scheme in which every entity is encoded using binary codes; selection operator that employs roulette wheel method; crossover operator that produces new individuals by swapping the cross points; and mutation operator that uses bit mutation in binary encoding. Description of the proposed hybrid CFS-GA algorithm is presented as Algorithm 1.

figure a

Importance of GA is outlined as follows.

  • It is beneficial to use GA as it helps the right approach to come from the best of previous solutions. GA improves the candidate solution over time. GA’s theory is to unify different solutions to derive the best genes (features) from every generation and generate better solutions in subsequent generations.

  • Data sets with multiple characteristics can be controlled by GA.

  • These may not require particular domain knowledge for computation.

3.2 Decision tree

In contrast to other classification techniques, DT is a white box model and also an active learning scheme [50]. DT comprise of several leaf nodes, some internal nodes and a single root node. A decision tree is shown in the Fig. 1, with its root at the top. In the figure square shape shows condition or interior node, in view of which the tree parts into different branches or edges. The end of the branch or edge that does not split any longer is the decision or leaf and is shown using oval shape. Every leaf node possesses a class label and is connected to the root node via internal nodes. The starting node of a DT is the root node and the path from this node to the leaf nodes yields the classification rules. System operators can use these rules as guidelines to assess and monitor real-time voltage stability.

Fig. 1
figure 1

A typical decision tree architecture

In this work, we use C4.5 DT algorithms that make use of information gain ratio for attribute selection. The employed C4.5 algorithm solves the over-fitted problem and is capable of effectively handling continuous attributes [50, 51]. The computation procedure of C4.5 algorithm can be described in five steps as detailed below.

  1. 1)

    The initial information entropy for the dataset S is calculated as

    $$\mathrm{Entropy} \left(S\right)=-\sum_{a=1}^{m}{p}_{a}\times {\mathrm{log}}_{2}\left({p}_{a}\right),$$
    (7)

where m represents the total number of classes and \({p}_{a}\) represents the percentage of class a sample among these. This can result in two cases.

Case 1: If distinct class labels are assigned to all the data, \(\left[{p}_{a}=\frac{1}{m}\right]\), then \(\mathrm{Entropy }\left(S\right)={\mathrm{log}}_{2}m\) (highest).

Case 2: If same class label is assigned to all the data, \([{p}_{a}=m=1]\), then \(\mathrm{Entropy} \left(S\right)=\mathrm{zero}\) (lowest).

(2) Partition S into two attribute partitions (\({S}_{\mathrm{left}}\) and \({S}_{\mathrm{right}}\)). The split entropy for every subset S is calculated as

$${\mathrm{Entropy}}_{A}\left(S\right)=\frac{\left|{S}_{\mathrm{left}}\right|}{\left|S\right|}\times \mathrm{Entropy}\left({S}_{\mathrm{left}}\right)+\frac{\left|{S}_{\mathrm{right}}\right|}{\left|S\right|}\times \mathrm{Entropy}\left({S}_{\mathrm{right}}\right),$$
(8)

where A is an attribute of S. \(\left|S\right|\), \(\left|{S}_{\mathrm{left}}\right|\) and \(\left|{S}_{\mathrm{right}}\right|\) represents the number of samples in S, \({S}_{\mathrm{left}}\) and \({S}_{\mathrm{right}}\) respectively.

(3) Information gain of attribute A is obtained as

$${\mathrm{Information}}_{\mathrm{gain}}=\mathrm{Entropy}\left(S\right)-{\mathrm{Entropy}}_{A}\left(S\right)$$
(9)

Higher value of \({\mathrm{Information}}_{\mathrm{gain}}\) denotes more entropy reduction resulting in a better attribute.

(4) To normalize the information gain and avoid over-fitted problems, C4.5 algorithm introduces a split information value estimated as

$${\mathrm{Split}}_{\mathrm{info}}\left(A\right)=-\sum_{a=1}^{k}\frac{\left|{S}_{a}\right|}{\left|S\right|}\times {\mathrm{log}}_{2}\left[\frac{\left|{S}_{a}\right|}{\left|S\right|}\right].$$
(10)

(5) For every node of a DT, information gain ratio is calculated as

$${\mathrm{IG}}_{\mathrm{ratio}}\left(A\right)=\frac{{\mathrm{Information}}_{\mathrm{gain}}}{{\mathrm{Split}}_{\mathrm{info}}\left(A\right)},$$
(11)

where \({\mathrm{IG}}_{\mathrm{ratio}}\left(A\right)\) represents the information gain ratio of attribute A. The attribute having higher value of \({\mathrm{IG}}_{\mathrm{ratio}}\) is selected. This process is recursively executed to split S into several better subsets. The DT learning algorithm is presented as Algorithm 2.

figure b

Importance of DT:

  • understandable classification rules are generated from the training data;

  • constructs the fastest tree;

  • only necessary features are needed before all information is classified;

  • finding leaf nodes allows the pruning of test results, decreasing the number of tests;

  • whole dataset is scanned to build tree.

3.3 SMOTE

Chawla et al. [52] introduced an oversampling technique named SMOTE that utilize neighbouring information to create new artificial instances in contrast to other existing methods that relies on random oversampling of instances. SMOTE replicates and randomly increases the minority class thereby effectively balancing the class distribution. It relies on synthesizing new minority instances from existing ones and use linear interpolation to generate virtual training records. Pseudocode of SMOTE algorithm is presented as Algorithm 3.

figure c

The main reason behind using SMOTE is enumerated as below:

  • SMOTE is used to solve the class imbalance problem in classification;

  • independent on underlying classifier;

  • can be easily implemented.

3.4 Proposed PMSGD model

In the previous section, we discussed about the various concepts of ML that are used to solve the aforementioned problems associated with the existing diabetes prediction system. The general architecture of our proposed prediction model can be divided into four layers namely pre-processing layer, dimensionality reduction layer, training layer and performance evaluation layer. The functionality of these layers is discussed in the subsections below.

3.4.1 Pre-processing layer

In this layer, pre-processing of the dataset is performed for the following: (1) checking and handling missing values, (2) outliers’ detection and handling, and (3) production of high-quality training datasets by oversampling the minority class (solves the class imbalance problem). As most of the existing artificial intelligence approaches neglect the minority class, these are prone to inconsistent results. This is the major issues in dealing with the imbalanced data sets. Therefore, the most significant output is success on the minority class.

3.4.2 Dimensionality reduction layer

The performance of machine learning algorithm depends on input variables. In case of more number of input variables, the performance of ML algorithms degrade. This may have a dramatic effect on the output of ML algorithms that fit on data with many input characteristics. In this layer, feature selection is employed to remove the insignificant features using correlation and GA from the PID dataset to reproduce the high-quality dataset. Owing to this reduction in the dimension of the dataset, the training complexity is reduced thereby resolving the issues of overfitting. The simulation in this layer reveals the four most significant features of individuals with diabetes namely glucose, BMI, diabetes pedigree function and age.

3.4.3 Training layer

In this layer, the DT-based prediction model is trained using different split of training data set. The training dataset is comprised of array of features and associated class labels. Through iterative process of C 4.5 DT algorithm, the proposed prediction model is trained that can be further used to predict the output for new inputs.

The models training phase starts from a series of pre-processed training data using the gain ratio concept. Each training set sample consists of an n-dimensional vector in which the sample is set feature values and the class in which the sample belongs. In the training process, select a node that most efficiently divides the set of samples into subsets enriched in one class or another is chosen for each node of the tree. The gain ratio is the partitioning criterion. To make the decision, the attribute with the highest gain ratio is picked. The procedure then recurses on the divided sub lists.

3.4.4 Performance evaluation layer

This layer is used to measure the effectiveness of the model. The performance of the proposed prediction model on the PIDD dataset is evaluated on different metrics such as CA, CE, precision, sensitivity, FM and AUROC.

Figure 2 depicts the framework of the proposed prediction model.

Fig. 2
figure 2

The framework of proposed PMSGD model

4 Experiment and analysis

The Experiment and analysis section provide the details of dataset used, experimental environment, statistical study of dataset, and the results of the prediction model on different split of datasets. This section also states that the significance of the proposed prediction with the help of comparative study.

4.1 Dataset

This dataset originated from National Diabetes and Digestive and Kidney Diseases Institute. The dataset's purpose is to predict whether a patient has diabetes or not, based on some diagnostic measures used in the dataset. Various restrictions have been imposed on choosing such instances from a database. In particular, all patients considered in this dataset are females of Pima Indian Diabetes dataset (PIDD) heritage who are at least 21 years old. The Training and Testing set is taken from the UCI Repository site (https://www.archive.ics.uci.edu/) [53,54,55,56]. The PIDD dataset is composed of 768 samples, with 268 diabetic and 500 non-diabetic samples and. This contains eight numerically valued features and a class number, where the value '0' diabetes negative and the value ‘1’ means diabetes positive. Table 1 presents the statistical description of the dataset attributes and the visualization of the attribute values with respect to various other attributes are depicted in Fig. 3. Visualization of data helps to curate information in such a way that it is easy to identify patterns and outliers. A successful visualization eliminates the noise from the information and shows the useful details. In the proposed scheme, pre-processing phase is capable of handling the noise and outliers [57,58,59,60,61].

Table 1 Dataset attribute statistical description
Fig. 3
figure 3

Visualization of attribute values

4.2 Experimental environment and simulation parameters

Experiments are performed on a PC with Intel(R) Core (TM) i7 7th generation and 8 GB memory, running on Windows 10. For simulation, weka ML library and java 1.8 is used. To get the uniformity in the results, the proposed algorithm is executed ten times with all the variations of the dataset and the best outcomes are recorded. Three different types of simulation strategies have been performed on the proposed PMSGD model using PIDD dataset. These simulation strategies are as follows: (1) with and without oversampling (2) with and without feature selection (3) with and without feature selection and oversampling.

4.3 Performance measures

The confusion matrix describes the classifier’s performance by contrasting the real classes and those projected classes. The confusion matrix for binary classification is composed of quadrants as shown in Table 2. True positive (TP) is a measure in which the model predicts the positive class as positive. False positive (FP) is a measure in which the model predicts the positive class as negative. True negative (TN) is a measure in which the model predicts the negative class as negative. False negative (FN) is a measure in which the model predicts the positive class as negative.

Table 2 Confusion matrix

The performance indicators such as CA, CE, precision, sensitivity, FM and AUROC are quantified in accordance with the confusion matric. CA is defined as the proportion of correctly classified tuples and the total tuples. CE is the proportion of incorrectly classified tuples and the total number of tuples. Precision is the proportion of TP and the predicted positive tuples. Sensitivity is the proportion of TP and positive samples. AUROC curve gives the area under recall and false positive rate. It tells how much the model is fit for recognizing classes. Higher the value, better the model is at classifying 0 s as 0 s and 1 s as 1 s. By example, the higher the value, the better the model is to distinguish between patients with disease and no diseases. The calculation of these indicators is as below.

$$\mathrm{Classification} \mathrm{accuracy} \left(\mathrm{CA}\right)=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}},$$
(12)
$$\mathrm{Classification} \mathrm{error} \left(\mathrm{CE}\right)=\frac{\mathrm{FP}+\mathrm{FN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}},$$
(13)
$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$$
(14)
$$Recall \left(sensitivity\right)=\frac{TP}{TP+FN},$$
(15)
$${F}_{\mathrm{Measure}}=\frac{2 \times \mathrm{recall }\times \mathrm{precision}}{\mathrm{recall} + \mathrm{precision}}.$$
(16)

4.4 Results and discussions

The split test methodology is implemented as a technique for planning and validating the results for the training and test dataset. The main motivation behind the selection of split test methodology is as follows:

  • The problem with training and testing on same data is that you will only know the output of the model on the datasets, but have no idea of how the algorithm would perform on data in which the model was not trained.

  • The problem with multiple split tests is that few instances of data might never be used for training. This leads to distorted results that do not give a clear indication of the algorithm's accuracy.

  • Cross-validation is unbiased estimation of the efficiency of the methods on unknown data. If randomness is used by the method itself, it will lead to different results for the same training data each time a different random number of seed (start of the pseudo-randomness sequence) was trained. Cross-validation does not compensate for the uncertainty in the results of the algorithm.

In the split test methodology, we tested the considered model for different percentage splits such as 60–40%, 65–35%, 70–30%, 75–25% and 80–20%. Here, the first part represents the size of training set and the second part represents the size of the testing set. The method is simulated ten times for each split and the best five outcomes are recorded for each data set. The four varieties of datasets are used for performing training and testing namely PIDD, PIDD + SM, PIDD + GA and PIDD + SM + GA. PIDD is the PIMA Indian Diabetes Dataset, PIDD + SM is the oversampled data set using SMOTE. PIDD + GA is the datasets with features selected using correlation and GA. PIDD + SM + GA is the dataset that is over sampled using SMOTE and features selected using correlation and GA. The features selected using GA on PIDD are shown in Table 3.

Table 3 Features selected using GA on PIDD

The features selected using GA on oversampled dataset using SMOTE is shown in Table 4.

Table 4 Features selected using GA on PIDD + SM

Table 5 shows the parameter configuration for best-selected features using GA.

Table 5 Parameter configuration

In ML, the CA is frequently used as the performance measure for diabetes research. Because of the class imbalance in the diabetes dataset (like PIDD), CA is alone inadequate to determine the efficiency of the system. CA alone is inadequate for evaluating efficiency as stated in the related work. To assess and equate the proposed prediction model, the following three simulation scenarios are performed.

  • Classification using C 4.5 Decision Tree classifier with PIDD, PIDD + GA, PIDD + SM and PIDD + SM + GA.

  • Evaluation of the trained model against a series of metrics such as CA, CE, precision, sensitivity, FM and AUROC.

  • Outcome of the proposed prediction model is compared with other standard existing systems in terms of the CA, CE, precision, sensitivity, FM and AUROC.

Tables 6, 7, 8, 9 and 10 shows the simulation results of the top 5 outcomes of the proposed model on PIDD, PIDD + GA, PIDD + SM, PIDD + SM + GA datasets. The model is simulated ten times for each dataset and the top 5 outcomes are recorded. In each iteration, the dataset is randomised that may lead to change in its performance.

Table 6 60–40 Training–testing result
Table 7 65–35 Training–testing result
Table 8 70–30 Training–testing result
Table 9 75–25 Training–testing result
Table 10 80–20 Training–testing result

Table 6 depicts the simulation outcomes of PMSGD model in which 60% tuples are considered as training set and the remaining 40% is considered as a testing set. The following observations are noted in this simulation strategy with respect to accuracy.

  • The best outcome is observed on PIDD + SM. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 78.5024, 21.4976, 0.7854, 0.8037, 0.7945 and 0.8230, respectively.

  • The second-best outcome is observed on PIDD + SM + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.5362, 22.4638, 0.7642, 0.8178, 0.7901 and 0.8230, respectively.

  • The third best outcome is observed on PIDD + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 76.9481, 23.0519, 0.6609, 0.7037, 0.6816 and 0.8036, respectively.

  • The fourth best outcome is observed on PIDD. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 76.6234, 23.3766, 0.7500, 0.5000, 0.6000 and 0.7754, respectively.

    Table 7 depicts the simulation outcomes of PMSGD model in which 65% tuples are considered as training set and the remaining 35% is considered as a testing set. The following observations are noted in this simulation strategy with respect to accuracy.

  • The best outcome is observed on PIDD. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.5539, 20.4461, 0.7910, 0.5638, 0.6584 and 0.7619, respectively.

  • The second-best outcome is observed on PIDD + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.5539, 20.4461, 0.7910, 0.5638, 0.6584 and 0.7632, respectively.

  • The third best outcome is observed on PIDD + SM. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 78.7879, 21.2121, 0.7581, 0.8670, 0.8089 and 0.8068, respectively.

  • The fourth best outcome is observed on PIDD + SM + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 78.7293, 21.2707, 0.7350, 0.8171 and 0.8078, respectively.

    Table 8 depicts the simulation outcomes of PMSGD model in which 70% tuples are considered as training set and the remaining 30% is considered as a testing set. The following observations are noted in this simulation strategy with respect to accuracy.

  • The best outcome is observed on PIDD + SM + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.4212, 20.5788, 0.7513, 0.9006, 0.8192 and 0.8150, respectively.

  • The second-best outcome is observed on PIDD + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.9221, 22.0779, 0.7419, 0.5679, 0.6434 and 0.7776, respectively.

  • The third best outcome is observed on PIDD + SM. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.4920, 22.5080, 0.7974, 0.7578, 0.7771 and 0.8053, respectively.

  • The fourth best outcome is observed on PIDD. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.0563, 22.9437, 0.6458, 0.7654, 0.7006 and 0.7865, respectively.

    Table 9 depicts the simulation outcomes of PMSGD model in which 75% tuples are considered as training set and the remaining 25% is considered as a testing set. The following observations are noted in this simulation strategy with respect to accuracy.

  • The best outcome is observed on PIDD + SM + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.9228, 20.0772, 0.8060, 0.8060, 0.8060 and 0.8473, respectively.

  • The second-best outcome is observed on PIDD + SM. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.5367, 20.4633, 0.7956, 0.8134, 0.8044 and 0.8359, respectively.

  • The third best outcome is observed on PIDD + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.6042, 22.3958, 0.7143, 0.5970, 0.6504 and 0.7427, respectively.

  • The fourth best outcome is observed on PIDD. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.6042, 22.3958, 0.7143, 0.5970, 0.6504 and 0.7427.

    Table 10 depicts the simulation outcomes of PMSGD model in which 80% tuples are considered as training set and the remaining 20% is considered as a testing set. The following observations are noted in this simulation strategy with respect to accuracy.

  • The best outcome is observed on PIDD + SM. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 82.1256, 17.8744, 0.8070, 0.8598, 0.8326 and 0.8511, respectively.

  • The second-best outcome is observed on PIDD + SM + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 80.1932, 19.8068, 0.7797, 0.8598, 0.8178 and 0.8490, respectively.

  • The third best outcome is observed on PIDD + GA. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.9221, 22.0779, 0.7273, 0.5926, 0.6531 and 0.8222, respectively.

  • The fourth best outcome is observed on PIDD. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 77.2727, 22.7273, 0.8065, 0.4630, 0.5882 and 0.8129, respectively.

4.5 Performance evaluation with existing systems

The proposed method is compared on the basis of CA, CE, precision, sensitivity, FM, and AUROC. It is worth to mention that the proposed model yields superior results in comparison to the various existing schemes as shown in the Table 11. The best outcome is observed on the PIDD + SM data set. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 82.1256, 17.8744, 0.8070, 0.8598, 0.8326 and 0.8511, respectively. The second-best outcome is observed on PIDD + SM + GA data set. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 80.1932, 19.8068, 0.7797, 0.8598, 0.8178 and 0.8490, respectively. The third best outcome is observed on both PIDD and PIDD + GA dataset. The outcome of the considered performance indicators namely CA, CE, precision, sensitivity, FM and AUROC are 79.5539, 20.4461, 0.7910, 0.5638, 0.6584 and 0.7619, respectively.

Table 11 Performance evaluation with other existing methods

The proposed PMSGD model addresses the issues of missing values, outlier detection and its handling in pre-processing. The dataset used in this work suffers from class imbalance problem. The proposed model solves this problem by oversampling the minority class using SMOTE that yields high-quality training datasets. Important attributes in the datasets are selected via feature selection method to eliminate the insignificant features to generate high-quality datasets using Correlation and GA. This reduced dimension of the dataset lowers the training complexity and solves the issues of over fitting. Further the processed data is used to predict whether the testing instance is suffering from diabetes or not. The remarkable observation observed in this experimentation is that the proposed PMSGD model outperforms other techniques as given in [45, 46]. The best outcome achieved by the proposed system in terms of CA, CE, precision, sensitivity, FM and AUROC is 82.1256%, 17.8744%, 0.8070%, 0.8598, 0.8326 and 0.8511, respectively.

5 Conclusion

In this paper, a novel PMSGD prediction model is proposed for diabetes disease classification that also addresses the problems of data imbalance, curse of dimensionality and missing data values in the diabetes datasets. The difficulty of dealing with imbalanced data sets is that most AI approaches neglect the minority class thereby leading to inconsistent results. In this regard, the proposed model uses SMOTE to oversample the minority class in its pre-processing stage whereas makes use of correlation and GA to extract significant features. Through simulation of feature selection, it is observed that Glucose, BMI, Diabetes Pedigree Function and Age are the significant features of individuals in the PIDD. On the basis of the outcome of feature selection, the training and testing sets are formed. The training set is used to train the proposed PMSGD prediction model and testing set is used to test its efficacy. The proposed model outperforms the existing models in terms of various metrics such as CA, CE, precision, sensitivity, FM, and AUROC. The best outcome achieved by the proposed system in terms of CA, CE, precision, sensitivity, FM and AUROC is 82.1256%, 17.8744%, 0.8070%, 0.8598, 0.8326 and 0.8511, respectively. In future work, the proposed model can be tested for automatic diabetes analysis and prediction with high precision. Testing its applicability to diagnose other diseases can serve as another research direction. Also, pruning the rule sets of the proposed PMSGD model can be an interesting future research work. Furthermore, implementation of various nature-inspired algorithms such as PSO, ACO, grass hopper optimization, grey wolf optimization, Jaya algorithm or fruit fly optimization may be investigated so as to increase the accuracy and reduce the dimensionality of the dataset and consequently mitigate the time complexity.