Introduction

The top ten causes of death in 2016 include diabetes. In 2016, 1.6 million people were affected by diabetes, up from fewer than 1,000,000 in 2000. HIV/AIDS was the seventh leading cause of death as shown in Fig. 1. Diabetes figures grew from the number of diabetes people in the 1980s of 108 million to 422 million in 2014; global diabetes rose from 4.7% in 1980 to 8.5% in 2014 for adults aged over 18.

Fig. 1
figure 1

Description of dataset

By 2040, diabetes is projected to be present in 642 million people (1 in 10 people). In addition, 46.5% of diabetes patients were not diagnosed [1]. It is important to develop strategies and procedures that aid early diagnosis of diabetes, since many deaths of diabetic patients are due to late diagnosis, to reduce diabetes-related deaths.

We need advanced information technology to achieve state-of-the-art technologies for early diagnostics of diabetes, and the data mining sector is an important area for it. Data mining provides the ability to extract from a broad database repository and discover previously unknown, secret, yet interesting models. Such trends can help to [2] diagnose and determine medically.

Diabetes mellitus is one of the diseases that affect a very large human population and is often called diabetes mellitus. Diabetes [3], a very large amount, affected more than 425 million people in 2017. In the same year, about 4 million people died of diabetes and associated complications. Though 74 million people in India have suffered from diabetes, India is recognized as the “World Capital for Diabetes”. If this disease has not been taken seriously and there are no major steps to diagnose and prevent it, an estimated 629 million people worldwide will be affected by diabetes by 2045 [4].

Diabetes is a high blood glucose condition that is caused if the body cannot make the required quantity of the insulin or the body is unable to use the insulin that is produced effectively. Diabetes is most commonly caused by obesity, urbanization, physics inactivity, unhealthy diet, aging and diabetes family history. When diabetes is not rightly diagnosed or managed properly, it can cause many complications, such as cardiovascular problems, kidney diseases, blindness, and neural complications such as stroke [5]. Early diagnosis is the most important fact for effective diabetes management and related complications. Early diagnosis and the recommended daily healthy lifestyle are the most important factor [6].

Literature Review

When you open trans_jour.docx, select “Page Layout” from the.

The following describes some of the various methods used on PIMA Indian Diabetes Datasets with their results.

Rohan Bansal et al. used diabetes diagnosis KNN classifier; the attributes are selected using the PSO techniques. This method has proven to be 77 percent accurate [7]

In the case of the normalization and unconventional KNN algorithm model, i.e. the KNN class-specific classification algorithm, the preprocessing of the dataset is proposed as class-wise KNN (CKNN) methodology for diabetes classification. The accuracy of this process is 78.16% [7].

Lin Li et al. proposed one of the techniques known as weight-adjusted voting classification commonly known. This method is predictive of the accuracy of 77 percent following implementation of PIMA's Indian diabetes dataset [8].

The principle of modified extreme learning machines was used by Priyadarshini et al. to determine whether or not the patient is diabetic dependent or not on the available data. In neural networks and extreme classifier learning, the authors draw comparative conclusions.

Prema NS et al. [9] proposed to use ensemble technique on normalized PIMA Indian diabetes dataset and got efficiency of 81%.

In its analysis, Iyer [10] indicated that a forecast for diabetes should be made with the use of the Naïve Bayes algorithm. The study reported a 79.56 percent accuracy result. Throughout the classification of diabetic patients, Tarun [11] used a PCA and a support vector machine. Experimental tests have shown that while their accuracy is 93.66 percent, the previous amount can be enhanced. Kadhmi [12] suggested that, after applying a nearest K algorithm to the elimination of unwanted data, the decision tree (DT) be used to assign every data sample to its corresponding class. Han et al. [13] developed a model that uses the algorithm for the prediction of diabetes using the K-means algorithm. The model attained a 95.42% accuracy [14].

In Ref. [15], k-mean clustering was used for defining and removing outliers, genetic algorithm and CFS for the related extraction of characteristics, as well as for the classification of diabetic patients by k-nearest neighbor (KNN). Patil [16] has proposed a hybrid model of forecasting which applied k-means to the original dataset and then used C4.5 algorithms to construct the model for the classifier. The result was 92.38% classification precision. Anjali [17] proposed to reduce the dimension of the extracted features with neural network (NN) as a classification technique dependent upon principal component analysis. The accuracy result was 92.2% [18].

Methodology

PIMA Indian Diabetes Dataset

A list of different datasets is available for the research and implementation of ML algorithms in the UCI Machine Learning Repository. The data have been very regularly used as a primary source of machine learning datasets by researchers, students and educators. We took the PIMA Diabetes Dataset [15] for our study from this repository. This dataset is made up of 768 patients ' medical data.

There are eight attributes in each data point, and they are:

  • Number of times pregnant

  • Plasma glucose concentration

  • Diastolic blood pressure

  • Triceps skin fold thickness

  • Body mass index

  • 2-h serum insulin

  • Diabetes pedigree function

  • Age

The 9th attribute of each data point is the class variable. The outcome will be either 0 or 1 for positive or negative diabetes.

Data Cleaning

The data when found to have many missing values, these missing values create a lot of problems in the analysis, and when we train the model with the help of original dataset, having these missing values will not give good result and hence the missing values have to be taken care of; there are many methods available for cleaning the data like replacing the whole row or deleting the complete row but that would result in less number of training data which we don’t want and hence we have used the mean method; we have replaced all the missing data with the mean of the values taken from other values, and hence, it has given the same kind of values and we can process further with the pipeline [16].

Algorithm:Baseline

  • We normally provide training and testing results. Only at the end of the measurement and the final performance assessment should we reach the test range. Then we can set the train to train and check settings. We use the validation dataset to tune the model [17].

  • High variance test issues with conventional train testing process. It means that by changing the test set the result of the prediction changes. We use the k-fold validation method in our train and validation set to solve this problem [18].

  • We analyzed the data; after that, we visualized the data to understand the data more better; we plotted a pair plot and found out there were lot of outliers in the data [19]. We investigated each feature distribution and checked its skewness and kurtosis. We followed this step with feature engineering which includes the following.

Data Preprocessing

Numerical features preprocessing is different for tree and non-tree model. Usually, tree-based models do not depend on scaling. Non-tree-based models hugely depend on scaling. Most often used preprocessing are: MinMax scaler to [0,1], Standard scaler to mean = 0, and Std = 1. Then we removed the outliers.

Feature Selection

Feature selection means that we will have to select those variables or features which will give very high dependency on our target variable which is diabetes is there or not in our case. In our data the features or the attributes are automatically selected using the feature selection; the most relevant to the prediction of our test case variable will be taken up.

Feature selection methods allow you to build a predictive model in our task. It allows us to choose those feature which will give very high dependency on the target class [20].

All the redundant and irrelevant features or the columns are deleted as they can have adverse effect on the prediction accuracy.

Models and Chosen Hyperparameters

  1. A.

    Logistic Regression (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.1.Logistic-Regression)

  2. C: Regularization value, the more, the stronger the regularization (double).

  3. Regularization type: can be either "L2" or “L1”. Default is “L2”.

  4. B.

    KNN

  5. n_neighbors: Number of neighbors to use by default for k_neighbors queries

  6. C.

    SVC (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.3.-SVC)

  7. C: The penalty parameter C of the error term.

  8. Kernel: Kernel type could be linear, poly, rbf or sigmoid.

  9. D.

    Decision Tree (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.4.-Decision-Tree)

  10. max_depth: Maximum depth of the tree (double).

  11. row_subsample: Proportion of observations to consider (double).

  12. max_features: Proportion of columns (features) to consider in each level (double).

  13. E.

    AdaBoostClassifier (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.5-AdaBoostClassifier)

  14. learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate.

  15. n_estimators: Number of trees to build.

  16. F.

    GradientBoosting

  17. learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate.

  18. n_estimators: Number of trees to build.

Ensemble Methods

Ensemble is a technique of machine learning which combines multiple machine learning techniques in one optimal predictive model. Reduce variance, bias or improve predictions [20]. This approach makes it possible to improve predictive performance when compared to a single model. There are various methods of ensembling such as bagging, boosting, adaboosting, stacking, voting, averaging, etc. We have applied voting-based ensembling method on PIMA Indian diabetes dataset. The ensemble vote classifier is a meta-classifier which combines similar or conceptually different machine learning classifiers for classification through majority or plurality voting.

Voting Classifier Using Python Library Scikit learn

A voting classifier is a ML model that forms on a collection of various models and forecasts an output on the basis of its highest probability of the selected class.

We pass the findings of each classifier, and our voting classifier sums all of them and predicts the output class based on the highest majority of the vote. The idea is that instead of creating different dedicated models and calculating the accuracy for each of them, we create a single model that trains all the specified machine learning model [21]; these models predict output based on their cumulative majority voting for each output class (Figs. 2, 3, 4, 5, 6).

Fig. 2
figure 2

Histogram of features

Fig. 3
figure 3

Data cleaning

Fig. 4
figure 4

The blueprint of algorithm

Fig. 5
figure 5

Cross-validation

Fig. 6
figure 6

Flow of ensemble method

Two Types of Votes are Supported by Voting Classifier

Hard voting: The expected performance class in hard polling is a class which is most likely to be expected by each classifier, with the most number of votes. Suppose the output class (A, A, B) is foreseen by three classifiers, so that most predicted A as output. A is therefore the ultimate forecast.

Soft voting: The prediction in soft voting is based on the average probability given to this class. Assume the likelihood for class A = (0.40, 0.57, 0.63) and B = (0.30, 0.42, 0.50) given some inputs to three models. The average is 0.5333 for class A and 0.4067 for class B. The winner is clearly class A.

In soft voting, class label is predicted on the predicted probabilities p for classifier [22].

$$y^{\wedge} \arg \max i\sum j = 1mwjpij,$$

where wj is the weight that can be assigned to the jth classifier.

We assume as per our figure a binary classification task with class labels i ∈ {0,1}; our ensemble could make the following prediction:

$$C1\left( x \right) \to \left[ {0.8,0.2} \right]$$
$$C2\left( x \right) \to \left[ {0.7,0.3} \right]$$
$$C3\left( x \right) \to \left[ {0.3,0.7} \right]$$

Using uniform weights, we compute the average probabilities:

$$p(i0{\mid }x) = \left( {0.8 + 0.7 + 0.3} \right)/3 = 0.6$$

\(p(i1{\mid }x) = \left( {0.2 + 0.3 + 0.7} \right)/3 = 0.4\)

$$y^{\wedge} \arg \max i[p(\left. {i0} \right|x),p(\left. {i1} \right|x)] = 0$$

[12]

Result

We have applied different classification techniques for PIMA Indian diabetes; the results are shown in Table 1. The data are sent to the classifier by dividing the data into 30% testing and 70% training, the accuracy of various models using cross-validation technique is shown in Table 1, and the comparative analysis is shown in Fig. 1 as well.

Table 1 Various models with accuracy

Conclusion

Diabetes prediction is done using various machine learning model and classifier; we have also used ensemble voting with a group Indian diabetes dataset for PIMA classifiers compared to highest consistency with different classification algorithms. We have used cross-validation on dataset with tenfold CV data which were distributed into 30% tests and 70% training. Logistic regression performed surprisingly very well 84.3% and by using ensemble voting classifier with default soft voting the accuracy came out to be 82.8%.