Timely Prediction of Diabetes by Means of Machine Learning Practices

Diabetes Prediction Using Ensemble Methods

Machine Learning Application in Primitive Diabetes Prediction—A Case of Ensemble Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The top ten causes of death in 2016 include diabetes. In 2016, 1.6 million people were affected by diabetes, up from fewer than 1,000,000 in 2000. HIV/AIDS was the seventh leading cause of death as shown in Fig. 1. Diabetes figures grew from the number of diabetes people in the 1980s of 108 million to 422 million in 2014; global diabetes rose from 4.7% in 1980 to 8.5% in 2014 for adults aged over 18.

By 2040, diabetes is projected to be present in 642 million people (1 in 10 people). In addition, 46.5% of diabetes patients were not diagnosed [1]. It is important to develop strategies and procedures that aid early diagnosis of diabetes, since many deaths of diabetic patients are due to late diagnosis, to reduce diabetes-related deaths.

We need advanced information technology to achieve state-of-the-art technologies for early diagnostics of diabetes, and the data mining sector is an important area for it. Data mining provides the ability to extract from a broad database repository and discover previously unknown, secret, yet interesting models. Such trends can help to [2] diagnose and determine medically.

Diabetes mellitus is one of the diseases that affect a very large human population and is often called diabetes mellitus. Diabetes [3], a very large amount, affected more than 425 million people in 2017. In the same year, about 4 million people died of diabetes and associated complications. Though 74 million people in India have suffered from diabetes, India is recognized as the “World Capital for Diabetes”. If this disease has not been taken seriously and there are no major steps to diagnose and prevent it, an estimated 629 million people worldwide will be affected by diabetes by 2045 [4].

Diabetes is a high blood glucose condition that is caused if the body cannot make the required quantity of the insulin or the body is unable to use the insulin that is produced effectively. Diabetes is most commonly caused by obesity, urbanization, physics inactivity, unhealthy diet, aging and diabetes family history. When diabetes is not rightly diagnosed or managed properly, it can cause many complications, such as cardiovascular problems, kidney diseases, blindness, and neural complications such as stroke [5]. Early diagnosis is the most important fact for effective diabetes management and related complications. Early diagnosis and the recommended daily healthy lifestyle are the most important factor [6].

Literature Review

When you open trans_jour.docx, select “Page Layout” from the.

The following describes some of the various methods used on PIMA Indian Diabetes Datasets with their results.

Rohan Bansal et al. used diabetes diagnosis KNN classifier; the attributes are selected using the PSO techniques. This method has proven to be 77 percent accurate [7]

In the case of the normalization and unconventional KNN algorithm model, i.e. the KNN class-specific classification algorithm, the preprocessing of the dataset is proposed as class-wise KNN (CKNN) methodology for diabetes classification. The accuracy of this process is 78.16% [7].

Lin Li et al. proposed one of the techniques known as weight-adjusted voting classification commonly known. This method is predictive of the accuracy of 77 percent following implementation of PIMA's Indian diabetes dataset [8].

The principle of modified extreme learning machines was used by Priyadarshini et al. to determine whether or not the patient is diabetic dependent or not on the available data. In neural networks and extreme classifier learning, the authors draw comparative conclusions.

Prema NS et al. [9] proposed to use ensemble technique on normalized PIMA Indian diabetes dataset and got efficiency of 81%.

In its analysis, Iyer [10] indicated that a forecast for diabetes should be made with the use of the Naïve Bayes algorithm. The study reported a 79.56 percent accuracy result. Throughout the classification of diabetic patients, Tarun [11] used a PCA and a support vector machine. Experimental tests have shown that while their accuracy is 93.66 percent, the previous amount can be enhanced. Kadhmi [12] suggested that, after applying a nearest K algorithm to the elimination of unwanted data, the decision tree (DT) be used to assign every data sample to its corresponding class. Han et al. [13] developed a model that uses the algorithm for the prediction of diabetes using the K-means algorithm. The model attained a 95.42% accuracy [14].

In Ref. [15], k-mean clustering was used for defining and removing outliers, genetic algorithm and CFS for the related extraction of characteristics, as well as for the classification of diabetic patients by k-nearest neighbor (KNN). Patil [16] has proposed a hybrid model of forecasting which applied k-means to the original dataset and then used C4.5 algorithms to construct the model for the classifier. The result was 92.38% classification precision. Anjali [17] proposed to reduce the dimension of the extracted features with neural network (NN) as a classification technique dependent upon principal component analysis. The accuracy result was 92.2% [18].

Methodology

PIMA Indian Diabetes Dataset

A list of different datasets is available for the research and implementation of ML algorithms in the UCI Machine Learning Repository. The data have been very regularly used as a primary source of machine learning datasets by researchers, students and educators. We took the PIMA Diabetes Dataset [15] for our study from this repository. This dataset is made up of 768 patients ' medical data.

There are eight attributes in each data point, and they are:

Number of times pregnant
Plasma glucose concentration
Diastolic blood pressure
Triceps skin fold thickness
Body mass index
2-h serum insulin
Diabetes pedigree function
Age

The 9th attribute of each data point is the class variable. The outcome will be either 0 or 1 for positive or negative diabetes.

Data Cleaning

The data when found to have many missing values, these missing values create a lot of problems in the analysis, and when we train the model with the help of original dataset, having these missing values will not give good result and hence the missing values have to be taken care of; there are many methods available for cleaning the data like replacing the whole row or deleting the complete row but that would result in less number of training data which we don’t want and hence we have used the mean method; we have replaced all the missing data with the mean of the values taken from other values, and hence, it has given the same kind of values and we can process further with the pipeline [16].

Algorithm:Baseline

We normally provide training and testing results. Only at the end of the measurement and the final performance assessment should we reach the test range. Then we can set the train to train and check settings. We use the validation dataset to tune the model [17].
High variance test issues with conventional train testing process. It means that by changing the test set the result of the prediction changes. We use the k-fold validation method in our train and validation set to solve this problem [18].
We analyzed the data; after that, we visualized the data to understand the data more better; we plotted a pair plot and found out there were lot of outliers in the data [19]. We investigated each feature distribution and checked its skewness and kurtosis. We followed this step with feature engineering which includes the following.

Data Preprocessing

Numerical features preprocessing is different for tree and non-tree model. Usually, tree-based models do not depend on scaling. Non-tree-based models hugely depend on scaling. Most often used preprocessing are: MinMax scaler to [0,1], Standard scaler to mean = 0, and Std = 1. Then we removed the outliers.

Feature Selection

Feature selection means that we will have to select those variables or features which will give very high dependency on our target variable which is diabetes is there or not in our case. In our data the features or the attributes are automatically selected using the feature selection; the most relevant to the prediction of our test case variable will be taken up.

Feature selection methods allow you to build a predictive model in our task. It allows us to choose those feature which will give very high dependency on the target class [20].

All the redundant and irrelevant features or the columns are deleted as they can have adverse effect on the prediction accuracy.

Models and Chosen Hyperparameters

A.
Logistic Regression (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.1.Logistic-Regression)
•
C: Regularization value, the more, the stronger the regularization (double).
•
Regularization type: can be either "L2" or “L1”. Default is “L2”.
B.
KNN
•
n_neighbors: Number of neighbors to use by default for k_neighbors queries
C.
SVC (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.3.-SVC)
•
C: The penalty parameter C of the error term.
•
Kernel: Kernel type could be linear, poly, rbf or sigmoid.
D.
Decision Tree (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.4.-Decision-Tree)
•
max_depth: Maximum depth of the tree (double).
•
row_subsample: Proportion of observations to consider (double).
•
max_features: Proportion of columns (features) to consider in each level (double).
E.
AdaBoostClassifier (https://www.kaggle.com/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86#5.5-AdaBoostClassifier)
•
learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate.
•
n_estimators: Number of trees to build.
F.
GradientBoosting
•
learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate.
•
n_estimators: Number of trees to build.

Ensemble Methods

Ensemble is a technique of machine learning which combines multiple machine learning techniques in one optimal predictive model. Reduce variance, bias or improve predictions [20]. This approach makes it possible to improve predictive performance when compared to a single model. There are various methods of ensembling such as bagging, boosting, adaboosting, stacking, voting, averaging, etc. We have applied voting-based ensembling method on PIMA Indian diabetes dataset. The ensemble vote classifier is a meta-classifier which combines similar or conceptually different machine learning classifiers for classification through majority or plurality voting.

Voting Classifier Using Python Library Scikit learn

A voting classifier is a ML model that forms on a collection of various models and forecasts an output on the basis of its highest probability of the selected class.

We pass the findings of each classifier, and our voting classifier sums all of them and predicts the output class based on the highest majority of the vote. The idea is that instead of creating different dedicated models and calculating the accuracy for each of them, we create a single model that trains all the specified machine learning model [21]; these models predict output based on their cumulative majority voting for each output class (Figs. 2, 3, 4, 5, 6).

Two Types of Votes are Supported by Voting Classifier

Hard voting: The expected performance class in hard polling is a class which is most likely to be expected by each classifier, with the most number of votes. Suppose the output class (A, A, B) is foreseen by three classifiers, so that most predicted A as output. A is therefore the ultimate forecast.

Soft voting: The prediction in soft voting is based on the average probability given to this class. Assume the likelihood for class A = (0.40, 0.57, 0.63) and B = (0.30, 0.42, 0.50) given some inputs to three models. The average is 0.5333 for class A and 0.4067 for class B. The winner is clearly class A.

In soft voting, class label is predicted on the predicted probabilities p for classifier [22].

$$y^{\wedge} \arg \max i\sum j = 1mwjpij,$$

where wj is the weight that can be assigned to the jth classifier.

We assume as per our figure a binary classification task with class labels i ∈ {0,1}; our ensemble could make the following prediction:

$$C1\left( x \right) \to \left[ {0.8,0.2} \right]$$

$$C2\left( x \right) \to \left[ {0.7,0.3} \right]$$

$$C3\left( x \right) \to \left[ {0.3,0.7} \right]$$

Using uniform weights, we compute the average probabilities:

$$p(i0{\mid }x) = \left( {0.8 + 0.7 + 0.3} \right)/3 = 0.6$$

$p(i1{\mid }x) = \left( {0.2 + 0.3 + 0.7} \right)/3 = 0.4$

$$y^{\wedge} \arg \max i[p(\left. {i0} \right|x),p(\left. {i1} \right|x)] = 0$$

[12]

Result

We have applied different classification techniques for PIMA Indian diabetes; the results are shown in Table 1. The data are sent to the classifier by dividing the data into 30% testing and 70% training, the accuracy of various models using cross-validation technique is shown in Table 1, and the comparative analysis is shown in Fig. 1 as well.

Table 1 Various models with accuracy

Full size table

Conclusion

Diabetes prediction is done using various machine learning model and classifier; we have also used ensemble voting with a group Indian diabetes dataset for PIMA classifiers compared to highest consistency with different classification algorithms. We have used cross-validation on dataset with tenfold CV data which were distributed into 30% tests and 70% training. Logistic regression performed surprisingly very well 84.3% and by using ensemble voting classifier with default soft voting the accuracy came out to be 82.8%.

Availability of Data and Materials

The datasets used/or analyzed during the current study is available from the corresponding author on reasonable request.

References

http://www.who.int/news-room/fact-sheets/detail/diabetes Accessed 27 July 2018
IDF diabetes atlas-8th edition (2017) International Diabetes Federation, 2017. Available online https://diabetesatlas.org/. Accessed 15 Dec 2018
https://www.diabetesdaily.com/learn-about-diabetes/what-is-diabetes/how-many-people-have-diabetes/
Jhaldiyal T, Mishra PK (2014) Analysis and prediction of diabetes mellitus using PCA, REP and SVM. Int J Eng Technol Res (IJETR) 2(8) ISSN: 2321-0869.
Prabhu P et al (2011) Improving the performance of K-means clustering for high dimensional data set. Int J Comput Sci Eng 3(6):2317
Google Scholar
Anjali Khandegar, Khushbu Pawar (2017) diagnosis of diabetes mellitus using PCA, neural network and cultural algorithm. Int J Digital Appl Contemp Res 5(6)
Kaur N, Sharma M (2017) Brain tumor detection using self-adaptive K-means clustering. In: 2017 International conference on energy, communication, data analytics and soft computing (ICECDS), pp 1861–1865. IEEE
Motka R, Parmarl V, Kumar B, Verma AR (2013) Diabetes mellitus forecast using different data mining techniques. In: IEEE 4th international conference on computer and communication technology (ICCCT), IEEE (2013), pp 99–103
Global Report on Diabetes WHO Library Cataloguing-in-Publication Data Global report on diabetes. 2016
Pandey BK, Mane D, Nassa VKK, Pandey D, Dutta S, Ventayen RJM, Rastogi R (2021) Secure text extraction from complex degraded images by applying steganography and deep learning. Multidisciplinary approach to modern digital steganography. IGI Global, pp 146–163
Chapter Google Scholar
Kaur SP, Sharma M (2015) Radially optimized zone-divided energy-aware wireless sensor networks (WSN) protocol using BA (bat algorithm). IETE J Res 61(2):170–179
Article Google Scholar
Madhumathy P, Pandey D (2022) Deep learning based photo acoustic imaging for non-invasive imaging. Multimed Tools Appl 81(5):7501–7518
Article Google Scholar
PIMA Indian diabetes dataset, An open dataset (2019) UCI machine learning repository. Available online http://ftp.ics.uci.edu/pub/machine-learningdatabases/pima-indians-diabetes/. Accessed 11 Jan 2019
Bansal R, Kumar S, Mahajan A (2017) Diagnosis of diabetes mellitus using PSO and KNN classifier. In: 2017 international conference on computing and communication technologies for smart nation (IC3TSN), 2017, pp 32–38
Lelisho ME, Pandey D, Alemu BD, Pandey BK, Tareke SA (2023) The negative impact of social media during COVID-19 pandemic. Trends Psychol 31(1):123–142
Article Google Scholar
Li L (2014) Diagnosis of diabetes using a weight-adjusted voting approach. In: 2014 IEEE international conference on bioinformatics and bioengineering, pp 320–324
Pandey BK, Pandey D, Wairya S, Agarwal G, Dadeech P, Dogiwal SR, Pramanik S (2022) Application of integrated steganography and image compressing techniques for confidential information transmission. Cyber Secur Netw Secur 169–191
Kotsiantis SB, Kanellopoulos D, Pintelas PE (2007) Data preprocessing for supervised leaning. World Acad Sci Eng Technol Int J Comput Electr Autom Control Inf Eng 1(12):4091–4096
Google Scholar
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N (2017) Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol J 15:104–116
Article Google Scholar
Ali R et al (2014) Prediction of diabetes mellitus based on boosting ensemble modeling. In: International conference on ubiquitous computing and ambient intelligence, part of the lecture notes in computer science book series. LNCS. vol 8867. Springer
Sharma M, Sharma B, Gupta AK, Pandey D (2023) Recent developments of image processing to improve explosive detection methodologies and spectroscopic imaging techniques for explosive and drug detection. Multimed Tool Appl 82(5):6849–6865
Goyal S, Pandey D, Singh H, Singh J, Kakkar R, Srinivasu PN (2022) Mathematical modelling for prediction of spread of corona virus and artificial intelligence/machine learning-based technique to detect COVID-19 via smartphone sensors. Int J Mode Identif Control 41(1–2):43–52

Download references

Acknowledgements

I would like to thank the DTE.

Funding

No funding.

Author information

Authors and Affiliations

Amity University Tashkent, Tashkent, Uzbekistan
Rajan Prasad Tripathi
Chandigarh Group of Colleges, Chandigarh, Landran, India
Manvinder Sharma & Anuj Kumar Gupta
Department of Technical Education, IET, Dr. A.P.J. Abdul Kalam Technical University, Lucknow, Uttar Pradesh, India
Digvijay Pandey
Department of Information Technology, College of Technology, Govind Ballabh Pant University of Agriculture and Technology, Pantnagar, Uttrakhand, India
Binay Kumar Pandey
SRM Medical College, Kattankulathur, Tamil Nadu, India
Aakifa Shahul
Tbilisi State Medical University, Tbilisi, Georgia
A. S. Hovan George

Authors

Rajan Prasad Tripathi
View author publications
You can also search for this author in PubMed Google Scholar
Manvinder Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Anuj Kumar Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Digvijay Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Binay Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Aakifa Shahul
View author publications
You can also search for this author in PubMed Google Scholar
A. S. Hovan George
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed.

Corresponding author

Correspondence to Digvijay Pandey.

Ethics declarations

Conflict of interest

There is no conflict of interest.

Consent for Publication

“Not applicable”.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tripathi, R.P., Sharma, M., Gupta, A.K. et al. Timely Prediction of Diabetes by Means of Machine Learning Practices. Augment Hum Res 8, 1 (2023). https://doi.org/10.1007/s41133-023-00062-4

Download citation

Received: 24 February 2022
Revised: 24 April 2023
Accepted: 06 November 2023
Published: 09 December 2023
DOI: https://doi.org/10.1007/s41133-023-00062-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Timely Prediction of Diabetes by Means of Machine Learning Practices

Abstract

Similar content being viewed by others