Early Prediction of Diabetes Using Feature Selection and Machine Learning Algorithms

Abdollahi, Jafar; Aref, Solmaz

doi:10.1007/s42979-023-02545-y

Early Prediction of Diabetes Using Feature Selection and Machine Learning Algorithms

Original Research
Published: 20 January 2024

Volume 5, article number 217, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

SN Computer Science Aims and scope Submit manuscript

Early Prediction of Diabetes Using Feature Selection and Machine Learning Algorithms

Download PDF

271 Accesses
1 Citation
Explore all metrics

Abstract

Diabetes has become one of the most common diseases in middle- and low-income countries. Machine learning (ML) and data mining techniques have recently been used to predict diabetes with a high success rate. As a result, medical professionals seek a dependable method for predicting diagnosis. Of course, the feature selection process may be considered a global combinatorial optimization problem in machine learning. The number of features is reduced, irrelevant, noisy, redundant data are removed, and classification accuracy is acceptable. This work uses particle swarm optimization (PSO) to implement feature selection, followed by performance comparison. After that, three medical datasets are used to compare the performance of several machine learning methods. Standard approaches are used to determine the optimum technique for the three datasets. The best results for three datasets are reported for each scheme. The primary goal is to assess the validity of each algorithm's data classification in terms of efficiency and effectiveness in terms of accuracy, sensitivity, and specificity. Decision Tree, Random Forest, and Naïve Bayes deliver the highest accuracy with the lowest mistake rate, according to the findings of the experiments. Machine learning may classify and determine which instances should be sent to medical for further evaluation and treatment with high accuracy. Using such an algorithm on a global scale could help minimize the number of people diagnosed with diabetes.

Prediction of Diabetes Using Various Feature Selection and Machine Learning Paradigms

Performance Analysis of Machine Learning Based On Optimized Feature Selection for Type II Diabetes Mellitus

Article 27 March 2024

Improving the Accuracy of Diabetes Diagnosis Applications through a Hybrid Feature Selection Algorithm

Article 27 March 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Diabetes is a prominent cause of death across the world [1]. Diabetes can harm one's health if discovered too late [2]. Individuals/families, healthcare institutions, and society bear tremendous financial costs [3]. Furthermore, nearly 30 million Indians have diabetes, with many more at risk [4]. Most people get chronic illnesses due to their lifestyle, eating choices, and lack of physical exercise [5]. Predicting future health outcomes is extremely desired, especially for pre-diabetic patients implementing preventative and intervention measures [6]. Diabetes remission is a hotly disputed concept in contemporary endocrinology [7].

Medical practitioners are looking for an effective diabetes prediction system. Different machine learning approaches can examine data from various angles and synthesize it into meaningful information. If specific data mining techniques are applied to large volumes of data, they will be able to provide us with relevant knowledge [8].

Data mining techniques aid in the machine learning process and are widely used in various critical applications [9]. Many data processing methodologies, decision support systems, and systems that probe deeper into the diseases were discovered in the current literature [10,11,12,13,14,15,16,17]. Several machine learning approaches are used in clinical settings to forecast illness, and they have been demonstrated to be more accurate than the traditional methods for diagnosis [18]. As a result, modern medicine has encountered issues acquiring vast amounts of data, analyzing it, and applying the resulting knowledge to solving complicated clinical problems; AI capabilities are required for these goals [19].

Given the importance of diabetes care and the assumption that AI applications for diabetes care are useful tools, and the scarcity of studies examining the use of AI for diabetes care, this study examined AI algorithms and techniques for diabetes care, focusing on machine learning methods. Diabetes outcomes are classified and diagnosed by employing a type of algorithm. This work compares the performance of nine classifiers following Feature Selection using Particle Swarm Optimization (PSO). The most prominent data mining algorithms in the top 10 data mining algorithms research community are LR, NB, C 4.5, DT, RF, SVM, GB, SGDA, and KNN. Our goal is to evaluate the efficiency and effectiveness of these algorithms in terms of accuracy, sensitivity, specificity, and precision.

A significant amount of vital and sensitive healthcare data have been produced due to the tremendous breakthroughs in biotechnology and public healthcare infrastructures. Many intriguing patterns are discovered through intelligent data analysis tools for the early and onset diagnosis and prevention of various fatal diseases. An early diabetes diagnosis can result in more effective therapy. Data mining techniques are widely used for the prediction of disease at an early stage. In this study, diabetes is predicted using significant attributes, the relationship between the various features is also characterized, and a comparison of the proposed approach with the current state-of-the-art techniques is also carried out, demonstrating the proposed method's adaptability in many public applications in healthcare. Moreover, the main contribution of this article is as follows:

Diabetes prediction models using machine learning performed well.
A comparison of the findings from the suggested technique with the most pertinent studies carried out following the prior literature.
We investigate the benefits of feature selection (PSO-ML) for prediction and feature selection.

The article's structure is as follows: In “Related works” Section summarizes related work, “Materials and methods” Section proposes a method, and “Results” Section gives experimental data, including performance evaluation and comparison. The article's “Conclusion and Future Work” Section are presented in the final section.

Related Work

Feature selection (FS) is indeed a tough, challenging, and demanding task due to the large exploration space. It moderates and lessens the number of features. It also eliminates insignificant, noisy, superfluous, repetitive, and duplicate data, and provides reasonably adequate classification accuracy. Present feature selection approaches do face the difficulties like stagnation in local optima, delayed convergence and high computational cost. In machine learning, particle swarm optimization (PSO) is an evolutionary computation procedure which is computationally less costly and can converge quicker than other existing approaches. PSO can be effectively used in various areas, like medical data processing, machine learning and pattern matching, but its potential for feature selection is yet to be fully explored. PSO improves and optimizes a candidate solution iteratively with respect to a certain degree of quality. It provides a solution to the problem by having an inhabitant of swarm particles. By applying mathematical formulas, velocity and position of swarm particles are calculated and these particles are moved in the search space. The movement of individual swarm particle is inclined by its local finest known position and is also directed to the global finest known position in the exploration space. These positions are updated as improved positions, which are found by other particles. These improved positions are then used to move the swarm in the direction of the best solutions. The aim of the study is to inspect and improve the competence of PSO for feature selection. PSO functionalities are used to detect a subset of features to accomplish improved classification performance than using entire features set [20].

In [21] several algorithms are examined on the PIMA Indian dataset and a localized dataset. Principle component analysis (PCA) and PSO are also used in different combinations with classification algorithms. The best results of 79.56% by PCA-LR and 92.43% by PSO-Naive Bayes were achieved on the PIMA Indian and localized datasets. The PSO is also employed by [5], to improve ANN accuracy for diabetes detection. They successfully tried to control the saturation rate of PSO activation function.

Hassan et al. [22] examined a self-organizing map (SOM) optimization algorithm with four metaheuristic algorithms, including PSO, newton-based SOMPSO, SOMHSA (SOM with the Harmony search algorithm), and SOMSwram. The best accuracy of diagnosis of diabetic patients of 80% is achieved on the PIMA Indian diabetes dataset. The four algorithms are also examined on Wisconsin and new Thyroid dataset, and better accuracies than those on the PIMA Indian dataset were obtained. For example, for the new Thyroid dataset, accuracy of 91% through newton-based SOM, and Wisconsin dataset, accuracy of 97% was gained through SOMHSA.

Machine learning methods are now utilized to analyze high-dimensional biomedical data automatically. Some examples of biomedical applications of ML include liver disease diagnosis, skin lesions, cancer categorization, risk assessment for cardiovascular disease, and analysis of genetic and genomic data [19].

Type 1 and type 2 diabetes exacerbates the negative effects of COVID -19 independently [23]. In [24], the proportional contributions of insulin resistance and beta-cell dysfunction in type 2 diabetes are varied and dependent on demographic, genetic, and clinical factors, with significant interaction with environmental factors [25]. In the case of newly diagnosed DM2, the VERIFY research found that early treatment with metformin–vildagliptin improves long-term glycemic control and can slow disease progression [26]. People with type 2 diabetes diagnosed in adolescence and early adulthood (or with a younger present age) were intrinsically and more prone to retinopathy after accounting for illness duration and other key confounding factors [27]. Simple non-invasive fibrosis scores based on normal blood tests are increasingly examined as screening tools [28].

Miroslav Marinov et al. [29] reviewed 31 articles related to a diabetes diagnosis. This study was classified under the classification, clustering, and association data mining methods. The authors stated that data mining has a bright future in biomedicine. However, there was no detailed classification accuracy comparison.

Anjali Khandgar presented a review to interpret various data mining techniques for diabetes prediction. This study has shown standards for analyzing the parameters of behavior and lifestyle of patients such as emotions, physical activities, eating habits, etc. The retrieved information can be used to check clinical parameters, other prognoses, and treatment planning. However, a comparison of the accuracy of different methods is not mentioned [30].

Preeti Verma et al. [31] reviewed various studies with classification techniques for a diabetes diagnosis. The results showed that the support vector machine (SVM) effectively classifies the diabetic disorder. The accuracy rate obtained using SVM is 96.58%. The authors have not investigated the effects of data preprocessing on the accuracy of the prediction of diabetic patients.

Yu et al. [32] used quantum particle swarm optimization (QPSO) and weighted least squares support vector machine (WLS-SVM) for type 2 diabetes prognosis. Fanicol et al. conducted their study on the same data set and used four algorithms NB, DT, LR, and 274RF. They calculated the performance of each classifier and found that the most successful method was RF with tenfold cross-validation with an accuracy of 97.4% [33, 34]. Zhu et al. [35] reduced the data size by principal component analysis (PCA) in feature extraction methods using random data from 68,994 patients obtained from a hospital in Luzhou, China. Using the obtained features, they achieved an accuracy of 80.84% with RF. In the following, comparing the related work with existing work and its limitations (Table 1).

Table 1 Comparing the limitations of related work with Existing work

Early Prediction of Diabetes Using Feature Selection and Machine Learning Algorithms

Abstract

Similar content being viewed by others

Prediction of Diabetes Using Various Feature Selection and Machine Learning Paradigms

Performance Analysis of Machine Learning Based On Optimized Feature Selection for Type II Diabetes Mellitus

Improving the Accuracy of Diabetes Diagnosis Applications through a Hybrid Feature Selection Algorithm

Explore related subjects

Introduction

Related Work

Machine Learning Algorithms

PSO Algorithms

Difference Between PSO and Genetic Algorithm

Advantages and Disadvantages of Particle Swarm Optimization

Equation for the Objective Function they were Maximizing or Minimizing

Feature Selection

Filtering Methods

Wrapper Methods

Embedded Methods

Differences Between Filter and Wrapper Methods

Feature Selection Techniques Using PSO Algorithms

Motivation

Material and Method

Stage 1: Collected Dataset

Pima Indians Diabetes Database

Diabetes 130-US Hospitals for Years 1999–2008 Data Set

Diabetes Iraqi Society Data Set

Stage 2: Data Preprocessing

Need for Data Preprocessing

Stage 3: Proposed Method

Evaluation of Result

Results

Discussion

Limitations

Conclusion and Future Work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation