Introduction

Diabetes is a prominent cause of death across the world [1]. Diabetes can harm one's health if discovered too late [2]. Individuals/families, healthcare institutions, and society bear tremendous financial costs [3]. Furthermore, nearly 30 million Indians have diabetes, with many more at risk [4]. Most people get chronic illnesses due to their lifestyle, eating choices, and lack of physical exercise [5]. Predicting future health outcomes is extremely desired, especially for pre-diabetic patients implementing preventative and intervention measures [6]. Diabetes remission is a hotly disputed concept in contemporary endocrinology [7].

Medical practitioners are looking for an effective diabetes prediction system. Different machine learning approaches can examine data from various angles and synthesize it into meaningful information. If specific data mining techniques are applied to large volumes of data, they will be able to provide us with relevant knowledge [8].

Data mining techniques aid in the machine learning process and are widely used in various critical applications [9]. Many data processing methodologies, decision support systems, and systems that probe deeper into the diseases were discovered in the current literature [10,11,12,13,14,15,16,17]. Several machine learning approaches are used in clinical settings to forecast illness, and they have been demonstrated to be more accurate than the traditional methods for diagnosis [18]. As a result, modern medicine has encountered issues acquiring vast amounts of data, analyzing it, and applying the resulting knowledge to solving complicated clinical problems; AI capabilities are required for these goals [19].

Given the importance of diabetes care and the assumption that AI applications for diabetes care are useful tools, and the scarcity of studies examining the use of AI for diabetes care, this study examined AI algorithms and techniques for diabetes care, focusing on machine learning methods. Diabetes outcomes are classified and diagnosed by employing a type of algorithm. This work compares the performance of nine classifiers following Feature Selection using Particle Swarm Optimization (PSO). The most prominent data mining algorithms in the top 10 data mining algorithms research community are LR, NB, C 4.5, DT, RF, SVM, GB, SGDA, and KNN. Our goal is to evaluate the efficiency and effectiveness of these algorithms in terms of accuracy, sensitivity, specificity, and precision.

A significant amount of vital and sensitive healthcare data have been produced due to the tremendous breakthroughs in biotechnology and public healthcare infrastructures. Many intriguing patterns are discovered through intelligent data analysis tools for the early and onset diagnosis and prevention of various fatal diseases. An early diabetes diagnosis can result in more effective therapy. Data mining techniques are widely used for the prediction of disease at an early stage. In this study, diabetes is predicted using significant attributes, the relationship between the various features is also characterized, and a comparison of the proposed approach with the current state-of-the-art techniques is also carried out, demonstrating the proposed method's adaptability in many public applications in healthcare. Moreover, the main contribution of this article is as follows:

  • Diabetes prediction models using machine learning performed well.

  • A comparison of the findings from the suggested technique with the most pertinent studies carried out following the prior literature.

  • We investigate the benefits of feature selection (PSO-ML) for prediction and feature selection.

The article's structure is as follows: In “Related works” Section summarizes related work, “Materials and methods” Section proposes a method, and “Results” Section gives experimental data, including performance evaluation and comparison. The article's “Conclusion and Future Work” Section are presented in the final section.

Related Work

Feature selection (FS) is indeed a tough, challenging, and demanding task due to the large exploration space. It moderates and lessens the number of features. It also eliminates insignificant, noisy, superfluous, repetitive, and duplicate data, and provides reasonably adequate classification accuracy. Present feature selection approaches do face the difficulties like stagnation in local optima, delayed convergence and high computational cost. In machine learning, particle swarm optimization (PSO) is an evolutionary computation procedure which is computationally less costly and can converge quicker than other existing approaches. PSO can be effectively used in various areas, like medical data processing, machine learning and pattern matching, but its potential for feature selection is yet to be fully explored. PSO improves and optimizes a candidate solution iteratively with respect to a certain degree of quality. It provides a solution to the problem by having an inhabitant of swarm particles. By applying mathematical formulas, velocity and position of swarm particles are calculated and these particles are moved in the search space. The movement of individual swarm particle is inclined by its local finest known position and is also directed to the global finest known position in the exploration space. These positions are updated as improved positions, which are found by other particles. These improved positions are then used to move the swarm in the direction of the best solutions. The aim of the study is to inspect and improve the competence of PSO for feature selection. PSO functionalities are used to detect a subset of features to accomplish improved classification performance than using entire features set [20].

In [21] several algorithms are examined on the PIMA Indian dataset and a localized dataset. Principle component analysis (PCA) and PSO are also used in different combinations with classification algorithms. The best results of 79.56% by PCA-LR and 92.43% by PSO-Naive Bayes were achieved on the PIMA Indian and localized datasets. The PSO is also employed by [5], to improve ANN accuracy for diabetes detection. They successfully tried to control the saturation rate of PSO activation function.

Hassan et al. [22] examined a self-organizing map (SOM) optimization algorithm with four metaheuristic algorithms, including PSO, newton-based SOMPSO, SOMHSA (SOM with the Harmony search algorithm), and SOMSwram. The best accuracy of diagnosis of diabetic patients of 80% is achieved on the PIMA Indian diabetes dataset. The four algorithms are also examined on Wisconsin and new Thyroid dataset, and better accuracies than those on the PIMA Indian dataset were obtained. For example, for the new Thyroid dataset, accuracy of 91% through newton-based SOM, and Wisconsin dataset, accuracy of 97% was gained through SOMHSA.

Machine learning methods are now utilized to analyze high-dimensional biomedical data automatically. Some examples of biomedical applications of ML include liver disease diagnosis, skin lesions, cancer categorization, risk assessment for cardiovascular disease, and analysis of genetic and genomic data [19].

Type 1 and type 2 diabetes exacerbates the negative effects of COVID -19 independently [23]. In [24], the proportional contributions of insulin resistance and beta-cell dysfunction in type 2 diabetes are varied and dependent on demographic, genetic, and clinical factors, with significant interaction with environmental factors [25]. In the case of newly diagnosed DM2, the VERIFY research found that early treatment with metformin–vildagliptin improves long-term glycemic control and can slow disease progression [26]. People with type 2 diabetes diagnosed in adolescence and early adulthood (or with a younger present age) were intrinsically and more prone to retinopathy after accounting for illness duration and other key confounding factors [27]. Simple non-invasive fibrosis scores based on normal blood tests are increasingly examined as screening tools [28].

Miroslav Marinov et al. [29] reviewed 31 articles related to a diabetes diagnosis. This study was classified under the classification, clustering, and association data mining methods. The authors stated that data mining has a bright future in biomedicine. However, there was no detailed classification accuracy comparison.

Anjali Khandgar presented a review to interpret various data mining techniques for diabetes prediction. This study has shown standards for analyzing the parameters of behavior and lifestyle of patients such as emotions, physical activities, eating habits, etc. The retrieved information can be used to check clinical parameters, other prognoses, and treatment planning. However, a comparison of the accuracy of different methods is not mentioned [30].

Preeti Verma et al. [31] reviewed various studies with classification techniques for a diabetes diagnosis. The results showed that the support vector machine (SVM) effectively classifies the diabetic disorder. The accuracy rate obtained using SVM is 96.58%. The authors have not investigated the effects of data preprocessing on the accuracy of the prediction of diabetic patients.

Yu et al. [32] used quantum particle swarm optimization (QPSO) and weighted least squares support vector machine (WLS-SVM) for type 2 diabetes prognosis. Fanicol et al. conducted their study on the same data set and used four algorithms NB, DT, LR, and 274RF. They calculated the performance of each classifier and found that the most successful method was RF with tenfold cross-validation with an accuracy of 97.4% [33, 34]. Zhu et al. [35] reduced the data size by principal component analysis (PCA) in feature extraction methods using random data from 68,994 patients obtained from a hospital in Luzhou, China. Using the obtained features, they achieved an accuracy of 80.84% with RF. In the following, comparing the related work with existing work and its limitations (Table 1).

Table 1 Comparing the limitations of related work with Existing work

Particle swarm optimization (PSO) is used to implement feature selection in this work, followed by a performance comparison of machine learning algorithms on three medical datasets. The project is divided into two halves. The first is the feature selection approach, which encourages more relevant traits while discarding the irrelevant for faster and more efficient data classification. The classification algorithms are applied to the obtained features in the second stage to predict.

Machine Learning Algorithms

Machine learning (ML), a subset of artificial intelligence (AI), has expanded significantly in data analysis and computing in recent years, enabling programs to perform intelligently. ML is typically referred to as the most well-liked newest technology in the fourth industrial revolution and gives systems the ability to learn and improve from experience automatically without being specifically programmed (4IR or Industry 4.0). Utilizing cutting-edge smart technologies like machine learning automation, "Industry 4.0" is often the ongoing automation of traditional manufacturing and industrial activities, including exploratory data processing. Thus, to intelligently analyze these data and construct the related real-world applications, machine learning algorithms are the key [36].

In the following sections, we will provide a brief overview of several machine learning algorithms that are the most often utilized and, consequently, the most well-liked ones. Additionally, it aims to emphasize the advantages and disadvantages of machine learning algorithms from the perspective of their applications to help decision-makers make an informed choice when choosing the best algorithm to fulfill a certain application requirement. Table 2 compares the benefits and drawbacks of the algorithm for diagnosing diabetes to previous methods (Table 3).

Table 2 Advantages and disadvantages of the proposed method in diagnosing diabetes compared to other methods
Table 3 Comparison of the performance of other feature selection

PSO Algorithms

Many challenging research issues can be formulated as optimization issues. The emergence of big data technology has also sparked a large-scale increase in the complexity and size of optimization challenges. The development of parallelized optimization techniques has become necessary due to the high computing cost of these issues. One of the most well-known swarm intelligence-based algorithms, particle swarm optimization (PSO), is enhanced with resilience, simplicity, and global search capabilities [37]. It has undergone numerous improvements since it was first introduced in 1995. With more knowledge of the method, researchers have created new iterations that address diverse demands, created new applications in various fields, published theoretical analyses of the consequences of the different parameters, and proposed numerous algorithm variations [38]. Ant colony optimization (ACO), particle swarm optimization (PSO), artificial fish swarm (AFS), bacterial foraging optimization (BFO), and artificial bee colony are just a few of the swarm intelligence techniques that have been developed in recent years (ABC). This paper attempts to pick features using PSO. Table 4 compares the effectiveness of feature selection methods based on PSO.

Table 4 Comparison of the PSO approach with other feature selection approaches mentioned (filter methods, wrapper methods, embedded and methods)

In the previous article [1], we used the genetic algorithm to predict diabetes, and in the comparison, we made with the particle swarm algorithm, we saw that this algorithm has the following advantages over the genetic algorithm and can be successful in predicting diabetes. Therefore, we tried to use this algorithm to predict diabetes. Also, the PSO does not rely on the gradient of the objective function, it is computationally more efficient than Genetic Algorithm. Moreover, it is simple to parallelize. Each particle may be changed concurrently, and since we are manipulating numerous particles to find the best answer, we only need to gather the updated value once per iteration. As a result, PSO may be implemented well using map-reduce architecture. In this article, feature selection using this algorithm is proposed. The results obtained using this method are compared to those obtained using a number of traditional machines learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN), Naive Bayesian Classifier (NBC), Random Forest Classifier (RFC), and Logistic Regression (LR). The computational findings of our suggested strategy demonstrate that improved prediction accuracy can be attained with significantly fewer features. This work has the potential to be useful in clinical settings and serve as a resource for clinicians.

Difference Between PSO and Genetic Algorithm

Genetic Algorithms (GAs) and PSOs are both used as cost functions, they are both iterative, and they both have a random element. They can be used in similar kinds of problems. The difference between PSO and Genetic Algorithms (GAs) is that GAs does not traverse the search space like birds flocking, covering the spaces in between. The operation of GAs is more like Monte Carlo, where the candidate solutions are randomized, and the best solutions are picked to compete with a new set of randomized solutions. Also, PSO algorithms require normalization of the input vectors to reach faster “convergence” (as heuristic algorithms, both do not truly converge). GAs can work with features that are continuous or discrete. Also, In PSO, there is no creation or deletion of individuals. Individuals merely move on a landscape where their fitness is measured over time. This is like a flock of birds or other creatures that communicate.

Advantages and Disadvantages of Particle Swarm Optimization

Advantages:

  • Insensitive to scaling of design variables.

  • Easily parallelized for concurrent processing.

  • Derivative free.

  • Very few algorithm parameters.

  • A very efficient global search algorithm.

Disadvantages:

  • PSO’s optimum local search ability is weak..

Equation for the Objective Function they were Maximizing or Minimizing

We are looking to maximize or minimize a function to find the optimum solution. A function can have multiple local maximums and minimum. However, there can be only one global maximum as well as a minimum. If your function is very complex, then finding the global maximum can be a very daunting task. PSO tries to capture the global maximum or minimum. Even though it cannot capture the exact global maximum/minimum, it goes very close to it. It is the reason we called PSO a heuristic model. Particle Swarm Analysis Fish shoaling and bird flocking social behaviors served as inspiration for Eberhart and Kennedy's [42] PSO stochastic optimization method. Each component of the folk elements is represented by a particle in the PSO, which gives physical characteristics like mass and volume. Each component of the folk elements is represented by a particle in the PSO, which gives physical characteristics like mass and volume.

Let us assume a few parameters first. You will find some new parameters, which I will describe later.

F: Objective function, VI: Velocity of the particle or agent, A: Population of agents, W: Inertia weight, C1: cognitive constant, U1, U2: random numbers, C2: social constant, Xi: Position of the particle or agent, Pb: Personal Best, gb: global Best.

The actual algorithm goes as below:

  1. 1.

    Create a ‘population’ of agents (particles) which is uniformly distributed over X.

  2. 2.

    Evaluate each particle’s position considering the objective function (say the below function)

    $$z = f(x,y) = \sin^2 + \sin y^2 + \sin x\sin y$$
    (1)
  3. 3.

    If a particle’s present position is better than its previous best position, update it.

  4. 4.

    Find the best particle (according to the particle’s last best places).

  5. 5.

    Update particles’ velocities.

    $$V_i^{t + 1} = W \cdot V_i^t + c_1 U_1^t (P_{b1}^t - P_i^t ) + c_2 U_2^t (g_b^t - p_i^t )$$
    (2)
  6. 5.

    Move particles to their new positions.

    $$P_i^{t + 1} = P_i^t + v_i^{t + 1}$$
    (3)
  7. 6.

    Go to step 2 until the stopping criteria are satisfied.

The operation of PSO is described by Eqs. (1)–(5).

$$X_i = \left( {x_{i1} , \, x_{i2} , \ldots ,x_{iD} } \right)$$
(4)
$$P_i = \left( {p_{i1} , \, p_{i2} , \ldots ,p_{iD} } \right)$$
(5)
$$V_i = \left( {v_{i1} , \, v_{i2} , \ldots ,v_{iD} } \right)$$
(6)
$$V_{id} = \, w \ast v_{id} + c1 \ast r1 \ast \left( {P_{id} - X_{id} } \right) + c2 \ast r2 \ast \left( {P_{gd} - X_{id} } \right)$$
(7)

where the current position of a particle is xid, the best of the particle is PID, the best of the group is paged, the velocity of particle is void, the entire factor is we, the relative influence of the cognitive component is c1, the relative influence of the social component is c2, and r1, r2 are random numbers. r1, r2 are employed to keep the population’s change spread between [0, 1], equally. The c1 and c2 are the self-recognition t constant and the social component coefficient, as shown in Eq. (5).

$$w = w_{\max } - \frac{{w_{\max } - w_{\min } }}{{{\text{iter}}_{\max } }}$$
(8)

where the initial weight is shown by \(w_{\max }\), the final weight is shown by \(w_{\min }\), the maximum iteration number is shown by \({\text{iter}}_{\max }\), and the current iteration number is shown by iter.

A particle swarm optimisation operates in this manner. We start with a number of random locations on the plane (call them particles) and let them search for the minimum point in a variety of directions, much like a flock of birds searching for food. Every particle should look around the lowest position it has ever found as well as the lowest point the entire swarm of particles has ever found at each step. We regard the minimal point of the function to be the last point that this swarm of particles has ever investigated after a specific number of iterations.

Assume we have P particles, and we denote the position of particle I at iteration t as \(X^i (t)\), which in the example of above, we have it as a coordinate \(X^i (t) = (X^i (t),y^i (t))\). Besides the position, we also have a velocity of each particle, denoted as \(V^i (t) = (V_y^i (t),y_y^i (t))\). At the next iteration, the position of each particle would be updated as

$$X^i (t + 1) = X^i (t) + V^i (t + 1)$$
(9)

Or, equivalently,

and at the same time, the velocities are also updated by the rule.

$$V^i (t + 1) = wV^i (t) + c_1 r_1 \left( {{\text{pbest}}^i - X^i (t)} \right) + c_2 r_2 \left( {{\text{gbest}} - X^i (t)} \right)$$
(10)

where r1 and r2 are random number between 0 and 1, constants w, c1, and c2 are parameters to the PSO algorithm, and the bestI is the position that gives the best of (X) value ever explored by particle I and gbest is that explored by all the particles in the swarm.

Note that pbestI and Xi (t) are two position vectors and the difference pbest I-Xi (t) is vector subtraction. Adding this subtraction to the original velocity Vi(t) is to bring the particle back to the position pbesti. Similar are the differences gbest-Xi (t).

We call the parameter W the inertia weight constant. It is between 0 and 1 and determines how much should the particle keep on with its previous velocity (i.e., speed and direction of the search). The parameters C1 and C2 are called the cognitive and the social coefficients respectively. They control how much weight should be given between refining the search result of the particle itself and recognizing the search result of the swarm. We can consider these parameters control the tradeoff between exploration and exploitation.

$$V_i^{t + 1} = W \cdot V_i^t + c_1 U_1^t (P_{b1}^t - P_i^t ) + c_2 U_2^t (g_b^t - P_i^t )$$
(11)

\(W\) = The parameter W is the inertia weight, and it is a positive constant, this parameter is important for balancing the global search, also known as exploration (when higher values are set), and local search, known as exploitation (when lower values are set).

$$W \cdot V_i^t$$
  • Diversification: searches for new solutions, finds the regions with potentially the best solutions.

  • Inertia: Makes the particle move in the same direction and with the same velocity.

\(c_1 U_1^t (P_{b1}^t - P_i^t )\) = Personal Influence: Improves the individuals. Makes the particle return to a previous position, better than the current.

\(c_1 U_1^t (P_{b1}^t - P_i^t )\, + \,c_2 U_2^t (g_b^t - P_i^t )\) = Intensification: explores the previous solutions, and finds the best solution of a give’s regions.

\(c_2 U_2^t (g_b^t - P_i^t )\) = Social Influence: Makes the particle follow the best neighbor’s direction.

If W = 1, the particle’s motion is entirely influenced by the previous motion, so the particle may keep going in the same direction. On the other hand, if 0 ≤ W < 1, such influence is reduced, which means that a particle instead goes to other regions in the search domain.

Pb1t and its current position pit. It has been noticed that the idea behind this term is that as the particle gets more distant from the Pb1t (Personal Best) position, the difference (Pb1t-Pit) must increase; hence, this term increases, attracting the particle to its best own position. The parameter C1 existing as a product is a positive constant, and it is an individual-cognition parameter. It weighs the importance of the particle’s own previous experiences.

The other hyper-parameter which composes the product of the second term is U1t. It is a random value parameter within the [0, 1] range. This random parameter plays an essential role in avoiding premature convergences, increasing the most likely global optima.

The difference (gbt-Pit) works as an attraction for the particles toward the best point until it is found at t iteration. Likewise, C2 is also a social learning parameter, and it weighs the importance of the global learning of the swarm. And U2t plays precisely the same role as U1t.

In the case of C1 = C2 = 0, all particles continue flying at their current speed until they hit the search space’s boundary.

In cases C1 > 0 and C2 = 0, all particles are independent.

In cases C1 > 0 and C2 = 0, all particles are attracted to a single point in the entire swarm.

In case C1 = C2 ≠ 0, all particles are attracted toward the average of pbest and gbest.

Feature Selection

A preprocessing method called feature selection identifies the main characteristics of a particular situation. It has historically been used to solve various issues, such as analyzing biological data, financial matters, and intrusion detection systems. Medical applications have effectively employed feature selection to reduce dimensionality and better understand the root causes of disease [43]. Traditional feature selection algorithms do not try to capture causal relationships between features; instead, they choose parts based on the correlations between predictive characteristics and the class variable. Since causal linkages suggest the underlying mechanism of a system, it has been demonstrated that knowledge of the causal relationships between elements and the class variable may be useful for developing interpretable and reliable prediction models. As a result, several algorithms have been presented, and causality-based feature selection has increasingly gained more attention [44]. There are three feature selection strategies: filtering, wrapping, and embedded. Also, the comparison of the performance of other feature selection approaches is shown in Table 5.

Table 5 Comparing the performance of other feature selection approaches

Filtering Methods

Using an indirect criterion, such as the distance criterion, which shows how well the classes are separated, filtering methods evaluate the accuracy of predictions or classifications. Usually, this technique is applied as a preliminary step. Instead, the features are chosen to be related to the outcome variable based on how well they perform in various statistical tests.

Wrapper Methods

Wrapper approaches assess a subset of genes throughout the search phase using a search strategy and a learning model. The wrapper methods typically outperform filter methods in classification accuracy due to a learning model. On the other hand, they have a few drawbacks, including a large computational overhead and the potential for overfitting.

Embedded Methods

These techniques choose features during the learning process and are typically given to students. This model also takes advantage of the previous models using different evaluation criteria in different search stages. Filter and wrapper characteristics are combined with embedded methods. Algorithms use these internal feature selection techniques. Compare the performance of other feature selection showed in Table 3.

Differences Between Filter and Wrapper Methods

The following are the key variables between feature selection processes using wrapper and filtering:

  • Since filter methods do not require model training, they are substantially faster than wrapper approaches. Wrapper approaches, on the other hand, also cost a lot to compute.

  • Cross-validation is used by wrapper techniques, while statistical methods are used by filtering methods to examine a subset of characteristics.

  • Wrapper methods may always offer the best feature subset, whereas filtering methods frequently fail to do so.

  • Wrapper methods may always offer the best feature subset, whereas filtering methods frequently fail to do so.

  • The model is more vulnerable to employing a subset of filter method characteristics when using a set of wrapper method features.

Feature Selection Techniques Using PSO Algorithms

Eberhart and Kennedy devised the PSO, which is a population-based method. PSO is a well-known and successful worldwide search method. It is an excellent technique for feature selection issues, because it is easy to encode features, has a global search capacity, is computationally acceptable, has fewer parameters, and is easier to apply. The PSO is used to choosing characteristics because of the considerations above. The limitations of feature selection approaches mentioned (Filter methods, Wrapper methods, Embedded and methods) are shown in Table 6.

Table 6 Limitations of other feature selection approaches mentioned (filter methods, wrapper methods, and embedded methods) [47]

PSO was used to explore and choose a subset of primary components or the principal features throughout the main space. Particles in PSO represent possible solutions in the search space and form a swarm known as a population. The swarm of particles is created by randomly dispersing 1s and 0s. If the primary component is 1, it is chosen, while the main component of 0 is ignored. As a result, each particle represents a different subset of the primary components. The particle swarm is randomly initiated, and then moved in the search or principal space, updating its position and velocity to find the best collection of characteristics [9, 48, 49]. For example, the Parameter’s Initialization PSO-SVM is shown in Table 7.

Table 7 PSO-SVM parameters initialization

Motivation

Diabetic disease is typically composed because of higher-than-normal blood sugar levels. Instead, the production of insulin may be regarded insufficient. It has been noted in recent days that the percentage of diabetes-affected patients have grown to a larger extent throughout the world. Evidently, this problem must be taken more seriously in the coming days to ensure that the average percentages of diabetes-affected individuals are reduced. Recently, several research teams conducted detailed research on the machine learning platform to determine the precision of each other. Machine learning can be used by parametric modeling of health data, including diabetic patient data sets, to synthesize expertise in the field. In this study, a model is proposed for Prediction diabetes based on Feature Selection and Machine Learning Algorithms. The combined Particle Swarm Optimization (PSO) and machine Learning Algorithms are used to evaluate a set of medical data relating to a diabetes diagnosis challenge. Experiments are performed on the Diabetes Database. The sensitivity, specificity and accuracy metrics widely used in medical studies have been used to assess the effectiveness of the proposed system reliability. The proposed approach has the potential to be applied for effective and early diagnosis of other medical diseases as well.

The machine learning process can be implemented using various machine learning techniques. The most extensively utilized learning techniques are supervised and unsupervised learning. The supervised learning technique is applied when historical data is available for a specific problem. The system is trained using inputs and replies before being applied to predict new data responses. Artificial neural networks, backpropagation, decision trees, support vector machines, and the Nave Bayes classifier are all examples of supervised techniques. An unsupervised learning technique is applied when the available training data are unlabeled. There is no prior information or training offered in the system. The algorithm must analyze and detect patterns in the available data to make judgments or predictions. K-means clustering, hierarchical clustering, principal component analysis, and the Hidden-Markov model are all examples of unsupervised techniques [19].

Also, it reduces the number of features, eliminates useless, noisy, and redundant data, and generates acceptable classification accuracy. The feature selection process can be considered a global combinatorial optimization problem in machine learning. Feature selection is critical in pattern classification, medical data processing, machine learning, and mining applications. A good feature selection strategy based on the number of characteristic analyses for sample classification is necessary to speed up the processing rate, enhance predicted accuracy, and avoid incomprehensibility. This paper implements feature selection using particle swarm optimization (PSO). Machine learning algorithms using the one-versus-rest strategy are used as a PSO fitness function for the classification problem. The selected features are then used to diagnose diabetes using machine learning algorithms.

Material and Method

Stage 1: Collected Dataset

Pima Indians Diabetes Database

The National Institute of Diabetes and Digestive and Kidney Diseases (NIDDKD) includes cost information (donated by Peter Turney). The selection of these instances from a larger database was subjected to many constraints. All patients are of Pima Indian heritage, female, and at least 21 years old. This study uses the type 2 diabetes dataset from (https://www.kaggle.com/kumargh/pimaindiansdiabetescsv). There are 768 instances in this data set, divided into two groups: diabetic and non-diabetic, with eight risk factors: number of pregnancies, 2-h plasma glucose concentration in an oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, 2-h serum insulin, body mass index, diabetes pedigree function, and age, as shown in Table 8. Seventy percent of the information is for training purposes, while 30% is for testing purposes. The data include characteristics like Pregnancy, Glucose, Blood Pressure, Skin-Thickness, Insulin, BMI, Diabetes-Pedigree-Function, Age, and Class.

Table 8 Description of the Pima Indian diabetes datasets

Diabetes 130-US Hospitals for Years 1999–2008 Data Set

In addition, algorithms were learned using a different dataset. Two types of diabetic records were used: automatic electronic recording equipment and paper records. The automatic gadget had an inbuilt clock that allowed it to timestamp occurrences, whereas the paper records had "logical time windows" (breakfast, lunch, dinner, and bedtime). Breakfast (08:00), lunch (12:00), dinner (18:00), and the rest (18:00) are all recorded on paper (22:00). As a result, paper records have at notionally consistent recording times, but electronic records have more precise time stamps. These data were analyzed to look for indicators linked to readmission of diabetic patients and other outcomes. For this study, the diabetes dataset available in (https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008), which contains 55 useful variables and 100,000 records, was used. Table 9 lists these variables and acronyms. The information is divided into two categories: training and testing. Training accounts for 80% of the data, while testing accounts for 20%.

Table 9 Description of the diabetes 130-US hospitals for 1999–2008 data set

The dataset represents clinical care in 130 US hospitals and integrated delivery networks for ten years (1999–2008). There are more than 50 elements that can be used to depict patient and hospital outcomes. The following conditions had to be followed for interactions to remove data from the database.

  1. [1]

    This is a visit to the hospital (a hospital admission).

  2. [2]

    It is a diabetic encounter, meaning that diabetes was diagnosed during the interaction.

  3. [3]

    The duration of the stay ranged between 1 and 14 days.

  4. [4]

    Throughout the meeting, lab tests were being done.

  5. [5]

    Drugs were administered throughout the exchange.

The information includes:

  • Details about the patient's number, race, gender, and age.

  • The type of admission and length of hospital stay.

  • The medical specialty of the admitting physician.

  • The number of lab tests performed.

  • The HbA1c test results.

  • The diagnosis.

  • The number of medications.

  • The number of diabetic medications.

  • The quantity of outpatient, inpatient, and emergency visits in the year before the hospitalization.

  • Etc.…

Diabetes Iraqi Society Data Set

The diabetic data set's structure was discussed. The data were gathered in Iraqi society and came from the Medical City Hospital's laboratory (the specialized center for endocrinology and diabetes—Al-Kindy Teaching Hospital). To construct the diabetes dataset, patient records were created from which data were extracted and entered into a database. Medical information and test results are included in the data. The data attribute reads, "The data comprises medical notes, laboratory analyses, and other related information." The data attribute is the data that consists of.

  1. 1.

    Medical notes,

  2. 2.

    Laboratory analyses, etc.

  3. 3.

    The information that is initially entered into the system are

    • The Number of patients

    • Blood glucose level

    • Age

    • Sex

    • Creatinine (Cr)

    • Body mass index (BMI)

    • Urea

    • Cholesterol (Chol)

    • Fasting lipid profile, including total

    • LDL

    • VLDL

    • Triglycerides (TG)

    • Cholesterol

    • HBA1C.

And the class (the patient's disease class is also diabetic, non-diabetic, or pre-diabetic). The dataset (https://data.mendeley.com/datasets/wj9rwkp9c2/1) contains 14 useful variables and 1000 records. Table 10 lists these variables and acronyms. The information is divided into two categories: training and testing. Training accounts for 80% of the data, while testing accounts for 20%.

Table 10 Description of the diabetes Iraqi society data set

Stage 2: Data Preprocessing

Data processing is transforming data from one format into another that is more usable, desirable, meaningful, and instructive. Machine learning techniques, mathematical modeling, and statistical expertise can all be used to automate this procedure [50,51,52]. Outliers and missing data were removed from the clinical data. Each case with missing survival information was eliminated from the analysis to develop a credible model. In addition, mean and mode imputation techniques were used to treat the remaining missing data. This was accomplished utilizing Python software and data mining techniques.

Need for Data Preprocessing

The data must be properly prepared to generate better results from the model used in machine learning applications. Some machine learning models need the data to be in a certain format; for instance, the Random Forest method cannot handle null values. Therefore, null values from the initial raw data set must be treated before the algorithm can run. How the data set is organized should also be considered to run various Machine Learning and Deep Learning algorithms simultaneously and select the best of them. The following approaches were employed in this article:

  1. 1.

    Handling Null Values: Every real-world dataset contains a small number of null values. Whether the issue is classification, regression, or any other kind, no model can handle NULL or Nan variables independently, so we must step in.

  2. 2.

    Standardization: This stage in the preprocessing procedure is crucial. We may standardize our data by giving a mean of 0 and a standard deviation of 1. In machine learning, there are two techniques to scale features (Table 11).

  3. 3.

    Data Reduction: A large database may become slower, cost more to access, and be more challenging to store efficiently. In a data warehouse, data reduction seeks to produce a more straightforward version of the data.

  4. 4.

    Rescale Data: Rescaling our data's attributes to the same scale will help various machine learning techniques when our data contains variables of different sizes. This is helpful for machine learning methods that employ gradient descent and other optimization techniques. It is also beneficial for weighted input algorithms like regression and neural networks, as well as distance-based algorithms like K-Nearest Neighbors.

  5. 5.

    Binarize Data (Make Binary): A binary threshold can be used to modify our data. When a value exceeds or equals the threshold, it is indicated with a 1; when it is equal to or less than the threshold, it is marked with a 0. This process of thresholding or binarizing data is known. It can be useful if you have probabilities that you want to convert to precise values. It is also beneficial when you are doing feature engineering and want to add new features that imply something. You can make brand-new binary attributes.

Table 11 Techniques of scale features

Stage 3: Proposed Method

For effective machine learning model creation. The majority of attributes are typically irrelevant to supervised machine learning categorization. Feature selection and outlier elimination were part of the raw data preprocessing phase. There are several approaches to dealing with outside and inconsistent data. We chose the qualities in our study that had significantly connected data. A feature subset selection based on PSO is proposed in the second stage. After preprocessing and feature selection, the integrated dataset is subjected to classification algorithms.

The project is divided into two halves. The first is the feature selection approach, which focuses on obtaining more relevant This article discusses various approaches and datasets for evaluating the performance of different machine learning algorithms. Figure 1 depicts the study's recommended methodology. This study's methodology is divided into three key steps: data collecting, preprocessing, and classification. The dataset used for the analysis is the diabetes of the study. The proposed method uses data from three different profiles and is based on an integrated methodology. On the other hand, the medical dataset has a lot of missing and irrelevant data that cannot be used for categorization. As a result, the initial phase of the strategy is preparing the dataset using typical imputation techniques in accordance with the data profiles.

Fig. 1
figure 1

Methodology followed in the study

In machine learning applications, feature engineering is a critical stage. Modern data sets are defined with several property features while discarding the irrelevant for faster and more efficient data classification. The second stage applies the classification algorithms to the collected parts to produce predictions.

Therefore, the objective of this study Comparison of machine learning algorithms in diagnosing diabetes. Thus, to compare the behavior of LR, NB, KNN, DT, RF, SVM, GB, SGDA, and C4.5, we conducted an experiment evaluating the algorithms' effectiveness and efficiency. Specifically, the research questions we set for the study area:

  1. 1.

    Which algorithm is the most effective?

  2. 2.

    Which one is the most efficient?

  3. 3.

    Which one is the most accurate?

Evaluation of Result

This section presents the results of the information analysis. To apply and evaluate our classifiers, we employed the tenfold Cross-Validation test, a technique for assessing predictive models in which the original set is split into a training sample for training the model and a test set for assessment. After performing the preprocessing and preparation techniques, we visually analyze the data and determine the distribution of values in terms of effectiveness and efficiency.

The classification cost can be represented by a cost matrix that can identify two types of positive mistakes for classification problems with two categories. The performance measurement is used to determine the effectiveness of the classification method. (FP) In addition, as shown in Table 12, false negatives (FN) and two types of classifications, true-positive (TP) and true-negative (TN), have different costs and benefits.

Table 12 Confusion matrix

A confusion matrix is a table that describes how well a classification model (or "classifier") performs on a set of experimental data with known right values. If you have an unequal number of observations in each class or your dataset has quite two categories, classification accuracy alone may be misleading. Calculating a confusion matrix might help you better understand what your classification model gets right and where it goes wrong. The Detail Descriptions of Performance Measures are shown in Table 13.

Table 13 Detail descriptions of the performance measures [50,51,52,53,54,55,56]

Accuracy The simplest measure of performance accuracy is classification accuracy, which is defined as the percentage of properly predicted batches obtained using the formula [50,51,52,53,54,55,56].

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}$$
(12)

Sensitivity Real positive rate: If the person's result is positive, the model will be positive in a small percentage of cases, as computed by the formula below.

$${\text{Sensitivity}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(13)

Propertiesss True-negative rate: If the person's result is negative, the model will likewise have a negative result in certain cases, as computed by the method below.

$${\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
(14)

PPV How likely would a person get diabetes if the model is positive?

$${\text{PPV}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(15)

NPV How likely would a person get diabetes if the model is positive?

$${\text{NPV}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FN}}}}$$
(16)

In this section, we assess the efficacy of all classifiers in terms of the time it takes to build the model, the number of correctly categorized examples, the number of misclassified instances, and accuracy. This article was created with the Python 3.7 programming language in the Jupyter Notebook platform's Anaconda environment. Table 14 shows the implementation details.

Table 14 Software requirements

Results

Patients' quality of life and life expectancy can benefit from early diabetes diagnosis. Different diabetes detection models [19] have been developed using supervised algorithms. In almost every classification task, the dataset comprises many features. However, because some features are useless and duplicated, they are not required for good classification performance. As a result, classifiers with fewer characteristics but higher classification accuracy are preferred for ease of interpretation. Due to improved representation, the ability to explore huge spaces, being more cost-effective computed, being easier to implement, and requiring fewer parameters, PSO is an excellent technique for feature selection problems. This work compared a particle swarm optimization algorithm and ten machine learning algorithms. The Bayesian information criterion (Accuracy) is proposed as a fitness function. Table 15 shows the feature selection results with the particle swarm algorithm on each data set. All classification techniques were experimented with in "Jupyter Notebook” programming in Python.

Table 15 Result of feature selection
  • Feature selection Diabetes Iraqi society Data Set:

    • Number of Features in Subset: 4

    • Individual: [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]

    • Feature Subset: ['No_Pation', 'Cr', 'TG', 'BMI']

  • Feature selection Pima Indian diabetes datasets:

    • Number of Features in Subset: 4

    • Individual: [1, 0, 1, 1, 1, 0, 0, 0]

    • Feature Subset: ['Pregnancies', 'Blood Pressure', 'SkinThickness', 'Insulin']

  • Feature Selection Diabetes 130-US hospitals for years 1999-2008 Data Set:

    • Number of Features in Subset: 7

    • Individual: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1]

    • Feature Subset: ['gender', 'age', 'admission_type_id', 'discharge_disposition_id', 'diag_2', 'number_diagnoses', 'diabetesMed']

As we mentioned in “Materials and methods” Section the selected features are (shown in Table 15) used to diagnose diabetes using machine learning algorithms. We can notice from Table 16 that SVM, SGDA, and C4.5 take about 0.09 s to create their models, whereas NB, KNN, and DT take just 0.01 s. Conversely, the accuracy obtained by RF (98.81%) is healthier than that obtained by LR, NB, KNN, DT, SVM, GB, SGDA, and C4.5, which have an accuracy that varies between 90.00 and 98.01 attempts. It may also be easily seen that RF has the best value of correctly classified instances and lower value of incorrectly classified instances than the opposite classifier. The results are shown in Table 16 and Fig. 2.

Table 16 Compared evaluation time to build a model (s) classifiers
Fig. 2
figure 2

Accuracy of the classifiers machine learning algorithms

The data set has been partitioned into two parts (training and testing). We trained our model with 70% training data and tested it with 30% remaining data. Five models have been developed using supervised learning to detect whether the patient is diabetic or non-diabetic. For this purpose, Logistic Regression (LR), Naive Bayes Classifier (NB), K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), Support Vector Machines (SVM), Gradient Boosting (GB), and Stochastic Gradient Descent Algorithm (SGDA) algorithm is used.

Figure 2 shows the accuracy of the nine classification models when applied to the dataset. As shown in Fig. 2, the decision trees and random forests perform better than other algorithms. Simulation error is also considered in this study to measure the performance of classifiers better. To do so, we evaluate the effectiveness of our classifier in terms of:

  • Kappa statistic (KS) as a chance-corrected measure of agreement between the classifications and the actual classes,

    $$k = \frac{p_0 - p_e }{{1 - p_e }} = 1 - \frac{1 - p_o }{{1 - p_e }}$$
    (17)
  • Mean Absolute Error (MAE) as to how close forecasts or predictions are to the eventual outcomes,

    $${\text{MAE}} = \frac{{\sum_{i = 1}^n {|y_i - x_i |} }}{n} = \frac{{\sum_{i = 1}^n {|e_i |} }}{n}$$
    (18)
  • Root Mean Squared Error (RMSE),

    $$RMSD\left( \theta \right) = \sqrt {MSE\left( \theta \right)} = \sqrt {{E\left( {\left( {\theta - \theta } \right)^2 } \right)}} .$$
    (19)
  • Relative Absolute Error (RAE),

  • Root Relative Squared Error (RRSP).

KS, MAE, and RMSE are in numeric values. RAE and RRSE are in percentage. The results are shown in Table 17.

Table 17 Comparative evaluation Kappa Statistic (KS), Mean Absolute Error (MAE), Root-Mean-Square Error, Relative Absolute Error, and Root Relative Squared Error Classifiers

Once the predictive model is built, we can check its efficiency. For that, we compare the accuracy measures based on precision, recall, TP rate, and FP rate values for LR, NB, C, 4.5, DT, RF, SVM, GB, SGDA, and KNN, as shown in Table 17.

Figures 3, 4 and 5 show the results of the mean absolute error (MAE). The square root of the squared error. The mean fundamental error (MAE) measures the error between paired observations expressing the same phenomenon. The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a commonly used measure of the discrepancies between predicted and observed values (sample or population values) by a model or estimator. The RMSD is the quadratic mean of the differences between anticipated and observed values or the square root of the second sample moment of these differences. Interrater dependability is also routinely tested using the kappa statistic.

Fig. 3
figure 3

Training and simulation error Pima Indians diabetes database

Fig. 4
figure 4

Training and simulation error diabetes 130-US hospitals for years 1999–2008 data set

Fig. 5
figure 5

Training and simulation error Diabetes Iraqi so + . *ciety data set

Tables 18, 19, and 20 show that RF has the most straightforward classification (0.98%) and the lowest warning error rate (0.01). We will also remark that RF has the most straightforward compatibility between the reliability and validity of the data obtained.

Table 18 Training and simulation error
Table 19 Training and simulation error
Table 20 Training and simulation error

We will now study the findings acquired while measuring the efficiency of our algorithms after we have generated the predicted model. The best values were obtained by RF and DT, as shown in Table 14. (99.68%, 99.82%). Based on these findings, we can deduce why the SVM beats the other classifieds.

To diagnose diabetes, the performance of each of the nine models is assessed using parameters such as precision, recall, and F-Measure (Table 21). Tenfold cross-validation is used to avoid the problems of overfitting and underfitting. Our classifier's accuracy reveals how often it is correct to determine whether a patient has diabetes. Precision was utilized to assess the classifier's ability to make accurate positive diabetes predictions. In our research, recall or sensitivity is employed to determine the percentage of actual positive diabetes cases properly detected by the classifier. The capacity of a classifier to distinguish negative diabetes cases is measured by its specificity.

Table 21 Evaluate the efficiency and effectiveness of algorithms in terms of accuracy

Discussion

Diabetes is a collection of metabolic illnesses marked by high blood sugar levels caused by a lack of insulin secretion, insulin function, or both. Diabetes-related chronic hyperglycemia is linked to long-term damage, dysfunction, and failure of various organs, including the eyes, kidneys, nerves, heart, and blood vessels. Diabetes must be detected early to maintain a healthy lifestyle. Because diabetes cases are quickly increasing, this disease may cause global concern.

Machine learning (ML) is a computerized method for learning from experience automatically and improving performance to make more accurate predictions. Machine learning techniques are successfully used in various applications, including diagnosis. A machine learning algorithm that develops a classifier system may aid clinicians in identifying and diagnosing diseases at an early stage by generating a classifier system. We will use machine learning classification techniques to improve the speed, performance, reliability, and accuracy of diagnosing this system for a specific ailment. This research focuses on utilizing machine learning approaches to analyze diabetic illness detection.

Kennedy and Eberhart developed particle swarm optimization (PSO) in 1995, a population-based stochastic optimization approach. PSO models species' social behavior, such as bird flocking and fish schooling, to show an autonomously evolving system. PSO refers to each candidate solution as "an individual bird of the flock" or a particle in the search space. Each particle uses memory and the swarm's collective knowledge to choose the best answer (Venter 2002). Each particle has fitness values are maximized using a fitness function and velocities that control particle movement. Each particle adjusts its position during mobility depending upon its own and nearby particle's experiences, selecting the best position it and its neighbor have encountered. The particles follow a current of optimal particles through the problem space [9, 48, 49].

Particle swarm optimization (PSO) was used in a study by Asti Herliana et al. To choose the best diabetic retinopathy feature from a dataset of diabetic retinopathy cases. The selected feature is then further classified via the neural network classification approach. The study's findings indicate a 76.11% improvement in outcome when using neural network-based particle swarm optimization (PSO). According to this study, the classification result has improved by 4.35% when feature selection is used, compared to the prior result of 71.76% when simply utilizing the neural network approach [57].

Using data mining techniques, Xiaohua Li and colleagues published an article on identifying a diabetic patient. Preprocessing, feature selection, and classification are the three steps of the suggested method. With K-means for feature selection, several amalgamations of the Harmony search algorithm, genetic algorithm, and particle swarm optimization algorithm are investigated. The combinations have never been looked at before for applications in diabetes diagnosis. The diabetes dataset is categorized using the K-nearest neighbor algorithm. Sensitivity, specificity, and accuracy have been measured to assess the outcomes. The findings show that the proposed strategy performed better than the earlier methods tested in this paper [58], with an accuracy of 91.65%.

To diagnose various medical conditions, Mohammad Reza Daliri proposes a feature selection technique utilizing a binary particle swarm optimization algorithm. The binary particle swarm optimization's fitness function was implemented using support vector machines. The four databases used to evaluate the suggested technique were the single proton emission computed tomography heart database, the Wisconsin breast cancer data set, the Pima Indians diabetes database and the Dermatology data set. The findings show that using fewer traits could diagnose heart, cancer, diabetes, and erythematosquamous diseases with a higher degree of accuracy. Our approach produced more accurate results when the findings were compared to the F-score and information gain, two classic feature selection techniques. The findings of the suggested method demonstrate a superior accuracy in all but one of the data compared to the genetic algorithm for feature selection. Additionally, the methodology performs better, utilizing fewer characteristics when compared to other methods that employ the same data [59].

Tuan Minh Le et al. [39] suggested a machine learning algorithm to forecast the early onset of diabetes in patients. It is an innovative approach to wrapper-based feature selection that uses Adaptive Particle Swam Optimization (APSO) and Gray Wolf Optimization (GWO) to optimize the Multilayer Perceptron (MLP) and minimize the number of the input characteristics needed. Additionally, they compared the outcomes of this strategy with many well-known machine learning algorithm approaches, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbor (KNN), Naive Bayesian Classifier (NBC), Random Forest Classifier (RFC), and Logistic Regression (LR). The computational findings of our suggested method demonstrate that, in addition to requiring significantly less characteristics, higher prediction accuracy can also be attained (97% for APGWO-MLP and 96% for GWO-MLP). This work has the potential to apply to clinical practice and become a supporting tool for doctors/physicians in the following, comparing the related work with the proposed method (Table 22) (Fig. 6).

Table 22 Performance comparison of other feature selection techniques in diabetes diagnosis
Fig. 6
figure 6

Performance comparison of other feature selection techniques in diabetes diagnosis

Numerous expert systems that have been created to improve the accuracy of medical diagnostics assist medical diagnosis [64]. Table 22 shows that RF, SVM, and C4.5 take about 0.06 s to ble 254uild their model, unlike DT, which takes only 0.01 s. Conversely, the accuracy obtained with RF (98.79%) is healthier than LR, NB, KNN, DT, SVM, RF, GB, SGDA, and C4.5, which have different accuracies between 94.00 attempts to 98.25%. It is also easy to see that RF has the best value of correctly classified instances and lower for incorrectly classified examples than the other classifier.

Figures 7, 8 and 9 show the accuracy of the nine classification models when applied to the dataset. As shown in Fig. 9, the decision trees and random forests have higher performance than the other algorithms.

Fig. 7
figure 7

Evaluate the efficiency and effectiveness of algorithms using Holdout

Fig. 8
figure 8

Evaluate the efficiency and effectiveness of algorithms using K-fold = 5

Fig. 9
figure 9

Evaluate the efficiency and effectiveness of algorithms using k-fold = 10

In summary, RF has demonstrated its effectiveness, efficiency, accuracy, and detectability of support. Compared with a series of diabetes risk prediction research in the literature, our experimental results achieve the best value (99.82%) in diabetes risk prediction classification. RF outperforms other classifiers regarding the accuracy, sensitivity, and specificity in classifying cardiac diabetes. Table 23 shows the performance of machine learning and data mining algorithms with proposed method for classification diabetes.

Table 23 Performance of machine learning algorithms for classification diabetes

Through data exchange among intelligent wearables and sensors, the industrial healthcare system has improved the quality of medical services and opened the prospect of implementing enhanced real-time patient monitoring. However, a system of this kind needs to be highly accurate and error-free (Table 24).

Table 24 Performance of PSO algorithms for feature selection and classification diabetes

Additionally, as is common knowledge, any ML that we employ with data of any kind must be precise, effective, and able to manage data with a wide distribution. A decentralized learning algorithm must be better at managing widely scattered data, since it is more concerned with the distribution of the data. As we saw in the section above of the article, we have several issues with the centralized learning technique on which our majority of traditional models depend. In contrast, swarm learning is a part of the artificial intelligence and machine learning studies where the major focus of swarm learning is to evaluate the behaviors of the decentralized system. We might find it useful to use a decentralized system to get around the drawbacks of centralized learning techniques. The fundamental concept underlying this learning is drawn from the PSO's method of operation.

PSO is a metaheuristic because it can search very huge spaces of potential solutions and makes little to-no assumptions about the problem being optimized. Furthermore, unlike traditional optimization techniques like gradient descent and quasi-Newton methods, PSO does not employ the gradient of the issue being improved, negating the need for the optimization problem to be differentiable. We recommended using metaheuristics like PSO to ensure that an optimal solution is always discovered in a decentralized system. Decentralized AI will also offer a ladder of success that develops from the expansion of knowledge. To exhibit high accuracy and error reduction, we combined PSO with Machine Learning (DL) technique [74,75,76].

Limitations

The benefits of machine learning techniques are numerous, but they are not without flaws that limit their potential in some respects. For example, many algorithms could be suitable for tackling a particular problem. Similarly, one algorithm may perform well for a given data collection, while others may not. As a result, selecting an acceptable algorithm for a given dataset could be a huge hurdle in bioinformatics, as is deciding on an appropriate feature selection approach. Furthermore, training ML algorithms often necessitates big datasets. These datasets must be unbiased and of good quality. Time is also required for data collection.

Furthermore, ML algorithms require sufficient time to train and test to produce highly reliable outcomes. These methods need a significant amount of hardware and resources. In addition, ML algorithms have a hard time confirming their results. As a result, proving that their predictions work in all cases is tough.

The correct analysis and interpretation of the findings generated by ML algorithms are, once again, a major problem in their utilization. Finally, machine learning algorithms are prone to errors. They generate false results when trained with faulty or incomplete data. This can set off a cascade of diagnosis or medication errors that wreak havoc in the healing process. If these problems are detected, detecting the cause of mistakes takes time, and correcting these errors is even more difficult.

Conclusion and Future Work

Detecting the dangers of diabetes at an early stage is one of the world's most pressing health concerns. Machine learning and deep learning have been successfully utilized in medical image and healthcare [52] analysis like whole-slide pathology [54], X-ray [50], diabetes [1, 2], breast cancer [51], heart [53], time series [77], Medicinal Plants [55], stock market [78], Stroke [79], Maximizing the Impact on Social Networks [35], outcome prediction of bupropion exposure [20], etc. This research aims to develop a framework for predicting the likelihood of developing diabetes. This paper compared the outcomes of nine machine learning classification algorithms with various statistical measures. The dataset collected through the UCI site was subjected to tests.

There are also many data processing and machine learning strategies for analyzing medical knowledge. Producing accurate and computationally affordable classifiers for medical applications is a significant challenge in data processing and machine learning. On the diabetes datasets, this study used nine primary algorithms: LR, NB, C 4.5, DT, RF, SVM, GB, SGDA, and KNN. To select the best algorithm—classification accuracy, we sought to analyze the efficiency and efficacy of various algorithms in terms of accuracy, sensitivity, and specificity. Random forest and decision trees performed better than all other algorithms. In conclusion, DT, NB, and RF proved their strength in diagnosing and identifying diabetes and achieved the simplest performance, accuracy, and low error rate.

The findings show that by choosing fewer variables, we could diagnose diabetes illnesses with a higher degree of accuracy. Our method produced more accurate results when the outcomes were compared to the usual feature selection approaches, namely the F-score and the information gain. The accuracy of the suggested method is higher than that of the genetic algorithm for feature selection (99.79% for RF using Holdout—99.59% for DT using K-fold = 5, and 99.86% for NB using K-fold = 10). Additionally, the strategy had a superior performance utilizing fewer features than other methods that employed the same data. This work has the potential to be useful in clinical practice and serve as a tool for doctors and other medical professionals.

In the future, the performance of the machine learning classifier can be improved by feature subset selection using Ant Colony Optimization Algorithm process, and like XGBoost, Extreme Learning Machine, Ensemble Learning Classifiers, and Neural Network.