1 Introduction

Diabetes is a chronic disease affecting the world’s general population. The diabetic patient has elongated blood glucose levels, and they are also at risk of developing life-threatening medical conditions in the future. Research [1] has shown that 693 million people will be affected by diabetes by 2045. Diabetes is graded under type-I, type-II, gestational diabetes, variation diabetes, maturity onset diabetes of the young (MODY), etc. Although a variety of variants have been identified, the most common are type-I and type-II. Type-I diabetes occurs when their immune system invades the beta cells that produce insulin. It occurs in one’s early stages of life and also affects the later stages. Type 2 diabetes is also known as non-insulin-dependent diabetes mellitus (NIDDM) or adult-onset diabetes. Type-II diabetes normally affects humans who are in their middle age of life and makes their bodies insulin resistant. Insulin resistance is a condition that does not allow the body’s cells to absorb the glucose content. The reasons for type-II diabetes is mainly due to overweight, lack of exercise, and other genetic factors. People who follow a poor diet regime and are physically inactive [2, 3] tend to experience type-II diabetes more than the one who follows a healthy diet regime and exercise routine. The complications of type-II diabetes include renal failure, blindness, bleeding disorder, hypertension, impaired wound healing, heart disease, stroke, and neurodegeneration [4]. Earlier prediction of people with type-II diabetes is essential for reducing the risk and delaying chronic complications throughout their lifetime. The assessment of risk factors for type-II diabetes assists in the monitoring of undiagnosed patients at risk.

The identified factors that are linked to increased risk of type II diabetes [5] are adults with BMI value 25kg/m2, overweight women planning for pregnancy, individuals with a family history of diabetes, signs of insulin resistance, physical inactivity, etc. Glucose concentration in the blood plays a significant role in predicting diabetes. Diabetes develops when the fasting plasma glucose (FPG) of the individual exceeds an average of 126 mg/dl. The FPG level in the blood is tested after eight to 12 h of fasting, and the postprandial blood glucose (PPG) level can be tested after a meal. A 2-h plasma glucose level less than 140mg/dl indicates that patients will have greater potential for type II diabetes [6]. A PPG value (≥160 mg/dl) indicates the possibility of type II diabetes and also cardiovascular disease. FPG has a high degree of specificity when compared to PPG in predicting diabetes. Glycosylated hemoglobin or hemoglobin A1C (HbA1c) [7] is used to evaluate the sugar level present in the red blood cells. An HbA1c value of ≥ 6.5% indicates type-II diabetes present in the hemoglobin proteins.

In recent years, the possibility of identifying type-II diabetes in an earlier stage and increasing classification accuracy has been addressed by machine-learning algorithms [8,9,10,11]. A wide range of Machine Learning (ML) algorithms [12, 13] has been proposed for diagnosing type-II diabetes. Machine Learning leads to predictive modeling, an approach that develops a mathematical model to offer accurate predictions [14]. Due to the vast amount of healthcare data generated daily worldwide, this process can be achievable. The clinical data analysis offers better healthcare solutions to the patients and also aids in financial and operational improvements. Integrating ML algorithms and data systems yields in predicting type-II diabetes earlier.

In recent years, a wide range of clinical trials has been focused on predicting type-II diabetes, in order to evaluate the associated risk factors effectively. R. Delshi Howsalya Devi et al. [15] used the Farthest First Clustering (FFC), Support Vector Machine (SVM), and Sequential Minimal Optimization (SMO) algorithms to present a hybrid approach for Type II diabetes prediction. Bum Ju Lee et al. [16] argued that the hypertriglyceridemic waist (HW) and Waist circumference(WC) is the high associated risk factor for type II diabetes. Triglyceride (TG) is not considered a strong factor when compared to these. Hang Lai et al [17] developed an interactive computer program to help doctors predict the risk of diabetes in their patients and provide preventive measures. They suggested that their proposed model performs well in detecting type-II diabetes than the models using Random Forest and Decision Tree. Karim M. Orabi et al. [18] designed a predictive system for type-II diabetes, which can predict the person’s age in which they are prone to be diabetic by using the regression technique and random code mechanisms. Researchers did not consider the crucial risk factors associated with type-II diabetes, however, and therefore the model suffered from low precision values. Namrata Singh et al. [19] developed a hybrid model utilizing an ensemble-based approach (XGBoost) to extract the rules from SVM to diagnose hypertension among diabetic patients. However, they did not handle the class imbalance problem present in the dataset.

Researchers have integrated several ML techniques to predict several diseases related to type-II diabetes with other diseases in healthcare [20]. Xiao-lu XIONG et al. [21] adopted a Cross-sectional Retrospective Study in Chinese Adults by using different ML algorithms(AdaBoost (AD), Multilayer Perceptron (MLP), SVM, Trees Random Forest (TRF), and Gradient Tree Boosting (GTB)) to choose the most appropriate technique for the prediction of critical risk factors present in type-II diabetic patients. This study evaluated eleven risk factors that are closely related to this disease. In literature, Bassam Farran et al. [22] combined four different ML techniques (Multifactor Dimensionality Reduction (MDR), k-nearest neighbors (k-NN), Conventional Logistic Regression(LR), and SVM) to create predictive models for evaluating the likelihood of hypertension, type-II diabetes, and comorbidity using the Kuwaiti national health dataset. This work builds a predictive model using non-intrusive data to identify diabetic and hypertension patients at high risk. Here the risk factors associated with type-II diabetes are BMI, ethnicity, and family history of diabetes.

Bassam Farran et al. [23] proposed a prognostic model to predict future risk for type II diabetes (within 3, 5, and 7 years) using LR, k-NN, and SVM from Kuwait health network data. They demonstrated that this model could identify subjects at higher risk for type-II diabetes and prognosis at an early stage. We have performed an advanced study (A. Sheik Abdullah et al. [24]) using an enhanced combination of PSO and Decision Trees to examine the risk factors correlated with type II diabetes. This study evaluated the risk factors associated with diabetes using a mathematical model named Fishers Linear Discriminant Analysis(FLDA) for the discovered attributes. Han Wu et al. [25] used K-Means and LR algorithms to implement a data mining technique for the prediction of type 2 diabetes mellitus. This model suffers from high time complexity that occurs in the pre-processing phase. Hamid R. Marateb et al. [26] applied an Expert-Based Fuzzy Micro Albuminuria (EBFMA) Classifier, PSO, and Multiple LR techniques to identify MA in type-II diabetes patients without measuring urinary albumin. The limitations found in this study are its small sample size and cross-sectional nature; however, they did not consider taking a larger sample size for detailed investigations. In recent years, evolutionary methods [27,25,26,27,31] have been achieved better performance in engineering and medical applications.

A novel approach based on transfer learning has been deployed for human gene regulatory network reconstruction (Paolo Mignone et al. [32]). The Gene Regulatory Network(GRN) is mainly reconstructed using the Gene Expression data to identify the regulatory schemes utilized in identifying different human diseases. The main problem considered here is to overcome the low availability of the labeled samples and make full use of the unlabelled samples. Emanuele Pio Barracchia et al. [33] presented a link prediction approach using hierarchical clustering to discover the ncRNA relationship in a heterogeneous network comprising different biological attributes. This approach offers increased prediction accuracy along with predictions at a different level of granularity. The classification of this methodology falls into two categories namely algorithm-based and similarity-based techniques. Table 1 provides a summary of different methods proposed for detecting type-II diabetes and its associated symptoms.

Table 1 A summary of type-II diabetes prediction Models used in this literature

The present study proposed an automatic type-II diagnosis system utilizing KELM and a hybrid PS0-AFSO optimization algorithm which could improve the classification accuracy of Type-II diabetes prediction. In the first phase, the input training dataset is divided into five sub-samples by a fivefold cross-validation approach where 90% of the input is used for training and 10% are used to test the model. In the second phase, a Hybrid Particle Swarm Optimization-Artificial Fish Swarm Optimization (HAFPSO) optimization algorithm is applied for improving the accuracy and base learner optimization. In the final phase, the proposed KELM framework is deployed for automatically selecting the appropriate classifier based on the derived features. The classifier chosen determines if a person is at or not at risk of developing type II diabetes. The experiments were conducted on Pima Indian Diabetes Dataset (PIDD) and a Diabetes Research Centre dataset.

The key goals of this research work are summed up as follows:

  1. i.

    Utilizing two datasets (PIDD and Diabetes Research Centre) to classify important risk factors associated with Type II diabetes.

  2. ii.

    Formation of the KELM-HAFPSO stacking method to assess Type II diabetes accurately.

  3. iii.

    The solution for the multi-objective selection problem is analyzed using the HAFPSO algorithm by concurrently optimizing the two benchmark values(Classification Accuracy and Number of Classifiers used).

  4. iv.

    Nine standard benchmark functions (F1-F9) are used to test the efficiency of the proposed HAFPSO algorithm.

  5. v.

    The performance of the KELM-HAFPSO classifier has been compared with various state-of-the-art classifiers in terms of accuracy, specificity, sensitivity, Matthews correlation coefficient, and Kappa statistics.

The remainder of this paper is structured accordingly. Section 2 introduces the problem statement and new models of the current prediction framework; Section 3 demonstrates the formulation of the HAFPSO algorithm; Section 4 describes the KELM methodology proposed; Section 5 explains our study's methodological setup; Section 6 describes the dataset and various important aspects used in this study for experimental evaluation; Section 7 demonstrates the experimental results and discussion; and Section 8 concludes this paper.

2 Problem statement

Type-II diabetes prediction is a bipartite classification problem that splits the patients involved in this study into two categories namely positive class(Diagnosed with type-II diabetes) and negative class(not Diagnosed with type-II diabetes). This study proposes a KELM-based HAFPSO stacking approach to identify whether a subject is likely to develop type-II diabetes within 5 years based on the several risk factors as predictors. The HAFPSO algorithm is used as a classifier for the two datasets obtained for multi-objective optimization. In the medical industry, Classification Accuracy (CA) is crucial in predicting the disease at an early stage. Four indicators namely True-Positive (Aj1), True-negative (Aj2), False-Positive (Aj3), and False-Negative (Aj4) are used to compute this objective function. The objective function CA is modeled using the four random reference variables Aj1 = I {ELj = PLj = PC}, Aj2 = I {ELj = PLj = NC} ,Aj3 = I {ELj ≠ PLj = PC}, and Aj4 = I {ELj ≠ PLj = NC}  . Here PC and NC denote the actual positive and negative classes respectively.

The exact label value which indicates the actual presence of the disease is indicated by ELj and the predicted label of the proposed model is indicated by PLj. At any moment, the sum of the four random reference variables is equal to one which is indicated as \( \sum \limits_{k=1}^4{A}_{jk}=1,{\forall}_{jj};\kern0.48em TruePositive=\sum \limits_{j=1}^N{A}_{j1}\kern0.24em \). Where, \( TrueNegative=\sum \limits_{j=1}^N{A}_{j2} \), \( FalsePositive=\sum \limits_{j=1}^N{A}_{j3}\kern0.24em \), and \( FalseNegative=\sum \limits_{j=1}^N{A}_{j4}\kern0.24em \).

The proposed model uses a two-class classification problem on the PIDD and Diabetic Research Center dataset and N is known to be the total number of samples present in the dataset. The N samples are classified into positive (p) and negative (n) using the following expressions:\( p= TP+ FN=\sum \limits_{j=1}^N{A}_{j1}+\sum \limits_{j=1}^N{A}_{j4}\kern0.24em \) and \( n= TN+ FP=\sum \limits_{j=1}^N{A}_{j2}+\sum \limits_{j=1}^N{A}_{j3}\kern0.36em \).

This analysis uses two benchmark values which are specified in the following equations shown below:

$$ Increase\kern0.36em {B}_1= CA=\frac{\sum \limits_{j=1}^N{A}_{j1}+\sum \limits_{j=1}^N{A}_{j2}}{\sum \limits_{j=1}^N\sum \limits_{k=1}^4{A}_{jk}}\kern0.6em $$
(1)
$$ Decrease\;{B}_2= NBC=\frac{N_{CS}}{T_{NC}}\kern1.08em $$
(2)

where NCS is the number of classifiers selected by the HAFPSO stacking approach and TNC denotes the number of classifiers present in total. The main aim of this paper is to maximize the CA value by identifying the optimum number of base classifiers required. NBC represents the kernel complexity for the optimum learners chosen. The optimum learners are mainly selected based on the opinion that different learners introduce diverse classification issues and to enhance the classification accuracy, these learners can be merged.

3 Formulation of HAFPSO

Swarm Intelligence is a collection of algorithms that are inspired by nature’s way of solving problems. The emergent complexity of the diabetes prediction model can be solved by using swarm intelligence.

  1. i.

    PSO for global search

The PSO algorithm [34,32,36] is inspired by the social behavior of birds flocking, fish schooling, and swarm theory for solving continuous optimization problems. It, in addition, provides fast convergence which leads to an optimal solution. The particles are often anything that will fly through the multi-dimensional search space to seek out an optimal solution for the given problem. In the swarm optimization algorithm, a particle and an artificial fish represent a solution found in the search space. In the initialization phase, they are allotted a random initial position and velocity. The position of the particle indicates the solution based on the value obtained by the objective function. The particles memorize the position of the best search space they found. Velocity is a weighted sum of three components namely the old velocity, the velocity of the previous best solution, and the velocity of the neighbor’s best solution.

The PSO swarm consists of a set of particles also known as the initial population where P= {p1, p2, ……, pn}. The position of the particle which represents the candidate solution is represented by the fitness function f. For any time step t, pj it has its associated position (\( {{\overrightarrow{a}}_j}^t \)) and velocity (\( {{\overrightarrow{v}}_j}^t \)) value. The global optimum solution (best solution) with respect to the fitness function\( {{\overrightarrow{b}}_j}^t \) is estimated as for a given time step t. The particle pjreceives its information from its neighborNj ∈ P. The PSO algorithm is initialized by generating a random position for the population at the starting point of the RegionR ⊆ R. The velocity values are usually initializedR, but sometimes they are also set to zero or minimal values to prevent the particles from exiting the search space in the first iteration. When the algorithm enters its main loop, the velocity and the position of the particles are iteratively updated until convergence. The rules used to update the velocity and position are described below.

$$ {\overrightarrow{v}}_j^{t+1}=w{\overrightarrow{v}}_j^t+\kern0.36em {\alpha}_1{\overrightarrow{M}}_1^t\left({\overrightarrow{b}}_j^t-{\overrightarrow{a}}_j^t\right)+{\alpha}_2{\overrightarrow{M}}_2^t\left({\overrightarrow{n}}_j^t-{\overrightarrow{a}}_j^t\right) $$
(3)
$$ {\overrightarrow{a}}_j^{t+1}={\overrightarrow{a}}_j^t+{\overrightarrow{v}}_j^{t+1} $$
(4)

where w represents the inertial weight, α1 and α2are the acceleration coefficients, \( {\overrightarrow{M}}_1 \) and\( {\overrightarrow{M}}_2 \)are the two non-diagonal matrices with random numbers as the main diagonal entries. The random number falls under the interval [0, 1] and both matrices are regenerated at the end of the iteration. Vector \( {\overrightarrow{n}}_j^t \) indicates the best solution found in the neighborhood by any particle pjwhich can be identified as follows:

$$ f\left({\overrightarrow{n}}_j^t\right)\le f\left({\overrightarrow{b}}_j^t\right)\kern0.96em \forall {p}_j\in {N}_j $$
(5)
figure d

The values of w, α1 and α2 should be allocated with appropriate values to prevent the velocity from entering infinity. The value of the acceleration constant is usually the numbers between zeros to four. The PSO algorithm pseudocode is shown in Algorithm 1. The velocity update rule follows three characteristics to generate the local behavior of the particles which is listed as follows:

  • Inertia: It helps the particle keep track of the previous flight direction and helps to prevent it from rigorously changing direction (velocity).

  • Cognitive component: It helps the particle to return to its previous best solution

  • Social component: It can access the performance of the particle compared to its neighbors. It usually represents the group norm or standard that should be attained.

  1. ii.

    AFSO for local search

The AFSO algorithm mathematically models the collective movement and social behavior of the fish [37]. The algorithm is highly convergent, fast, versatile, and accurate. The AFSO algorithm imitates the behavior of the fish (preying, swarming, and local search) to reach the global optimum value. The environment where the Artificial fish (Af) lives is the solution space and it also contains the states of every Af’s present. The next behavior of Af relies on its current state as well as its local environment state. First, the algorithm generates potential solutions randomly, and then it performs the search to find the optimal solution. The Af analyzes the environment around it by using its vision. The current state of the Af is represented as Cs, visual distance as Vd, and Visual Position as Vp. The Af moves to its next state (Nstate) if it finds the state at Vp better than its Cs. Let Cs = (c1, c2, .. ……, cn) and \( {P}_v=\left({p}_1^v,{p}_2^v,\dots \dots ..,{p}_n^v\right) \) then the random process is explained as follows:

$$ {p}_i^v={p}_i+{V}_d\bullet \mathit{\operatorname{rand}}\left(\right),\kern3.359999em i\in \left(0,n\right) $$
(6)
$$ {N}_{state}={C}_s+\frac{V_p-{C}_s}{\left\Vert {V}_p-{C}_s\right\Vert}\bullet Step.\mathit{\operatorname{rand}}\left(\right) $$
(7)

where rand() generates a random number 0 or 1, n is the number of variables used and step is the step length. The AFSO algorithm includes three behaviors namely Af -prey, Af - swarm, and Af -follow.

  1. a)

    Af -prey: This function is based on the biological behavior of the Af chasing its prey(food). The state-based on its random visual distance is represented as Cv, and F represents the prey concentration value(objective function/fitness value). When the value of Vd increases the Af finds its global extreme value rapidly and converges soon.

$$ {C}_v={C}_s+{V}_d\bullet \mathit{\operatorname{rand}}\left(\right) $$
(8)
$$ {C}_s^{\left(t+1\right)}={C}_s^{(t)}+\frac{C_v-{C}_s^{(t)}}{\left\Vert {C}_v-{C}_s^{(t)}\right\Vert}\bullet Step.\mathit{\operatorname{rand}}\left(\right) $$
(9)
  1. b)

    Af -swarm: The Af always moves in swarms (groups) to exist in its colony and avoid potential threats.

$$ {C}_s^{\left(t+1\right)}={C}_s^{(t)}+\frac{C_{centre}-{C}_s^{(t)}}{\left\Vert {C}_{centre}-{C}_s^{(t)}\right\Vert}\bullet Step.\mathit{\operatorname{rand}}\left(\right) $$
(10)

Let Ccentre be the center position, nc be the number of companions, χ be the crowd factor, and n be the number of Af’s. If Fcentre > Fs and \( \frac{n_c}{n}<\chi \), this indicates the companion center has a major amount of food (higher fitness value) and less crowd which results in the Af moves near the companion center.

figure e
  1. c)

    Af -follow: The following behavior represents the swarm moving towards the single fish or multiple fishes finding food. If Fv > Fs and \( \frac{n_c}{n}<\chi \), this indicates the companion Cv has a major amount of food( higher fitness value) and less crowd which results in the Af moves

Here Fv denotes the fitness value of Cv, Fcentre denotes the fitness value of Ccentre and Fs denotes the fitness value of Cs. AFSO algorithm is presented in algorithm 2.

  1. iii.

    HAFPSO optimization

For training the KELM neural network, a HAFPSO optimization algorithm is used and makes full use of both the algorithms. This algorithm combines the behavior of the artificial fish in the swarm and the particle information of the PSO into one [38]. Here, first, the PSO algorithm is applied to the type-II diabetes prediction problem to initiate the global search. When the PSO algorithm is terminated, the AFSO algorithm is started; this obtains the final best population from PSO as its initial population and then conducts the local search. The AFSO performs the three functions of the artificial fish to obtain the final fitness value (best solution). The summary of the HAFPSO algorithm is narrated as follows and its flowchart is demonstrated in Fig. 1.

Fig. 1
figure 1

Flowchart of HAFPSO

  1. Step 1:

    Initialize the population of PSO and set its iteration value b=0

  2. Step 2:

    Check the convergence condition of PSO: If b=bmax-PSO, Then goto Step 9 or else go to step 3. Where bmax-PSO indicates the maximum iteration number of PSO.

  3. Step 3:

    Calculate the fitness value of each particle in the population and sort each one of them based on their fitness value.

  4. Step 4:

    Update the Local-Best -Position, and Global-Best-Position experienced by the jth particle.

  5. Step 5:

    Update the velocity of each particle using equation (3). This process prevents the particle in the search space from entering in the wrong direction.

  6. Step 6:

    Update the position of each particle using equation (4). The position indicates the current best solution and fitness value.

  7. Step 7:

    Repeat step-4 to step-7 until the maximum fitness value (bmax-PSO) is attained.

  8. Step 8:

    Set b=b+1 and go to step 2.

  9. Step 9:

    Initialize the parameters of AFSO and select the elite population of the PSO as the initial population of AFSO. The scoreboard is updated with the best particle from the elite population. Initially, the iteration b is set as 0.

  10. Step 10:

    Check if the convergence condition of AFSO is met or not. If b=bmax-AFSO the algorithm converges and the result is updated to the score-board. Here, bmax-AFSO is considered as the maximum number of iteration present in AFSO or else go to step 11.

  11. Step 11:

    Function selection- Each function represents the different behaviors (Af -prey, Af -swarm and Af -follow)exhibited by artificial fish. The artificial fish finds its food(best fitness value) by simulating these three behaviors respectively. If the best fitness value is found select the best behavior to perform if none is available to select the Af-prey function.

  12. Step 12:

    Update the score-board- Compare the best fitness value between each artificial fish and the score-board. If the fitness value of the artificial fish is found to be superior to the fitness value present in the score-board then update the score-board.

  13. Step 13:

    Set b=b+1 and go to step 10

4 ELM and KELM

4.1 Extreme learning machine (ELM)

ELM is a novel technique for classifying patterns and approximating functions. ELM is a single, feed-forward neural network with one hidden node layer [39]. Weight is assigned randomly between the inputs and hidden nodes and they remained constant throughout the entire training and predicting phases. The weights which directly link the hidden node to the output can be trained quite quickly. ELM improves the prediction accuracy, gives better generalization performance, and reduces the risk of overfitting, and lowers computational cost[40]. The ELM model first constructs a classification model from the input dataset D={dl ∈ Rf}, where l=1,2….,n with n samples and f input features. The ELM network consists of i input units, h hidden neurons, and o outputs, where its model output can be formulated as follows:

$$ {c}_i(l)={r_i}^Tq(l) $$
(11)

where the weight vector ri is denoted as ri ∈ Rg, i ∈ {1, 2, 3, .. ……, o}, which connects the hidden neuron to the ith output neuron.q(l) ∈ Rg,i ∈ {1, 2, .. ……, o}is a vector representation of output hidden neurons for its corresponding input textd ( l ) ∈ Rf. Then q(l) is derived as follows:

$$ q(l)=\left[\;f\left({w}_1^Td(l)+{b}_1\right),f\left({w}_2^Td(l)+{b}_2\right).\dots \dots \dots \dots \dots \dots \dots \dots, f\left({w}_g^Td(l)+{b}_g\right)\right] $$
(12)

Here bk (k = 1, 2, …., g) denotes the bias value of the kth hidden neuron, wg ∈ Rfwhich is the weight representation of the gth hidden neuron, and f(.) defines the sigmoid activation function. Both the weight (wg) and bias vectors (bk) are generated randomly from a Gaussian distribution. By taking the weight and bias vector as input, the next step generates a matrix that provides the hidden layer output HO. An account of this, the weight matrix W=[r1, r2, . . …, ro] is estimated by a Moore–Penrose pseudo inverse approach as follows:

$$ W={\left({H}_o\times {H_o}^T\right)}^{-1}\;{H}_o\times {P}^T $$
(13)

where P=[p(1), p(2), . . …, p(n)] denotes a o × nmatrix, where the actual target vector is the lth column present and p(l) ∈ Ro. After the ELM network’s parameters are identified, the class label for the type-II diabetes prediction is measured as follows:

$$ L={}_{l=1,2,.,0}{}^{\arg \max}\left\{{Out}_l\right\} $$
(14)

Here, the predicted class label is represented as L in the above equation.

4.2 KELM

KELM is a kernel representation of the ELM which can be derived as follows. For S arbitrary distinct samples {{(ai, ti)| ai ∈ Rm, ti ∈ Rn, i = 1, 2, .., S}}, the output function of the ELM with X hidden neurons is derived as follows.

$$ {f}_X(a)=\sum \limits_{i=1}^X{\beta}_i{h}_i(a)=h(a)\;\beta $$
(15)

The output weight vector that lies between the hidden layer X is denoted as β = β1, β2, . . ………, βXand the output vector of the same is represented as h(x) = [ h1(a), h2(a), .. …, hx(a)] with respect to the input a, whose sole purpose is to map the data from the input space to the KELM’s feature space. When the training error and output weights are reduced, it automatically improves the generalization performance of the KELM, i.e., derived as follows:

$$ \mathit{\operatorname{Minimize}}:\left\Vert A\beta -O\right\Vert, \left\Vert \beta \right\Vert $$
(16)

The least-square solution obtained in equation (16) is based on Karush-Kuhn-Tucker Theory [41] and it can be derived as shown in equation (17).

$$ \beta ={A}^T\left(\frac{1}{C}+{AA}^T\right)O $$
(17)

Here A indicates the hidden layer output matrix, O is the expected sample output matrix, and C the regulation coefficient. The ELM learning algorithm’s output function is derived as follows.

$$ f(a)=h(a)\;{A}^T{\left(\frac{1}{C}+{AA}^T\right)}^{-1}\;O $$
(18)

In the above equation for feature mapping, initially, the value of h(a) remains unknown. In ELM, the unknown value of the kernel matrix can be identified by the Mercers condition [42] as shown as follows:

$$ M={AA}^T;{m}_{xy}=h\left({a}_x\right)h\left({a}_y\right);=k\left({a}_x,{a}_y\right) $$
(19)

From the above equation, the output of the KELM can be derived as

$$ f(a)=\left[k\left(a,{a}_1\right),\dots \dots \dots \dots \dots, k\left(a,{a}_N\right)\right]\;{\left(\frac{1}{C}+M\right)}^{-1}O $$
(20)

where M=AAT and k(a,b) is the kernel function of the hidden neurons present in the KELM. The feature selection approach utilized by KELM follows a Leave One Out Error (LOOE) scheme [43] for faster convergence and to find the optimal feature in the subsets.

This paper uses four kernel functions as four Base Learners(BL) such as Linear-KELM(Lin-KELM), Polynomial KELM(Pol-KELM), Sigmoid-KELM(Sig-KELM), and Gaussian KELM(Gaus-KELM).

  1. i.

    Lin-KELM: It offers the best performance for larger datasets by generating the best solution for the optimization problem and increasing the predictive performance simultaneously. It generates the results in a lesser amount of time.

$$ k\left({a}_i,{a}_j\right)={a_i}^T.{a}_j $$
(21)
  1. ii.

    Pol-KELM: The P-KELM finds the similarities and features from the input text. The exponent value of the polynomial kernel is always greater than one for various cases when its value is lesser than one it is said to be a fractional polynomial. Here d indicates the polynomial degree.

$$ k\left({a}_i,{a}_j\right)={\left(\gamma\;{a_i}^T{a}_j+C\right)}^d $$
(22)
  1. iii.

    Sig-KELM: The S-KELM is similar to the sigmoid function used in Logistic Regression.

$$ k\left({a}_i,{a}_j\right)=\tanh\;\left(\gamma\;{a_i}^T{a}_j+C\right) $$
(23)
  1. iv.

    Gaus-KELM: In G-KELM, the input samples are mapped into the higher dimensional space in a non-linear fashion. There is no prior knowledge used in determining the parameter γ.

$$ k\left({a}_i,{a}_j\right)=\exp \left(-\gamma {\left\Vert {a}_i-{a}_j\right\Vert}^2\right)\kern0.84em \gamma \succ 0 $$
(24)

γ and C are the kernel parameters used. C is a constant used to tradeoff higher and lower-order features present in the input dataset. In this paper, the KELM and cross-validation method is combined in the training phase to yield higher prediction accuracy and reduce the overtraining problem. A fivefold cross-validation scheme is constructed initially to find the fitting parameters from the training dataset. The cross-validation based model selection used here automizes the four kernel functions used and reduce the overfitting problem [20]. The automated process can sometimes degrade the performance of the system when it does not consider the whole process of fitting the model. A HAFPSO optimization algorithm is used to boost system performance.

5 Proposed approach

This work suggests the KELM HAFPSO stacking method to construct a type-II diabetes prediction model. The method is clearly identified in subparagraphs below.

5.1 The KELM stacking approach

This segment prescribes the new KELM method for the diagnosis of type II diabetes from different data samples. The whole PIDD and the physical examination dataset are split into three parts-training, testing, and validation. The training data set is used to get the model learners and the prediction error is calculated by the data set used for validation. This study utilizes a learning process with the HAFPSO algorithm for model selection by the twenty BLconstructed and the stack-based integration approach as shown in Fig. 2. The KELM learning model comprises of two core comprehensive modules. The first module is utilized for base-level learning which constructs the BL from the training dataset. The second module is the multi-objective generative module which generates an optimal solution by enhancing base learner count and CA count. The HAFPSO algorithm uses the validation set to perform the model selection procedure. The selected models are integrated via a stacking method. The preceding paragraphs elaborate on the concepts for the two modules used.

Fig. 2
figure 2

Working of the proposed KELM-HAFPSO architecture

5.2 Base-level learning module

This module constructs the base learner from the training dataset. To construct a KELM learning model with higher prediction accuracy, four kernel functions are employed as the base learners. The whole PIDD dataset and the physical examination dataset are divided into two parts randomly in a ratio of 90 to 10. The training and validation dataset is constructed from 90% of both datasets obtained. This dataset can be simultaneously used for both the model construction and estimation of the prediction error. The testing set is obtained from the remaining 10% data which is utilized to evaluate the generalization error. The five samples of distinct training datasets obtained are a result of fivefold cross-validation. To this five training dataset created four learners are applied. These four BLare Lin-KELM, Pol-KELM, Sig-KELM, and Gaus-KELM. The validation set estimates the fitness of each solution and identifies the optimal solution. The testing dataset evaluates the performance of the KELM model. The diversity present in the four kernel functions used serves as an essential component for building an efficient KELM model. The trained BL should be diverse and complementary at the same time to obtain the maximum information from the metadata used for prediction. The crucial factor for the learning aspect at the base level is producing a fair amount of diverse KELM's. The five training samples generated by the fivefold cross-validation gives rise to distinct base learners. Consequently, the diversity is obtained by using the bias value of the four base learners, base learner selection is considered as a significant factor in measuring the proposed model for performance evaluation.

5.3 Multi-objective generative module

The integrated approach of the model shows better productivity than their individual counterparts. Parameter Analysis and model construction is considered as an important aspect of the proposed model. The multi-objective generative module is the HAFPSO stacking approach that incorporates both the selection techniques and the combination of models. The training data is used to provide the candidate solutions, and the validation data is used to determine the candidate solutions ' fitness value for each iteration that is a generative process.

  1. i.

    HAFPSO for parameter analysis: This section discusses the function of the HAFPSO to maximize the number of base classifiers for the classification of both datasets and how to achieve a higher CA value. The selection of the model is known as a twofold issue for maximizing the optimal solution and giving a higher accuracy by using the minimum number of base classifiers. In order to find the right combination of base learners, this optimization process is most necessary. A binary coding scheme is employed to find a solution to the Type-II diabetes problem. The model's selection and rejection are indicated by the value “0” and “1” respectively. To find whether a model is selected or not the bit values “0” and “1” are used. The value “0” indicates that the corresponding model is not selected and the value “1” indicates that the corresponding model is selected. Therefore, the number of BL used matches the length of the solution. After the model selection is completed, the first benchmark value (CA) is increased and the next benchmark value(NBC) is reduced.

The evolutionary algorithms based on swarm intelligence are adopted to find a solution to multi-objective optimization problems and hence they are also known by the name multi-objective generalization algorithms. By investigating the candidate solution generated it selects the potential optimal solution by performing both global and local search. The HAFPSO algorithm combines the benefits of both algorithms to find the solution which results in improved accuracy, faster convergence, and enhanced global searching ability. The AFSO algorithm is acquainted with the PSO during the iteration process. It not only avoids the pre-mature phase in PSO but also improves the exploring and developing phases in the AFSO algorithm by increasing the diversity of the swarm optimization. This study uses a multi-objective prediction of type-II diabetes problems that must be simultaneously optimized. Since these goals are inherently contradictory, progress towards one goal can only be made to the detriment of at least one other. In order to achieve this goal, we typically aim for the best balance between competing objectives. The HAFPSO algorithm in this analysis is used to optimize the accuracy and number of base classifiers. This process can be explained by the upcoming step based on the concept of influence between two decision vectors present in the objective space. The solution is generated on one condition which the decision vector v influences another decision vector u (indicated by v<u), if and only if:

$$ \forall j\kern0.36em {f}_j(v)\le {f}_j(u)\kern0.36em and\kern0.36em {\exists}_i\;{f}_i\;(u)<{f}_i\;(u)\kern0.6em when\kern0.24em j,i\in \left\{1,2,3,\dots \dots \dots, N\right\}\kern0.84em $$
(25)

The above equation states that the objective of v is not aggravated than u and it also states that there is one objective present in v which is surely greater than the objective present in u. Here N reflects the cumulative number of objectives used consequently fj(v) specifies the value for the jth objective function based on u.

  1. ii.

    Particle representation: The optimal solution is found from the 20 BL’s involved by utilizing it through a HAFPSO algorithm. By encoding the object, specifying the fitness function, and using the internal velocity and memory to explore the search space, the problem can be formulated. In this study, the particle is encoded in a bit string format for the five training data groups. Each group has four bits associated with it, and each bit represents the classifier used. If the bit value is one, then it represents the use of the appropriate kernel function for classification with its corresponding data group. The particle size is equal to the five data group’s sizes created and each particle has four dimensions for its corresponding four classifiers. The whole particle length is twenty bits and it is a binary encoded string. The dimension of the particle reveals its equivalent data group-classifier pair and the bits encoded indicate its existence and nonexistence in the model. Figure 3 demonstrates a random particle with 11 ones and each one indicates its 11 corresponding data group and classifier combination. This process generated eleven BL(i.e., BL-1,BL-2,BL-4,BL-5,BL-7,BL-9, BL-12,BL-14,BL-15,BL-17, and BL-20) and gives a base-level model as output. As shown in Fig. 3, Lin-KELM, Sig-KELM, and Gaus-KELM classifiers with data-group dg1 results in the generation of the base-level model as output. Similarly, by training each classifier on certain data-groups base-level models are obtained. Thus, the kernels are produced according to the BL selected and their combination is shown by highlighting it in blue color in Fig. 3.

Fig. 3
figure 3

Particle representation

Since exploration and exploitation are based on two different objectives, a single algorithm cannot handle both simultaneously; hence, the HAFPSO algorithm is deployed. In the HAFPSO algorithm, the local search phase (exploitation) is done by AFSO and global search space(exploration) is done by PSO. The PSO improves the solution by searching in a small region of the solution until the particle meets the same peak of the objective function. The AFSO algorithm leaves the current peak and searches for the best solution. In the initial phase, the algorithm has large velocities that focus more on exploration, and in the later phase when the velocity converges to zero, it goes to exploitation. The balance between the two algorithms leads to a potential optimal solution. Once the HAFPSO algorithm is completed, a potential optimal solution will be obtained. This solution leads to a kernel that indicates the tradeoff between the complexity of learning with kernels and prediction accuracy. To build a precise KELM model, the final particle derived is considered as the potential optimal solution with high accuracy. The final kernel was therefore built by the selected learners at the basic level of the encoded particle with a bit of value one.

  1. iii.

    Stacking-based model integration: To develop a KELM model, model integration plays a crucial role. The predictive behaviors of the classifiers can be enhanced by using the correct model integration method. For model integration, the proposed HAFPSO uses the stacked generalization approach. The meta-data is combined by using KELM. Experimental analysis was conducted on the KELM model to explore the combination of the selected B Land the KELM. KELM has gained wide popularity in various classification tasks [38, 44, 45] due to the promising results obtained. As the new attribute to the KELM classification function, the predicted output of the BL is used. The meta-data from the training set will be loaded into the KELM meta-learner. The qualified KELM classifier is utilized by the test set to obtain the final predictions. The output of the KELM is the subjects diagnosed with type-II diabetes; otherwise, the output of the KELM is the subjects not diagnosed with type-II diabetes.

6 Experimental evaluation

6.1 Dataset description and preprocessing

This study uses two datasets National Institute of Diabetes and Digestive and Kidney Diseases is where the PIDD dataset originated. The dataset was originally created with an objective to predict whether a patient is subjected to diabetes or not based on the diagnostic measurements obtained from the dataset [46]. The study was conducted on the Pima Indian Women population near Phoenix, Arizona for a period of five years. This database is familiar for researchers to identify the onset of diabetes based on eight features from the 768 samples as shown in Table 2. The eight features tend to be the significant risk factors when predicting type-II diabetes. From Table 2, the ninth feature describes a class label that identifies whether a patient is subjected to Type-II diabetes or not. Diabetes pedigree function is a likelihood value derived from the patient’s family history of diabetes. From the 768 samples obtained, 268 patients were actually diagnosed with diabetes within a one year period. The actually identified 268 samples are annotated with the value one while the remaining values are labeled as zero. Since our proposed work focuses on identifying type-II diabetes, the insulin measures and the number of times pregnant are not considered as a very significant risk factor.

Table 2 Features present in PIDD dataset

From the 768 samples obtained, 376 samples lacked experimental value because few attributes were considered missing. Due to the errors and deregulation present in the dataset, the missing value occurs. If the missing values are not replaced it leads to inaccuracy in the results. The pgc, tst, dbp, si, and bmi values cannot be termed as zero if it is then the real value is missing. The zero values are interchanged with the mean value of the corresponding attribute present in the training data to replace the missing values. The preprocessed data which is free from errors is taken by the disease prediction model for processing.

In our second dataset, the test values were obtained during the physical examination of different patients from a Diabetes Research Centre, Tamilnadu. From a total of 8700 samples, 5000 patients were not subjected to type-II diabetes and the remaining 3700 patients were affected with type-II diabetes. There were a total of 230 indicators present in the physical examination dataset and some of them had no significant relationship with type-II diabetes. For our study, we selected some indicators manually which are associated with type-II diabetes of some sort and they are presented in Table 3 shown below. The type-II diabetic patient produces enough insulin but their body is not able to utilize it properly. Hence, the glucose remains in their blood and the blood glucose levels need to be checked to determine their insulin resistance capability. The blood glucose levels of these patients are measured via the FPG, PPG, HbA1c, and 2hpg tests done. From, these test values obtained, the insulin resistance is measured. The PPG test, FPG test, HbA1c, TC, and TGL, BMI, age, and 2hPG are the important factors present in the dataset to predict type-II diabetes in patients. The duration of diabetes mainly represents both the type-I and type-II diabetes present in the patient for a specific time duration. The type-II diabetes class label mainly represents the prediction of new type-II diabetes patients(incident diabetes).

Table 3 Features present in the Diabetes Research Centre dataset

In both datasets obtained, some values were missing and some of the samples had more than one feature missing. The missing value is interchanged with the mean value here. The mean value of the samples affected by type-II diabetes and samples not affected with type-II diabetes is calculated separately. Since each feature has a different interval replacing it with its mean value affects its prediction accuracy. This study uses a min-max normalization to make sure that every feature at least has a value between zero and one.

$$ REplaced\ Value=\frac{Old- value(feature)-\mathit{\min}(feature)}{\mathit{\max}(feature)-\mathit{\operatorname{Min}}(feature)}\times \left( new\_\mathit{\operatorname{Max}}(feature)- ne{w}_{-}\mathit{\operatorname{Min}}(feature)\right)+ new\_\mathit{\operatorname{Max}}(feature) $$
(26)

6.2 Performance evaluation metrics

The efficiency of this proposed work is measured using different evaluation metrics such as Accuracy, Sensitivity, Specificity, and Mathews Correlation Coefficient (MCC) has been used. ROC (Receiver Operating Features) curve provides a graphical interface that offers an estimation of the predictive performance of the model proposed. The curve shows the true positive as well as the false-positive rates. Accuracy indicates the percentage of correctly identified samples from the PIDD dataset. The sensitivity indicates the probability of accurately identified diabetic patients. The specificity indicates the probability of accurately identified non-diabetic patients.

$$ Accuracy=\frac{A_{j1}+{A}_{j2}}{A_{j1}+{A}_{j2}+{A}_{j3}+{A}_{j4}}\kern0.24em $$
(27)
$$ Sensitivity=\frac{A_{j1}}{A_{j1}+{A}_{j4}}\kern0.24em $$
(28)
$$ Specificity=\frac{A_{j2}}{A_{j2}+{A}_{j3}}\kern0.24em $$
(29)

The binary classification in the type-II diabetes prediction problem can be measured by using MCC. The MCC value lies in a range between −1 and 1. The value −1 indicates that the model's prediction is completely inaccurate in terms of prediction and observation. An accurate prediction value is represented by 1 and 0 indicates that the model's prediction is not better than a random prediction.

$$ MCC=\frac{A_{j1\times }{A}_{j2}-{A}_{j3}\times {A}_{j4}}{\sqrt{\left({A}_{j1+}{A}_{j3}\right)\left({A}_{j1+}{A}_{j2}\right)\left({A}_{j2+}{A}_{j3}\right)\left({A}_{j2+}{A}_{j4}\right)}}\kern0.24em $$
(30)

The Kappa Statistics (KS) is a critical factor used to check the stability of our proposed approach. It is a comparison technique that compares the result of our proposed model with a result randomly generated by another classifier. The KS value ranges between 0 and 1. The range value close to 1 indicates that the model performs well and a range value close to 0 indicates that the model’s performance is worse. The KS equation is derived as shown below

$$ KS=\left[\frac{P(observed)-P(chance)}{1-P(chance)}\right]\kern0.24em $$
(31)
$$ P(observed)=\left[\frac{A_{j1}+{A}_{j2}}{TN}\right]\kern0.24em $$
(32)
$$ P(chance)=\left[\frac{\left({A}_{j1}+{A}_{j2}\right)\ast \left({A}_{j1}+{A}_{j3}\right)\ast \left({A}_{j2}+{A}_{j4}\right)}{TN^2}\right]\kern0.24em $$
(33)

Here, TN represents the total number of observations found, P(observed) indicates the actually observed agreement, P(chance) indicates the chance value, Aj1 indicates the True-positive, Aj2 indicates the True-negative, Aj3 indicates the False-Positive, and Aj4 indicates the False-Negative values.

6.3 Parameter settings

This section gives a brief description of the various parameters used in this study. The regularization parameter C and the kernel function parameter γ is considered as the most important aspect in KELM and proper care should be given to them while tuning. The values of C and γ are taken as \( C\in \left\{{2}^{-5},{2}^{-3,\dots, {2}^{15}}\right\} \)and γ ∈ {2−15, 2−13,…, 23}. This study utilized 150 combinations of both these parameters and the KELM model used 45 hidden neurons and a sigmoid activation function. The dimension size D of the problem is taken as 20 and the population size is initialized to 50. The maximum number of iteration M is set to 1000 to increase the quality of the potential optimal solution. Rand() is a random number value that lies between [0,1], and the step size is set to 0.3. The value of the visual field Vd is set to 3.5, try-number is set to 10, and the crowded factor is χ=1. The values of inertial weight w are set to 0.7 andα1 = α2 = 2 for the PSO algorithm. For the Gaus-KELM the parameter γ is tuned to 0.01, and in the Sig-KELM the parameter γ and C is both tuned to 0.5 and 0.01. In Pol-KELM, the values for C, γ, and d are tuned to 0.5, 0.25, and 1 respectively. The parameter used for the model selection process in this work is depicted in Table 4.

Table 4 Parameters used for the model selection process

6.4 Experimental setup and test functions

The performance evaluation of this proposed model is compared with various conventional ML techniques that exist in the literature to diagnose type-II diabetes. The Pseudocode of the evolutionary algorithms is coded in Matlab 2018(a) environment and the experiments were conducted on an Intel Core i9-9980HK Processor, 5.00 GHz maximum turbo frequency, and windows 10-64 bit OS. In this study, nine statistical optimization problems are solved to verify the optimization performance of the HAFPSO algorithm. Nine conventional benchmark functions(F1-F9) are used to verify the effectiveness of the proposed algorithm by comparing it and testing it with PSO [35], AFSO [37], GWO [47], Non-dominated Sorting Genetic Algorithm- II (NGSA-II) [48], and Pareto Archived Evolution Strategy (PAES) [49]. The functions from F1-F3 are termed as single peak function which is used to measure the algorithm's exploitation capacity, functions from F4–F6 are called the multi-peak functions and they are used to evaluate the algorithm’s exploration capacity, and the functions F7–F9 are called the fixed multi-peak functions and they are used to evaluate the algorithm’s ability to escape from the local minima. These 9 functions have their own expressions and variable ranges as shown in Table 5.

Table 5 Unimodal benchmark functions

In Table 5, the number of variables used is represented as v, the range of values used is represented as Range, and the optimal value is indicated as foptimal. The effectiveness of the proposed approach can be measured using three evaluation metrics namely Average-Fitness, the Best Value, and Standard Deviation.

Table 6 shows the three evaluation metrics used to evaluate the proposed methodology with the other five techniques. Here Ni indicates the number of iterations used, Fj indicates the fitness value, and μ indicates the mean value of the population.

Table 6 Evaluation metric used for fitness estimation

6.5 Computational complexity of kernels

This section discusses the computational complexity of the proposed approach. The worst-case computational complexity encountered in the KELM stacking approach is given as follows:

  1. i.

    The training complexities of the four classifiers used are listed as follows: Lin-KELM has a complexity of O(nf), and Pol-KELM, Sig-KELM, and Gaus-KELM have a complexity of O(n3). Where n represents the sample size and f represents the number of features present in the sample. The computational complexity of iterations 1-6 of the HAFPSO algorithm is O(TS*(nf+2n3+nf2)). Here TS is the number of training samples used.

  2. ii.

    The computational Complexity of the KELM model for both accuracy and kernel complexity of the multiobjective optimization problem is O(NS2). Here, N denotes the number of objectives and S denotes the size of the population. This complexity is calculated for iteration number 7 to 36.

  3. iii.

    The predictional time complexity is represented as O(t) and it is termed as constant. For iteration Number 36-43, the complexity is computed as O(nMt). Here, n is the number of the training samples used and M is the number of models selected.

The computational complexity of the solution proposed is derived from O(TS*(n3 +nf2+nf+NS2)). Complexity depends primarily on three factors: the number of basic learners, the KELM meta-learner, and the method of optimization.

7 Discussion

The KELM-HAFPSO model proposed has been assessed for Accuracy, Sensitivity, Specificity, MCC, and KS by means of 5-fold cross-validation based on two related datasets. Comparative Analysis was conducted between the proposed KELM-HAFSO with the other seven competitive methods namely ELM-GA [50], Decision Tree C4.5-PSO [24], k-NN [23], MLP [21], LR [17], SVM [19], and NB [16].

8 Test functions to evaluate the performance of the HAFPSO algorithm

The accuracy of the novel hybrid algorithm has been verified by evaluating them with nine different benchmark functions. Table 7 shows the test results of the six algorithms compared to evaluate their optimization results. The test results shown in Table 7 prove that the proposed hybrid HAFPSO algorithm has the best fitness capability for both the local and global search and it also prevents it from premature convergence. Additionally, based on the convergence accuracy, HAFPSO converges faster and in less number of iterations than other algorithms. In view of the above study, it is obvious that the proposed HAFPSO algorithm will effectively boost the PSO and AFSO algorithm's convergence speed and accuracy, respectively. In other words, the suggested AFSO algorithm is statistically assumed to outperform the other five standard algorithms in terms of their applicability and practicality.

Table 7 Comparative analysis for optimization results

Figures 4, 5, and 6 show the different optimization curves obtained for 9 different benchmark functions by HAFPSO, PSO, AFSO, GWO, NGSA-II, and PAES algorithms to compare the convergence rate. Figure 4 consists of 4 representations for 4 benchmark functions(F1-F4) evaluated by the six algorithms respectively. The remaining representations for the other five benchmark functions(F5-F9) is shown in Figs. 5 and 6 respectively.

Fig. 4
figure 4

Comparison of optimization results with Benchmark Functions. ad: F1 –F4

Fig. 5
figure 5

Comparison of optimization results with Benchmark Functions. eh: From F5-F8

Fig. 6
figure 6

Comparison of optimization results with Benchmark Functions F9

8.1 Performance and comparative analysis of the proposed KELM-HAFPSO approach

The output of kernel generalization is evaluated by the test data. Here the Dataset-I (PIDD) and Dataset-II (Physical Examination Data) are indicated as D-I and D-II. The potential optimal solution obtained from the two input data sets on a single run as shown in Fig. 7. The potential optimal solution is obtained from the input dataset on a single run of the HAFPSO algorithm. In Fig. 7, B1 represents the classification Accuracy CA and B2 represents the number of base classifiers used NBC and it also shows the accuracy and kernel complexity tradeoff between the potential optimal solutions. Each kernel is termed as a solution that ensures that the accuracy and kernel complexity is balanced. From Fig. 7, it is clear that solution 5 has the best accuracy obtained when the appropriate kernel is used. Figure 7 shows the best kernel for validation and test data with 5 BL’s, and a B1 value of 0.9992(D-I) and 0.9857(D-II). Table 8 displays the details of both the test accuracy and validation accuracy obtained for the proposed KELM-HAFPSO model in a single run. Table 8 shows that during kernel formation Gaus-KELM and Sig-KELM is the frequently selected base learner to obtain the potential optimal solution and it helps to gain increased prediction accuracy.

Fig. 7
figure 7

Accuracy (B1) versus kernel complexity (B2) of the potential solution obtained in a single run. a D-I, b D-II

Table 8 Validation (B1) and test accuracy obtained for the potential optimal solution in a single run for datasets D-I and D-II

The swarm optimization algorithms generate a random combination of BL which leads to diverse classification. The detailed classification result of the proposed KELM-HAFPSO classification model is shown in Table 9. For the classification of Type-II Diabetes in D-I, the Accuracy, Sensitivity, Specificity, MCC, and KS values are 0.999, 0.997, 0.859, 0.908, and 0.962 respectively. For instance, in D-II the Accuracy, Sensitivity, Specificity, MCC, and KS values are 0.985, 0.987, 0.832, 0.918, and 0.977. The performance of the proposed approach is superior to the other classifiers compared. The minimum and maximum kernel complexity associated with the potential optimal solution is 4 and 11 out of the twenty rounds of the algorithm. The average selection number of BL is 7, i.e., required to form a kernel.

Table 9 Experimental results of proposed KELM-HAFPSO with different runs on two different datasets

As shown in Table 10, the proposed KELM-HAFSO model achieves the highest performance among all other competitive classifiers used with an average accuracy of 98.5%, Sensitivity of 98.2%, Specificity of 84.2%, MCC of 90.5%, and Kappa Statistics of 96.5%. In this study, we assessed the other seven competitive models for analysis on the same two datasets through 5 fold cross-validation. From Table 10, the accuracy of SVM is much low when compared to other models. Table 10 lists the detailed classification list of ELM-GA, C4.5-PSO, k-NN, MLP, LR, NB, and SVM models. The average classification accuracy of C4.5-PSO and NB is slightly lesser than our proposed model by 0.098 and 0.046. As seen from the Table, the classification accuracy of SVM in D-I and D-II is 78.7% and 77.4% which is very much lower than that of our proposed model by 21.2% and 21.1%. The simulation results of the proposed model obtained from Tables 9 and 10 prove the efficiency of our proposed approach. The competitive performance of the proposed model over the conventional classifiers shows promising results and superior performance of the model. The proposed KELM-HAFPSO method has therefore been concluded to achieve greater accuracy of predictions than those of the seven competitive classifiers. In contrast ELM-GA, k-NN, MLP, and LR provide slightly less accuracy, sensitivity, specificity, MCC, and Kappa Statistics value than the proposed model.

Table 10 Comparative analysis of proposed approach with individual classifiers

Figures 8, 9, 10, 11, and 12 display the bar graph of the five performance measures namely accuracy, sensitivity, specificity, MCC, and Kappa Statistics obtained from different classifiers, and the proposed KELM-HAFPSO. The higher accuracy of the proposed KELM-HAFPSO approach is shown in Figure 8 for both datasets D-I and D-II when compared with other classifiers. The second highest accuracy value is obtained by C4.5-PSO. Figure 9 demonstrates the significant sensitivity values obtained by our proposed approach. The classifiers such as k-NN, MLP, LR, and SVM yields lower sensitivity value than others. In Fig. 10, our proposed model achieves higher specificity values when compared to other classifiers. The MCC scores obtained are shown in Fig. 11, where our proposed approach achieves a maximum MCC score(0.905) than other classifiers. In contrast, MLP and SVM achieve the lowest MCC values. Figure 12 demonstrates the Kappa Statistics comparison of the proposed model with other classifiers. The kappa statistics value of the proposed approach is relatively high when compared to others. The Kappa statistics value of k-NN is slightly low when compared to others. Lastly based on the seven competitive classifiers, the performance of the KELM-HAFPSO on the datasets D-I and D-II tendS to be the highest.

Fig. 8
figure 8

Accuracy of the proposed KELM-HAFPSO approach compared with other competitive classifiers

Fig. 9
figure 9

Sensitivity of the proposed KELM-HAFPSO approach compared with other competitive classifiers

Fig. 10
figure 10

Specificity of the proposed KELM-HAFPSO approach compared with other competitive classifiers

Fig. 11
figure 11

MCC values of the proposed KELM-HAFPSO approach compared with other competitive classifiers

Fig. 12
figure 12

Kappa Statistics values of the proposed KELM-HAFPSO approach compared with other competitive classifiers

The proposed algorithm is compared with ten classifiers from the literature as shown in Table 11. Table 11 indicates that our proposed model is a promising tool to classify high-risk diabetic patients with higher accuracy of 99.92% and 98.57% for Datasets D-I and DII respectively.

Table 11 Comparative analysis of classification accuracy of the proposed model with classifiers from literature

8.2 Risk factor analysis

For a given training dataset {U, T}, the risk factor identification problem is represented as

$$ {\displaystyle \begin{array}{c}\min \\ {}\alpha, \gamma \end{array}}\;\frac{1}{N}\sum \limits_{i=1}^N{\left({T}_i-f\left({\alpha}^T{a}_i;\gamma \right)\right)}^2 $$

Such that

$$ {\left\Vert \alpha \right\Vert}_o={p}_s\le p,\alpha \in {\left\{0,1\right\}}^f $$
(34)

Here f is the mapping function of the KELM network, γis its output weight, ps is the size of the feature subset, and ‖.‖0is a Lo normalization function. αiis a binary value that indicates whether the ith feature in the subset is selected or not. The Lo normalization is non-continuous, so there arises a complexity in the optimization function to find the fitness value which can be solved by using a relaxed version of L1 normalization.

$$ {\displaystyle \begin{array}{c}\min \\ {}\hat{\alpha},\gamma \end{array}}\;\frac{1}{N}\sum \limits_{i=1}^N{\left({T}_i-f\left({\hat{\alpha}}^T{a}_i;\gamma \right)\right)}^2+{C}_1{\left\Vert \hat{\alpha}\right\Vert}_1 $$
(35)

C1 is a regularized coefficient and ||.||1 denotes the L1 normalization. The value \( \hat{\alpha} \)is not binary and it can take any real number values. If the ith entry \( \hat{\alpha} \)is termed to be non-zero then it is immediately selected as a feature. The feature can be extracted via a multivariate LR technique [51]. The individual risk factor associated with type-II diabetes can be identified by substituting the features derived from the KELM algorithm directly into the LR equations. The prediction class is denoted by a variable P which takes the following values such as positive-class (value-1), negative–class(value-0) and the predictor variables a1, a2, . ……, an to form a linear function α0 + α1a1 + α2a2 + . …… + αnan. In the linear function formed α0is considered as the intercept value and α1, α2, . ……, αnis the regression coefficient. The probability value of predicting Type-II diabetes lies in positive class value “1,” and then the proposed model uses the following equation to gain the probability estimate.

$$ p=\frac{e^{\left({\alpha}_0+{\alpha}_1{a}_1+{\alpha}_2{a}_2+.\dots \dots +{\alpha}_n{a}_n\right)}}{1+{e}^{\left({\alpha}_0+{\alpha}_1{a}_1+{\alpha}_2{a}_2+.\dots \dots +{\alpha}_n{a}_n\right)}} $$
(36)

The features present in both the datasets are applied to a logistic regression model and the results obtained are displayed in Tables 12 and 13. The value t indicates the wald statistic which is obtained by taking the ratio of the feature estimate with its corresponding standard error for each feature present in the dataset. The risk factors that are highly associated with Type-II diabetes can be identified by the p>|t| column. In Dataset-I, the risk factors are hp, pg, bmi, and dpf and the risk factors in dataset-II are MBI, FPG, PPG, HbA1c,2Hpg, TC, and TGL. From both the tables, it is clear that dataset-II has more risk factors that are closely related to Type-II diabetes when compared to dataset-I.

Table 12 Experimental results for Dataset-1 by fitting it by LR
Table 13 Experimental results for Dataset-2 by fitting it by LR

9 Conclusion

This paper aims to develop an automatic Type-II diabetes prediction system based on the novel KELM-HAFPSO approach. Previously, a number of different ML classifiers have been used to develop a type-II diabetes prediction model. However, due to the complexity of diverse features present in the dataset, it remains difficult to achieve an accurate predictive model that classifies the crucial risk factors associated with this disease. To overcome this obstacle, this paper endeavors to devise a multiobjective KELM-HAFPSO approach for selecting the risk factors present in the dataset. Two datasets are used in this paper where one is the PIDD dataset and another one is obtained from the Diabetes Research Centre, Tamilnadu incorporating a total of 8700 samples. These two datasets were preprocessed using the Min-Max Normalization approach. The accurate classification is yielded in this paper by using a stacking based kernel approach for the KELM classifier. The KELM uses 4 BL namely Lin-KELM, Pol-KELM, Sig-KELM, and Gaus-KELM to train the model via fivefold cross-validation to yield 20 trained BL’s. The efficiency of this proposed model is evaluated by increasing the classification accuracy and decreasing the number of base classifiers used. The KELM model serves as a meta-classifier that accurately classifies the test samples linked with a higher risk of type-II diabetes based on the risk factors. The comparison results of the novel Hybrid HAFPSO algorithm with six competitive algorithms evaluated using nine different benchmark functions shows the superiority of our proposed algorithm in terms of accuracy and optimalism. The experimental results reveal that the proposed approach outperforms the other seven competitive classifiers in terms of accuracy, sensitivity, specificity, MCC, and Kappa Statistics on both the datasets applied. In the future, we plan to implement this approach in different high dimensional datasets that include data about proteomics, genetics, metabolomics, etc.