1 Introduction

According to IDF (International Diabetes Federation) Diabetes Atlas, the worldwide occurrence rate of diabetes in the 20 to 79 age group was 10.5% in 2021, equivalent to approximately 536.6 million people. This figure is projected to increase to 12.2% by 2045, reaching around 783.2 million individuals [1]. Diabetes is known to be associated with severe complications such as retinopathy, neuropathy, cancer, heart attacks, and potential fatality [2, 3]. The high prevalence of diabetes, and the absence of intelligent techniques, cause delays and inaccuracies in the process of diagnosis. Medical data mining has the capability to uncover concealed patterns from vast amounts of data, leading to timely and precise medical decisions [4, 5]. This can be applied to accurate diabetes prediction as well if sufficient and quality data is available. Data quality and missing data are common problems, with real-world diabetes datasets, which affect the performance of intelligent techniques [6, 7]. In the healthcare domain, patient records are frequently produced as a result of patient care activities, rather than being explicitly collected as part of a structured research protocol, resulting in the potential loss of valuable information [8]. Hence, a significant portion of patient records exhibit missing values, as evidenced by the presence of datasets in the UCI (University of California Irvine) Machine Learning Repository that contain more than 40% missing values [9]. There are many reasons for missingness in medical data such as unrecorded values, incorrect measurements, equipment errors, human errors, outliers, or wrong data. In handling missingness, it’s important to know the missingness mechanism, or cause of missingness, and the missingness pattern, which can impact the choice of imputation techniques [10]. The existing literature categorizes missingness into three distinct categories, namely: (1) Missing Completely at random (MCAR), (2) Missing at random (MAR), and (3) Not missing at random (NMAR) [10,11,12]. Handling incomplete data is a vital step in the analysis of medical datasets [13]. Among the various methods available, the simplest way to handle missingness is to delete records with the incomplete data and do the computation with complete records only. However, there are many drawbacks of this technique- it can lead to loss of information, it can affect the performance of classifiers, as deleted variables might be the deciding factor in predicting the disease, and, the collection of medical data involves time, money, and human efforts [14]. Imputation is an alternative approach employed to address missing data. This technique involves substituting missing values with estimated or imputed values. Imputation has been widely adopted as an efficient approach for managing incomplete data [15]. The task of imputing missing data holds significant importance across various domains, particularly in the medical or healthcare field. In this context, it becomes crucial to utilize all available data and avoid disregarding records solely due to the presence of missing values [16]. The most common method of imputation is filling in the missing data, with an average value of the missing variable, in all the observed cases of that variable. For numerical attributes, the mean value is utilized to replace missing values within the dataset, while for nominal attributes, the mode is used as a substitution approach. The advantage of this method is that the sample mean of that variable (missing variable) is not changed. However, the mean imputation technique is not suitable for multivariate analysis, as it underrepresents the variability in the data, and attenuates any correlations involving the imputed variable(s). ML)-based imputation techniques utilize the available variables to make predictions and estimate the missing data [14, 17]. These techniques employ the development of a predictive model for determining missing values in the datasets. ML-based models offer significant benefits, including heightened flexibility compared to traditional statistical models, enabling them to capture intricate higher-order interactions within the data and consequently producing superior predictive outcomes.

The main contributions of this work include

  • Comparison of three ML-based imputation techniques—KNNI, MICE, and MissForest.

  • A comprehensive empirical analysis of three ML-based techniques-KNNI, MICE and MissForest, on UCI Diabetes Dataset for 10%-50% missing rate (MR).

  • Performance analysis of the three imputation techniques is carried out on 16 datasets (one complete and fifteen imputed datasets), and evaluated using 11 evaluation criteria—accuracy, precision, recall, F1 score, Mcoff score, MAE, RMSE, R^2 values, Pearson correlation analysis, AIC, and BIC values.

The remainder of this paper is structured into five sections. Section 2 provides the background of ML based techniques. Section 3 provides a summary of the literature of different imputation techniques. In Section 4, the methodology employed in this study is presented and explained. The experimental setup, along with the results obtained and their analysis, is covered in Section 5. Impact of the work is covered in Section 6. In Section 7, we conclude by discussing our findings and future scope.

2 Background

This work explores KNN, MICE, and MissForest ML-based techniques. K Nearest Neighbour (KNN) is a ML- based imputation method. It computes the k-nearest neighbour for each of the missing values and imputes values from them. In numerical imputation, mean and weighted mean is used to replace the missing value while mode is used for binary or categorical variable. In weighted mean, greater weights are given to closer neighbours. The idea of KNN is that objects close to each other are potentially similar [13, 16]. Challenging issue is the selecting optimal value of k, and the other is selecting neighbours In KNN algorithm, generally Euclidean, Manhattan, Pearson etc. are used as similarity measure. Selection of similarity measure also plays a very important role in the overall performance of the algorithm [18]. The drawback of KNN is that it searches the whole databases to look for most similar instances. It is a limitation for large databases. Miss Forest is a machine learning-based imputation technique. It uses a Random Forest (RF) algorithm. It initializes the missing variables with mean or mode values. The variable under imputation is used as the target variable for building the RF model. The missing value is replaced by the prediction of the RF model. It is based on iterative approach, the process of looping through missing data points repeats several times [13]. Multiple Imputation by Chained Equations (MICE) is a prevalent method for executing multiple imputation because of its flexibility. In MICE, multivariate missing data are imputed on a attribute by attribute basis. called fully conditional specification (Van Buuren, 2007). This means that per variable imputations are created, such that for each incomplete variable a specified imputation model is required. In these imputation models, interactions can be modelled in two ways: first, by specifying models including interaction effects manually and second by imputing subgroups of the data separately. MICE consist of 3 steps, step1 is generation of multiple imputation, step2 is analyzing the imputed data and step3 is pooling the analysis results. Let us take a set of attributes, X1…… Xn, in which, some or all contains missing values. Initially, all missing values are filled in at random. First attribute having missing value, In this example, X1 is regressed on the other attrbutes, X2,..., Xn. This is restricted to individuals with observed X1. The missing values in X1 are now replaced by simulated draws from the posterior predictive distribution of X1. This process is repeated for all other attributes X2…Xn. For attribute X2: X1, X3…Xn attributes will be considered. This cycle is repeated number of times, and creates one imputed dataset. The entire procedure is repeated m times, creating m imputed datasets. Each complete dataset is analyzed independently by MICE, then the results are pooled [19].

In MCAR, missing values are randomly distributed. KNNI can be effective when the missing values are irregularly related to other variables in the dataset. KNNI works on the assumption that the structure of the data remains similar for close instances. This makes it suitable for MCAR situations where missingness is not structured. MICE can handle MCAR beneficially as it imputes missing values built on observed values and the connections present in the dataset. MCAR presumes that missing values are not systematically related to any variables. Also, MICE is flexible in incorporating variable relationships. Hence MICE is suitable for handling MCAR missingness. Miss Forest ML method is powerful and can grasp complex relationships in the data. They work well even when missingness is random, as they can utilize information from other variables to envision missing values. The aggregate nature of Random Forests helps mitigate overfitting, making them suitable for imputation in datasets having a combination of MCAR and other missing data patterns.

The primary objective behind comparing machine learning-based imputation methods across four categories using 11 evaluation criteria in diabetes research is multifaceted, with key motivations including: a) The need to enhance data completeness and quality, b) The enhancement of predictive modeling for diabetes, c) The establishment of benchmark imputation methods tailored for diabetes research datasets, d) The utilization of standardized evaluation criteria to guarantee transparent and reproducible results when comparing imputation techniques, e) Empowering both clinicians and researchers with the requisite tools and knowledge to make well-informed decisions in the area of diabetes care and management.

3 Related work

In the existing literature, missing data in the medical field has been addressed through the application of statistical and machine learning-based imputation techniques. Statistical imputation assumes a normal distribution of the data and predicts missing values from the available data distribution. ML-based imputation techniques do not assume any specific data distribution and are capable of handling nonlinear relationships between variables [16, 20,21,22,23,24]. For instance, in a study on real breast cancer datasets, authors [16] employed statistical and ML-based methods to handle missing values. They utilized techniques like hot deck, mean, and hybrid imputation methods, as well as multilayer perceptron, K-nearest neighbor (KNN), and algorithms based on self-organizing map for handling missing data. In another study [25], researchers worked on medical datasets such as breast cancer, hepatitis, and diabetes datasets from the UCI repository. They proposed a novel hybrid prediction model that employed Simple K-means clustering to evaluate various imputation methods and select the superior one for filling in the missing data in the dataset. Similarly, in [26], the authors worked with the hepatitis dataset, which contained an arbitrary pattern of missing values. They utilized principal component analysis and multiple imputation to fill in missing values having arbitrary pattern. Moreover, in [27], the authors also explored the hepatitis dataset and performed imputation using the bootstrap aggregating method. They compared the performance of this method with decision tree imputation, mode imputation, and mean imputation. The comparison demonstrated that the classifier yielded better results when using bootstrap aggregating imputation. Furthermore, in [28], researchers dealt with hepatitis and breast cancer datasets. They employed hot deck imputation for handling missing data and utilized an ensemble method for feature selection. By utilizing a neural network, the classification task was executed, resulting in an accuracy of 98.47% for the breast cancer dataset and 95.51% for the hepatitis dataset. Authors in [29] worked on a kidney dataset. They introduced the Weighted Average Ensemble Learning Imputation (WAELI) technique to fill in missingness and improve the disease prediction. RF classification and regression trees, and C4.5 were used to predict the missing values, and the resultant value was obtained by computing the weighted average of every model. In [30], a hybrid classifier was utilized for detecting retinal lesions caused by diabetic disease, where the dataset contained missing values. Another study [31] employed a novel hybrid classifier for predicting diabetes and employed multiple imputations to handle missing values. This hybrid classifier combined an adaptive model and logistic regression based on a fuzzy inference system. The deletion method is commonly used in the literature for handling missing values when predicting diabetes diseases. However, authors in [32] utilized Bayesian networks and TensorFlow factorization to process missingness in breast cancer datasets. They employed KNN, decision trees, and SVM for breast cancer recurrence prediction. Furthermore, researchers in [33] worked on the Iran diabetes dataset and proposed a hybrid imputation method based on single and multiple imputation. They compared the outcomes using three classifiers and evaluated the results based on accuracy, precision, recall, and F1 score. Lastly, in [34], a comparative study was conducted using decision trees, multilayer perceptron, KNN, and RF classifiers to enhance the accuracy of diabetes prediction. Mean imputation was employed for handling missingness. The precision of the imputation process in the healthcare domain can be further improved by incorporating domain expert knowledge [35]. The authors employed deep-learning techniques for predicting pneumonia [42]. Various machine learning-based imputation techniques are employed for medical datasets in [43,44,45].

One significant weakness in the literature is the limited discussion and comparison of specific machine learning algorithms and statistical methods used for imputation, which could be more detailed to enhance comprehensibility and applicability. Another critical issue is the lack of clear explanation and justification for the components and integration of the hybrid intelligent system, which hinders reproducibility. Finally, the complexity of the proposed model may limit its accessibility to researchers without specialized knowledge. Evaluating the performance of imputation methods can be challenging, as there is often no ground truth to compare against. This makes it difficult to assess the accuracy of imputed values. In the literature mostly model performance is selected as the evaluation criteria. Nevertheless, imputation fulfills a broader role within data analysis, and therefore, its effectiveness cannot be comprehensively assessed solely through model performance metrics. These metrics may fall short in encapsulating several critical aspects, including the extent of information loss, the introduction of bias, and the overall quality of imputed data (Table 1).

Table 1 Strengths and weaknesses of state-of-the-art

4 Methodology

In this study, we used Pima Indians Diabetes Dataset, which is sourced from the UCI containing 768 records. However, 376 records had missing values in one or more variables, so we deleted those records, and processing was done using 392 complete records. Thereafter, we generated synthetic missingness in this dataset, using the MCAR mechanism, to generate five incomplete datasets having 10%-50% MR. This missingness was generated in multivariate configuration, in more than one variable, using the binomial distribution. We used three ML-based imputation techniques- KNNI, MICE, and MissForest, to impute the missing data in five incomplete datasets, to generate fifteen imputed datasets – five datasets of KNNI imputation, five datasets of MICE imputation and five datasets of MissForest imputation. The design process for comparison of imputation techniques is shown in Fig. 1.

Fig. 1
figure 1

The experimental design process for comparison of imputation techniques

We evaluated the performance of KNNI, MICE, and MissForest in four categories- 1) Diabetes Prediction Model Performance, 2) Imputation error rate, 3) Correlation analysis, 4) Model selection basis.

  1. 1.

    Diabetes prediction model performance: This model was built with one complete dataset and fifteen imputed datasets using four classifiers- RF, SVM, AdaBoost, and XGBoost (XGB). These four classifiers are widely used for machine learning imputation techniques on medical datasets. RF is an ensemble method, SVM is a linear and non-linear classifier, AdaBoost is an ensemble boosting method, and XGB is a gradient boosting algorithm. This diversity helps assess how different types of classifiers react to imputed data. The prediction performance of four classifiers with imputed datasets is compared with one complete dataset, using five evaluation metrics- accuracy, precision, recall, F1 score, and Mcoff score.

  2. 2.

    Imputation error rate: We evaluated the quality of imputation of KNNI, MICE & MissForest techniques using metrics-MAE, RMSE, and R^2, by comparing one complete dataset and fifteen imputed dataset values. MAE, RMSE, and R^2 values are calculated for KNNI, MICE & MissForest techniques for 10% to 50% MR.

  3. 3.

    Correlation analysis: It is performed to identify the imputation technique suitable to grasp the intricate connection among various variables in the diabetes dataset, and produce more accurate results. The Pearson correlation coefficient of all the variables in the fifteen imputed datasets is calculated, and compared with the Pearson correlation coefficient values of all the variables, in one complete dataset.

  4. 4.

    Model selection basis: We selected the best model after calculating & comparing the AIC & BIC scores of the full model and step model of one complete and fifteen imputed datasets. The full model is constructed with all the variables, & step model is constructed using stepwise regression, which selects a subset of variables to improve the performance of the model, and build the step model.

5 Experimental setup and results

The objective of this experiment was to conduct a comparative analysis of MCAR (Missing Completely at Random) Multivariate Missing patterns and assess the effectiveness of three machine learning-based imputation techniques in addressing them for 10%—50% MR. This study evaluates the performance of KNNI, MICE, and MissForest using four categories- Diabetes Prediction Model Performance, Imputation error rate, Pearson Correlation analysis, and Model selection based on AIC and BIC scores. The Diabetes Prediction Model performance of KNNI, MICE, and MissForest is evaluated with four ML classifiers namely RF, SVM, AdaBoost & XGB. Diabetes prediction is carried out for one complete dataset and fifteen imputed datasets by RF, SVM, AdaBoost, and SVM classifiers. The performance of the imputation techniques with four classifiers is evaluated using five evaluation criteria- accuracy, precision, recall, relative F1 score, and Mcoff score. Imputation error is evaluated using MAE and RMSE and R^2 values, Pearson correlation analysis of the variables of one complete dataset and fifteen imputed datasets is calculated, and compared to check the preservation of the relationship between variables, before & after imputation. Model selection is based on AIC and BIC scores. The experiments conducted in this study utilized the Pima Indians Diabetes dataset, obtained from the UCI repository [36]. This dataset consists of a total of 768 patient records, all of which are female. Among these records, there are 268 cases of diabetic patients and 500 cases of non-diabetic patients. The dataset provided in Table 2 consists of information on eight attributes, including glucose, blood pressure, skin thickness, insulin, and BMI. To handle missing values, records with missing entries were removed, resulting in a dataset containing 392 records that were processed for further analysis. Out of 392 patient records used for analysis, 130 records belong to diabetes present cases and 262 records belong to diabetes absent cases. In the dataset containing 392 complete records, missingness was artificially generated. The experimentations were accomplished using Python 3.8 on the Anaconda Jupyter Notebook platform.

Table 2 Overview Pima dataset attributes and missingness

To create five incomplete datasets, various missing rates ranging from 10 to 50% were artificially introduced into the input variables. It’s important to note that the output variable remained intact and was not affected by the missing values. The next step involved imputing the missing values in these five datasets using the KNNI, MICE, and MF techniques. As a result, there were five imputed datasets for KNNI and MICE and MF imputations, amounting to a total of fifteen imputed datasets. Additionally, one complete dataset without any missing values was included, resulting in a total of sixteen datasets for experimentation. The sample size used for these experiments was 392. In these 392 samples of the Pima Indians Diabetes dataset, the generated artificial missingness was produced through the MCAR mechanism, while the true data exhibits characteristics that fall in between MAR and MCAR [7]. This missingness was generated randomly, in a multivariate pattern, which means missingness is present in more than one variable of the dataset, with the binomial distribution.

Results

To evaluate the comparative effectiveness of KNNI, MICE, and MissForest imputation techniques, our experimental design was formulated. In this study, a comparative analysis was conducted to assess the performance of three imputation methods across four categories: Model Performance, imputation error rate (MAE, RMSE, R^2 values), Pearson correlation analysis, and Model Selection based on AIC and BIC values. To evaluate the model performance, a ten-fold cross-validation technique was employed. The entire dataset was subjected to 10 repetitions of the experiment, with each sample being tested. The average of the outcomes from all 10 iterations was then chosen as the ultimate result. The performance of the Diabetes Prediction Model was evaluated by considering metrics such as accuracy, precision, recall, F1 score, and Mcoff score with RF, SVM, AdaBoost & XGB classifiers (4.1–4.5), Imputation error rate, Coefficient of Determination of complete datasets and imputed datasets are compared (4.6), Correlation analysis (4.7), Model selection based on AIC & BIC values for various missing rate is carried out (4.8).

5.1 Relative performance analysis of prediction accuracy

Prediction accuracy is obtained by dividing the number of correct predictions by the size of the dataset.

$$Accuracy=(TP+TN)\div (TP+TN+FP+FN)$$

TP, TN, represents True Positive & True Negative and FP and FN represents False Positive & False Negative values respectively

$$Relative\;Accuracy=100\times \left(\left(AO-\left(AO-AM\right)\right)\right)\div AO$$

In the context where all features are available and known, AO represents the prediction accuracy of the complete dataset, The prediction accuracy, denoted by AM, was measured after applying each respective imputation method to fill in missing data. As depicted in Fig. 2, it is apparent that the MissForest algorithm outperforms other imputation techniques in four out of the five cases i.e. in 80% of cases with varying percentages of missing data.

Fig. 2
figure 2

Comparing relative differences of prediction accuracy between Original and Imputed dataset for four classifiers

5.2 Relative performance analysis of precision

Precision is a measure of how many of the positive predictions made are correct i.e. TP. TP + FP is the Number of patient models predicted with diabetes

$$Precision=TP\div \left(TP+FP\right)$$
$$Relative\;Precision\;Score=100\times \left(\left(PO-\left(PO-PM\right)\right)\div PO\right)$$

In the context where all features are available and known, PO denotes the precision score of the complete dataset, while PM represents the precision score measured after applying each respective imputation method to fill in missing data. From Fig. 3, it is evident that the KNNI algorithm performs better than other imputation techniques, in three out of five missing % cases i.e., 60% of cases, and MissForest performance is better in 40% of cases.

Fig. 3
figure 3

Comparing relative differences of precision score between Original and Imputed datasets for four classifiers

5.3 Relative performance analysis of recall

The recall is a measure of how many positive cases the classifier has accurately predicted, It is very important in medical domains as we want to minimize the chance of missing positive cases.

$$Recall=TP\div \left(TP+FN\right)$$

where TP (True Positive) is no. of correctly predicted patient with diabetes and TP + FN (false Negative) is total no. of patients with diabetes in the dataset.

$$Relative\;Difference\;Recall\;Score=100\times \left(\left(RO-\left(RO-RM\right)\right)\div RO\right)$$

In the context where all features are available and known, RO refers to the recall score of the complete dataset and RM represents the recall score after applying the corresponding imputation method to impute missing values. From Fig. 4, it is evident that the MissForest imputation technique outperforms other imputation techniques in three out of five missing % of cases i.e., 60% of cases. It is also observed that the MissForest imputation technique gives the best performance with the SVM classifier.

Fig. 4
figure 4

Comparing relative differences of recall scores between Original and Imputed datasets for four classifiers

5.4 Relative performance analysis of F1 score

The F1 score, a comprehensive evaluation metric, accounts for both precision and recall, making it suitable for imbalanced datasets where precision and recall must both be considered.

$$F1\;Score=2\times \left((precision\times recall)/(precision+recall\right))$$
$$Relative\;Difference\;F1\;Score=100\times \left(\left(FO-\left(FO-FM\right)\right)\div FO\right)$$

In the context where all features are available and known, FO refers to the F1 score of the complete dataset and FM is the F1 score after applying the corresponding imputation method to impute missing values. From Fig. 5 it is evident that the MissForest algorithm performs better than other imputation techniques in 100% of cases. It is also observed that the MissForest imputation algorithm gives the best performance with the SVM classifier.

Fig. 5
figure 5

Comparing relative differences of F1 score between Original and Imputed datasets for four Classifiers

5.5 Relative performance analysis of Mcoff-score

Mcoff is considered a consistent evaluation metric since it yields a high score only when the prediction exhibits excellent performance across all four categories of the confusion matrix.

$$Relative\;Mcoff\;Score=\left(\left(MO-\left(MO-MM\right)\right)\div MO\right)$$

In the context where all features are available and known MO refers the Mcoff score of the complete dataset, and MM is the Mcoff score after applying the corresponding imputation method to impute missing values. From Fig. 6, it is evident that among the other imputation techniques, the MissForest algorithm shows better performance in four out of five cases i.e., 80% of cases. It is also observed that the MissForest imputation algorithm gives the superior results with the SVM classifier. The overall performance of imputation techniques and classifiers is shown in Figs. 7 and 8 respectively. After comparison of model performance across missing rates ranging from 10 to 50% [Figs. 2, 3, 4, 5, and 6], it becomes evident that the performance of four classifiers, as evaluated using five criteria, is better at the 50% MR compared to the 40% MR. Generally, classifier performance tends to decline as the missing rate increases. This occurrence could be attributed to the synthetic generation of MCAR-type missingness, where important predictive features might exhibit a higher degree of missingness in the case of the 40% MR.

Fig. 6
figure 6

Comparing relative differences in Mcoff scores between original and imputed datasets for four classifiers

Fig. 7
figure 7

Comparing overall performance of Imputation techniques

Fig. 8
figure 8

Comparing overall performance of Classifiers

5.6 Relative performance analysis of imputation techniques by MAE, RMSE, R^2 values

The accuracy of the imputation method is evaluated using various metrics that assess the discrepancy between the imputed values and the actual values of missing data. One commonly used metric is the MAE, which calculates the average absolute difference between the imputed values and the true values. A lower MAE value indicates better performance. Another frequently employed metric is the RMSE, which measures the square root of the average of the squared differences between the imputed values and the true values. A lower RMSE value indicates better performance in capturing the differences between the model-predicted values and the observed values. Additionally, R^2 is utilized to measure the proportion of the variance in the true values that can be explained by the imputed values. A higher R^2 value signifies a stronger correlation and a better representation of the true values by the imputed values. A higher value of R^2 indicates better performance. Complete dataset values are compared with fifteen imputed datasets of 10%-50% MR filled with KNNI, MICE, and MissForest techniques to calculate MAE, RMSE, and R^2 values.

It is observed that the MissForest Imputation method achieved lower MAE, and RMSE in 100% of MR cases and higher R^2 in 100% of MR cases, as compared to KNNI and MICE. This revealed that the performance of MissForest is better than the other two imputation techniques, in all of these three evaluation criteria as shown in Fig. 9.

Fig. 9
figure 9

Comparison of MAE, RMSE and R^2 values for various Imputation techniques

5.7 Relative performance analysis of imputation techniques by correlation analysis

Correlation analysis quantifies the association between the imputed values and other variables within the dataset. We evaluated the performance of KNNI, MICE, and MissForest imputation techniques by comparing Pearson correlation coefficient values of Glucose, Age, Insulin, BMI, Pregnancies, skin thickness, Diabetes Pedigree Function, and BP variables of the complete dataset and fifteen imputed datasets, to check if the imputed values are correlated with other variables in the dataset. Correlation analysis of three imputation techniques shows that the MissForest imputation technique can capture the complex relationship between the variables, like the complete dataset, for all the variables, as compared to MICE and KNNI imputation techniques. Results are shown in Figs. 10 and 11.

Fig. 10
figure 10

Comparison of Pearson correlation coefficient Values of Glucose, Age, Insulin & BMI for various Imputation techniques

Fig. 11
figure 11

Comparison of Pearson correlation coefficient Values of Pregnancies, Skin Thickness, Diabetes Pedigree Function & BP for various Imputation techniques

5.8 Relative performance analysis of imputation techniques by AIC and BIC scores

5.8.1 AIC

It is a model selection principle proposed by Akaike in 1973. AIC helps in selecting a model, by estimating the quality of each model given as an input to it. AIC evaluates the effectiveness of a model based on the extent to which it preserves information, with higher quality models retaining less lost information. AIC accounts for the potential risks of overfitting and underfitting in model estimation, with lower AIC values indicating a more optimal model fit. AIC penalizes complex models less, so less score is given to the complex model, and finally, a complex model is selected. The full model is the model with all the variables of the dataset [37,38,39]. The step model is constructed using stepwise regression which selects a subset of variables and builds the step model which gives the best performing model by iteratively adding. and deleting variables. Results are shown in Figs. 12 and 13.

Fig. 12
figure 12

Comparison of AIC values of full model

Fig. 13
figure 13

Comparison of AIC values of step model

5.8.2 BIC

Schwarz proposed the Bayesian Information Criterion (BIC) in 1978 as a model selection principle, which serves as an asymptotic approximation to a transformed Bayesian posterior probability of a candidate model.

$$AIC=-2ln\left(maximum\;likelihood\right)+2m$$
$$BIC=-2ln\left(maximum\;likelihood\right) +m ln\left(n\right)$$

The best model is selected based on the minimum value of AIC or BIC, where AIC and BIC are estimated using the number of estimated parameters (m) and the number of observations (n). BIC penalizes the model more as compared to AIC for its complexity, BIC selects the less complex one [40, 41].

The full model and stepwise regression model are constructed for the complete dataset and fifteen imputed datasets for 10–50% MR. AIC and BIC score comparison of the full model and step model is carried out for KNNI, MICE, and MissForest. AIC and BIC score analysis show that the performance of MissForest is better than MICE and KNNI imputation techniques, for the full and step model. Results are shown in Figs. 14 and 15.

Fig. 14
figure 14

Comparison of BIC values of full model

Fig. 15
figure 15

Comparison of BIC values of step model

6 Impact of the work

Diabetes is a chronic disease that requires continuous monitoring and management. Departments in large hospitals monitoring chronic diseases generate a lot of data with probability of missingness. Addressing missingness in a scientific manner helps in reducing knowledge loss and accurate decision making in various healthcare domains. This work has assessed imputation techniques using 11 evaluation criteria which provided a holistic understanding of imputation techniques’ performance. Conducting Pearson correlation analysis allows to understand the relationships between variables in the dataset before and after the imputation. Other 10 evaluation criteria also capture different aspects of imputation offering a comprehensive view of strengths and weaknesses which can help identifying the best imputation technique and classifier which can ensure that datasets are more complete and of higher quality which is crucial for accurate analysis, and also help in improving disease prediction models and accurate decision making.

7 Conclusion and future scope

In this work, a comparative analysis of three ML-based imputation methods was performed on the Pima Indian dataset. Experimental evidence confirmed that the MissForest imputation technique performed better in eleven evaluation criteria, as compared to the other imputation techniques. It was also found that the SVM classifier performed better than RF, XGB, and AdaBoost classifiers in the precision, recall, F1 score, and Mcoff. The empirical analysis for all five MR (10% to 50%) cases, using MissForest, KNNI, and MICE techniques, revealed that the MissForest method performed better in accuracy & Mcoff in 80% of cases, better in precision & recall in 60% of cases, better in F1 score, MAE, RMSE, R^2, AIC, BIC values in 100% of cases. The Pearson Correlation Coefficient analysis of the input variables also revealed that MissForest techniques were able to capture the complex relationship between all the variables in the diabetes dataset. Overall, our empirical evidence confirms that MissForest is a better ML-based imputation technique, for handling missing data in diabetes datasets. The use of accurate imputation techniques can improve the quality of diabetes research, by ensuring that missing data does not compromise the validity of research results. In this work, we exclusively addressed the MCAR missing mechanism. However, we intend to address this limitation in the future by incorporating methods to handle the MAR mechanism as well as introducing an ensemble imputation approach and explore other ML based methods that is capable of effectively managing both MCAR and MAR missingness in different diseases. Also, future direction of this study involves enhancing imputation by integrating medical expertise and developing real-time imputation applications for missing data, in clinical settings where prompt decision-making is essential.