Keywords

1 Introduction

Diabetes is one of the most common chronic diseases affecting around 415 million people around the world. Early diagnosis and prediction of diabetes can suppress its effects and can prevent long-term complications. In the past few years, literature reported many works on the prediction of diabetes using machine learning algorithms, tested on PIMA dataset,Footnote 1 one of the most widely used diabetes datasets in literature [1,2,3]. However, such datasets are imbalanced. Class imbalance problem can be defined as having an unequal distribution of the data. Such a problem poses a challenge in detecting and extracting diabetic patterns. Because of the dominance of one class, existing machine learning algorithms may fail to detect diabetic cases accurately. Nnamoko and Korkontzelos [3] proposed a two-step data pre-processing approach on PIMA Dataset, where the first step identified the outliers using the Interquartile Range (IQR) algorithm and the second step employed Synthetic Minority Oversampling Technique (SMOTE).

This paper aims to find the best machine learning model for predicting diabetes with an imbalanced source. In this process, this research work presents rigorous experimentation in three categories: category 1: experiments with classification algorithms, category 2: experiments with ensemble methods, and category 3: experiments with imbalanced data pre-processing (different undersampling, oversampling, and combination techniques) and classification algorithms. Undersampling, oversampling, and combination are the techniques to adjust the class distribution of data. Undersampling down-sizes the majority class by removing observations, oversampling over-sizes the minority class by adding observations, while in combination methods, the data is oversampled and then the transformed data is undersampled.

The performance of the solutions has been evaluated using six different metrics: F1-score, Precision, Recall, Area Under Receiver Operating Characteristic curve (AUROC), Area Under Precision-Recall curve (AUPR), and Classification Accuracy (Accuracy). Experimental results show that the amalgamation of imbalanced data pre-processing methods improves the performance of traditional machine learning classifiers achieving the best accuracy as 98.49%. The results are compared with the existing methods in the literature. The proposed model yields better performance in terms of accuracy as compared to all other existing methods. Besides, we examined the validity of our proposed model in other domains (not related to healthcare) with the credit card dataset that exhibits high-class imbalance.

2 Related Works

The health sector has been showing impeccable growth in terms of technology, with the use of machine learning and deep learning. Few notable contributions are, detection of lung cancer [4, 5], dermatoscopic melanocytic skin lesion segmentation [6], lung segmentation [7, 8], and diabetes detection [1,2,3]. One common problem with methodologies for dealing with such data is the class imbalance. Literature reported many ways to tackle the class imbalance problem in various domains. Common approaches for handling class imbalance are undersampling and oversampling techniques or a combination of both.

Undersampling: Different methods under undersampling techniques can be categorized as a. Methods that select the samples to keep: Near Miss [9], and Condensed Nearest Neighbor Rule [10], b. Methods that select the samples to delete: Tomek Links [11], and Edited Nearest Neighbors [12], c. Combinations of keep and delete methods: One-Sided Selection [13], and Neighborhood Cleaning Rule [14].

Oversampling: In the similar way, oversampling methods can be categorized into different methods: a. Synthetic Minority Oversampling Technique [15], b. Borderline-SMOTE [16], c. Borderline-SMOTE SVM [17], d. Adaptive Synthetic Sampling (ADASYN) [18], where (b), (c) and (d) are extensions of (a).

Combination of Undersampling and Oversampling: In the Combination family, we combine oversampling methods and undersampling to make it more effective. A few examples of effective combinations are: (i) SMOTE and Tomek Links [19], and (ii) SMOTE and Edited Nearest Neighbors [20].

3 Materials and Methods

3.1 Dataset Description

PIMA datasetFootnote 2 is used in this paper for analysis. It has a total of 768 samples with 9 features. The class ratio of diabetic to non-diabetic is 0.34:0.66 (see Fig. 1). The yellow dots and the blue dots in the scatter plot represent the diabetic cases and non-diabetic cases.

Fig. 1
A scatterplot of blood pressure versus glucose. The title reads a scatter plot of the Pima diabetes dataset. The outcome, a color reference bar, measures 0 to 1.

Scatter plot of PIMA dataset

3.2 Feature Engineering

The data contains 0 as a measurement for certain features. The pregnancy column in the dataset containing 0 indicates that the woman is 0 times pregnant. Age and Diabetes Pedigree Function are continuous attributes. Apart from Outcome, Age, Diabetes Pedigree Function, and Pregnancies, the rest of the features containing 0 are assumed to be missing observation. The assumed missing values are replaced with the median since the median is not affected by extreme values.

3.3 Experimental Setup

This paper reported rigorous experimentation to tackle the class imbalance problem and adopted various state-of-the-art methodologies to attain better performance in the prediction of diabetic cases. The experiments were conducted in three different categories.

  • Category 1—Experiments with traditional machine learning algorithms: In this category, five different supervised classification algorithms are employed on our dataset, which includes Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbour (KNN), and Deep Neural Network (DNN).

  • Category 2—Experiments with ensemble machine learning algorithms: In this category, five different ensemble algorithms are applied to our dataset, which includes Bagging, Random Forest, AdaBoost, Gradient Boosting, and XGBoost.

  • Category 3—Experiments with imbalanced data pre-processing and traditional machine learning methods: Here, to tackle the class imbalance problem different undersampling techniques, oversampling techniques, and a combination of both have been employed before feeding it to the machine learning algorithms. The undersampling techniques like Random Undersampling (RU), Near miss-1 [9], Near miss-2 [9], Tomek Links [11], Edited Nearest Neighbors (ENN) [12], and One-Sided Selection [13] are employed. On the other hand, oversampling techniques like Random Oversampling (RO), Synthetic Minority Oversampling Technique (SMOTE) [15], Borderline SMOTE-1 [16], Borderline SMOTE-2 [16], SVM-SMOTE [17], and Adaptive Synthetic Sampling (ADASYN) [18] are employed in this study. From the experiments with undersampling and oversampling methodologies, the two best oversampling and undersampling methods are picked up. The combination of the two methods is then investigated.

4 Experimental Results and Discussions

4.1 Category 1: Experiments with Traditional Machine Learning Methods

Five supervised classification algorithms were applied that include Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbour (KNN), and Deep Neural Network (DNN), with class weights as 0.34 and 0.66 for class 0 and class 1, respectively.

In the KNN, k = 7 is taken as with k = 7 kNN performed best. For DNN, the architecture is built with 3 layers with 5, 8, and 1 unit of nodes, and Rectified Linear Unit (ReLU) is used as activation function We have used Adam optimizer, with batch size 32, and the number of epochs 20.

Table 1 Performance of traditional machine learning algorithms
Table 2 Performance of ensemble methods

The performance of different classification algorithms has been illustrated in Table 1. It is seen that, in terms of all the six evaluation metrics, LR performed best while SVM performed second best. DNN also gave a comparable performance in terms of AUROC, Precision, Recall, and Accuracy. However, in terms of AUPR and F1-score, DNN performed worst. Besides that, DNN requires further fine-tuning of hyper-parameters and implementing it after an appropriate pre-processing technique is computationally expensive, requires a large amount of memory, and computational source than LR, KNN, SVM, and DT. Hence, we eliminated DNN from the list of classification algorithms for further analysis.

4.2 Category 2: Experiments with Ensemble Machine Learning Methods

To evaluate the performance of ensemble methods on imbalanced data, this study includes experiments with five different ensemble algorithms: Bagging, Random Forest, AdaBoost, Gradient Boosting, and XGBoost. The results are given in Table 2 in terms of all the six evaluation metrics. It is seen from the results that XGBoost performed best in terms of all the evaluation metrics except AUPR, while Random Forest performs best in terms of AUPR and second best in terms of remaining metrics.

4.3 Category 3: Experiments with Imbalanced Data Pre-processing and Traditional Machine Learning Methods

This category of experimentation includes the amalgamation of imbalanced data pre-processing and machine learning methods. A variety of undersampling and oversampling techniques were examined to process the data. The undersampling techniques like RU, Near miss-1, Near miss-2, Tomek Links, ENN, and OSS are employed. On the other hand, oversampling techniques like RO, SMOTE, Borderline SMOTE-1, Borderline SMOTE-2, SVM-SMOTE, and ADASYN are applied in the study. The two best oversampling and undersampling methods are picked up, and their combination is investigated.

The pre-processed data is then classified using traditional machine learning algorithms that include LR, KNN, SVM, and DT.

Undersampling Techniques The sampling strategy, one of the parameters in undersampling techniques, is defined as the ratio of the total number of samples in the minority class to the total number of samples in the majority class after re-sampling. However, for the present dataset, the minority class contains 268 instances, and the majority class contains 500 instances. Ideally, the denominator can take any value in the range of [268, 500]. The sampling strategy cannot be below 0.53 (268/500) for the current dataset. It can take any value in the range of [0.53, 1]. When the value is 1, the class ratio will be balanced to 0.5:0.5. However, doing so will reduce the number of samples from 768 to 536. But we aimed to remove only those samples which were affecting the models initially. Keeping these constraints in mind and the size of the data, we tuned this parameter to 0.625. The number of samples after down-sampling is 694, with the class ratio of diabetic to non-diabetic being 0.39:0.61. Table 3 presents the results when the data was pre-processed with undersampling methods. It was seen from the table that ENN performed best in terms of all evaluation metrics with all the machine learning algorithms, whereas Near miss-1 performed second best in terms of AUPR.

Table 3 Performance of different machine learning methods with undersampling techniques

Oversampling Techniques In oversampling, a sampling strategy is defined as the ratio of the total number of samples in the minority class after re-sampling to the total number of samples in the majority class. The numerator can take any value in the range of [268, 500]. Therefore in our dataset, the parameter can take any value in the range of [0.53, 1]. Since the data is small, we compromised ourselves for maximum redundancy. This redundancy of information from the minority group will balance the instances of two classes in the dataset and thereby gives better results. We tuned the parameter to 1. All the minority instances will now be upsampled to the proportion of the majority class. The total number of samples after upsampling is 1000, with the class ratio of diabetic to non-diabetic being 0.5:0.5. Table 4 presents the results when the data is pre-processed with oversampling methods. It is seen from the table that RO gives the best performance in terms of all the evaluation metrics except Recall with all the machine learning algorithms, whereas SMOTE performed best in terms of Recall.

Table 4 Performance of different machine learning methods with oversampling techniques

Combination Methods After experimenting with different oversampling and undersampling techniques, we picked up the two best undersampling and oversampling methods in terms of AUPR and combined them. AUPR is not affected in the case of moderate to a high-class imbalance of the data and can also provide accurate predictions [21]. With undersampling, we observed that ENN (KNN and DT) surpassed all the remaining methods in terms of all the evaluation metrics. Apart from ENN, Near miss-1 (LR) performs second best and achieved greater than 80% with three classifiers in terms of AUPR. With oversampling, Random oversampling (KNN and DT) performs best in terms of AUROC, AUPR, F1-score, Precision, Accuracy, and better in terms of Recall. SMOTE (KNN) performs best in terms of Recall and second best in terms of AUPR. Hence, we made 4 combinations: RO + ENN, RO + Near miss-1, SMOTE + Near miss-1, and SMOTE + ENN. Figures 2 and 3 show the scatter plot after employing the aforesaid four combinations using the first two features. The yellow dots and the blue dots in the scatter plot represent the diabetic cases and non-diabetic cases, respectively. It is seen that the ratio of diabetic and non-diabetic cases improves as compared to Fig. 1.

Table 5 presents the results of traditional machine learning algorithms pre-processed by the aforesaid combined methods. KNN with k = 3 is used, as with k = 3, the model achieves the highest performance.

Fig. 2
Two scatterplots of Blood pressure versus glucose. 1. Scatterplot with R O and E N N. 2. Scatterplot with R O and Near miss-1. The outcome measures, a color reference bar, 0 to 1.

Scatter plot of Pima Indian dataset with RO and two best undersampling methods

Fig. 3
Two scatterplots of Blood pressure versus glucose. 1. Scatterplot with S M O T E and E N N. 2. Scatterplot with S M O T E and Near miss-1. The outcome, a color reference bar, measures 0 to 1.

Scatter plot of Pima Indian dataset with SMOTE and two best undersampling methods

Table 5 Performance of different machine learning methods with the combination of undersampling and oversampling techniques

It is seen from Table 5 that SMOTE + ENN gives the best performance in terms of all evaluation metrics with all the classifiers, while SMOTE + ENN + KNN gives the highest performance. The combination (SMOTE + ENN) is further investigated with different ensemble machine learning methods (see Table 6). It is seen that the performance of the ensemble methods improves with the combination of these two imbalanced pre-processing techniques achieving the highest Accuracy of 96% with Random Forest.

Table 6 Performance of ensemble methods with SMOTE + ENN

4.4 Comparison with Previous Studies

The comparison is also carried out with the state-of-the-art methods (see Table 7). The results present that our approach produced better accuracy as compared to past studies. Naz and Ahuja [2] obtained comparable accuracy with Deep Learning (98.07%) as compared to our work. However, Deep Learning is computationally extensive to train. Some of the studies listed in Table 7 have evaluated their performance in terms of other metrics. In particular, Nanni et al. [22] evaluated their performance in terms of F1-score, G-mean, and AUROC while Raghuwanshi and Shukla [23] presented in terms of G-mean and AUROC. In terms of F1-score and AUROC, our model performs best as compared to both studies. Zahirnia et al. [24] presented in terms of feature cost and misclassification cost and Wei et al. [25] evaluated with sensitivity, F3 and G-mean diabetes dataset.

Table 7 Comparison with previous studies in terms of accuracy

5 Conclusion and Future Directions

The experiments portrayed in the paper proved that the imbalanced data processing methods lead to greater performance. To attain this, we investigated the effects of different imbalanced data processing methods and machine learning algorithms based on classification performance metrics. Results present that SMOTE + ENN gave the best performance on the PIMA Indian dataset. These results are also better as compared to the previous studies carried out on the Pima Indian dataset. However, not all the studies on diabetes prediction available in the literature are based on the same dataset, so we identified those with the same dataset and compared results. Future work would include investigation with different unsupervised methods and semi-supervised methods.