Keywords

1 Introduction

In today’s world, technology surrounds us in every possible way, usage of modern day technologies to get some ease in our work, some of the examples we can see are driverless cars, maps GPS navigation, voice controlled devices and many others. All these modern technologies use the data of the real world to train their models and finally test them with the help of data again and again to generate better accuracy. The degree of ease and accuracy of the technology is directly proportional to its usage. Machine learning is also a similar technique in which various algorithms are used to train the model using some part of data and then test the model with some other part of data to generate the outcome. This modern day technique can be used in the prediction of some disease with the help of associated symptoms. Algorithms can train the model in which symptoms values will be used as inputs and generation of outcome will be tested to check the best accuracy and some other different measures, which will help the patient and the medical field to deal with the patient’s condition in the earlier stage [1].

Diabetes is a major cause of death, metabolic disorders in humans as well as leads to commercial and productivity loss throughout the world due to lower levels of efficiency of man power. It is a metabolic disorder, characterized by high blood sugar levels which is caused by low insulin production in the pancreas. It increases the risk of long-term complications. It increases the chances of heart disease and about 75% people having this disease die due to coronary artery disease.

In this research, different machine learning algorithms are compared in order to predict risk of someone having diabetes. Classification algorithms are used to classify the target outcomes (1 for diabetic or 0 for non-diabetic) independently. Our study is structured in the following order—Sect. 2 contains literature review. The next, Sect. 3 explains the procedural approach, the machine learning techniques used and the model evaluation. Section 4 discusses about the final result obtained from the research. The last, Sect. 5 contains the conclusion and future work.

2 Literature Review

Diabetes diagnosis and treatment has been a crucial topic in medical research from a very long period of time. With the help of Machine learning, a really good progress has been made in the process of predicting diabetes in people. This prediction is made by the help of machine learning models, which are trained on the dataset consisting of medical information of patients, along with the information whether patients are diabetic or not. After the training phase the model is evaluated by passing the testing data to the model, to check how efficiently the model is working. Kahramanli and Allahverdi [2] used amalgamation of Artificial Neural Networks and fuzzy logics to make a model with good accuracy to predict diabetes. Kumar Dwivedi [3] compared five machine learning algorithms to predict diabetes. The algorithms used were artificial neural networks, classification tree, KNN, SVM, and logistic regression. The author in [4] used two classification algorithms, deep neural networks and artificial neural networks. And also used principal component analysis. Using deep neural networks they achieved better accuracy of 82.67%. Khan and Mohamudally [5] used k-means clustering, neural networks and C4.5 decision tree algorithm to predict diabetes in patients. Bayesian network, Artificial neural network, SVM, Decision tree, and KNN were used to predict diabetes, by Heydari et al. [6]. Temurtas et al. [7] made a model which was trained by Levenberg–Marquardt (LM) algorithm, and the model was combined with multilayer neural network structure.

Rajesh and Sangeetha used classification technique. They used C4.5 decision tree algorithm to find hidden patterns from the dataset for classifying efficiently [8]. Butwall and Kumar proposed a model using Random Forest Classifier to forecast diabetes behavior [9]. Ashiquzzaman et al. [10] proposed a prediction framework for the diabetes mellitus using deep learning approach where the overfitting is diminished by using the dropout method. Patil proposed Hybrid Prediction Model which includes Simple K-means clustering algorithm, followed by application of classification algorithm to the result obtained from clustering algorithm. In order to build classifiers C4.5 decision tree algorithm is used [11].

Patients can have several symptoms and some of the symptoms and factors are included in the data set like Age, Insulin level, Glucose level, Diabetes Pedigree Function, Blood Pressure level, Skin Thickness, and BMI. Prediction of the outcome from data has been done using various traditional machine learning techniques and artificial neural networks. In order to apply these algorithms, we need to preprocess the data which includes cleaning of the data [12]. Then proposed algorithms are applied and their performances are validated. These prerequisite actions are necessary so that optimal levels of accuracy, precision, and recall can be obtained.

In this paper, classification algorithms are used on the diabetic patient’s data set to predict the outcome of diabetes presence in patients and we achieved a success rate on the test set of 76%. Moreover, we were able to obtain this much accuracy with traditional machine learning approaches, by adding some data preprocessing techniques.

3 Procedural Approach and Methodology

The procedural approach is as follows, as shown in Fig. 1.

Fig. 1
figure 1

Proposed architecture

3.1 Data Extraction

The data used is the PIMA Indian diabetes dataset. Aim of which is to predict whether or not a patient is diabetic, on the basis of several attributes included in the dataset. Different criteria were used on the selection of these values from the database.

The dataset contains the medical details of 768 different patients, and these medical details were used for classifications. These medical details were stored in 768 rows and 9 columns. Nine columns consisting of 8 attributes and one class column ‘Outcome’ (diabetic or non-diabetic), as shown in Table 1.

Table 1 List of attributes present in the dataset and their data type

3.2 Data Preprocessing

After the data is collected, it cannot be directly used for the study, therefore it needs to be processed and cleaned to gather suitable information from the raw data useful for the study. The raw data is expected to have many inconsistencies, anomalies, out of bound values, missing values or a format not suitable for our model. Hence, the data needs to be processed in order to use it for our study. Moreover, vast data in present day business, science, industry, and academia scenarios needs complex mechanisms to analyze it. It includes data cleaning, transformation of data; and irregular data reduction tasks, used to reduce the convolution of data, determine and eliminate irrelevant and boisterous elements from the data through feature selection or discretization processes.

Elimination of Null Values

The data was checked for any null values across all features and secondly in individual feature columns, Elimination of Not a number (NaN) values: We replaced the null values of ‘glucose’ and ‘blood pressure’ by the mean of the respective attributes, and replaced the null values of the ‘skin thickness’, ‘insulin,’ and ‘BMI’ with the median of the respected attributes.

Table 2 shows the count of number of null values present in the data set before elimination of null values, and also after the elimination of null values.

Table 2 Null values count

Evaluation of Class Distribution

The data was checked to be distributed evenly between the target variable outcomes.

3.3 Data Splitting

Entire data was divided into training and testing data. Two-thirds of main dataset was the training data and the rest one-third was used for testing. The training data is the dataset which is given to the model in the beginning for model’s training purpose, which is, to learn from the dataset about the input attributes and the output attribute. The testing data is the dataset which given to the model to model after the training of the model is complete, to check of efficiently the model is working.

So, here the testing data is 1/3rd of the whole data set, and the training dataset is 2/3rd of the whole dataset, as shown in Table 3.

Table 3 Verification of data splitting

3.4 Machine Learning Methodologies

K-Nearest Neighbor

Aim of the algorithm is to find the class for the given input. It is a supervised machine learning algorithm. K is the number of neighbors with which we will compare the given input. The input will be assigned to the class whose maximum number of data will be near to the input itself. And the calculation is done with the help of Euclidean distance KNN formula, where x and y are the values of the independent attributes of the neighbor and the new point, and m is the number of independent attributes:

$$ {\text{dist}}\left( {A,B} \right) = \sqrt {\mathop \sum \limits_{i = 1}^{m} \left( {x_{i} - y_{i} } \right)^{2} } $$
(1)

Support Vector Machine

It is a labeled training data algorithm that creates a hyperplane that separates the points according to their classes. This hyperplane can be seen in 2D space as a plane splitting line into two pieces, one for each segment. Linear SVM is a technique for generating a classifier that can distinguish between labeled datasets. Given two sorts of points, it tries to maximize the margin geometrically. The letter ‘Z’ is utilized to solve the problem of maximum margin and the reparability limitation.

Logistic Regression

It is an algorithm for calculating binary outcomes like zero and one (in our case diabetic or non-diabetic). A linear regression is ineffective for categorizing a binary variable because it predicts continuous values that are beyond the range.

Artificial Neural Networks

The output layer, hidden layer, and input layer are the three layers of an ANN, which are made up of interconnected neurons. The hidden layer has multi-layered structure. The nodes in successive layers are all linked together. Every neuron has an activation function, which is a transformation function that is applied to the node before it is sent to the next layer as input. The result of a node is computed as in Fig. 2.

Fig. 2
figure 2

Neuron in ANN

3.5 Model Validation

The methodologies were performed on the jupyter notebook. The data was analyzed using data visualization techniques and conformed via performance evaluation metrics such as accuracy, precision, recall, F1-score. Cross-validation method was used for evaluation. In k-fold cross-validation, we broke the data into k distinct sets which are exclusive in nature and have equal size, with one set used for training purpose and other for testing.

Evaluation Metrics

The study is evaluated/validated via confusion matrix using metrics such as accuracy, precision, recall, and F1-score.

True Positives (TP): Total predicted diabetic cases, validated as diabetic.

True Negative (TN): Total predicted non-diabetic cases, validated as non-diabetic.

False Positives (FP): Total predicted diabetic cases, validated as non-diabetic.

False Positives (FN): Total predicted non-diabetic cases, validated as diabetic.

$$ {\text{Presision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}},\,\,{\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}, $$
$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}},\,\,F1{\text{-score}} = \frac{{2{\text{TP}}}}{{2{\text{TP}} + {\text{FP}} + {\text{FN}}}} $$

In this study for model validation confusion matrix have been used. In Fig. 3, confusion matrix of KNN model is shown in Table 4. Figure 4 shows the learning curve of KNN model which represents the training score and cross-validation of the KNN model. Table 5 shows the accuracy, precision, recall, and F1-score of the KNN model. Similarly, all the other algorithms were validated.

Fig. 3
figure 3

Learning curve of KNN

Table 4 Confusion matrix
Fig. 4
figure 4

Confusion matrix of KNN

Table 5 Accuracy, precision, recall, and F1-score of the KNN model

4 Results

In this research, we have performed diabetes prediction on PIMA Indian dataset, to predict diabetes a person is diabetic or not. First data is preprocessed by eliminating all the Not a number (NaN) values, by replacing them by the mean or the median of the respective attribute. Then the prediction was made by using four different machine learning algorithms KNN, SVM, logistic regression, and artificial neural networks. And among all the four algorithms, KNN showed the best accuracy of 76%. Table 5 shows all the values of accuracy, precision, recall, and F1-score of all the four machine learning algorithms are shown in Table 6.

Table 6 Performance measures of all the four algorithms

5 Conclusion and Future Work

In this study we ought to resolve the complications occurred during diagnosis of diabetes disease. The study put forwards an light on different machine learning algorithm such as the SVM, KNN, logistic regression, and ANN for predicting whether a patient is diabetic or not. It was concluded that out of all KNN performed best with an accuracy of 76%, hence it is a better option for classifying complex data.

In future we will try to come up with much better mechanisms and a much larger data set in order to increase the accuracy to help medical practitioners to treat patients and overcome this deadly disease.