Keywords

1 Introduction

Diabetes mellitus: More commonly referred to as “diabetes”—a chronic disease associated with abnormally high levels of the sugar glucose in the blood. Diabetes is due to one of the two mechanisms: Inadequate production of insulin (which is made by the pancreas and lowers blood glucose), or Inadequate sensitivity of cells to the action of insulin [2]. Diabetes mellitus also may develop as a secondary condition linked to another disease, such as pancreatic disease, a genetic syndrome, such as myotonic dystrophy, or drugs, such as glucocorticoids. Gestational diabetes is a temporary condition associated with pregnancy. In this situation, blood glucose levels increase during pregnancy but usually returns to normal after delivery [3]. Based on the data from the 2011 National Diabetes Fact Sheet, diabetes affects an estimate of 25.8 million people in the US, which is about 8.3% of the population. Additionally, approximately 79 million people have been diagnosed with pre-diabetes [4]. Pre-diabetes refers to a group of people with higher blood glucose levels than average but not high enough for a diagnosis of diabetes. Increased awareness and treatment of diabetes should begin with prevention. Many studies regarding diabetes prediction have been conducted for several years. The primary objectives are to predict what variables are the causes, at high risk, for diabetes and to provide a preventive action toward an individual at increased risk for the disease. Several parameters are considered for the study, which is explained in the next section. A healthy diet, regular physical activity, maintaining a healthy body weight and avoiding the use of tobacco can prevent or delay the onset of type 2 Diabetes Mellitus [5].

Nowadays, medical healthcare systems are rich in information. Wise use of these data can produce some predictive outcome. Pima Indians in Arizona have participated in a longitudinal diabetes study that has provided data publicly, which is used by many researchers for the study of diabetes. Various tools and techniques of Artificial Intelligence have been devised for early detection of diabetes after extracting information from the vast data set. Pardha Repalli [6], in their research work, predicted how likely the people with different age groups are affected by diabetes based on their lifestyle activities. They also found out factors responsible for the individual to be diabetic. Various classification methods were also used to detect Diabetes. Among them, some of the frequently used classifiers and clustering techniques are—Random forest, K-Means, J.48 algorithm, fuzzy approaches [7, 8]. Artificial Neural Network has also been widely used in medical research and studies for the disease prediction like Malaria and Cancer [9, 10]. Some research has also proved that ANN is also suited for the early diagnosis of diabetes [11]. Classification and prediction of the patient’s condition based on risk factors are an application of artificial neural networks. The predictive capability of each neural network within the fully trained dataset was analyzed as well as the predictive capabilities of the neural networks on unseen data.

2 Methodology

The following subsections show the process or the method through which diabetes prediction model has developed.

2.1 Data Collection

Data has been collected from Kaggle’s website (platform for predictive modeling and analytics competitions in which companies and researchers post data for research purpose) Pima Indians Diabetes Dataset, Class variable (0 or 1) Which is the original source of Research Center of National Institute of Diabetes and Digestive and Kidney Diseases, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University [12]. This data has already been used for forecasting the onset of diabetes mellitus using ADAP learning algorithm. In this dataset, all patients here are females at least 21 years old of Pima Indian heritage. The total Number of Instances is 768, which is completely used in this study. It contains 8 attributes plus one class (Label) column. Each attribute is numeric-valued; attributes of this dataset are as follows:

  • Number of times pregnant

  • Plasma glucose concentration at 2 h in an oral glucose tolerance test

  • Diastolic blood pressure (mm Hg)

  • Triceps skinfold thickness (mm)

  • 2-hour serum insulin (mu U/ml)

  • Body mass index (weight in kg/(height in \(\text {m}^2\))

  • Diabetes pedigree function

  • Age (years)

  • Class variable (0 or 1).

This dataset also contains the Missing Attribute Values, which is handled in the next step of methodology (Preprocessing) using some statistical techniques. In class values, distribution is like if there is 1, then it interpreted as “tested positive for diabetes”, if the class value is 0, it means “tested negative for diabetes” training and test data are divided into a certain number: 688 for training and 80 for testing data. The mean of each attribute is shown in Table 1. The screenshot of Training Sample Data and Testing Sample Data is shown in Figs. 1 and 2, respectively.

Table 1 Mean value of attributes
Fig. 1
figure 1

Screenshot of training sample data

Fig. 2
figure 2

Screenshot of testing sample data

2.2 Data Preprocessing

Data value in different attributes are having some missing values. These missing values can lead to inaccurate result; also it may reduce the model accuracy. So to handle these missing value the mean of column method is used to replace 0 with appropriate calculation [13]. To handle programmatically this missing values, NumPy package of Python was used to get mean function and manipulate the existing column array value from 0 to calculated result [14]. One thing is also important to prioritize the attribute, so that Artificial Neural Network calculate weight of each neurons (attribute) as per the given priority. Prioritizing attribute is need to get more accuracy of diabetes detection, which shows that which cause affects the diabetes detection on which priority. Table 2 show the attribute priority.

Table 2 Attribute priority

2.3 Data Preprocessing

For the prediction of diabetes, the model is built in core python using Artificial Neural Network(ANN) Algorithm.

Python: Python is a general-purpose language, most data analysis functionality is available in packages like NumPy and pandas [15]. It has efficient high-level data structures and a simple but effective approach to object-oriented programming.

Artificial Neural Networks (ANN): Artificial Neural Networks (as shown in Fig. 3) is a family of models inspired by the biological neural network (the central nervous systems of animals, in particular, the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are unknown [16]. Artificial neural networks are presented as systems of interconnected “neurons” which exchange messages between each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning [17]. Three types of parameters typically define an ANN:

  1. 1.

    The interconnection pattern between the different layers of neurons

  2. 2.

    The learning process for updating the weights of the interconnections

  3. 3.

    The activation function that converts a neuron’s weighted input to its output activation.

Mathematically, a neuron’s network function f(x) is defined as a composition of other functions gi(x), which can further be defined as a composition of other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables. A widely used type of composition is the nonlinear weighted sum (given in Eq. 1),

$$\begin{aligned} f(x)=K\left( \sum \limits _{i=1}w{_i}g{_i}(x)\right) \end{aligned}$$
(1)

where K (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent.

Fig. 3
figure 3

Artificial neural networks (ANN) model

This ANN algorithm itself is having various components to simulate the values and learn using the history data for better prediction [15]. These components were written in python as a function to call and execute:

  1. 1.

    Read_CSV(): Training data file Diabetes_TrainingData.csv and converting in array to read by python. Using Pandas package [18] and related function array can be formed to easily supply as input Training values.

  2. 2.

    Assigning Random weight(): INPUT_NEURONS variables used to weight for input Hidden (WiH), Then HiD (Hidden input Neurons) to HIDDEN_NEURONS for assigning weight. Finally, transfer HIDDEN_NEURONS weights to OUTPUT_NEURONS.

  3. 3.

    NeuralNetwork(): First define number of epoch, which is \(\text {epoch} = 0\) for initial and give training rate, which is TRAINING_REPS should always be greater than epoch. TrainInputs[ ] is an array which stores weight and input neurons values, trainOutput[ ] stores output hidden neurons values and learn for new values.

  4. 4.

    feedForward(): First Neurons values are transferred to hidden layer neurons, where these values for each neurons are multiplied and stored in actual variable, which is the sum of all the multiplied neurons and weight value.

  5. 5.

    backPropagate(): Backpropagation is a method to calculate the gradient of the loss function with respect to the weights in an artificial neural network [19]. It is commonly used as a part of algorithms that optimize the performance of the network by adjusting the weights. Here, backpropagation call the sigmoidDerivative function and define LEARN_RATE (Initially lower value), then calculate error in each sigmoid layer.

  6. 6.

    sigmoid(val): The sigmoid function is a type of activation function for artificial neurons. The most basic activation function is the heaviside (binary step, 0 or 1, high or low). The sigmoid function (a special case of the logistic function) and its formula looks as shown in Fig. 4.

  7. 7.

    ErrorCal(): Here, the final error shows the model accuracy and Actual and Predicted values, which is finally 8% at the end of building model and get prediction.

  8. 8.

    Graph_Plot(): This shows the result in graphical format. Package MatPlotLib used to plot the graph of actual and predicted values [20]. This graph shows under result section.

Fig. 4
figure 4

Sigmoid function

3 Accuracy Measurement of Model

Root Mean Squared Error (RMSE) and ROC (Receiver Operating Characteristic) performance parameters of ANN model considered for the analysis of accuracy [21]. ANN produced Root Mean Squared Error 0.39 and ROC area 0.88. As per performance guide for classifications accuracy, it shows that ROC \(> 0.80\) is considered GOOD classifier and ROC 0.77 as FAIR. The classifier should achieve ROC value closer to 1 for higher accuracy of making prediction. The Screenshot of the output shown in Fig. 5.

Fig. 5
figure 5

Output prediction—actual versus predicted

Fig. 6
figure 6

Graph between predicted and actual value

4 Result and Conclusion

Result of Diabetes Prediction model is shown as graphical format, where red line shows the predicted values of Diabetes and blue lines shows the actual value. Here, it is considered that predicted value which closer to 1 and above 0.5 is considered as 1 (Positive) whereas, closer to 0 and below 0.5 is considered as 0 (Negative) Prediction capacity of ANN-based model and can predict the possibility of developing diabetes in the community of Pima Indians. It is also observed that learning with more sample dataset can improve the accuracy with reducing error rate. Using relative values for the parameters for other demographic areas can be scaled up to the large areas. This model is useful for health policy makers, who can take preventive action before the occurrence of diabetes in large number (Fig. 6).