1 Introduction

Machine learning plays a major role in prediction of diseases from the reacted healthcare datasets [1]. It analyses different attributes and patient lab records. On the basis of a suitable learning strategy, it predicts whether the patient has a certain type of disease or not. It can also predict the severity of the disease by analysing the outcomes of the various attributes or features, which are considered from different health issues or diseases. For example, it predicts by examining the dataset of the patients whether they have cancer or not. And by gaining the knowledge from different features of cancer dataset, it can also predict the type of cancer the patient is having that is whether it is benign or malignant [2]. The machine learning techniques are broadly divided into two categories; supervised and unsupervised [3]. It has different applications and usefulness in predicting the diseases and also analyses the disease datasets. Supervised learning provides the facility to have the result of interest from the related information, whereas unsupervised learning works in a different manner.

Only patients seem abnormal because they have unusual combinations of labs and comorbidity. So we look for interesting structures within the data, not categorized, but related to the properties of the data themselves. This is called learning by without supervision. Automatic learning can help the healthcare analysers with the things such as precision medicine. In fact, automatic learning plays an important role in promoting these efforts to achieve important goals such as helping healthcare evolves. To build a suitable and efficient learning model, we can utilize the information gathered from ponders completed, socioeconomic of patients, medical records, and other sources. The distinction between the customary approach and the programmed learning technique for illness prediction is depending on the number of factors. In a conventional approach, many important factors have not been considered for proper prediction, whereas the substantial number of factors, which brings more prominent precision of wellbeing information has been considered in the machine learning approach. Machine learning has tremendous capability of analysing, visualizing and predicting different kinds of data. Due to its wide applicability in healthcare sector, one can build a machine learning model which can analyse, visualize and predict various kinds of diseases [4]. To accommodate the disease prediction instances mentioned above, in the form of a proposed predictive model, we use different classification algorithms of machine learning for the analysis and prediction of suitable results. For better understanding of results, various performance measures have been considered. We also obtain statistical analysis of various algorithms used in the model.

The rest of the paper is organized as follows. Section 2 describes about the different work done in the past on various disease datasets. In section 3, the methodologies used in our new predictive model for classifying disease datasets are discussed. Section 4 deals with experimental and its results and gives prediction of various performance measures and section 5 conclude the paper.

2 Related works and disease dataset characteristics

2.1 Disease dataset

We look into some of the most vulnerable diseases which are commonly found in patients, so we analyse the datasets for those disease. We collect the datasets for different diseases from UCI repository [5]. We also compare different aspects of ML algorithms on disease datasets for the better prediction of various causes of that diseases in which circumstances lead to particular disease by analysing the attributes of the datasets and finally predicting the severity of the disease in the patient and predicting the patient who are diagnosed with the disease as per the outcome of the performance analysis of different disease dataset using our new predictive model. Diseases we have chosen for our experiment setup are as follows.

  • CKD (Chronic Kidney Disease)

  • Heart Disease (Cardio Vascular Disease)

  • Diabetes

  • Wisconsin Cancer dataset

  • Hepatitis

  • ILPD (Indian Liver Patient dataset)

2.2 CKD (chronic kidney disease)

Chronic Kidney disease (CKD) [6] is a very famous worldwide common health problem, with a hazardous risk to the predictable life of >50%, greater than that of invasive cancer, diabetes, and coronary heart disease. CKD is defined as the presence of renal impairment, revealed by abnormal excretion of albumin or diminished renal function. The disease is measured or estimated by the glomerular filtration rate (FG) which persists for more than 3 months for patients suffering from CKD [7]. The glomerular filtration rate (FG) is the best indicator of how the kidneys work. CKD is analyzed [8, 9] with various machine learning techniques for a better prediction in the literature [10]. Dataset we used for CKD is from the UCI repository [5]. Dataset features, it has 400 numbers of instances and the number of attributes is 25 and there is some missing value and the dataset is analyzed to predict whether the patient has CKD or not CKD.

2.3 Heart disease (cardiovascular disease)

Cardiovascular disease (CVD) [3] is a class of diseases involving the heart or blood vessels. Cardiovascular diseases include coronary artery disease (CAD), such as angina and myocardial infarction (commonly known as a heart attack). Other CVD includes stroke, heart failure, hypertension, rheumatic heart disease, cardiomyopathy, cardiac arrhythmia, a congenital heart defect, valvular cardiac disorder, aortic aneurysms, peripheral artery diseases, disease Thromboembolism and venous thrombosis [11]. In the literature, many heart disease datasets are analyzed by different machine learning techniques [11, 12]. The cardiovascular dataset is the Cleveland Heart dataset that has 303 numbers of instances and the number of features 14 and the number of classes 5 [12]. The goal field refers to the presence of cardiac disease in the patient is an integer evaluated from 0 (no presence) to 4. Experiments with the Cleveland database focused on the simple attempt to distinguish the presence (values 1, 2, 3, 4) from the absence (value 0).

2.4 Hepatitis

Hepatitis is inflammation of the liver tissue [2, 13]. Some people have no symptoms, while others develop yellowing of the skin and the whites of the eyes, lack of appetite, vomiting, fatigue, abdominal pain or diarrhoea. Hepatitis may be temporary (acute) or long-term (chronic) depending on whether it lasts less than six months or more. Acute hepatitis can sometimes be resolved by itself, either by progressing to chronic hepatitis, or by rarely giving rise to acute hepatic failure. Over time, chronic form can progress to liver scarring, liver failure, or liver cancer. Machine learning plays a vital role in the prediction of hepatitis by its data analysis and there is work done in the literature for the same [14,15,16]. The Dataset used for analysis is taken from the UCI repository [5]. The features of the datasets are: The number of instances is 155, and the number of features is 20 and the number of classes is 2, and they are categorized as Live or die and there are some missing values that will be frequented using the Dataset’s pre-processing.

2.5 Diabetes

Diabetes mellitus (DM), commonly known as diabetes [17], it is a group of metabolic disorders in which there are high levels of blood sugar for an extended period of time. High blood sugar symptoms include frequent urination, increased thirst, and increased hunger. If left untreated, diabetes can cause many complications. The classic symptoms of untreated diabetes are weight loss, polyuria (increased urination), polydipsia (increased thirst) and polyphagia (increased hunger). Machine learning plays a vital role in the prediction of diabetes by its data analysis and there is work done in the literature for the same [18, 19]. The dataset is used by the UCI Pima Diabetes Dataset Repository. The number of instances is 768 and the number of attributes are 9 and the class has values such as tested_positive and tested_negative. Some major findings of diabetes prediction methodologies are summarized in Table 1.

Table 1 Findings of diabetes prediction methodologies

2.6 Wisconsin breast Cancer dataset

Breast cancer is a disease that occurs when the cells of the mammary tissue change (or mutate) and continue to reproduce [26]. These abnormal cells are usually grouped together forming a tumor. A tumor is cancerous (or malignant) when these abnormal cells invade other parts of the breast or when they spread (or reproduce) to other areas of the body through the bloodstream or lymphatic system, a network of vessels and knots in the body that carries a PA in the fight Against the infection. The Dataset used here for the Wisconsin breast cancer dataset taken from the UCI repository. Machine learning plays a vital role in the prediction of cancer by its data analysis and there is work done in the literature for the same [27,28,29,30]. The number of instances is 699, and the number of features is 10, and the class represents 2 values if the type is benign or malignant.

2.7 ILPD (Indian liver patient dataset)

Liver disease (also known as hepatic diseases) is a type of liver injury or disease [31]. A number of liver (hepatic) function tests are available to test the correct function of the liver. These tests for the presence of enzymes in the blood that is usually more abundant in liver tissue, metabolites or products. Machine learning plays a vital role in the prediction of diabetes by its data analysis and there is work done in the literature for the same [32, 33]. The dataset used in this study is the ILPD dataset [34], taken from the UCI machine learning repository [5]. This Dataset contains 583 records with 11 attributes. It contains 416 records of liver patients and 167 records of non-hepatic patients. The characteristics of the datasets considered are described in the Table 2.

Table 2 Characteristics of disease dataset

3 New predictive model for analyzing disease datasets

3.1 Predictive model

In the literature, various works have been described with different classification algorithms on few disease datasets [1, 2]. But, no combined efforts have been made, where we can find multiple disease datasets analysed under a common entity. On the other hand, using as many different algorithms with ensemble methods to predict and analyse for obtaining better performance measures have not been made. To accommodate this instance, we consider various disease dataset as previously described into consideration and analysed them by our proposed model.

1In this section, we describe various pre-processing methods, different classification algorithms and ensemble methods which are used in our new predictive model for better prediction of the class in the disease dataset. The disease datasets are taken from the open source UCI repository and are first pre-processed using different pre-processing methods such as: Discretization, Resampling and Principal Component Analysis [5]. On the other hand, the misclassified instances were removed by using Decision Tree algorithm. The traditional classification algorithms such as SVM, Naïve Bayes, KNN, Decision Tree, Random Forest [3, 35] are then again processed with ensemble methods such as Bagging and Boosting [36] to obtain better results and improved performance metrics for different disease datasets. Different techniques for pre-processing, various classification algorithms and ensemble techniques we used for our predictive model are described as follows.

3.2 Pre-processing techniques

3.2.1 Discretization

Discretization is a method to transform the datasets from numeric to nominal value. We use discretization mechanism to transform the disease dataset from numerical values to nominal values, which represents the class labels of the classification problem.

3.2.2 Resampling

It forms a new dataset by producing a subsample of the previous dataset using sampling with replacement. It also used to handle missing values in the datasets.

3.2.3 Principal component

Principal components are used for reducing the number of features of the data. Generally, it is desirable for the set of features to describe a large amount of “information”. It helps in reducing features and improving prediction.

3.2.4 Replacing misclassified with decision tree

In this technique, the misclassified instances are firstly identified by using decision tree and then they are removed. This improves the accuracy of prediction of the class labels.

3.3 Classification algorithms

3.3.1 Support vector machine (SVM)

A Support Vector Machine (SVM) is a classifier that tries to maximize the margin between training data and the classification boundary (i.e. the plane defined by  = 0). The idea is that maximizing the margin maximizes the chance that classification will be correct on new data. We assume the new data of each class is near the training data of that type. The instance of the classifier is shown in Fig. 1.

Fig. 1
figure 1

SVM Classifier

It is formulated as follows:

  • w: decision hyperplane normal vector

  • xi: data point i

  • yi: class of data point i (+1 or − 1)

  • Classifier is:

    $$ \mathrm{f}\left({\mathbf{x}}_{\mathrm{i}}\right)=\kern0.5em \operatorname{sign}\left({\mathbf{w}}^{\mathrm{T}}{\mathbf{x}}_{\mathrm{i}}+\mathrm{b}\right) $$
    (1)
  • Functional margin of xi is:

    $$ {\mathrm{y}}_{\mathrm{i}}\ \left({\mathbf{w}}^{\mathrm{T}}{\mathbf{x}}_{\mathrm{i}}+\mathrm{b}\right) $$
    (2)

3.3.2 N B (Naïve Bayes) classifier

Bayes Theorem:

P (A|B) = probability of A given that B is true.

$$ \mathrm{P}\left(\mathrm{A}|\mathrm{B}\right)=\frac{\mathrm{P}\left(\mathrm{B}|\mathrm{A}\right)\times \mathrm{P}\left(\mathrm{A}\right)}{\mathrm{P}\left(\mathrm{B}\right)} $$
(3)

B = Data, A = some event.

Naive Bayes classifiers [37,38,39] are the statistical classifiers designed based on the bayes theorem. These classifiers predict the likelihood of membership in the class, such as the probability that a particular tuple belongs to a particular class. Bayes classifier assumes that a class label is independent of the attributes. It first calculates the frequency of the vaious instances on the basis of class labels and then calculates the probability of the same where the class lablel having highest probabililty value is predicted.

3.3.3 K – Nearest neighbour (KNN)

K-nearest neighbors of a record x are the data points that have the k smallest distance to x [4, 38]. To classify an unknown record, the distance is measured to compute with other training records. The Euclidean distance between two points or tuples, say, x = (×1, ×2,..., xn) and y = (y1, y2,…., yn) is calculated as;

$$ Dist\left(x,y\right)=\sqrt{\sum_{i=1}^n{\left({x}_i-{y}_i\right)}^2} $$
(4)

Class labels of nearest neighbors is used to determine the class label of unknown record (e.g., by taking majority vote).

3.3.4 Decision tree

J48 [39, 40] is the Weka [21] implementation of the C4.5 software paradigm. This paradigm is used to induce the classification rules in the form of decision trees from a set of given instances. This is a software extension of the basic ID3 algorithm designed by Quinlan which is used to construct a tree. It works on categorical as well as continuous values. The nodes of tree denote different attributes. The branches between nodes represent the possible value of attributes and the terminal node represents the final values of the dependent variables. In literature, it is widely used for disease prediction [40].

3.4 Ensemble methods

Ensemble methods construct a set of classifiers from the training data and predicts a class label of previously unseen records by aggregating predictions made by multiple classifiers [36]. These methods use a combination of models to increase accuracy and combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*.

3.4.1 Random Forest, bagging, and boosting

Random forest [41, 42] is an extension of bagging which is used for classification or regression. Decision trees are built using a greedy algorithm that selects the best starting point in each step in the process of construction of the tree. Thus, the resulting trees end up looking very similar to reduce the variance of the estimates of all bags which in turn harms the robustness of the predictions. Random forest is an improvement on bags of decision trees that stops the greedy algorithm division when building the tree.

The ensemble bagging has been constructed by forming a sequence of classifiers which runs a specific algorithm repeatedly on different versions of the training dataset. In other words, bagging is the combination of predictions which exactly provides the same type of occurrences. In the same fashion, boosting has been constructed by framing a sequence of classifiers which runs a learning algorithm repeatedly by changing the distribution of the training set. In other words, this is same as bagging, where the performance of the previous classifier has an effect on new classifier [43].

3.5 Performance measures

We have used four different measures for the evaluation of the classification quality such as accuracy, precision, recall and F-Measure [41, 42]. These measures can be calculated using confusion matrix given below.

 

Labels being Predicted

Postive

Negative

True Label

Positive

TP(True Positive)

FN(False Negative)

Negative

FP(False Positive)

TN (True Negative)

3.5.1 Accuracy

It is a measure to identify the total number of correctly classified instances. It is calculated in the following manner.

$$ \kern0.5em \mathrm{Accuracy}=\kern0.5em \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$
(5)

3.5.2 Precision

It is a measure that describes what portion of the instances are true when they are predicted true.

$$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(6)

3.5.3 Recall

It is a measure that describes the number of correctly classified cases to the number of positive cases.

$$ \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(7)

3.5.4 F-measure

It is a measure which combines the previous two defined measures Precision and Recall to produce a combined model. It is formulated as follows:

$$ \mathrm{F}-\mathrm{Measure}=\frac{2\times \mathrm{Precision}\times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} $$
(8)

Fig. 2 illustrates the flow diagram of our proposed ensemble method based predictive model. After the completion of analysis by the ensemble methods, result evaluation is carried out on the basis performance metrics and result obtained is visualized through the graphical representation of the metrics obtained after the prediction.

Fig. 2
figure 2

Flow diagram of predictive model for analysis of disease datasets

4 Experimental analysis of predictive model

In this segment, we focus on the experimental analysis of the predictive model of different disease datasets. The whole experiment is carried out in the well-known machine learning tool known as Weka [21]. The disease dataset contains nominal as well as numeric data. All the datasets were analysed through the pre-processing phase to ensure the data is converted into the nominal data which in turn useful for the prediction of class value through the proposed model. The traditional algorithms [44, 45] were applied with 10-fold validation technique and the performance is improved by applying ensemble methods with classification algorithms.

4.1 CKD (chronic kidney disease) analysis

The CKD dataset contains numerical and as well as nominal data. We applied our model to it and Discretization; Resampling, Principal Component are used as pre-processing methods. The result that we obtained from Bagging and Boosting improves the performance in the case of Naïve Bayes, KNN, SVM and performance is same in the case of Random Forest and Decision Tree but we can also observe there’s some margin improvement visible from the result as given in the Table 3 and Table 4.

Table 3 Predictive analysis of CKD dataset with Traditional algorithms
Table 4 Predictive analysis of CKD dataset with Bagging and Boosting

The table analysis is shown as the graphical representation in Fig. 3. From this, it is observed that the proposed combined model predicts better in comparison to the previous models in the literature [46].

Fig. 3
figure 3

CKD a Traditional Algorithms, b Algorithms with Bagging, c Algorithms with Boosting

4.2 Heart (cardiovascular disease) analysis

The Heart dataset, also known as Cleveland Heart dataset [12] contains numerical and as well as nominal data. We initially apply our model and perform pre-processing using discretization, resampling and principal component analysis. After applying ensemble methods, it is found that Bagging and Boosting improves the performance in the case of Decision Tree, Naïve Bayes and behaves almost same in case Random Forest. The predicted values are shown Table 5 and Table 6.

Table 5 Predictive analysis of heart (CVD) dataset with traditional algorithms
Table 6 Predictive analysis of heart (CVD) dataset with bagging and boosting

The table analysis is shown as the graphical representation in Fig. 4. From this, it is observed that the proposed model predicts better in comparison to the previous models in the literature [11, 47].

Fig. 4
figure 4

CVD analysis a traditional Algorithms, b Algorithms with Bagging, c Algorithms with boosting

4.3 Diabetes (PIMA) dataset

The Diabetes dataset, also known as PIMA (Diabetes) dataset contains numerical as well as nominal data. We apply our model and perform pre-processing using discretization, resampling and principal component analysis. After applying ensemble methods, it is found that Bagging and Boosting improves the performance in the case of Decision Tree, KNN, and SVM. On the other hand, in case of Boosting with Random forest there is significant improvement but in case of SVM there are marginal changes have been obtained. The obtained values are shown Table 7 and Table 8.

Table 7 Predictive analysis of Diabetes (PIMA) dataset with traditional algorithms
Table 8 Predictive analysis of Diabetes (PIMA) dataset with bagging and boosting

The table analysis is shown as the graphical representation in Fig. 5 where, analysis of tradition algorithms is presented with algorithm analysis with Bagging and the algorithms analyzed with the Boosting methods are presented. From this it is concluded that our model predicts better results in comparison to the previous models [17, 48].

Fig. 5
figure 5

Diabetes analysis a Traditional Algorithms, b Algorithms with Bagging, c Algorithms with Boosting

4.4 Hepatitis disease dataset analysis

The Hepatitis disease dataset contains numerical nominal data. Discretization, Resampling and Replacing misclassified with Decision Tree is used as pre-processing methods and traditional algorithms were applied which then again used with ensemble methods to improve the performance. From the results, it is found that Bagging improves the performance in the case of Naïve Bayes, KNN and behaves almost same in case SVM and there is some margin improvement seen in case of Random Forest with Boosting. On the other hand, there is no visible improvement seen in case of Decision Tree. The result is presented in Table 9 and Table 10.

Table 9 Predictive analysis of Hepatitis dataset with traditional algorithms
Table 10 Predictive analysis of Hepatitis dataset with bagging and boosting

The table analysis is depicted as the graphical representation in Fig. 6, where analysis of tradition algorithms is presented with algorithm analysis with Bagging and the algorithms analyzed with the Boosting methods are presented. From this it is found that the proposed model predicts better results in comparison with other existing models [31, 49].

Fig. 6
figure 6

Hepatitis analysis a traditional Algorithms, b Algorithms with Bagging, c Algorithms with boosting

4.5 Wisconsin breast cancer dataset analysis

The Breast cancer dataset, also known as Wisconsin Breast Cancer dataset contains numerical as well as nominal data. We apply our model to perform pre-processing with discretization, resampling and principal component analysis methods and traditional algorithms were applied, which are again used with ensemble methods to improve the performance. From the results, it is found that Bagging and Boosting improves the performance in the case of Decision Tree, Naïve Bayes, and SVM. In the case of RF with Bagging and boosting there found a marginal improvement. The obtained results are depicted in Table 11 and Table 12.

Table 11 Predictive analysis of Wisconsin Breast Cancer dataset with traditional algorithms
Table 12 Predictive analysis of Wisconsin Breast Cancer dataset with bagging and boosting

The table analysis is shown as the graphical representation in Fig. 7, where analysis of tradition algorithms is presented with algorithm analysis with Bagging and the algorithms analyzed with the Boosting methods are presented. From this it is concluded that our model predicts better results in comparison to the previous methodologies existing in the literature [27,28,29, 50].

Fig. 7
figure 7

Cancer disease analysis a traditional Algorithms, b Algorithms with Bagging, c Algorithms with boosting

4.6 ILPD (Indian liver patient disease) dataset analysis

The Liver dataset also known as Indian Liver Patient dataset [34] contains both numerical and nominal data. The methods such as discretization, resampling and principal component are used as pre-processing methods and traditional algorithms were applied which then again used with ensemble methods to improve the performance. From the results, we found that Bagging and Boosting improves the performance in the case of Decision Tree and behaves almost same in case of Boosting with SVM and KNN. On the other hand, there is some marginal improvement seen in case of Naïve Bayes and Random Forest with Boosting but no significant improvement found in case of Bagging. The related experimental observations are given Table 13 and Table 14.

Table 13 Predictive analysis of ILPD dataset with traditional algorithms
Table 14 Predictive analysis of ILPD dataset with bagging and boosting

The table analysis is shown as the graphical representation in Fig. 8, where analysis of tradition algorithms is presented with algorithm analysis with Bagging and the algorithms analyzed with the Boosting methods are presented. From this it is concluded that the proposed model predicts better results in comparison to the previous models [32, 51, 52].

Fig. 8
figure 8

ILPD analysis a traditional Algorithms, b Algorithms with Bagging, c Algorithms with boosting

The experiments performed on six disease datasets proves that our new predictive model obtained significant improvement in almost every case as compared with other existing methodologies [13, 41, 42].

5 Conclusion and future work

In this paper, we developed a new predictive model for the disease datasets to improve their performance measures for various traditional classification algorithms with ensemble methods where different pre-processing methods use to handle the numerical as well as nominal data. The performance of classical algorithms with boosting found better where in some cases bagging performed significantly well and in few cases both performed marginally well as compared to the traditional algorithms. And for some cases, they performed almost same and for very few cases only there was no significant changes. Overall, the new predictive model obtains better results as compared to the existing methods. The future work can be compiled by adding new algorithms and using this other different kind of datasets can be further improved.