Keywords

1 Introduction

HealthCare is the maintenance and betterment of the health by the help of diagnosis, prevention, and treatment of any kind of diseases. But a major challenge faced is to provide better care and clinical services at an affordable cost. By the help of various predictive analysis, the cost will get diminish and can get better clinical care. Cardiovascular disease refers to the trouble occur with heart. It specifically implies the condition of the heart that contracts or obstructs blood vessels which result in heart attack, chest pain or stroke. So various data mining classification techniques such as Decision tree classifier, Naive Bayes classifier, Support Vector Machine classifier, and k-NN classifier are used to spot and prevent the diseases at an primitive stage.

This paper is organized into section as follows. Section 2 summarizes heart disease. Section 3 provide a brief description of literature survey of heart related disease. The work flow steps are discussed in Sect. 4. Section 5 is all about the concise discussion of the classification techniques such as Naive Bayes, Decision tree, SVM, k-NN. Dataset collection attributes elucidation, comparison study is discussed in Sect. 6. Section 7 is all of the result analysis. Section 8 is the conclusion, summarizes a brief overview of the content.

2 Heart Disease

Any abnormality in heart results to heart disease. Heart disease affects the structure and function of the heart. There are various types of abnormality observed in the heart such as narrowing arteries, heart attack, aberrant rhythms of the heart, crushing of heart, disease related to a heart valve and heart muscle etc. The abnormal function of the heart is because of various factors such as blood sugar level, cholesterol level, blood pressure, etc. From the various study, the death rate of Cardio Vascular Diseases is 272 people per 100 000 population in India and globally it is 235 per 100 000 population. 610,000 number of people deceased because of heart-related problems in the United States every year.

3 Literature Survey

Aditya Sundar et al. describes classification techniques for prediction and evaluates the performance of Naive Bayes classification technique and WAC (Weighted Association Classifier) by using different performance measure [1]. Sellappan Palaniappan et al. describe various data mining classification algorithm Naive Bayes, Decision tree and Neural network to predict heart disease [2]. Dangare et al. in their paper describe early anticipation of heart related illness by the help of Neural network, Decision tree and Naive Bayes and determine their accuracy [3]. J Thomas et al. in their paper describes the classification techniques k-NN classification, Naive Bayes Classification, Decision tree classifier and Neural network method to predict the danger level of a diligent to have a heart-related illness or not [4]. Swathy Wilson et al. in their paper conclude that decision tree with k means clustering yield improved accuracy as compared to others [5]. A Nishara Banu et al. did the study of Association Rule Mining, Classification, and Clustering for spotting heart-related disease. They showed that the designed spotting structure is able to spot the heart attack efficiently [6]. Shabana Asmi et al. add some attributes for spotting the heart-related unwellness which results in high accuracy by the help of association rules [7]. Beant Kaur et al. used various Genetic and data mining algorithm for the spotting of heart-related illness. Their result shows that Genetic Algorithm gives an accuracy of 73.46% [8]. Sashikant Ghumbre et al. in their study used SVM classifier and Radial basis function network for heart disease diagnosis and got the result that SVM is best for identification [9].

4 Work Flow Design

Figure 1 describes the workflow and methodology for the prediction of heart disease. We have taken the dataset from UCI/Kaggle in CSV format then preprocess the data has been done which includes data transformation, data cleaning, and data integration. After preprocessing data mining classification algorithms such as Decision Tree, SVM, Naive Bayes, k-NN are applied for the prediction and comparison of the classification techniques based on their performance.

Fig. 1.
figure 1

Work flow for prediction of heart disease

5 Classification Algorithm

Classification belongs to a supervised learning method, which predicts a class for each object and assigns them to a target class [12]. The main goal is to prognosticate the target class for each data in a data set accurately.

5.1 Decision Tree

A decision tree is basically a tree-like structure in which branch nodes denotes attribute, terminal nodes denote class labels and branches denotes the outcome. Testing criteria are applied on the source node and branch nodes and on the basis of testing criteria result the data will follow the branch till it reaches the leaf node or class label (Table 1).

Table 1. Pros and cons of decision tree classification techniques

5.2 Naive Bayes Classifier

Bayesian classification is based on Bayes theorem. Thus it is a classifier based on probabilities it fails in case of continuous attribute because of frequency count is not possible (Table 2).

Table 2. Pros and cons of Naive Bayes classification techniques

5.3 Svm

A Support Vector Machine (SVM) is based on decision planes on decision margins. A decision plane can be defined as which split up between a bunch of objects, belongs to, unlike class. SVM is used mutually for classification as well as regression analysis [9, 10] (Table 3).

Table 3. Pros and cons of SVM classification techniques

5.4 K-NN

k-NN classifier is the most instance-based method for classifying data. In k-NN the target function may be either discrete valued or real value. k-NN stores all available records and classifies them on the basis of similarity measures (Table 4).

Table 4. Pros and cons of k-NN classification techniques

6 Data Set Elucidation

Heart disease dataset is collected from kaggle/UCI machine learning repository in which there are 14 attributes and 303 patients record [11] (Fig. 2).

Fig. 2.
figure 2

Detail description of dataset [11]

7 Comparison Table of Different Classification Techniques

Table 5 gives the comparison of data mining classification algorithms based on various performance measure.

Table 5. Comparison of different classifier with respect to Accuracy, Sensitivity, Specificity, PPV, NPV and AUC

Confusion Matrix is exploited to compute Accuracy, Sensitivity, Specificity, Area under curve and ROC curve. Confusion Matrix for classification of heart disease is shown in Table 6.

Table 6. Confusion matrix for heart disease
$$ \begin{array}{*{20}l} {{\text{Sensitivity}} = {\text{P}}( + |1) = \% \;{\text{of}}\;{\text{True}}\;{\text{Positive}}\,:\,{\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right)} \hfill \\ {{\text{Specificity}} = {\text{P}}( - |0) = \% \;{\text{of}}\;{\text{True}}\;{\text{Negative}}\,:\,{\text{TN}}/\left( {{\text{TN}} + {\text{FP}}} \right)} \hfill \\ {{\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)/\left( {{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}} \right)} \hfill \\ {{\text{PPV }}({\text{Positively}}\;{\text{Predicted}}\;{\text{Value}} = {\text{ TP}}/\left( {{\text{TP}} + {\text{FP}}} \right)} \hfill \\ {{\text{NPV }}\left( {{\text{Negatively}}\;{\text{Predicted}}\;{\text{Value}}} \right) = {\text{TN}}/\left( {{\text{TN}} + {\text{FN}}} \right)} \hfill \\ \end{array} $$

Figure 3 represents the Roc curve of different classifier.

Fig. 3.
figure 3

ROC curve of different classifier

8 Conclusion

This paper focuses on the early detection and prevention of heart related illness by using several data mining classification method which is implemented by using data analytical tool R. For the prediction of heart disease various classifiers are used and we obtained several performance measurement parameters and observed that the performance is better for prediction in case of Naive Bayes as compared to others. Here we also observed that the performance of classifier varies from each other and also depended upon the platform or analytical tool on which the classification techniques are implemented. In future we would try to implement other techniques in which prediction is more accurate.