Keywords

1 Introduction

Heart disease also called as coronary artery disease is a condition that affects the heart. Heart disease is a leading cause of death worldwide. Physicians generally make decisions by evaluating current test results of the patients. Previous decisions taken by other patients with the same conditions are also examined. So diagnosing heart disease requires experience and highly skilled physicians. Heart disease diagnosis is an important yet complicated task. The automation or decision support system would be extremely advantageous. Data mining can be used to automatically infer diagnostic rules and help specialists to make diagnosis process more reliable. Data mining have shown a good result in prediction of heart disease and is widely applied for prediction of heart disease.

Random forest is an ensemble classifier which combines bagging and random selection of features. Random forest can handle data without preprocessing. Random forest algorithm has been used in prediction and probability estimation.

Feature selection is a process of identifying and removing redundant and irrelevant features and increasing accuracy. Feature subset selection methods are classified into four types. (1) Embedded method (2) Wrapper method (3) Filter method (4) Hybrid method. Chi square test is used as a feature selection measure to determine the difference between expected frequency and observed frequency.

Major contributions of our paper are summarized as

  1. (1)

    Propose a new method which employs the random forest algorithm for prediction of heart disease.

  2. (2)

    Apply chi square feature selection algorithm to identify and remove irrelevant attributes.

  3. (3)

    Apply feature selection algorithm (chi-square) to improve the accuracy in predicting heart disease.

The rest of the paper is organized as follows. Section 2 presents related work. We will review various articles related to heart disease. Section 3 deals with literature review. Section 4 presents proposed approach. Experimental results are discussed in Sect. 5. Finally Sect. 6 summarizes our paper.

2 Related Work

In this section, we will review some articles related to heart disease.

Polat et al. proposed hybrid method which uses fuzzy weighted preprocessing and artificial immune system [1]. Their proposed medical decision making method consists of two phases. In the first phase fuzzy weighted preprocessing is applied to heart disease data set to weight the input data. Artificial immune system is applied to classify the weighted input. They applied their methodology on Cleveland heart disease data set which consists of 13 attributes. The method uses 10 fold cross validation.

Diagnosis of heart disease through neural network ensembles was proposed by Das et al. [2]. Their method creates a new model by combining posterior probabilities from multiple predecessor models. They implemented the method with SAS base software on Cleveland heart disease data set and obtained 89.01 % accuracy.

P.K. Anooj developed a clinical decision support system to predict heart disease using fuzzy weighted approach. The method consists of two phases. First phase consists of generation of weighted fuzzy rules, and in second phase fuzzy rule based decision support system is developed. Author used attribute selection and attribute weight method to generate fuzzy weighted rules. Experiments were carried out on UCI repository and obtained accuracy of 57.85 % [3].

Robert Detrano et al. proposed probability algorithm for the diagnosis of coronary artery disease. The probabilities that resulted from the application of the Cleveland algorithm were compared with Bayesian algorithm. Their method obtained an accuracy of 77 % [4].

Decision tree for diagnosing heart disease patients was proposed by Shouman et al. [5]. Different types of decision trees are used for classification. The research involves data discretization, decision tree selection and reduced error pruning. Their method outperforms bagging and j48 decision tree. Their approach achieved 79.1 % accuracy.

Diagnosis of heart disease through bagging approach was proposed by Tu et al. [6]. The proposed bagging algorithm to identify warning signs of heart disease. They made a comparison with decision tree. Their approach claimed an accuracy of 81.4 %.

Andreeva used C4.5 decision tree for the diagnosis of heart disease. Feature extraction and specific rule inferring from heart disease data set considered. Their proposed approach achieved an accuracy of 75.73 % [7].

3 Literature Review

This section reviews literature used in this paper.

3.1 Heart Disease

Heart disease is a range of conditions that affect our heart. Heart disease also called as coronary heart disease (CHD), is a deposition of fats inside the tubes which supplies blood to the heart muscles. Heart disease actually starts as early as 18 years and patients only came to know about heart disease when the blockage exceeds about 70 %. Theses blockages develop over the years and lead to rupture of the membrane covering the blockage due to pressure increases. If the chemicals released by broken membrane mixed with blood and lead to a blood clot, results to heart disease [8].

The reasons which increase blockage are called as risk factors. These risk factors are classified as modifiable and non modifiable risk factors. Non modifiable risk factors are age, gender, and heredity. These risk factors can’t be modified and they will always keep causing heart disease. Risk factors which can be changed by our efforts are called as modifiable risk factors. Some modifiable risk factors are (1) Food related (2) Habit related (3) Stress related (4) Bio chemical and miscellaneous risk factors.

Effective decision support system should be developed to help in tackling the menace of heart disease.

3.2 Random Forest (RF)

Random forest algorithm is one of the most effective ensemble classification approach. The RF algorithm has been used in prediction and probability estimation. RF consists of many decision trees. Each decision tree gives a vote that indicate the decision about class of the object. Random forest item was first proposed by Tin kam HO of bell labs in 1995. RF method combines bagging and random selection of features. There are three important tuning parameters in random forest (1) No. of trees (n tree) (2) Minimum node size (3) No. of features employed in splitting each node (4) No. of features employed in splitting each node for each tree (m try). Random forest algorithm advantages are listed below.

  1. (1)

    Random forest algorithm is accurate ensemble learning algorithm.

  2. (2)

    It produces highly accurate classifier for many data sets.

  3. (3)

    Random forest runs efficiently for large data sets.

  4. (4)

    It can handle hundreds of input variables.

  5. (5)

    Random forest estimates which variables are important in classification.

  6. (6)

    It can handle missing data.

  7. (7)

    Random forest has methods for balancing error for class unbalanced data sets.

  8. (8)

    Generated forests in this method can be saved for future reference [9]. Following algorithm illustrates random forest method.

Algorithm Random forest

  1. Step 1:

    From the training set, select a new bootstrap sample.

  2. Step 2:

    Grow on a un pruned tree on this bootstrap sample.

  3. Step 3:

    Randomly select (m try) at each internal node and determine best split.

  4. Step 4:

    if each tree is fully grown. Do not perform pruning.

  5. Step 5:

    Output overall prediction as the majority vote from all the trees.

3.3 Chi-Square Method

Chi square is a statistical test that is used to measure divergence from the distribution of feature occurrence which is independent of the class value [10]. Chi square requires the following conditions to be satisfied

  1. (1)

    Data must be quantitative

  2. (2)

    One or more categories of data required

  3. (3)

    Independent observations

  4. (4)

    Sample size should be adequate and simple

  5. (5)

    Data must be in frequency form

  6. (6)

    All observations must be read.

Chi square formula used for the data sets is

$$ X^{2} = \sum {\frac{{(O - e)^{2} }}{e}} $$

where O is observed frequency and e is expected frequency. Thus chi square represents a summed normalized square deviation of the observed values from the corresponding expected values.

4 Proposed Method

The literature survey represents various techniques for prediction of heart disease. Each method has its own advantages and their short comings. The proposed technique uses random forest algorithm and chi square feature selection for prediction of heart disease. Feature subset selection is a process that selects a subset of original attributes and reduces feature space [11].

We applied Random forest with chi square as feature selection on heart disease data set collected from various corporate hospitals in Hyderabad (Heart disease data set T.S) and also on heart stalog data set. In our proposed work we used chi square to reduce number of attributes and keep only attributes which contribute more towards the diagnosis of heart disease.

Confusion matrix is used to compare actual classification of heart disease data set, with number of correct and incorrect predictions made by the model. The traditional classification matrix is shown below.

 

Disease present

Disease absent

Test positive

10

7

Test negative

2

11

To evaluate the performance of our proposed model, we used following classification measures [12].

  1. (1)

    Specificity = TN/(FP + TN)

  2. (2)

    Sensitivity = TP/(TP + FN)

  3. (3)

    Disease prevalence = (TP + FN)/(TP + FP + TN + FN)

  4. (4)

    Positive predictive value: TP/(TP + FP)

  5. (5)

    Negative Predictive value: TN/(FN + TN)

  6. (6)

    Accuracy = (TP + TN)/(TP + FP + TN + FN)

where

TP =>:

Positive tuples that are correctly labeled by the classifier

TN =>:

Negative tuples that are correctly labeled by classifier

FN =>:

Positive tuples that are incorrectly labeled by classifier

FP =>:

Negative tuples that are incorrectly labeled by classifier

Attributes for our heart disease data set T.S are listed in Table 1.

Table 1 Heart disease data set attributes

Proposed algorithm:

  1. Step 1:

    Load the heart disease data set

  2. Step 2:

    Rank the features in descending order based on chi square value. A high value of chi square indicates feature is more related to class. Apply backward elimination algorithm. Back ward elimination algorithm starts from the full feature set, and iteratively removes one by one feature with low chi value. In each iteration only one feature is removed, which mostly affects overall model accuracy, as long as the accuracy stops increasing. Least rank feature will be pruned. Chi square is used to select high ranked features.

  3. Step 3:

    Select the feature with highest value.

  4. Step 4:

    Apply Random forest algorithm on the remaining features (features with high chi square) of the data set that maximizes the classification accuracy.

  5. Steps 5:

    Find the accuracy of the classifier.

Steps 1 to 4 deals with feature selection. High ranked features are selected for classification using chi square approach. From Step 3 to 4 RF classification will be applied to the selected feature subset. After applying classification, accuracy of the classifier will be calculated.

5 Experimental Results

To evaluate the performance of our approach, we used the measures listed in Sect. 4. Accuracy comparison of Heart Disease data set-Cleveland [13] is shown in Table 2 and Fig. 1. Naïve bayes approach obtained an accuracy of 78.56 %, whereas decision table obtained an accuracy of 82.43 %. The results are obtained using 10 cross validation. Our approach obtained 7.97 % improvement over C4.5 algorithm. Accuracy comparison for Heart Disease data set T.S is compared with Decision tree (DT) is shown in Table 3 and Fig. 2. Our approach obtained 100 % accuracy, where as DT obtained an accuracy of 98.66 %. Comparision of various parameters for heart disease data set T.S is listed Table 4 and Fig. 3. Specificity shows that the probability of testing the result of heart disease will be negative when the heart disease is not present. Positive predictive value (PPV) is the probability that the heart disease is present when the diagnosis test is positive. PPV value for DT is 98.39 % where as our approach records 100 %. Negative predictive value (NPV) recorded by our approach is 90 % and positive predictive value is 75.8 for heart stalog data set which is shown in Table 5 and Fig. 4. Clinically, the disease prevalence(DP) is the same as the probability of disease being present before the test is performed (prior probability of disease). The above experimental results suggests that our proposed approach efficiently achieve high degree of dimensionality reduction and improve accuracy with predominate features. Overall our approach outperforms other approaches. This indirectly helps patient’s no. of diagnosis tests to be taken for prediction of heart disease.

Table 2 Accuracy comparison for heart disease data set (Cleveland data set)
Fig. 1
figure 1

Accuracy comparision of heart stalog data set by various approaches

Table 3 Accuracy comparison for heart disease data set (TS Data set)
Fig. 2
figure 2

Accuracy comparision of heart disease data set-T.S

Table 4 Comparison of various parameters for heart disease data set T.S
Fig. 3
figure 3

Comparision of various parameters for heart disease data set-T.S

Table 5 Comparison of various parameters for heart disease data set—HEART STALOG
Fig. 4
figure 4

Comparision of various parameters for heart stalog data set

Specificity of heart disease data set T.S obtained by our approach is 100 % where as it is 92.86 % by decision tree. Our approach obtained 1.8 % improvement over decision tree for heart stalog data set, which is shown in Figs. 3 and 4.

6 Conclusion

In this research paper, we presented an efficient approach for prediction of heart disease using Random forest. We adopted backward elimination method for feature selection using chi square measure for heart disease classification. Feature selection measure improves the classification accuracy. Our proposed approach (Random forest and Chi square) achieved an accuracy of 83.70 % for heart stalog data set. Applying Random forest has shown improved accuracy in prediction of heart disease. This research systematically tested using 10 fold cross validation to identify most accurate method. We compared our approach with other classification algorithms. Our approach outperforms other classification approaches for effective classification of heart disease. This type of research will play an important role to help health care professionals for prediction of heart disease.