Keywords

1 Introduction

The knowledge discovery process, now-a-days, has become more complex because of increasing size and complexity of the data sets. Data mining procedures are utilized to extract meaningful information for effective knowledge discovery. These procedures can be classified as descriptive procedures and predictive procedures. Descriptive procedures of data mining provide latest information on past or recent events, and for validating results, necessitate post-processing methods. Predictive procedures, on the other hand, predict the patterns and properties of vague information. Commonly used data mining procedures are Clustering, Classification, Association, Outlier Detection, Prediction, and Regression.

Classification is utilized for discovering knowledge based on different classes. It determines a model to describe and distinguish data classes based on trained data set, and identifies to which of the categories a new observation belongs to. Decision tree fosters a classification model in a tree-like structure. This type of mining, where the data set is distributed into smaller subsets and the associated Decision Tree (DT) is incrementally built, belongs to supervised class learning. The benefits of decision trees are:

  • Easy integration due to intuitively representing the data,

  • Investigative discovery of knowledge

  • High accuracy

  • Easily interpretable, and

  • Excludes unimportant features

Because of above mentioned benefits, decision tree classifier is utilized for knowledge extraction in areas like education [33, 40], tourism [18, 34], healthcare [30, 31] and others. The healthcare industry creates colossal information from which it becomes extremely difficult to extract useful information. Decision tree is an efficient method for extracting effective knowledge from this titanic of information and providing reliable healthcare decision. It has been utilized in making effective decisions in various medical science areas like cancer detection, heart disease diagnosis and others [9, 11, 32, 45]. Presenting a brief overview of the algorithms for developing decision trees, and then comparing these algorithms for predicting heart disease based on performance measures is the foremost goal of this paper.

Heart diseases are a major source of death worldwide. As of 2016 there have been more than 17.6 million deaths per year. The death toll is expected to exceed 23.6 million by 2030 [3]. India too is witnessing shocking rise in the occurrence of heart disease (HD) [12]. Researchers have developed various decision tree algorithms to effectively diagnosis and treat heart diseases. Decision trees and rough set approach was utilized by Son, Kim, Kim, Park and Kim [41] to develop a model for heart failure. Chaurasia and Pal [8], Sa [35], and Amin, Chiam, and Varathan [4] developed a prediction system for HD by utilizing decision tree in combination with other data mining algorithms. Mathan, Kumar, Panchatcharam, Manogaran, and Varadharajan [22] presented forecast frameworks for heart diseases using decision tree classifiers. Wu, Badshah and Bhagwat [45] developed prediction model for HD survivability. Saxena, Johri, Deep and Sharma [37] developed a HD prediction system using KNN and Decision tree algorithm. Shekar, Chandra and Rao [38] developed a classifier to provide optimized feature for envisaging the type of HD using decision tree and genetic algorithm. Vallée, Petruescu, Kretz, Safar and Blacher [43] evaluated the role of APWV index in predicting HD. Pathak and Valan [29] proposed a forecasting model for HD diagnosis by integrating rule-based approach with decision tree. Sturts and Slotman [42] predicted risks for the patients who are re-admitted within 30 days after hospital discharge for CHF by using decision trees analysis. Seeing the importance of decision tree in healthcare, this paper presents a brief overview and comparison of seven DT algorithms based on various evaluation measures to diagnosis the heart disease.

2 Material and Method

2.1 Overview of Decision Tree Algorithms

Decision tree algorithm is a supervised learning method which is implemented on the basis of the data volume, available memory space and scalability, in serial or parallel style. The DT algorithms considered in this study are: J48, Decision stump, LMT, Hoeffding tree, Random forest, Random tree and REPTree. These are the most used algorithms for predicting various diseases (Table 1).

Table 1. Different algorithms are applied in many areas.
  1. a.

    The J48 algorithm develops decision tree by classifying the class attribute based on the input elements.

  2. b.

    The Hoeffding tree algorithm learns from huge data streams.

  3. c.

    A Random tree algorithm draws a random tree from a set of possible trees and the distribution of trees is considered uniform.

  4. d.

    A Random forest algorithm draws multiple decision trees using a bagging approach.

  5. e.

    Logistic model tree (LMT) interprets combination of tree induction and linear logistic regression.

  6. f.

    Decision stump builds simple binary decision stumps for both nominal and numeric classification task.

  7. g.

    REPTree algorithm generates a regression or decision tree using information gain or variance.

2.2 Data Set

In order to attain the second goal of the present paper, three data sets from Cleveland database, Hungarian database and Switzerland database are considered for evaluating the performance measures of the DT algorithms. 60 data records for training and 50 data records for testing were taken as input for comparison. As shown in Table 2, fourteen attributes are considered for evaluating the performance measures.

Table 2. Description of the input attributes.

3 Decision Tree Analysis

The performance measures are generated by using the information mining instrument Weka 3.9.3. Data pre-processing is done by means of the Replace Missing Values channel to filter all records and replace missing qualities. Next confusion matrices are developed by applying considered DT algorithms with 2 classes as Class 1 = YES (heart disease is present), and Class 2 = NO (heart disease not present), and True Positive = correct positive predicted; False Positive = incorrect positive predicted; True Negative = correct negative predicted; False Negative = incorrect negative predicted; P are Positive samples; and N are Negative samples.

These matrices are then utilized to compute the accuracy measures using the equations:

$$ TP rate = \frac{TP}{TP + FP} $$
(1)
$$ FP rate = \frac{FP}{FP + TN} $$
(2)
$$ Accuracy = \frac{TP + TN}{P + N} $$
(3)
$$ Error rate = \frac{FP + FN}{P + N} $$
(4)

4 Results: Comparison

In Table 3 discussed the comparison of considered algorithms.

Table 3. Working comparison of decision tree algorithms.

Table 4 shows the computed performance measures using Eq. (3) and Eq. (4) for the data.

Table 4. Values of correctly classifier instances (CCI) and incorrectly classifier instances (ICI).

From the Fig. 1, it can be observed that for the considered data sets, Random Forest is showing max accuracy and least error rate.

Fig. 1.
figure 1

Graph showing accuracy and error rate.

From Table 5 it is clear that the TP Rate for the class = No is higher for Decision stump, Hoeffding tree, J48, Random forest, LMT and Random tree, which means the algorithms are successfully identifying the patients who do not have heart disease.

Table 5. Class accuracy.

5 Conclusion and Future Scope

The primary goal of this paper was to compare most used decision tree algorithms and determine efficient method for predicting heart disease on the basis of computed performance measures Accuracy, True Positive Rate, Error rate and False Positive Rate. The algorithms considered in the study are Hoeffding tree, Decision stump, LMT, J48, Random tree, Random forest and REPTree were evaluated. From results it is clear that Random tree and Random forest are efficient method for generating decision tree. The reason for Radom forest being the best is, it splits on a sub set of a features and supports parallelism. Further the algorithm also supports high dimensionality, quick prediction, and outliners and non-linear data. However, the algorithm is less interpretable and can tend to over fit. In future, performance evaluation can be based on considering more attributes responsible for heart diseases. Other than healthcare, the framework can be utilized for evaluating performances in other domains also.