Keywords

1 Introduction

In today’s time at many places clinical test results are often produced based on doctors’ intuition, skills and expertize rather than on the rich information available in many large databases. Many a times this process leads to error, unintentional biases and a huge medical cost. Sometimes it can affects the quality of service provided to patients drastically.

Today many hospitals installed some kind of patient’s information collection systems to manage their healthcare or to collect patient data. These information systems usually generate large amounts of data which can be in different format like numbers, text, charts and images but unfortunately, this database that contains rich information is rarely used for clinical decision making. There is a lot of information stored in repositories that can be used effectively to support decision making in healthcare.

Here we focus on Heart Disease Prediction using data Mining techniques. The motivation for this study is the estimation given by WHO. As per the WHO estimation by year 2030, almost 23.6 million people will die due to Heart disease. So to minimize the risk, prediction of heart disease should be done. The most difficult and complex task in healthcare sector is diagnosis of correct disease. Heart disease prediction using different parameters of a patient diagnostic tests is a multi-layered issue which may lead to false presumptions and unpredictable effects. Now a day’s Healthcare sector generating a huge amounts of raw data about patients, hospitals resources, disease diagnosis, electronic patient records, medical devices etc. This huge amount huge of raw data is the main resource that can be efficiently pre-processed and analysed for key information extraction that directly or indirectly motivates the medical society for cost-effectiveness and support decision making. Proper diagnosis of heart disease cannot be possible by using only human intelligence. There are lots of parameters that can affects the accurate diagnosis like less accurate results, less experience, time dependent performance, knowledge up gradation and so on. Lots of development and research happened in this field using multi-parametric attributes with nonlinear and linear features of Heart Rate Variability (HRV). A novel technique was proposed by Lee et al. [1]. To achieve this, many researchers have used many classifiers e.g. CMAR (Classification based on Multiple Association Rules), SVM (Support Vector Machine), Bayesian Classifiers and C4.5). Some of the latest techniques in this field described in [2]. In Healthcare, there is a very large scope and potential of Data mining applications usefulness but effectiveness of these application mostly reliable on accuracy of data and cleanliness. In this regard, it is very much desirable that the healthcare industry use such policies and methods so that data can be better prepared, stored, captured and mined. Some probable methods and methodology we suggested includes the clinical data standardization, analysis and the data sharing across the related industries to enhance the accuracy and effectiveness of data mining applications in healthcare [3]. It is also advisable to explore the use of text mining and image mining for expansion the nature and scope of data mining applications in healthcare sector. Data mining application can also be explored on digital diagnostic images for application effectiveness. Some progress has been made in these areas [4, 5].

The question can be arises out of this available data:

“How can we use this data to generate useful information that can be used by healthcare practitioners to make effective clinical decisions?” This is the main objective of this research.

2 Background

In recent time, many organizations in healthcare sector uses data mining applications intensively and extensively on large scale. Another reason is that the healthcare transactions generated by this sector are too voluminous and complex to be analysed and processed by traditional methods. Decision-making can be improved majorly by using data mining applications in discovering trends and patterns in large volumes of typical data [6]. In recent trends analysis on these large dataset has become necessary due to financial pressures on healthcare industries. This extracted information can be used for decisions making based on the regress analysis of medical and financial data. Knowledge extraction can influence industry operating efficiency, revenue and cost using knowledge discovery from database by maintaining a top level of care [7]. Research shows that if we uses data mining applications in healthcare organizations then these organizations would be in better position to meet their short term goals and long-term needs, Benko and Wilson argue [8]. We can get very useful results from healthcare raw data by transforming raw data into useful information. A great reason that enables researchers in this field is that this is very useful for all stake holder involved in the healthcare sector. Like, if we consider Insurance provider, they can detect abuse and fraud, practitioner in healthcare can gain assistance in decisions making, like in customer relationship management. Healthcare providers (hospitals, physician, test laboratories and patient etc.) can also use data mining applications in their respective expert zone for expert decision making for example, by finding best practices and correct and effective treatments.

3 UCI Heart Disease Dataset Description

Source Information:

  1. (a)

    Creators of the used dataset: V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

  2. (b)

    Donor: David W. Aha (aha@ics.uci.edu)

The “num” attributes indicate the presence and level or absence of heart disease in the patient. The range of this attribute is from 0 (no presence) to 4 (severe).

Most of the experiments associated with Cleveland database are focused on absence (“Num” value 0) and presence (“Num” values from 1 to 4) Due to personal security patient’s personal identification information replaced with dummy values.

Number of Instances: Cleveland: 303. The directory contains a dataset related with heart disease diagnosis. The data was collected from the following locations:

Cleveland Clinic Foundation (cleveland.data).

The Cleveland database contains total 76 raw attributes, but in our experiments only 14 of them is actually used because all published experiments till now using a subset of 14 only and the data is also given only for these 14 attributes. The dataset used in this experiment contains different important parameters like ECR, cholesterol, chest pain, fasting sugar, MHR (maximum heart rate) and many more.

The detailed information about these attributes and their domain range are as follows:

@relation Cleveland, @attribute age real [29.0, 77.0],@attribute sex real [0.0, 1.0]

@attribute cp real [1.0, 4.0],@attribute trestbps real [94.0, 200.0]

@attribute chol real [126.0, 564.0],@attribute fbs real [0.0, 1.0]

@attribute restecg real [0.0, 2.0],@attribute thalach real [71.0, 202.0]

@attribute exang real [0.0, 1.0],@attribute oldpeak real [0.0, 6.2]

@attribute slope real [1.0, 3.0],@attribute ca real [0.0, 3.0]

@attribute thal real [3.0, 7.0],@attribute num {0, 1, 2, 3, 4}

@inputs age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal @outputs num

We have used the Classification model by covering rules (based on decision trees) as C4.5 Rules [911] and partial tree on the above modified dataset and find out the generated rule sets with different priority. We have also generated pruned and classified rules. Further we have used WEKA tool [12] for dataset analysis and KEEL [13, 14] to find out the classification decision rules and partial tree generation.

4 Experiment Design with KEEL

We have used KEEL (Knowledge Extraction based on Evolutionary Learning) [14] tool. KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems.

We have designed an Experiment using the Cleveland dataset as given in the Fig. 1. In the preprocessing phase we have used an AllPossible-MV [15] algorithm to fill the missing values in the dataset.

Fig. 1
figure 1

Heart disease prediction model

5 Classification Rule Generation

6 Evaluation Results

We have used 5 folds for training and 5 folds for testing to evaluate the classification accuracy using different parameter.

6.1 Classification Results by Algorithm and by Fold

We have evaluated the classification accuracy of C4.5 Rules and Partial Tree classifier and the results using different classifier fold wise are as follows

Test Results using Partial Tree Classifier

Fold 0:

CORRECT=0.540983606557377 N/C=0.0

Fold 1:

CORRECT=0.5454545454545454 N/C=0.0

Fold 2:

CORRECT=0.540983606557377 N/C=0.0

Fold 3:

CORRECT=0.5238095238095238 N/C=0.0

Fold 4:

CORRECT=0.5882352941176471 N/C=0.0

  • Global Classification Error + N/C: 0.45210668470070586

  • Stddev Global Classification Error + N/C: 0.02148925320086861

  • Correctly classified: 0.5478933152992942, Global N/C: 0.0

Train Results using Partial Tree Classifier

Fold 0:

CORRECT=0.5503875968992248 N/C=0.0

Fold 1:

CORRECT=0.5494071146245059 N/C=0.0

Fold 2:

CORRECT=0.5503875968992248 N/C=0.0

Fold 3:

CORRECT=0.5546875 N/C=0.0

Fold 4:

CORRECT=0.5378486055776892 N/C=0.0

  • Global Classification Error + N/C: 0.451456317199871

  • Stddev Global Classification Error + N/C: 0.0056511365140908335

  • Correctly classified: 0.548543682800129, Global N/C: 0.0

Test Results using C4.5 Rules Classifier

Fold 0:

CORRECT=0.5081967213114754 N/C=0.0

Fold 1:

CORRECT=0.5 N/C=0.0

Fold 2:

CORRECT=0.6065573770491803 N/C=0.0

Fold 3:

CORRECT=0.47619047619047616 N/C=0.0

Fold 4:

CORRECT=0.4558823529411765 N/C=0.0

  • Global Classification Error + N/C: 0.4906346145015383,

  • Stddev Global Classification Error + N/C: 0.05195453555220856

  • Correctly classified: 0.5093653854984617, Global N/C: 0.0

Train Results using C4.5 Rules Classifier

Fold 0:

CORRECT=0.7093023255813953 N/C=0.0

Fold 1:

CORRECT=0.6482213438735178 N/C=0.0

Fold 2:

CORRECT=0.6550387596899225 N/C=0.0

Fold 3:

CORRECT=0.62890625 N/C=0.0

Fold 4:

CORRECT=0.6772908366533865 N/C=0.0

  • Global Classification Error + N/C: 0.33624809684035556,

  • stddev Global Classification Error + N/C: 0.02752991241516823

  • Correctly classified: 0.6637519031596444, Global N/C: 0.0

6.2 Global Average and Variance

The global average and variance measured using C4.5 Rules classifier and Partial Tree classifier are given in Table 1.

Table 1 Global average and variance

6.3 Classification Rate by Algorithm and by Fold

To evaluate the performance of C4.5 Rules classifier and Partial Tree classifier fold wise on test data set and training data set are given in the Table 2.

Table 2 Classification rate by algorithm and by fold

7 Conclusions

Heart Disease Prediction System evaluation analysis shows the evaluation of the two classifier on different parameter with different statistics measures. Results shows that C4.5 classifier can correctly classified the heart Disease up to 70.93 %. It has been also observed that C4.5 classifier supersedes the partial classifier on the given dataset.