Keywords

1 Introduction

Cardiovascular diseases are one of the foremost reasons of death and disability all over the country. Deaths due to these diseases are increasing day by day in every age group. These are basically the disease of heart and blood vessels. Coronary artery disease is a category of cardiovascular disease where the plaque is accumulated in heart vessels, interrupts the flow of blood, causes pain, and results into heart attack or even death [1,2,3,4]. The major causes of heart diseases are unhealthy lifestyle, lack of exercise, smoking, and unhealthy eating habits. Developing as well as developed countries are spending large amount of its nation financial budget for detection and treatment of the disease. In spite of advancement in medical science, accurate diagnosis and treatment of the heart disease is still a challenging task because of the complexity of diagnosis and treatment methods, especially in resource poor settings. Accurate diagnosis and treatment of patients is necessary in order to save human life and also to reduce the risk of more severe disease [5,6,7,8,9]. Heart disease is prevented by adopting healthy life style and timely tracking off and treatment of the disease. There are many well-known invasive and non-invasive modalities available for identification of the disease [10]. Non-invasive methods include techniques such as ECG, echocardiogram, stress test, and so on. Sometimes the result of these modalities is inconclusive as well as requires time for assessment. So, coronary angiography is popular examining modality for disease identification. It is invasive, painful, and expensive as well as requires expensive clinical setup [11,12,13,14].

Due to improvement in technology and low-cost storage devices, storage of huge amount of data becomes easy. Even health sector has been untouched. Machine learning methods are being widely used to analyze the collected data due to its capability to predict the diseases. Researchers are seeking inexpensive, reproducible, fast, and computationally inexpensive methods for detection of heart disease. They are exploring machine learning methods such as support vector machine (SVM), K-nearest neighbor (KNN), artificial neural network (ANN), decision tree, logistic regression, and naive Bayes for identification of heart disease [15,16,17,18,19,20]. The paper focuses on various machine learning methodologies in order to identify the heart disease. For experiment purpose, Cleveland heart disease dataset and Alizeshani heart disease dataset available at University of California Irvine (UCI) machine learning repository are used.

Nowadays, healthcare sector is generating large amount of data related to patients, disease, clinical reports, physician notes, laboratory tests, and administrative data. The collected data is used by knowledge miners to extract useful patterns with the help of advanced computational intelligent methods results in low-cost healthcare services with reduced error, improved diagnostic methods.

2 Framework for Intelligent Coronary Artery Disease Prediction

The benchmark coronary artery disease dataset is collected from UCI machine learning repository available for research purpose. The Z-Alizadesh Sani coronary artery dataset contains 53 attributes and 303 records such as Age, Weight, Length, Gender, Body mass index, Diabetes Mellitus, Hypertension, Current smoker, EX-Smoker, Obesity, Airway disease, Thyroid, Chest pain, etc. The Cleveland data consists of 14 attributes and 303 instances. The features are Age, Gender, Chest pain type, Resting blood pressure on admission, Serum cholesterol, Fasting blood sugar, Resting ECG outcome, Max heart rate achieved, Old peak, Slope, Number of fluoroscopy, Colored vessels, Reversible defect, and Outcome [21].

Data is preprocessed to apply predictive modeling using classification methods. The disease identification model is evaluated using performance measures such as accuracy, error rate, AUC, and F-measures. The experimental results exhibit that ensemble-based model which is a better approach with regard to reliability and predictivity of diagnosis. Table 1 shows the description of Z-Alizadesh Sani heart disease dataset.

Table 1 Description of Z-Alizadesh Sani dataset

The CAD dataset collected from UCI machine learning repository was preprocessed. Then, logistic regression, deep learning, decision tree, random forest, gradient boosted, and SVM learning algorithm were applied to identify the presence and absence of coronary artery disease. The performance measures to evaluate the recital of learning algorithms are recorded such as accuracy, error rate, AUC, and F-measure.

Accuracy parameter outputs the percentage of correctly identified patients keeping into consideration following the observations of people suffering from coronary artery disease and those who are not suffering from this disease. Error rate is the percentage of patients not suffering from coronary artery disease, identified as positive for disease, and patients who are suffering from disease and identified as negative for the diseases. Figure 1 shows the framework for intelligent coronary artery disease system.

Fig. 1
figure 1

Framework for intelligent coronary artery disease system

3 Result Analysis

Table 2 presents the result on Z-Alizadesh Sani dataset having accuracy, error rate, F-measures, and AUC. Logistic regression achieves the prediction accuracy of 80%, deep learning-based model achieves the prediction accuracy of 78%, decision tree achieves the prediction accuracy of 69%, SVM achieves accuracy of 71%, and random forest achieves the prediction accuracy of 82%. The Ensemble-based model gradient boosted tree achieves highest prediction accuracy of 84%. Figure 2 presents the accuracy of classification methods.

Table 2 Accuracy/Error rate/F-measures/AUC of models on Z-Alizadesh Sani dataset
Fig. 2
figure 2

Accuracy of models on Z-Alizadesh Sani dataset

In case of misclassification error rate, logistic regression achieves the error rate 20%, deep learning-based model achieves the error rate of 22%, decision tree achieves the error rate of 31%, random forest achieves the error rate of 18%, SVM achieves the error rate of 29%, and in case of gradient boosted tree, it achieves the lowest error rate of 16%. Figure 3 presents the misclassification error rate of classifiers. Figures 4 and 5 present the AUC and F-measure of prediction models.

Fig. 3
figure 3

Error rate of models on Z-Alizadesh Sani dataset

Fig. 4
figure 4

AUC of models using Z-Alizadesh Sani dataset

Fig. 5
figure 5

F-measure of models using Z-Alizadesh Sani dataset

Table 3 shows the experimental results on Cleveland heart disease dataset. Logistic regression achieves the prediction accuracy of 83%, deep learning achieves the prediction accuracy of 83%, and decision tree achieves the prediction accuracy of 74%. Random forest achieves the prediction accuracy of 77%, SVM achieves the prediction accuracy of 74%, and gradient boosted tree (the ensemble-based method) achieves the highest prediction accuracy of 84%. On the other hand, gradient boosted tree achieves the lowest error rate of 16%, and support vector machine has the highest error rate of 26%. Logistic regression achieves the error rate of 17%, deep learning 17%, random forest 23%, respectively, (Figs. 6, 7 and 8).

Table 3 Results on Cleveland heart disease dataset
Fig. 6
figure 6

Accuracy of models using Cleveland heart disease dataset

Fig. 7
figure 7

Error rate of models using Cleveland heart disease dataset

Fig. 8
figure 8

AUC of models using Cleveland heart disease dataset

4 Conclusion

The experimental results using Z-Alizadesh Sani dataset and Cleveland datasets show that ensemble-based method is preferred as compared to other coronary artery disease detection models. The proposed model can be used to reduce cost of initial detection of coronary artery disease with low cost. The clinical parameters can be easily collected from hospitals, and results can be reproduced in a faster, more accurate, scalable, and reliable manner. It can serve an adjunct tool in clinical settings.