Keywords

1 Introduction

Heart disease (HD) is the biggest reason behind the deaths all around the world. The WHO investigated into the statistics and reported that 17.7 deaths were caused due to cardiovascular diseases almost in 2015 throughout the world [1]. The early prediction of HD among population can be a potential help in saving lives by issuing warning and precautionary measures to the people. Machine learning (ML) techniques are playing a crucial role in heart diseases prediction (HDP) using the past collected patient data [2]. A wide range of ML techniques is available for developing the heart disease predictors [3]. The patient datasets possess numerous attributes and not all worthy for predicting the heart disease. Feature selection (FS) facilitates to enhance prediction accuracy by removing the non-contributing and irrelevant attributes [4,5,6,7,8]. Bio-inspired algorithms are gaining popularity for the FS [9]. This study utilizes lion optimization (LO) algorithm originated from the social behavior of lion [10]. Lion optimization for feature selection (LOFS) has not yet been utilized in ML-based HDP domain. To carry out the research streamlined, following research goals are established-

  • R1To report the best ML-based HDP model among the proposed models to predict heart disease effectively.

  • R2 To establish the statistical validation of the work.

The paper is organized as follows—Sect. 2 discusses the literature related to this study. The experimental methods and setup are given in Sect. 3. The results of experiments are reported under Sect. 4. The research work is concluded under Sect. 5 bringing a light on the future work.

2 Literature Work

The survey on the work carried out in the literature of HDP applying the machine learning techniques has been summed up in this section. The survey is summarized as Table 1.

Table 1 Related work in the literature

3 Research Methodology

The research methodology adopted for this work including the experimental methods and setup are briefed in this section.

This work utilizes three datasets from the UCI repository for experimental work [15]. The description to datasets attributes is given as under Table 2. The patient dataset is partitioned into training and testing datasets with 70–30 ratio. Then, lion optimization algorithm for feature selection (LOFS) [14] is applied to select the most significant features. The features selected using the LOFS algorithm for all three experimental datasets are listed as in Table 3. Then, the only selected features are fed to the ML-based classifiers for training purpose. The most renowned classification algorithms [2] are selected for the heart disease prediction (HDP) which are artificial neural network (ANN) [16], support vector machine (SVM), [17] and decision trees (DT) [18, 19]. Performance of all three proposed classifiers is recorded over all three datasets. Figure 1 depicts the proposed experimental model.

Table 2 Description of the datasets used
Table 3 Features Selected Using LOFS Algorithm
Fig. 1
A flow chart to select features from 3 U C I datasets by L O F S, predict heart disease by 3 machine learning algorithms and evaluate their performance.

Proposed heart disease prediction model with LOFS

For the performance evaluation, ROC, AUC, and accuracy are considered [2, 3, 11,12,13, 16,17,18,19,20,21].

4 Results and Discussion

This section reports the experimental results and the inferences drawn after analysis are listed out here.

4.1 Finding the Best ML-Based HDP Model (R1)

A comparison is done among LOFS-ANN, LOFS-SVM, and LOFS-DT to find the best performer. First up, the AUC values are recorded over all three datasets for all the candidate models and reported as in Table 4. Next, the author records the accuracy measure (see Table 5). It is clear that LOFS-ANN performs best over accuracy criteria too. The results are plotted as Fig. 2 for visualization of comparative analysis.

Table 4 Comparison over AUC
Table 5 Comparison over accuracy

To achieve the goal R1, ROC is considered for performance evaluation. The corresponding ROC plots for all three datasets—UCI Heart Disease Dataset (Cleveland) [15], UCI Statlog (Heart), and UCI Heart Failure Clinical Dataset are reported as Figs. 3, 4, and 5, respectively.

Fig. 2
A bar chart of performance metrics A U C and accuracy of the 3 H D P models L O F S-A N N, L O F S-S V M, and L O F S-D T, A N N model performs best.

Comparison of HDP models over AUC and accuracy

Fig. 3
A graph has three lines L O F S - A N N, S V M, and D T with the false positive rate on the X-axis and true positive on the Y-axis. All lines exhibit upward trends.

ROC curve over UCI Heart Disease Dataset (Cleveland)

Fig. 4
A graph has three lines L O F S - A N N, S V M, and D T with the false positive rate on the X-axis and true positive on the Y-axis. All lines exhibit upward trends.

ROC curve over UCI Statlog (Heart)

Fig. 5
A graph has three lines L O F S - A N N, S V M, and D T with the false positive rate on the X-axis and true positive on the Y-axis. All lines exhibit upward trends.

ROC curve over UCI Heart Failure Clinical Dataset

From the experimental results, it is seen that LOFS-ANN shows the best accuracy for predicting the heart disease in comparison with rest of the models.

Response to R1—The proposed LOFS-ANN performs best among the proposed models for all datasets.

4.2 Statistical Justification (R2)

To find the statistical proof, Friedman’s test is conducted [20]. The result of test reflects upon whether the statistical proof for the goal R1 exists or not. The test is conducted with significance level of 5%. The results show that the value of p-statistic is less than 0.05 (see Fig. 6). Hence, it can be statistically validated that proposed LOFS-ANN-based HDP model is better than LOFS-SVM and LOFS-DT.

Fig. 6
A table titled Friedman's ANOVA with six columns and four rows. The column headers are sources, S S, d f, M S, Chi-square, and Prob Chi-square.

p-statistic for Friedman test

Response to R2—There exists statistical proof to validate the research work carried out in this paper.

5 Conclusion

Heart disease is the biggest reason of death in the entire world. If it is predicted well in advance and the patient is fore alarmed, then the lives can be saved. ML classification algorithms are being used for predicting the heart disease. The accuracy of the heart disease predictor is enhanced with the appropriate subset selection of the features from the total feature set—which are in good correlation with the target. In this paper, lion-based feature selection (LOFS) method has been utilized to select most significant features from three datasets—UCI Heart Disease Dataset (Cleveland), UCI Statlog (Heart), and UCI Heart Failure Clinical Dataset. These preprocessed data are fed for the training of three classifiers—ANN, SVM, and DT resulting into three HDP models-LOFS-ANN, LOFA-SVM, and LOFS-DT. The comparison is made among the performance of these proposed methods. The author concludes the work that the ANN with LOFS performs best for heart disease prediction.

Author proposes to replicate the work in the future with larger clinical datasets to contribute more accurate heart disease predictors for biomedical domain.