Keywords

1 Introduction

Cardiovascular diseases such as heart diseases are the leading cause of death worldwide. According to the world health organization (WHO), heart diseases amount to one-third of worldwide deaths [1]. Early detection of heart diseases is usually challenging, but it is essential to patient survival. Therefore, several machine learning methods have been developed to predict heart disease risk [2, 3]. Usually, medical data contains several features, and some could be noisy, which can negatively impact the model’s performance. An efficient feature selection approach could select the most informative feature set, reduce the computation cost of making predictions, and enhance the prediction performance [4]. Therefore, feature selection is an essential step in most ML applications, especially in medical diagnosis.

Feature selection refers to obtaining the most suitable features while discarding the redundant ones [5]. Feature selection is usually achieved using a wrapper, filter, or embedded method. Wrapper methods perform feature selection via a classifier’s prediction, while filter-based techniques score each feature and select the highest scores [6]. Meanwhile, embedded methods combine both wrapper and filter-based methods [7]. Also, having too many attributes in a model increases its complexity and could lead to overfitting. On the other hand, fewer features lead to ML models that are more effective in predicting the class variable. Therefore, this research aims to use the recursive feature elimination technique to obtain the most relevant features for detecting heart disease.

Recursive feature elimination is a type of wrapper-based feature selection method. It is a greedy algorithm used to obtain an optimal feature set [8]. The RFE employs a different ML algorithm to rank the attributes and recursively eliminates the least important attributes whose removal will improve the generalization performance of the classifier. The iterative process of eliminating the weakest attributes goes on until the specified number of attributes is obtained. The RFE’s performance relies on the classifier used as the estimator in its implementation. Therefore, it would be beneficial to use different classifiers and compare the predicted features to obtain a more reliable feature set.

Our research aims to develop a feature selection approach to obtain the most informative features to enhance the classification performance of the classifiers. This research uses three base algorithms separately in the RFE implementation to predict the most relevant features. The algorithms include gradient boosting, logistic regression, and decision tree. A feature selection rule based on set theory is applied to obtain the optimal feature set. Then, the optimum feature set serves as input to the logistic regression (LR), decision tree (DT), random forest (RF), linear discriminant analysis (LDA), naïve Bayes (NB), and extreme gradient boosting (XGBoost).

Meanwhile, the class imbalance problem is usually considered when dealing with medical datasets because the healthy (majority class) usually outnumber the sick (minority class) [9]. Most conventional machine learning algorithms tend to underperform when trained with imbalanced data, especially in classifying samples in the minority class. Furthermore, in medical data, samples in the minority class are of particular interest, and the cost of misclassifying them is higher than that of the majority class [10]. Hence, this study employs the hybrid synthetic Minority Oversampling Technique-Edited Nearest Neighbor (SMOTE-ENN) to resample the data and create a dataset with a balanced class distribution.

The contributions of this research include the development of an efficient approach to detect heart disease, implement effective data resampling, select the most relevant heart disease features from the dataset, and compare the performance of different ML algorithms. The rest of this paper is structured as follows: Sect. 2 reviews some related works in recent literature. Section 3 briefly discusses the proposed approach and the various algorithms used in the study. Section 4 describes the dataset and performance assessment metrics used in this paper. Section 5 presents the results and discussion, while Sect. 6 concludes the article and discusses future research directions.

2 Related Works

Many research works have presented different ML-based methods to predict heart disease accurately. For example, in [11], a new diagnostic system was developed to predict heart disease using a random search algorithm to select the relevant features and a random forest classifier to predict heart disease. The random search algorithm selected seven features as the most informative features from the famous Cleveland heart disease dataset, which initially contained 14 features. The experimental results showed that the proposed approach achieved a 3.3% increase in accuracy compared to the traditional random forest algorithm. The proposed approach also obtained superior performance compared to five other ML algorithms and eleven methods from previous literature.

In [12], a deep belief network (DBN) was optimized to prevent overfitting and underfitting in heart disease prediction. The authors employed the Ruzzo-Tompa method to eliminate irrelevant features. The study also developed a stacked genetic algorithm (GA) to find the optimal settings for the DBN. The experimental results achieved better performance compared to other ML techniques. Similarly, Ishaq et al. [13] used the random forest algorithm to select the optimal features for heart disease prediction. They employed nine ML algorithms for the prediction task, including adaptive boosting classifier (AdaBoost), gradient boosting machine (GBM), support vector machines (SVM), extra tree classifier (ETC), and logistic regression etc. The study also utilized the synthetic minority oversampling technique (SMOTE) to balance the data. The experimental results showed that the ETC achieved the best performance with an accuracy of 92.6%.

Ghosh et al. [1] proposed a machine learning approach for effective heart disease prediction by incorporating several techniques. The research combined well-known heart disease datasets such as the Cleveland, Hungarian, Long Beach, and Statlog datasets. The feature selection was achieved using the least absolute shrinkage and selection operator (LASSO) algorithm. From the results obtained, the hybrid random forest bagging method achieved the best performance. Meanwhile, Lakshmananao et al. [14] developed an ML approach to predict heart disease using sampling techniques to balance the data and feature selection to obtain the most relevant features. The preprocessed data were then employed to train an ensemble classifier. The sampling techniques include SMOTE, random oversampling, adaptive synthetic (ADASYN) sampling approach. The results show that the feature selection and sampling techniques enhanced the performance of the ensemble classifier, which obtained a prediction accuracy of 91%.

Furthermore, Haq et al. [15] applied feature selection to the Cleveland heart disease dataset to obtain the most relevant features to improve the classification performance and reduce the computational cost of a decision support system. The feature selection was achieved using the sequential backward selection technique, and the classification was performed using a k-nearest neighbor (KNN) classifier. The experimental results showed that the feature selection step improved the performance of the KNN classifier, and an accuracy of 90% was obtained.

Mienye et al. [2] proposed an improved ensemble learning method to detect heart disease. The study employed decision trees as based learners in building a homogenous ensemble classifier which achieved an accuracy of 93%. In [3], the authors presented a heart disease prediction approach that combined sparse autoencoder and an artificial neural network. The autoencoder performed unsupervised feature learning to enhance the classification performance of the neural network, and classification accuracy of 90% was obtained. Meanwhile, most of the heart disease prediction models in the literature achieved somewhat acceptable performance. Research has shown that datasets with balanced class distribution and optimal feature sets can significantly improve the prediction ability of machine learning classifiers [16, 17]. Therefore, this study aims to implement an efficient data resampling method and robust feature selection method to enhance heart disease prediction.

3 Methodology

This section briefly discusses the various methods utilized in the course of this research. Firstly, we discuss the hybrid SMOTE-ENN technique used to balance the heart disease data. Secondly, we provide an overview of the recursive feature elimination method and the proposed feature selection rule. Thirdly, the ML classifiers used in training the models are discussed.

3.1 Hybrid SMOTE-ENN

Resampling techniques are used to add or eliminate certain instances from the data, thereby creating balanced data for efficient machine learning. Conventional machine learning classifiers perform better with balanced training data. Oversampling techniques create new synthetic samples in the minority class, while undersampling techniques eliminate examples in the majority class [18]. Both techniques have achieved good performance in diverse tasks. However, previous research has shown that they perform excellent data resampling when both methods are combined [19].

This study aims to perform both oversampling and undersampling using the hybrid SMOTE-ENN method proposed by Batista et al. [20]. This hybrid method creates balanced data by applying both oversampling and undersampling. It combines the SMOTE ability to create synthetic samples in the minority class and the ENN ability to remove examples from both classes that have different class from its k-nearest neighbor majority class. The algorithm works by applying SMOTE to oversample the minority class until the data is balanced. The ENN is then used to remove the unwanted overlapping examples in both classes to maintain an even class distribution [21]. Several research works have shown that the SMOTE-ENN technique results in better performance than when the SMOTE or ENN is used alone [19, 22, 23].

3.2 Recursive Feature Elimination

Recursive feature elimination is a wrapper-based feature selection algorithm. Hence, a different ML algorithm is utilized at the core of the technique wrapped by the RFE. The algorithm iteratively constructs a model from the input features. The model coefficients are used to select the most relevant features until every feature in the dataset has been evaluated. During the iteration process, the least important features are removed. Firstly, the RFE uses the full feature set to calculate the performance of the estimator. Hence, every predictor is given a score. The features with the lowest scores are removed in every iteration, and the estimator's performance is recalculated based on the remaining feature set. Finally, the subset which produces the best performance is returned as the optimum feature set [24].

An essential part of the RFE technique is the choice of estimator used to select the features. Therefore, it could be inefficient to base the final selected features using a single algorithm. Combining two or more algorithms that complement each other could efficiently produce a more reliable feature subset. Therefore, in this research, we aim to use gradient boosting, decision tree, and logistic regression as estimators in the RFE. We introduce a feature selection rule to obtain the most relevant features from the three predicted feature sets. The rule is that a feature is selected if it was chosen by at least two of the three base algorithms used in the RFE implementation. Assuming the final feature set is represented by \(A\) and the optimal feature set selected by gradient boosting, logistic regression and decision tree is represented by the set \(X\), \(Y\), and \(Z\), respectively. Then, we can use set theory to define the rule as:

$$A=\left(X\cap Y\cap Z\right)\cup \left(X\cap Y\right)\cup \left(Y\cap Z\right)\cup \left(X\cap Z\right)$$
(1)

3.3 Logistic Regression

Logistic regression is a statistical method that applies a logistic function to model a binary target variable. It is similar to linear regression but with a binary target variable. The logistic regression models the probability of an event based on individual attributes. Since probability is a ratio, it is the logarithm of the probability that is modelled:

$$\mathrm{log}\left(\frac{\pi }{1-\pi }\right)={\beta }_{0}+{\beta }_{1}{x}_{1}+{\beta }_{2}{x}_{2}+\dots {\beta }_{m}{x}_{m}$$
(2)

where \(\pi \) represents the probability of an outcome (i.e., heart disease or no heart disease), \({\beta }_{i}\) denotes the regression coefficients, and \({x}_{i}\) represents the independent variables [25].

3.4 Decision Tree

Decision trees are popular ML algorithms that can be used for both classification and regression tasks. They utilize a tree-like model of decisions to develop their predictive models. There are different types of decision tree algorithms, but in this study, we use the classification and regression tree (CART) [26] algorithm to develop our decision tree model. CART uses the Gini index to compute the probability of an instance being wrongly classified when randomly selected. Assuming a set of samples has \(J\) classes, and \(i\in \{\mathrm{1,2},\dots ,J\}\), then Gini index is defined as:

$$Gini=1-\sum\nolimits_{i=1}^{J} {p}_{i}^{2}$$
(3)

where \({p}_{i}\) is the probability of a sample being classified to a particular class.

3.5 Random Forest

Random forest [27] is an ensemble learning algorithm that uses multiple decision tree models to classify data better. It is an extension of the bagging technique that generates random feature subsets to ensure a low correlation between the different trees. The algorithm builds several decision trees, and the bootstrap sample method is used to train the trees from the input data. In classification tasks, the input vector is applied to every tree in the random forest, and the trees vote for a class [28]. After that, the random forest classifier selects the class with the most votes. The difference between the random forest algorithm and decision tree is that it chooses a subset of the input feature, while decision trees consider all the possible feature splits. Different variants of the random forest algorithm [29,30,31] have been widely applied in diverse medical diagnosis applications with excellent performance.

3.6 Linear Discriminant Analysis

Linear discriminant analysis is a generalization of Fisher’s linear discriminant, a statistical method used to compute a linear combination of features that separates two or more target variables. The calculated combination can then be utilized either as a linear classifier or for dimensionality reduction and then classification. LDA aims to find a linear function:

$$y={a}_{1}{x}_{{i}_{1}}+{a}_{2}{x}_{{i}_{2}}+{a}_{3}{x}_{{i}_{3}}+\dots +{a}_{m}{x}_{{i}_{m}}$$
(4)

where \({a}^{T}=\left[\left\{{a}_{1}, {a}_{2},\dots ,{a}_{m}\right\}\right]\) is a vector of coefficients to be calculated, whereas \({x}_{i}=\left[{x}_{{i}_{1}},{x}_{{i}_{2}},\dots ,{x}_{{i}_{m}}\right]\) are the input variables [32].

3.7 Naïve Bayes

Naïve Bayes classifiers are probabilistic classifiers based on Bayes’ Theorem. They are called naïve because they assume the attributes utilized for training the model are independent of each other [33]. Assuming \(X\) is a sample with \(n\) attributes, represented by \(X={(x}_{1},\dots ,{x}_{n})\). To compute the class \({C}_{k}\) that \(X\) belongs to, the algorithm employs a probability model using Bayes theorem:

$$P\left({C}_{k}|X\right)=\frac{P(X|{C}_{k})P({C}_{k})}{P(X)}$$
(5)

The class that \(X\) belongs to is assigned using a decision rule:

$$y=argmax P\left({C}_{k}\right){\Pi }_{i=1}^{n}P\left({X}_{i}|{C}_{k}\right)$$
(6)

where \(y\) represents the predicted class. The naïve Bayes algorithm is a simple method for building classifiers. There are numerous algorithms based on the naïve Bayes principle, and all of these algorithms assume that the value of a given attribute is independent of the value of the other attributes, given the class variable. In this study, we employ the Gaussian naïve Bayes algorithm, which assumes that the continuous values related to each class are distributed based on a Gaussian (i.e.normal) distribution.

3.8 Extreme Gradient Boosting

Extreme gradient boosting (XGBoost) is an implementation of the gradient boosting machine. It is based on decision trees and can be used for both regression and classification problems. The primary computation process of the XGBoost is the collection of repeated results:

$${\widehat{y}}_{i}^{(T)}={\widehat{y}}_{i}^{(0)}+\sum\nolimits_{t=1}^{T}{f}_{t}({x}_{i})$$
(7)

where \({f}_{0}\left({x}_{i}\right)={\widehat{y}}_{i}^{(0)}=0\) and \({f}_{t}\left({x}_{i}\right)={\omega }_{q({x}_{i})}\). \(T\) represents the number of decision trees, \({\widehat{y}}_{i}^{(T)}\) denotes the predicted value of the \({i}_{th}\) instance, \(\omega \) represents a weight vector associated with the leaf node, and \(q({x}_{i})\) represents a function of the feature vector \({x}_{i}\) that is mapped to the leaf node [34]. In the XGBoost implementation, the trees are added one after the other to make up the ensemble and trained to correct the misclassifications made by the previous models.

3.9 The Architecture of the Proposed Heart Disease Prediction Model

The flowchart of the proposed heart disease prediction method is shown in Fig. 1. Firstly, the heart disease dataset is resampled using the SMOTE-ENN method to create a dataset with a balanced class distribution. Secondly, the proposed feature selection method is used to select the optimal feature set, which is then split into training and testing sets. The training set is used to train the various classifiers, while the testing set is used to evaluate the classifiers’ performance.

Fig. 1.
figure 1

Flowchart of the proposed heart prediction method

4 Dataset and Performance Metrics

The heart disease dataset used in this study contains 303 samples obtained from medical records of patients above 40 years old. The dataset was compiled at the Faisalabad Institute of Cardiology in Punjab, Pakistan [35]. It comprises 12 attributes and a target variable, including binary attributes such as anaemia, gender, diabetes, smoking, high blood pressure (HBP). Furthermore, the attributes include creatinine phosphokinase (CPK), which is the level of the CPK enzyme in the blood. Other features include ejection fraction, the amount of blood leaving the heart at every contraction, platelets, serum creatinine, etc. The full features are shown in Table 1.

Table 1. Features of the heart disease dataset.

Meanwhile, the dataset is not balanced, as there are more samples in the majority class than the minority class. Hence, the need to efficiently balance the data to enhance the classification performance. Furthermore, this research utilizes performance metrics such as accuracy, precision, sensitivity, and F-measure. Their mathematical representations are shown below:

$$Accuracy= \frac{TP+TN}{TP+TN+FP+FN}$$
(8)
$$Precision=\frac{TP}{TP+FP}$$
(9)
$$Sensitivity=\frac{TP}{TP+FN}$$
(10)
$$F measure=\frac{2\times precision\times recall}{precision+recall}$$
(11)

where \(TN\), \(TP\), \(FN\), and \(FP\) represent true negative, true positive, false negative, and false positive, respectively. Also, we utilize the receiver operating characteristic (ROC) curve and area under the ROC curve (AUC) to evaluate the performance of the various ML models.

5 Results and Discussion

This section presents the results obtained from the experiments. Firstly, the heart disease data is resampled using the SMOTE-ENN to create a dataset with a balanced class distribution. Secondly, the feature selection is performed using the proposed RFE technique. Though all the features are associated with heart disease, research has shown that reduced feature sets usually improve classification performance [36, 37]. The optimal feature set obtained by the RFE with gradient boosting estimator comprises the following: F1, F3, F5, F7, F8, F9, F10, F11, and F12.

The logistic regression estimator selected the following features: F1, F2, F3, F5, F7, F8, F9, F10, and F12, whereas the decision tree estimator selected F1, F2, F3, F5, F6, F7, F8, F9, F12. Therefore, applying the proposed feature selection rule gives the following features: F1, F2, F3, F5, F7, F8, F9, F10, F12, which is the final feature set. The selected features are used to build ML models. Tables 2 shows the performance of the classifiers when trained with the complete feature set. In contrast, Table 3 shows the performance when the algorithms are trained after the data has been resampled and the feature selection applied.

Table 2. Performance of the algorithms without feature selection and data resampling.
Table 3. Performance of the algorithms after feature selection and data resampling.

Table 3 shows that the reduced feature set enhanced the performance of the classifiers, and the XGBoost obtained the best performance with an accuracy, sensitivity, precision, F-measure, and AUC of 0.956, 0.981, 0.932, 0.955, and 0.970, respectively. Furthermore, the ROC curves of the various classifiers trained with the reduced feature set are shown in Fig. 2. The ROC curve further validates the superior performance of the XGBoost model trained with the reduced feature set.

Fig. 2.
figure 2

ROC curves of the classifiers trained with the reduced feature set

Furthermore, we used the XGBoost model to conduct a comparative study with other recently developed research works, shown in Table 4. The comparative analysis is conducted to further validate the performance of our approach. We compare the XGBoost performance with recently developed methods, including an SVM and LASSO based feature selection method [38], XGBoost model [39], a hybrid random forest [40], a deep neural network (DNN) [41], a sparse autoencoder based neural network [3], an enhanced ensemble learning method [2], an improved KNN model [42], and an extra tree classifier with SMOTE based data resampling [13].

Table 4. Performance comparison with other studies.

Table 4 further shows the robustness of our approach, as the XGBoost model trained with the reduced feature set outperformed the methods developed in the other literature. Furthermore, this research has also shown the importance of data resampling and efficient feature selection in machine learning.

6 Conclusion

In machine learning applied to medical diagnosis, data resampling and the selection of relevant features from the dataset is vital in improving the performance of the prediction model. In this study, we developed an efficient feature selection approach based on recursive feature elimination. The method uses a set theory-based feature selection rule to combine the features selected by three recursive feature elimination estimators. The reduced feature set then served as input to six machine learning algorithms, where the XGBoost classifier obtained the best performance. Our approach also showed superior performance compared to eight other methods in recent literature.

Meanwhile, the limitation of this work is that the proposed approach was tested on a single disease dataset. Future research would apply the proposed approach for the prediction of other diseases. Furthermore, future research could utilize evolutionary optimization methods such as a genetic algorithm to select the optimal feature set for training the machine learning algorithms, which could be compared with the method proposed in this work and tested on other disease datasets.