1 Introduction

Heart disease refers to conditions that affect the functioning of the heart and blood vessels. Heart disease is a major cause of death worldwide. About 31% of global deaths occur due to this disease. According to the World Health Organization report, approximately 17.9 million people worldwide die each year due to this disease. According to the American Heart Association reports, 121.5 million American adults were affected by this disease in 2016 [1]. Early detection of heart disease can reduce the chance of the disease progressing to a more severe stage by providing appropriate treatment [2].

With the advent of machine learning, decision support systems have become useful tools in many fields such as manufacturing, marketing, education, weather forecasting, transportation, and healthcare [3]. In the past few decades, machine learning has influenced the healthcare sector to a great extent and various automated decision support systems have been developed for the prediction and diagnosis of diseases [4].

Heart Disease diagnosis is a deadly disease and timely diagnosis of this disease can reduce the severity of the disease and hence save human life. A decision support system developed using machine learning methods can help in diagnosing the disease using non-invasive tests. Researchers have made many efforts in this direction and research is still going on [5, 6]. This paper provides a detailed survey of the various decision support systems developed for heart disease diagnosis. In developing these decision support systems, researchers have used machine learning and deep learning methods. The performance parameters used to evaluate the performance of these systems and the validation methods used are also presented. Researchers have utilized several online available heart disease datasets for validating their systems. The details of various heart disease datasets available online are also discussed. Various challenges faced by researchers along with some feasible solutions are also suggested.

1.1 Motivation

Several decision support systems for heart disease diagnosis have been developed in recent years. It's critical to understand how existing systems were developed and what problems researchers faced to enhance them. It is also crucial to discover what improvements can be made to existing systems.

1.2 Research Questions

This paper aims to answer the following research questions:

  1. (1)

    What methods have been used to develop a decision support system for heart disease diagnosis?

  2. (2)

    What are the performance parameters and validation methods to evaluate the performance of systems?

  3. (3)

    What are various online available heart disease datasets?

  4. (4)

    What are the issues and challenges faced by researchers in developing automated diagnostic systems?

  5. (5)

    What strategies can be used to overcome challenges?

  6. (6)

    What are the possible improvements in heart disease diagnostic systems that can be done in the future?

1.2.1 Data Sources

The authors have done a survey of various articles from 2014 to 2022. An extensive search has been performed using the following keywords:

  • “Heart disease diagnosis using machine learning”

  • “Heart disease prediction using machine learning”

  • “Heart disease diagnosis using deep learning”

  • “Heart disease prediction using deep learning”

  • “Intelligent system to diagnose heart disease”

  • “Decision support system for heart disease diagnosis”

  • “Automated heart disease diagnosis”.

The number of articles studied is shown in Fig. 1.

Fig.1
figure 1

Number of articles reviewed from 2014 to 2022

Authors have surveyed articles from renowned digital libraries like Springer, IEEE, Hindawi, and Elsevier. The number of articles reviewed from different libraries is shown in Fig. 2.

Fig. 2
figure 2

Numbers of papers from different publishers

A list of some good journals under various digital libraries used for the study is shown in Fig. 3.

Fig. 3
figure 3

List of journals

The remaining paper is organized as follows: Sect. 2 contains a survey of heart disease diagnostic systems developed using machine learning methods. Section 3 contains a survey of heart disease diagnostic systems developed using deep learning methods. Section 4 contains the description of various online available heart disease datasets. Section 5 describes the performance parameters used to evaluate the performance of decision support systems. Section 6 describes validation methods used to perform experiments. Section 7 contains issues and challenges faced by various researchers. Section 8 describes the conclusion and future work.

2 Machine Learning Methods for Heart Disease Diagnosis

Several ways for detecting heart disease using machine learning techniques have been developed by researchers in the previous decade. Many researchers have proposed different approaches under diverse strategies, and we will discuss these methods in this section. Ghadiri Hedeshi and Saniee Abadeh [7] diagnosed heart disease by extracting rules using PSO (particle swarm optimization) algorithm. Multiple rules were extracted in each run of PSO. The authors worked on the dataset created by combining datasets of Cleveland, Long Beach, Hungarian, and Switzerland. The dataset contained 920 records and heart disease was diagnosed using 13 features. An accuracy of 85.76% was obtained. Bashir et al. [8] developed a system for the prediction of heart disease using the ensemble mechanism. The authors performed experiments on five datasets. The absence of disease was indicated by specifying 0 as the class label and the presence of disease was indicated by specifying 1 as the class label. The inter-quantile range method was used to detect outliers. The classification was performed using DT (decision tree), NB (naive bayes), SVM (support vector machine, and memory-based learner classifiers. Classifier results were aggregated with a majority vote to increase accuracy. Accuracy of 85.81% was obtained on the Cleveland dataset, 80.15% on the SPECTF dataset, 82.40% on the SPECT dataset, 86.12% on the Eric dataset, and 88.52% on the Statlog dataset.

Tomar and Agarwal [9] developed a system using LSTSVM (least square twin support vector machine). Features were selected based on their F-score. Experiments were performed on the Statlog dataset and 85.59% accuracy was achieved. Olaniyi et al. [10] proposed a model using MLP (multilayer perceptron) and SVM (support vector machine). The backpropagation algorithm was used for training the MLP. A learning rate of 0.32 was used in MLP. SVM provided an accuracy of 87.5% and multilayer perceptron provided an accuracy of 85%. Marateb and Goudarzi [11] diagnosed heart disease using a fuzzy rule-based system developed by NFC (Neuro-Fuzzy Classifier). The scaled conjugate gradient algorithm was used to sharply reduce the root mean square error and increase the learning speed of NFC. In their research, the authors used the Cleveland dataset. As a part of data preprocessing, discretization was performed to convert continuous values ​​of features into discrete values. The classification process included both fuzzification and defuzzification. SFS (sequential feature selection) and MLR (multiple logistic regression) algorithms were used to identify significant features. The validation of the system was done by the hold-out method. NFC performance was measured without using feature selection, NFC in combination with MLR, NFC in combination with SFS. The results demonstrated that NFC in combination with MLR provided the highest accuracy of 84%. Khanna et al. [12] achieved 84.7% accuracy with SVM on the Cleveland dataset.

Long et al. [13] developed a system using IT2FLS (interval type-2 fuzzy logic system). Chaos firefly and rough set feature selection algorithms were used to optimize IT2FLS. The system achieved 88.3% accuracy on the Statlog dataset. Miranda et al. [14] developed a model to detect the risk of heart disease using NB classifier. The results of blood and urine tests were used to develop the model. Data was collected from Mayapada hospital. Prediction attributes were selected by conducting interview sessions. Data contained 38 attributes of 60589 patients. Feature selection was performed using the BE (backward elimination) method. Records with incomplete data were removed. After data cleaning, data from 50528 patients were retained. Data normalization was performed by converting numerical data into categorical data. After data cleaning and normalization, classification was performed using the NB Classifier. The system achieved 80% accuracy. Verma et al. [15] performed feature selection by combining CFS (correlation-based feature selection) with PSO. K-means clustering algorithm was used to remove outliers. Four Classifiers MLP (multilayer perceptron), MLR (multinomial logistic regression), FURIA (fuzzy unordered rule induction algorithm), and C4.5 were used to develop the prediction model. Experiments were performed on data collected from IGMC (Indira Gandhi Medical College), Shimla. MLR achieved an accuracy of 88.4%.

Jabbar et al. [16] used RF (random forest) to predict heart disease. The feature set was reduced using chi-square method. The system achieved 83.70% accuracy. Liu et al. [17] used relief and RS (rough set) methods to select features relevant to heart disease diagnosis. The relief method assigns weights to features and selects important features. The output of the relief was used as the input of RS to reduce the set of features. In their research, the authors used an ensemble classifier using boosting mechanism. C4.5 was used as a weak classifier in the ensemble classifier. System performance was validated with the jackknife test on the Statlog dataset. The system achieved 92.59% accuracy. Buchan et al. [18] predicted disease based on risk factors such as high cholesterol, physical inactivity, high blood pressure, and an unhealthy diet. The authors used electronic medical records of patients which are unstructured data. The authors had to combine natural language processing and machine learning to make predictions based on unstructured data. The authors used i2b2 Heart Disease Risk Factors Challenge data set containing records of 296 diabetic patients. Some of the risk factors for diabetics are common to heart disease. This was a challenge facing the researchers. The authors used Apache cTAKES for natural language processing. The information extracted by cTAKES was used to provide training to the model. Feature selection was performed using PCA (principal component analysis) and MI (mutual information). After feature selection, classification was performed using MaxEnt (maximum entropy), SVM, and NB classifiers. The system achieved 77.4% F1-Score.

Mdhaffar et al. [19] combined the technique of CEP (complex event processing) with statistical methods. The system collected the health parameters of the patients extracted through the use of wearable sensors. CEP processed the input data by executing analysis rules based on the threshold. Threshold values vary from patient to patient. Threshold values were calculated automatically based on historical data. Rspberry PI3 was used to perform the formatting of the collected data. NoSQL database was used to store the data. CEP can trigger alarms for predicting heart failure. The system can generate reports of prediction which can be used by cardiologists. The system achieved 84.75% precision. Babic et al. [20] detected heart disease diagnosis using descriptive and prescriptive analysis. Predictive analysis was done using NB, DT, SVM, and ANN (artificial neural network) whereas descriptive analysis was done using decision and association rules. Important features for prediction were selected using some statistical methods.

The analysis was performed on three datasets: Z-Alizadeh Sani dataset, South African dataset, and combined dataset created from Hungary, Cleveland, Long Beach, and Switzerland datasets. In the Z-Alizadeh Sani dataset, the best accuracy of 86.67 was achieved with SVM. The best accuracy of 73.87% was achieved with DT on the South African dataset. In the combined dataset, ANN achieved the best accuracy of 89.93%.

Davari Dolatabadi et al. [21] diagnosed heart disease from ECG (electrocardiogram) signals obtained from the Long-Term ST Database. The database included ECG recordings of eighty individuals representing events of ST-segment changes. HRV (heart rate variability) signals were extracted from the ECG signals. PCA was applied to the extracted features to select important features. Features selected with PCA were used by SVM classifier for diagnosis achieving an accuracy of 99.2%. Kumar and Inbarani [22] diagnosed heart disease by ECG signals. The authors acquired the data on ECG signals from the database of MIT-BIH Arrhythmia. Discrete wavelet transform was applied to remove noise from ECG signals and perform feature extraction. The authors proposed NRSC (neighborhood rough set classifier) to perform the diagnosis. Euclidean distance was used as the distance metric to define the neighborhood. The system diagnosed the disease by classifying the signal as a normal or abnormal heartbeat. The system achieved 99.32% accuracy.

Shah et al. [23] proposed a system combining PPCA (probabilistic principal component analysis) with SVM. PPCA was used to reduce the dimensionality of the features. RBF (radial basis function) based SVM was used to classify a smaller set of features. The system achieved 85.82% accuracy on the Hungarian dataset, 82.18% accuracy on the Cleveland dataset, and 91.30% accuracy on the Switzerland dataset.

Qin et al. [24] their research used RF, SVM, MLP, LR (logistic regression), GBDT (gradient boosting decision tree), Adaboost (adaptive boosting), and KNN (k-nearest neighbor) classifiers for detecting heart disease. The authors proposed an ensemble algorithm based upon multiple feature selection to select relevant features so that detection accuracy can be improved. Maximum accuracy of 93.70% was achieved on Z-Alizadeh dataset. Nalluri et al. [25] diagnosed heart disease using the hybrid system. SVM and MLP classifiers were used to perform the classification. Three evolutionary algorithms GSA (gravity search algorithm), FA (firefly algorithm), and PSO were used to optimize the parameters. In MLP, momentum and learning rate were optimized. In SVM, margins were optimized. The system was validated on five datasets of cardiovascular disease. The system obtained an accuracy of 94.1% on the Cleveland dataset, 90.74% on the Statlog dataset, 89.5% on the SPECT dataset, 90.6% on the SPECTF dataset, and 91.4% on the Eric dataset. Alizadehsani et al. [26] used three classifiers for detecting the stenosis of three coronary arteries. During the detection of stenosis in each artery, features to be used are selected by SVM. These three classifiers were only able to predict blockage in the individual artery. The final prediction of heart disease was made by combining the results of the three classifiers. The authors achieved 88.77% accuracy on the Hungarian dataset, 93.06% on the Cleveland dataset, and 96.40% on the Z-Alizadeh dataset.

Verma et al. [27] presented the use of NB, C4.5, and MLP for CAD diagnosis. The authors collected the data of 335 individuals from IGMC, Shimla, India. Disease severity was detected with 77.6% accuracy using C4.5, 73.73% accuracy using NB, and 71.94% accuracy using MLP.

Dhanaseelan and Jeya Sutha [28] proposed HCFI (frequent itemset algorithm based upon hashing) to detect heart disease. The algorithm efficiently detected disease by removing unnecessary features. The algorithm worked in two steps. In the first step, the transaction was initiated. In the second step, frequent itemsets were generated, The authors evaluated HCFI on the Cleveland dataset. David and Belcy [29] used RF, DT, and NB classifiers. The system provided the best accuracy of 81% with RF on the Statlog dataset. Haq et al. [30] used three algorithms of feature selection MRMR (minimal redundancy maximal relevance), relief, and LASSO (least absolute shrinkage and selection operator) for the selection of significant features. The classification was performed using six classifiers LR, SVM, NB, ANN, DT, RF, and KNN. The combination of RF and Relief provided an accuracy of 85%.

Vijayashree and Sultana [31] classified heart disease using SVM. Feature selection was performed using improved PSO. PSO was improved by selecting optimal weights. A fitness function optimized using a support vector machine helped in the selection of optimal weights. The system achieved 84.36% accuracy on the Cleveland dataset. Dwivedi [32] predicted heart disease using six classifiers ANN, LR, SVM, CT (classification tree), KNN, and NB. LR achieved the highest accuracy of 85%. Dogan et al. [33] predicted heart disease using genetic and epigenetic data from the Framingham dataset. A model was constructed using random forest and 78% accuracy was achieved. Saqlain et al. [34] diagnosed heart disease using RBF Kernel SVM. The accuracy of diagnosis was increased by performing feature selection using three methods. Different feature subsets were created by combining features using the fisher score-based algorithm. After creating different feature subsets, forward and reverse feature selection algorithms were used to select the feature subset. The system achieved an accuracy of 92.68%, 81.19%, 84.52%, and 82.7% for Switzerland, Cleveland, Hungarian and SPECTF datasets, respectively. Abdar et al. [35] developed a system using three types of SVM. GA (genetic algorithm) and PSO were used for feature selection and model optimization. Experiments were performed on the Z-Alizadeh dataset and 93.08% accuracy was achieved.

Ayatollahi et al. [36] diagnosed heart disease using SVM and ANN. Data were collected from the Aja University of Medical Sciences. Twenty-five features were used for disease diagnosis. SVM diagnosed with greater accuracy than ANN. SVM provided a sensitivity of 92.23%. Latha and Jeeva [37] used majority voting with NB, BN (bayes network), MLP, and RF to develop an ensemble model to predict heart disease with 85.48% accuracy. Khennou et al. [38] performed heart disease prediction using KNN and SVM. A combined dataset of Cleveland, Switzerland, and Hungarian was used to evaluate the results and maximum accuracy of 87% was achieved with SVM. Magesh and Swarnalatha [39] used CDTL (cluster-based decision tree learning) to optimize the set of features and performed heart disease classification using these optimized features. The authors used a Cleveland heart disease dataset that was divided into multiple datasets using class labels. After that data was preprocessed and different class pairs were created. After that decision tree was applied to each dataset and decision attributes were selected from each cluster. Interconnecting features were extracted from these decision attributes and classifiers were applied to these extracted features. The system achieved an accuracy of 89.30% using an optimized random forest classifier.

Khourdifi and Bahaj [40] developed a system using the SVM, KNN, MLP, RF, and NB classifiers. FCBF (fast correlation-based feature selection) method was used to select relevant features for classification based on the correlation between the features. The selected subset of features was further optimized by ACO (ant colony optimization) and PSO (particle swarm optimization) methods. In PSO, different individuals of the population called particles work together to find a globally optimum solution. ACO optimizes the feature set by selecting the features that have less similarity with other features, hence reducing redundancy in the selected features. The system was validated on the Cleveland dataset. The system provided the best accuracy of 99.65% with KNN. Mohan et al. [41] developed HRFLM (hybrid random forest with the linear method) for the classification of heart disease. Feature selection was performed using a decision tree by entropy value. The system achieved 88.7% accuracy on the Cleveland dataset. Ali et al. [42] predicted heart failure using a stacked model. The stacked model was developed using two models of SVM. One model was used for feature selection and the other model was used for classification. L1 regular linear SVM was used for feature selection. L2 regular RBF (radial basis function) kernel SVM was used for classification. The system achieved 92.22% accuracy on the Cleveland dataset.

Li et al. [43] developed a heart disease diagnosis system using DT, KNN, ANN, LR, SVM, and NB classifiers. Cleveland dataset was used for performing experiments. Missing values in the dataset were removed. Preprocessing methods of standard scaling and min–max scaling were applied to the dataset. Authors used standard methods of feature selection including MRMR, relief, LL (local learning), and LASSO for selecting relevant features to increase the accuracy of the system. The author also proposed a new feature selection method FCMIM (fast conditional mutual information). FCMIM was deployed based on conditional mutual information. FCMIM used mutual information value of features that are more compatible with the target class and compatible with already selected features. The system achieved maximum accuracy of 92.37%.

Fitriyani et al. [44] validated their proposed system on the Cleveland and the Statlog datasets. Outliers were detected and eliminated in the dataset using DBSCAN (density-based spatial clustering of applications with noise). The Training dataset was balanced using a hybrid SMOTE-ENN method. SMOTE (synthetic minority oversampling technique) oversampled the minority class and ENN (edited nearest neighbor) removed undesired overlapped samples while ensuring a balanced distribution of class. PCC (Pearson’s correlation coefficient) and the information gain method were used to remove irrelevant features. The Weka V3.8 tool was used for performing the experiments. XGBoost (extreme gradient boosting) classifier was used for predicting heart disease. The system achieved 95.90% accuracy on the Statlog dataset and 98.40% accuracy on the Cleveland dataset.

Almustafa [45] predicted heart disease using different classifiers. Naive Bayes, k-nearest neighbors, decision tree J48, SVM, JRip, stochastic gradient descent, decision tables, and AdaBoost classifiers were used for prediction. The authors used a combined dataset from Hungarian, Cleveland, Long Beach, and Switzerland, available online on Kaggle, to conduct the experiments. A total of 14 features were used out of 76 features. Out of these 14 features, the relevant features were selected using the classifier subset evaluator method. Decision Trees, KNN, and JRIP provided the best results with accuracy of 98.04, 99.70, and 97.26 respectively. The author also performed sensitivity analysis on decision tree and naive bayes classifiers. The sensitivity analysis of the decision tree was performed by taking the PCF (pruning confidence factor) as a parameter and the sensitivity analysis of naive bayes was done by taking the training size as a parameter. The decision tree was chosen to perform the sensitivity analysis because of its maximum accuracy and the naive bayes was chosen because of its low accuracy. The decision tree performed best with a PCF value of 0.35. Naive Bayes performed best with 80% training size.

Tama et al. [46] developed a system for using a stacked ensemble model. Stacked model was constructed using GB, RF, and XGBoost classifiers, and dimensionality reduction was performed using PSO. The system achieved 93.55% accuracy on the Statlog dataset, 86.49% accuracy on the Cleveland dataset, and 91.18% accuracy on the Hungarian dataset. Terrada et al. [47]used DT, ANN, and AdaBoost classifiers to diagnose heart disease and performed experiments on three datasets. ANN provided the best accuracy of 94%. Verma [48] developed an ensemble model using J48, CART (classification and regression tree), and RF classifiers. The model was validated on the Z-Alizadeh dataset. The system achieved 84.82% accuracy. Javid et al. [37] developed an ensemble model using the voting mechanism for heart disease prediction. KNN, RF, SVM, GRU (gated recurrent unit), and LSTM (long short-term memory) classifiers were used to develop the ensemble model. Experiments were performed on the Cleveland dataset and 85.71% accuracy was achieved.

Joloudari et al. [49] developed models using DT (decision tree), RT (random tree), CHAID (chi-squared automatic interaction detection), and SVM. Important features were selected based on feature ranking. The random tree provided the best accuracy of 91.47. Mienye et al. [50] partitioned the dataset into different segments. Different models were developed on the partitioned datasets using CART (classification and regression tree). An ensemble model was developed by combining different CART models. The system achieved 91% accuracy on the Framingham dataset. Spencer et al. [51] combined four datasets i.e. Cleveland, Hungarian, Long Beach, and Switzerland into one dataset. After creating the integrated dataset, important features were selected using the chi-square method. Heart disease classification was done using bayes net algorithm achieving 85% accuracy. Gazeloğlu [52] achieved 84.81% accuracy in heart disease classification using the SVM classifier. Budholiya et al. [53] developed a system using the XGBoost classifier. The hyperparameters of XGBoost were optimized using bayesian optimization. The system achieved 91.8% prediction accuracy. Amin et al. [54] developed different models for predicting heart disease using DT, KNN, NB, SVM, LR, and ANN classifiers. An ensemble model was also developed by applying voting on naive Bayes and logistic regression. Significant features were selected using the brute force method. The system achieved maximum accuracy of 87.4 using the voting mechanism.

(L et al. 2021) developed models to predict heart disease using RF, NB, DT, AdaBoost, LR, GB (gradient boosting), and XGBoost classifiers. Relevant features were selected using GA. The performance of the models was optimized using hyperparameter optimization. Feature selection increases the accuracy of all the classifiers. DT achieved 88.7% accuracy, RF achieved 90.7% accuracy, AdaBoost achieved 85.5% accuracy, NB achieved 62.7% accuracy, LR achieved 70.4% accuracy, KNN achieved 84.5% accuracy, XGBoost achieved 85.2% accuracy and GB achieved 86.8% accuracy. Gárate-Escamila et al. [55] developed a system using random forest where the optimal set of features was selected using chi-square and PCA. The system provided an accuracy of 99% on the Hungarian dataset, 98.7% on the Cleveland dataset, and 99.4% on the Cleveland-Hungarian dataset. Arul Jothi et al. [56] used DT and KNN classifiers to predict heart disease. KNN achieved 67% accuracy and DT achieved 81% accuracy. Valarmathi and Sheela [57] had optimized random forest classifiers using randomized search, grid search, and genetic algorithm. Important features for diagnosis were selected using SFS (sequential forward selection) algorithm. Optimized random forest provided 80.2% accuracy on the Z-Alizadeh dataset. Bahani et al. [58] developed a system to predict heart disease by FCRLC (fuzzy rule-based classification system with fuzzy clustering and linguistic modifiers). The system achieved 83.17% accuracy on the Cleveland dataset.

Shorewala [59] developed a stacked model using KNN, SVM, and RF for heart disease detection. LASSO algorithm was utilized for feature selection. The model was evaluated using a Cardiovascular Disease dataset having records from 70,000 patients, and it was shown to be 75.1% accurate. (L et al. 2021) developed an optimized model using RF. The features were chosen using GA. The model provided 90.7% accuracy on the Z-Alizadeh dataset. Rani et al. [60] developed a hybrid system using NB, SVM, RF, LR, and AdaBoost Classifiers. Features were selected using GA and RFE (recursive feature elimination) algorithms. SMOTE and standard scalar methods were also used for data preprocessing. Missing values were imputed using MICE (multivariate imputation by chained equations). The system achieved maximum accuracy of 86.6% with RF (random forest). Rani et al. [61] selected features by finding out the feature importance using ET (extra tree) classifier. The classification was done using KNN, XGBoost, SVM-Linear (Support Vector Machine-Linear), and SVM-RBF(Support Vector Machine-Radial Basis Function). Hyperparameter optimization was done using grid-search optimization. The system provided an accuracy of 95.16% with SVM-RBF on the Z-Alizadeh Sani dataset.

Patro et al. [62] used support vector machine optimized using bayesian algorithm and achieved 93.3% accuracy on the Cleveland dataset. Louridi et al. [63] filled missing data using MICE, Mean, KNN, and Mode algorithms. Class balancing was also done in the dataset. Accuracy of 95.83% was achieved using the stacking algorithm. Ghosh et al. [64] selected features using relief and lasso methods. Hybrid classifiers were developed by combining boosting and bagging methods. The best accuracy of 99.05% was achieved with RFBM (random forest bagging method). Nawaz et al. [65] developed a prediction model for heart disease using KNN, SVM, RF, ANN, and GDO (gradient descent optimization). GDO achieved maximum accuracy of 98.54%. Chang et al [66] developed a system using RF and achieved 83% accuracy. Archana et al. [67] developed a hybrid method using NB and RF. Features were selected using the relief algorithm. An accuracy of 93% was obtained on the Cleveland dataset. Nagavelli et al. [68] detected heart disease using the XGBoost classifier and achieved 95.9% accuracy on the Cleveland dataset. Records with missing values were not used. Gao et al. [69] performed experiments with SVM, RF, DT, KNN, NB, and ensemble algorithms. Features were selected using LDA (linear discriminate analysis) and PCA methods. Ensemble algorithm with DT has given maximum accuracy of 98.6%. The heart disease diagnosis systems developed by the researchers using machine learning methods are summarized in Table 1.

Table 1 Summary of heart disease diagnosis systems developed using machine learning

3 Deep Learning Methods for Heart Disease Diagnosis

Researchers have proposed several methods for identifying heart disease using deep learning techniques. Many researchers have presented various approaches under various strategies, which are discussed here. Choi et al. [72] developed a model to detect heart failure using RNN (recurrent neural network). The authors analyzed the relationship between temporal events in EHR (electronic health records) using the GRU of RNN. EHR was obtained from Sutter PAMF (Palo Alto Medical Foundation). EHR events were represented by a set of one-hot vectors. An N-dimensional vector was used to represent N events. In each vector of n dimensions, one dimension was 1 indicating the occurrence of the event and the rest were 0. Vector Xt was given as input to GRU stored in hidden layer h at timestamp t. The state of the hidden layer changes with each timestamp. Logistic regression was applied to the vector of the final state and a scalar value was produced representing the patient's risk score. The model achieved an AUC value of 0.777. Arabasadi et al. [73] proposed a system for CAD diagnosis using a neural network. Weights of the neural network were optimized by GA. The backpropagation algorithm was used to train the ANN. GA used 100 chromosomes as the initial population. The fitness value of the chromosomes was calculated using the RMSE (root mean square error) of the untrained ANN. Roulette wheel algorithm was used in GA for selection. Two-point crossover was used with a crossover probability of 1. The mutation was performed using gaussian mutation. Each chromosome contained all weights of the neural network and each gene in the chromosome contained one weight of the neural network. Feature selection was performed by SVM. The system was evaluated on the Z-Alizadeh Sani dataset. The system achieved an accuracy of 93.85%.

Samuel et al. [74] developed the ANN model to predict the risk of heart failure. The network weights were optimized using the fuzzy approach. Accuracy of 91.10% was achieved. Kim and Kang [75] developed a system to predict the risk level of heart disease using ANN. Relevant features contributing to diagnosis were selected by performing a correlation analysis of features. The correlated features were coupled by connecting to the hidden layer of the neural network. Only relevant features were used as inputs to the neural network to predict disease. Experiments were performed on the dataset collected in the KNHANES-VI (6th Korea National Health and Nutrition Examination Survey). The system provided an accuracy of 82.51%. Caliskan and Yuksel [76] proposed a DNN (deep neural network model) for CAD diagnosis by combining a softmax and two autoencoders. The authors validated the model on four different datasets. The system achieved 85.2% accuracy on the Cleveland dataset, 84% accuracy on the Long Beach dataset, 92.2% on the Switzerland dataset, and 83.5% on the Hungarian dataset. Poornima and Gladis [77] in their research proposed a hybrid classifier for the prediction of heart disease. Features were selected using OLPP (Orthogonal Local Preserving Projection). The classification was performed using ANN. There were 4 neurons in the input layer, 100 neurons in the hidden layer, and 5 neurons in the output layer of the neural network. Weights of the connection between neurons had a range from − 10 to 10. The network was optimized using LM (levenberg–marquardt) and GSO (group search optimization) for setting the weights. From the two sets of weights obtained by LM and GSO, the best weights in the network were used. The authors used three datasets Cleveland, Hungarian, and Switzerland to validate the results. The system achieved 94% accuracy on the Cleveland dataset, 98% accuracy on the Hungarian dataset, and 87% accuracy on the Switzerland dataset. Malav and Kadam [78] predicted heart disease using ANN and K-means. The authors utilized the Cleveland dataset. The dataset was first clustered using K-means and then the output of K-means was given as input to the ANN for classification. The convergence time of ANN was reduced by K-Means. The system achieved 89.53% sensitivity and 93.52% precision.

Tan et al. [79] proposed a system for CAD diagnosis with a stacked model of LSTM and CNN using ECG signals. The data of ECG signals were obtained from the PhysioNet database and only lead 2 signals were used. This dataset consists of ECG signals from 7 CAD patients and 40 healthy individuals. The system achieved an accuracy of 99.85%. Miao and Miao [80] developed a DNN model for heart disease diagnosis and achieved 83.67% accuracy on the Cleveland dataset. Ali et al. [81] developed a system for diagnosing heart disease using DNN. In DNN, overfitting and underfitting of the model should not occur. Irrelevant features in the training data lead to overfitting of the model. If there is an insufficient number of features in the training data, this may lead to the underfitting of the model. To tackle the problem of selecting relevant features, the authors have used the chi-square statistical method. The network configuration was optimized using the exhaustive grid search method. The authors used the Cleveland dataset for experiments achieving 93.33% accuracy.

Meshref [82] achieved 84.25% accuracy in heart disease diagnosis using ANN. The attribute subset selection method was used to select features. Verma and Mathur [83] used the correlation and cuckoo search method to select important features and developed a DNN for the diagnosis of heart disease. In an individual detected with the disease, the severity of the disease was informed using case-based logic. An accuracy of 85.48% was obtained using this approach.

Javeed et al. [84] developed two-hybrid systems FWAFE-DNN and FWAFE-ANN for heart disease diagnosis. The authors proposed FWAFE (floating window with adaptive size for feature elimination) algorithm for feature selection that was used in both systems. In this feature selection method, a floating window was used to eliminate the features. The window size was taken from 1 to n-1 and the features that resided in the window were eliminated. Feature selection was done by evaluating the performance of the system for different subsets of features. After feature selection, classification was done using ANN in the FWAFE-ANN hybrid system and DNN in the FWAFE-ANN hybrid system. The authors used the Cleveland dataset and the hold-out validation method with 70% training data and 30% testing data. The FWAFE-ANN achieved 91.11% accuracy and the FWAFE-ANN achieved 93.33% accuracy.

Pan et al. [85] developed a system for heart disease prediction using enhanced deep learning assisted convolution neural network. The dataset was pre-processed by removing missing values and applying scaling methods. Deep learning was used for feature selection. The classification was performed using MLP and BN. The system achieved 94.9% accuracy. Ali et al. [81] proposed OCI-DBN (optimally configured and improved deep belief network) for heart disease prediction. Important features were selected using the RUZZO-TOMPA approach in which features were selected by computing the fitness of each feature. The configuration of the deep belief network was optimized using a stacked genetic algorithm. The system achieved 94.61% accuracy. Dutta et al. [86] used the data collected in the National Health and Nutritional Examination Survey from 1999 to 2016. The data was highly imbalanced containing 1300 records of heart patients and 35,779 records of healthy individuals. A CNN (convolutional neural network) model was proposed for CAD diagnosis that provided 79.5% accuracy. Relevant features were selected using LASSO before being applied to the model for classification.

Paragliola and Coronato [87] proposed a system to identify the risk of cardiac events due to hypertension. This system was developed using LSTM and CNN and used ECG signals as inputs. The system achieved 98% accuracy. Cherian et al. [88] predicted heart disease using the ANN model. The feature set was reduced using PCA. A hybrid approach combining LA (lion algorithm) and PSO was used to optimize the weights of a neural network. Results were validated on the Statlog dataset and 87.09% accuracy was achieved. Salhi Dhai Eddine and Tari [89] selected important features using a correlation matrix. The authors used ANN to diagnose heart disease using selected features and maximum accuracy of 93% was achieved. Murugesan et al. [90] developed a super learner by combining three bioinspired algorithms with ANN. Three sets of features were selected using CSO (cat swarm optimization), BFO (bacterial foraging optimization), and KH (krill herd) algorithms. A BPNN (backpropagation neural network) was trained using the features selected by each algorithm. Accuracy of 86.36% was achieved on the Statlog dataset and Accuracy of 84% was achieved on the Cleveland dataset. Bharti et al. [91] developed a DNN model to detect heart disease. Dropout layers were used to prevent overfitting. DNN achieved 94.2% accuracy on the Cleveland dataset. Mehmood et al. [92] used LASSO to select features and applied the selected features to CNN achieving 97% accuracy. Koppu et al. [93] proposed a model in which firstly preprocessing was done using spline interpolation to fill missing data and entropy-correlation to detect outliers. Optimal features were selected using F-DA (fitness-oriented dragon fly optimization algorithm). Selected features were applied to DBF (deep belief network) achieving 84.44% accuracy. The heart disease diagnosis systems developed by the researchers using deep learning methods are summarized in Table 2.

Table 2 Summary of heart disease diagnosis systems developed using deep learning

4 Online Available Heart Disease Datasets

In the literature, researchers used various online available clinical datasets to develop models for heart disease diagnosis. The details of the various heart disease datasets available online are given in this section.

4.1 Cleveland, Hungarian, Switzerland, and Long Beach Heart Disease Datasets

All four datasets are available in the UCI (University of California, Irvine) repository. These datasets have a total of 76 features including continuous and categorical features. All the researchers used 14 of the 76 features. Thirteen of these 14 features are prognostic features and one feature differentiates between the presence and absence of disease.

The presence of the disease is indicated by a value of 1 to 4 indicating the level of disease severity. The absence of disease is indicated by the value 0. All four datasets have some missing values. The Cleveland dataset is the most widely used dataset by researchers [95].

4.2 Statlog Heart Disease Dataset

There are 13 predictive features in this dataset and one feature indicates the presence or absence of heart disease. There are 297 instances in the dataset. The presence of heart disease is indicated by 1 and the absence of heart disease is indicated by 0. There are no missing values in this dataset Statlog Dataset [101].

4.3 Framingham Heart Disease Dataset

This dataset is available on Kaggle. This dataset contains 15 predictive features. There are 4240 examples in the dataset. Values of some features are missing in this dataset Framingham Dataset [100].

4.4 SPECTF Heart Disease Dataset

This dataset includes features extracted from cardiac SPECT (single proton emission computed tomography) images. The dataset consists of total 44 attributes based on which heart disease is diagnosed. This dataset contains a record of 267 individuals. This dataset is available on the UCI repository (SPECTF [96].

4.5 Z-Alizadeh Sani Dataset

This dataset contains a record of 303 individuals. The dataset contains 54 predictive features classified into four categories: ECG, demographic, symptom and examination, and laboratory. Based on these predictive features, a person is classified as a normal or CAD patient. There are no missing values in the data in this dataset [97]

The various online available heart disease datasets are summarized in Table 3. Figure 4 shows the datasets along with the number of features (Table 4).

Table 3 Summary of online available heart disease datasets
Fig. 4
figure 4

Online available heart disease datasets

5 Performance Parameters

Various performance parameters used to evaluate the classification performance of the system are as follow:

  • Accuracy

  • Sensitivity

  • Specificity

  • Precision

  • F-Measure

  • AUROC (area under receiver operating characteristics curve)

The performance parameters are calculated using the number of true positives, false positives, true negatives, and false negatives. If a patient is suffering from a disease and the model can predict the disease, it is known as true positive and if the model is not able to predict, it is referred to as false negative. If an individual does not suffer from the disease and the model correctly classifies it is known as true negative, otherwise, it is referred to as false positive. Performance parameters are shown in Fig. 5 and the usage percentage of these parameters in the studied literature is shown in Fig. 6.

Fig. 5
figure 5

Performance parameters

Fig. 6
figure 6

Usage percentage of performance parameters

6 Validation Methods

Most researchers have used one or both of the following methods to validate the results:

6.1 Hold-Out Validation

In this method, the dataset is divided into two parts. One part is used for training the system and another part is used for testing the system. Most of the researchers have used 70% data for training and 30% data for testing. However, some researchers have used other percentage splits as well.

6.2 K-Fold Validation

In this method, the dataset is divided into k groups. Classification performance is evaluated over k iterations, using k-1 groups for model training and one group for model testing. In each iteration, a different group is selected for testing and the remaining groups are used for training. The performance of the classifier is calculated by taking the average performance over k iterations. Most of the researchers have used 10-fold validation with k = 10. However, some researchers have also used 2-fold, 3-fold, and 5-fold methods.

7 Challenges and Suggested Solutions

A study of the existing literature has concluded that there are many issues and challenges in the automated diagnosis of heart disease. The challenges faced by researchers are shown in Fig. 7 and the suggested solutions are shown in Table 4.

Table 4 Suggested solutions to overcome challenges
Fig. 7
figure 7

Challenges Faced by Researchers

8 Conclusion and Future Work

Machine learning and deep learning approaches have been used to develop several decision support systems for the detection of heart disease. These systems were accurate to varying degrees. The system's accuracy is determined by the feature selection method, classifier, and preprocessing methods used. To design a high-performance decision support system for heart disease diagnosis, effective preprocessing approaches, feature selection, and classifiers are required. The majority of the researchers used data from the UCI repository, which is available online. The Cleveland and Z-Alizadeh datasets are the most popular and widely used.

Following are some suggestions based on review of the literature that should be included in future research for more accurate heart disease detection:

  1. 1.

    By combining several machine learning algorithms and mining unstructured data available in enormous quantities in healthcare organisations, more hybrid models for reliable prediction of heart disease can be developed.

  2. 2.

    In heart disease prediction, classification algorithms received greater attention than association rules. So, to achieve better results in future research, we must include these factors.

  3. 3.

    The majority of studies used the online available datasets to train and test prediction models. We can collect real-time data from a large number of heart disease patients from reputable medical institutes around the country and utilize it to train and evaluate our prediction algorithms.

  4. 4.

    For a more accurate diagnosis, highly skilled cardiologists must be consulted to prioritize the features based on their impact on the patient's health and add more vital heart disease attributes.