Keywords

1 Introduction

In the research field, heart disease has created a lot of serious concerns, and the significant challenge is accurate detection or prediction at an early stage to minimize the risk of death. According to World Health Organization (WHO) [1], the medical professionals have predicted only 67% of heart diseases correctly and hence there exists a vast research scope in the area of heart disease prediction. A lot of technicalities and parameters are involved in predicting the diseases accurately. Various machine learning, deep learning algorithms and several optimization techniques have been used to predict the heart-disease risk. All these techniques mainly focus on the higher accuracy which shows the importance of correct prediction of heart disease. It would be helpful for the doctors to predict the heart disease at an early stage and save millions of life from death [2]. For temporal sequenced data, the recurrent neural network (RNN) models are best suited and for sequenced features, several variants have been chosen. In various sequence based tasks like language modelling, handwriting recognition, and for other such as tasks, long short term memory (LSTM) has been used, which shows an impressive performance [3, 4]. For better performance, evolutionary algorithms (EAs) are used for model optimization. The evolutionary algorithm related to self-adaptability based on population is very useful in case of feature selection and extraction. The EAs used in the recent year include ant colony optimization (ACO), particle swarm optimization (PSO) and genetic algorithm (GA). The GA is considered as a stochastic method for optimization and global search, which is very helpful in handling the medical data. The possible solutions are obtained from set of individuals using GA. GA which are generally used to create solutions with a better quality for global search and optimization are based on the mutation, crossover and selection operators. The PSO-a meta heuristic algorithm is considered in this study due to its simplicity and ease implementation. It uses only few parameters and required few numbers of parameters tuning. The PSO exhibits the information sharing mechanism and population-based methods, and hence it extended from single to multi-objective optimization. It has been successfully applied in the medical field for heart disease prediction and recorded good performances [5, 6]. The main contribution of this study involves,

  • Improve the accuracy of the prediction of heart disease in human using efficient feature selection and classification methods.

  • Implementing the GA and PSO for efficient feature selection.

  • Implementing the RNN and LSTM to improve an accuracy for heart disease prediction.

  • Compared performance of the proposed method with the existing techniques in terms of an accuracy, precision, recall and f-measure.

The remaining organization of the paper is as follows: Sect. 2 includes the literature survey of the existing research work related to feature selection techniques and deep learning classification methods for heart disease prediction. Section 3 discusses the implementation process of the GA and PSO optimization algorithm and LSTM and RNN classification. Section 4 discusses the performance analysis of the proposed work. The conclusion has been presented in Sect. 5.

2 Related Work

In [7], researchers proved that optimization algorithms are necessary for an efficient heart disease diagnosis and also for their level estimation. They used support vector machine (SVM) and generated an optimization function using the GA for the selection of more substantial features to identify the heart disease. The data set used in this research is a Cleveland heart disease database. G. T. Reddy et al. developed an adaptive GA with fuzzy logic design (AGAFL) in [8], which in turn helps the medical practitioners for heart disease diagnose at an early stage. Using the hybrid AGAFL classifier, the heart disease has been predicted, and this research has been performed on UCI heart disease data sets. For diagnosing the coronary artery disease, usually angiography method is used, but it shows significant side effects and highly expensive. The alternative modalities have been found by using the data mining and machine learning techniques stated in [9], where the coronary artery disease diagnosis is done with the more accurate hybrid techniques with increased performance of neural network and used GA to enhance its accuracy. For this research work, Z-Alizadeh Sani data set is used and yields above 90% values in specificity, accuracy and sensitivity. In [10], researchers proposed trained recurrent fuzzy neural network (RFNN) based on GA for heart disease prediction. The data set named UCI Cleveland heart disease is used. From the testing set, 97.78% accuracy has been resulted. For large data related to health diagnosis, the machine learning has been considered as an effective support system. Generally to analyze this kind of massive data more execution time and resources have required. Effective feature selection algorithm has been proposed by J. Vijayashree et al. in [11] to identify the significant features which contribute more in disease diagnosis. Hence to identify the best solution in reduced time the PSO has been implemented. The PSO also removes the redundant and irrelevant features in addition to selecting the important features in the given data set. Novel fitness function for PSO has been designed in this work using the support vector machine (SVM) to solve the optimal weight selection issue for velocity and position of particle’s updation. Finally, the optimization algorithms show the merit of handling the difficult non-linear problems with adaptability and flexibility. To improve the heart disease classification quality, the Fast correlation based feature selection namely (FCBF) method used in [12] by Y. Khourdifi et al. to enhance the classification of heart disease and also filter the redundant feature. The classification based on SVM, random forest, MLP, K-Nearest neighbor, the artificial neural network optimized using the PSO mixed with an ant colony optimization (ACO) techniques, have been applied on heart disease data set. It resulted in robustness and efficacy by processing the heart disease classification. By using data mining and artificial intelligence, the heart disease has been predicted but for lesser time and cost in [13], which focused on PSO and neural network feed forward back propagation method by using the feature ranking on the disease’s effective factors presented in Cleveland clinical database. After evaluating the selected features, the result shows that the proposed classified methods resulted in best accuracy. In [14], for the risk prediction of diseases, machine learning algorithm plays a major role. The prediction accuracy influenced by attribute selection in the data set. The performance metric of Mathew’s correlation co-efficient has been considered. For attribute selection performance, the altered PSO has been applied. N. S. R. Pillai et al. in [15] using the deep RNNs the language model like technique demonstrated to predict high-risk diagnosis patients (prognosis prediction) named as PP-RNNs. Several RNNs used by this proposed PP-RNN for learning from the patient’s diagnosis code to predict the high risk disease existences and achieved a higher accuracy. In [16], M. S. Islam et al. suggested grey wolf optimization algorithm (GWO) combined with RNN, which has been used for predicting medical disease. The irrelevant and redundant attributes removed by feature selection using GWO. The feature dimensionality problem avoided by RNN classifier in which different diseases have been predicted. In this study, UCI data sets used and enhanced an accuracy in disease prediction obtained from Cleveland data set. From the structured and unstructured medical data, deep learning techniques exhibited the hidden data. In [17], researchers used the LSTM for predicting the cardio vascular disease (CVD) risk factors, and it generally yields better Mathew’s correlation co-efficient (MCC) as 0.90 and accuracy as 95% compared with the existing methods. Compared with other statistical machine learning algorithms, the LSTM based proposed module shows best performance in the CVD risk factors’ prediction. Based on novel LSTM deep learning method in [18], helped in predicting the heart failure at an early stage. Compared with general methods like SVM, logistic regression, MLP and KNN, the proposed LSTM method shows superior performance. Due to mental anxiety also CVD occurs, which may increase in COVID-19 lock down period. In [19], researchers proposed an automated tool which has used RNN for health care assistance system. From previous health records of patients for detecting the cardiac problems, the stacked bi-directional LSTM layer has been used. Cardiac troubles predicted with 93.22% accuracy from the obtained experimental results. In [21], Senthilkumar Mohan et al. proposed a hybrid machine learning technique for an effective prediction of heart disease. A new method which finds major features to improve the accuracy in the cardiovascular prediction with different feature’s combinations and several known classification techniques. Machine learning techniques were used in this work to process raw data and provided a new and novel discernment towards heart disease. The challenges are seen in existing studies exhibited as,

  • In the medical field, the challenging requirement is, training data in a large amount is necessary to avoid the over-fitting issue. Towards the majority samples, predictions are biased if the data set is imbalanced and hence over-fitting occurs.

  • Through the tuning of hyper parameters such as activation functions, learning rates and network architecture, the deep learning algorithms are optimized. However, the hyper-parameters selection is a long process as several values are interdependent, and multiple trials are required.

  • Significant memory and computational resources are required for timely completion assurance. Also, need to improve an accuracy of Cleveland heart disease data set using deep learning with feature selection techniques.

3 Methodology

The main purpose of this study is to predict the heart disease in human. The proposed workflow is shown in Fig. 1, which starts with the collection of dataset, data pre-processing, implementing the PSO and GA significantly for feature selection and for classification, RNN and LSTM classifiers used. At last, the proposed model is evaluated with respect to accuracy, precision, recall and f-measure. This section describes the workflow of the proposed study.

Fig. 1.
figure 1

Heart rate prediction proposed flow with RNN and LSTM classification

3.1 Preprocessing the Data

The data set contains 14 attributes and 303 records, where 8 are categorical and 6 are numeric attributes. Attribute of a class is zero for normal and one for having heart disease. Also, 6 records had been missing values. For pre-processing the two strategies is followed. By using features mean value; the missing values are replaced. Further the string values are changed into numerical categorical values. After filtering out such records, 297 were complete records available for the heart disease prediction.

3.2 Enhanced GA and PSO for Feature Selection

Genetic Algorithm (GA)

The GA is considered as the stochastic search algorithm that imitates natural evolutionary process using the operators which imposed upon the population. The GA algorithm used two major operators such as crossover operator and mutation operator. For individual’s mating in parent population, the crossover operator is used whereas the characteristics of individual’s randomly changed and diverse offspring resulted for the mutation operator. In the following algorithm 1, the offspring made systematic replacement of the generated parents. In nature the crossover of single point symmetric and through bit flipping the mutation is achieved. The expression of a minimization problem is,

$$\begin{aligned} fit= \alpha E(C)+ \beta \frac{|s{f}|}{|A{f}|} \end{aligned}$$
(1)

where, E(C) is the classifier’s error rate, s f is the selected feature subset length and available features total count is the A f, the parameters used to control feature reduction and classification accuracy weights \(\beta \) is \(1 - \alpha \) and \( \alpha \in \) [0,1].

Selection

It selected a portion of population for next-generation breed. Based on the measured fitness values using Eq. (1) the selection is generated.

Crossover

For further breeding, randomly selected two parents from the previously selected pool. Until the suitable population size reached, the process is continued. At only one point, the crossover taken place and this is the parent solution’s mid-point. The crossover probability parameter is \( prob_{c}\) which controls the crossover frequency.

Mutation

Selected the random solutions from the chosen candidates for breeding and on these, the bit flipping has been carried out. A diverse group of solutions arise, which keeps various characteristics of their parents. The mutation probability parameter is \( Prob_{m}\) which controls the mutation’s frequency.

Table 1. Algorithm 1

GA algorithm 1 from Table 1 is described as, at first initialize the population size N, nsite, \(Prob_{c}\), \(max_{it}\) values. Then for each solution, initialize the population randomly as \(x_{i}= (x_{i1}, x_{i2}...x_ {iD})\). The following calculations are repeated until the ending criteria is seen, i) evaluate the fitness value using \(f (x_i)\) ii) breeding population selected as \(x_{val}= N_ \frac{Top}{2}( fitsort)\) iii) Taken random value and its higher than \(Prob_{c}\), random sample mutation from \(x_{val}\) is taken iv) update the enhanced new solution with existing solution v) Taken random value and its higher than \(Prob_{m}\), random sample mutation from \(x_{val}\) is taken vi) update the enhanced new solution with existing solution vii) combination of \(x_{val}\) and \( x_{newval}\) generated and it is considered a new solution and finally global best solution is produced considered as best found solution.

Particle Swarm Optimization (PSO)

The PSO is a population-based search technique derived from the exchange of information through birds. At first in PSO, initialize the particles’ random population and based on their other particles interchange, these particles are moved with certain velocity in the population. At each iteration out of all particles, the personal, best and global best achieved and all the particles of velocity updated based on this information. To the personal, best and global, the weights are given by certain variables. The following algorithm 2 used specified transfer function type k is used for alteration of endless value to binary value, which is a substitute to basic hyperbolic tangent function. By the dimensional vector D, every particle is represented, and with every individual value, which is being binary are initialized randomly,

$$\begin{aligned} x_{i}= (x_{i1}, x_{i2}...x_ {iD}) \in A_{s} \end{aligned}$$
(2)

Where \(A_{s}\) is the search space which is available by dimensional vector D, the velocity \(v_{i}\) is represented and initialized to 0.

$$\begin{aligned} v_{i}=(v_{i1}, v_{i2}...v_{iD}) \in A_{s} \end{aligned}$$
(3)

By each particle retained, the best personal position recorded as,

$$\begin{aligned} p_{i}= (p_{i1}, p_{i2} ...p_{iD}) \end{aligned}$$
(4)
Table 2. Algorithm 2

From Table 2, the PSO algorithm described as, at first the swarm size values N, acceleration constant \(A_{c1}, A_{c2}, w_{max}, w_{min}, v_{max}, max_{it}\) are initialized. As in Eq. (2) and Eq. (3), the population is randomly initialized and velocity vectors are initialized respectively. The following calculations are repeated until the ending criterion is seen, i) inertia weight value w is updated, ii) using \(f(x_{i})\) the each solution’s fitness value is updated, iii) assigned the personal-best solution \( p_{best}\) and gbest as global test solution, iv) the velocity of each particle is formulated with respect to each iteration c, v) using the transfer function k, the continuous values are mapped into binary values and generate the new solutions. Finally, the global best is produced as best found solution.

LSTM and RNN for Classification

A classification technique to predict the heart disease using the RNN and LSTM model is developed. The LSTM model is proposed at first by Hochreiter et al. in 1997 considered as special RNN model [20]. The RNN is a catch up to the current hidden layer state to previous n-level hidden layer state to obtain the long-term memory. Basis of RNN network, the LSTM layers are added to valve node, which overcomes the RNN long term memory evaluation problems. Generally, LSTM includes three gates to original RNN network such as an input gate, forget gate and an output gate. The LSTM design key vision is to integrate data-dependent controls and non-linear to RNN cell is trained and assures that the objective function gradient does not vanish based on the state signal. The specification of RNN and LSTM shown in Table 3.

Table 3. RNN and LSTM Specification

GA and PSO algorithms with LSTM deep learning model are shown in Fig. 2 and Fig. 3. Here, GA and PSO are used as feature selection algorithms and LSTM is used as classifier to classify the patients into normal and abnormal class. Selected features are given as an input to classifier. The details of features selected are given in Table 6.

Table 4. Dataset features description [22]
Fig. 2.
figure 2

GA with LSTM work flow

Fig. 3.
figure 3

PSO with LSTM work flow

4 Results and Discussion

4.1 Dataset Description

In this proposed study, the Cleveland heart disease data set described is available on the UCI machine learning repository. This data aset contains 303 instances, in which six instances exhibits missing attributes and 297 instances exhibit no missing data. In its original form, the data set contains 76 raw features. From Table 4, the experiments followed only 13 features. These instances have no missing values used in the proposed experiments.

4.2 Performance Metrics

The proposed predictive model results are evaluated by performance metrics such as accuracy, precision, recall and f-measure. The formulations of all the metrics are,

Accuracy: The correctly classified in test data set shows in percentage values are termed as accuracy. The accuracy can be calculated based on the formula given in Eq. (5),

$$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(5)

Precision: While the correctly classified subjects showed by precision value. Precision is calculated by using the formula given in Eq. (6),

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(6)

Recall: A recall is the proportion of related instances that have been recovered. Therefore, both accuracy and recall are based on an understanding of significance and measurement. It is estimated by the formula given in Eq. (7),

$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(7)

F-measure: The method of F1 score is referred to as the harmonious mean of accuracy and recall. This can be computed with the aid of the formula given in Eq. (8),

$$\begin{aligned} F1 Score = \frac{2 * Precision * Reall}{Precision + Recall} \end{aligned}$$
(8)

4.3 Comparison of Results

The following figures representing the performance metrics of the proposed method with respect to feature selection by GA and PSO and classification using RNN and LSTM shown below. Here, data set is split into 30% testing and 70% training. Out of 303, randomly 212 records have taken for training, and 61 records have taken for testing and predicted for normal (class 0 - negative class having no heart disease) and abnormal (class 1 - positive class having heart disease) class of heart disease.

Fig. 4.
figure 4

Performance evaluation with an accuracy of model

From Fig. 4, it shows the results of the performance metric of accuracy of deep learning models, RNN and LSTM with and without feature selection algorithms of GA and PSO. Here, all six models are compared and LSTM + PSO shows better accuracy of 93.5%. Out of 61 records tested, 57 predicted accurately where 25 records are from normal class, and 32 records are from abnormal class. Also, LSTM gives an accuracy in less time compared to RNN as shown in Table 5.

Table 5. Performance in time
Fig. 5.
figure 5

Performance evaluation with precision of model

From Fig. 5, it shows the results of the performance metric of precision of deep learning models, RNN and LSTM with and without feature selection algorithms of GA and PSO. Correctly classified of positive class patients’ percentage accuracy is shown. Here, all six models are compared and LSTM + PSO shows better performance of 94%. Out of 61 records tested, 34 records are predicted for abnormal class where 32 records are accurately predicted, and two records are from normal class but predicted wrongly as abnormal class.

Fig. 6.
figure 6

Performance evaluation with recall of model

From Fig. 6, it shows the results of the performance metric of recall of deep learning models, RNN and LSTM with and without feature selection algorithms of GA and PSO. Here, all six models are compared, and PSO shows better performance of 94% for RNN and LSTM classifier. Out of 61 records tested, 34 records are from abnormal class where 32 records are accurately predicted, and 2 are wrongly predicted as normal class.

Fig. 7.
figure 7

Performance evaluation with F-measure of model

From Fig. 7, it shows the results of the performance metric of f-measure of deep learning models, RNN and LSTM with and without feature selection algorithms of GA and PSO. Here, all six models are compared and LSTM + PSO shows better performance of 94%. It shows an average of precision and recall.

Table 6. Features accuracy

From Table 6, the proposed method evaluation shows the PSO, and GA selected features. For PSO, the selected features’ count is 8 and shows an accuracy level as 91% and takes more time. While the GA selected features’ count is 11 and shows an accuracy level as 90% and takes lesser time compared with PSO. However, in terms of accuracy, the PSO shows better performance compared with GA.

Fig. 8.
figure 8

Proposed method performance of RNN evaluation (With GA and PSO)

From proposed Fig. 8, the evaluation performance for RNN is shown for GA and PSO features selected algorithms. It shows that, RNN with PSO shows the better performance compared to RNN with GA and without any feature selection. Also, accuracy is increased by 3% using PSO algorithm.

Fig. 9.
figure 9

Proposed method performance of LSTM evaluation (With GA and PSO)

From proposed Fig. 9, the evaluation performance for LSTM is shown for GA and PSO features selected algorithms. It shows that, LSTM with PSO shows the better performance compared to LSTM with GA and without any feature selection. Also, accuracy is increased by 7% using PSO algorithm.

Table 7. Comparison of Existing methods with Proposed method

From Table 7, it shows that by compared with the existing method the proposed method with LSTM + PSO shows higher accuracy for predicting the heart disease.

5 Conclusion

In this study, the efficient diagnosis approach has been developed for accurate prediction of heart disease. The proposed approach used enhanced GA and PSO for optimized feature selection from the heart disease data set. Further, the classification has been achieved by using deep learning models such as RNN and LSTM. The proposed model has been evaluated using the accuracy, precision, recall and f-measure performance metrics. The obtained results show that the proposed method which implements LSTM with PSO yields an accuracy of 93.5% and slightly higher computational time due to the feature selection phase but leads to an accurate prediction of heart disease as compared to the existing methods. For other performance metrics like precision, recall and f-measure also LSTM + PSO shows better performance. In the future, it may be considered for enhancing the performance of the proposed model.