1 Introduction

According to the World Health Organization (WHO) the top three causes of death worldwide are: (a) cardiovascular diseases, including ischaemic heart disease and stroke; (b) Respiratory diseases which consist of chronic obstructive pulmonary disease and lower respiratory infections; (c) Neonatal conditions (birth asphyxia and birth trauma, neonatal sepsis and infections, and preterm birth complications) [1]. Of these, cardiovascular diseases are the leading cause of mortality across the globe, accounting for 16% of the total number of fatalities [2]. Heart failure occurs when the heart's ability to pump blood at a pace sufficient for the needs of the metabolizing tissues is compromised or can only do so with a raised diastolic filling pressure due to an anomaly of cardiac function. Figure 1 represents the global statistics and age distribution pattern of the rising prevalence and incidence of heart failure. In 2017, there were 56.5 million deaths worldwide, with slightly under half of them adults over the age of 70, 26% were between the ages of 50 and 69, 13% were between the ages of 15 and 49, 1% were between the ages of 5 and 14 or below, and over 9% were children under the age of 5 [3].

In 2020, India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, and the Maldives will collectively account for 1.8 billion people, thereby accounting for 23% of the world's population [4]. South Asian nations continue to bear a heavy burden of disorders including congenital, nutritional, and infectious diseases that are typical of earlier phases of the epidemiological transition. Every year, over 17 million people worldwide die from cardiovascular diseases, of which 0.5 to 1.8 million new cases of heart failure per year are in India [5]. Further, heart failure is strongly associated with age, and the population of India is on average younger than that of other countries [6].

There are several risk factors for heart failure, including gender, family history, and age. These risk factors are categorized as uncontrollable risk factors, whereas high cholesterol, smoking, high blood pressure, and obesity are categorized as controllable risk factors [7]. In the realm of healthcare, electronic health records (EHR) [8] are a systematized collection of patient data that may also be used in clinical and research settings. The EHR can assist care-related tasks, either directly or indirectly, via various interfaces, such as evidence-based decision support, quality monitoring, and outcomes reporting. Currently, mass screening for heart disease is time-consuming and error-prone, and hence expert systems built on artificial intelligence (AI) models are effectively utilized to predict heart failure readmissions and modality rates. These strategies employ patient data to identify the occult patterns that have a big impact on the outcomes. For predicting hospitalization due to heart failure, the combination of clinical, therapeutic data, routine laboratory data, and biomarkers is also beneficial.

Healthcare services, primary care, standardization, cost, data privacy, and security are the challenges faced in the healthcare sector [9]. Although the effectiveness of AI has been established in many domains, the medical sector has encountered significant deployment challenges for probable reasons: (1) the need for enormous patient data to train the model; (2) If the data used to train machine learning (ML) models is biased, AI-powered systems using those models will produce misleading results; (3) When all accessible input data is used, the results might be tainted [6]. To achieve rapid pattern and anomaly detection, the best features for the model are automatically identified by feature engineering leveraging domain knowledge and data mining. Various automated approaches for heart disease identification have been suggested by researchers in recent years using clinical feature-based data modalities [9,10,11,12,13,14].

Fig. 1
figure 1

(a) Leading causes of death worldwide and (b) Age distribution of heart disease worldwide based on prevalence and incidence

Currently, it is still challenging to predict the survival of patients with heart failure both in terms of achieving high prediction accuracy and ascertaining the contributing components. As highlighted by Sakamoto et al., most models created for this purpose only achieve moderate accuracy, with poor interpretability from the predictive parameters, mostly owing to a lack of repeatability [15]. To determine the best features and estimate the rate of modality, researchers have used Cox regression, Kaplan–Meier survival plot, and sex-based mortality prediction models [16]. While multiple researchers have used standard statistical approaches to generate some intriguing outcomes, these techniques are ineffective when working with huge datasets [17], and more efficient ML algorithms are needed. In recent years, mobile health (mHealth) has garnered considerable attention and popularity due to the pervasive use of mobile devices and the availability of numerous health and wellness applications [18]. A comprehensive evaluation conducted in 2020 [12] also concluded that mobile health interventions were superior to standard treatment in lowering all-cause mortality and all-cause hospitalizations among patients with heart failure.

The use of mHealth applications has the potential to revolutionize the delivery of healthcare by rendering healthcare solutions more practical and readily accessible. It is estimated that over 500 million patients utilize mHealth apps to assist their activities related to self-care [19]. In addition, fitness tracking, stress management, monitoring blood pressure, Chronic disease management, tracking fetal development, and social support are some of the most promising categories of mHealth utilization. The following are the main characteristics of mHealth apps for cardiovascular disease self-management [20]: (1) general health status; (2) parameters like weight, step count, heart rate, and blood pressure monitoring; (3) electronic health records; (4) control and follow-up of medication; and (5) real-time warning and medication reminders. This research project explored the feasibility of using mHealth technology to aid cardiovascular disease patients in disease management.

Fig. 2
figure 2

Overview of the proposed machine learning-based heart failure prediction model

In view of these gaps, we propose a ML-based approach for accurate survival diagnosis of cardiovascular patients. The primary objective is to estimate the mortality rate of heart failure in India and explore its relationship with other significant risk factors. To address the issue of class imbalance, no sampling, random oversampling (ROS), random under sampling (RUS), and Synthetic Minority Oversampling Technique (SMOTE) were utilized. Ten different classifiers were employed for medical data mining to predict mortality from heart failure. Furthermore, a robust technique based on recursive feature elimination (RFE) is utilized to extract essential features from the dataset that affect the performance of the ML algorithm, and these features are then used to investigate the most significant risk factors. Figure 2 depicts each level of the proposed machine learning-based approach for heart failure prediction. The model that offers the best accuracy is chosen in the subsequent phase and combined with a web-based mHealth application to diagnose heart failure conditions. The Python programming language was used to construct the Flask backend for this web-based application. The proposed mHealth apps have a (1) user-friendly interface, (2) more reliable data, (3) the ability to be used by medical professionals even in the absence of an internet connection, and (4) highly customized applications with distinguished features. The following are the key contributions of the proposed work:

  • An efficient decision support system has been developed for early prediction of heart failure patients.

  • ROS, RUS, and SMOTE have been used as benchmarks against which other classification models' ability to predict cardiac patient survival is evaluated.

  • RFE technique has been used to facilitate a decision support system to automatically identify the highest-ranking optimum features during the training phase in order to improve performance.

  • The effectiveness of the machine learning algorithm is evaluated with regard to several criteria, including precision, recall, and F1 scores.

  • Furthermore, a web application utilizing the Python Flask web development framework was developed by integrating the proposed model.

2 Related work

Healthcare experts often employ computer-aided diagnostic (CAD) systems to save expenses, assist physicians in disease detection, and distinguish between disease progressions. Many studies in the field of cardiovascular diseases use CAD systems based on echocardiography, cardiac magnetic resonance, cardiac computed tomography, or single photon emission computed tomography [21]. When it comes to the detection of cardiac issues, angiography is among the most reliable of the traditional diagnostic tools. Angiography, on the other hand, has a few limitations, including high cost, computational complexity, and the need for advanced technology [22]. Due to human error, conventional approaches often result in inaccurate diagnoses and involve time-consuming evaluations. This has motivated researchers to develop a non-invasive smart healthcare system based on predictive machine learning (ML) [23]. These predictions can then be used to identify patients at risk of developing heart failure or other conditions, allowing healthcare providers to intervene early and provide treatment before the condition becomes severe. By identifying patients at risk early and providing appropriate treatment, the mortality rate for heart failure can be reduced.

Machine learning models have been extensively employed in detecting precursor markers at the early stages of cardiac disease. Risk factors for cardiovascular diseases include tobacco use, advanced age, diabetes, and high blood pressure [24, 25]. With 159 patient data, Yang et al. [26] developed a scoring model based on support vector machines and obtained a Youden's index of 69, a sensitivity of 75%, and a specificity of 94%. Twelve different parameters were adopted, including the ratio between low-frequency and high-frequency heart rate variability, sodium, B-type natriuretic peptide, left ventricular ejection fraction, left ventricle end-diastolic diameter, left atrium maximal volume, left ventricular posterior wall, P-R interval, cardiothoracic ratio, six-minute walk distance, standard maximum oxygen consumption, and the ratio between early and late ventricular filling velocity. Overall, the model offered a classification accuracy of 74.4%, with individual accuracy rates of 78.79%, 87.5%, and 65.85% for identifying the healthy group, the heart failure prone group, and the heart failure group. With features including gender, age, blood pressure, and cigarette usage, Gharehchopogh and Khalifelu [27] were able to predict 85% of test cases accurately using data from the medical records of 40 patients.

Accuracy rates of 79.54%, 61.46%, and 68.96% have been reported for the identification of left anterior descending, left circumflex, and right coronary artery stenoses, respectively, using a bagging-based classification model (Alizadeshani et al. [28]). A two-tier ensemble model has been suggested by Tama et al. [29] which incorporates the predictions of class labels from a Gradient Boosting Machine (GBM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost). The investigation conducted by Parthiban and Srivatsa [30] concentrated on diabetic patients who also had cardiac problems. They utilized a variety of different prognostic factors, including age, blood pressure, and blood sugar levels, among others. They were able to attain an accuracy of 94.60% using the support vector machine (SVM) classifier. The training was performed on an imbalanced dataset which is very common in heart diagnosis because certain medical conditions, such as heart failure, are relatively rare compared to healthy cases. An imbalanced dataset may result in biased training, as the model may be trained to predict the majority class more accurately than the minority class. Additionally, if the model is not able to generalize well and has poor performance on unseen examples, it may misdiagnose healthy patients as having heart failure, which can lead to unnecessary treatment and costs.

Adrian et al. [31] collected features such as blood urea nitrogen, confidence interval, chronic kidney disease epidemiology collaboration, estimated glomerular filtration rate, high-density lipoprotein, hazard ratio, N-terminal pro-B-type natriuretic peptide, systolic blood pressure from multi centers. They identified predictors of mortality, and the predictors of hospitalization due to heart failure were noticeably different. In order to better understand the causes of mortality and cardiac problems, Shah et al. [32] established a research framework that employs ML techniques. Only 14 of the possible 76 features were chosen since the researchers' primary focus was on developing an accurate and effective system with a minimum number of components. K-Nearest Neighbor (KNC) was the most accurate classifier out of a total of four, which included Naive Bayes (NB), Radial Basis Function (RBF), and ensemble technique.

Our literature survey indicated that only a limited number of studies have been performed on heart failure diagnosis. In this study, we aim to develop a decision support system for detecting heart failure that is both accurate and efficient. Dataset acquisition is difficult when building an ML-based system, especially in the medica

l field as issues arise regarding privacy, confidentiality, ethics, and security [33]. Fortunately, multivariate data comprising 43 features were collected from the All India Institute of Medical Sciences (AIIMS) for 152 patients and deployed for further study. Most significant features were selected from the feature space utilizing the average importance of each feature based on 10 different models. During RFE, the best precision and F1-score of 0.98 and 0.95, respectively were achieved using KNC with 4 features only. While most of the published studies utilize a single model to select the best features, the feature importance varies with respect to the selection of the model. So, averaging out the feature importance over multiple models helped us in identifying the more accurate importance of each feature. The significant properties identified by RFE were gender, diabetes mellitus, Na, heart rate, and systolic blood pressure. On these selected features, GNB was found to be the best among all chosen classifiers, with the best precision value of 0.86 and recall value of 0.71.

Fig. 3
figure 3

Flowchart of the proposed framework for predicting patient's survival in heart failure

3 Methods and materials

In this study, a ML model is proposed to detect heart failure using data collected between 2020 and 2022 at the All India Institute of Medical Sciences (AIIMS), New Delhi. The model is composed of supervised machine learning classification techniques including Decision tree classifier (DTC), Adaptive boosting classifier (ABC), Random forest classifier (RFC), Gaussian process classifier (GPC), Gradient boosting classifier (GBC), k-nearest neighbors classifier (KNC), Extra tree classifier (ETC), Gaussian Naive Bayes classifier (GNB), Multi-layer perceptron (MLPC) and Support vector machine (SVM). Figure 3 illustrates the architectural schematic of the model. It is seen that there are seven steps, including pre-processing, scaling, sampling, training, assessment, feature engineering, and classification. Features having more than 10% missing data are removed from the dataset. Further, patients having missing features greater than 5% are removed. The remaining missing values are filled based on predicted value using a regression model with the top three correlated variables as input. Further, two scaling techniques, standard scaling, and min–max scaling are applied to the whole dataset followed by a random selection of training and testing datasets. Furthermore, random oversampling (ROS), random under-sampling (RUS), and synthetic minority oversampling technique (SMOTE) are used to handle the imbalanced class across the training set. The aforementioned 10 different classifiers are trained on the training dataset and the importance order of the feature set is calculated using recursive feature elimination. A combined average feature ranking is calculated to select important features so as to make better predictions with fewer features helping doctors for accurate and early diagnosis.

Fig. 4
figure 4

Probability distribution for all features according to the target variable

3.1 Dataset description

The current investigation is carried out prospectively on individuals who were hospitalized over a two-year period (2020–2022) at AIIMS Delhi, India. The patients range in age from 17 to 80, with 30.46% women and 69.54% men. All 151 patients had acute decompensated heart failure and fell into classes II, III, and IV of the New York Heart Association's (NYHA) assessment of the stages of heart failure according to their past heart failures.

Table 1 consists of 33 features, which report information on the body, clinical data, and lifestyle: BNP, stair, age, gender, diabetes mellitus (DM), thyroid, chronic obstructive pulmonary disease (COPD), chronic kidney disease (CKD), cardiac resynchronization therapy (CRT), edema, tobacco usage, systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), sodium (Na), potassium (K), blood urea, serum creatinine (SCr), hemoglobin, white blood cell (WBC), albumin, total cholesterol, high-density lipoprotein (HDL), low-density lipoprotein (LDL), triglycerides, serum glutamic oxalacetic transaminase (SGOT/AST), serum glutamic pyruvic transaminase (SGPT/ALT), bilirubin, ejection fraction (EF), dressing yourself, showering bathing, walking 1 block on level ground, and hurrying or jogging.

Table 1 Characteristics of HF patients in the study cohort

3.2 Dataset pre-processing

Prior to building the machine learning model, the data must be pre-processed. Key activities include exploratory data analysis, generation of new features from the current features, transformation of data, and partitioning of data into training data, validation data, and testing data. A total of 151 patient records, each of which has 43 attributes, were retrieved from AIIMS Delhi, India. The handling of missing numbers, which are often represented by the notation NaN (Not a Number), is one of the first significant issues that machine learning must tackle. This issue is handled in an iterative way using a two-stage strategy that takes into account the total number of NaN cells. First, the column(s) containing NaN values are identified and then if the proportion of NaN values in those columns is more than 10%, the columns containing those values are removed. The row(s) that included more than 5% NaN values were removed during the second phase of the process. As a result of the preliminary processing, a total of 128 patients containing 33 features were adopted for further processing.

3.3 Scaling

When training on a dataset, it is important that all features get appropriate weightage in the determination of the outcome. As seen above in Fig. 4, a few parameters such as BNP, SGOT, and SGPT typically fall into a range of thousand, while others, like SBP, DBP, HR, Na, cholesterol, HDL, and LDL fluctuate in the hundreds. Hence, feature scaling techniques are used to normalize the distribution of each feature's magnitude before the dataset is utilized. An overview of data preparation techniques along with feature selection and dimensionality reduction for the transformation of attributes has been offered by Tan et al. [34]. The following transformation approaches were examined: standard scaling (\({{\text{x}}}_{{\text{ss}}}\)) and min–max scaling (\({{\text{x}}}_{{\text{ms}}}\)).

$${{\text{x}}}_{{\text{ss}}}=\frac{{\text{x}}-\overline{{\text{x}}}}{\upsigma }$$
(1)
$${{\text{x}}}_{{\text{ms}}}=\frac{{\text{x}}-{{\text{x}}}_{{\text{min}}}}{{{\text{x}}}_{{\text{max}}}-{{\text{x}}}_{{\text{min}}}}$$
(2)

where, \(\overline{{\text{x}}}\) and \(\upsigma\) denote the average and standard deviation, respectively, and the output of \({{\text{x}}}_{{\text{ss}}}\) is of the form \(\overline{{\text{x}}}=0\) and \(\upsigma =1\), whereas \({{\text{x}}}_{{\text{ms}}}\) restricts all values to the range [0, 1].

Fig. 5
figure 5

Pearson's correlation matrix for all features included in the study. Except for features F30, F31, F32, and F33, there was no significant association between input variables

3.4 Class imbalance

A notable use of data analysis in the field of medicine is in the diagnosis of disorders. However, significant challenges that arise in this endeavor are the uneven distribution of data and the unequal quality of the majority and minority groups, which typically result in misclassification [35]. The minority class samples are used to diagnose mortality rate, despite the fact that the majority class samples and their accurate classification are more crucial to the classifier. Misdiagnosis leads to additional clinical testing for individuals with mild heart failure but may be fatal for those with severe heart failure. Therefore, it is crucial from a clinical perspective to research the class imbalance by performing a detailed assessment of the impacts of the imbalanced data. In this context, three different techniques were employed namely, ROS, RUS, and SMOTE.

  • ROS is a non-heuristic technique that includes randomly duplicating instances from the minority class and adding them to the training dataset.

  • In RUS, instances from the majority class are randomly chosen and removed from the training dataset. As a result, there are fewer examples in the majority class in the training dataset that has been modified. Until the desired class distribution, such as an equal number of samples for each class is reached, this process can be repeated.

  • SMOTE uses existing data to generate new synthetic data in order to strengthen the data for the minority class. As a result, the overfitting issue is avoided, and the decision limits for the minority class are expanded into the space of the majority class.

Algorithm 1
figure a

Recursive feature elimination algorithm

4 Feature selection

Following the computation of the missing values, it is necessary to select the critical aspects that significantly correlate with features of relevance for disease diagnosis. The construction of a reliable diagnostic model is hindered by unnecessary and irrelevant features during the vector feature extraction process [36]. In this investigation, we utilized the recursive feature elimination (RFE) method to extract the most important characteristics of cardiovascular disorders of the patients. RFE is an advanced technique applied to methodically evaluate and prioritize features in a dataset by their level of importance [37, 38]. In each iteration, it detects and eliminates the least significant feature according to relevance scores, thereby improving model performance by preserving the most valuable features. As illustrated in Algorithm 1, the features search process employs backward selection, initially considering the complete feature set \({\text{F}}=\{{{\text{F}}}_{1}, {{\text{F}}}_{2}, \dots , {{\text{F}}}_{{\text{n}}}\},\) and iteratively eliminates features that do not enhance the accuracy of the classification. The RFE process iterates until it reaches the optimal number of features, ensuring an effective feature subset for enhancing classification accuracy.

5 Classification

Classification algorithms such as DTC, ABC, RFC, GPC, GBC, KNC, ETC, GNB, MLPC, and SVM have been employed in medical data mining to predict mortality from heart failure. A brief description of the various classification algorithms that were examined in this study is shown in Table S1 of the supplementary section.

6 Results and discussion

To evaluate the efficacy of the proposed technique, a number of experiments for heart failure patients’ survival predictions were conducted. In order to construct a generic machine learning model predicting heart failure or mortality, we split our cohort into two sets of data: 70% for training, and 30% for testing over unlabelled data. Initially, we trained the model with a complete set of features followed by a significant set of features. ROS, RUS, and SMOTE were employed to address the issue of class imbalance. All the experiments were conducted in a Python environment using different libraries in a server with an Intel(R) Silver(R) 4210 CPU on 2.19 GHz and 128 GB RAM.

6.1 Evaluation metric

Accuracy, precision, recall, and the F-score are the three fundamental metrics that were investigated in this study to identify the extent of difference between different machine learning-based algorithms. These metrics are described in Eqs. (3) - (6) in which true positive (\({{\text{K}}}_{{\text{TP}}}\)) represents correct predictions, whereas \({{\text{K}}}_{{\text{FP}}}\) and \({{\text{K}}}_{{\text{FN}}}\) represent false positive and false negative, respectively.

$${\text{Precision}}= \frac{{{\text{K}}}_{{\text{TP}}}}{{{\text{K}}}_{{\text{TP}}}+{{\text{K}}}_{{\text{FP}}}}\times 100$$
(3)
$${\text{Recall}}=\frac{{{\text{K}}}_{{\text{TP}}}}{{{\text{K}}}_{{\text{TP}}}+{{\text{K}}}_{{\text{FN}}}}\times 100$$
(4)
$${\text{F}}1-{\text{score}}=2\times \frac{{\text{Precision}}\times {\text{recall}}}{{\text{Precision}}+{\text{recall}}}$$
(5)
$${\text{Accuracy}}=\frac{({{\text{K}}}_{{\text{TP}}}+{{\text{K}}}_{{\text{TN}}})}{{{\text{K}}}_{{\text{TP}}}+{{\text{K}}}_{{\text{FN}}}+{{\text{K}}}_{{\text{FP}}}+{{\text{K}}}_{{\text{TN}}}}$$
(6)

6.2 Feature correlation

The dataset after deleting rows and columns having more than 5% and 10% missing data consists of 151 instances, 11 NaN values, and 33 features, including patient ID, age, gender, and other important features as indicated in Fig. 5. The dataset's feature correlation matrix is depicted, which estimates the correlation coefficients between different variables. Each cell statistically represents the Pearson correlation coefficient (Eq. 7) between two variables. The values range from + 1 representing a perfect direct relationship and -1 indicating no direct relationship/perfect inverse relationship. Additionally, a correlation of 0 indicates there is no connection between the movements of the two variables.

$$\mathrm{Pearson\;correlation\;coefficient}=\frac{\sum ({{\text{V}}}_{{\text{i}}}-\overline{{\text{V}} })({{\text{Y}}}_{{\text{i}}}-\overline{{\text{Y}} })}{\sqrt{\sum {({{\text{V}}}_{{\text{i}}}-\overline{{\text{V}} })}^{2}\sum {({{\text{Y}}}_{{\text{i}}}-\overline{{\text{Y}} })}^{2}}}$$
(7)

where, \({V}_{i}\) and \({Y}_{{\text{i}}}\) represents the value of the V and Y variables in a sample, \(\overline{V }\) and \(\overline{Y }\) represents the mean of the values of the V and Y variables. High correlations were observed between the features representing the best and worst values of the different feature sets. For instance, the stair (F2) is highly correlated with the dressing yourself (F30), showering bathing (F31), walking 1 block on level ground (F32), and hurrying or jogging (F33) features. SGOT (F26), and SGPT (F27) are highly correlated with each other, whereas BNP (F1) represents a poor correlation with total cholesterol (F22), dressing yourself (F30), showering bathing (F31), walking 1 block on level ground (F32), and hurrying or jogging (F33). The 11 missing values are predicted by a regression model developed using the best 3 correlated features as input and the feature with missing value as output.

6.3 Ablation study on the impact of significant features

RFE has been implemented over the 10 classifiers with four different performance criteria to calculate permutation feature importance (PFI) namely, balanced accuracy, precision, recall, and F1-score. The performance among classifiers varies greatly because of their different and unique algorithm. As given in Table 2, all the classifiers show their best precision value greater than 0.95 with different numbers of selected features at the instance. The highest precision value of 0.98 is given by KNC using 4 features namely, F1, F2, F31, and F3 with ROS and MMS preprocessing techniques and recall as PFI performance criteria. GBC also shows its best precision of 0.97 with the same preprocessing techniques, F1-score as PFI performance criteria, and 4 features, namely F1, F2, F25, and F3. Both, the best recall and F1-score of each classifier, are greater than 0.8 with the highest recall of 0.93 for RFC with 11 features and the highest F1-score of 0.95 for KNC with 4 features. Although other classifiers perform better at their best, GNB yields the best performance when averaged over all feature subsets and preprocessing techniques. The best performance of GNB is achieved with RUS and no scaling technique with precision, recall, and F1-score of 0.90 each with 7 features.

Table 2 Classification results of machine learning models using a selected number of features
Fig. 6
figure 6

Performance analysis of recursive feature extraction (a) F1 score, (b) Precision and (c) Recall for ROS and MMS preprocessing steps

The RFE result for ROS and MMS is shown in Fig. 6 and a peak at 4 features for KNC can be noted for precision, recall, and F1-score. From Table 2 and Fig. 6, it can be concluded that there is a lot of variation in the performance of different classifiers with differences in the number of features, preprocessing techniques, and PFI performance criteria. At the same time, the feature ranking also varies greatly. In order to identify the features affecting the patient’s condition, an average of all the feature rankings is utilized. Figure 7 shows the violin plot of ranking of all features based on all classifiers and preprocessing techniques. A visual gap can be noted in the average feature rank of 10 with 5 features having an average rank below 10. Finally, gender (F4), diabetes mellitus (F5), Na (F12), HR (F14), and systolic blood pressure (F15) are highlighted as significant properties by RFE.

Fig. 7
figure 7

Variation in feature rankings based on all RFE results. The red dot shows the median rank and the black line spans 25th to 75th percentile rankings. Red dotted lines are used as the threshold rank that best separates the maximum average rank below the threshold and the minimum average rank above the threshold

Fig. 8
figure 8

Classifier performance over selected features

6.4 Heart failure prediction with selected features

The selected features from the averaged ranking of RFE are further used to classify the modality of the patients. Figure 8 shows the precision, recall, and F1-score of selected 10 classifiers using 5 features with different scaling and sampling techniques. From Fig. 8(a), it is evident that ABC with RUS and MMS yield the best performance amongst all classifiers (recall value of 0.763, precision value of 0.718, and F1 score of 0.649) when compared to the performance of RFC with RUS and MMS (recall value of 0.744, the precision value of 0.692, and F1 score of 0.675) and GBC (recall value of 0.6203, precision of 0.595 and F1 score of 0.5606). For RUS with SS, ABC (recall of 0.763, precision of 0.7187, and F1 score of 0.6491) outperforms other classifiers such as RFC (recall of 0.7180, precision of 0.6726 and F1 score of 0.6405) and GNB (recall of 0.6917, precision of 0.6545 and F1 score of 0.6060). Figure 8(c), it is seen that ABC outperforms other classifiers in the absence of standardization. Classification performance over selected features using different sampling and scaling with ten different classifiers is discussed in Fig. S1 in the supplementary section. The significance of the features selected are.

  • Gender—When compared to heart failure in females, the incidence of heart failure is much higher in males.

  • Diabetes- it has been observed in the last two decades that the prevalence of heart failure in diabetes is very high and the prognosis is worse [49].

  • Sodium- Patients with heart failure are restricted with sodium intake as it may lead to comorbidities like hypertension, chronic kidney disease, stroke, and cardiovascular diseases.

  • Heart rate- raised heart rate was postulated as a high risk for heart failure [50].

  • Systolic blood pressure- was strongly associated with heart failure, its elevation was found to be a major risk factor in middle-aged men [51].

6.5 Prediction results of web application

Self-management mHealth applications are primarily developed to support medical diagnostic and treatment strategies using machine learning techniques by building prediction models. As a result of this trend, the demand for mHealth applications in the market is also increasing, ranging from fitness apps to medical apps for managing chronic conditions such as heart failure [52], diabetes [53], hypertension [54], breast cancer [55], blood pressure monitoring [56] and lung diseases [57]. The proposed framework allows the mHealth application to develop solutions for improved decision-making based on the selected feature. As a result, cardiovascular disease diagnosis can be improved, the efficiency of cardiovascular disease preventative measures can be increased, and better cardiovascular disease self-management can be promoted. Figure 9 illustrates the fact that the user is required to provide some information in order to get the desired output. In response to a request from the Flask server, an explainable machine learning model calculates a predicted result when data is submitted via a mobile app. The result is then sent from the Flask server to the mobile app, which retrieves it and shows it to the user. Comparison of the best model using the suggested feature elimination approach with various scaling and imbalance class handling, as well as comparison of performance metrics, allows for an accurate assessment of a classification model's efficacy.

Fig. 9
figure 9

Web application for heart failure prediction: (a) user input validation, (b) error screen, and (c) prediction results

The app development has been strongly influenced by factors including gender, diabetes, sodium, heart rate, and systolic blood pressure. Users will see a screen similar to Fig. 9 when the app is launched. The application is able to verify that the input provided for each field is legitimate. A cautionary notice shown in Fig. 9(b), will appear on the screen for the user to read in the event that they input an incorrect value for any of the parameters. If the user enters accurate data, the app will estimate the probability of heart failure, as shown in Fig. 9(c). By downloading and using this application on their mobile devices or desktop PCs, users of any device may get a heart failure risk assessment.

7 Conclusions

Despite recent medical technical advances, the number of deaths caused by heart failure hospitalization is continually rising, which may be related to woefully inadequate diagnostic tools. In order to reduce the risk of developing heart failure, there is a need for a system that is able to either produce rules or categorize the data via the use of machine learning strategies. Unlike the conventional ML approach that assumes an identical risk contribution for all the attributes of heart failure, the proposed method considered each of the attributes and can identify discrete subsets with a high discriminating capability. We tested the proposed technique using the AIIMS clinical dataset, which contains 151 samples from possible heart failure patients and found that the proposed algorithm outperformed the state-of-the-art methods in terms of accuracy, precision, recall, and F1-score. Comparison with different ML techniques including DTC, ABC, RFC, GPC, GBC, KNC, ETC, GNB, MLPC, and SVM has been presented. Further, ROS, RUS, and SMOTE were applied to deal with class imbalance problems, and furthermore, RFE was employed for feature ranking. According to RFE, the most significant features are Gender, Diabetes Mellitus, Na, hr, and systolic blood pressure. This study has the potential to advance medical practice and give clinicians a new resource for gauging heart failure patients' likelihood of survival. The performances of machine learning models have been compared on a full set of features and selected features from the heart failure clinical records dataset. ROS significantly improved the performance of KNC in predicting heart patient survival. ROS with MMS showed the highest result in all evaluation measures and achieved 0.9852 precision, 0.9167 recall, and 0.9470 F1-score. Any user may provide the clinical data to be analyzed by the web-based application, which has the ability to determine whether or not heart failure is present. In the future, this work might be extended to the classification of heart failure using an explainable machine learning technique, with the goal of defining the model's accuracy, transparency, and outcomes.