1 Introduction

The whole world has witnessed a huge loss in human lives due to heart diseases or cardiovascular diseases. World Health Organization (WHO) reported 17.9 million human deaths caused by the cardiovascular diseases in the year 2019 that was estimated to be 32% of the total deaths for the year 2019 [11]. India is a major contributor of this tally [12]. The same situation continues due to several reasons. Therefore, monitoring the heart condition in regular interval and tracing out the problem in earlier stage is the need of the hour to control the life-threatening situation due to heart failure. The advanced technology like Artificial Intelligence (AI) and Machine Learning (ML) joins hands with healthcare system to provide a solution for proper monitoring and diagnosis of heart disease. Towards this, most of the applications are found on the use of supervise learning approaches like Decision Tree, Artificial Neural Network, Naïve Bayes classifier, etc. for the prediction of heart disease. Unlike existing proposals, our proposal works in three folds: (i) we propose distribution preserving train-test datasets called Distribution preserving Hold-out (DPH) method and Distribution preserving K-fold cross validation (DPK) method (ii) we apply individual classifiers on these train-test splits (iii) we select best three classifiers from the individual classifiers and build an ensemble out of them. The evaluation metrics: Accuracy, Precision, Recall, and F1-score are used to measure the performance of the classifiers. The results obtained from the experiments show that among all individual classifiers, Naïve Bayes performs the best according to Accuracy, Precision, Recall, and F1-score for both DPH, and DPK methods. The best three classifiers Artificial Neural Network Classifier (ANN-C) [15], Logistic Regression Classier (LR-C) [18], and Naïve Bayes Classifier (NB-C) [18] are ensembled and called ALN-C that is used for the heart disease prediction. The results show that ALN-C performs better than AdaBoost [2] and Random Forest [18] according to Accuracy, and F1-score.

The major contributions of this paper are:

  1. 1.

    We propose Distribution preserving Hold-out (DPH) method and Distribution preserving K-fold cross validation (DPK) method for preserving the distribution of classification labels of the overall dataset in the training and testing datasets.

  2. 2.

    An ensemble of Artificial Neural Network Classifier (ANN-C), Logistic Regression Classier (LR-C), and Naïve Bayes Classifier (NB-C) is built and used for heart disease prediction.

The remaining part of this paper is organized as follows. Section 2 discusses the related work. The dataset preparation is discussed in Sect. 3. Section 4 covers the methodology. The results with discussions are placed in Sect. 5. At last, Sect. 6 presents the conclusion with future scope of research.

2 Related work

Recently, several works have been reported on the applications of machine learning methods in biological systems [17, 20, 23]. In this direction, disease analysis, and prediction are predominant research concerns that need a multidisciplinary treatment to address the current challenges. Machine learning has been serving since a long time to the multidisciplinary research due to its data-centric approach. Towards this, heart disease prediction is a vital area of exploration. Researchers have investigated various machine learning methods for heart disease prediction. The objective of the research is to identify heart disease at early stage such that treatment can be provided beforehand for avoiding mortality. Dwivedi [5] has evaluated six machine learning techniques to predict heart disease. He has reported logistic regression with accuracy 85% is the best among all classifiers under consideration. In [21], SVM is shown to be the best with an accuracy of 83%. Ghumbre, and Ghatol [8] have considered India’s heart disease dataset and found SVM to be the best. Likewise, in [6] logistic regression is shown to be the best, and in [13, 19] KNN has reported the best result. Some of the methods considered feature reduction-based approach to apply machine learning techniques. Sahu et al. [18] have discussed an early prediction strategy of heart disease by using machine learning approach that is supported by principal component analysis for feature reduction. Kannan and Vasanthi [14] have used receiver operating characteristic curve to predict the heart disease. A dynamic n-gram based feature optimization is used in [1] to reduce false alarming in heart disease prediction. Similarly, [10] discusses a diagnostic system for heart disease prediction based on machine learning approaches. A comparison between the classifiers is shown by considering full attribute set and reduced set of attributes. Furthermore, the researchers have applied different ensemble methods under bagging and boosting for heart disease prediction. Ghosh et al. [7] have applied the bagging and boosting based classifiers on the combination of five datasets and found Random Forest based bagging method to be the best among all bagging and boosting methods under consideration. According to [16], the Random Forest is reported as the best. In the same direction, many more results have been reported in the literature [3, 22]. In this section, we pointed only few vital contributions.

3 Dataset preparation

We consider the heart disease dataset collected from UCI machine learning repository [4]. The dataset is created by combining 5 datasets (Cleveland, Statlog, Hungary, Switzerland, and Long beach) that contains1190 records and from the whole attribute set, we consider 11 independent attributes (age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, oldpeak, slope), and one dependent attribute (Target) contains label 1 and 0 that denotes person suffering from heart disease and not suffering from heart disease respectively. The whole dataset contains 629 records of label 1 and 561 records of label 0.

4 Methodology

In this section, we propose a methodology to prepare ensemble classifier by combining the decisions of more than one classifier. The ensemble classifier works on the top of the proposed stratified sampling-based distribution preserving train-test split.

4.1 Stratified sampling-based distribution preserving train-test split

We create two strata for the binary classification (In one group we keep all the samples with class label 0 and all the samples with class label 1 are kept in another group). Compute the fractions |D0|/|D| and |D1|/|D| where |D0| represents the number of tuples of the dataset D with label 0, |D1| represents the number of tuples of the dataset D with label 1, and |D| is the number of tuples in dataset D. Then following methods are employed to provide train-test splitting.

4.1.1 Distribution preserving hold-out method (DPH)

  1. (a)

    Split the whole dataset D into two Groups: Train and Test data samples. Also, decide the split percentage of both say p% and q% where p + q = 100. Say p% of dataset size |D| is P, and q% of data size |D| is Q.

  2. (b)

    To maintain the distribution of class labels of the dataset D in the train and test datasets, randomly select P*|D0|/|D| samples from the stratum 0 and P*|D1|/|D| samples from the stratum 1 that populates the train dataset. Likewise, select Q*|D0|/|D| samples from the stratum 0 and Q*|D1|/|D| samples from the stratum 1 to create test dataset.

4.1.2 Distribution preserving K-fold cross validation (DPK)

The whole data set D is divided into k folds say D1, D2,...., Dk such that each fold contains |D|/k samples. Each fold contains |D|/k *|D0|/|D| samples from the stratum 0 and |D|/k *|D1|/|D| samples from the stratum 1. Then for each ith iteration of total k iterations, Di fold is considered as test dataset and combination of rest folds is considered as training dataset. Average over the k iterations is used for performance evaluation.

4.2 Voting-based classification

We employ a voting-based classifier that combines the prediction results of k different classifiers to predict the classification label. This ensemble finds the majority among the decisions of k different classifiers for this purpose. As we consider binary classification problem, the value of k is considered to be an odd number to get a consensus decision without a tie.

Following steps are adopted for the prediction using voting-based classifier:

  1. 1.

    Train the three classifiers Classifier-1, Classifier-2,…….., Classifier-n independently by the training data samples generated from the Distribution preserving Hold-out method (DPH) or Distribution preserving K-fold cross validation (DPK).

  2. 2.

    For each testing sample X, the outputs of the classifiers Classifier-1, Classifier-2,…….., Classifier-n are considered and the majority of their outputs is used as a classification label of X.

  3. 3.

    Then the performance evaluation is done using metrics like Accuracy, Precision, Recall and F1-Score.

The proposed methodology obeys the flow shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the proposed methodology

5 Results and discussions

The proposed and existing algorithms are implemented in Python environment using Scikit Learn library. The dataset used for this experiment is the heart disease dataset as mentioned in Sect. 3. The classification problem considered here is a binary classification problem. At first, we consider all the individual classifiers under Distribution preserving Hold-out (DPH) method, and Distribution preserving K-fold cross validation (DPK) method as shown in the Sect. 4. Secondly, the proposed ensemble method is compared with the existing ensemble methods. We use four evaluation metrics: Accuracy, Precision, Recall, and F1-score [9] as explained below for comparing the performance of the classification algorithms.

$$ Accuracy = \frac{TP + TN}{{P + N}} $$
(1)
$$ Precision = \frac{TP}{{TP + FP}} $$
(2)
$$ Recall = \frac{TP}{P} $$
(3)
$$ F1 - score = \frac{2 \times Precision \times Recall}{{Precision + Recall}} $$
(4)

where TP = True Positive, TN = True Negative, P = Positive, and N = Negative.

5.1 Performance of individual classifiers under DPH and DPK methods

We evaluate the performance of the individual classifiers: Logistic Regression Classifier (LR-C) [18], Naïve Bayes Classifier (NB-C) [18], Support Vector Machine Classifier (SVM-C) [18], K Nearest Neighbor Classifier (KNN-C) [18] with K = 7 (that is the best among some random choices), and Artificial Neural Network Classifier (ANN-C) [15] with 15 hidden layers with 20 neurons in each hidden layer (that is the best among some random choices) with sigmoid activation function and backpropagation learning.

The result of performance evaluation according to four metrics: Accuracy, Precision, Recall, and F1-score for both training and testing samples under Distribution preserving Hold-out method (DPH) is shown in Tables 1 and 2. We consider Accuracy, and F1-score that composes both Precision and Recall for comparison among the algorithms. Our result is different from the result shown in the existing papers [5, 19] because the classifiers are applied under distribution preservation consideration (Train and Test samples contains 53% samples with label 1 and 47% samples with label 0). It is clear from Figs. 2 and 3 that NB-C performs the best among all classifiers for both training and testing samples. Likewise, Tables 3 and 4 list out the values of the performance metrics for the 5 classifiers under Distribution preserving K-fold cross validation (DPK) method. Figures 4 and 5 show the Accuracy, and F1-score of the 5 classifiers of training and testing samples respectively under DPK method. It shows that LR-C performs best among all classifiers under consideration. By combining the results of both DPH and DPK methods, we can say NB-C is best among all.

Table 1 Training performance (individual classifier) using DPH method
Table 2 Testing performance (individual classifier) using DPH method
Fig. 2
figure 2

Accuracy and F1-score of individual classifiers under DPH method (training performance)

Fig. 3
figure 3

Accuracy and F1-score of individual classifiers under DPH method (testing performance)

Table 3 Training performance (individual classifier) using DPK method
Table 4 Testing performance (individual classifier) using DPK method
Fig. 4
figure 4

Accuracy and F1-score of individual classifiers under DPK method (training performance)

Fig. 5
figure 5

Accuracy and F1-score of individual classifiers under DPK method (testing performance)

5.2 Performance of ensemble classifiers under DPH and DPK methods

We consider an ensemble of three best individual classifiers: Artificial Neural Network Classifier (ANN-C), Logistic Regression Classier (LR-C), and Naïve Bayes Classifier (NB-C) as evaluated in Sect. 5.1.

5.2.1 ANN-C

We consider a Multilayer ANN classifier, where our network is a feed forward network and trained using back-propagation learning algorithm. Here, we consider a network with m hidden layers with n neurons in each hidden layer. We consider sigmoid function as activation function and backpropagation learning is used for training.

5.2.2 LR-C

Logistic regression-based classifier (LR-C) classifies the samples into two groups i.e., 0, 1 using a logistic function. Though the regression model is generally used for prediction, Logistic regression is useful in classification because the continuous inputs are converged to the 0 or 1 by the help of logistic function.

5.2.3 NB-C

Naive Bayes classifier (NB-C) is a probability-based classifier that is based on Bayes’ theorem that treats the features to be independent of one another. As our features are taken from continuous domains, they are assumed to satisfy a gaussian probability distribution.

The performance of the ANL-C is measured under both DPH and DPK methods. The comparison with existing ensemble techniques like AdaBoost (Boosting) [2] and Random Forest (Bagging) [18] for both training and testing datasets is shown in Figs. 6, and 7 respectively for DPH method. We consider Accuracy, and F1-score for comparison. Likewise, Figs. 8 and 9 show the performance of the ensemble methods under DPK method for both training and testing samples. From the observations it is clear that ALN-C performs better than AdaBoost and Random Forest.

Fig. 6
figure 6

Accuracy and F1-score of individual classifiers under DPH method (training performance)

Fig. 7
figure 7

Accuracy and F1-score of individual classifiers under DPH method (testing Performance)

Fig. 8
figure 8

Accuracy and F1-score of individual classifiers under DPK method (training performance)

Fig. 9
figure 9

Accuracy and F1-score of individual classifiers under DPK method (testing performance)

6 Conclusion and future scope

This paper presents a machine learning based approach for heart disease prediction. Here, we have proposed two approaches: Distribution preserving Hold-out (DPH) method, and Distribution preserving K-fold cross validation (DPK) method for preserving the distribution of class labels in the training and testing datasets. We have applied individual classifiers on this train-test split and found Naïve Bayes to be the best among all classifiers under consideration for both DPH and DPK methods. The ensemble of Artificial Neural Network based classifier (ANN-C), Logistic Regression based classier (LR-C), and Naïve Bayes classifier (NB-C) is prepared and named ALN-C. The ALN-C is compared with AdaBoost, and Random Forest, and reported to be the best among the three. The evaluation metrics: Accuracy, Precision, Recall, and F1-score are used for performance measure. In future, this work can be extended by applying different ensemble methods on the heart disease dataset for achieving better prediction result.