Introduction

Ensemble methods are those classifiers where a collection of base estimators are built to find a final result of the classification. Each output from base machines is collected to form a voting scheme [11, 14]. Bagging and boosting are two major techniques that are used to obtain an ensemble techniques [11]. Two well-known examples of ensemble methods are Random Forests (RF) [7] and Gradient Boosting Trees. First one is an estimator fusing Decision Trees on subsets of training dataset to control over-fitting [40]. Second one is a greedy approximation of a tree collection [15]. Weak classifiers are the base estimators of ensemble methods; Decision Trees are widely used in this context. A Decision Tree is an ML model that establishes an induction [33] machine through a set of human-interpretable rules.

Statlog and Spectf are two well-known datasets used for heart disease prediction [24]. Statlog is a dataset with 13 features obtained from medical measurements whereas Spectf has 44 features extracted from tomography images. Hence, in this study, two novel classification algorithms are proposed for these different datasets.

We have followed the experimentation setup given in [24] and it is appropriate to introduce their Chaos Firefly Attribute Reduction and Fuzzy Logic (CAFL) method here: CAFL is highly based on attribute reduction where Rough Set [31] and Chaos Firefly optimization [13] is used. Then a type-2 Fuzzy Logic system is utilized to make classification.

Related Works

Heart Disease prediction is a field where ensemble methods have been successfully applied [4, 5]. On the other hand, in [24], a Fuzzy Logic approach [23] is experimented together with a rough set [17] feature reduction.

Artificial Neural Networks (ANN) are extensively used in the literature; apart from Deep Learning architectures [37], classical Neural Network structures are also employed [10, 12, 19]. There are also hybrid methods such as [25, 43]. Deep Learning architectures are utilized to improve diagnosis activities of Chronic Kidney Disease and Lung Cancer, respectively, in the domain of online clinical decision support systems [20, 21]. [39] used ANN together with Principal Component Analysis to select features before Breast Cancer classification. [38] PCA and Linear Discriminant Analysis (LDA) are applied to select features and ANN to classify the resulting Breast Cancer data.

Support Vector Machines (SVM) is a method based on minimizing structural risk, where linearly non-separable data are implicitly mapped to a higher dimensional one to obtain separability [9]. SVMs are applied to various problems [27]. In the context of Heart Disease Prediction, it is rather used as a helper method to select features [3] or a component of ensembles [28, 34]. [42] integrated fractal image analysis with SVM to classify breast cancer.

Naive Bayes (NB) classifiers assume that features are independent [36] and choose the class maximizing the overall probability. [41] proposed a decision support system using Naive Bayes. [26] conducted experiments on Cleveland dataset and [30] developed a web-based application upon Naive Bayes categorization.

We claim that ensemble methods are still valuable in domain of heart disease prediction and outperforms firefly algorithm of [24].

Methodology

Two classification algorithms are proposed for heart disease prediction on image and medical measurement datasets. First base classifier proposed—Reference Vector Classifier (RVC)—is based on formulating randomness of distance sequences with respect to a subset of vectors of image data. Taken a vector \(\mathbf {u}\), namely, an observation from the training set, other observations are investigated whether their class label sequences are ’regular’ when sorted according to the distances to \(\mathbf {u}\). Core idea is that the more non-random the corresponding label sequence is, more valuable it is \(\mathbf {u}\) for classification. This introduces a wide range of alternatives through the selection of randomness tests [2]. A dataset or a dataset domain can be captured more effectively by a specific randomness test.

We first present our randomness analysis classifier; for each observation \(\mathbf {u}\), we find the binary sequence associated with that observation \(\mathbf {u}\), which is the class label sequence of other observations when sorted according to their distances to \(\mathbf {u}\). Randomness calculation of obtained sequence is performed for \(\mathbf {u}\). More random the sequence is, less important the vector is. This process applied to all vectors in the training set and their label sequences. These label sequences with distances are stored as a matrix which is truncated to form a decision function (Figs. 1, 2, 3).

Fig. 1
figure 1

Class label sequences and distances before sorting

Fig. 2
figure 2

Class label sequences and distances after sorting

figure a

find_randomness refers to a function for calculating randomness. We have exploited exponential of autocovariance function [29] for randomness calculation. FIT() function returns most important n observations, their label sequences and distances based on their randomness value. n needs to be determined as a hyperparameter. For observation x, prediction is performed by finding its distance to each reference vector and finding the closest value. The labels of closest values are collected to find the classification result based on majority voting principle. This weak classifier is then plugged into a Bagging Classifier method [6].

Second classifier—Shrunk Covariance Classifier (SCC)—is developed for medical parameter dataset (Statlog) and almost straightforwardly derived from Graphical Lasso [16] and Ledoit–Wolf shrinkage estimation [22], where Glasso and Ledoit–Wolf inverse covariances are fitted and prediction is done with respect to combined Mahalanobis distance. To our knowledge, Glasso and Ledoit–Wolf methods are not applied in this context, that is, in combination with Bagging Classification on heart disease prediction.

figure b

RVC and SCC methods are depicted in detail in Figs. 4, 5 and 6.

Fig. 3
figure 3

Overall algorithm

Experiments

Experiments are conducted on Spectf and Statlog datasets. First one has 44 features extracted from Single Proton Emission Computed Tomography (SPECT) images. Second one has 13 features: age, sex, chest pain type, resting blood pressure, serum cholesterol in mg/dl, resting electrocardiographic results, maximum heart rate achieved, exercise-induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0–3) colored by fluoroscopy and defect type.

Experiments are conducted in the same way as [24]; feature reduction is used for both datasets that resulted in 33 features in Spectf and 10 features for Statlog datasets. One third of each dataset is kept for training-validation, remaining for test purposes. Cross-validation is applied to decide on the optimal hyperparameters of bagging classifier (i.e. number of estimators, maximum sample size, maximum features etc.). The proposed algorithms’ performances are compared with performance of NB, ANN and SVM, which are obtained from [24]. Scaling to minmax [18] is applied before feature selection and classification. CIFE [8] and ANOVA-based selection are the reduction methods preferred for Spectf and Statlog, respectively.

The algorithm experiments are performed using Python programming language. sklearn [32] is the library used to run Shrunk Covariance estimation, cross-validation and accuracy measurements. Spyder [35] is the IDE where all codes are written.

Additionally, we have conducted a software defect prediction experiment on kc2 dataset where there are 21 features related to software characteristics such as lines of code and McCabe Cyclomatic complexity [1]. In this experiment, we have compared RVC with RF, SVM and NB. Here, an ANOVA feature reduction (number of reduced features is 10) after Robust Scaling is performed before classification.

Results

To test the performances of the algorithms, several experiments were conducted. We have used four measures: accuracy, precision, recall and f-measure.

Results can be seen from Tables 1 and 2. Our methods outperform classical algorithms and state-of-the-art Chaos Firefly and Fuzzy Logic (CAFL) procedure.

Table 1 Accuracy comparison
Table 2 Performance of our methods

Discussion

The proposed RVC and SCC algorithms outperform CAFL with respect to accuracy metric.

One major advantage of our method over CAFL is that algorithm is still manageable in case of high dimensionality. Other advantage on Statlog is speed; attribute reduction in CAFL takes more than 5 minutes. On the other hand, our total cross-validation, that is whole parameter extraction and testing together with dimension reduction is only about 1 minute. But, one major disadvantage of our methods can be seen when we consider random states. Bagging Classifier implementation has some dependency on random value extraction; we used the optimal solution, giving the maximum test accuracy score. This corresponds somehow ‘peeking at the test data’ [44]. As a future study, we plan to develop a robust variant of this algorithm.

Second disadvantage in our case is that these classifiers are dataset dependent; that CAFL method itself successful on both Spectf and Statlog.

Our second analysis considers the Shrunk Covariance method, which is a direct application of covariance estimation to classification. This also somehow suffers from curse of dimensionality, but is more straightforward, simpler, interpretable and accurate than CAFL. Of course, Bagging Classifier random state problem arises here, too. Nevertheless, overall speed again is better than CAFL which adds an advantage through the duration of training time.

SVM captures non-linearity via kernel trick. Results indicate that one needs a more sophisticated kernel to derive an accurate classifier on Spectf and Statlog. RVC resolves this by carrying the decision step to distances to specific (‘important’) observations.

Naive Bayes assumes that features are independent but albeit the shrunk nature of SCC, from results on Statlog, we can see that, in this case, variable interactions can be valuable.

To sum up, proposed algorithms are more accurate and efficient than standard methods and CAFL, the state-of-the art technique in context of heart disease prediction.

Conclusion

In this work, we proposed two algorithms, namely RVC and SCC, for two important datasets, Spectf and Statlog, respectively. We have shown that randomness test-based importance detection is beneficial for classification and shrunk covariance estimators are potentially good as Mahalanobis distance measure sources. Two different feature reduction schemes are plugged into the framework to obtain better accuracy results.

Fig. 4
figure 4

Base train algorithms

Fig. 5
figure 5

RVC base predict algorithm

Fig. 6
figure 6

SCC base predict algorithm

Future Work

Future study will focus on two aspects: first, a more robust variant independent of random states (an average score higher than state of the art) and second, application to various datasets other than heart disease.