1 Introduction

In every human body, there are blood vessels that span an approximate length of 60,000 miles while ensuring the proper functionalities of all its organs. The heart regularly pumps fresh blood into these blood vessels, at a rate of 4.7 l per minute. As such, the heart may be called as the engine of the human body whose failure results in the collapse of the body. As per reports by the World Health Organization, cardiac diseases are one of the main factors contributing toward human deaths. Different surveys held since 2012 concluded among almost 56 million people lost their lives in 2012; 17.5 million were caused by cardiovascular diseases [1]. It was the highest single natural factor responsible for the loss of lives. Heart diseases can be cured or controlled if they are diagnosed in their early stages. This is possible through a series of medical tests conducted by specialized medical human experts and coupled with the medical history of the subject. As per World Bank figures, the ratio of general physicians to living humans in developed countries is 3:1000, while in the developing countries this ratio is 0.6:1000 [2]. As there are fewer heart specialists than general physicians, we can say that the availability of the former is far lower than the latter. Particularly for the subjects living in remote rural areas it nearly becomes impossible in approaching a specialist.

Automated computer systems have amazingly improved the quality and standard of the life in every aspect. These systems are also facilitating medical professionals for accomplishing their jobs. Along with the development of automatic diagnosis systems for other diseases, research is in progress to diagnose heart diseases through computerized intelligent systems [3]. During the last two decades, considerable work is presented for achieving the desired results. In these systems, statistics of medical reports, medical history and the subject’s life habits are provided as inputs. The automated system, through its learning and decision functions, then predicts whether the subject is suffering from heart disease or not. Research on these systems have focused on two areas: (1) the selection of important features while discarding redundant ones and (2) the selection of an appropriate classifier. The selection of the appropriate features subset not only reduces the computational complexity but improves the classification results [4]. Features should be selected in a way in which each can be considered as a discriminating feature. The combination of all the individual discriminating features must then produce a better classification. The role of the classifiers is the key to the success of prediction systems. In the broader aspects, two types of machine learning techniques are used, i.e., supervised and unsupervised learning-based techniques. Supervised learning is quite successful while the historical data along with the class labels are available. Support vector machines (SVMs) are supervised learning techniques which are used to solve both linear and nonlinear problems. When radial basis function (RBF)-based kernels are used in SVM, it makes them capable of solving the linearly non-separable problems.

This paper presents feature selection techniques which select the features on the basis of their individual and collective performance of the subset when added to it. It not only considers the features with individual high discriminating scores but the ones with lower scores are taken into account as well if their presence improves the classification accuracy for the whole subset. The Matthews correlation coefficient (MCC) is used as a metric for selecting the feature subset. The Matthews correlation coefficient (MCC) is used as a metric for selecting the feature subset. The higher the MCC score, the better the feature subset. RBF kernel-based SVM classifier takes the reduced dimensional feature subset as an input and predicts whether the subject under consideration is a potential heart disease patient or a normal control subject. The rest of the paper is organized as follows: Sect. 2 briefly discusses the latest relevant literature; Sect. 3 presents the proposed research; Sect. 4 discusses the datasets and evaluation criteria used; Sect. 5 presents, discusses and analyzes the research results; and Sect. 6 concludes the paper.

2 Literature review

This section presents a review of the relevant research contributions published in recent past. Long et al. [5] proposed a heart disease diagnostic system which used fuzzy-based classification. The authors used a feature subset by reducing the original dimension of the feature set through rough sets and the chaos firefly algorithm. The technique was validated by testing it over the heart disease and SPECF datasets. Extreme learning-based algorithm was used by Ismaeel et al. [6] for heart disease diagnostics. Their research had achieved 80% accuracy by implementing their methodology over the Cleveland dataset. Minimum distance-based KNN classifier was used by Krishnaiah et al. [7] to diagnose the heart diseases. As an initial step, they fuzzified medically measured data and then provided it to the KNN through fuzzy membership. An intelligent prediction system was proposed by Chitra and Seenivasagam [8] which used the feed forward neural network and the cascaded correlation neural network for classification. Srinivas et al. [9] generated fuzzy rules through set theory which were then used to perform fuzzy classification. The technique was validated through results which were obtained by implementing it over the UCI dataset (i.e., Cleveland, Hungarian, SPECTF and Switzerland). The authors achieved the highest accuracy (80%) for the Switzerland dataset, while they achieved the lowest accuracy (42%) for the Hungarian dataset. A hybrid approach was developed by Yang et al. [10] which combined adaptive network-based fuzzy inference system (ANFIS) with linear discriminant analysis (LDA) and was used to assess the risk of coronary heart disease. The technique was validated by experimenting it over a dataset of the Korean National Health and Nutrition Examinations Survey, achieving an accuracy of 80.2%. A hybrid modeling scheme for the heart disease diagnosis was proposed by Yuehjen et al. [11]. It was a two-stage methodology in which initially a set of explanatory variables were reduced through the logistic regression (LR), multivariate adaptive regression splines (MARS) and rough set (RS) followed by a classification through the artificial neural network (ANN). An expert system for the diagnosis of coronary heart disease was presented by Muthukaruppan and Er [12] in which they employed particle swarm optimization (PSO) along with fuzzy classifiers. Fuzzy membership functions were tuned using the PSO. Afterward, they used the Cleveland and Hungarian heart disease datasets for validating their technique.

Anooj [13, 14] presented a clinical decision support system (CDSS) for heart disease diagnosis. The proposed CDSS had two stages: In the first stage, attribute selection and attribute weightage were achieved using the data mining techniques while Fuzzy rule-based CDSS was proposed in the second stage. Comparative results were presented along with the ANN-based technique over UCI datasets. Shilaskar and Ghatol [15] used a reduced feature subset for the diagnosis of heart disease. They used distance measures as a metric to select features through forward inclusion, forward selection and backward elimination search techniques. To improve the classification accuracy, the authors proposed their own hybrid forward selection algorithm and achieved comparably better results than the forward selection and backward elimination algorithm. Expert judgment was compared with automated feature selection techniques using different classifiers [16]. The authors showed that a combination of expert judgment with intelligent feature selection techniques improved the accuracy of the naive Bayes, IBK and SMO classifiers. ECG heart rate signals were used as input to predict cardiac health by Giri et al. [17]. Frequency sub-bands of the input signals were obtained using discrete wavelet transform (DWT). Later on, principal component analysis (PCA), independent component analysis (ICA) and linear discriminant analysis (LDA) were employed over DWT coefficients. Different tests using the Gaussian mixture model (GMM), support vector machine (SVM), probabilistic neural network (PNN) and artificial neural network (ANN) classifiers were employed. Zhao et al. [18] presented a feature selection framework which was used to preserve similarity measures through a conventional combinatorial optimization formulation. In order to improve the efficiency and effectiveness of their methodology, they extended it with a sparse multiple-output regression formulation. The authors concluded that it allowed them to achieve the better results.

The feature selection problem was investigated by Hancer et al. [19] through proposing ant–bee colony (ABC)-based algorithm. They extended the standard ABC algorithm by integrating it with evolutionary-based similarity mechanism. A feature selection criteria was proposed by Zhang et al. [20] where they incorporated the \(l_{2,1}\) norm regularization into the original Fisher criterion. It was assured that the \(l_{2,1}\) norm regularization term would achieve the sparsity of the feature selection matrix resulting in the feature selection as a globally optimized solution. A multiclass problem was solved by Li et al. [21] who proposed a cost-sensitive LDA classifier. Linear discriminant coefficients were used for estimating the posterior probabilities of state-of-the-art testing instances. The authors achieved efficiency in terms of low computational cost and higher accuracy than the existing state-of-the-art methodologies. Markos et al. [22] proposed a four-staged fuzzy-based classifier. A Generalized Fisher score was proposed by Gu et al. [23] which maximized the lower bound of the traditional Fisher score for selecting the feature subset. The resulting selected features were reformulated as a quadratically constrained linear programming (QCLP) and solved it by a cutting plane algorithm. Information gain and divergence were used by Lee and Lee [24] for selecting feature subsets. The authors reduced the redundancy among features and maintained information gain through this methodology.

Support vector machines (SVMs) are the collection of supervised learning methodologies [25]. Akay [26] presented their technique that was used to predict women breast cancer disease. The authors utilized the classification ability of SVM using reduced dimensional feature set. They achieved 99.51% accurate results. Çomak et al. [27] used Doppler signals of the heart valves for predicting the heart disease. They used least-squares support vector machine (LS-SVM) as a predictor. Yahiaoui et al. [28] used SVM for diagnosis of tuberculosis disease. Comparative results of the proposed technique with the existing techniques were presented, and the proposed methodology reported better results. Huang et al. [29] used ensembles of SVM classifiers for predicting breast cancer disease. The authors discussed that ensembles of SVM produced better results than the single SVM, and while using different kernels, RBF kernels performed better than all its fellows.

3 Proposed methodology

The proposed technique selects the feature subset by proposing feature selection algorithms. The selected subset then is used to train and test RBF-based SVM classifier (Fig. 1).

Prior to the feature selection process, feature attribute ranges are standardized and Fisher score of each feature is calculated. During the feature selection procedure, we present mean Fisher score-based feature selection algorithm (MFSFSA), forward feature selection algorithm (FFSA) and reverse feature selection algorithm (RFSA) for subset selection. The feature subset selection algorithm (FSSA) returns the most effective lower-dimensional feature subset.

Definition 1

Heart feature set is \(S_\mathrm{{HP}}\), set of suspected heart patients, having M Tuples. Each tuple \(T_i\) has N features. \(S_\mathrm{{HP}}\) is represented by \(S_{HP} =\{T_{1},T_{2},\ldots ,T_{M}\}\). Tuple \(T_i\) is represented by \(T_{i}=\{f_{1},f_{2},\ldots ,f_{N}\}\) for every \( 1\le i \le M\).

We need to classify each \(T_{i} \in C\) where \(C=\{C_\mathrm{{HDP}},C_\mathrm{{NCS}}\}\) and \(C_\mathrm{{HDP}}\) and \(C_\mathrm{{NCS}}\) represent class of heart disease patient (HDP) and class of normal control subject (NCS), respectively.

3.1 Feature selection

To avoid the participation of redundant and the lesser class discriminating features, whose presence not only compromises the classification accuracy but leads toward high computational complexity as well, a feature subset with higher class discriminating capabilities must be selected. The required goal is achieved by performing the following steps:

3.1.1 Data standardization

The medical data of a subject consist of multiple attributes, represented by the different data types. Some of the attributes are of binary nature, while others may represent decimal or fractional numbers. The ranges of all the attributes can be quite diverse resulting in a bias toward selection of the certain features.

Definition 2

A Tuple T with features \(f_{i}, 1\le i\le n\), each \(f_i\) may have different data format. Data standardization is the process of transforming diverse data representations into a unique format. The transformed single format helps in comparing and classifying the data instances.

Fig. 1
figure 1

Flow diagram of proposed technique

As a means of preprocessing, all the features are standardized using following relation:

$$\begin{aligned} f'_{i}=\frac{f_{i}-\overline{f_{i}}}{{\sigma _{f_{i}}}} \end{aligned}$$
(1)

where \(f_{i}\) is the ith feature in its raw form, \(\overline{f_{i}}\) is the mean of the ith feature, while \(\sigma _{f_{i}}\) is the standard deviation of the ith feature. \(f'_{i}\) is the ith feature in a standardized form.

3.1.2 Fisher score calculation

The feature selection methodology should favor the features with higher discriminating capabilities while the others should be suppressed.

Definition 3

Using a feature set \(F=\{x_{1},x_{2},\ldots ,x_{N}\}\), Fisher score is used to measure the class discriminant nature of each \(x_i\). Features with the higher Fisher scores are more discriminant than the features with lower ones.

The proposed technique uses Fisher score to rank each feature which is calculated by using following mathematical formulation:

$$\begin{aligned} \mathrm{SCF}(F_{i})=\frac{\sum ^{C}_{j=1}\eta _{j}(\mu _{i,j} -\mu _{i})^{2}}{\sum ^{C}_{j=1}\eta _{j}\sigma _{i,j}^2} \end{aligned}$$
(2)

where \(\eta _{j}\), \(\mu _{i,j}\), \(\mu _{i}\) and \(\sigma _{i,j}^2\) represent the number of tuples lying in the jth class, mean of the ith feature in the jth class, mean of the ith feature and variance of the ith feature in the jth class, respectively. C represents total number of classes.

3.1.3 Feature subset selection

In order to select a minimal number of features used for coronary heart disease diagnosis, we present a set of feature selection algorithms. The final outcome of the proposed set of algorithms is a P-dimensional feature subset, i.e., \(M \times N\rightarrow M \times P\), where M, N and P represent the number of instances in the dataset, original feature dimension and the reduced feature dimension, respectively.

3.2 Mean Fisher score-based feature selection algorithm (MFSFSA)

Definition 4

Given a set of ranked Fisher score, \(RFS=\{ fs_{1},fs_{2},\ldots ,fs_{m}\} \forall fs_{j-1}\ge fs_{j}\), and feature set corresponding to RFS, \(FSR=\{ x_{1},x_{2},\ldots ,x_{m}\} \forall fs_{x_{j-1}}\ge fs_{x_{j}}\), mean Fisher score-based feature selection algorithm (MFSFSA) selects a feature subset based on individual Fisher scores in comparison with the mean Fisher score. It returns the MCC score of the selected feature subset as well.

figure h

In MFSFSA, all the features having Fisher score higher than the mean score are selected in feature subset as they have more class discriminative powers. The selected feature subset is used for training support vector machines and calculating the \(MFS\_MCC\) through a validation process.

3.2.1 Forward feature selection algorithm (FFSA)

Definition 5

Given a set of ranked Fisher score, i.e., \(RFS=\{ fs_{1},fs_{2},\ldots ,fs_{N}\} \forall fs_{j-1}\ge fs_{j}\), feature set corresponding to RFS, i.e., \(FSR=\{ x_{1},x_{2},\ldots ,x_{N}\} \forall fs_{x_{j-1}}\ge fs_{x_{j}}\), feature subset using MFSFSA (FSFS), i.e., \(FSFS=\{x_{1},x_{2},\ldots ,x_{Q}\} Q\le N\) and MCC score through MFSFSA (\(MFS\_MCC\)), forward feature selection algorithm (FFSA) selects and adds features in FSFS as per their Fisher score (descending order) and makes it a potential \(SFS_{F_{i}}\), i.e., feature subset. It returns the selected feature subset with the maximum MCC score.

figure i

During processing of FFSA, each feature having a score lesser than the mean Fisher score is added to feature subset. The process is accomplished by adding features in descending order with respect to their corresponding Fisher scores. The classifier is trained and validated using an updated feature subset. A corresponding MCC is then calculated. If the selection of a feature increases the Matthews correlation coefficient score by an acceptable amount, the feature is kept in a list of selections; otherwise, it is discarded. The process continues until all the features are evaluated.

3.2.2 Reverse feature selection algorithm (RFSA)

Definition 6

Given a set of ranked Fisher score, i.e., \(RFS=\{ fs_{1},fs_{2},\ldots ,fs_{N}\} \forall fs_{j-1}\ge fs_{j}\), feature set corresponding to RFS, i.e., \(FSR=\{ x_{1},x_{2},\ldots ,x_{N}\} \forall fs_{x_{j-1}}\ge fs_{x_{j}}\), feature subset using MFSFSA (FSFS), i.e., \(FSFS=\{x_{1},x_{2},\ldots ,x_{Q}\} Q\le N\) and MCC score through MFSFSA (\(MFS\_MCC\)), reverse feature selection algorithm (RFSA) selects and adds features in FSFS as per their Fisher score (ascending order) and makes it a potential \(SFS_{R_{i}}\), i.e., feature subset. It returns the selected feature subset with the maximum MCC score.

figure j

The FFSA favors the features with higher Fisher scores by giving them chance prior to the features with lower scores. To avoid the biases, RFSA works in reverse to the FFSA by chancing the features with lower Fisher score first.

3.2.3 Feature subset selection algorithm (FSSA)

Definition 7

Given the MCC scores from MFSFSA, FFSA and RFSA along with their corresponding feature subsets, feature subset selection algorithm (FSSA) computes and returns the selected feature subset (SFS).

figure k

At the end, FSSA looks for the feature subset that results in highest Matthews correlation coefficient score. If FFSA and RFSA do not improve the MCC score, then subset obtained through MFSFSA gets selected. Otherwise, feature subset scoring the highest MCC is selected. If MCC scores for both the subsets (FFSA and RFSA) are equal, then the feature set obtained through the FFSA is selected as its features are more probable of having the higher Fisher scores.

figure l

The cardiac system algorithm (CSA) presents the flow of all the four proposed algorithms.

Theorem 1

The FSSA returns the minimum item feature subset having maximum class discriminating ability.

Proof

Let F be a set of m features, i.e., \(\hbox {F}=\{x_{1},x_{2},x_{3},\ldots ,x_{m}\}\), \(F^{'}\) is the set having Fisher scores for each \(x_j\): \(1\le j\le m\). Algorithm 1, MFSFSA, returns a feature subset with features having individual Fisher scores greater than mean Fisher score, i.e., \(\hbox {FSFS}= \{x_{i},x_{i+1},\ldots x_{s}\}\), \(1\le s\le m\) and \(MFS\_MCC\) as its class discriminant score.

  1. 1.

    If \(MFS\_MCC\) is one, FSFS is the most efficient feature subset and no further computation is required.

  2. 2.

    If \(MFS\_MCC\) is less than one, then FSFS may not be an optimized subset. Algorithm 2 returns \(SFS_F\), a selected feature subset with the higher class discriminant score, i.e., \(F\_MCC \ge \)\(MFS\_MCC\) and FSFS \(\subset SFS\_F\). From Algorithm 3, it may be observed that SFS_R, a feature subset may be obtained having higher class discrimination, i.e., \(R\_MCC \ge MFS\_MCC\) and FSFS \(\subset SFS\_R\).

  3. 3.

    By using Algorithm 4, if \(MFS\_MCC\), \(F\_MCC\) and \(R\_MCC\) are equal then feature subset corresponding to \(MFS\_MCC\) is selected as it has the lower dimension while sharing higher class discriminant score among others. Otherwise, the feature subset with higher discriminant score will be selected.

  4. 4.

    Through step 3, if \(F\_MCC\) and \(R\_MCC\) are equal, resulting in both the corresponding subsets with the equal class discriminant scores, a subset with the lower dimension would be selected. It gives a feature subset with the higher class discriminant score and lower possible feature dimension.

  5. 5.

    If step 3 returns equal \(F\_MCC\) and \(R\_MCC\) scores and dimensions of their corresponding feature subsets, i.e., \(SFS_F\) and \(SFS_R\), are equal as well, then \(SFS_F\) should be selected as it has a same class discriminant score as \(SFS_R\), but its individual features are more probable of having a higher class discriminant score.

From 1, 3, 4 and 5, it is clear that the FSSA always returns a feature subset with the highest class discriminating ability and the minimum possible dimension. \(\square \)

3.3 Classification

In the proposed technique, support vector machine (SVM) is used for binary classification (i.e., heart disease patient and normal control subject). The input dataset \( M \times P \) is divided into two subsets: (1) train data and (2) test data. Train data are of order \(T_r \times P \), and test data have an order \( T \times P \). Decision functions for linear SVMs can be represented by \( (u.s_{j}+c) \le -1 \) if \( t_{j}=-1 \) and \( (u.s_{j}+c) \ge 1 \) if \( t_{j}=1 \). The two relations can be combined to cover both cases, i.e., \( t_{j}(u.s_{j}+c) \ge 1 \) where t represents the class, s is the input data, u represents weight and c is the margin. The problem of coronary heart disease anticipation comes under the class of nonlinear problems. Decision function for the nonlinear classification is represented as \( f(s)=u.\varPhi (s)+c \). This is presented as we are dealing with the research problem having input feature subset instead of a single-dimensional problem. For a higher-dimensional data convergence, SVM kernel functions are widely used. In a kernel-based SVM, u may be represented as \( u= \sum ^{m}_{j=1} \beta _{j} \varPhi (s_{j}) \). Decision function for a nonlinear classification problem having multidimensional feature set can be represented as:

$$\begin{aligned} f(s)= \sum ^{m}_{j=1}\beta _{j}\varPhi (s_{j}).\varPhi (s)+c \end{aligned}$$
(3)

where \( \varPhi (s_{j}).\varPhi (s) \) is defined as the kernel function and denoted as \( KF(s,s_{j}) \). By replacing \( \varPhi (s_{j}).\varPhi (s) \) by \( KF(s,s_{j}) \), the decision function is given as:

$$\begin{aligned} f(s)= \sum ^{m}_{j=1}\beta _{j}\cdot KF(s,s_{j})+c \end{aligned}$$
(4)

In order to solve the nonlinear separable problems (with high-dimensional feature set), the radial basis function-based kernel is identified as the most suitable. This is due to the RBF kernel being known to be used in solving infinite-dimensional problems. The RBF kernel can be represented as

$$\begin{aligned} KF(s,s_{j})= \mathrm{e}^{(-\gamma \left| |s-s_{j}\right| |^2)} \end{aligned}$$
(5)

By substituting Eq. (5) into Eq. (4), the required decision function becomes:

$$\begin{aligned} f(s)= \sum ^{m}_{j=1}\beta _{j}\cdot \mathrm{e}^{(-\gamma \left| |s-s_{j}\right| |^2)}+c \end{aligned}$$
(6)

Theorem 2

The RBF-based SVM solves the heart disease classification problem.

Proof

Given an input data with \(M\times P\) dimension partitioned into two subsets, training data (\(S\times P\)) and test data (\(T\times P\)). Training and testing a non-linearly separable feature set (multidimensional) having binary classes are a nonlinear problem. In order to produce a nonlinear boundary projection, SVM uses \(f(s)=u.\varPhi (s)+c\) as decision function. The diagnosis of the heart disease through feature subset is a problem of multidimensional spaced data. As multidimensional feature set classification is achieved using kernel-based decision function, so using \( f(s)= \sum ^{m}_{j=1}\beta _{j}\varPhi (s_{j}).\varPhi (s)+c \), kernel-based decision function solves the nonlinear classification problem. RBF kernel, \(KF(s,s_{j})= e^{(-\gamma \left| |s-s_{j}\right| |^2)}\), is proved in solving infinite-dimensional problems. So by using \(f(s)= \sum ^{m}_{j=1}\beta _{j}.e^{(-\gamma \left| |s-s_{j}\right| |^2)}+c\), classification of nonlinear separable multidimensional feature set can be achieved. Hence, it is concluded that the RBF-based SVM solves the heart disease classification problem. \(\square \)

4 Experimental setup

4.1 Evaluation criteria

In order to evaluate the performance of the proposed technique, three qualitative measurements have been used, i.e., (1) accuracy, (2) specificity and (3) sensitivity.

$$\begin{aligned} \mathrm{{Sensitivity}}= & {} \frac{\mathrm{{True\ Positive}}}{(\mathrm{{True\ Positive}} + \mathrm{{False\ Negative}})} \end{aligned}$$
(7)
$$\begin{aligned} \mathrm{{Specificity}}= & {} \frac{\mathrm{{True\ Negative}}}{(\mathrm{{True\ Negative}} + \mathrm{{False\ Positive}})}\end{aligned}$$
(8)
$$\begin{aligned} \mathrm{{Accuracy}}= & {} \frac{(\mathrm{{True\ Negative}} + \mathrm{{True\ Positive}})}{(\mathrm{{True\ Negative}} + \mathrm{{True\ Positive}} + \mathrm{{False\ Negative}} + \mathrm{{False\ Positive}})} \end{aligned}$$
(9)

where

  • True positive: heart disease patient correctly identified as heart disease patient

  • False positive: normal control subject incorrectly identified as heart disease patient

  • True negative: normal control subject correctly identified as normal control subject

  • False negative: heart disease patient incorrectly identified as normal control subject

4.2 Dataset

Four UCI [30] datasets are used: (1) Cleveland, (2) Hungarian, (3) Switzerland and (4) SPECTF. Originally, the first three datasets had 76 features out of which 14 attributes including class label are selected, while SPECTF have 45 attributes which are used in total. Details of the used attributes(UA) for evaluating the proposed technique are shown in Table 1.

Table 1 Partition details of the datasets in train and test sets

Three distinct statistics have been shown for each of the dataset (i.e., the total number of records, training and test data statistics). The Cleveland dataset with 303 instances is partitioned into training and test dataset with 202 and 111 instances, respectively. The said process is achieved with an intuition of keeping symmetry in class membership statistics. Out of the 202 training records, 111 have membership in the HDP class while 91 instances belong to NCS. From the test data subset, 53 records are part of HDP class and 48 records belong to NCS. The rest of the datasets are analyzed in the same way as well.

Table 2 Dataset specifications

5 Results and discussion

The detail of the dataset [30] used for experimentation is explained in Tables 1 and 2. In order to avoid the bias in a comparative analysis with the existing techniques, the dataset partitioning is performed in the same way as opted by the competitive techniques.

Table 3 Results of proposed technique over all datasets

Table 3 shows the mean Fisher score, dimension of the intermediate selected subsets using MFSFSA, FFSA, RFSA and finally selected subset using FSSA. Originally three of the datasets have 13-dimensional feature set (i.e., Cleveland, Hungarian and Switzerland) while SPECTF has 44 features. The mean Fisher score varies from 0.0758 (Switzerland) to 0.1457 (Cleveland). For the Cleveland dataset, six features are selected using MFSFSA and one extra feature is selected using the FFSA and RFSA. The FSSA selects subset produced by the FFSA. While experiments are performed over the Hungarian dataset, MFSFSA selects four features while two and three extra features are selected through the FFSA and RFSA, respectively. In the same manner, four features are selected by the MFSFSA for the Switzerland dataset. Afterward, the FFSA selects four, while the RFSA selects two extra features. The FSSA selects the feature subset produced by the FFSA. Experimental results of the proposed technique over the SPECTF showed the selection of a 19-dimensional feature subset. All the 19 features are selected using the MFSFSA.

Table 4 Feature dimension details

Table 4 clearly shows a reduction in the original feature dimension using the proposed technique. Dimensions of the selected feature subsets for Cleveland, Hungarian, Switzerland and SPECTF are 47, 47, 39 and 57%, respectively, which are less than their original feature sets.

Fig. 2
figure 2

Performance analysis of the proposed technique

Figure 2 shows a graphical representation of the performance achieved by the proposed technique. It shows the performance of the proposed technique in terms of accuracy, sensitivity and specificity.

5.1 Results for the Cleveland dataset

Figure 3a, b, c graphically shows the feature selection statistics using FFSA, RFSA and FSSA for the Cleveland dataset. It shows that out of the 13 features of Cleveland dataset, MFSFSA selects six features with the Fisher score greater than the mean Fisher score of all the features. The MCC score (i.e., \(MFS\_MCC\)) for the selected feature subset is 0.79. Figure 3a shows the MCC score for the feature subset selection using FFSA (i.e., \(F\_MCC\)). Initially, \(\delta _{f}\) is the 15th percent of the \(MFS\_MCC\) score. The inclusion of the 7th ranked feature increased the MCC score from 0.79 to 0.90 and fulfills the required criteria, i.e., \(F\_MCC > TS+\delta _{f}\). After adding the 8th ranked feature to the selected subset, the \(F\_MCC\) score achieved is 0.93, an increase of 0.03. The 8th ranked feature cannot be selected as the \(F\_MCC\) score should have been boosted by 0.06. All the other features are added and tested against MCC boosting criteria, but none of the features is selected in feature subset. Through the FFSA, only the 7th ranked feature is selected. This becomes a member of existing feature subset obtained through the MFSFSA criteria.

Figure 3b shows the MCC scores of selected feature subsets through the RFSA. As with the case of the FFSA, six features are selected through the MFSFSA. Rest of the features are added to the selected subset in reverse to the FFSA (i.e., features with low Fisher score are tested first). When the 13th ranked feature (i.e., a feature with the lowest Fisher score) is added to the feature subset, \(R\_MCC\) score is boosted from 0.79 to 0.85. This corresponds to an increase of 0.06 points. The minimum threshold increase in the score should be 0.11 for the inclusion of the first feature fulfilling \(R\_MCC> TS+\delta {r}\), where \(\delta _{r}\) is initialized with \(0.15\times MFS\_MCC\). This feature cannot be selected. The process continues and when the 8th ranked feature is added to the feature subset it fulfills the criteria and gets selected. This is the only selected feature and the final dimension of the feature subset using the RFSA results to seven.

The feature subsets selected through the MFSFSA, FFSA and RFSA are fed as input to the FFSA. The subsets selected through FFSA and RFSA have higher dimension than of MFSFSA, but higher MCC score as well. The feature subset selected through MFSFSA gets out of the competition. The subsets selected through FFSA and RFSA contain different combinations, but the same MCC score and feature dimension. In this case, as the 7th ranked feature added using the FFSA has a higher Fisher score than the 8th ranked feature added using the RFSA, the former will be used as the final choice. So, the FSSA returns feature subset selected through FFSA and shown in Fig. 3c.

Fig. 3
figure 3

Feature subset selection using FSSA: Cleveland dataset

Figure 4 presents a graphical comparative analysis of the proposed technique versus the existing ones when the experiments ate performed over the Cleveland dataset. It may be observed that the proposed technique performs better in terms of accuracy and specificity. In comparison, the highest accuracy was achieved by fuzzy-based CDSS among the existing techniques [15]. In terms of sensitivity, the proposed technique showed moderate results which are close to the highest achieved result, i.e., 76%. In terms of specificity, it was again the proposed technique which performed better than the existing techniques. The closest results in terms of specificity were achieved through the Fuzzy decision tree [22].

Fig. 4
figure 4

Comparative analysis: Cleveland dataset

5.2 Results for the Hungarian dataset

Figure 5a, b, c shows the results of feature subset selection when the proposed technique is applied to the Hungarian dataset. In all the three graphs, it is obvious that the number of features selected through the MFSFSA is four. These are the highest ranked features as per the Fisher score. The \(MFS\_MCC\) score achieved through these four selected features is 0.6. Figure 5a represents the MCC score, (i.e, \(F\_MCC\)), obtained when a feature is added to the feature subset using FFSA. When the 5th ranked feature is added to the feature subset, the \(F\_MCC\) score for feature subset gets 0.67. This does not fulfill SFSMCC \(F\_MCC \ge \)TS+\(\delta _{f}\). The inclusion of the 5th ranked feature in the selected feature subset boosted the \(F\_MCC\) score by 0.07, but as per criteria, it should have been at least 0.09. A similar situation arises with the 6th ranked feature and this does not get selected as well. When the 7th ranked feature is added to feature subset, it gets selected as it increases the MCC score by 0.1. Ranked features 8 through 11 do not get selected as they do not boost the MCC score by a minimum score of 0.045. The 12th ranked feature is selected for the feature subset as it increases the MCC score by 0.05. In total, six (6) features are selected with 0.75 as its MCC score.

Fig. 5
figure 5

Feature subset selection using FSSA: Hungarian dataset

The same process is performed for the RFSA but with the ranked features inclusion in reverse order. This is shown in Fig. 5b. A subset of 7 features gets selected by attaining a 0.81 MCC score (i.e., \(R\_MCC\)). Figure 5c shows a graph of the selected features with the \(R\_MCC\) score achieved by their inclusion. The subsets selected through the MFSFSA, FFSA and RFSA are provided as an input to the FSSA. Through the MFSFSA, the selected subset has four features while subset through RFSA has seven and the FFSA selected a subset with six features. However, dimension of subset selected through RFSA is the highest but achieved same in MCC score as well. So, the subset selected through the RFSA is returned as the final choice.

Fig. 6
figure 6

Comparative analysis: Hungarian dataset

Figure 6 presents the comparative results of the proposed technique with the existing methodologies in terms of the opted evaluation metrics. In terms of accuracy, it is obvious that the performance of the proposed technique is better than the existing competitors by a considerable margin. Its accuracy is 38% more than the fuzzy-based CDSS technique [15] which had 46% accuracy the highest among its fellows. Likewise, the proposed technique performed better in terms of other two metrics, i.e., sensitivity and specificity, as well.

5.3 Results for the Switzerland dataset

Figure 7 shows the results of the feature subset selection process by applying the proposed technique over Switzerland dataset. In Fig. 7a, a feature subset is selected through the MFSFSA, FFSA. There are four (04) features which have Fisher scores higher than the mean score and get selected through MFSFSA. The MCC score (\(MFS\_MCC\)) obtained through the selected features is 0.601. Through the FFSA, when the 5th ranking feature is added to the feature subset it attains a 0.70 MCC score. It is selected as it boosted the MCC score (i.e., \(F\_MCC\)) by 0.1, higher than the minimum required score increase of 0.09. In the next iteration, 6th ranked feature is included in the feature subset and a \(F\_MCC\) score of 0.759 is achieved. It boosts the \(F\_MCC\) score by 0.059 score which is more than the required increase of 0.045. The next ranked feature does not affect the \(F\_MCC\) score, so it is excluded. When the 8th ranked feature is included in the feature subset, the subset achieves 0.92 MCC score and score is boosted by 0.161. This is higher than the required threshold of 0.023, thus allowing it to become part of feature subset. The 9th and 10th ranked features also made no impact over the MCC scores, so they are not selected. The 11th ranked feature, when added to feature subset, the subset achieved an \(F\_MCC\) score of 1, the maximum possible score. The MCC boost attained is 0.08 greater than the required 0.011. Other features did not make an impact as well. As such, the dimensions of the selected feature subset through the FFSA are eight and \(F\_MCC\) score of 1 is achieved. Figure 7b represents simulation results of the feature subset selection using the RFSA. As previously mentioned, the Swiss dataset has four features included in the subset through MFSFSA. As features are added in reverse order (i.e., lowest to highest rank), it was determined that the inclusion of features ranked 13th through 9th did not make any considerable impact over the MCC score (\(R\_MCC\)). When the 8th ranked feature is included, however, it raises the \(R\_MCC\) score of the feature subset from 0.601 to 0.718. An increase of 0.117, higher than the required raise of 0.09, makes the said feature selected. Features ranked 7th and 6th are also not good choices for the feature subset as their inclusion did not meet the required minimum increase in the \(R\_MCC\) score. The 5th ranked feature is selected as it boosts the \(R\_MCC\) score to 0.856, an increase of 0.138 that is higher than the threshold of 0.045. In total, there are six (6) features in the subset whose \(R\_MCC\) score is 0.856 after the RFSA is done. Finally, FSSA selects feature subset obtained through the FFSA which attained a 1 MCC score. Figure 7c shows MCC scores of the features subset when each selected feature is included.

Fig. 7
figure 7

Feature subset selection using FSSA: Switzerland dataset

Fig. 8
figure 8

Comparative analysis: Switzerland dataset

Figure 8 shows that the comparative performance of the proposed technique in terms of accuracy is better than its fellows. Similarly, the sensitivity score of the proposed technique is quite promising and better than all the comparable techniques while the specificity score was reasonable.

5.4 Results for the SPECTF dataset

The SPECTF dataset is quite different from the other UCI datasets as it has 44-dimensional features. When the proposed technique is applied over SPECTF, 19 are directly selected through the MFSFSA-NFGMFS criteria as they have Fisher scores higher than the mean. The MCC score of the selected feature subset already results in 1, the maximum score, so cardiac system algorithm (CSA) decides that there is no need to perform the FFSA and RFSA, thus allowing it to remain as the final selected feature subset. Figure 9 graphically represents the performance based on comparative results of the proposed and eight (8) other existing techniques [5]. A different set of existing techniques were used on the SPECTF dataset due to the number and type of attribute differences that it has over the other datasets (as evident in Tables 1, 2) It can be seen that our proposed technique performed better than the existing ones in terms of its accuracy (82.7%) accurate, except for one (i.e., FireFly+CFARS-AR). In terms of sensitivity and specificity, however, the proposed technique achieved average performance as compared to the others.

Fig. 9
figure 9

Comparative analysis: SPECTF dataset

5.5 K-fold validations

This section presents the K-fold validations for the 3 UCI datasets and provides a comparison with K-fold validations for the NN-based CDSS [15], Markos [22] and fuzzy-based CDSS [15]. For the Swiss dataset, the eightfold analysis was used, while the tenfold analysis was used for the other two. Each fold contained 30, 28 and 15 records for the Cleveland, Hungarian and the Swiss datasets, respectively. K-fold analysis over the SPECTF is not performed due to the unavailability of the comparative results. The K-fold validations confirm the results achieved by the proposed technique is the most accurate among its fellow techniques. Figures 10, 11, 12 show that the proposed technique has provided better accuracy for almost every single fold. It is important to state that the configurations of the folds used are identical as they were used in all the existing research techniques.

Fig. 10
figure 10

K-fold comparative analysis: Cleveland

Fig. 11
figure 11

K-fold comparative analysis: Hungarian

Fig. 12
figure 12

K-fold comparative analysis: Switzerland

5.6 Computational cost analysis of the proposed technique

The computational complexity of the RBF-SVM is calculated as \(O(max(M,d) min(M,d)^2)\), where M represents the number of samples, while d is the feature dimension [31]. As the number of samples are greater than the feature dimension, i.e., \(M>d\), the complexity becomes \(O(Md^2)\). Using RBF-SVM complexity, the computational cost of the MFSFSA becomes:

$$\begin{aligned} 4+O(M-T)Q^2)\Rightarrow O((M-T)Q^2) \end{aligned}$$
(10)

where Q is the number of features selected through MFSFSA. The computational complexity of the FFSA is calculated as:

$$\begin{aligned}&2+(N-Q)O\left( 1+\left( (M-T)\frac{N(N+Q+1)}{2}\right) ^2\right) \nonumber \\&\quad \Rightarrow (N-Q)O\left( (M-T)\left( N\left( \frac{N+Q+1}{2}\right) \right) ^2\right) \end{aligned}$$
(11)

N and T represent original feature dimension and number of instances in test data, respectively. As only training and validations are used in FFSA and RFSA, the test instances are excluded. The RFSA work the same as the FFSA but only chooses the features with Fisher score in reverse order, so its complexity is the same as the FFSA, i.e., \(((N-Q)O((M-T)(N(\frac{N+Q+1)}{2}))^2))\). The maximum number of statements (including the conditional statements) that come into execution are seven. Once the feature subset of reduced dimension (P) gets selected, the computational cost for the training and testing over the reduced dimensional feature subset is calculated as:

$$\begin{aligned} O(MP^2) \end{aligned}$$
(12)

By summing individual complexities of all the algorithms, the overall computational cost of the proposed technique becomes:

$$\begin{aligned} \begin{aligned}&O((M-T)Q^2)+(N-Q)O\left( (M-T)\left( N\frac{(N+Q+1)}{2}\right) ^2\right) \\&\quad +(N-Q)O\left( (M-T)\left( N\frac{(N+Q+1)}{2}\right) ^2\right) +7+O(MP^2)\\&\quad \Rightarrow O((M-T)Q^2)+2(N-Q)O\left( (M-T)\left( N\frac{(N+Q+1)}{2}\right) ^2\right) \\&\qquad +(N-Q)O((M-T)+O(MP^2) \end{aligned} \end{aligned}$$
(13)

For the best-case scenario, the complexity is:

$$\begin{aligned} O((M-T)Q^2)+ O(MP^2) \end{aligned}$$
(14)

This is the case when the features selected through the MFSFSA return maximum MCC, i.e., one. It means not to select the features using FFSA and RFSA as they would not be able to increase the validation MCC score. As in this case P and Q are same, we may rewrite it as:

$$\begin{aligned} O((M-T)Q^2)+ O(MQ^2) \end{aligned}$$
(15)

The best computational complexity is achieved for the experiments performed over the SPECTF dataset. Out of total 44 features, 19 were selected through MFSFSA and they produced the MCC score of 1. This is the case where selection process through FFSA and RFSA is not performed, resulting the use of the same 19 features for training and testing purpose. For the average cases, the computational complexity remains \(O((M-T)Q^2)+2(N-Q)O((M-T)(N\frac{(N+Q+1)}{2})^2))+(N-Q)O((M-T)+O(MQ^2)\). For all the other three experiments (Cleveland, Hungarian and Switzerland), the computational complexity remained as average case. These are the cases when all the algorithms went through and \(Q>1\) and \(P<N\).

The highest computational complexity occurs for the three cases, i.e., (i) The case when Q is nearly equal to 1. In this scenario, FFSA and RFSA have to go through nearly all the features although P is less than N by a margin. (ii) It is the case when P is greater than 1 by a margin but higher MCC is not obtained, FFSA and RFSA have to go through as average case but P nearly equals N. (iii) The third case is worst of all when Q nearly equals 1 and P nearly equals N. In this case, FFSA and RFSA need to go through all the features and training/testing is performed over the original (approximately) feature dimension.

6 Conclusion

A cardiac disease diagnosis system is presented by proposing the feature selection algorithms, i.e., MFSFSA, FFSA, RFSA and FSSA. The proposed algorithms use individual Fisher scores of the features, along with MCC scores of the subsets, for their selection in the feature subsets. The proposed feature selection algorithms select one feature subset each and among all of them the feature subset with the higher MCC score and lower dimension is a good choice of selection. A binary class RBF kernel-based SVM is used for predicting the heart disease. Four UCI datasets are used for the verification of the proposed technique. The proposed technique is analyzed over three performance metrics, i.e., accuracy, sensitivity and specificity. The proposed technique achieved considerably better results than the other comparative techniques. A detail of computational analysis for the proposed technique is presented as well.