Keywords

1 Introduction

The human heart is a vital organ about the size of a big fist and weighed 230 to 340 grams. Its role is to oxidize deoxygenated blood and pump it throughout the body as part of the circulatory system. The heart resides in a two-walled sac, called the pericardium, which protects the heart within the chest. The fluid, known as pericardial fluid, flows between two layers of the pericardium that keep the heart lubricated during various heart movements, the diaphragm, and the lungs. The heart performs its function in two cycles: the circulatory system and the circulatory system. In the pulmonary circulation, the oxygen-deprived blood absorbed by the heart during the process of inhalation reaches the lungs through the pulmonary artery, receives oxygen there, and returns to the heart through the pulmonary artery to the left auricle [12, 14]. In systemic circulation, oxygenated blood from the left auricle descends to the left ventricle and eventually leaves the heart through the aorta which separates and divides into many arteries and capillaries that supply oxygenated blood to all parts of the body. Under Fig. 1 shows the workings of the human heart.

Fig. 1
The image represents the human working heart, it supplies oxygenated blood to all parts of the human body. Various parts like atriums, ventricles, aorta, veins etc. are labelled.

Working of the human heart

Any blockage in any of these blood vessels blocks the smooth flow of blood and may lead to heart attacks. [13, 15] There are many factors that can lead to such blockages including high-cholesterol diets, diet. excessive fat, physical inactivity, stress at work, sleep disturbances, air pollution, alcohol or tobacco use, etc.

According to the World Health Organization (WHO), approximately 17.9 million people die each year from heart disease [1]. The fast-growing mortality rate can be controlled by predicting the risk of early CVD in a person [2]. Heart disease prognosis can be made with the help of advanced machine-based models that have proven to be very useful for both patients and physicians. Most machine-based predictive models that have been developed so far can differentiate between patients by simply detecting the presence or absence of any heart problem in them. However, in this paper, we have emphasized the division of patients into five categories that reflect the degree of impact of the disease on them and thus provide a deeper understanding of the patient's health status.

In this research work, heart patients are divided into five categories (0 to 4) with a category 0 indicating the absence of the disease, a category 1 showing a minor effect of the disease, and thus an increase in the number of classes showing an increase in disease. criticism of Sect. 4 which means a very critical situation. The algorithms used for this category are Logistic Regression (LR), Vector Support Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), and Naïve Bayes (NB). Here we have compared the performance of these algorithms with their standard methods and their use of One-vs-All (OVA) in which they create a multi-stage multi-level division instead of a single multi-phase division. The results of this research work showed that the latter provides much better accuracy than the first. OVA-based performance is further enhanced by the Key Component Analysis (PCA) system which has increased the predictive accuracy of algorithms with KNN showing the best performance with 99.56% accuracy.

This paper is also organized as follows. The next Sect. 2 discusses a few previous activities regarding heart disease prediction using machine learning models. Section 3 presents the workflow of this study, the data used, the preliminary processing of the data, and the algorithms used. Section 4 discusses the results obtained during this study and the paper ends up in Sect. 5.

2 Literature Review

As mentioned in the previous section, most of the earlier models for heart disease were designed for binary segregation of patients, so our state-of-the-art base has been reduced to a limited number of research activities for many cardiovascular categories. a disease in the health care sector with few binary separation functions.

Kirsi Varpa et al. [3] performed multiple classifications in Otoneurological disorder patients using KNN and SVM classification algorithms. These algorithms are used in their conventional methods as well as using the OVA method where the latter gives the best results with KNN. Anurag Kumar Verma et al. [4], in their paper, classified people suffering from skin diseases into six distinct categories (psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra). This classification is done using six machine learning algorithms and producing their own ensembles using bagging, Adaboost, and gradient boosting. It was noted that the use of ensembles to produce predictive models provides better results than each algorithm.

Hin Wai Lui et al. [5] merged the normal neural network with the convolutional neural network to form a multiclass classification of patients affected by Myocardial Infarction disease. The upgraded model was able to provide 97.2% accuracy with 92.4% sensitivity and can be plugged into portable devices. C. Beulah Christalin Latha et al. [6] performed two classifications of cardiovascular patients in the Cleveland heart disease database. They used six different algorithms: Bayes Net, Naïve Bayes, Random Forest, C4.5, Multilayer perceptron, and Projective Adaptive Resonance Theory (PART) to create their ensembles again. The combining methods used were bagging, voting, and packaging. It has been noted that the use of manufactured merged models offers much higher accuracy than individual weak algorithms, with a multi-voting system showing a higher accuracy with an increase of 7%. This accuracy is also enhanced by the use of the PCA process.

Abderrahmane El.daoudy et al. [1] created a cardiovascular prediction model using Apache Cassandra to store highly generated data and Spark MLlib to make predictions. This model, built using a random forest algorithm was able to handle real-time data and provided 87.5% accuracy and 86.67% sensitivity.

3 Method and Materials

This section describes the flow of work during this research, the dataset used, its analysis and preprocessing, and also explains the various algorithms used.

The steps presented in the flowchart in Fig. 2 are performed for the One-vs-all implementation of each of the machine learning algorithms used, which are, logistic regression, SVM, random forest, KNN, and Naïve Bayes such that the binary classifiers generated from each algorithm give their own accuracies whose mean is estimated to determine the accuracy of that algorithm.

Fig. 2
A flow chart represents the workflow of the research, it collects the data and processes the data, and finally gives classification output.

Flow of work

3.1 Dataset Description

The dataset used for this research work is the Cleveland heart disease dataset available in the UCI machine learning repository. This dataset has the target variable which is a multivariate attribute and can take up values between 0 and 4 with class 0 indicating an absence of the disease, class 1 indicating a mild impact of the disease, class 2 indicating a moderate state, class 3 implies a slightly severe state, and class 4 indicates the most critical state.

The other predictor variables that are involved in the classification process are the patient’s age, gender, type of chest pain, blood pressure while resting, cholesterol level in the blood, blood sugar level, maximum heartbeat rate, and a few others [7]. The 13 predictors along with the target variable are presented in Table 1.

Table 1 Features of the Cleveland heart disease dataset

Before applying the various machine learning algorithms to predict the target attribute values, the dataset needs to be preprocessed and analyzed properly to handle the outliers and missing values, and to identify the relevant patterns hidden in it.

In the dataset used, the Ca and Thal attributes contained missing values that we replaced with the values 0 and 3 which have the highest count in these attributes, respectively. The outliers are detected by plotting the boxplots of the attributes and it was observed that their presence did not affect the prediction accuracy adversely, hence remained undisturbed.

The following Fig. 3 shows the bar plot for the target attribute, Figs. 4 and 5 show the boxplot representation of the age and chol attributes, respectively.

Fig. 3
A bar graph describes the target attribute, whereas the x axis represents as dot factor and y axis represents the count. The plot shows a declining trend.

Bar plot for target variable

Fig. 4
A boxplot indicates age attributes for various factors, with class 4 associated with the highest age group.

Boxplot for age attribute

Fig. 5
A boxplot indicates that the average cholesterol of all people falls in the range of 200 to 300.

Boxplot for cholesterol attribute

It can be observed from Fig. 3 that the maximum number of people in the dataset (more than 150) belong to class 0, i.e., they are completely free from CVD and class 4 contains the minimum number of people. Figure 4 indicates that the people belonging to a higher age group are more prone to being in class 4, i.e., they are more likely to reach a critical state with a few outliers where people with an age less than 40 are in a critical state, i.e., they belong to class 4. Figure 5 indicates that the average cholesterol of all the people falls in the range of 200–300. Also, it can be seen from Fig. 5 that few people having a cholesterol level close to 500 are completely free from heart disease.

Figure 6 represents the correlation matrix for the dataset used where every value less than 0 indicates a negative correlation, every value greater than 0 indicates a positive correlation and a 0 value indicates complete independence between the two associated attributes.

Fig. 6
A correlation matrix describes the dataset used, values at 1 are highlighted in diagonal.

Correlation matrix for cleveland heart disease dataset attributes

3.2 Algorithms Used

In this research work, we have used the following five algorithms to classify the patients and compared their performances with and without the One-vs-All approach.

Logistic Regression, a supervised classification algorithm, is typically used for binary classification and cannot perform multiclass classification in its ordinary form; however, in this research, we have used it for multiclass classification by implementing it with the One-vs-all approach. LR method can be used on datasets that are free from missing values. But the dataset used here has missing values in Ca and Thal attributes which we have handled by replacing the blanks with the value having the highest frequency in the respective attribute. The core functionality of this algorithm that is used to estimate the probability of a specific class being applicable on a data point is the sigmoidal function as shown in Eq. 1 [8].

$${\text{y }} = { 1}/\left( {{1} + {\text{e}}^{{ - {\text{x}}}} } \right)$$
(1)

where ‘e’ is the Euler constant with the value 2.718, x is the linear combination of all the predictors, and y is the probability between 0 and 1 indicating the class to which the new tuple belongs by considering a threshold value between 0 and 1, which is 0.5 by default.

Support Vector Machine is another supervised algorithm, that is applicable for binary as well as multiclass classification. It attempts to generate a separating boundary, known as a hyperplane, depending on the dataset being used that separates the tuples belonging to vivid classes with the maximum margin [9]. Out of all the possible hyperplanes that segregate the data points, the hyperplane that provides the maximum margin is called the Maximal Margin Hyperplane (MMH). Figure 7 shows a 2-dimensional space containing data points of two different classes A and B separated by a hyperplane.

Fig. 7
A line graph describes a two dimensional space containing data points of two different classes A and B separated by a hyperplane.

Support vector machine architecture in a 2-d space

In the case of non-linearly separable data, that is when the hyperplane required to separate the data points is not a straight line, a kernel function is used that takes as input the low-dimensional feature space and converts it into a high-dimensional feature space and generates a curve or a plane as a hyperplane to separate the data elements belonging to distinct classes. While implementing SVM for our dataset, without the One-vs-all approach, 182 support vectors were obtained as shown in Fig. 8.

Fig. 8
The S V M text document represents a one vs all approach, with 182 support vectors.

SVM model built

Random Forest is a machine learning classifier based on the ensemble technique and uses the concept of decision trees in a randomized manner [10]. Each decision tree generated takes up its own set of tuples and attributes, known as a bootstrap, from the original dataset and comes up with its prediction result for the newly introduced data point. These individual prediction outputs are aggregated using the majority voting technique, i.e., the prediction value that is generated by the maximum number of trees as their output is considered to be the final prediction result of the random forest model [11].

Nearest Neighbors is another supervised machine learning algorithm used to solve classification and regression problems. To classify a newly introduced data point into one of the classes in the dataset, ‘K’ data points that are closest to the new data point are identified. These ‘K’ nearest data points are detected by measuring the distance of all the existing data points from the new data point. The class or category to which majority of these ‘K’ neighbors belong is the desired class to which the new data point should be assigned. The distance between the new data point and existing data points can be calculated using several measures [10]. Few commonly used techniques are given as follows.

Euclidean distance given by

$${\text{d }} = \, \left( {\sum \, \left( {{\text{x}}_{{\text{i}}} - {\text{ y}}_{{\text{i}}} } \right)^{{2}} } \right)^{{{1}/{2}}}$$
(2)

Manhattan distance given by

$${\text{d }} = \, \sum \, \left( {{\text{x}}_{{\text{i}}} - {\text{ y}}_{{\text{i}}} } \right)$$
(3)

Minkowski distance given by

$${\text{d }} = \, \left( {\sum \, \left( {{\text{x}}_{{\text{i}}} - {\text{ y}}_{{\text{i}}} } \right){\text{ s}}^{{\text{r}}} } \right)^{{{1}/{\text{r}}}}$$
(4)

The error in prediction varies with the value chosen for the variable ‘K’. In our research, we iterated over 1 to 10 to choose the value of ‘K’ which will give the minimum error rate. The following Figs. 9, 10 and 11 show the error rate versus K plots for the KNN algorithm implemented in its ordinary form, with One-vs-all, and with OVA combined with the principal component analysis technique.

Fig. 9
A line graph indicates error value versus K plots for the K N N algorithm in ordinary forms.

Ordinary KNN

Fig. 10
A line graph indicates error value versus K plots for the K N N algorithm in One versus All form.

One-vs-all

Fig. 11
A line graph indicates the error value versus K plots for the K N N in one versus all and P C A.

One-vs-all & PCA

It can be observed from the above figures that the error rate is minimum for K = 10 for ordinary KNN implementation, reaches a minimum at K = 2, and remains same till K = 10 when implemented with OVA, and the error rate becomes 0 for K = 1 to 10 when KNN is implemented with OVA combined with PCA.

Naïve Bayes is also a supervised learning algorithm that is also a probabilistic classifier. This algorithm assumes complete independence among the features of the dataset, i.e., occurrence of one attribute does not depend on the occurrence of any other attribute [10]. It works on the principle of the Bayes theorem which is given as follows:

$${\text{P}}\left( {{\text{A}}|{\text{B}}} \right) \, = \, \left( {{\text{P}}\left( {{\text{B}}|{\text{A}}} \right) \, *{\text{ P}}\left( {\text{A}} \right)} \right)/{\text{P}}\left( {\text{B}} \right)$$
(5)

where P(A|B) is the probability of hypothesis A when event B has occurred, P(B|A) is the likelihood of occurrence of an event given that A is true, P(A) denotes the prior probability of the hypothesis before the event has occurred, P(B) denotes the evidence, i.e., probability of occurrence of event B.

One-vs-All is a technique used to implement machine learning algorithms to perform multiclass classification much more efficiently compared to their performance without it. In this approach, ‘n’ binary classifiers are built based on a chosen algorithm instead of a single multiclass classifier, where ‘n’ is the number of classes in the dataset. Each binary classifier is committed toward a single class, i.e., each binary classifier gives the accuracy in prediction for a single class by considering that class as class 1 and all other classes as class 0. The accuracies obtained in making predictions for each class as provided by their associated binary classifiers are averaged to find the overall accuracy of that algorithm. Figure 12 shows the architecture of the One-vs-all technique.

Fig. 12
A diagram represents a technique where n binary classifiers are built based on a chosen algorithm instead of a single multiclass classifier. 3 classes represented in various colors.

Schematic diagram of one-vs-all technique

Principal Component Analysis is an unsupervised machine learning algorithm that is used to reduce the dimensionality of the dataset thus allowing the model to predict the target variable values for a reduced dataset thus reducing the chances of overfitting [10]. PCA tries to find attributes, known as Principal Components, that provide maximum variance in the higher dimensional space and project those onto a smaller dimensional space retaining only the relevant information. Using these principal components to make predictions not only reduces the burden on the models in making predictions but also improves the accuracy of the prediction by the model.

4 Experimental Results

The research outcome shows a comparison among the heart disease prediction accuracies provided by the five algorithms: LR, SVM, RF, KNN, and NB when implemented on the Cleveland heart disease dataset. These algorithms are implemented in three ways: in their ordinary form, with the One-vs-all approach, and One-vs-all and principal component analysis combined. The accuracies obtained in each case are compared and it is observed that using the OVA approach significantly increases the accuracy in prediction for each algorithm. Also, using the OVA technique on the dataset reduced by PCA further enhances the performance of the models.

The accuracy of each algorithm is computed using the confusion matrices generated. Confusion matrix refers to an nxn matrix where n is the number of distinct classes available in the dataset. In a confusion matrix, each column sum indicates the number of people that actually belong to that class and each row sum indicates the number of people who have been categorized into that class. Hence, it can be concluded that the diagonal elements of the confusion matrix indicate the number of patients who have been correctly classified. The Eq. 6 is used to compute the accuracy in classification from the confusion matrix generated.

$${\text{Accuracy }} = \left \{ \sum \left( {\text{diagonal elements}} \right)/\sum ({\text{all elements}}) \right \} *{1}00$$
(6)

The following Figs. 13 and 14 depict the generated confusion matrices by all the above-mentioned algorithms without OVA and with OVA, respectively. In the OVA approach, as already mentioned earlier, instead of a single multiclass classifier, 5 binary classifiers are built for each algorithm that generates their own 2x2 confusion matrices. The accuracies of all 5 confusion matrices are averaged to compute the overall accuracy provided by that algorithm when implemented with the OVA approach.

Fig. 13
The four matrices represent confusion without the O V A approach.

Confusion matrices without one-vs-all

Fig. 14
The confusion matrix with the O V A approach, for S V M, Random forest, K N N, Naive Bayes and logistic regression.

Confusion matrices with one-vs-All

The confusion matrices generated by the application of OVA approach on the dataset reduced by PCA have slightly better values than the confusion matrices without PCA, thus providing slightly better accuracy.

The following Table 2 states the accuracy provided by each of the algorithms that are implemented without OVA method, with the OVA approach, and OVA on the PCA reduced dataset.

Table 2 Accuracies obtained by each algorithm in all 3 cases

Figure 15 shows the bar graph representation of the accuracy values acquired by all the algorithms for the three types of implementation. It can be observed from Table 2 and Fig. 15 that no accuracy value exists for the ordinary implementation of logistic regression as it is a binary classification algorithm and can perform multiclass classification only with the One-vs-all approach.

Fig. 15
A bar graph plot represents the accurate values obtained by all the algorithms for the three types of implementation.

Bar plot for accuracies obtained

Also, it can be seen that the KNN algorithm has outperformed all other algorithms by providing the highest accuracy of 99.56%.

5 Conclusion

This paper has emphasized the classification of CVD patients into more than two classes that will be more helpful for the physicians in providing the best possible treatment to their patients with more precision instead of simply discovering the sign of any cardiac problem in them. We have recommended the exploitation of this One-vs-all method with different machine learning algorithms to perform a multiclass classification of the patients. The obtained accuracy is further improved by implementing these algorithms with OVA on the dataset reduced by PCA. Out of the five algorithms implemented during this research, KNN has shown the best performance with an accuracy of 99.56% with a combination of PCA and One-vs-All techniques.