Keywords

1 Introduction

Since healthcare organizations are complex systems they generate vast amount of data that comes in different formats. Because of this, the key challenge is to build intelligent systems that will efficiently interpret generated data and support humans in decision making [1, 2].

To build such system, different solutions such as artificial intelligence methods, linked data, semantic web technologies and NoSQL datastores are combined. In recent years, machine learning becomes popular for developing intelligent systems in healthcare. Machine learning algorithms are capable to approximate relationships between dataset variables in form of a function that is used for prediction and decision making. Application of machine learning models in healthcare systems improve efficiency and accuracy of the system overall [3].

Google employed machine learning to detect breast cancer by detecting patterns in the tissue with 87% accuracy, which is better than 73% accuracy achieved by human [4]. Scientists from Stanford developed algorithm for skin cancer detection using visual processing and deep learning inspired by neural networks [5]. In the [6] algorithm for detection of diabetic retinopathy using neural networks is presented.

One of the most common problems in healthcare is heart disease. Based on the latest statistics, number of people with heart disease is predicted to rise by 46% by 2030, which is more than 8 million adults with heart disease. According to World Health Organization [7], cardiovascular diseases (a group of disorders of the heart and blood vessels) are number one cause of death all over the world [8]. Statistical report of American Heart Association for 2017 [9] shows that about 92.1 million Americans are living with some cardiovascular disease.

Gi Beom Kim, in his paper [10] presents that according to database of Korean Health Foundation, about 50,000 of adults with heart disease live in South Korea and more than 4000 enter adulthood every year. At the current rate, it is estimated that about 70,000 of adults with heart disease will live there by the year 2020. Approximately 2200 people die of cardiovascular disease each day, which is one death on average every 40 s.

In recent researches [12,13,14,15,16] various machine learning algorithms have been applied to predict heart disease problems. Different techniques are employed to build reliable system which will produce useful results while lowering costs and diagnosis time. The goal of this paper is to build a model that will combine multiple classification algorithms to predict heart disease and to compare single algorithm models and ensemble model. Majority voting has been used as ensemble method for combining multiple machine learning algorithms [11].

The rest of paper is organized as follows: in Sect. 2 we summarize related works, in Sect. 3 we describe dataset and applied methods, in Sect. 4 we present results of our work and finally, we conclude our work with Sect. 5.

2 Literature Review

In the March 2017, Singh et al. [12] proposed web application that enables users to share their heart related problems and get diagnosis of disease using intelligent system online. Application takes inputs from user, process them and returns disease related to inputs from user. To avoid variance, dataset with 14 input attributes was split indiscriminately into two sets. Finally, implementation of model was performed using Naïve Bayes’ classifier. As result, implemented application returns output to users based on prediction whether risk for heart disease is low, average or high.

Various data mining techniques were applied to predict heart disease. Devi et al. [13] analysed classification techniques for decision making in this field, especially Decision trees, Naive Bayes, Neural Networks and Support Vector Machines. They found out that application of hybrid data mining techniques can give promising results. Combining the outputs of each algorithm and comparing them helps to make prediction quicker and more accurate.

Datasets related to same disease problem may show different results applying same machine learning techniques. El-Bialy et al. [14] focused on integration of results of machine learning techniques applied on sets for heart diseases. They applied fast decision tree and C4.5 tree techniques, after which they compared features in trees resulted from different datasets. Common features among these datasets are collected to create new dataset which is used in later analysis. It is shown that accuracy of new dataset is higher than average accuracy of all separate datasets. Average accuracy of all datasets was 75.48% using fast decision tree and 76.30% using C4.5. Classification accuracy for newly collected dataset was 78.06 and 77.50% for fast decision tree and C4.5, respectively.

Venkatalakshmi and Shivsankar [15] presented prediction system for heart disease based on predictive mining. Experiments were carried out using Weka, open source tool for data mining, and data from UCI Machine Learning Repository. The goal was to compare performance of predictive data mining techniques such as Naive Bayes and Decision tree. Naive Bayes outperformed Decision Tree with accuracy of 85.03, while accuracy of Decision tree was 84.01.

In their research paper, Jabbar et al. [16] combined K-Nearest Neighbor and Genetic Algorithm on seven datasets to build heart disease classifier. The results of their study showed that accuracy is increased 5% using both, KNN and GA, rather than only KNN. Also, accuracy is decreasing as k-value goes on increasing. Although emphasis was on data related to Andhra Pradesh, city in India, it is shown that classifier gives high accuracy when it is applied on other heart disease datasets.

Table 1 shows summary of experiments mentioned above. It displays accuracy of heart disease classifiers obtained by several authors applying various methods.

Table 1 Accuracy of various methods for heart disease classification in related work

Authors in [14, 15] used same dataset as we did, but unlike them, additionally, we want to apply multiclass classification to evaluate results. Besides that, our main goal is to explore gain achieved by application of ensemble learning.

3 Methodology

3.1 Dataset

As a part of this research, multiple machine learning models were developed using Heart Disease dataset from UCI Machine Learning Repository [17]. Original dataset contains 76 attributes, but all published works used only 14 of them, so we did same in our work. These attributes are selected as the most important for the reliable prediction. Data set contains 303 instances and it is publicly available.

The last attribute in dataset represents diagnosis of heart disease. Value 0 indicates absence of heart disease and values 1, 2, 3 and 4 indicate different levels of disease. Analyzing representation of each class individually, which are 54.12, 18.15, 11.88, 11.55 and 4.29% respectively, we can conclude that this dataset does not have skewed class problem.

Figure 1 shows visual representation of all attributes in dataset. It presents distribution of attributes in respect to particular class in dataset.

Fig. 1
figure 1

Visual representation of attributes

To split dataset, we applied two approaches. Firstly, we splitted data into training and testing set with ratio 66:34. Later, we applied 10-fold cross validation to compare results.

3.2 Feature Selection

After dataset was uploaded, we performed feature selection process which is used to identify attributes that are most relevant for prediction [18]. Two elevators were tried: GainRatioAttributeEval and InfoGainAttributeEval. According to the output of these two evaluators, several attributes were excluded from the further examination, as they showed no significant impact on an automated heart disease prediction process. Excluded attributes are resting blood pressure, serum cholesterol, fasting blood sugar and resting electrocardiographic result.

3.3 Classification

In our work, we applied ensemble learning to build model. Ensemble learning is used to combine several models to improve results. Multiple methods of ensemble learning exist such as voting, stacking, bagging and boosting which are explained at [11] in detail. In our work we investigate majority voting. Each model included in majority voting makes its own prediction and final prediction is the one with highest number of votes.

We considered three classification algorithms: artificial neural network (ANN), support vector machine (SVM) and k-nearest neighbors algorithm (KNN). These algorithms are combined together and complement each other. Figure 2 shows the process of implementation.

Fig. 2
figure 2

Process of model implementation

Different combinations of parameters were used for each of these algorithms. Final model was built using the parameter values as provided in continuation.

Artificial neural network contained 1 hidden layer and 0.4 learning rate. The value of training time was increased from default value of 500 to value of 5000. K-nearest neighbor was applied with value of k equal to 5. Also, we applied LinearNNSearch with Filtered Distance. Support vector machine was used with polynomial kernel and value of exponent equal to 1.5. The regularization factor was 3.0.

Described parameters were selected as the best combination for these algorithms according to obtained accuracy values.

Initial dataset had 5 output classes (class 0 for absence of disease and classes 1, 2, 3, 4 for presence) which are used for multiclass classification. Moreover, to compare results we applied binary classification with two classes (class 0 for absence and class 1 for presence of disease). Finally, for each of them we tested performance of ensemble learning.

4 Results

To measure performance of classifiers, we calculated the accuracy, the specificity and the precision.

The accuracy represents ratio of correctly classified samples and total number of samples [19].

$$ {\text{ACC}} = \frac{{{\text{TP}} + {\text{FN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}} $$
(1)

The sensitivity or true positive rate (TPR) is ratio of true positives and actual positives (TP + FN) [19].

$$ {\text{TPR}} = \frac{TP}{TP + FN} $$
(2)

The precision or positive predictive value (PPV) is ratio of true positives and predicted positives [19].

$$ {\text{PPV}} = \frac{TP}{TP + FP} $$
(3)

From confusion matrix presented in Table 3, we can see types of mismatching between classes and notice that errors mostly occur between neighboring classes.

Table 2 and Fig. 3 show results obtained by multiclass classification and percentage split. The highest accuracy 61.16 is gained by majority voting, while ANN and KNN resulted in same accuracy 58.25, when applied separately.

Table 2 Multi class classification results by percentage split
Fig. 3
figure 3

Multi class classification results by percentage split

From confusion matrix presented in Table 4, we can see that mismatch occurs between classes 0, 1, 2 and 3, not only between neighboring classes. But errors are reduced by increasing the distance between neighbored classes.

Table 3 and Fig. 4 show results obtained by multiclass classification and 10-fold cross validation. Again, majority voting achieved highest accuracy 58.41. In this case KNN outperformed ANN, while SVM still had the lowest accuracy.

Table 3 Confusion matrix obtained by percentage split
Fig. 4
figure 4

Multi class classification results by 10-fold cross validation

If we compare results obtained by percentage split and cross validation for multiclass classification, we can observe that for the most of measurements percentage split results are higher than those of cross validation. Only ANN resulted in higher accuracy by cross validation than by percentage split.

In the next experiment, problem has been transformed into binary classification with output labels 0 and 1 where 0 presents absence and 1 presence of heart disease.

Table 4 and Fig. 5 show results obtained by binary classification and percentage split. It can be noticed that results are higher than those obtained by multiclass classification. Majority voting resulted in highest accuracy 87.37. Moreover, contrary to results of multiclass classification, KNN obtained lowest accuracy and SVM has highest accuracy, when applied without ensemble.

Table 4 Multi class classification results by 10-fold cross validation
Fig. 5
figure 5

Binary classification results by percentage split

Table 5 and Fig. 6 show results obtained by binary classification and 10-fold cross validation. Unlike all previous cases, here majority voting did not achieve highest accuracy. Highest accuracy 84.15 is obtained by ANN. As with percentage split, KNN achieved lowest accuracy (Tables 6 and 7).

Table 5 Confusion matrix obtained by 10-fold cross validation
Fig. 6
figure 6

Binary classification results by 10-fold cross validation

Table 6 Binary classification results by percentage split
Table 7 Binary classification results by 10-fold cross validation

For binary classification, all results obtained by percentage split are higher than those obtained by cross validation.

5 Conclusion

In this paper we presented application of artificial neural network, k nearest neighbor and support vector machine on dataset with 14 attributes and 303 instances.

We evaluated difference between results obtained by algorithms applied separately and majority voting as ensemble learning. Problem is solved in two ways: as multiclass and binary classification, using two types of evaluation: percentage split and cross validation. We applied 66:34 percentage split and 10-fold cross validation.

In three of the four cases majority voting had highest accuracy equal to 61.15 for multiclass classification and 87.37 for binary classification. Only in binary classification with 10-fold cross validation single algorithm, ANN, outperformed ensemble learning method. This superiority of the ensemble learning over single algorithms could be explained by the fact that ensemble learning combines the best of all algorithms and gives single result. Also, results for binary classification are higher than those for multiclass classification. As the reason we could conclude number of classes that are available in decision making, since it is harder to ‘learn’ with five than with two outputs.

When it comes to data split, generally higher values of specificity, precision and accuracy are obtained by percentage split than by 10-fold cross validation.

Therefore, we may conclude that majority voting outperforms algorithms used separately and percentage split gives better results compared to 10-fold cross validation.